learning plus: 10月 2012

2012年10月31日星期三

VIM + python

Refs
https://github.com/b4winckler/macvim
https://github.com/crosbymichael/.dotfiles
http://blog.othree.net/log/2010/11/22/vim-for-python/

decorator + cprofile +sqlite3 python

Refs:
http://stackoverflow.com/questions/5375624/a-decorator-that-profiles-a-method-call-and-logs-the-profiling-result
http://www.doughellmann.com/PyMOTW/profile/
http://www.artima.com/weblogs/viewpost.jsp?thread=240808 http://docs.python.org/dev/library/functools.html

Download:
https://docs.google.com/open?id=0B35jh2lwIpeKb3R0Mnp6ckhGZTA

2012年10月30日星期二

python table HDF5

The HDF5 library is a versatile, mature library designed for the storage of numerical data. The h5py package provides a simple, Pythonic interface to HDF5. A straightforward high-level interface allows the manipulation of HDF5 files, groups and datasets using established Python and NumPy metaphors. HDF5 provides a robust way to store data, organized by name in a tree-like fashion. You can create datasets (arrays on disk) hundreds of gigabytes in size, and perform random-access I/O on desired sections. Datasets are organized in a filesystem-like hierarchy using containers called "groups", and accesed using the tradional POSIX /path/to/resource syntax. Refs: example https://github.com/qsnake/h5py/tree/master/h5py/tests http://code.google.com/p/h5py/ http://alfven.org/wp/hdf5-for-python/ http://pytables.github.com/usersguide/

2012年10月19日星期五

nose unittest extender

nose 一個好用的 unittest manager, 讓每隻的 testsuits 都可以連結或者個是分開個別來測, 測完後有個coverage report. 很適合拿來做 regression 的驗證. 底下就是個非常簡單的例子

dut.py 我們所寫好的 method

""" dut """

__all__ = ['frame']

class frame(object):

    def __init__(self):
     self.name = "frame"

    def double(self,w):
     return w * 2

    def triple(self,w):
     return w * 3

test_dut.py 測試寫的 method

 
import os
import unittest

import nose

from dut import *
import gc

class TestDouble(unittest.TestCase):

    def setUp(self):
        self.frame = frame()

    def teardown(self):
     self.frame = None
     gc.collect()


    @unittest.skip("calling test skip test_double_word")
    def test_double_word(self):
     """ test double word """

     expect  = "hihi"
     results = self.frame.double("hi")
     self.assertTrue(expect == results)


    def test_double_dec(self):
     """ test double dec """

     expect = 4.0
     results= self.frame.double(2.0)
     self.assertTrue(expect == results)


if __name__ == '__main__':
    # unittest.main()
    import nose
    # nose.runmodule(argv=[__file__,'-vvs','-x', '--ipdb-failure'],
    #                exit=False)
    nose.runmodule(argv=[__file__,'-vvs','-x','--pdb', '--pdb-failure'],
                   exit=False)

__init__.py path include

runtest.sh top test run manager

#!/bin/sh
coverage erase
nosetests -w ./ --with-coverage --cover-package=dut $*

how to run it
>>> ./runtest.sh

test results
>>> results .S ---------------------------------------------------------------------- Ran 2 tests in 0.001s OK (SKIP=1)

2012年10月17日星期三

MongoDB VS SQL performance benchmark

performance benchmark
tips :
Memcached obviously wins the competition as it des not have to sync anything to disk. Surprisingly MongoDB beats it in small dataset inserts!! i guess it is becayse mongodb driver uses binary protocol and performs fire-and-forget inserts by default (unsafe mode). In addition MongoDB does not enforce sync to disk so a lot of writes are kept in memory. Thats why it does so well on inserts of small rows.

SQL requires joins, joins are slow. MongoDB is fast in large part because it doesn’t use joins (most of the time).

Refs:
http://blog.michaelckennedy.net/2010/04/29/mongodb-vs-sql-server-2008-performance-showdown/

http://tobami.wordpress.com/2011/02/28/benchmarking-mongodb/

http://stackoverflow.com/questions/4465027/sql-server-and-mongodb-comparisonhttp://atlantischiu.blog.ithome.com.tw/post/3058/110773

http://zh.scribd.com/doc/28862327/MongoDB-High-Performance-SQL-Free-Database

2012年10月16日星期二

panda for data access

panda 可以方便對大亮的資料處理,也提供很多查找的方式.例如 join, split, groupby, columns/rows select, hierarchical index supported... 雖然有點類似 SQL 的想法. 當相對的比 SQL 來的快速跟方便. 不要自己再多寫很多的 query 方式, 且可支援 cvs, HDF5(pytable) compression, json, ... 的資料格式. ex: 改寫 panda 的範例算出 move avage ....

"""
Some examples playing around with yahoo finance data
"""

from datetime import datetime

import matplotlib.finance as fin
import numpy as np
from pylab import show
import pprint

from pandas import Index, DataFrame
from pandas.core.datetools import BMonthEnd
from pandas import ols

startDate = datetime(2009, 9, 1)
endDate = datetime(2009, 9, 10)

def getQuotes(symbol, start, end):
    quotes = fin.quotes_historical_yahoo(symbol, start, end)
    dates, open, close, high, low, volume = zip(*quotes)

    data = {
        'open' : open,
        'close' : close,
        'high' : high,
        'low' : low,
        'volume' : volume
    }

    dates = Index([datetime.fromordinal(int(d)) for d in dates])
    return DataFrame(data, index=dates)


def getMoveAvage(frame, label='close', mvavg=5):
    """ get move avage """

    assert(label in ['open', 'close', 'high', 'low', 'volume'])

    avgs    = []

    for indx, val in enumerate(frame.index):
        tot_sum = 0.0

        if indx > mvavg and mvavg >0:
            for i in range(mvavg):
                tot_sum += frame[label][indx-i]

            avgs.append(tot_sum/mvavg)

        else:
            avgs.append(0.0)

    data = {
            "%s_avg_%s" %(label,mvavg)  : avgs
            }

    return DataFrame(data, index=frame.index)


msft = getQuotes('MSFT', startDate, endDate)
msft_close_mv5 = getMoveAvage(msft, 'close', 5)
msft_open_mv5 = getMoveAvage(msft, 'open', 5)

new_msft = msft.join(msft_close_mv5)
print new_msft

用 np.sum 來加速減少 memory access times

....

def getMoveAvage2(frame, label='close', mvavg=5):
    """ get move avage """

    assert(label in ['open', 'close', 'high', 'low', 'volume'])

    avgs    = []

    for indx, val in enumerate(frame.index):
        tot_sum = 0.0

        if indx > mvavg and mvavg >0:
            tot_sum = np.sum(frame[label][indx-mvavg+1:indx+1])

            avgs.append(tot_sum/mvavg)

        else:
            avgs.append(0.0)

    data = {
            "%s_avg_%s" %(label,mvavg)  : avgs
            }

    return DataFrame(data, index=frame.index)

#--------------------------------------

msft = getQuotes('MSFT', startDate, endDate)

profile.run("getMoveAvage(msft, 'close', 5)", 'status0')
p0 = pstats.Stats('status0')
p0.sort_stats('time', 'cumulative').print_stats(5)

profile.run("getMoveAvage2(msft, 'close', 5)", 'status1')
p1 = pstats.Stats('status1')
p1.sort_stats('time', 'cumulative').print_stats(5)

rst0 = eval("getMoveAvage(msft, 'close', 5)")
rst1 = eval("getMoveAvage2(msft, 'close', 5)")

refs: http://pandas.pydata.org/pandas-docs/dev/ http://www.pytables.org/moin