numpy - Performing math on a Python Pandas Group By DataFrame -


i have pandas dataframe following structure:

in [1]: df out[1]:      location_code    month    amount 0    1               1        10 1    1               2        11 2    1               3        12 3    1               4        13 4    1               5        14 5    1               6        15 6    2               1        23 7    2               2        25 8    2               3        27 9    2               4        29 10   2               5        31 11   2               6        33 

i have dataframe following:

in [2]: output_df out[2]:      location_code    regression_coef 0   1                none 1   2                none 

what like:

output_df = df.groupby('location_code')[amount].apply(linear_regression_and_return_coefficient) 

i group location code , perform linear regression on values of amount , store coefficient. have tried following code:

import pandas pd import statsmodels.api sm import numpy np  gb = df.groupby('location_code')['amount']  x = [] j in range(6): x.append(j+1)  location_code, amount in gb:     trans = amount.tolist()     x = sm.add_constant(x)     model = sm.ols(trans, x)     results = model.fit()     output_df['regression_coef'][merchant_location_code] = results.params[1]/np.mean(trans) 

this code works, data set large (about 5 gb) , bit more complex, , taking really long time. wondering if there vectorized operation can more efficiently? know using loops on pandas dataframe bad.

solution

after tinkering around, wrote function can used apply method on groupby.

def get_lin_reg_coef(series):     x=sm.add_constant(range(1,7))     result = sm.ols(series, x).fit().params[1]     return result/series.mean()  gb = df.groupby('location_code')['amount']  output_df['lin_reg_coef'] = gb.apply(get_lin_reg_coef) 

benchmarking versus iterative solution had before, varying input sizes gets:

dataframe rows    iterative solution (sec)    vectorized solution (sec)        370,000    81.42                       84.46      1,850,000    448.36                      365.66      3,700,000    1282.83                     715.89      7,400,000    5034.62                     1407.88          

clearly lot faster dataset grows in size!

without knowing more data, number of records, etc, code should run faster:

import pandas pd import statsmodels.api sm import numpy np  gb = df.groupby('location_code')['amount']  x = sm.add_constant(range(1,7))  def fit(stuff):     return sm.ols(stuff["amount"], x).fit().params[1] / stuff["amount"].mean()  output = gb.apply(fit) 

Comments

Popular posts from this blog

apache - PHP Soap issue while content length is larger -

asynchronous - Python asyncio task got bad yield -

javascript - Complete OpenIDConnect auth when requesting via Ajax -