numpy - Performing math on a Python Pandas Group By DataFrame -
i have pandas dataframe following structure:
in [1]: df out[1]: location_code month amount 0 1 1 10 1 1 2 11 2 1 3 12 3 1 4 13 4 1 5 14 5 1 6 15 6 2 1 23 7 2 2 25 8 2 3 27 9 2 4 29 10 2 5 31 11 2 6 33
i have dataframe following:
in [2]: output_df out[2]: location_code regression_coef 0 1 none 1 2 none
what like:
output_df = df.groupby('location_code')[amount].apply(linear_regression_and_return_coefficient)
i group location code
, perform linear regression on values of amount
, store coefficient. have tried following code:
import pandas pd import statsmodels.api sm import numpy np gb = df.groupby('location_code')['amount'] x = [] j in range(6): x.append(j+1) location_code, amount in gb: trans = amount.tolist() x = sm.add_constant(x) model = sm.ols(trans, x) results = model.fit() output_df['regression_coef'][merchant_location_code] = results.params[1]/np.mean(trans)
this code works, data set large (about 5 gb) , bit more complex, , taking really long time. wondering if there vectorized operation can more efficiently? know using loops on pandas dataframe bad.
solution
after tinkering around, wrote function can used apply
method on groupby
.
def get_lin_reg_coef(series): x=sm.add_constant(range(1,7)) result = sm.ols(series, x).fit().params[1] return result/series.mean() gb = df.groupby('location_code')['amount'] output_df['lin_reg_coef'] = gb.apply(get_lin_reg_coef)
benchmarking versus iterative solution had before, varying input sizes gets:
dataframe rows iterative solution (sec) vectorized solution (sec) 370,000 81.42 84.46 1,850,000 448.36 365.66 3,700,000 1282.83 715.89 7,400,000 5034.62 1407.88
clearly lot faster dataset grows in size!
without knowing more data, number of records, etc, code should run faster:
import pandas pd import statsmodels.api sm import numpy np gb = df.groupby('location_code')['amount'] x = sm.add_constant(range(1,7)) def fit(stuff): return sm.ols(stuff["amount"], x).fit().params[1] / stuff["amount"].mean() output = gb.apply(fit)
Comments
Post a Comment