python - pandas: concatenate dataframes, forward-fill and multiindex on column data -
i have 2 csv files same column names, different values.
the first column index (time
) , 1 of data columns unique identifier (id
)
the index (time
) different each csv file.
i have read data 2 dataframes using read_csv
, giving me following:
+-------+------+-------+ | id | size | price | +-------+-------+------+-------+ | time | | | | +-------+-------+------+-------+ | t0 | id1 | 10 | 110 | | t2 | id1 | 12 | 109 | | t6 | id1 | 20 | 108 | +-------+-------+------+-------+ +-------+------+-------+ | id | size | price | +-------+-------+------+-------+ | time | | | | +-------+-------+------+-------+ | t1 | id2 | 9 | 97 | | t3 | id2 | 15 | 94 | | t5 | id2 | 13 | 100 | +-------+-------+------+-------+
i create single large dataframe entries both, , use ffill
forward fill values previous time-step.
i able achieve using combination of concat
, sort
, ffill
.
however, requires renaming columns of 1 of dataframes first, there aren't name clashes
df2.columns = [ 'id', 'id2_size', 'id2_price' ] df = pd.concat([df1, df2]).sort().ffill()
this results in following dataframe:
+------+------+-------+----------+-----------+ | id | size | price | id2_size | id2_price | +-------+------+------+-------+----------+-----------+ | time | | | | | | +-------+------+------+-------+----------+-----------+ | t0 | id1 | 10 | 110 | nan | nan | | t1 | id2 | 10 | 110 | 9 | 97 | | t2 | id1 | 12 | 109 | 9 | 97 | | t3 | id2 | 12 | 109 | 15 | 94 | | t5 | id2 | 12 | 109 | 13 | 100 | | t6 | id1 | 20 | 108 | 13 | 100 | +-------+------+------+-------+----------+-----------+
my current method klunky in have rename columns of 1 of dataframes.
i believe better way represent data use multiindex 2nd dimension's value coming id
column.
the resulting dataframe this:
+--------------+--------------+ | id1 | id2 | +------+-------+------+-------+ | size | price | size | price | +-------+------+-------+------+-------+ | time | | | | | +-------+------+-------+------+-------+ | t0 | 10 | 110 | nan | nan | | t1 | 10 | 110 | 9 | 97 | | t2 | 12 | 109 | 9 | 97 | | t3 | 12 | 109 | 15 | 94 | | t5 | 12 | 109 | 13 | 100 | | t6 | 20 | 108 | 13 | 100 | +-------+------+-------+------+-------+
is possible?
if so, steps required go 2 dataframes read csv, final merged multiindexed dataframe?
here's one-liner ask, although it's bit convoluted in terms of stacking/unstacking:
df1.append(df2).set_index(['time','id']).sort().stack().unstack(level=[1,2]).ffill() id id1 id2 size price size price time t0 10 110 nan nan t1 10 110 9 97 t2 12 109 9 97 t3 12 109 15 94 t5 12 109 13 100 t6 20 108 13 100
fwiw, default approach have been following, little more straightforward (less stacking/unstacking) , give same basic results, different column organization:
df1.append(df2).set_index(['time','id']).sort().unstack().ffill() size price id id1 id2 id1 id2 time t0 10 nan 110 nan t1 10 9 110 97 t2 12 9 109 97 t3 12 15 109 94 t5 12 13 109 100 t6 20 13 108 100
and along lines, add swaplevel
, sort
columns reorganized in first approach:
df1.append(df2).set_index(['time','id']).sort().unstack().ffill().swaplevel(0,1,axis=1).sort(axis=1)
Comments
Post a Comment