I have a dataframe which has the following columns:
region_id
, name
, parent
, parent_name
, t2m
, d2m
, and tp
.
I want to group and aggregate column values in a specific way. To enable that, I have defined the following lists:
w_params = ['t2m', 't2m', 't2m', 'd2m', 'tp']
operation = ['max', 'min', 'mean', 'mean', 'sum']
common_cols = ['region_id', 'name', 'parent', 'parent_name']
I have written the function agg_daily
to group column values by date
and region_id
and aggregate.
def agg_daily(df, common_cols, w_params, operation):
"""
Aggregate the data for each day.
Parameters
----------
df : pandas dataframe
Dataframe containing daily data.
Returns
-------
agg_daily_df : pandas dataframe
Dataframe containing aggregated data for each day.
"""
agg_daily_df = df.groupby(['date', 'region_id']).agg(
name=('name', 'first'),
parent=('parent', 'first'),
parent_name=('parent_name', 'first'),
t2m_max=('t2m', 'max'),
t2m_min=('t2m', 'min'),
t2m_mean=('t2m', 'mean'),
d2m=('d2m', 'mean'),
tp=('tp', 'sum')
).reset_index()
agg_daily_df = agg_daily_df.sort_values(['region_id', 'date'], ascending=[True, True]).reset_index(drop=True)
return agg_daily_df
However, notice inside agg_daily
that the arguments within agg
, e.g. t2m_max
, t2m_min
, t2m_mean
are hard coded. Instead, I want to pass common_cols
, w_params
, operation
as arguments to agg_daily
, avoid the hard coding, and yet get the function agg_daily
perform the desired operation.
Note that for columns belonging to common_cols
, I do not wish to create a new column name in the final output. However, for the columns belonging to w_params
, I want to create a column corresponding to the operation
being performed.
Can anyone help me get a customizable function?
I have a dataframe which has the following columns:
region_id
, name
, parent
, parent_name
, t2m
, d2m
, and tp
.
I want to group and aggregate column values in a specific way. To enable that, I have defined the following lists:
w_params = ['t2m', 't2m', 't2m', 'd2m', 'tp']
operation = ['max', 'min', 'mean', 'mean', 'sum']
common_cols = ['region_id', 'name', 'parent', 'parent_name']
I have written the function agg_daily
to group column values by date
and region_id
and aggregate.
def agg_daily(df, common_cols, w_params, operation):
"""
Aggregate the data for each day.
Parameters
----------
df : pandas dataframe
Dataframe containing daily data.
Returns
-------
agg_daily_df : pandas dataframe
Dataframe containing aggregated data for each day.
"""
agg_daily_df = df.groupby(['date', 'region_id']).agg(
name=('name', 'first'),
parent=('parent', 'first'),
parent_name=('parent_name', 'first'),
t2m_max=('t2m', 'max'),
t2m_min=('t2m', 'min'),
t2m_mean=('t2m', 'mean'),
d2m=('d2m', 'mean'),
tp=('tp', 'sum')
).reset_index()
agg_daily_df = agg_daily_df.sort_values(['region_id', 'date'], ascending=[True, True]).reset_index(drop=True)
return agg_daily_df
However, notice inside agg_daily
that the arguments within agg
, e.g. t2m_max
, t2m_min
, t2m_mean
are hard coded. Instead, I want to pass common_cols
, w_params
, operation
as arguments to agg_daily
, avoid the hard coding, and yet get the function agg_daily
perform the desired operation.
Note that for columns belonging to common_cols
, I do not wish to create a new column name in the final output. However, for the columns belonging to w_params
, I want to create a column corresponding to the operation
being performed.
Can anyone help me get a customizable function?
Share Improve this question asked 2 days ago TRKTRK 2031 gold badge2 silver badges8 bronze badges2 Answers
Reset to default 0does this work for you ?
w_params = ['t2m', 't2m', 't2m', 'd2m', 'tp']
operation = ['max', 'min', 'mean', 'mean', 'sum']
common_cols = ['name', 'parent', 'parent_name']
def agg_df(df, common_cols, w_params, operation):
# get list of col methods
cols = pd.DataFrame(zip(w_params, operation), columns=["col","method"]).groupby("col").agg(list).reset_index()
# create agg_dict and add common_cols methods
aggs = pd.concat([cols,pd.DataFrame(zip(common_cols,len(common_cols)*[["first"]]), columns=["col","method"])], ignore_index=True)
# aggregation with created dict
result_df = df.groupby(['date', 'region_id']).agg(aggs.set_index("col").method).sort_values(['region_id', 'date'], ascending=[True, True]).reset_index()
# and rename columns have multiindex
have_multi_method_cols = aggs[aggs.method.apply(len) > 1].col.tolist()
result_df.columns = result_df.columns.map(lambda x: x[0] if x[0] not in have_multi_method_cols else "_".join(x))
# return df
return result_df
agg_df(data, common_cols, w_params, operation)
unpack a dictionary created from the pairing of w_params and operation within the named aggregation:
def agg_daily(df, common_cols, w_params, operation):
mapped = zip(w_params, operation)
mapped = {f"{col}_{func}": (col, func) for col, func in mapped}
outcome = df.groupby(common_cols, as_index=False).agg(**mapped)
return outcome
Application:
data = {'model': {0: 'Mazda RX4', 1: 'Mazda RX4 Wag', 2: 'Datsun 710'},
'mpg': {0: 21.0, 1: 21.0, 2: 22.8},
'cyl': {0: 6, 1: 6, 2: 4},
'disp': {0: 160.0, 1: 160.0, 2: 108.0},
'hp': {0: 110, 1: 110, 2: 93},
'drat': {0: 3.9, 1: 3.9, 2: 3.85},
'wt': {0: 2.62, 1: 2.875, 2: 2.32},
'qsec': {0: 16.46, 1: 17.02, 2: 18.61},
'vs': {0: 0, 1: 0, 2: 1},
'am': {0: 1, 1: 1, 2: 1},
'gear': {0: 4, 1: 4, 2: 4},
'carb': {0: 4, 1: 4, 2: 1}}
mtcars = pd.DataFrame(data)
agg_daily(df=mtcars,
common_cols='cyl',
w_params=['disp','hp','drat'],
operation=['min','max','min'])
cyl disp_min hp_max drat_min
0 4 108.0 93 3.85
1 6 160.0 110 3.90
ideally, you'll add some checks - w_params should be same length as operation, the entries in operation should be strings (if they are not you have to consider how to grab the names - .__name__()
possibly), ...