最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

pandas - How to customize column names while grouping and aggregating? - Stack Overflow

programmeradmin1浏览0评论

I have a dataframe which has the following columns: region_id, name, parent, parent_name, t2m, d2m, and tp.

I want to group and aggregate column values in a specific way. To enable that, I have defined the following lists:

w_params = ['t2m', 't2m', 't2m', 'd2m', 'tp']
operation = ['max', 'min', 'mean', 'mean', 'sum']
common_cols = ['region_id', 'name', 'parent', 'parent_name']

I have written the function agg_daily to group column values by date and region_id and aggregate.

def agg_daily(df, common_cols, w_params, operation):
    """
    Aggregate the data for each day.

    Parameters
    ----------
    df : pandas dataframe
        Dataframe containing daily data.

    Returns
    -------
    agg_daily_df : pandas dataframe
        Dataframe containing aggregated data for each day.

    """
    agg_daily_df = df.groupby(['date', 'region_id']).agg(
        name=('name', 'first'),
        parent=('parent', 'first'),
        parent_name=('parent_name', 'first'),
        t2m_max=('t2m', 'max'),
        t2m_min=('t2m', 'min'),
        t2m_mean=('t2m', 'mean'),
        d2m=('d2m', 'mean'),
        tp=('tp', 'sum')
    ).reset_index()
    agg_daily_df = agg_daily_df.sort_values(['region_id', 'date'], ascending=[True, True]).reset_index(drop=True)
    return agg_daily_df

However, notice inside agg_daily that the arguments within agg, e.g. t2m_max, t2m_min, t2m_mean are hard coded. Instead, I want to pass common_cols, w_params, operation as arguments to agg_daily, avoid the hard coding, and yet get the function agg_daily perform the desired operation.

Note that for columns belonging to common_cols, I do not wish to create a new column name in the final output. However, for the columns belonging to w_params, I want to create a column corresponding to the operation being performed.

Can anyone help me get a customizable function?

I have a dataframe which has the following columns: region_id, name, parent, parent_name, t2m, d2m, and tp.

I want to group and aggregate column values in a specific way. To enable that, I have defined the following lists:

w_params = ['t2m', 't2m', 't2m', 'd2m', 'tp']
operation = ['max', 'min', 'mean', 'mean', 'sum']
common_cols = ['region_id', 'name', 'parent', 'parent_name']

I have written the function agg_daily to group column values by date and region_id and aggregate.

def agg_daily(df, common_cols, w_params, operation):
    """
    Aggregate the data for each day.

    Parameters
    ----------
    df : pandas dataframe
        Dataframe containing daily data.

    Returns
    -------
    agg_daily_df : pandas dataframe
        Dataframe containing aggregated data for each day.

    """
    agg_daily_df = df.groupby(['date', 'region_id']).agg(
        name=('name', 'first'),
        parent=('parent', 'first'),
        parent_name=('parent_name', 'first'),
        t2m_max=('t2m', 'max'),
        t2m_min=('t2m', 'min'),
        t2m_mean=('t2m', 'mean'),
        d2m=('d2m', 'mean'),
        tp=('tp', 'sum')
    ).reset_index()
    agg_daily_df = agg_daily_df.sort_values(['region_id', 'date'], ascending=[True, True]).reset_index(drop=True)
    return agg_daily_df

However, notice inside agg_daily that the arguments within agg, e.g. t2m_max, t2m_min, t2m_mean are hard coded. Instead, I want to pass common_cols, w_params, operation as arguments to agg_daily, avoid the hard coding, and yet get the function agg_daily perform the desired operation.

Note that for columns belonging to common_cols, I do not wish to create a new column name in the final output. However, for the columns belonging to w_params, I want to create a column corresponding to the operation being performed.

Can anyone help me get a customizable function?

Share Improve this question asked 2 days ago TRKTRK 2031 gold badge2 silver badges8 bronze badges
Add a comment  | 

2 Answers 2

Reset to default 0

does this work for you ?

w_params = ['t2m', 't2m', 't2m', 'd2m', 'tp']
operation = ['max', 'min', 'mean', 'mean', 'sum']
common_cols = ['name', 'parent', 'parent_name']

def agg_df(df, common_cols, w_params, operation):

    # get list of col methods
    cols = pd.DataFrame(zip(w_params, operation), columns=["col","method"]).groupby("col").agg(list).reset_index()
    # create agg_dict and add common_cols methods
    aggs = pd.concat([cols,pd.DataFrame(zip(common_cols,len(common_cols)*[["first"]]), columns=["col","method"])], ignore_index=True)
    # aggregation with created dict
    result_df = df.groupby(['date', 'region_id']).agg(aggs.set_index("col").method).sort_values(['region_id', 'date'], ascending=[True, True]).reset_index()
    # and rename columns have multiindex
    have_multi_method_cols = aggs[aggs.method.apply(len) > 1].col.tolist()
    result_df.columns = result_df.columns.map(lambda x: x[0] if x[0] not in have_multi_method_cols else "_".join(x))
    # return df
    return result_df

agg_df(data, common_cols, w_params, operation)

unpack a dictionary created from the pairing of w_params and operation within the named aggregation:

def agg_daily(df, common_cols, w_params, operation):
    mapped = zip(w_params, operation)
    mapped = {f"{col}_{func}": (col, func) for col, func in mapped}
    outcome = df.groupby(common_cols, as_index=False).agg(**mapped)
    return outcome

Application:

data = {'model': {0: 'Mazda RX4', 1: 'Mazda RX4 Wag', 2: 'Datsun 710'},
 'mpg': {0: 21.0, 1: 21.0, 2: 22.8},
 'cyl': {0: 6, 1: 6, 2: 4},
 'disp': {0: 160.0, 1: 160.0, 2: 108.0},
 'hp': {0: 110, 1: 110, 2: 93},
 'drat': {0: 3.9, 1: 3.9, 2: 3.85},
 'wt': {0: 2.62, 1: 2.875, 2: 2.32},
 'qsec': {0: 16.46, 1: 17.02, 2: 18.61},
 'vs': {0: 0, 1: 0, 2: 1},
 'am': {0: 1, 1: 1, 2: 1},
 'gear': {0: 4, 1: 4, 2: 4},
 'carb': {0: 4, 1: 4, 2: 1}}

mtcars = pd.DataFrame(data)
agg_daily(df=mtcars, 
          common_cols='cyl', 
          w_params=['disp','hp','drat'], 
          operation=['min','max','min'])
   cyl  disp_min  hp_max  drat_min
0    4     108.0      93      3.85
1    6     160.0     110      3.90

ideally, you'll add some checks - w_params should be same length as operation, the entries in operation should be strings (if they are not you have to consider how to grab the names - .__name__() possibly), ...

发布评论

评论列表(0)

  1. 暂无评论