Pandas groupby

Here we show a few examples of how to use the pandas groupby function to:

Filter data: select groups with given characteristics
Transform data: add a new column based on group characteristics (e.g. normalization within groups)
Aggregation: calculate aggregative quantities for each group

All the examples are based on a dummy dataset like the following one: A_cat and B_cat are categorical variables, while C and D are numerical. The code to generate the dummy dataset can be found at the bottom of the page.

dummy dataset

Filtering

Select data such that grouping by A_cat will result in groups of 20 or more elements:

df = df.groupby('A_cat').filter(lambda df:len(df)>=20)

Select data such that grouping by B_cat will result in groups having low standard dev on variable C:

df = df.groupby('B_cat').filter(lambda df:df.C.std() < 2)

Transformations

Create column C_std storing the standard dev of variable C withing the B_cat groups:

df = df.assign(C_std = lambda df:df.groupby('B_cat').C.transform('std'))

Aggregations

Aggregate A_cat groups by calculating the mean of the C columns and the standard error of the mean of the D column:

df.groupby('A_cat').agg(
    C_mean = ('C', np.mean),
    D_sem = ('D', 'sem')
)

agg

For custom aggregations, which need mixing more columns we can use the apply function:

def custom_agg(df):
    C_mean = df.C.mean()
    D_std = df.D.std()
    return pd.Series(dict(
        C_std_D = C_mean/D_std,
        C_mean = C_mean
    ))

df.groupby('A_cat').apply(custom_agg)

apply

Utilities

df.groupby('A_cat').ngroups           # get number of groups
df.groupby('A_cat').groups            # get dictionary of data grouped
df.groupby('A_cat').size()            # get the size of each group
df.groupby('A_cat').groups.keys()     # get the groups name
df.groupby('A_cat').get_group('1.0')  # get a single group by name

Multiple grouping

df.groupby(['A_cat', 'B_cat']).apply(custom_agg).head()

apply

df.groupby(['A_cat', 'B_cat']).apply(custom_agg).unstack().head()

apply

df.groupby(['A_cat', 'B_cat']).apply(custom_agg).swaplevel(axis='index').unstack().head()

apply

Grouping using index

The grouping can be based on the dataframe index by using a mapping function:

df.groupby(lambda idx: (idx%3)==True).sum()

apply

Grouping using bins

Grouping can be performed using a list of labels (sharing same len with dataframe). E.g. one can bin the data using the pandas cut function.

df.groupby(pd.cut(df.B_cat,
                  np.linspace(-8, 8, 6),
                  labels=['Very low', 'Low', 'Medium', 'High', 'Very high'])
            )[['C', 'D']].mean()

apply

Create a dummy dataset

import pandas as pd
import numpy as np

np.random.seed(42)
df = pd.DataFrame(np.random.normal(0, 2, size=(200,4)), columns=['A', 'B', 'C', 'D'])

df = (df
        .assign(A_cat = lambda df: np.round(df.A).astype('str')).drop('A', axis=1)
        .assign(B_cat = lambda df: np.round(df.B)).drop('B', axis=1)
)

df.head()