Python ModulesPandas TutorialData Aggregation and Grouping

Data Aggregation and Grouping in Pandas

Pandas provides powerful tools for grouping and aggregating data, allowing you to analyze datasets effectively. This page covers groupby operations, built-in aggregation methods, and applying custom functions.


GroupBy Operations

The groupby() method splits data into groups based on a column or set of columns, enabling aggregation or transformation on each group.

Example: Grouping Data by a Column

import pandas as pd
 
# Sample DataFrame
data = {
    'Category': ['Electronics', 'Electronics', 'Clothing', 'Clothing', 'Groceries'],
    'Sales': [1000, 1500, 800, 1200, 500],
    'Quantity': [5, 7, 10, 4, 20]
}
df = pd.DataFrame(data)
 
# Group by 'Category'
grouped = df.groupby('Category')
print(grouped)

Output:

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x...>

To view the grouped data:

for name, group in grouped:
    print(name)
    print(group)

Aggregations

Aggregation refers to performing operations like sum, mean, count, etc., on grouped data.

Built-In Aggregations

# Aggregate sales by category
sales_sum = grouped['Sales'].sum()
print(sales_sum)

Output:

Category
Clothing        2000
Electronics     2500
Groceries        500
Name: Sales, dtype: int64

Multiple Aggregations

You can apply multiple aggregation functions to a column or across multiple columns.

# Apply multiple aggregations
agg_result = grouped.agg({'Sales': ['sum', 'mean'], 'Quantity': 'sum'})
print(agg_result)

Output:

                 Sales         Quantity
                   sum    mean     sum
Category
Clothing         2000  1000.0       14
Electronics      2500  1250.0       12
Groceries         500   500.0       20

Applying Custom Functions

Custom aggregation functions can be applied using the agg() method or apply() method.

Example: Custom Aggregation with agg()

# Custom function to calculate range (max - min)
def range_func(series):
    return series.max() - series.min()
 
# Apply custom aggregation
df_range = grouped['Sales'].agg(range_func)
print(df_range)

Output:

Category
Clothing       400
Electronics    500
Groceries        0
Name: Sales, dtype: int64

Example: Custom Transformation with apply()

# Apply a custom function to each group
def normalize(group):
    return group - group.mean()
 
normalized_sales = grouped['Sales'].apply(normalize)
print(normalized_sales)

Output:

0   -250.0
1    250.0
2   -200.0
3    200.0
4      0.0
Name: Sales, dtype: float64

Try It Yourself

Problem 1: Aggregate Grouped Data

Given the following DataFrame:

import pandas as pd
 
data = {
    'Department': ['HR', 'HR', 'IT', 'IT', 'Sales'],
    'Salary': [50000, 60000, 70000, 80000, 55000],
    'Experience': [5, 7, 10, 12, 6]
}
df = pd.DataFrame(data)
  1. Group by Department.
  2. Calculate the total and average salary for each department.
Show Code
grouped = df.groupby('Department')
 
# Total and average salary
result = grouped['Salary'].agg(['sum', 'mean'])
print(result)

Problem 2: Apply Custom Functions

  1. Use the same DataFrame.
  2. Create a custom function to calculate the square of each Salary value within groups and apply it.
Show Code
# Custom function
def square_values(series):
    return series ** 2
 
squared_salaries = grouped['Salary'].apply(square_values)
print(squared_salaries)

Pyground

Play with Python!

Output: