Data Aggregation and Grouping in Pandas
Pandas provides powerful tools for grouping and aggregating data, allowing you to analyze datasets effectively. This page covers groupby
operations, built-in aggregation methods, and applying custom functions.
GroupBy Operations
The groupby()
method splits data into groups based on a column or set of columns, enabling aggregation or transformation on each group.
Example: Grouping Data by a Column
import pandas as pd
# Sample DataFrame
data = {
'Category': ['Electronics', 'Electronics', 'Clothing', 'Clothing', 'Groceries'],
'Sales': [1000, 1500, 800, 1200, 500],
'Quantity': [5, 7, 10, 4, 20]
}
df = pd.DataFrame(data)
# Group by 'Category'
grouped = df.groupby('Category')
print(grouped)
Output:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x...>
To view the grouped data:
for name, group in grouped:
print(name)
print(group)
Aggregations
Aggregation refers to performing operations like sum
, mean
, count
, etc., on grouped data.
Built-In Aggregations
# Aggregate sales by category
sales_sum = grouped['Sales'].sum()
print(sales_sum)
Output:
Category
Clothing 2000
Electronics 2500
Groceries 500
Name: Sales, dtype: int64
Multiple Aggregations
You can apply multiple aggregation functions to a column or across multiple columns.
# Apply multiple aggregations
agg_result = grouped.agg({'Sales': ['sum', 'mean'], 'Quantity': 'sum'})
print(agg_result)
Output:
Sales Quantity
sum mean sum
Category
Clothing 2000 1000.0 14
Electronics 2500 1250.0 12
Groceries 500 500.0 20
Applying Custom Functions
Custom aggregation functions can be applied using the agg()
method or apply()
method.
Example: Custom Aggregation with agg()
# Custom function to calculate range (max - min)
def range_func(series):
return series.max() - series.min()
# Apply custom aggregation
df_range = grouped['Sales'].agg(range_func)
print(df_range)
Output:
Category
Clothing 400
Electronics 500
Groceries 0
Name: Sales, dtype: int64
Example: Custom Transformation with apply()
# Apply a custom function to each group
def normalize(group):
return group - group.mean()
normalized_sales = grouped['Sales'].apply(normalize)
print(normalized_sales)
Output:
0 -250.0
1 250.0
2 -200.0
3 200.0
4 0.0
Name: Sales, dtype: float64
Try It Yourself
Problem 1: Aggregate Grouped Data
Given the following DataFrame:
import pandas as pd
data = {
'Department': ['HR', 'HR', 'IT', 'IT', 'Sales'],
'Salary': [50000, 60000, 70000, 80000, 55000],
'Experience': [5, 7, 10, 12, 6]
}
df = pd.DataFrame(data)
- Group by
Department
. - Calculate the total and average salary for each department.
Show Code
grouped = df.groupby('Department')
# Total and average salary
result = grouped['Salary'].agg(['sum', 'mean'])
print(result)
Problem 2: Apply Custom Functions
- Use the same DataFrame.
- Create a custom function to calculate the square of each
Salary
value within groups and apply it.
Show Code
# Custom function
def square_values(series):
return series ** 2
squared_salaries = grouped['Salary'].apply(square_values)
print(squared_salaries)