Advanced Pandas
As you become more proficient in Pandas, mastering its advanced features can significantly improve your data analysis workflows. This page covers MultiIndex DataFrames, pivot tables, crosstabs, and performance tuning techniques.
MultiIndex DataFrames
MultiIndex DataFrames allow hierarchical indexing, enabling you to work with complex datasets more effectively.
Creating a MultiIndex DataFrame
import pandas as pd
# Creating a MultiIndex DataFrame
data = {
"Year": [2020, 2020, 2021, 2021],
"City": ["Delhi", "Mumbai", "Delhi", "Mumbai"],
"Population": [30, 20, 35, 25]
}
df = pd.DataFrame(data)
df.set_index(["Year", "City"], inplace=True)
print(df)
Output:
Population
Year City
2020 Delhi 30
Mumbai 20
2021 Delhi 35
Mumbai 25
Accessing Data in MultiIndex
# Accessing data for a specific index
print(df.loc[(2020, "Delhi")])
# Accessing data for a specific level
print(df.xs(2020, level="Year"))
Pivot Tables and Crosstabs
Pivot tables and crosstabs are powerful tools for summarizing data.
Creating a Pivot Table
# Sample DataFrame
data = {
"City": ["Delhi", "Mumbai", "Delhi", "Mumbai"],
"Year": [2020, 2020, 2021, 2021],
"Sales": [100, 200, 150, 250]
}
df = pd.DataFrame(data)
# Creating a pivot table
pivot = df.pivot_table(values="Sales", index="Year", columns="City", aggfunc="sum")
print(pivot)
Output:
City Delhi Mumbai
Year
2020 100 200
2021 150 250
Creating a Crosstab
# Creating a crosstab
crosstab = pd.crosstab(df["City"], df["Year"])
print(crosstab)
Output:
Year 2020 2021
City
Delhi 1 1
Mumbai 1 1
Efficient Performance Tuning with Pandas
When working with large datasets, efficiency is critical. Here are some tips to improve performance:
1. Using eval()
and query()
for Speed
# Using eval for faster computation
df["Total"] = pd.eval("df['Sales'] * 2")
print(df)
# Using query for efficient filtering
filtered = df.query("City == 'Delhi' and Year == 2021")
print(filtered)
2. Optimizing Memory Usage
# Downcasting numeric columns
df["Sales"] = pd.to_numeric(df["Sales"], downcast="integer")
print(df.dtypes)
3. Parallelizing Operations
# Using Dask for parallelized Pandas operations
import dask.dataframe as dd
dask_df = dd.from_pandas(df, npartitions=2)
print(dask_df.compute())
Try It Yourself
Problem 1: MultiIndex DataFrame
Create a MultiIndex DataFrame for a dataset containing years, cities, and temperatures. Access data for a specific year and city.
Show Code
import pandas as pd
# Create MultiIndex DataFrame
data = {
"Year": [2020, 2020, 2021, 2021],
"City": ["Delhi", "Mumbai", "Delhi", "Mumbai"],
"Temperature": [40, 35, 42, 38]
}
df = pd.DataFrame(data).set_index(["Year", "City"])
print(df)
# Access specific data
print(df.loc[(2020, "Delhi")])
Problem 2: Create a Pivot Table
Given a dataset of students’ marks in different subjects, create a pivot table to summarize the average marks per subject for each class.
Show Code
import pandas as pd
# Sample Data
data = {
"Class": ["10th", "10th", "11th", "11th"],
"Subject": ["Math", "Science", "Math", "Science"],
"Marks": [85, 90, 88, 92]
}
df = pd.DataFrame(data)
# Create Pivot Table
pivot = df.pivot_table(values="Marks", index="Class", columns="Subject", aggfunc="mean")
print(pivot)