Advanced Pandas

As you become more proficient in Pandas, mastering its advanced features can significantly improve your data analysis workflows. This page covers MultiIndex DataFrames, pivot tables, crosstabs, and performance tuning techniques.


MultiIndex DataFrames

MultiIndex DataFrames allow hierarchical indexing, enabling you to work with complex datasets more effectively.

Creating a MultiIndex DataFrame

import pandas as pd
 
# Creating a MultiIndex DataFrame
data = {
    "Year": [2020, 2020, 2021, 2021],
    "City": ["Delhi", "Mumbai", "Delhi", "Mumbai"],
    "Population": [30, 20, 35, 25]
}
df = pd.DataFrame(data)
df.set_index(["Year", "City"], inplace=True)
 
print(df)

Output:

               Population
Year City               
2020 Delhi            30
     Mumbai           20
2021 Delhi            35
     Mumbai           25

Accessing Data in MultiIndex

# Accessing data for a specific index
print(df.loc[(2020, "Delhi")])
 
# Accessing data for a specific level
print(df.xs(2020, level="Year"))

Pivot Tables and Crosstabs

Pivot tables and crosstabs are powerful tools for summarizing data.

Creating a Pivot Table

# Sample DataFrame
data = {
    "City": ["Delhi", "Mumbai", "Delhi", "Mumbai"],
    "Year": [2020, 2020, 2021, 2021],
    "Sales": [100, 200, 150, 250]
}
df = pd.DataFrame(data)
 
# Creating a pivot table
pivot = df.pivot_table(values="Sales", index="Year", columns="City", aggfunc="sum")
print(pivot)

Output:

City   Delhi  Mumbai
Year                
2020    100     200
2021    150     250

Creating a Crosstab

# Creating a crosstab
crosstab = pd.crosstab(df["City"], df["Year"])
print(crosstab)

Output:

Year   2020  2021
City              
Delhi     1     1
Mumbai    1     1

Efficient Performance Tuning with Pandas

When working with large datasets, efficiency is critical. Here are some tips to improve performance:

1. Using eval() and query() for Speed

# Using eval for faster computation
df["Total"] = pd.eval("df['Sales'] * 2")
print(df)
 
# Using query for efficient filtering
filtered = df.query("City == 'Delhi' and Year == 2021")
print(filtered)

2. Optimizing Memory Usage

# Downcasting numeric columns
df["Sales"] = pd.to_numeric(df["Sales"], downcast="integer")
print(df.dtypes)

3. Parallelizing Operations

# Using Dask for parallelized Pandas operations
import dask.dataframe as dd
 
dask_df = dd.from_pandas(df, npartitions=2)
print(dask_df.compute())

Try It Yourself

Problem 1: MultiIndex DataFrame

Create a MultiIndex DataFrame for a dataset containing years, cities, and temperatures. Access data for a specific year and city.

Show Code
import pandas as pd
 
# Create MultiIndex DataFrame
data = {
    "Year": [2020, 2020, 2021, 2021],
    "City": ["Delhi", "Mumbai", "Delhi", "Mumbai"],
    "Temperature": [40, 35, 42, 38]
}
df = pd.DataFrame(data).set_index(["Year", "City"])
print(df)
 
# Access specific data
print(df.loc[(2020, "Delhi")])

Problem 2: Create a Pivot Table

Given a dataset of students’ marks in different subjects, create a pivot table to summarize the average marks per subject for each class.

Show Code
import pandas as pd
 
# Sample Data
data = {
    "Class": ["10th", "10th", "11th", "11th"],
    "Subject": ["Math", "Science", "Math", "Science"],
    "Marks": [85, 90, 88, 92]
}
df = pd.DataFrame(data)
 
# Create Pivot Table
pivot = df.pivot_table(values="Marks", index="Class", columns="Subject", aggfunc="mean")
print(pivot)

Pyground

Play with Python!

Output: