Data Cleaning and Preprocessing in Pandas
Data cleaning is an essential step in preparing raw data for analysis. Pandas provides several tools for handling missing data, replacing values, removing duplicates, and changing data types.
Handling Missing Data
Missing data can occur in datasets for various reasons. Pandas offers methods to identify and handle these missing values.
Identifying Missing Data
import pandas as pd
# Sample DataFrame
data = {
'Name': ['Anika', 'Rahul', None],
'Age': [25, None, 22],
'City': ['Delhi', 'Mumbai', None]
}
df = pd.DataFrame(data)
print(df.isnull()) # Check for missing values
print(df.notnull()) # Check for non-missing values
Output:
Name Age City
0 False False False
1 False True False
2 True False True
Dropping Missing Values
# Drop rows with missing values
df_dropped = df.dropna()
print(df_dropped)
# Drop columns with missing values
df_dropped_col = df.dropna(axis=1)
print(df_dropped_col)
Filling Missing Values
# Fill missing values with a specified value
df_filled = df.fillna("Unknown")
print(df_filled)
# Fill missing values with different values for each column
df_filled_custom = df.fillna({"Name": "No Name", "Age": 0, "City": "Unknown"})
print(df_filled_custom)
Replacing Values
Replace specific values in the dataset using the replace()
method.
# Replace values in a column
df_replaced = df.replace({"Delhi": "New Delhi", None: "Unavailable"})
print(df_replaced)
Removing Duplicates
Identify and remove duplicate rows using the duplicated()
and drop_duplicates()
methods.
# Sample DataFrame with duplicates
data = {
'Name': ['Anika', 'Rahul', 'Anika'],
'Age': [25, 30, 25],
'City': ['Delhi', 'Mumbai', 'Delhi']
}
df = pd.DataFrame(data)
print(df.duplicated()) # Identify duplicates
# Remove duplicate rows
df_no_duplicates = df.drop_duplicates()
print(df_no_duplicates)
Changing Data Types
Convert data types using the astype()
method.
# Sample DataFrame
data = {
'Name': ['Anika', 'Rahul', 'Sneha'],
'Age': ['25', '30', '22'],
'Salary': ['50000', '60000', '45000']
}
df = pd.DataFrame(data)
# Convert Age and Salary to integers
df['Age'] = df['Age'].astype(int)
df['Salary'] = df['Salary'].astype(float)
print(df.dtypes)
Try It Yourself
Problem 1: Handle Missing Data
Create a DataFrame with some missing values. Fill missing values in one column with a specific value and drop rows with missing values in another column.
Show Code
data = {
'Product': ['Laptop', 'Phone', None],
'Price': [80000, None, 20000],
'Stock': [50, 150, None]
}
df = pd.DataFrame(data)
# Fill missing Product values with "Unknown"
df['Product'] = df['Product'].fillna("Unknown")
# Drop rows with missing values in the Price column
df_cleaned = df.dropna(subset=['Price'])
print(df_cleaned)
Problem 2: Remove Duplicates and Change Data Types
Create a DataFrame with duplicate rows. Remove the duplicates and convert a column to a different data type.
Show Code
data = {
'Name': ['Anika', 'Rahul', 'Anika'],
'Age': ['25', '30', '25'],
'Salary': ['50000', '60000', '50000']
}
df = pd.DataFrame(data)
# Remove duplicate rows
df_no_duplicates = df.drop_duplicates()
# Convert Age and Salary to numeric types
df_no_duplicates['Age'] = df_no_duplicates['Age'].astype(int)
df_no_duplicates['Salary'] = df_no_duplicates['Salary'].astype(float)
print(df_no_duplicates)