Data Cleaning and Preprocessing in Pandas

Data cleaning is an essential step in preparing raw data for analysis. Pandas provides several tools for handling missing data, replacing values, removing duplicates, and changing data types.

Handling Missing Data

Missing data can occur in datasets for various reasons. Pandas offers methods to identify and handle these missing values.

Identifying Missing Data

import pandas as pd
 
# Sample DataFrame
data = {
    'Name': ['Anika', 'Rahul', None],
    'Age': [25, None, 22],
    'City': ['Delhi', 'Mumbai', None]
}
df = pd.DataFrame(data)
 
print(df.isnull())  # Check for missing values
print(df.notnull()) # Check for non-missing values

Output:

    Name    Age   City
0  False  False  False
1  False   True  False
2   True  False   True

Dropping Missing Values

# Drop rows with missing values
df_dropped = df.dropna()
print(df_dropped)
 
# Drop columns with missing values
df_dropped_col = df.dropna(axis=1)
print(df_dropped_col)

Filling Missing Values

# Fill missing values with a specified value
df_filled = df.fillna("Unknown")
print(df_filled)
 
# Fill missing values with different values for each column
df_filled_custom = df.fillna({"Name": "No Name", "Age": 0, "City": "Unknown"})
print(df_filled_custom)

Replacing Values

Replace specific values in the dataset using the replace() method.

# Replace values in a column
df_replaced = df.replace({"Delhi": "New Delhi", None: "Unavailable"})
print(df_replaced)

Removing Duplicates

Identify and remove duplicate rows using the duplicated() and drop_duplicates() methods.

# Sample DataFrame with duplicates
data = {
    'Name': ['Anika', 'Rahul', 'Anika'],
    'Age': [25, 30, 25],
    'City': ['Delhi', 'Mumbai', 'Delhi']
}
df = pd.DataFrame(data)
 
print(df.duplicated())  # Identify duplicates
 
# Remove duplicate rows
df_no_duplicates = df.drop_duplicates()
print(df_no_duplicates)

Changing Data Types

Convert data types using the astype() method.

# Sample DataFrame
data = {
    'Name': ['Anika', 'Rahul', 'Sneha'],
    'Age': ['25', '30', '22'],
    'Salary': ['50000', '60000', '45000']
}
df = pd.DataFrame(data)
 
# Convert Age and Salary to integers
df['Age'] = df['Age'].astype(int)
df['Salary'] = df['Salary'].astype(float)
print(df.dtypes)

Try It Yourself

Problem 1: Handle Missing Data

Create a DataFrame with some missing values. Fill missing values in one column with a specific value and drop rows with missing values in another column.

Show Code

data = {
    'Product': ['Laptop', 'Phone', None],
    'Price': [80000, None, 20000],
    'Stock': [50, 150, None]
}
df = pd.DataFrame(data)
 
# Fill missing Product values with "Unknown"
df['Product'] = df['Product'].fillna("Unknown")
 
# Drop rows with missing values in the Price column
df_cleaned = df.dropna(subset=['Price'])
print(df_cleaned)

Problem 2: Remove Duplicates and Change Data Types

Create a DataFrame with duplicate rows. Remove the duplicates and convert a column to a different data type.

Show Code

data = {
    'Name': ['Anika', 'Rahul', 'Anika'],
    'Age': ['25', '30', '25'],
    'Salary': ['50000', '60000', '50000']
}
df = pd.DataFrame(data)
 
# Remove duplicate rows
df_no_duplicates = df.drop_duplicates()
 
# Convert Age and Salary to numeric types
df_no_duplicates['Age'] = df_no_duplicates['Age'].astype(int)
df_no_duplicates['Salary'] = df_no_duplicates['Salary'].astype(float)
 
print(df_no_duplicates)

Pyground

Play with Python!

Output:

Data Selection and Filtering Data Aggregation and Grouping