Introduction to Pandas
Pandas is one of the most powerful and widely used Python libraries for data manipulation and analysis. Its user-friendly data structures and rich set of functions make it a favorite among data scientists and analysts.
What is Pandas?
Pandas is an open-source Python library designed for:
- Manipulating and analyzing structured data.
- Handling large datasets efficiently.
- Providing easy-to-use data structures such as Series (1D) and DataFrame (2D).
It builds on NumPy and seamlessly integrates with other popular libraries like Matplotlib and Scikit-learn.
Key Features and Use Cases
Key Features:
- Data Structures: Offers Series and DataFrame for handling one-dimensional and two-dimensional data.
- Indexing: Enables intuitive indexing and selection of data.
- Data Cleaning: Simplifies handling missing data and duplicates.
- Data Aggregation: Provides powerful GroupBy operations for summarizing data.
- File Handling: Supports reading from and writing to multiple file formats (CSV, Excel, SQL, JSON, etc.).
- Time Series: Offers tools for working with time-series data.
- Visualization: Includes built-in support for basic plotting.
Use Cases:
- Data Wrangling: Cleaning and reshaping raw data for analysis.
- Exploratory Data Analysis (EDA): Summarizing datasets to uncover patterns.
- Data Aggregation: Summarizing sales, user behavior, and other metrics.
- Time Series Analysis: Stock market data, IoT data, etc.
- ETL Pipelines: Extracting, transforming, and loading data efficiently.
Installing Pandas
To install Pandas, use the following command:
pip install pandas
If you’re using Anaconda, Pandas comes pre-installed. To update:
conda update pandas
Verifying Installation
After installation, verify the version using:
import pandas as pd
print(pd.__version__)
Importing and Basic Usage
To use Pandas, import it as pd
(a convention in the Python community):
import pandas as pd
Creating a Series
A Series is a one-dimensional array-like object.
# Create a Series
s = pd.Series([10, 20, 30, 40])
print(s)
Output:
0 10
1 20
2 30
3 40
dtype: int64
Creating a DataFrame
A DataFrame is a two-dimensional table-like data structure.
# Create a DataFrame
data = {
'Name': ['Anika', 'Rahul', 'Sneha'],
'Age': [25, 30, 22],
'City': ['Delhi', 'Mumbai', 'Bangalore']
}
df = pd.DataFrame(data)
print(df)
Output:
Name Age City
0 Anika 25 Delhi
1 Rahul 30 Mumbai
2 Sneha 22 Bangalore
Reading a CSV File
Pandas makes it easy to load datasets from files.
# Reading a CSV file
df = pd.read_csv('data.csv')
print(df.head()) # Display the first 5 rows
Try It Yourself
Problem 1: Create a Series
Create a Pandas Series from a list of integers [1, 2, 3, 4, 5]
and display it.
Show Code
import pandas as pd
# Create a Series
series = pd.Series([1, 2, 3, 4, 5])
print(series)
Problem 2: Create a DataFrame
Create a DataFrame with columns Product
, Price
, and Stock
. Add sample data for 3 products and display the DataFrame.
Show Code
import pandas as pd
# Create a DataFrame
data = {
'Product': ['Laptop', 'Phone', 'Tablet'],
'Price': [80000, 30000, 20000],
'Stock': [50, 150, 100]
}
df = pd.DataFrame(data)
print(df)