PANDAS LIBRARY
Pandas is an open-source Python library that offers high-performance data manipulation. Pandas, which means an Econometrics from Multidimensional Data, gets its name from the phrase "Panel Data". Python is used for data analysis. Processing steps including merging, cleansing, and restructuring are all necessary for data analysis which can be done with Pandas.
Advantages of Pandas:
-
Easily handles missing data
-
It provides easy way to slice data
-
It provides flexible ways to merge, concatenate or reshape the data.
-
Column and row operations are simple in Pandas.
For Installing Pandas Library :
>>> pip install pandas
For importing Pandas Library:
>>> import pandas as pd
Common data structures used in Pandas are:
-
Series (one- Dimensional structure)
-
DataFrame (two - dimensional structure)
-
Panel (three-dimensional structure)
*Note: As per CBSE curriculum, Panel is not in course
Creating Series
It is described as a one-dimensional array that can store several forms of data.
-
The term "index" refers to a series' row labels.
-
The list, tuple, and dictionary are easily transformed into series using the "series' method
-
a series cannot have numerous columns as it has 1-D structure.
-
Data in series is mutable (i.e. changeable), but its size is immutable.
For creating empty Series :
>>> s1=pd.Series()
Series can be created with several methods by using :-
-
Lists
-
Array (using numpy)
-
Dictionary
Creating Series Using Lists
>>>import pandas as pd
>>> list1=[1,2,3,"a","b","c"]
>>> s1=pd.Series(list1)
>>> print(s1)
Output:
0 1
1 2
2 3
3 a
4 b
5 c
dtype: object
Creating Series Using Dictionary
>>> dict1={"One":1,"Two":2,"Three":3,"Four":4,"Five":5}
>>> s2=pd.Series(dict1)
>>> print(dict1)
Output:
One 1
Two 2
Three 3
Four 4
Five 5
dtype: int64
Note: Here, keys are used as index for the series.
Creating Series Using Array (Numpy)
Numpy is a Python library used for working with arrays. It stands for "Numeric Python" or "Numerical Python". It is one of the most commonly used packages for scientific computing and math operations in Python. NumPy was created in 2005 by Travis Oliphant.
Installing Numpy
Numpy can be installed by typing following command:
>>>pip install Numpy
NumPy arrays are used to store lists of numerical data. It is a very versatile and efficient data structure. The Numpy array is officially called ndarray but commonly known as array.
Difference between List and array:
Methods to create ndarray:
-
Using array()
>>>import numpy as np (np is used as alias for Numpy methods)
>>>arr1=np.array([2,3,5])
>>>print(arr1)
Output: array([2, 3, 5])
>>>arr2=np.array([[10,20,30],[80,90,50]])
>>>print(arr2)
Output:array([[10, 20, 30],
[80, 90, 50]])
2. Using arange() - This method helps to print values with the given range.
Syntax: np.arange([start,] stop=10[, step], dtype=None)
import numpy as np
>>>arr4=np.arange(5,10)
>>>print(arr4)
Output: [5 6 7 8 9]
>>>arr4=np.arange(5,10,2)
>>>print(arr4)
Output: [5 7 9]
Creating Series Using Array (numpy) - contd.
>>> array1=np.array([1,2,3,4,5])
>>> s3=pd.Series(array1)
>>>print(s3)
Output:
0 1
1 2
2 3
3 4
4 5
dtype: int32
Series Attributes
Series Methods
Accessing Elements of a Series
Elements can be accessed either by indexing or slicing.
-
Indexing: Indexing is used to extract the element stored inside the series by providing the index. Indices can be of two types:
a. Positional Index b. Labeled index
Positional index takes an integer value that corresponds to its position in the series starting from 0, whereas labeled index takes any user-defined label as index.
Example :
2. Slicing : It is used to extract the sequence of values from the series.
Syntax - Series[start:end:step],
Start - from where to start the range
End - till where to print. End index is not included in the result.
Step - if you want to skip some values in between or print in reverse order.
{Note: start and step are optional}
Example:
3. Using iloc and loc
-
iloc : The attribute . iloc is takes integer values i.e positional value(s) for accessing a particular series element.
-
loc : Whereas, the attribute . loc takes row labels i.e. user defined indexes for accessing a particular series element.
Example:
Positional Index
>>> ser=pd.Series([1,2,3,4], index=[10,20,30,40])
>>>print(ser[2])
Output: 3
>>>print(ser[[2,3]])
Output: 2 3
3 4
dtype: int64
Labeled Index
>>> ser=pd.Series([1,2,3,4], index=[10,20,30,40])
>>>print(ser[20])
Output: 2
>>>print(ser[[20,40]])
Output: 20 2
40 4
dtype: int64
>>> ser=pd.Series(['Unity','Integrity','Loyalty','Devotion'])
>>> ser[0:3]
0 Unity
1 Integrity
2 Loyalty
dtype: object
>>> ser[:2]
0 Unity
1 Integrity
dtype: object
>>> ser[0:3:2]
0 Unity
2 Loyalty
dtype: object
>>> ser
a 5
b 10
c 15
d 20
e 25
dtype: int32
>>> ser.loc['a':'d']
a 5
b 10
c 15
d 20
dtype: int32
>>> ser.iloc[1:3]
b 10
c 15
dtype: int32
Note: while using .loc, both values are included in the result but in .iloc, end position is excluded from the result.
Conditional based extraction from the series
Elements can also be accessed on the basis of condition based on data.
Example:
Mathematical Operations on Series
Pandas allows us to work with two series mathematically.
Index matching is used when working with series, and any missing values are automatically filled in with NaN by default.
Example:
For mathematical operations on Series, following methods can be used inplace of operators:
-
add()
-
sub()
-
mul()
-
div()
Fill_value, argument can be used to fill values in place of non matching index values as demonstrated below:
>>> ser
a 5
b 10
c 15
d 20
e 25
dtype: int32
>>> ser>20 #if we only give condition, we get boolean result
a False
b False
c False
d False
e True
dtype: bool
#but if we give condition inside[ ], then we get the rows which shall qualify for the condition.
>>> ser[ser>15]
d 20
e 25
dtype: int32
>>> import numpy as np
>>> serA=pd.Series(np.arange(5,30,5), index=['a','b','c','d','e'])
>>> serA
a 5
b 10
c 15
d 20
e 25
dtype: int32
>>> serB=pd.Series(np.arange(6,36,6), index=['x','y','c','d','e'])
>>> serB
x 6
y 12
c 18
d 24
e 30
dtype: int32
>>> print(serA+serB)
a NaN
b NaN
c 33.0
d 44.0
e 55.0
x NaN
y NaN
dtype: float64
>>> print(serA*serB)
a NaN
b NaN
c 270.0
d 480.0
e 750.0
x NaN
y NaN
dtype: float64
>>>print(serA,serB)
a 5
b 10
c 15
d 20
e 25
dtype: int32
x 6
y 12
c 18
d 24
e 30
dtype: int32
>>> serA.add(serB,fill_value=0)
a 5.0
b 10.0
c 33.0
d 44.0
e 55.0
x 6.0
y 12.0
dtype: float64
>>> serA.sub(serB,fill_value=10)
a -5.0
b 0.0
c -3.0
d -4.0
e -5.0
x 4.0
y -2.0
dtype: float64
>>> serA.mul(serB,fill_value=2)
a 10.0
b 20.0
c 270.0
d 480.0
e 750.0
x 12.0
y 24.0
dtype: float64
Pandas Basics (Continued)
Creating Dataframe
It is described as a multi-dimensional array that can store several forms of data. The term "index" refers to a series' row labels. The list, tuple, and dictionary are easily transformed into series using the "dataframe' method; a dataframe is just like a table with rows and columns making it easier to access any element present in table.
For creating empty Dataframe:
>>> df1=pd.DataFrame()
Dataframe can be created with several methods by using :-
-
Arrays
-
List of Dictionaries
-
Dictionary of Lists
-
Series
-
Dictionary of Series
Creating DataFrame Using Arrays
>>> import numpy as np
>>> array1 = np.array([10,20,30])
>>> array2 = np.array([100,200,300])
>>> array3 = np.array([-100,-200,-300, -400])
>>> df1 = pd.DataFrame(array1) #From Single Array
>>> df2 = pd.DataFrame([array1,array2,array3],columns=["A","B","C","D"]) #From Multiple Array
Creating DataFrame Using List of Dictionaries
>>> listDict = [{'a':10, 'b':20}, {'a':5, 'b':10, 'c':20}]
>>> df1 = pd.DataFrame(listDict)
Creating DataFrame Using Dictionary of Lists
>>> dict= {'State': ['Goa', 'Maharashtra', 'Delhi'], 'Population': [98438, 5481, 56835] , 'Pollution' : [27, 6.72,16]}
>>> df1= pd.DataFrame(dict)
Creating DataFrame Using Series
>>> series1 = pd.Series([1,2,3,4,5], index = ['a', 'b', 'c', 'd', 'e'])
>>> series2 = pd.Series ([100,200,-300,-400,-1000], index = ['a', 'b', 'c', 'd', 'e'])
>>> series3 = pd.Series([12,80,10,-30,10], index = ['z', 'y', 'a', 'c', 'e'])
>>> df1 = pd.DataFrame(series1) #From Single Series
>>> df1 = pd.DataFrame([series1,series2,series3]) #From Multiple Series
Creating DataFrame Using Dictionary of Series
>>> Result={ 'Vaibhav': pd.Series([90, 91, 97], index=['Maths','Science','Hindi']),
'Akshitaa': pd.Series([92, 81, 96], index=['Maths','Science','Hindi']),
'Keshav': pd.Series([81, 71, 67], index=['Maths','Science','Hindi']),
'Nikhil': pd.Series([94, 95, 99], index=['Maths','Science','Hindi'])}
>>> ResultDF = pd.DataFrame(Result)
Attributes of Dataframe
Handling CSV (Comma Seperated Value) Files
Importing CSV Files to Dataframe
To Import CSV File to Dataframe
>>> df1=pd.read_csv( location(path to csv) , parameter)
Parameter
Handling CSV (Comma Seperated Value) Files
Exporting Dataframe to CSV Files
To Import CSV File to Dataframe
>>> dataframename.to_csv( location(path to save/store) , parameter)