Ashish Trehan

Pandas is an easy and flexible tool used by many Data Scientists to work on data analysis, maniplulation, and visualization. It is the go-to library for me when beginning any project to do some quick analytic work.

Why pandas?

The use of a Dataframe which is similar to one used in R. It always for a lot of flexibility and works well in tandem with Jupyter Notebook for ad-hoc analysis and reporting.

Loading Data

Let's download Arrival flight data from the last three days of March in 2016. The Bureau of Transportation Statistics is where we can go to obtain this.

	Carrier Code	Date (MM/DD/YYYY)	Flight Number	Tail Number	Origin Airport	Scheduled Arrival Time	Actual Arrival Time	Scheduled Elapsed Time (Minutes)	Actual Elapsed Time (Minutes)	Arrival Delay (Minutes)	Wheels-on Time	Taxi-In time (Minutes)	Delay National Aviation System (Minutes)	Delay Late Aircraft Arrival (Minutes)
0	DL	03/29/2016	5.0	N3734B	SLC	05:21	05:18	206.0	204.0	-3.0	05:12	6.0	0.0	0.0
1	DL	03/29/2016	8.0	N803DN	MIA	19:50	20:27	130.0	132.0	37.0	20:20	7.0	2.0	35.0
2	DL	03/29/2016	14.0	N947DN	RSW	16:48	16:47	113.0	117.0	-1.0	16:34	13.0	0.0	0.0
3	DL	03/29/2016	16.0	N175DZ	SEA	18:31	18:17	291.0	276.0	-14.0	18:09	8.0	0.0	0.0
4	DL	03/29/2016	23.0	N927DA	MIA	07:28	07:07	123.0	105.0	-21.0	07:01	6.0	0.0	0.0

Transforming Data

Data Wrangling

Data Wrangling is a very important step and very common. Inspecting the data and then working with it to put it into a usable format to either build a model with or do analysis. This with Data Cleaning often takes a majority of the time during a Data Science project.

Filtering

Group By

Splitting and Combining Columns

This task is used often when you have a column with a string, date, or another datatype that could be split into two columns. Also, combining columns that would make sense together maybe combining two categories like gender and age group.

Pivoting

Often times data is stored as a record format. Pivot tables are a great way to aggregate on numeric data.

python
d = pd.pivot_table(df, values="Arrival Delay (Minutes)", index=['Origin Airport'],aggfunc = "sum")  
pd.pivot_table(df, values="Flight Number", index=['Carrier Code','Origin Airport'],aggfunc = "count")

		Flight Number
Carrier Code	Origin Airport
DL	ABE	3
	ABQ	6
	AGS	9
	ALB	9
	ATW	3
	...	...
	TUL	10
	TUS	6
	TYS	5
	VPS	17
	XNA	3

122 rows × 1 columns

python
df = pd.pivot_table(df, values="Tail Number", index=['Origin Airport'],columns='Date (MM/DD/YYYY)',aggfunc = ["nunique"])

Data Cleaning

Data Cleaning is a way to harmonize and standardize the data. Often, you will go back to this step if you notice problems with your model. Maybe, you will need to use outlier detection to remove problematic data. You will need to drop nulls and even drop or add variables back in. Its important to talk with subject matter experts or consult any and all documentation to understand the meaning of each column.

Data Profiling

Building a quick function that can loop through each column and explain the count, datatype, number of nulls, and a few unique values is helpful when you are trying to get an understanding of your data.

Having fun with Pandas!

Introduction

What is pandas?

Why pandas?

Loading Data

Transforming Data

Data Wrangling

Filtering

Group By

Splitting and Combining Columns

Pivoting

Data Cleaning

Data Profiling

Dropping Nulls

Standardizing Data

Normalizing Data

Visualizing Data

Plotting

Writing to Disk