Home projects readings blog about

Having fun with Pandas!

Published on Tuesday, 05 April, 2016 pandas

Introduction

What is pandas?

Pandas is an easy and flexible tool used by many Data Scientists to work on data analysis, maniplulation, and visualization. It is the go-to library for me when beginning any project to do some quick analytic work.

Why pandas?

The use of a Dataframe which is similar to one used in R. It always for a lot of flexibility and works well in tandem with Jupyter Notebook for ad-hoc analysis and reporting.

Loading Data

Let's download Arrival flight data from the last three days of March in 2016. The Bureau of Transportation Statistics is where we can go to obtain this.

https://transtats.bts.gov/

```python import pandas as pd

df pd.read_csv('data/Detailed_Statistics_Arrivals.csv',sep=',',skiprows=6) ```

Carrier Code Date (MM/DD/YYYY) Flight Number Tail Number Origin Airport Scheduled Arrival Time Actual Arrival Time Scheduled Elapsed Time (Minutes) Actual Elapsed Time (Minutes) Arrival Delay (Minutes) Wheels-on Time Taxi-In time (Minutes) Delay Carrier (Minutes) Delay Weather (Minutes) Delay National Aviation System (Minutes) Delay Security (Minutes) Delay Late Aircraft Arrival (Minutes)
0 DL 03/29/2016 5.0 N3734B SLC 05:21 05:18 206.0 204.0 -3.0 05:12 6.0 0.0 0.0 0.0 0.0 0.0
1 DL 03/29/2016 8.0 N803DN MIA 19:50 20:27 130.0 132.0 37.0 20:20 7.0 0.0 0.0 2.0 0.0 35.0
2 DL 03/29/2016 14.0 N947DN RSW 16:48 16:47 113.0 117.0 -1.0 16:34 13.0 0.0 0.0 0.0 0.0 0.0
3 DL 03/29/2016 16.0 N175DZ SEA 18:31 18:17 291.0 276.0 -14.0 18:09 8.0 0.0 0.0 0.0 0.0 0.0
4 DL 03/29/2016 23.0 N927DA MIA 07:28 07:07 123.0 105.0 -21.0 07:01 6.0 0.0 0.0 0.0 0.0 0.0

Transforming Data

Data Wrangling

Data Wrangling is a very important step and very common. Inspecting the data and then working with it to put it into a usable format to either build a model with or do analysis. This with Data Cleaning often takes a majority of the time during a Data Science project.

Filtering

Let's filter on all flights that originated at LAX.

python LAX = df[(df['Origin Airport'] == "LAX")]

Group By

Great way to build groupings and do aggregations and other operations.

Splitting and Combining Columns

This task is used often when you have a column with a string, date, or another datatype that could be split into two columns. Also, combining columns that would make sense together maybe combining two categories like gender and age group.

Pivoting

Often times data is stored as a record format. Pivot tables are a great way to aggregate on numeric data.

python d = pd.pivot_table(df, values="Arrival Delay (Minutes)", index=['Origin Airport'],aggfunc = "sum") pd.pivot_table(df, values="Flight Number", index=['Carrier Code','Origin Airport'],aggfunc = "count")

Flight Number
Carrier Code Origin Airport
DL ABE 3
ABQ 6
AGS 9
ALB 9
ATW 3
... ...
TUL 10
TUS 6
TYS 5
VPS 17
XNA 3

122 rows × 1 columns

python df = pd.pivot_table(df, values="Tail Number", index=['Origin Airport'],columns='Date (MM/DD/YYYY)',aggfunc = ["nunique"])

Data Cleaning

Data Cleaning is a way to harmonize and standardize the data. Often, you will go back to this step if you notice problems with your model. Maybe, you will need to use outlier detection to remove problematic data. You will need to drop nulls and even drop or add variables back in. Its important to talk with subject matter experts or consult any and all documentation to understand the meaning of each column.

Data Profiling

Building a quick function that can loop through each column and explain the count, datatype, number of nulls, and a few unique values is helpful when you are trying to get an understanding of your data.

Dropping Nulls

Standardizing Data

Normalizing Data

Visualizing Data

Plotting

Writing to Disk