Pandas is an easy and flexible tool used by many Data Scientists to work on data analysis, maniplulation, and visualization. It is the go-to library for me when beginning any project to do some quick analytic work.
The use of a Dataframe which is similar to one used in R. It always for a lot of flexibility and works well in tandem with Jupyter Notebook for ad-hoc analysis and reporting.
Let's download Arrival flight data from the last three days of March in 2016. The Bureau of Transportation Statistics is where we can go to obtain this.
https://transtats.bts.gov/
```python import pandas as pd
df pd.read_csv('data/Detailed_Statistics_Arrivals.csv',sep=',',skiprows=6) ```
Carrier Code | Date (MM/DD/YYYY) | Flight Number | Tail Number | Origin Airport | Scheduled Arrival Time | Actual Arrival Time | Scheduled Elapsed Time (Minutes) | Actual Elapsed Time (Minutes) | Arrival Delay (Minutes) | Wheels-on Time | Taxi-In time (Minutes) | Delay Carrier (Minutes) | Delay Weather (Minutes) | Delay National Aviation System (Minutes) | Delay Security (Minutes) | Delay Late Aircraft Arrival (Minutes) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | DL | 03/29/2016 | 5.0 | N3734B | SLC | 05:21 | 05:18 | 206.0 | 204.0 | -3.0 | 05:12 | 6.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | DL | 03/29/2016 | 8.0 | N803DN | MIA | 19:50 | 20:27 | 130.0 | 132.0 | 37.0 | 20:20 | 7.0 | 0.0 | 0.0 | 2.0 | 0.0 | 35.0 |
2 | DL | 03/29/2016 | 14.0 | N947DN | RSW | 16:48 | 16:47 | 113.0 | 117.0 | -1.0 | 16:34 | 13.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | DL | 03/29/2016 | 16.0 | N175DZ | SEA | 18:31 | 18:17 | 291.0 | 276.0 | -14.0 | 18:09 | 8.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | DL | 03/29/2016 | 23.0 | N927DA | MIA | 07:28 | 07:07 | 123.0 | 105.0 | -21.0 | 07:01 | 6.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
Data Wrangling is a very important step and very common. Inspecting the data and then working with it to put it into a usable format to either build a model with or do analysis. This with Data Cleaning often takes a majority of the time during a Data Science project.
Let's filter on all flights that originated at LAX.
python
LAX = df[(df['Origin Airport'] == "LAX")]
Great way to build groupings and do aggregations and other operations.
This task is used often when you have a column with a string, date, or another datatype that could be split into two columns. Also, combining columns that would make sense together maybe combining two categories like gender and age group.
Often times data is stored as a record format. Pivot tables are a great way to aggregate on numeric data.
python
d = pd.pivot_table(df, values="Arrival Delay (Minutes)", index=['Origin Airport'],aggfunc = "sum")
pd.pivot_table(df, values="Flight Number", index=['Carrier Code','Origin Airport'],aggfunc = "count")
Flight Number | ||
---|---|---|
Carrier Code | Origin Airport | |
DL | ABE | 3 |
ABQ | 6 | |
AGS | 9 | |
ALB | 9 | |
ATW | 3 | |
... | ... | |
TUL | 10 | |
TUS | 6 | |
TYS | 5 | |
VPS | 17 | |
XNA | 3 |
122 rows × 1 columns
python
df = pd.pivot_table(df, values="Tail Number", index=['Origin Airport'],columns='Date (MM/DD/YYYY)',aggfunc = ["nunique"])
Data Cleaning is a way to harmonize and standardize the data. Often, you will go back to this step if you notice problems with your model. Maybe, you will need to use outlier detection to remove problematic data. You will need to drop nulls and even drop or add variables back in. Its important to talk with subject matter experts or consult any and all documentation to understand the meaning of each column.
Building a quick function that can loop through each column and explain the count, datatype, number of nulls, and a few unique values is helpful when you are trying to get an understanding of your data.