Lets compare the PM2.5 of two American cities (Denver, CO and Yuma, AZ) to see whether the means in air quality are statistically different.
import requests
import json
class Measurement(object):
def __init__(self,object):
self.location = object.get("location")
self.parameter = object.get("parameter")
self.value = object.get("value")
self.date = object.get("date")
self.unit = object.get("unit")
self.coordinates = object.get("coordinates")
self.country = object.get("country")
self.city = object.get("city")
class DataObj(object):
def __init__(self,object):
self.utc = object.get('utc')
self.local = object.get('local')
def databuild(city,date_from,date_to):
payload = {'country': 'US', 'city': city,'date_from':date_from,'date_to':date_to,
'parameter':'pm25','limit':'1000'}
url = 'https://api.openaq.org/v1/measurements/'
r = requests.get(url,params=payload)
data = json.loads(r.text)['results']
m = [Measurement(x) for x in data]
return m
a = databuild('DENVER','2017-12-01','2017-12-31')
b = databuild('Yuma','2017-12-01','2017-12-31')
denver = pd.DataFrame([[x.city,DataObj(x.date).utc,x.parameter,x.value] for x in a],columns=['city','date','type','value'])
yuma = pd.DataFrame([[x.city,DataObj(x.date).utc,x.parameter,x.value] for x in b],columns=['city','date','type','value'])
In two sample z-test , similar to t-test here we are checking two independent data groups and deciding whether sample mean of two group is equal or not.
$$H_0: \mu_1 = \mu_2$$
We are testing whether the PM2.5 means during December of 2017 were the same. Our null hypothesis there is no statistical difference between the two means. The hypothesized difference between the populations is 0.
from scipy import stats
from statsmodels.stats import weightstats as stests
ztest ,pval = stests.ztest(denver['value'], x2=yuma['value'], value=0,alternative='two-sided')
print(float(pval1))
if pval<0.05:
print("reject null hypothesis")
else:
print("accept null hypothesis")
We accept the null hypothesis based on a significance level of .05.
0.7228985828468157
accept null hypothesis
Let's add more samples to the two means to see which should give a more accurate picture of the two air quality averages.
Lets add a years worth of data.
a = databuild('US','DENVER','2016-12-01','2017-12-31')
b = databuild('US','Yuma','2016-12-01','2017-12-31')
We run the z-test code again. 0.7228985828468157 reject null hypothesis
We would reject the null indicating that the sample means between the two cities are statistically different.