Hypothesis Testing

Published on Saturday, 05 January, 2019 statistics

Comparing Two Means

Lets compare the PM2.5 of two American cities (Denver, CO and Yuma, AZ) to see whether the means in air quality are statistically different.


import requests 
import json


class Measurement(object):
    def __init__(self,object):
        self.location = object.get("location")
        self.parameter = object.get("parameter")
        self.value = object.get("value")
        self.date = object.get("date")
        self.unit = object.get("unit")
        self.coordinates = object.get("coordinates")
        self.country = object.get("country")
        self.city = object.get("city")
class DataObj(object):
    def __init__(self,object):
        self.utc = object.get('utc')
        self.local = object.get('local')


def databuild(city,date_from,date_to):
        payload = {'country': 'US', 'city': city,'date_from':date_from,'date_to':date_to,
          'parameter':'pm25','limit':'1000'}
    url = 'https://api.openaq.org/v1/measurements/'
    r = requests.get(url,params=payload)
    data = json.loads(r.text)['results']
    m = [Measurement(x) for x in data]
    return m


a = databuild('DENVER','2017-12-01','2017-12-31')
b = databuild('Yuma','2017-12-01','2017-12-31')


denver = pd.DataFrame([[x.city,DataObj(x.date).utc,x.parameter,x.value] for x in a],columns=['city','date','type','value'])
yuma = pd.DataFrame([[x.city,DataObj(x.date).utc,x.parameter,x.value] for x in b],columns=['city','date','type','value'])

Two-sample Z test

In two sample z-test , similar to t-test here we are checking two independent data groups and deciding whether sample mean of two group is equal or not.

We are testing whether the PM2.5 means during December of 2017 were the same. Our null hypothesis there is no statistical difference between the two means. The hypothesized difference between the populations is 0.

from scipy import stats
from statsmodels.stats import weightstats as stests
ztest ,pval = stests.ztest(denver['value'], x2=yuma['value'], value=0,alternative='two-sided')
print(float(pval1))
if pval<0.05:
    print("reject null hypothesis")
else:
    print("accept null hypothesis")

We accept the null hypothesis based on a significance level of .05.

0.7228985828468157 accept null hypothesis

Let's add more samples to the two means to see which should give a more accurate picture of the two air quality averages.

Lets add a years worth of data.


a = databuild('US','DENVER','2016-12-01','2017-12-31')
b = databuild('US','Yuma','2016-12-01','2017-12-31')

We run the z-test code again. 0.7228985828468157 reject null hypothesis

We would reject the null indicating that the sample means between the two cities are statistically different.