%load_ext load_style
%load_style talk.css
import addutils.toc ; addutils.toc.js(ipy_notebook=True)
from IPython.display import Image, HTML
import numpy as np
from numpy import ma
import matplotlib.pyplot as plt
%matplotlib inline
Scipy can is a library (but can be thought more as a scientific python distribution) that is built on top of Numpy and and provides a large set of standard scientific computing algorithms, in particular:
We'll only have a look at some of these sub-packages, more directly relevant to analyzing data
HTML("<iframe src='http://scipy.org/' width=1000 height=500></iframe>")
import scipy; print(scipy.__version__)
from scipy import stats
dir(stats)
stats.distributions.
# create a (continous) random variable with normal distribution
Y = stats.distributions.norm()
x = np.linspace(-5,5,100)
fig, axes = plt.subplots(3,1, sharex=True, figsize=(10,10))
# plot the probability distribution function (PDF)
axes[0].plot(x, Y.pdf(x))
axes[0].set_title('PDF')
# plot the commulative distributin function (CDF)
axes[1].plot(x, Y.cdf(x));
axes[1].set_title('CDF')
# plot histogram of 1000 random realizations of the variable Y ~ N(0,1)
axes[2].hist(Y.rvs(size=1000), bins=20);
axes[2].set_title('Random Variable');
An recurring statistical problem is finding estimates of the relevant parameters that correspond to the distribution that best represents our data. In parametric inference, we specify a priori a suitable distribution, then choose the parameters that best fit the data.
We're going to see how to do that in Python using two methods:
A real life example: Monthly rainfall amounts at Auckland Airport station
import pandas as pd; print(pd.__version__)
data = pd.read_excel('../data/AKL_aero_rain_monthly.xlsx', sheetname='AKL',index_col=1)
let's look at the empirical distribution (thanks to Pandas)
data['rain'].hist(normed=True, bins=20)
There are a few possible choices, but one suitable alternative is the gamma distribution:
There are three different parametrizations in common use, we're gonna use the the one below, with shape parameter $\alpha$ and an inverse scale parameter $\beta$ called a rate parameter.
The method of moments simply assigns the empirical mean and variance to their theoretical counterparts, so that we can solve for the parameters.
So, for the gamma distribution, it turns out (see the relevant section in the wikipedia article) that the mean and variance are:
So, if we solve for $\alpha$ and $\beta$, using the sample mean ($\bar{X}$) and variance ($S^2$), we can use a gamma distribution to describe our data:
first step: we calculate the sample mean and variance, using pandas convenience methods
precip_mean = data['rain'].mean() # sample mean
precip_var = data['rain'].var() # sample variance
print("mean: {:4.2f}\nvariance: {:4.2f}".format(precip_mean, precip_var))
second step: we calculate the parameters $\alpha$ and $\beta$ using the relations above
alpha_mom = precip_mean ** 2 / precip_var
beta_mom = precip_var / precip_mean
We import the gamma
function from the scipy.stats.distributions sub-package
The implementation of the Gamma function in scipy requires a location parameter, left to zero
from scipy.stats.distributions import gamma
gmom = gamma(alpha_mom, 0, beta_mom)
#or gmom = gamma(alpha_mom, scale=beta_mom)
plt.hist(data['rain'], normed=True, bins=20)
plt.plot(np.linspace(0, 300, 100), gmom.pdf(np.linspace(0, 300, 100)), lw=3, c='r')
shape,loc,scale = gamma.fit(data['rain'])
shape, loc, scale
alpha_mom, beta_mom
gmle = gamma(shape,loc,scale)
f, (ax0, ax1) = plt.subplots(1,2,figsize=(13,5))
rmax = 300; N=100
ax0.hist(data['rain'], normed=True, bins=20, color='0.6')
ax0.plot(np.linspace(0, rmax, N), gmle.pdf(np.linspace(0, rmax, N)), 'b-', lw=2, label='MLE')
ax0.plot(np.linspace(0, rmax, N), gmom.pdf(np.linspace(0, rmax, N)), 'r-', lw=2, label='Moments')
ax0.legend()
ax0.set_title('Histogram and fitted PDFs')
ax1.plot(np.linspace(0, rmax, N), gmle.cdf(np.linspace(0, rmax, N)), 'b-', lw=2, label='MLE')
ax1.plot(np.linspace(0, rmax, N), gmom.cdf(np.linspace(0, rmax, N)), 'r-', lw=2, label='Moments')
ax1.legend()
ax1.set_title('CDFs');
Goodness of fit tests are available in scipy.stats, through e.g.
scipy.stats.kstest
)scipy.stats.anderson
)from scipy.stats import kstest
kstest(data['rain'], 'gamma', (shape, loc, scale))
kstest(data['rain'], 'gamma', (alpha_mom, 0, beta_mom))
In some instances, we may not be interested in the parameters of a particular distribution of data, but just a smoothed representation of the data. In this case, we can estimate the distribution non-parametrically (i.e. making no assumptions about the form of the underlying distribution) using Kernel Density Estimation.
If you are interested into an excellent discussion of the various options in Python to perform Kernel Density Estimation, have a look at this post on Jake VanderPlas blog Pythonic Perambulations. Some more stuff on KDE is available here from Michael Lerner.
In a nutschell, (Gaussian) Kernel density estimation works 'simply' by summing up Gaussian functions (PDFs) centered on each data point. Each Gaussian function is characterized by a location parameter (the value of the data point), and a scale parameter ($\sigma$) which is related to the 'bandwith' of your (Gaussian) kernel.
# Some random data
y = np.random.random(15) * 10
y
x = np.linspace(0, 10, 100)
# Smoothing parameter
s = 0.4
# Calculate the kernels
kernels = np.transpose([stats.distributions.norm.pdf(x, yi, s) for yi in y])
plt.plot(x, kernels, 'k:', lw=1)
plt.plot(x, kernels.sum(1), lw=1.5)
plt.plot(y, np.zeros(len(y)), 'ro', ms=10)
Kernel Density Estimation is implemented in scipy.stats.kde by scipy.stats.kde.gaussian_kde
from scipy.stats.kde import gaussian_kde
f, ax = plt.subplots(figsize=(7,7))
xgrid = np.linspace(data['rain'].min(), data['rain'].max(), 100)
density = gaussian_kde(data['rain']).evaluate(xgrid)
ax.hist(data['rain'], bins=20, normed=True, label='data', alpha=.6)
ax.plot(xgrid, density, 'r-', lw=2, label='KDE')
ax.legend()
An example in 2 dimensions (from https://gist.github.com/endolith/1035069)
# Create some dummy data
rvs = np.append(stats.norm.rvs(loc=2,scale=1,size=(200,1)),
stats.norm.rvs(loc=1,scale=3,size=(200,1)),
axis=1)
kde = stats.kde.gaussian_kde(rvs.T)
# Regular grid to evaluate kde upon
x_flat = np.r_[rvs[:,0].min():rvs[:,0].max():128j]
y_flat = np.r_[rvs[:,1].min():rvs[:,1].max():128j]
x,y = np.meshgrid(x_flat,y_flat)
grid_coords = np.append(x.reshape(-1,1),y.reshape(-1,1),axis=1)
z = kde(grid_coords.T)
z = z.reshape(128,128)
# Plot
f, ax = plt.subplots(figsize=(8,8))
ax.scatter(rvs[:,0],rvs[:,1],alpha=0.8,color='white')
im = ax.imshow(z,aspect=x_flat.ptp()/y_flat.ptp(),origin='lower',\
extent=(rvs[:,0].min(),rvs[:,0].max(),rvs[:,1].min(),rvs[:,1].max()),\
cmap=plt.get_cmap('spectral'))
plt.colorbar(im, shrink=0.8);
One commonly used parametric test is the Student t-test, which determines whether two independant samples have - statistically - different means, under the assumption that the two samples are normally distributed. The Null hypothesis is that the two samples have identical means.
stats.ttest_ind?
X = stats.distributions.norm(loc=5, scale=10)
Y = stats.distributions.norm(loc=3, scale=10)
Xrvs = X.rvs(size=(1000))
Yrvs = Y.rvs(size=(1000))
f, ax = plt.subplots(figsize=(6,5))
ax.hist(Xrvs,histtype='stepfilled', bins=20, color='coral', alpha=.6, label=r'$\mu = 5$')
ax.hist(Yrvs,histtype='stepfilled', bins=20, color='steelblue', alpha=.6, label=r'$\mu = 3$')
ax.grid(color='b',alpha=.6)
ax.legend();
t, p = stats.ttest_ind(Xrvs, Yrvs)
print("""\n
T statistics = {0} ~= {0:4.2f}
P-value = {1} ~= {1:4.2f}
""".format(t, p))
from scipy.stats import pearsonr
We're gonna apply that to Time-Series of NINO3.4 and the SOI
The NINO indices come from http://www.cpc.ncep.noaa.gov/data/indices/ersst3b.nino.mth.81-10.ascii
The SOI is in the data
directory
import os
from datetime import datetime
from dateutil import parser
import pandas as pd
nino_url = "http://www.cpc.ncep.noaa.gov/data/indices/ersst3b.nino.mth.81-10.ascii"
nino = pd.read_table(nino_url, sep='\s+', engine='python')
# if not working try:
# nino = pd.read_table('../data/ersst3b.nino.mth.81-10.ascii', sep='\s+', engine='python')
dates = [datetime(x[0], x[1], 1) for x in zip(nino['YR'].values, nino['MON'].values)]
nino.index = dates
nino.tail()
soipath = '../data'
def get_SOI(soipath, start_date='1950-01-01'):
soi = pd.read_csv(os.path.join(soipath,'NIWA_SOI.csv'), index_col=0)
soi = soi.stack()
soi = soi.dropna()
dates = [parser.parse("%s-%s-%s" % (str(int(x[0])), x[1], "1")) for x in soi.index]
soidf = pd.DataFrame(soi.values, index=dates, columns=['SOI'])
soidf = soidf.truncate(before=start_date)
return soidf
creates another DataFrame containing the SOI
df = get_SOI(soipath)
df.tail()
Will automaticall ALIGN the DataFrames, see Pandas.ipynb
df['nino'] = nino['ANOM.3']
r, p = pearsonr(df['SOI'], df['nino'])
print("R = {0:<4.2f}, p-value = {1:<4.2f}".format(r,p))
#df = pd.DataFrame({'SOI':soi['SOI'],'nino':nino['ANOM.3']}, index=nino.index)
#df.to_csv('./data/soi_nino.csv')
There are several interfaces for performing 1D, 2D or N Dimensional interpolation using scipy.
For more on that I refer you to
Here we're going to see one simple example in 1D and 2D using Radial Basis Functions using the default (Multiquadric)
from scipy.interpolate import interp1d
x = np.linspace(0, 10, 10)
y = np.sin(x)
x
plt.plot(x,y,'ro')
f1 = interp1d(x, y) # default is linear
f2 = interp1d(x, y, kind='cubic') # and we're gonna try cubic interpolation for comparison
xnew = np.linspace(0, 10, 50)
ylin = f1(xnew)
ycub = f2(xnew)
f, ax = plt.subplots(figsize=(10,6))
ax.plot(xnew, ylin, 'b.', ms=8, label='linear')
ax.plot(xnew, ycub, 'gx-', ms=10, label='cubic')
ax.plot(x,y,'ro-', label='data')
ax.legend(loc='lower left')
from scipy.interpolate import Rbf
# defines a function of X and Y
def func(x,y):
return x*np.exp(-x**2-y**2)
np.random.seed(1) # we 'fix' the generation of random numbers so that we've got consistent results
x = np.random.uniform(-2., 2., 100)
y = np.random.uniform(-2., 2., 100)
z = func(x,y)
ti = np.linspace(-2.0, 2.0, 100)
XI, YI = np.meshgrid(ti, ti) # meshgrid creates uniform 2D grids from 1D vectors
# use RBF
rbf = Rbf(x, y, z, epsilon=2) # instantiates the interpolator
# you might want to play with the epsilon optional parameter
ZI = rbf(XI, YI) # interpolate on grid
# this is the 'True' field, the result of the function evaluated on a regular grid
true_Z = func(XI, YI)
# plot the result
f, ax = plt.subplots(figsize=(8,6))
im = ax.pcolor(XI, YI, ZI, cmap=plt.get_cmap('RdBu_r'))
ax.scatter(x, y, 50, z, cmap=plt.get_cmap('RdBu_r'), edgecolor='.5')
ax.set_title('RBF interpolation - multiquadrics')
ax.set_xlim(-2, 2)
ax.set_ylim(-2, 2)
plt.colorbar(im, orientation='vertical', pad=0.06);
# plot the difference between the 'true' field and the interpolated
f, ax = plt.subplots(figsize=(8,6))
im = ax.pcolor(XI, YI, ZI - true_Z, cmap=plt.get_cmap('RdBu_r'), vmin=-1e-3, vmax=1e-3)
ax.set_xlim(-2, 2)
ax.set_ylim(-2, 2)
plt.colorbar(im, orientation='vertical', pad=0.06);
In this case you want to fit noisy, e.g. experimental, data to a specific function, with unknown parameters
from scipy.optimize import curve_fit
The algorithm uses the Levenberg-Marquardt algorithm to perform non-linear least-square optimization
def fitFunc(t, a, b, c):
"""
defines the function
takes 3 parameters: a, b and c
"""
return a*np.exp(-b*t) + c
### defines the evaluation domain
t = np.linspace(0,4,50)
### defines the paramaters
a = 5.0
b = 1.5
c = 0.5
### create the response
temp = fitFunc(t, a, b, c)
### add some noise to simulate "real world observations" such as experimental data
noisy = temp + 0.4 * np.random.normal(size=len(temp))
### use curve_fit to estimate the parameters and the covariance matrices
fitParams, fitCovariances = curve_fit(fitFunc, t, noisy)
afit, bfit, cfit = tuple(fitParams)
print("\nEstimated parameters\na: {0:<4.2f}, b: {1:<4.2f}, c: {2:<4.2f}\n\n".format(afit, bfit, cfit))
f, ax = plt.subplots(figsize=(8,8))
ax.set_ylabel(u'Temperature (\xb0C)', fontsize = 16)
ax.set_xlabel('time (s)', fontsize = 16)
ax.set_xlim(0,4.1)
# plot the data as red circles with vertical errorbars
ax.errorbar(t, noisy, fmt = 'ro', yerr = 0.2)
# now plot the best fit curve
ax.plot(t, fitFunc(t, afit, bfit, cfit),'k-', lw=2)
# and plot the +- 1 sigma curves
# (the square root of the diagonal covariance matrix
# element is the uncertainty on the fit parameter.)
sigma_a, sigma_b, sigma_c = np.sqrt(fitCovariances[0,0]), \
np.sqrt(fitCovariances[1,1]), \
np.sqrt(fitCovariances[2,2])
ax.plot(t, fitFunc(t, afit + sigma_a, bfit - sigma_b, cfit + sigma_c), 'b-')
ax.plot(t, fitFunc(t, afit - sigma_a, bfit + sigma_b, cfit - sigma_c), 'b-');
solution available at http://nbviewer.ipython.org/gist/nicolasfauchereau/79131703a0340e1ed82a
import pandas as pd
data= pd.read_excel('../data/rainfall_calibration.xlsx', sheetname='Sheet1')
data.head()
f, ax = plt.subplots()
ax.plot(data['Rainfall'], data['Level'], 'b', lw=1.5)
ax.plot(data['Rainfall'], data['Level'], 'ro')
ax.set_xlabel('Rainfall')
ax.set_ylabel('Level')
...
Python itself has a very good support for IO and dealing with ascii files
f = open('filename','r')
out = f.readlines()
f.close()
f = open('../data/ascii_table.txt', 'r')
out = f.readlines()
f.close()
out
out = [map(np.int, x.split()) for x in out]
out
out = np.array(out)
print(out)
out.shape
with open('../data/ascii_table.txt', 'r') as f:
out = f.readlines()
out = [map(np.int, x.split()) for x in out]
out = np.array(out)
np.genfromtxt
a = np.genfromtxt('../data/ascii_table.txt', dtype=int) # can handle missing values
a
#np.loadtxt() ### each row must have the same number of values
a = np.loadtxt('../data/ascii_table.txt', dtype=np.int)
a
but all that is a pain !, for reading all csv, tabular format files including xls, xlsx, etc. the most convenient way is to use the special IO functions of Pandas, which we have seen for the SOI / NINO example, and will see in more details in the pandas notebook. All these functions will load these formats directly into Pandas DataFrames, which are a very convenient data structure for interacting with spreadsheet-like data. See the Pandas.ipynb notebook.
NetCDF stands for Network Common Data form, it is used largely in the atmosphere and ocean sciences to store mostly gridded datasets.
To read / write NetCDF files, I recommend using the netCDF4 standalone module, which is installed as part of the Anaconda distribution, and has lots of functionalities, including:
from netCDF4 import Dataset
#with Dataset('data/Hadley_SST.nc', 'r') as nc:
nc = Dataset('../data/NCEP2_airtemp2m_2014_anoms.nc', 'r')
nc.variables.keys()
temp = nc.variables['air'][:]
lon = nc.variables['lon'][:]
lat = nc.variables['lat'][:]
temp.shape
f, ax = plt.subplots(figsize=(8,4))
im = ax.imshow(temp.mean(0), vmin=-4, vmax=4, aspect='auto',\
cmap=plt.get_cmap('RdBu_r'))
ax.set_title('temperature anomaly')
cb = plt.colorbar(im);
cb.set_label(u'\xb0C', rotation=0)
Image(url='http://www.giss.nasa.gov/research/news/20150116/2014_annual_w-colorbar.png')
netCDF4 is not restricted to local files
you should be able to access NetCDF datasets over the network through e.g. OPenDAP
nc = Dataset('http://www.esrl.noaa.gov/psd/thredds/dodsC/Datasets/interp_OLR/olr.mon.ltm.nc', 'r')
nc.variables.keys()
olr = nc.variables['olr'][0,:,:]
olr.shape
f, ax = plt.subplots(figsize=(8,4))
im = ax.imshow(olr, interpolation='nearest', aspect='auto')
plt.colorbar(im, label=r'$W.m^{-2}$')
from scipy.io.matlab import loadmat, savemat, whosmat
We first create a matlab file containing one variable
x = np.random.randn(10,10)
savemat('../data/random_2darray.mat', {'xmat':x})
Then inspect its content with whosmat
whosmat('../data/random_2darray.mat')
And read it, accessing the saved variable as a numpy arrray using a dict type syntax
matfile = loadmat('../data/random_2darray.mat')
x = matfile['xmat'] # like a dictionnary
For matlab files containing matlab structures, the syntax is marginally more complicated
whosmat('../data/clusters_monthly.mat')
matfile = loadmat('../data/clusters_monthly.mat', struct_as_record=False)
matfile.keys()
clusm = matfile['clusm'][0][0] ### corresponds to the (1,1) shape as indicated by whosmat
type(clusm)
### inspect the attributes of clusm
dir(clusm)
clusm._fieldnames
t = clusm.time
t
data = clusm.data
data.shape
IMPORTANT: A note on matlab dates and time
Matlab's datenum representation is the number of days since midnight on January 1st, 0 AD. Python's datetime handling, (throught the datetime.fromordinal function) assumes time is the number of days since midnight on January 1st, 1 AD. So the duration of year 0 separates the time representation returned by python's fromordinal function and the time representation used by matlab.
matlab_datenum = 731965.04835648148
The following snippet is from Conrad Lee at https://gist.github.com/conradlee/4366296
from datetime import datetime, timedelta
python_datetime = datetime.fromordinal(int(matlab_datenum)) \
+ timedelta(days=matlab_datenum%1) - timedelta(days = 366) ## year 0 was a leap year
python_datetime
print(python_datetime)
HDF5
Please See http://www.h5py.org/
IDL:
from scipy.io import idl
idl.readsav()
Fortran files:
from scipy.io import FortranFile
f = FortranFile()
Examples
--------
To create an unformatted sequential Fortran file:
>>> from scipy.io import FortranFile
>>> f = FortranFile('test.unf', 'w')
>>> f.write_record(np.array([1,2,3,4,5],dtype=np.int32))
>>> f.write_record(np.linspace(0,1,20).reshape((5,-1)))
>>> f.close()
To read this file:
>>> from scipy.io import FortranFile
>>> f = FortranFile('test.unf', 'r')
>>> print(f.read_ints(dtype=np.int32))
[1 2 3 4 5]
>>> print(f.read_reals(dtype=np.float).reshape((5,-1)))
[[ 0. 0.05263158 0.10526316 0.15789474]
[ 0.21052632 0.26315789 0.31578947 0.36842105]
[ 0.42105263 0.47368421 0.52631579 0.57894737]
[ 0.63157895 0.68421053 0.73684211 0.78947368]
[ 0.84210526 0.89473684 0.94736842 1. ]]
>>> f.close()