March 10, 2016

Introduction

  • Background
  • PhD in Bioinformatics, MS Biostatistics (U of M)
  • Data Scientist at Trinity Health (1.5 years)
  • Two boys (2.5 and 0.25 years)
  • Closet metalhead

Motivation

  • Medical billing codes assigned to encounter
  • Aggregate code usage (monthly)
  • "Did we observe more live births at SJMHS than usual in November?"

Decomposition of time series

STL: a seasonal-trend decomposition procedure based on LOESS
Robert B. Cleveland, William S. Cleveland, Jean E. McRae, and Irma Terpenning
Journal of Official Statistics (1990)

  • Trend component: low-frequency variation and long-term changes in data
  • Seasonal component: seasonal variation in data
  • Remainder component: variation in data without trend and seasonal components

For \(\nu\) in \(1\dots N\),

\[Y_\nu = T_\nu + S_\nu + R_\nu\]

LOESS fit poorly approximates underlying trend

msft_data %>% 
  ggplot(aes(x = timestamp, y = count)) +
  geom_point() +
  geom_smooth(method = "loess", color = "red") +
  xlab("Date") + 
  ylab("Closing Price") + 
  ggtitle("MSFT")

STL to separate time series components

msft_zoo <- zoo::zoo(msft_data$count, order.by = msft_data$timestamp) # i like zoo, so sue me
msft_ts <- stats::ts(msft_zoo, frequency = 12)                        # monthly data, so 12 obs./period
msft_stl <- stats::stl(msft_ts, s.window = "periodic")                # seasonal decomposition by loess
plot(msft_stl)

Increased robustness if outliers are expected

msft_stl <- stats::stl(msft_ts, s.window = "periodic",
                       s.degree = 1,                 # seasonal polynomial fit degree, {0, 1}
                       robust = T)                   # robust fitting
#                      inner = if(robust)  1 else 2, # backfitting iterations
#                      outer = if(robust) 15 else 0  # robustness iterations
plot(msft_stl)

Outlier detection from univariate time series

  • Residual component roughly follows a standard normal distribution
  • Exploit this using Student's t-distribution
  • Grubbs' test for outliers (extreme studentized deviate test, or ESD)
  • Test is iterated until no outliers are found (e.g. reject \(H_0\))

\[G = \frac{\max\limits_{\nu}\left|R_\nu-\bar{R}\right|}{s}\]

\(H_0\): no outliers

\(H_a\): at least one outlier

For two-sided test, the null hypothesis is rejected if

\[G \gt \frac{N-1}{\sqrt{N}} \sqrt{\frac {t^2_{\alpha/(2N),N-2}} {N-2+t^2_{\alpha/(2N),N-2}} } \]

Extracting the residual component from STL

names(msft_stl)
## [1] "time.series" "weights"     "call"        "win"         "deg"        
## [6] "jump"        "inner"       "outer"
head(msft_stl$time.series)
##          seasonal     trend    remainder
## [1,] -0.292745057 0.1035591  0.289185945
## [2,]  0.006206882 0.1111303 -0.007337161
## [3,] -0.027554415 0.1187014  0.028852968
## [4,]  0.238298734 0.1291275 -0.257426188
## [5,] -0.117503871 0.1395535  0.077950408
## [6,]  0.032527383 0.1519589 -0.084486241

Approximate normality of residual component

residComp <- msft_stl$time.series[,3]
hist(residComp, breaks = 50, main = "", xlab = "Residual Component")

AnomalyDetection from Twitter

  • Designed to detect outliers in time series data
  • Seasonal hybrid ESD (S-H-ESD)
  • Uses piecewise median instead of STL trend
  • More robust to extreme outliers
  • \(Y_\nu\) is split into non-overlapping windows

Residual component is calculated as:

\[R_\nu = Y_\nu - S_\nu - \tilde{Y_\nu}\]

Then ESD is calculated using this residual and median absolute deviation (MAD)

\[G = \frac {\max\limits_{\nu} \left| R_\nu-\tilde{R} \right| } {median \left| R_\nu - \tilde{R} \right| } \]

AnomalyDetection syntax and results

msft_vec <- msft_data$count[order(msft_data$timestamp)]
outliers <- AnomalyDetectionVec(msft_vec, plot = T,
                                direction = 'both',   # detect positive and negative spikes
                                period = 12,          # number obs. in single period
                                longterm_period = 72) # number obs. for which trend is "flat"
outliers$plot

BreakoutDetection from Twitter

  • Designed to detect mean shifts in time series data
  • E-Divisive with Medians (EDM)
  • Robust to extreme outliers
  • Non-parametric

For \(Y_1,\,Y_2,\,\dots,\,Y_N\), let there be a point \(\tau\) where

\[Y_1,\,Y_2,\,\dots,\,Y_\tau \thicksim F\]

\[Y_{\tau+1},\,Y_{\tau+2},\,\dots,\,Y_N \thicksim G\]

Kolmogorov-Smirnov test to compare distributions

\(H_0 \colon F = G\)

\(H_a \colon F \neq G\)

But we don't know the location and number of \(\tau\)…

E-Divisive (with medians)

A Nonparametric Approach for Multiple Change Point Analysis of Multivariate Data
David S. MAtteson and Nicholas A. James
arXiv (1990)

  • algorithms to narrow search space
  • Twitter's modification uses medians to estimate the energy distance between two distributions
  • Energy distance combines within and between levels of variation
  • Tree searching (bisection) limits the space in which this must be calculated

BreakoutDetection syntax and results

changes <- breakout(msft_vec, plot = T,
                    method = 'multi', # detect multiple change points
                    min.size = 12)    # min(# obs.) between change points
changes$plot

Acknowledgements