Using Twitter’s algorithms to detect outliers and change points in time series data

March 10, 2016

Introduction

Background
PhD in Bioinformatics, MS Biostatistics (U of M)
Data Scientist at Trinity Health (1.5 years)
Two boys (2.5 and 0.25 years)
Closet metalhead

Motivation

Medical billing codes assigned to encounter
Aggregate code usage (monthly)
"Did we observe more live births at SJMHS than usual in November?"

Decomposition of time series

STL: a seasonal-trend decomposition procedure based on LOESS
Robert B. Cleveland, William S. Cleveland, Jean E. McRae, and Irma Terpenning
Journal of Official Statistics (1990)

Trend component: low-frequency variation and long-term changes in data
Seasonal component: seasonal variation in data
Remainder component: variation in data without trend and seasonal components

For \(\nu\) in \(1\dots N\),

\[Y_\nu = T_\nu + S_\nu + R_\nu\]

LOESS fit poorly approximates underlying trend

msft_data %>% 
  ggplot(aes(x = timestamp, y = count)) +
  geom_point() +
  geom_smooth(method = "loess", color = "red") +
  xlab("Date") + 
  ylab("Closing Price") + 
  ggtitle("MSFT")

STL to separate time series components

msft_zoo <- zoo::zoo(msft_data$count, order.by = msft_data$timestamp) # i like zoo, so sue me
msft_ts <- stats::ts(msft_zoo, frequency = 12)                        # monthly data, so 12 obs./period
msft_stl <- stats::stl(msft_ts, s.window = "periodic")                # seasonal decomposition by loess
plot(msft_stl)

Increased robustness if outliers are expected

msft_stl <- stats::stl(msft_ts, s.window = "periodic",
                       s.degree = 1,                 # seasonal polynomial fit degree, {0, 1}
                       robust = T)                   # robust fitting
#                      inner = if(robust)  1 else 2, # backfitting iterations
#                      outer = if(robust) 15 else 0  # robustness iterations
plot(msft_stl)

Outlier detection from univariate time series

Residual component roughly follows a standard normal distribution
Exploit this using Student's t-distribution
Grubbs' test for outliers (extreme studentized deviate test, or ESD)
Test is iterated until no outliers are found (e.g. reject \(H_0\))

\[G = \frac{\max\limits_{\nu}\left|R_\nu-\bar{R}\right|}{s}\]

\(H_0\): no outliers

\(H_a\): at least one outlier

For two-sided test, the null hypothesis is rejected if

\[G \gt \frac{N-1}{\sqrt{N}} \sqrt{\frac {t^2_{\alpha/(2N),N-2}} {N-2+t^2_{\alpha/(2N),N-2}} } \]

Extracting the residual component from STL

names(msft_stl)

## [1] "time.series" "weights"     "call"        "win"         "deg"        
## [6] "jump"        "inner"       "outer"

head(msft_stl$time.series)

##          seasonal     trend    remainder
## [1,] -0.292745057 0.1035591  0.289185945
## [2,]  0.006206882 0.1111303 -0.007337161
## [3,] -0.027554415 0.1187014  0.028852968
## [4,]  0.238298734 0.1291275 -0.257426188
## [5,] -0.117503871 0.1395535  0.077950408
## [6,]  0.032527383 0.1519589 -0.084486241

Approximate normality of residual component

residComp <- msft_stl$time.series[,3]
hist(residComp, breaks = 50, main = "", xlab = "Residual Component")

AnomalyDetection from Twitter

Designed to detect outliers in time series data
Seasonal hybrid ESD (S-H-ESD)
Uses piecewise median instead of STL trend
More robust to extreme outliers
\(Y_\nu\) is split into non-overlapping windows

Residual component is calculated as:

\[R_\nu = Y_\nu - S_\nu - \tilde{Y_\nu}\]

Then ESD is calculated using this residual and median absolute deviation (MAD)

\[G = \frac {\max\limits_{\nu} \left| R_\nu-\tilde{R} \right| } {median \left| R_\nu - \tilde{R} \right| } \]

AnomalyDetection syntax and results

msft_vec <- msft_data$count[order(msft_data$timestamp)]
outliers <- AnomalyDetectionVec(msft_vec, plot = T,
                                direction = 'both',   # detect positive and negative spikes
                                period = 12,          # number obs. in single period
                                longterm_period = 72) # number obs. for which trend is "flat"
outliers$plot

BreakoutDetection from Twitter

Designed to detect mean shifts in time series data
E-Divisive with Medians (EDM)
Robust to extreme outliers
Non-parametric

For \(Y_1,\,Y_2,\,\dots,\,Y_N\), let there be a point \(\tau\) where

\[Y_1,\,Y_2,\,\dots,\,Y_\tau \thicksim F\]

\[Y_{\tau+1},\,Y_{\tau+2},\,\dots,\,Y_N \thicksim G\]

Kolmogorov-Smirnov test to compare distributions

\(H_0 \colon F = G\)

\(H_a \colon F \neq G\)

But we don't know the location and number of \(\tau\)…

E-Divisive (with medians)

A Nonparametric Approach for Multiple Change Point Analysis of Multivariate Data
David S. MAtteson and Nicholas A. James
arXiv (1990)

algorithms to narrow search space
Twitter's modification uses medians to estimate the energy distance between two distributions
Energy distance combines within and between levels of variation
Tree searching (bisection) limits the space in which this must be calculated

BreakoutDetection syntax and results

changes <- breakout(msft_vec, plot = T,
                    method = 'multi', # detect multiple change points
                    min.size = 12)    # min(# obs.) between change points
changes$plot

Acknowledgements

Thank you very much for your attention!

Resources

Quandl: https://www.quandl.com

ESD: https://en.wikipedia.org/wiki/Grubbs%27_test_for_outliers

AnomalyDetection: https://github.com/twitter/AnomalyDetection

BreakoutDetection: https://github.com/twitter/BreakoutDetection