- Background
- PhD in Bioinformatics, MS Biostatistics (U of M)
- Data Scientist at Trinity Health (1.5 years)
- Two boys (2.5 and 0.25 years)
- Closet metalhead

March 10, 2016

- Background
- PhD in Bioinformatics, MS Biostatistics (U of M)
- Data Scientist at Trinity Health (1.5 years)
- Two boys (2.5 and 0.25 years)
- Closet metalhead

- Medical billing codes assigned to encounter
- Aggregate code usage (monthly)
- "Did we observe more live births at SJMHS than usual in November?"

**STL: a seasonal-trend decomposition procedure based on LOESS***Robert B. Cleveland, William S. Cleveland, Jean E. McRae, and Irma Terpenning*

Journal of Official Statistics (1990)

- Trend component: low-frequency variation and long-term changes in data
- Seasonal component: seasonal variation in data
- Remainder component: variation in data without trend and seasonal components

For \(\nu\) in \(1\dots N\),

\[Y_\nu = T_\nu + S_\nu + R_\nu\]

msft_data %>% ggplot(aes(x = timestamp, y = count)) + geom_point() + geom_smooth(method = "loess", color = "red") + xlab("Date") + ylab("Closing Price") + ggtitle("MSFT")

msft_zoo <- zoo::zoo(msft_data$count, order.by = msft_data$timestamp) # i like zoo, so sue me msft_ts <- stats::ts(msft_zoo, frequency = 12) # monthly data, so 12 obs./period msft_stl <- stats::stl(msft_ts, s.window = "periodic") # seasonal decomposition by loess plot(msft_stl)

msft_stl <- stats::stl(msft_ts, s.window = "periodic", s.degree = 1, # seasonal polynomial fit degree, {0, 1} robust = T) # robust fitting # inner = if(robust) 1 else 2, # backfitting iterations # outer = if(robust) 15 else 0 # robustness iterations plot(msft_stl)

- Residual component roughly follows a standard normal distribution
- Exploit this using Student's
*t*-distribution - Grubbs' test for outliers (extreme studentized deviate test, or ESD)
- Test is iterated until no outliers are found (
*e.g.*reject \(H_0\))

\[G = \frac{\max\limits_{\nu}\left|R_\nu-\bar{R}\right|}{s}\]

\(H_0\): no outliers

\(H_a\): at least one outlier

For two-sided test, the null hypothesis is rejected if

\[G \gt \frac{N-1}{\sqrt{N}} \sqrt{\frac {t^2_{\alpha/(2N),N-2}} {N-2+t^2_{\alpha/(2N),N-2}} } \]

names(msft_stl)

## [1] "time.series" "weights" "call" "win" "deg" ## [6] "jump" "inner" "outer"

head(msft_stl$time.series)

## seasonal trend remainder ## [1,] -0.292745057 0.1035591 0.289185945 ## [2,] 0.006206882 0.1111303 -0.007337161 ## [3,] -0.027554415 0.1187014 0.028852968 ## [4,] 0.238298734 0.1291275 -0.257426188 ## [5,] -0.117503871 0.1395535 0.077950408 ## [6,] 0.032527383 0.1519589 -0.084486241

residComp <- msft_stl$time.series[,3] hist(residComp, breaks = 50, main = "", xlab = "Residual Component")

- Designed to detect outliers in time series data
- Seasonal hybrid ESD (S-H-ESD)
- Uses piecewise median instead of STL trend
- More robust to extreme outliers
- \(Y_\nu\) is split into non-overlapping windows

Residual component is calculated as:

\[R_\nu = Y_\nu - S_\nu - \tilde{Y_\nu}\]

Then ESD is calculated using this residual and median absolute deviation (MAD)

\[G = \frac {\max\limits_{\nu} \left| R_\nu-\tilde{R} \right| } {median \left| R_\nu - \tilde{R} \right| } \]

msft_vec <- msft_data$count[order(msft_data$timestamp)] outliers <- AnomalyDetectionVec(msft_vec, plot = T, direction = 'both', # detect positive and negative spikes period = 12, # number obs. in single period longterm_period = 72) # number obs. for which trend is "flat" outliers$plot

- Designed to detect mean shifts in time series data
- E-Divisive with Medians (EDM)
- Robust to extreme outliers
- Non-parametric

For \(Y_1,\,Y_2,\,\dots,\,Y_N\), let there be a point \(\tau\) where

\[Y_1,\,Y_2,\,\dots,\,Y_\tau \thicksim F\]

\[Y_{\tau+1},\,Y_{\tau+2},\,\dots,\,Y_N \thicksim G\]

Kolmogorov-Smirnov test to compare distributions

\(H_0 \colon F = G\)

\(H_a \colon F \neq G\)

But we don't know the location and number of \(\tau\)…

**A Nonparametric Approach for Multiple Change Point Analysis of Multivariate Data***David S. MAtteson and Nicholas A. James*

arXiv (1990)

- algorithms to narrow search space
- Twitter's modification uses medians to estimate the energy distance between two distributions
- Energy distance combines within and between levels of variation
- Tree searching (bisection) limits the space in which this must be calculated

changes <- breakout(msft_vec, plot = T, method = 'multi', # detect multiple change points min.size = 12) # min(# obs.) between change points changes$plot

Thank you very much for your attention!

*Resources*

Quandl: https://www.quandl.com

ESD: https://en.wikipedia.org/wiki/Grubbs%27_test_for_outliers

AnomalyDetection: https://github.com/twitter/AnomalyDetection

BreakoutDetection: https://github.com/twitter/BreakoutDetection