State Space Time Series Analysis - Part 1

Modern time series analysis techniques like ARIMA and ARMA models (also known as the Box-Jenkins method) are relatively young innovations, having been pioneered by the statisticians George Box and Gwylim Jenkins in 1970 with the publication of their seminal text Time Series Analysis: Forecasting and Control. Despite their adolescence, these models are already familiar ground for students and seasoned statisticians alike and have become ubiquitous across massive industries like finance, tech, and engineering.

However, these techniques come with some significant drawbacks. The Box-Jenkins method is often regarded as a black box method in the sense that it can be difficult to see what’s going on under the hood. It can be challenging or even impossible to determine how seasonality, cycles, pulses, interventions, and other components of time series contribute to the overall model in a precise manner.

Box-Jenkins also makes strong assumptions regarding certain properties about time series; in particular, Box-Jenkins is generally only valid when applied to series that are stationary. If a series isn’t stationary, many developers will resort to repeated application of the differencing operator until a reasonably stationary series is produced. Futhermore, stationarity itself is a difficult property to measure and test for, so developers often face the problem of whether a series is “stationary enough.”

Another significant drawback is that the Box-Jenkins method does not allow for varying certain components over time. Seasonality can be accounted for in an ARIMA model, but it is not straight forward to allow seasonal or cyclical patterns to fluctuate over time.

Enter state space methods for time series analysis, a distinct set of methods for modeling time series that solves for each of the problems of the Box-Jenkins method described above. In this series of posts, I’ll discuss the basic theory of state space representations of time series. We’ll begin with the simplest type of state space model, the local level model. In later parts, we’ll build in additional components like seasonality, trend, and exogeneous variables.

These posts are primarily built off of the work of James Durbin and Siem Jan Koopman from their text Time Series Analysis by State Space Methods.

Prerequisites

This series of posts assumes that the reader is familiar with foundational topics in statistics like random variables, density functions, expectation, variance/covariance, and the normal and multinormal distributions. In multivariate cases, some basic matrix algebra is also needed. No background in time series analysis is needed.

Notation

Distribution functions will be denoted by \(p(\cdot)\), and conditional distributions will similarly be denoted by \(p(\cdot \vert \cdot)\).

Expectation, variance, and covariance will be respectively denoted by \(\mathbb{E}, \mathbb{V}\), and \(\text{Cov}()\).

Vectors will be represented by lower or upper case letters like \(x, y, Y\) and matrices will be represented with boldface capital letters like \(\mathbf{A}, \mathbf{\Sigma}, \mathbf{Y}\).

Building the Local Level Model

Time series analysis is concerned with analyzing and building models for sets of observations that are ordered by time. For example, you could have a time series of observations \(y_1, y_2, \ldots, y_{365}\) where each \(y_i\) represents the daily high temperature in your city for the last year. One way to think about modeling a time series is by using this basic model:

\[\begin{equation} y_t = \mu_t + \gamma_t + \epsilon_t, t = 1, \ldots, n. \tag{1} \end{equation}\]

In state space methodology, each of these symbols represent different components of a time series, namely

  • \(\mu_t\) is a slowly changing component called the trend
  • \(\gamma_t\) is a periodic component with a fixed period called the seasonal
  • \(\epsilon_t\) is an irregular component called the disturbance (or less commonly, the error)

We can create a very simple model for \((1)\) by setting \(\mu_t = \alpha_t\) where \(\alpha_t\) is a random walk with no seasonal component. More precisely, \(\alpha_{t+1} = \alpha_t + \eta_t\), where \(\eta_t \sim N(0, \sigma^2_\eta)\). Let’s also assume that \(\epsilon_t \sim N(0, \sigma^2_\epsilon)\) and that all the \(\eta_t\) and \(\epsilon_t\) are mutually independent. Our model then is

\[\begin{equation} \begin{split} y_t &= \alpha_t + \epsilon_t, \epsilon_t \sim N(0, \sigma^2_\epsilon), \\ \alpha_{t + 1} &= \alpha_t + \eta_t, \eta_t \sim N(0, \sigma^2_\eta) \end{split} \tag{2} \end{equation}\]

This model is the simplest state space model and it’s called the local level model. It demonstrates the characteristic structure of a state space model: there is a series of unobserved values \(\alpha_t\), called the states, representing the development of the system over time along with the observed time series \(y_t\). The series \(y_t\) is related to the unobserved series \(\alpha_t\) by the state space model. The first equation is called the observation equation and the second is called the state equation.

The Kalman Filter

The goal of this post will be to derive a recursion that allows us to estimate the unobserved states \(\alpha_t\) based on the \(y_t\) from the observed time series. This recursion is known as the Kalman Filter, but before we get into that, we’ll need some additional notation and properties.

Setting the Stage

We will define \(Y_{t-1}\) as the vector of observations \([y_1, y_2, \ldots, y_{t-1}]^\top\).

We’ll also assume that \(p(\alpha_t \vert Y_{t-1}) \sim N(a_t, P_t)\) where \(a_t\) and \(P_t\) are known. Later, we’ll relax this assumption as knowing these parameters is rarely the case in practice. From this assumption, we also have that \(p(\alpha_{t+1} \vert Y_t) \sim N(a_{t+1}, P_{t+1})\).

We will also define the parameters \(a_{t \vert t}\) and \(P_{t \vert t}\) as the mean and variance of the conditional distribution of \(\alpha_t\) given \(Y_t\).

From these assumptions and definitions, our goal is to compute \(a_{t \vert t}, P_{t \vert t}, a_{t+1},\) and \(P_{t+1}\) when a new observation \(y_t\) is brought in. Durbin calls \(a_{t \vert t}\) the filtered estimator of the state \(\alpha_t\) and \(a_{t+1}\) the one-step ahead predictor of \(\alpha_{t+1}\). Their respective variances are \(P_{t \vert t}\) and \(P_{t+1}\).

Prediction Errors and Properties

We can also define the one-step ahead prediction error of \(y_t\) with \(v_t = y_t - a_t\). In this section, we’ll derive some properties about these prediction errors that will prove useful later on.

\(\fbox{Lemma 1.}\) Let \(v_t, \alpha_t, Y_{t-1}, a_t, P_t,\) and \(\sigma_{\epsilon}^2\) be defined as above. Then

  1. \(\mathbb{E}[v_t \vert Y_{t - 1}] = 0.\)
  2. \(\mathbb{V}[v_t \vert Y_{t - 1}] = P_t + \sigma_{\epsilon}^2.\)
  3. \(\mathbb{E}[v_t \vert \alpha_t, Y_{t - 1}] = \alpha_t - a_t.\)
  4. \(\mathbb{V}[v_t \vert \alpha_t, Y_{t - 1}] = \sigma_{\epsilon}^{2}.\)

\( \fbox{Proof.}\) Let’s just brute force derive each of these properties individually. Note that \( v_t = y_t - a_t = \alpha_t + \epsilon_t - a_t \) because we’re going to use that a lot.

\(\fbox{1.}\) We have that

\[\begin{align*} \mathbb{E}[v_t \vert Y_{t - 1}] & = \mathbb{E}[y_t - a_t \vert Y_{t-1}] = \mathbb{E}[\alpha_t + \epsilon_t - a_t \vert Y_{t - 1}] \\ \\ & = \mathbb{E}[\alpha_t \vert Y_{t - 1}] + \mathbb{E}[\epsilon_t | Y_{t - 1}] - \mathbb{E}[a_t \vert Y_{t - 1}], \end{align*}\]

where the final equality comes from the linearity of expectation. Let’s tackle each term of this sum individually. From our assumptions in the previous section, we know that \( p(\alpha_t \vert Y_{t - 1}) \thicksim N(a_t, P_t) \), so the first term in this sum is just \(a_t\).

For the second term, remember that the \( \epsilon_t \) are independent and identically distributed normal random variables with zero mean. That means that the second term is zero.

Finally, for the third term, since \(a_t\) is a known constant based on our assumptions from the previous section, \( \mathbb{E}[a_t \vert Y_{t-1}] = a_t\). That leaves us with

\[ \mathbb{E} [v_t \vert Y_{t - 1}] = a_t + 0 - a_t = 0.\]

\(\fbox{2.}\) Deriving this property is very similar to the previous one. We have that

\[\mathbb{V}[v_t \vert Y_{t - 1}] = \mathbb{V}[\alpha_t + \epsilon_t - a_t \vert Y_{t - 1}].\]

In general, variance is not linear. Lucky for us though, in this case it is linear because each of the terms in the variance operator are mutually independent. More precisely, \(a_t\) is independent of the other terms because it’s a constant and \( \alpha_t \) is independent of \( \epsilon_t \) since only the observation equation contains \( \epsilon_t \). Therefore,

\[\mathbb{V}[v_t \vert Y_{t - 1}] = \mathbb{V}[\alpha_t \vert Y_{t - 1}] + \mathbb{V}[\epsilon_t \vert Y_{t - 1}] - \mathbb{V}[a_t \vert Y_{t - 1}].\]

Applying our assumptions and definitions from the previous section (and noting that the variance of a constant is zero), we have that

\[ \mathbb{V}[v_t \vert Y_{t - 1}] = P_t + \sigma_{\epsilon}^{2}. \]

\(\fbox{3.}\) This is exactly the same as part 1 except that now, we are given \( \alpha_t \). You should be able to convince yourself by glancing back at part 1 that

\[ \mathbb{E}[v_t \vert \alpha_t, Y_{t-1}] = \alpha_t - a_t.\]

\(\fbox{4.}\) This is exactly the same as part 2 except that now, we are given \( \alpha_t\). Again, you should be able to convince yourself by glancing back at part 2 that

\[\mathbb{V}[v_t \vert \alpha_t, Y_{t - 1}] = \sigma_{\epsilon}^{2}. \tag*{$\blacksquare$}\]

Deriving \( p(\alpha_t \vert Y_t) \) to Obtain the Kalman Filter

Using these properties and noting that knowing \( Y_t \) is equivalent to knowing \( Y_{t-1} \) and \( v_t \), we can now compute \( p(\alpha_t \vert Y_t) \). I won’t put all the details here because the algebra is gory and long, but if you’re curious, you can always work out the details yourself! We have that

\[\begin{align} p(\alpha_t \vert Y_t) & = p(\alpha_t \vert Y_{t - 1}, v_t) \\ & = p(\alpha_t, v_t \vert Y_{t - 1}) / p(v_t \vert Y_{t - 1}) \\ & = p(\alpha_t \vert Y_{t - 1}) p(v_t \vert \alpha_t, Y_{t - 1}) / p(v_t \vert Y_{t-1}) \\ & = \text{constant} \cdot \text{exp}(-\frac{1}{2} Q). \end{align}\]

The reason this all reduces down to the product of a constant and an exponential is because each of the distribution functions in the expression above are normal distributions, and that comes from how we defined the \( y_t \) and \( \alpha_t\). So, this form is really just a very general form of a normal distribution. Once we know \( Q\), we’ll know the mean and standard deviation of \( p(\alpha_t \vert Y_t)\). If you work out the details, it turns out that

\[Q = \frac{P_t + \sigma_{\epsilon}^2}{P_t \sigma_{\epsilon}^2} (\alpha_t - a_t - \frac{P_t v_t}{P_t + \sigma_{\epsilon}^2})^2.\]

This finally implies that

\[ p(\alpha_t \vert y_t) = N\left(a_t + \frac{P_t v_t}{P_t + \sigma_{\epsilon}^2}, \frac{P_t \sigma_{\epsilon}^2}{P_t + \sigma_{\epsilon}^2}\right). \]

Recall from earlier that we defined \( p(\alpha_t \vert Y_t) \thicksim N(a_{t \vert t}, P_{t \vert t}) \), implying that

\[ a_{t \vert t} = a_t + \frac{P_t v_t}{P_t + \sigma_{\epsilon}^2} \text{ and } P_{t \vert t} = \frac{P_t \sigma_{\epsilon}^2}{P_t + \sigma_{\epsilon}^2}. \]

Next, since \( a_{t + 1} = \mathbb{E}[\alpha_{t + 1} \vert Y_t] = \mathbb{E}[\alpha_t + \eta_t \vert Y_t] \) and \( P_{t + 1} = \mathbb{V}[\alpha_{t + 1} \vert Y_t] = \mathbb{V}[\alpha_t + \eta_t \vert Y_t] \), we have that

\[\begin{align} a_{t + 1} & = \mathbb{E}[\alpha_{t + 1}\vert Y_t] = a_{t \vert t} = a_t + \frac{P_t v_t}{P_t + \sigma_{\epsilon}^2}\\ P_{t + 1} & = \mathbb{V}[{\alpha_{t + 1} \vert Y_t}] + \sigma_{\eta}^2 = P_{t \vert t} + \sigma_{\eta}^{2} = \frac{P_t \sigma_{\epsilon}^2}{P_t + \sigma_{\epsilon}^2} + \sigma_{\eta}^2. \end{align}\]

For conciseness, Durbin uses the following notation:

\[\begin{align} F_t & = \mathbb{V}[v_t \vert Y_{t - 1}] = P_t + \sigma_{\epsilon}^2, \\ K_t & = P_t / F_t. \end{align}\]

\( K_t \) is called the Kalman gain and \( F_t \) is called the variance of the prediction error. At this point, we’ve completely derived the set of Kalman filter recursions for the local level model. To summarize,

\[ v_t = y_t - a_t, \hspace{1cm} F_t = P_t + \sigma_{\epsilon}^2, \] \[a_{t \vert t} = a_t + K_t v_t, \hspace{1cm} P_{t \vert t} = P_t (1 - K_t), \] \[ a_{t + 1} = a_t + K_t v_t, \hspace{1cm} P_{t + 1} = P_t (1 - K_t) + \sigma_{\eta}^2, \] \[ \text{for } t = 1, \ldots, n \text{ where } K_t = P_t / F_t. \]

So what does all of this mean? What this recursion accomplishes is a way to update our estimation of the states \( \alpha_t \) with each additional observation \( y_t \) that’s brought into the system. In the case of the local level model, the one step ahead predictor of the state \( \alpha_{t + 1} \), denoted \( a_{t + 1}\), turns out to just be the prior one step ahead predictor plus the product of the prediction error and the Kalman gain.

It’s important to recognize that we arrived at this recursion based on several assumptions that may not necessarily be true for every application. In particular, the assumption that the initial conditions \(a_1\) and \( P_1\) are known is almost never the case in practice. Luckily, there are methods to work around this issue. The most common way to initialize the Kalman filter without making assumptions about these parameters is through a process called diffuse initialization, which we’ll discuss in a future post.

We also assumed that all of the distributions we worked with were normal. Throughout my posts, I will continue using this assumption, but there are also formulations for non-Gaussian state space models that utilize other distribution functions. Durbin’s book that is referenced near the beginning of this post contains several such formulations.

Coming Up

In this post, we introduced the concept of state space models and we derived the Kalman filter for the local level model (albeit while also skipping a lot of algebra). In an upcoming post, we’ll modify the derivation for more complex state space models that account for trend, seasonality, and exogenous variables.