How to Use scikit-learn Methods with statsmodels Estimators

Sometimes you want to use estimators from one package but methods from another. Maybe, like me, you want to use scikit-learn’s grid searching cross validation function with an estimator from statsmodels. These two don’t work together straight out of the box, but by writing a quick wrapper, you can make a statsmodels estimator play nice with scikit-learn.

Okay, but why
Writing the Wrapper
Implementation
There you have it

Okay, but why

If you don’t care about my personal background with this problem, skip ahead to the wrapper. No hard feelings, I promise.

I’ve been reading a lot about state space time series analysis for the last couple months and have just recently started building state space models with Python. Even though, in my opinion, state space methodology is more powerful than classical Box-Jenkins approaches, I have found that many standard machine learning and statistical libraries have not yet implemented state space models into their toolkits. Luckily, statsmodels.tsa has a robust child class called statespace that allows you to build state space models with just about any level of detail you care about.

I landed on using statsmodels.tsa.statespace.sarimax for my projects. SARIMAX is essentially an ARIMA model that supports explicitly modeling seasonality (SARIMAX) and exogenous regressors (SARIMAX), along with the added bonus that these models are trained via the Kalman filter rather than the Box-Jenkins method. It works perfectly for my use case and I could not be more satisfied with the algorithms that train these models, primarily developed by Chad Fulton (my sincere gratitude, Mr. Fulton). But as is the case for most luxuries, this convenience comes at a price.

At the time of writing this, statsmodels does not have comparable functionality to scikit-learn’s GridSearchCV() function in its model_selection class. As far as I can tell, there are not plans to implement this kind of functionality either. For those that have not used this function before, GridSearchCV() is an enormously powerful tool in scikit-learn that allows you to train any number of candidate models across a hyperparameter grid. For example, if I want to train a random forest to solve some kind of problem but I’m not sure what my maximum tree depth should be or how many trees I should include in my forest, I could use GridSearchCV() to test out any number of combinations of choices for both these hyperparameters and choose the “best” one with regard to some preselected performance metric.

The primary hyperparameters that need tuning in a SARIMAX model are the order and the seasonal_order. In statsmodels.tsa.statespace, the order argument refers to the length three tuple respectively specifying the number of autoregressive (AR) terms, the order of integration (OOI), and the number of moving average (MA) terms that you want to build your model around. The seasonal_order argument represents the seasonal counterparts of this tuple along with a fourth argument representing the length of a season with respect to your data’s frequency.

Now, at this point, many statisticians and engineers that are well versed in time series analysis may want to point out to me that it is more sensible or simpler to determine the proper number of AR terms by looking at partial autocorrelation plots or that OOI can be estimated by cleverly applying the Augmented Dickey-Fuller test. This is a perfectly reasonable approach that has been used for decades. But in this age of increasingly powerful computers and tools that can rigorously check the fit of thousands (or thousands of millions, depends on the computer I guess!) of candidate models, why not make good use of those tools and leave the model selection to the computer? I would also argue that grid searching a hyperparameter space for the best model is a more rigorous approach than using subjective experience to determine the right number of AR terms based off a partial autocorrelation plot.

So, bearing all of this in mind, I set off to get my SARIMAX models to work with scikit-learn’s grid searching and cross validation functions. It turns out to only require a pretty straight forward wrapper so that scikit-learn can recognize a statsmodels estimator as a scikit-learn estimator.

Writing the Wrapper

Versions

In my project’s conda environment, I’m using the following relevant package versions. I imagine this type of implementation should work for similar versions, but do keep in mind that when using a different set of versions, your results could vary.

Python 3.11.5
scikit-learn=1.4.1
statsmodels=0.14.1

Developing a Custom `scikit-learn` Estimator

The maintainers of scikit-learn have already recognized that some developers may want to do exactly what I’m setting out to do in this post and have put together a very helpful document in their API reference called Developing scikit-learn estimators. To summarize this process so we can finally get to coding, scikit-learn estimators need to have

1) A constructor (__init__) that specifies all the parameters used in hyperparameter tuning

2) A fit() method; we’ll set this up to call the fit() method in the statsmodels package specific to SARIMAX model objects and return the fitted estimator

3) A set_params() method

4) If you want forecasting functionality, a predict() method is also needed

It is also a good idea to determine which description of your estimator fits best out of classifiers, regressors, and clusters. In my case, my estimator will be a regressor since its predictions fall on the real number line and since state space models fall under the umbrella of supervised learning.

My SARIMAX Wrapper

What follows is all of the code I used to make SARIMAX compatible with (most of) scikit-learn. After this, we’ll go through the creation of the class and each method line by line.

from sklearn.base import BaseEstimator, RegressorMixin
from sklearn.utils.validation import check_X_y, check_is_fitted, check_array
import statsmodels.api as sm
import numpy as np

class SarimaxSKL(BaseEstimator, RegressorMixin):
    def __init__(self, **kwargs):
        """
        Constructs the groundwork for a `statsmodels` `SARIMAX` model object
        with modified `fit` and `predict` methods to play well with 
        `scikit-learn`'s cross validation and forecasting functionality.
        The actual `SARIMAX` model is not created until the `fit` method
        is called, at which point the object is created and the `statsmodels`
        fit method is run using the supplied parameters and data set.

        Parameters
        ----------
        **kwargs: keyword arguments corresponding to the parameters to feed 
        the `SARIMAX` objects; the parameters that are currently supported are
        `order`, `seasonal_order`, `enforce_stationarity`, `maxiter`, 
        `mle_regresion`, and `time_varying_regression`; refer to the `statsmodels`
        API reference for details on how each parameter impacts the training process

        Returns
        ----------
        `SarimaxSKL` object with `fit` and `predict` methods enabled

        """

        self.order = kwargs.get("order", (0, 0, 0))
        self.seasonal_order = kwargs.get("seasonal_order", (0, 0, 0, 12))
        self.enforce_stationarity = kwargs.get("enforce_stationarity", False)
        self.maxiter = kwargs.get("maxiter", 1000)
        self.mle_regression = kwargs.get("mle_regression", True)
        self.time_varying_regression = kwargs.get("time_varying_regression", False)

    def get_params(self, deep = True):
        return {"order": {self.order}, "seasonal_order": {self.seasonal_order},
                "enforce_stationarity": {self.enforce_stationarity},
                "max_iter": {self.maxiter}, 
                "mle_regression": {self.mle_regression},
                "time_varying_regression": {self.time_varying_regression}}

    def set_params(self, **parameters):
        for parameter, value in parameters.items():
            setattr(self, parameter, value)
        return self

    def fit(self, X, y):
        X, y = check_X_y(X, y)
        self.X_ = X
        self.y_ = y

        self.model_ = sm.tsa.statespace.SARIMAX(
            endog = y, exog = X, order = self.order,
            seasonal_order = self.seasonal_order,
            enforce_stationarity = self.enforce_stationarity,
            mle_regression = self.mle_regression,
            time_varying_regression = self.time_varying_regression
        )

        self.results_ = self.model_.fit(maxiter = self.maxiter)

        return self

    def predict(self, X):
        check_is_fitted(self, 'model_')
        X = check_array(X)
        pred_obj = self.results_.get_forecast(steps = X.shape[0], exog = X)
        return np.squeeze(pred_obj.prediction_results.forecasts)
    
    def summary(self):
        print(self.results_.summary())

Import Statements

At the beginning of the wrapper, I import the packages required to run the wrapper.

from sklearn.base import BaseEstimator, RegressorMixin
from sklearn.utils.validation import check_X_y, check_is_fitted, check_array
import statsmodels.api as sm
import numpy as np

Importing the BaseEstimator and RegressorMixin classes from sklearn.base adds a lot of backend functionality from scikit-learn for estimators in general and regressor estimators specifically. The wrapper itself is a child class to these classes. Of course, if your estimator is either a classifier or cluster type, you should import ClassifierMixin or ClusterMixin instead of RegressorMixin.

Constructor `init`

sklearn’s documentation stresses the importance of configuring the constructor so that each keyword argument corresponds to an attribute on the instance. In other words, each keyword argument should correspond to some type of parameter that is required to train the estimator and no training data should be supplied to the estimator at instantiation. Additionally, no input validation or manipulation should occur in the constructor. In my constructor, you can see that it only takes values from **kwargs and plugs them into different attributes in the instance:

class SarimaxSKL(BaseEstimator, RegressorMixin):
    def __init__(self, **kwargs):
        self.order = kwargs.get("order", (0, 0, 0))
        self.seasonal_order = kwargs.get("seasonal_order", (0, 0, 0, 12))
        self.enforce_stationarity = kwargs.get("enforce_stationarity", False)
        self.maxiter = kwargs.get("maxiter", 1000)
        self.mle_regression = kwargs.get("mle_regression", True)
        self.time_varying_regression = kwargs.get("time_varying_regression", False)

I like to use **kwargs this way because I can still specify default values the way I want to while also making it simple to wrap additional classes around this one if that’s something I want to do eventually. For those that don’t know how **kwargs works, take a look at this.

It would also be perfectly valid to specify argument names that correspond to your estimator’s parameters in the constructor instead of using **kwargs, do what you’re most comfortable with. If I were to do that with my constructor, I would just change my constructor’s declaration to something more like this:

def __init__(self, order = (0, 0, 0), seasonal_order = (0, 0, 0, 12), ...) # ...and so on

`get_params()`

    def get_params(self, deep = True):
        return {"order": {self.order}, "seasonal_order": {self.seasonal_order},
                "enforce_stationarity": {self.enforce_stationarity},
                "max_iter": {self.maxiter}, 
                "mle_regression": {self.mle_regression},
                "time_varying_regression": {self.time_varying_regression}}

The get_params() method is a method found in all sklearn estimators that takes no arguments beyond deep and returns a dictionary containing the instance’s parameters. The deep keyword argument is used to determine whether the method should return parameters of sub-estimators. In my case, this isn’t really necessary, but I thought to include it to illustrate all the required moving parts of a sklearn estimator. As long as your estimator is a child of BaseEstimator, you don’t actually need to explicitly write this method yourself if you don’t want to.

`set_params()`

    def set_params(self, **parameters):
        for parameter, value in parameters.items():
            setattr(self, parameter, value)
        return self

This method is absolutely essential for my use case since I want to use sklearn’s grid searching functionality. The set_params() method takes the parameters supplied at instantiation and unpacks them into a dictionary, then sets the estimator’s parameters using that dictionary. The sklearn documentation specifies that this method should always return the estimator itself, hence why the method ends with return self.

`fit()`

This is the meat of the estimator. It is called an estimator, after all, because we use it to estimate the parameters that describe the process we’re studying through a fitting process. Let’s go line by line to understand how this method works.

    def fit(self, X, y):
        X, y = check_X_y(X, y)
        self.X_ = X
        self.y_ = y

Before doing anything, it’s best to validate your input to make sure that there’s a ground truth value in y for each training example in X. sklearn has a very useful validation function called check_X_y() that does this for you; it simply checks that the length of y and X match and that y is one-dimensional and X is two-dimensional. It also verifies that y only contains finite, non-null values. From there, if everything validates properly, the function converts your data to arrays, which tend to be more efficient than lists or pandas dataframes.

Keep in mind that all fit() methods should accept X and y arguments, even cluster estimators that are unsupervised. sklearn’s documentation says this is required “to make it possible to use the estimator as part of a pipeline that can mix both supervised and unsupervised transformers”. If you are using an unsupervised estimator, just set y = None in the function’s declaration.

I like to save the converted X and y arrays as attributes of the estimator in case I want to examine them after fitting, but I don’t believe this is strictly necessary.

        self.model_ = sm.tsa.statespace.SARIMAX(
            endog = y, exog = X, order = self.order,
            seasonal_order = self.seasonal_order,
            enforce_stationarity = self.enforce_stationarity,
            mle_regression = self.mle_regression,
            time_varying_regression = self.time_varying_regression
        )

        self.results_ = self.model_.fit(maxiter = self.maxiter)

        return self

The model_ attribute of an estimator is what it sounds like; this is the actual model object you intend to use. In my case, it’s a sm.tsa.statespace.SARIMAX object. You can see that I feed the parameters from the constructor to this model and, conveniently, they have the same names in the statsmodels package.

Finally, I call the fit() method on my SARIMAX instance to fit the model. Note that this fit() method is the fit() method from statsmodels, not a fit() method from sklearn or the fit() method we’re currently writing. If your custom estimator does not have a fit() method, you will want to put code here that will return fitted coefficients or parameters based on the training data.

Finally, we simply return the instance which has been fitted. At this point, we have a fitted estimator but we do not have a way to get predictions from it. For that, we need a separate predict() method.

`predict()`

    def predict(self, X):
        check_is_fitted(self, 'model_')
        X = check_array(X)
        pred_obj = self.results_.get_forecast(steps = X.shape[0], exog = X)
        return np.squeeze(pred_obj.prediction_results.forecasts)

Before creating any predictions, we want to make sure that we are trying to generate predictions with an estimator that has been fitted. It wouldn’t make much sense to get predictions from an unfitted estimator! That is where another useful sklearn validation function comes in, check_is_fitted(), which simply checks whether the estimator has had the fit() method called on it yet.

Next, we want to make sure that our estimator can make sense of the examples X we give to the predict() method. In this case, we use the check_array() function from sklearn, which makes sure that our examples are two-dimensional and only contain finite, non-null values.

Finally, we can generate predictions. In the case of statsmodels.tsa estimators, this usually means calling get_forecast() on the fitted estimator, which is what I’m doing on the third line of the predict() method. If your estimator does not have explicit methods to retrieve predictions, you will want to put code here that does that for you.

Finally, I can return my forecast. Neat!

Implementation

Now we have a wrapper that allows sklearn to interpret my new SarimaxSKL class as a sklearn estimator class. The entire point of doing all this work was to use this estimator with sklearn’s grid searching functionality, so let’s do that with a small sample data set. Before moving on to data preparation, I’ll need these imports to complete my implementation:

from pmdarima.datasets import load_msft
from sklearn.base import BaseEstimator, RegressorMixin
from sklearn.utils.validation import check_X_y, check_is_fitted, check_array
from sklearn.model_selection import GridSearchCV, TimeSeriesSplit
from sklearn.metrics import mean_absolute_percentage_error, make_scorer
import statsmodels.api as sm
import numpy as np

Data Preparation

I’m going to use a small portion of a Kaggle data set available through the pmdarima package that contains information about Microsoft’s stock price including opening prices, closing prices, daily highs, daily lows, and daily volume.

from pmdarima.datasets import load msft

msft = load_msft()
msft = msft[msft["Date"] >= "2016-06-01"]

X = msft.drop(['OpenInt', 'Volume', 'Date'], axis = 1)
y = msft['Volume']

Since this is just an example, I am filtering out observations earlier than June 2016. SARIMAX can be quite computationally expensive when using large training data sets and since we will be grid searching, paring down the data set is wise.

Also, notice that I split the data set into X and y. This is to prepare it for the grid search. My exogenous variables in X include the daily highs and lows and the opening and closing prices. The target variable is going to be volume.

Preparing the Grid Search

Next, I need to determine how I want to score each of the candidate models. I’m going to use mean absolute percentage error as the score metric in this example.

I also need to determine my splitting strategy for the cross validation. The default behavior in GridSearchCV() is to shuffle the data and split it into \( k \) random sets for standard \( k \)-fold cross validation. With time series analysis, this is a poor strategy because training data should always come before testing data.

A cursory search on your favorite engine about cross validation in time series will give you mountains of reading about this, but in short, random shuffling like this will give us a poor representation of the model’s true performance because there’s a good chance that some test examples will have occurred prior to some training examples.

It’s sort of like building a model that has to know what happens in the future to perform well, which is functionally useless. Like if you know what’s going to happen in the future, why are you even building a model in the first place?

Anywho, all of this to say, I use TimeSeriesSplit() from sklearn.model_selection to prevent this problem. It gives me a way to tell the grid search function to keep training data strictly prior to testing data in each of the cross validation folds.

Finally, I also need to specify my hyperparameter grid. I am only going to test four different orders and four seasonal orders to keep processing time low, but in practice you’d probably want to use a larger grid than this.

scorer = make_scorer(mean_absolute_percentage_error, greater_is_better = False)
# greater_is_better = False because mean_absolute_percentage_error is a loss function
tscv = TimeSeriesSplit(n_splits = 3, gap = 14)

param_grid = {
    'order': [
        (1, 0, 0), (0, 0, 1), (0, 1, 0), (1, 0, 1), (1, 1, 1)
        ], 
    'seasonal_order': [
        (1, 0, 0, 365), (0, 0, 1, 365), (0, 1, 0, 365), (1, 0, 1, 365), (1, 1, 1, 365)
        ]
}

Running the Grid Search

We’re finally ready to grid search! Let’s do it:

gscv = GridSearchCV(estimator = SarimaxSKL(), param_grid = param_grid, scoring = scorer, cv = tscv.split(X, y), n_jobs = 4)

gscv.fit(X, y)

print(f"Best Parameters: {gscv.best_params_} \n Best Score: {gscv.best_score_:.3f}")

>>> Best Parameters: {'order': (0, 0, 1), 'seasonal_order': (1, 0, 0, 365)}
>>> Best Score: -0.465

At this point, I can train my best performing model and generate predictions and forecasts with the predict() method in the SarimaxSKL class.

One quick note about the best score being -0.465: you may be wondering, how can the mean average percentage error be negative? Rest assured, it isn’t. It’s negative because MAPE is a loss function, and I specified this to sklearn with the greater_is_better argument in this line:

scorer = make_scorer(mean_absolute_percentage_error, greater_is_better = False)

The make_scorer function just multiples all of the scores by \( -1 \) when greater_is_better = False so it can still select the “greatest” score as corresponding to the best model even when you are using a loss function to score candidate models.

There you have it

That’s all there is to it. This is just the tip of the iceberg when it comes to using custom estimators with sklearn. For example, your custom estimator class could accept additional methods or you could set up an ensemble estimator by aggregating several custom subestimators.

If you were really ambitious, you could code up your own estimators or (like me) you can just coerce sklearn to use estimators from other machine learning and statistical libraries. No matter your level of amibition, I hope this has given you enough to get the ball rolling for your own custom estimators!