Windowless Transformation for Multivariate Time Series

5 min readNov 4, 2020

Time series data are taken over time and may have the temporal dependency and an internal structure (such as autocorrelation, stationarity, and seasonality) that should be accounted for. Time series analysis has been an important scientific topic in both academia and industry, and historically, statistical methods such as ARIMA [1], Holt Winters [2], BAT [3][4] are the go-to modeling tools for time series analysis. Recently, machine learning and deep learning models have achieved improved performance in time series modeling. While deep learning techniques may not require a feature engineering step prior to applying the final estimator in the pipeline, machine learning estimators such as Random Forest, XGBoost, Linear Regression, SVM may benefit greatly from feature engineering.

In this post, we focus on a method to transform time series raw data into features that are ready to be consumed by machine learning estimators.

Introduction

Existing techniques usually segment time series data into windows, apply different computations (or transformations) on each window, and merge the transformed output of all windows to generate final results. There are different types of windows such as sliding window, tumbling window, in which sliding window is more popular.

Deciding the right window size is a critical factor. There have been different techniques that tried to learn the window size from data values or from the temporal information of the time series. However, there is no universal method that worked for all types of time series, especially for multivariate time series that include individual time series with distinct internal structures.

In this post, we present a different way to do feature engineering for time series as follows. For each data point in the input (univariate) time series, define a max look back window m. A good value of m would be one third of the length of the time series. Then, create a list L of look back windows of sizes base¹, base², base³, base⁴, …., baseⁿ , so that baseⁿ ≤ m, in which base can be an integer such as 2, 3, 4, … Then, for each window size in L, compute the rolling values such as mean, std, max, min, etc. These are standard Python functions that can be used in a straightforward fashion. Below, we introduce a different function that uses Discrete Cosine Transform for each window in L, and then keep only the most significant coefficients as data features. Doing this, we ensure that the most significant information of each window is retained in the final feature values.

The same above procedure can be applied for all individual time series in a multivariate time series to transform multivariate time series into features for ML estimators. The reason the title of this post is “Windowless Transformation” is because the user does not need to manually specify the look back window size for each individual time series, instead multiple look back windows can be applied as above to compute the transformed features.

Figure 1 shows the overview architecture of our presented solution. The input multivariate time series goes through a chain of nested transformers: Time series transformer, column transformer, and rolling transformer. These transformers transform input multivariate time series and concatenate the results.

In the following sections, we will discuss these above transformers in detail.

Time Series Transformer

As the outermost layer of the nested architecture, this transformer is responsible for two main tasks:

Loop through all columns or all time series in the multivariate time series, and for each time series, create and execute a column transformer, as shown in lines 3 and 4 below.
Aggregate the output of column transformers in the correct format and return to the caller of this transformer, as shown in lines 6 and 8 below.

Column Transformer

This transformer creates multiple look back windows from each data point in the input univariate time series. As discussed above, this transformer creates a list L of look back windows of sizes base¹, base², base³, base⁴, …., baseⁿ , so that baseⁿ ≤ m, in which base can be an integer such as 2, 3, 4, … as shown in line 5 below. If there is a specified max look back value, then that value is used as m. Otherwise, as a rule of thump, as mentioned above, m should be one third of the length of the time series (line 4 below).

Also, this transformer includes a sklearn FeatureUnion [5] object that concatenates results of all rolling transformer objects (line 11 below). Using Feature Union, we allow all rolling transformers (each handles a different look back window) to run in parallel and the final result is obtained by concatenation of all output of these rolling transformers. This allows a fast and clean implementation.

As shown in line 12 above, the column transformer actually calls the ‘transform’ method of its Feature Union object, which calls all of its rolling transformers. Each rolling transformer is responsible for transforming one single look back window length as we discuss next.

Rolling Transformer

This transformer first converts the input 2D numpy array X into a Pandas data frame (line 2). Then, it uses the rolling and apply function of the data frame that calls a function named ‘compute’ (line 3). More about ‘compute’ function will follow below. After the ‘compute’ function outputs the results, the column transformer will need to prepend the output data frame with np.nan to make sure that the shape is correct (lines 6–8). This is important since the outer Feature Union will only concatenate (by column) the transformed matrices that have the same number of rows.

The ‘compute’ method is a helper method (or a work around trick) that is used to allow data frame rolling function to output more than one scalar.

At its core, this ‘column’ function uses the Discrete Cosine Transform [6] to calculate the coefficients of the input data as shown below. DCT-2 (i.e., type 2) is used here since it is the most common variant of discrete cosine transform. Note that, the output of DCT-2 has the same length as the input of DCT-2 (line 3). However, we do not want to keep all coefficients in the output, instead, only the largest coefficients (in absolute value) are kept so that we can reduce memory usage while retaining most of useful information of the input data (lines 4–8). If the input ‘data’ is too short, zeros will be padded at the end (line 6). The output is in the format of Pandas Series, which allows the outer FeatureUnion to concatenate these series and create the correct output format.

Source code

Source code of our presented solution are available at: https://github.com/lhvu2/ts_windowless_transformer

References

PJ Brockwell, RA Davis, MV Calder, “Introduction to time series and forecasting”, 2002, Springer.
Prajakta S. Kalekar, “Time series Forecasting using Holt-Winters Exponential Smoothing”, 2004
De Livera, Alysha M. “Automatic forecasting with a modified exponential smoothing state space framework.” Monash Econometrics and Business Statistics Working Papers 10, no. 10 (2010).
De Livera, Alysha M., Rob J. Hyndman, and Ralph D. Snyder. “Forecasting time series with complex seasonal patterns using exponential smoothing.” Journal of the American Statistical Association 106, no. 496 (2011): 1513–1527.
https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html
https://docs.scipy.org/doc/scipy/reference/generated/scipy.fftpack.dct.html