Hello fellow reader (and hello again if you read the first part of this article series). My name is Artur, and I am the head of the Machine Learning team in Akvelon’s Kazan office and you are about to read the second part of the tutorial for anomaly detection in a time series.
Implementation of ARIMA, CNN, and LSTM
During our own research, we’ve managed to gather a lot of information from tiny useful pieces all over the internet and we don’t want this knowledge to be lost!
We already dove into theory and data preparation in Part I:
1 more part is coming very soon, stay in touch and don’t miss it! This item will become a link to the next chapter:
- Part III — Eventually easier than it seemed
We reuse our code so if something seems unclear, consider visiting the previous part once more.
Alright then, let’s move on!
Just to briefly remind the tools that we use:
- Jupyter Notebooks environment for the implementation of the models
- Scikit-learn for some data preprocessing
- Statsmodel library for the ARIMA model
- PyTorch for neural networks
- Plotly for plots and graphs
Amongst all possible approaches listed in Part I, we chose these suitable ones:
- ARIMA statistical model — predicts next value
- Convolutional Neural Network — predicts next value
- Long Short-Term Memory Neural Network — reconstructs current value
Let’s start with the ARIMA model.
ARIMA Statistical Model
ARIMA is an autoregression statistical model that optimizes coefficients (also known as hyperparameters) during training. Then these hyperparameters are used in inference.
It is logical that during nighttime and daytime CPU usage may vary which is totally normal behavior. To consider this behavior, we can think of night and day times as separate “seasons”. And because of this, we will use SARIMAX — which is just a modified ARIMA model with the same idea that adds seasonal causes into ARIMA. SARIMAX will deal with the separation of seasons for us, so we don’t have to provide anything else except our dataset.
Here you can see the schema for the training process of SARIMAX:
The model tries to predict the next value using the current one and then it compares the predicted result with actual value.
Implementation of this model is not so interesting since all we can play with are hyperparameters for this model. To find the model that produces the best predictions we will iterate over hyperparameters and will pick the combination.
All we need is just the model itself from the
statsmodels.api package and
product function from
itertools for iteration over hyperparameters.
And also, it would be a good idea is to write predictions of the best model into the new column of the DataFrame with initial data to ease further metrics calculation.
Here is the code to pick the best model and write its predictions into training and validation DataFrames:
At this very moment, we have the best set of parameters and predictions for both training and validation data. To understand how good our model is, we should calculate different metrics such as precision, recall, and F-score. We will fully cover the metrics theme in Part III, however, we already can visualize the predictions and see how our model performs:
Note: one important thing about ARIMA is that time for training and optimizing the coefficients takes ~10 times longer than it takes to complete the same training of both neural networks.
Convolutional Neural Network
Convolutional neural networks are usually applied to image-connected tasks, such as image classification, segmentation, etc. But the purpose of the convolutional layers is to find and recognize patterns, which is totally applicable for analysis of the CPU utilization metric.
We use ResNet-ish architecture (which has already become the best type of architecture to use in CNNs) that consists of Residual Blocks (ResBlocks). The idea behind ResBlock is simple yet efficient — add the input of the block to its output. This idea allows neural networks to “remember” each intermediate result and take it into account in the final layers.
CNN tries to predict the next value using some number of previous values. In our case, this number equals 10, but of course, it may be configured.
Before coding CNN itself we should make a few additional preparations of the data. In general, since we use PyTorch, the data should be wrapped into something compatible with PyTorch’s Dataset. This may be as well the class inherited from the
Dataset class, a generator, or even a simple iterator.
We will use the class option due to its readability and implicit PyTorch’s enhancements.
Once again, there are some necessary imports for Dataset creation:
Let’s recall the task for CNN. We want the model to predict the next value using some amount of the previous values. This means that our Dataset class should contain each item in a specific format — divided into 2 parts:
nvalues in sequential order that the model uses for prediction (to easily change the amount of these values, we make
- 1 value that goes next after
nvalues from the first part of the item
Coming back to the implementation — actually, it is very easy to wrap data with our custom class that just inherits from PyTorch’s Dataset. It should implement only
Now it is the right time to define the number of previous values that are to be used for the prediction of the next one. And, of course, wrap the data with our
Here comes the best part — the definition of the neural network.
Let’s rewind how Residual block looks like in regular ResNets. It consists of:
The only difference between the regular
ResBlock and ours is that we removed the last
ReLU activation — it turned out that in our case CNN without the last
ResBlock generalizes better.
Many Residual Blocks bring twice more Convolution + Batch Norm. (+ ReLU) combinations. So, such a combination is a good starting point to define.
In each Residual Block, we should remember about the case of changing the number of output channels (when
in_feat != out_feat). One possible way to synchronize the number of channels is to multiply or cut them. However, there is another greater way — we can handle this using a 1×1 convolution without padding. This trick not only allows us to fit layer input into layer output but also adds more reasonable computations for the neural network.
It is widely used to finish the base block of the convolutional net with Max Pooling or Average Pooling depending on the task. Here comes another useful trick for Convolutional Neural Nets (thanks to Jeremy Howard and his fantastic fast.ai library) — concatenate Average Pooling and Max Pooling. It allows our neural net to decide, which approach is better for the current task and how to combine them to get better results:
And here is our resulting CNN class that is made of the building blocks that we implemented above:
We always want both training and validation losses to move down because this behavior means that the model has learned something useful about our data. If any loss eventually moves up, then the model can’t figure out how to solve the task, and you should change or modify it.
These losses on the picture above seem pretty nice because they eventually move down, but it is an early assumption (their niceness) since we haven’t checked the predictions yet.
At this very moment, we can easily calculate them with our model:
And take a look:
Just to be clear, we don’t really want our models to perfectly predict values, as such perfection would destroy the whole idea of the anomaly detection process (described in “Saying what we want from our models out loud” in Part I). That’s why when we look at plots, we want to see our model catch the main trend, not the particular values.
Long Short-Term Memory Neural Network
LSTM neural networks have an internal memory state that allows them to remember its’ previous evaluations that make this architecture the perfect candidate for time series anomaly detection.
Unlike the previous two models, this neural network tries to reconstruct the current value using the value itself. It may seem trivial, but this approach has extremely good results in anomaly detection.
We can simplify the wrapping of data into the Dataset from CNN because we need the very same output as input for reconstruction. That is why we get rid of the first part of each item and just keep 1 value. We also should consider that LSTM models usually can’t be fed with the whole sequence at once (because of the memory consumption for the whole sequence)— we have to separate data accordingly into partial sequences to train. Moreover, as training with the whole sequence may reduce the model’s ability to generalize, it may simply get used to the data this way.
The definition of the LSTM neural net is much easier than the CNN one. PyTorch already has an implemented class of LSTM cells that we can use.
Because of the training with partial sequences, we can’t directly send the Dataset instance to get the results, but we certainly can extract the entire sequences from DataFrames:
Another intermediate conclusion
Terrific, we have 3 trained models and their results on our dataset!
The toughest part is behind us and now we are prepared to make our final steps towards anomaly detection. In the last chapter, we will do some extra preparations and reveal the detection process along with its results.
If you want to refresh some theory or data preprocessing, don’t be shy and go to the first part:
Otherwise, the third part awaits (this item will become link like the one from the header):
- Part III — Eventually easier than it seemed
This article was written by Artur Khanin, a Technical Project Lead at Akvelon’s Kazan office, and was originally published on Medium.
Artur Khanin is a Technical Project Lead at Akvelon’s Kazan office.
This project was designed and implemented with love by the following members of Akvelon’s team: