ARIMA, CNN, and LSTM in Time Series Anomaly Detection

Hello fellow reader (and hello again if you read the first part of this article series). My name is Artur, and I am the head of the Machine Learning team in Akvelon’s Kazan office and you are about to read the second part of the tutorial for anomaly detection in a time series.

Implementation of ARIMA, CNN, and LSTM

During our own research, we’ve managed to gather a lot of information from tiny useful pieces all over the internet and we don’t want this knowledge to be lost!

We already dove into theory and data preparation in Part I:

Part I — Intro to Anomaly Detection and Data Preparation

1 more part is coming very soon, stay in touch and don’t miss it! This item will become a link to the next chapter:

Part III — Eventually easier than it seemed

We reuse our code so if something seems unclear, consider visiting the previous part once more.

Alright then, let’s move on!

Just to briefly remind the tools that we use:

Jupyter Notebooks environment for the implementation of the models
Scikit-learn for some data preprocessing
Statsmodel library for the ARIMA model
PyTorch for neural networks
Plotly for plots and graphs

Implemented Approaches

Amongst all possible approaches listed in Part I, we chose these suitable ones:

ARIMA statistical model — predicts next value
Convolutional Neural Network — predicts next value
Long Short-Term Memory Neural Network — reconstructs current value

Let’s start with the ARIMA model.

ARIMA Statistical Model

ARIMA is an autoregression statistical model that optimizes coefficients (also known as hyperparameters) during training. Then these hyperparameters are used in inference.

It is logical that during nighttime and daytime CPU usage may vary which is totally normal behavior. To consider this behavior, we can think of night and day times as separate “seasons”. And because of this, we will use SARIMAX — which is just a modified ARIMA model with the same idea that adds seasonal causes into ARIMA. SARIMAX will deal with the separation of seasons for us, so we don’t have to provide anything else except our dataset.

Here you can see the schema for the training process of SARIMAX:

Training process and architecture of the SARIMAX model

The model tries to predict the next value using the current one and then it compares the predicted result with actual value.

SARIMAX Implementation

Implementation of this model is not so interesting since all we can play with are hyperparameters for this model. To find the model that produces the best predictions we will iterate over hyperparameters and will pick the combination.

All we need is just the model itself from the statsmodels.api package and product function from itertools for iteration over hyperparameters.

And also, it would be a good idea is to write predictions of the best model into the new column of the DataFrame with initial data to ease further metrics calculation.

Here is the code to pick the best model and write its predictions into training and validation DataFrames:

At this very moment, we have the best set of parameters and predictions for both training and validation data. To understand how good our model is, we should calculate different metrics such as precision, recall, and F-score. We will fully cover the metrics theme in Part III, however, we already can visualize the predictions and see how our model performs:

ARIMA’s predictions on training data

ARIMA’s prediction on validation data

Note: one important thing about ARIMA is that time for training and optimizing the coefficients takes ~10 times longer than it takes to complete the same training of both neural networks.

Convolutional Neural Network

Convolutional neural networks are usually applied to image-connected tasks, such as image classification, segmentation, etc. But the purpose of the convolutional layers is to find and recognize patterns, which is totally applicable for analysis of the CPU utilization metric.

The architecture of the CNN model

We use ResNet-ish architecture (which has already become the best type of architecture to use in CNNs) that consists of Residual Blocks (ResBlocks). The idea behind ResBlock is simple yet efficient — add the input of the block to its output. This idea allows neural networks to “remember” each intermediate result and take it into account in the final layers.

CNN tries to predict the next value using some number of previous values. In our case, this number equals 10, but of course, it may be configured.

CNN Implementation

Before coding CNN itself we should make a few additional preparations of the data. In general, since we use PyTorch, the data should be wrapped into something compatible with PyTorch’s Dataset. This may be as well the class inherited from the Dataset class, a generator, or even a simple iterator.
We will use the class option due to its readability and implicit PyTorch’s enhancements.

Once again, there are some necessary imports for Dataset creation:

Let’s recall the task for CNN. We want the model to predict the next value using some amount of the previous values. This means that our Dataset class should contain each item in a specific format — divided into 2 parts:

n values in sequential order that the model uses for prediction (to easily change the amount of these values, we make n the parameter)
1 value that goes next after n values from the first part of the item

Coming back to the implementation — actually, it is very easy to wrap data with our custom class that just inherits from PyTorch’s Dataset. It should implement only __init__, __len__ and __getitem__ methods:

Now it is the right time to define the number of previous values that are to be used for the prediction of the next one. And, of course, wrap the data with our CPUDataset class.

Here comes the best part — the definition of the neural network.

Let’s rewind how Residual block looks like in regular ResNets. It consists of:

The only difference between the regular ResBlock and ours is that we removed the last ReLU activation — it turned out that in our case CNN without the last ReLU in ResBlock generalizes better.

Many Residual Blocks bring twice more Convolution + Batch Norm. (+ ReLU) combinations. So, such a combination is a good starting point to define.

In each Residual Block, we should remember about the case of changing the number of output channels (when in_feat != out_feat). One possible way to synchronize the number of channels is to multiply or cut them. However, there is another greater way — we can handle this using a 1×1 convolution without padding. This trick not only allows us to fit layer input into layer output but also adds more reasonable computations for the neural network.

It is widely used to finish the base block of the convolutional net with Max Pooling or Average Pooling depending on the task. Here comes another useful trick for Convolutional Neural Nets (thanks to Jeremy Howard and his fantastic fast.ai library) — concatenate Average Pooling and Max Pooling. It allows our neural net to decide, which approach is better for the current task and how to combine them to get better results:

And here is our resulting CNN class that is made of the building blocks that we implemented above:

We always want both training and validation losses to move down because this behavior means that the model has learned something useful about our data. If any loss eventually moves up, then the model can’t figure out how to solve the task, and you should change or modify it.

These losses on the picture above seem pretty nice because they eventually move down, but it is an early assumption (their niceness) since we haven’t checked the predictions yet.

At this very moment, we can easily calculate them with our model:

And take a look:

CNN’s predictions on training data

CNN’s predictions on validation data

Just to be clear, we don’t really want our models to perfectly predict values, as such perfection would destroy the whole idea of the anomaly detection process (described in “Saying what we want from our models out loud” in Part I). That’s why when we look at plots, we want to see our model catch the main trend, not the particular values.

Long Short-Term Memory Neural Network

LSTM neural networks have an internal memory state that allows them to remember its’ previous evaluations that make this architecture the perfect candidate for time series anomaly detection.

LSTM architecture

Unlike the previous two models, this neural network tries to reconstruct the current value using the value itself. It may seem trivial, but this approach has extremely good results in anomaly detection.

LSTM Implementation

We can simplify the wrapping of data into the Dataset from CNN because we need the very same output as input for reconstruction. That is why we get rid of the first part of each item and just keep 1 value. We also should consider that LSTM models usually can’t be fed with the whole sequence at once (because of the memory consumption for the whole sequence)— we have to separate data accordingly into partial sequences to train. Moreover, as training with the whole sequence may reduce the model’s ability to generalize, it may simply get used to the data this way.

The definition of the LSTM neural net is much easier than the CNN one. PyTorch already has an implemented class of LSTM cells that we can use.

Because of the training with partial sequences, we can’t directly send the Dataset instance to get the results, but we certainly can extract the entire sequences from DataFrames:

LSTM’s predictions on training data

LSTM’s predictions on validation data

Another intermediate conclusion

Terrific, we have 3 trained models and their results on our dataset!

The toughest part is behind us and now we are prepared to make our final steps towards anomaly detection. In the last chapter, we will do some extra preparations and reveal the detection process along with its results.

If you want to refresh some theory or data preprocessing, don’t be shy and go to the first part:

Part I — Intro to Anomaly Detection and Data Preparation

Otherwise, the third part awaits (this item will become link like the one from the header):

Part III — Eventually easier than it seemed

This article was written by Artur Khanin, a Technical Project Lead at Akvelon’s Kazan office, and was originally published on Medium.

Artur Khanin is a Technical Project Lead at Akvelon’s Kazan office.

This project was designed and implemented with love by the following members of Akvelon’s team:

Team Lead — Artur Khanin
Delivery Manager — Sergei Volynkin
Technical Account Manager — Max Kostin
ML Engineers — Irina Nikolaeva, Rustem Saitgareev

Time Series and How to Detect Anomalies in Them: Part II

Implementation of ARIMA, CNN, and LSTM

Implemented Approaches

ARIMA Statistical Model

SARIMAX Implementation

Convolutional Neural Network

CNN Implementation

Long Short-Term Memory Neural Network

LSTM Implementation

Another intermediate conclusion

From Ideas to Impact: How Akvelon Uses AI to Accelerate Client Projects

AI-Powered Medical Form Automation for Reducing After-Hours Work

DRI Copilot: Leverage Generative AI for Support Teams

Profit Safely from Conversational AI: A Guide to Secure GPT-4-like Models Integration

Google Document AI for Contracts & Resources Management

How to Optimize Your Machine Learning Process with Data Version Control (DVC)

Akvelon Secures Six Awards from the Puget Sound Business Journal

A Closer Look at Large Language Models

Akvelon enables non-Python apps to integrate machine learning models with MLEM

Case Study: AI ArtDive 360

Creating Emojis Using Machine Learning

Akvelon Develops an AI-powered Tool for Code Searching

This whitepaper is already on its way to your mailbox!

Blog

Time Series and How to Detect Anomalies in Them: Part II

Implementation of ARIMA, CNN, and LSTM

Implemented Approaches

ARIMA Statistical Model

SARIMAX Implementation

Convolutional Neural Network

CNN Implementation

Long Short-Term Memory Neural Network

LSTM Implementation

Another intermediate conclusion

Related Posts

This whitepaper is already on its way to your mailbox!