Hello there, my name is Artur.
You might be reading this intro for the third time — and if this is the case, I appreciate your sticking with this article series.
I am the head of the Machine Learning team in Akvelon-Kazan and you are about to read the last part of the tutorial for anomaly detection in time series.
During our research, we’ve managed to gather a lot of information from tiny useful pieces all over the internet and we don’t want this knowledge to be lost so we are sharing it with you!
Eventually easier than it seemed
We already dove into the theory and data preparation in Part I and defined and trained three models in Part II:
- Part I — Intro to Anomaly Detection and Data Preparation
- Part II — Implementation of ARIMA, CNN, and LSTM
We reuse our code so if something seems unclear, consider visiting the previous parts once more.
Fantastic, let’s complete this series!
Just to briefly remind the tools that we use:
- Jupyter Notebooks environment for the implementation of the models
- Scikit-learn for some data preprocessing
- Statsmodel library for ARIMA model
- PyTorch for neural networks
- Plotly for plots and graphs
And what type of models and how we trained:
- ARIMA statistical model — predicts next value
- Convolutional Neural Network — predicts next value
- Long Short-Term Memory Neural Network — reconstructs current value
Anomaly Detection with Static and Dynamic Threshold
Amazing, we trained all three models! But every line of the code before was just the preparation for the anomaly detection.
So just after a very small amount of additional preparations, we will be able to finally detect anomalies.
What exactly lies behind these “additional preparations”? These things (remember the “Saying what we want from our models out loud” from Part I?):
- Calculation of the errors for each item in datasets
- Threshold calculation based on errors
And then we will be able to detect anomalies extremely fast like literally just comparing errors with the threshold.
What are we waiting for? We’re ready for this!
The error calculation differs for each model due to different implementations of Datasets. The algorithm stays the same.
ARIMA’s errors calculation
Just as a reminder — we use absolute error for ARIMA because it causes better results. And if you are wondering how we came to this, the answer is — we just tried and it worked.
CNN’s errors calculation
Threshold calculation — common for all three models
Static threshold
This threshold is calculated with this formula of the three-sigma rule.
Dynamic threshold
For the dynamic threshold, we will need two more parameters — window
inside which we will calculate threshold and std_coef
that we will use instead of 3
from the static threshold formula.
- For ARIMA
window=40
andstd_coef=5
- For CNN and LSTM
window=40
andstd_coef=6
These two parameters are empirically chosen for each model using only the training data.
You may wonder — “Why does he always emphasize the usage of only training data? Why can’t I also use validation to choose better parameters?”.
The reason why we use just training data for choosing the parameters of our models is that this is the only way we can be sure that our models will work on data from the real world outside the training dataset. The validation part of the dataset imitates such real-world data and provides a better understanding of models’ capabilities because we know — it wasn’t used to train our models.
Let’s get down to business! Here is the code to calculate the dynamic threshold:
And the last element to fulfill our puzzle is metrics calculation. What kind of metrics? I am glad you asked. We calculate every base metric to fully analyze the models’ performance:
- Confusion matrix to see how a model performs in detail
- Precision to see how precisely our model predicts
- Recall to see how a model detects true anomalies
- F2-score to see the combined precision and recall, we are using F2 instead of F1 because detection of true anomalies is more important than avoiding false anomalies (recall is more important than precision)
Excellent! We can move to the piece of code with exact anomaly detection.
ARIMA with static threshold
For each model, we are going to filter errors with the given threshold and then simply return indexes of unfiltered ones. These unfiltered values we will consider as detected anomalies!
And of course, we are going to visualize everything that we detected! (By still using the same code from the Part I)
We will leave the metrics until the results part. But here are the code and printed confusion matrices:
Yeah, seems not so good (because of many falsely detected anomalies), but it still catches every anomaly.
ARIMA with dynamic threshold
Let’s do the same for the dynamic threshold and see if it can change the situation.
The code for metrics is the same, so we can skip it and take a look at the confusion matrices.
Well, these look much better (no more huge amount of incorrectly detected anomalies)! A tough baseline for our neural nets!
NN’s anomaly detection
For both neural nets, we will provide a unified generic function for anomaly detection.
And that’s it! We can effortlessly process the results of neural nets.
CNN with static threshold
It seems that our CNN model overfitted — it has an enormous amount of incorrect anomalies — but there is no need to make hasty decisions, it is better to look onto results with the dynamic threshold.
CNN with dynamic threshold
And let’s do the same with the dynamic threshold:
The metrics calculation is still the same.
These results are better than ARIMA’s. We already can say that we didn’t waste time on this!
And the last model (but certainly not the least) is LSTM.
LSTM with static threshold
Once again, metrics calculations are identical to CNN’s.
Here we have the same situation as with CNN, but now we know that the dynamic threshold will reveal the truth!
LSTM with dynamic threshold
And the dynamic evaluation certainly made a near-perfect detector out of our LSTM model.
Real-time evaluation with static/dynamic threshold
If it is hard to figure out from the code how to use these models in real-life data (and this is normal), so here are some visualizations of the real-time evaluation:
The top chart shows the original data with true anomalies and detected anomalies. On the bottom chart, we can see the error of a model with the purple static threshold line.
And here is the visualization of the same process with the dynamic threshold.
As you can see, the dynamic threshold adapts to the dispersion of the error. That is why the threshold is low when the error deviates a bit and high otherwise.
Results of the models
Finally, we can compare the metrics to be sure that we correctly put the LSTM onto the first place. We are using F2-score to decide which model is the best. Precision and recall are shown separated for the understanding of weak and strong sides of our models.
However, ARIMA performs slightly better with the static threshold, and the neural networks outperforms it with dynamic threshold — especially LSTM.
Ultimate Conclusion
Lastly, I would like to emphasize that these models can already be taken for production with not so much effort.
Nevertheless, these models are far from their limits and can be enhanced via:
- Increasing the amount of training data
- Adding other metrics such as memory, network, etc
- Combination of LSTM and CNN architectures
- Feature Engineering
Thank you very much for your attention, I hope that this tutorial gave you some understanding and hints on implementation.
And don’t stop looking for anomalies!
This article was written by Artur Khanin, a Technical Project Lead at Akvelon’s Kazan office, and was originally published on BecomingHuman.AI.
Artur Khanin is a Technical Project Lead at Akvelon’s Kazan office.
This project was designed and implemented with love by the following members of Akvelon’s team:
Team Lead — Artur Khanin
Delivery Manager — Sergei Volynkin
Technical Account Manager — Max Kostin
ML Engineers — Irina Nikolaeva, Rustem Saitgareev