Hello fellow reader, my name is Artur. I am the head of the Machine Learning team in Akvelon and you are about to read the tutorial for anomaly detection in time series.
Intro to Anomaly Detection and Data Preparation
During our research, we’ve managed to gather a lot of information from tiny useful pieces all over the internet and we don’t want this knowledge to be lost! That’s exactly why you can exhale and dive into these end-to-end working articles.
2 more parts are coming very soon, stay in touch and don’t miss them! These items will become links to the next chapters:
- Part II — Implementation of ARIMA, CNN, and LSTM
- Part III — Eventually easier than it seemed
Let’s get this started!
First of all, let’s define what an anomaly detection problem is in general
Anomaly Detection — is the identification of rare items, events, or patterns that significantly differ from the majority of the data.
Well, basically, the anomaly is something that makes no or little sense when you look at it from the high ground.
And it brings us to the fact that anomalies are extremely context-dependent. And also, different people may consider different pieces of data as anomalies.
Why do we bother finding anomalies?
Imagine the situation — you are the Co-founder and CTO in a small startup company. Your company has 1 web application and every single client really matters for success. One day a smart client found the backdoor in your app and started to send direct large queries to your database. The CPU usage has changed because of these queries but it stayed in normal boundaries.
The data may leak for a very long period of time, and when somebody else will find this backdoor and make it public, it may cause enormous damage. And this situation is not about the code quality, there is always the risk that mistakes and backdoors may appear in your app and codebase. This situation is about the drawing of your attention (as CTO and one of the main decision-makers) to reveal the issue and save your business.
Generally speaking, if you can notice any variations of your system from its normal behavior, you can find out the reasons for such behavior, uncover and eliminate hidden issues and find new non-obvious opportunities.
However, it is quite expensive to hire someone for monitoring all your metrics 24/7. That is why we want to detect unexpected variations (anomalies) automatically, inexpensively (also in terms of damage), and quickly.
Alright, now we know why we want to solve this problem. But before moving to the how part, we need to distinguish between generic anomalies and anomalies in time series. This will help us to understand what types of techniques are more appropriate for our problem.
Generic anomaly detection
Here is a generic example, which illustrates the cluster-based approach. Shortly speaking, the cluster-based approaches try to group similar data in clusters and consider values that don’t fit in these clusters as anomalies.
Although this picture illustrates some specific approach, the idea is the same across any technique used. We have some values with no particular order and we try to figure out which values seem uncommon.
Time series anomaly detection
On the other hand, when we talk about anomaly detection for time series, the value itself may not seem suspicious, but it becomes suspicious due to the time when it appears and the values before:
For example, it is okay, when the CPU load of some application is about 20%, but it seems strange when the load unexpectedly jumps to 80%. And one more important thing is that the load level around 80% after this jump is not suspicious anymore. That is why some threshold warning values can’t handle such complex situations.
The moment of the jump is way more important than the stabilization after it.
When you’re dealing with the anomalies in general, you can shuffle them into any order and the anomalies will still be the same. If a person bought 1000 packs of toilet paper, you probably will always say that this seems anomalous.
When you’re dealing with the anomalies in time series, the order of data is important. If nobody ever bought a single pack of toilet paper in a particular store, and then a person bought just 1 pack — you already should consider it as an anomaly.
Cool, now we are ready to move on to the how part!
How to find anomalies
The general idea
One of the most common and the most successful yet approaches can be described as:
Find out what is normal and if something deviates too much from it — this is an anomaly.
And almost all of the approaches listed below try to achieve this one way or another.
Possible Approaches
Here are the most common approaches to use for anomaly detection:
- Statistical Methods
- Deviations from association rules and frequent itemsets
- One-class Support Vector Machine
- Clustering-based techniques (k-means)
- Density-based techniques (k-nearest neighbor, local outlier factor)
- Autoencoders and replicators (Neural Networks)
These are just some of the popular techniques amongst the possible ones.
How we find anomalies
Of course, every technique listed above can be used for our purposes. But nowadays, neural networks are in trend and outperform classical algorithms almost everywhere. After our research, we decided to try 2 types of neural networks and took 1 classic model as a baseline for comparison. And here is why:
- ARIMA statistical model as a baseline — this is the classic auto-regression model that is made exactly for the time series
- Convolutional Neural Network — such neural networks are usually used for image processing, but if you dig deeper into them, you may find that they actually look for the patterns in the images. Our time series also consists of patterns
- Long Short-Term Memory Neural Network — this type was designed especially for time-related data
And for the development we chose this set of tools:
- Jupyter Notebooks environment for the implementation of the models.
We prefer using Jupyter Notebooks instead of regular editors (such as JetBrains’ PyCharm) since Notebooks save code into separate cells, which can also be run in any order you need. Most editors use.py
files and execute the whole bunch of code every time you run it — this makes the code modification process terrifying (sometimes training of a model consumes hours) because you have to wait to see the outcome of your simpleprint
function at the end of your code. Thus, cells of Notebooks provide flexibility for Machine Learning development and make it easier to try something out. - Scikit-learn for some data preprocessing.
Scikit-learn has many tools that are already implemented. There are no big secrets behind the data preprocessing, and we don’t really want to waste our time implementing them from scratch, optimizing them for large amounts of data, etc. This is where Scikit-learn comes in handy. - Statsmodel library for ARIMA model.
This library is used with the same motivation as scikit-learn. No big secrets behind ARIMA; already implemented tool; no need to waste time. - PyTorch for neural networks.
It is one of the most used libraries for the construction of neural nets. The alternative is, of course, Tensorflow. They both have pros and cons and we choose PyTorch because it is designed for eager execution that is way more flexible. On the other hand, Tensorflow is designed for building the whole calculation graph with the training/validation phase, metrics calculation, etc., meaning that a lot of stuff has to be done before. For example, you may face some syntax error at the very beginning of a model definition right after you finish coding the whole model. - Plotly for plots and graphs.
When we just started, we decided to use Matplotlib, which is used almost everywhere for plots and graphs. However, we’ve noticed a new player — plotly — and it turned out that it plots prettier and more informative graphs with less code.
Before diving into models architecting and implementation we should load the data and inspect it.
Dataset Loading
First of all, we should choose the dataset for our experiments. The internet is full of different datasets, for example, you can use Dataset Search from Google to find the appropriate one. We already have the perfect repository for time series anomaly detection — The Numenta Anomaly Benchmark (NAB):
- NAB contains many files with different metrics from different places. It is in the nature of metrics — being ordered in time and thus, being one of the best candidates for time series anomaly detection.
- From NAB we decided to use Real CPU utilization from AWS Cloudwatch metrics for Amazon Relational Database Service. These metrics are saved in
csv
format and the exact timestamps of anomalies in them (let’s call it labels) are saved injson
format. You may use any other metric that you like. - Just to clarify — these anomalies that we see in the dataset were marked as anomalies by the creators.
Here are direct links to the files in NAB that we use, you can download them, take a look and play around:
The dataset is chosen, great! Certainly, you can pick any other dataset, but we will focus on NAB. Next steps that we should do:
- Go through the README file from NAB repository to understand what we are working with
- Load data into our local filesystem
- Inspect our data of interest visually with anything that can open
csv
andjson
. Actually, we can directly open them in GitHub repo - Load data into our notebook’s environment to make further manipulations inside the notebook using Python. We use Pandas and NumPy libraries— pretty standard choice for data loading and manipulation already has several tools that have been implemented
Assuming that we already went through the README, we can move to the loading into our local filesystem.
This dataset is located in git, but it doesn’t mean that every dataset should be version controlled. We need just csv
with data and json
with anomaly labels and they can even be simply on the local hard drive. Git here is for convenience.
Let’s grab the data into our local environment to manipulate them any way we want:
After some initial inspection in NAB’s GitHub repo, we see that both training and validation files consist of timestamps with corresponding values.
Each file contains 4032 ordered data rows with a 5-minute rate.
Let’s find these files in labels (once more, json
with exact timestamps of anomalies).
We can see that each file contains 2 anomalies.
Since we know what we are dealing with, we can start loading into the notebook with imports of some useful packages:
Finally, we can load timestamps for anomalies from json
into our local environment:
As you can see, we’ve managed to load data and take a look at the first 5 rows from the training file. It is enough to understand that we succeeded — these rows are exact 5 first rows that we saw on GitHub.
Data Preprocessing
Great, we’ve loaded data into our notebook and inspected it a bit, but are we good to go with it and move on to the models part? Unfortunately, the answer is no.
We have to define exactly what part of the loaded data exactly will be used by our models. We have to think about the problem we want to solve and what data can be used for this.
In our case, we want to detect anomalies in CPU usage. Well, here is the answer, the clearest choice is value
column because it represents CPU usage.
We also can consider timestamp
column — a timestamp encodes a lot of information, i.e. what day of the week it is, what month of the year it is, etc. This information can be extracted and used by the models. We won’t do it in this article, but if you want, you can try it. Maybe, you’ll achieve even better results!
So, we are going to use value
column from training and validation DataFrames. The next step is to transform values from this column into an appropriate format. The appropriate format, as you might guess, is numbers. This comes from history — computers use numbers for calculations (and ML models as well), you have to make everything numbers, it’s just the way it works.
Luckily, value
column already consists of numbers, and we can use them in our models as it is. But it is pretty often a good idea to standardizenumbers before feeding our models with data. It helps them generalize better and avoid problems with different scales of values with different meanings (Yes, somebody tried it in ML and it worked). Quite often standardization just means rescaling the numbers to mean = 0 and standard deviation = 1.
That’s why the next thing we have to do is to parse datetime from timestamps (just for convenient visualization) and standardize values.
We will follow the regular rescaling policy (as we said — mean = 0 and standard deviation = 1) and use StandardScaler
from Scikit-learn:
Then we can extract anomalies into dedicated variables for both training and validation data from DataFrames:
And plot all data with the help of the Plotly library to visualize the whole set of data points and gain more understanding of what it present.
Firstly, we plot the training data.
(We are going to use this code for plotting everything we need)
Secondly, we plot the validation data.
I suppose this blue annoying dot between 60 and 70 needs an explanation — why isn’t it marked as an anomaly? The thing is that this dot goes right after the green anomaly dot between 70 and 80. And after ~77% of CPU load, ~67% doesn’t seem suspicious at all. You also should keep in mind that we can’t look ahead because real data comes in real-time so what looks like an anomaly at some one moment may not look like an anomaly in the full picture.
You may notice that at this moment you feel much more comfortable with data. That usually happens due to visual inspection of data values, so it is strongly advised to visualize and examine it with your own eyes.
Saying what we want from our models out loud
Now we know what our data looks like and what kind of data we have. We don’t have much information, actually, just timestamps and CPU loads. But still enough to build quite great models.
So, here comes the best moment to ask a question — “How our models are going to detect anomalies?”
To answer it we need to figure out 3 things (remembering the general idea of the anomaly detection and that our data are time series):
- What is normal?
This question can enforce your creativity because there is no strict definition of normality.
Here is what we came up with (like many other people). We will use 2 almost identical tasks to teach our models “the normality”:
1. Given the value of CPU usage, try to reconstruct it. This task will be given to the LSTM model.
2. Given the values of CPU usage, try to predict the next value. This task will be given to the ARIMA and CNN models.
If the model reconstructs or predicts easily(meaning, with little deviation), then, it is normal data.
You can change this distribution of the tasks for models. We tried different combinations and this one was the best for our dataset. - How to measure deviation?
The most common way to measure deviation is some kind of errors.
We will use the squared error between the predicted value (x*) and the true one (x). Squared error = ❨𝑥*−𝑥❩²
For ARIMA we will use absolute error =|𝑥*−𝑥|since it performed better. The absolute error may also be used in other models instead of the squared error. - And what is “too much” deviation?
Or in other words how to pick the threshold? You can simply take some numbers or figure out some rules to calculate it.
We are going to use the three-sigma statistic ruleby measuring the mean and the standard deviation of the errors in all training data (and only training, because we only use validation data to see how our model performs). And then we will calculate the threshold, above which deviation is “too much”.
And, looking ahead, we will also use a slightly modified version with measuring over some window behind the current position to incredibly enhance accuracy.
Don’t worry, if it seems too complicated right now, it will become much more clear, when you will see it in the code.
Intermediate conclusion
Great, most of the theory part is done, and we have loaded, inspected, and standardized our dataset!
As the practice shows, data preparation is one of the most important fragment sand. We will prove this in the further parts because the code that we implemented here is a strong fundament that we will use in all our models.
And very soon you will be able to move forward (these items will become links like the ones from the header):
- Part II — Implementation of ARIMA, CNN, and LSTM
- Part III — Eventually easier than it seemed
Artur Khanin is a Technical Project Lead at Akvelon’s Kazan office.
This project was designed and implemented with love by the following members of Akvelon’s team:
Team Lead — Artur Khanin
Delivery Manager — Sergei Volynkin
Technical Account Manager — Max Kostin
ML Engineers — Irina Nikolaeva, Rustem Saitgareev
Special thanks to Aleksey Veretennikov as well.