Hello fellow reader, my name is Artur. I am the head of the Machine Learning team in Akvelon and you are about to read the tutorial for anomaly detection in time series.
Intro to Anomaly Detection and Data Preparation
During our research, we’ve managed to gather a lot of information from tiny useful pieces all over the internet and we don’t want this knowledge to be lost! That’s exactly why you can exhale and dive into these end-to-end working articles.
2 more parts are coming very soon, stay in touch and don’t miss them! These items will become links to the next chapters:
- Part II — Implementation of ARIMA, CNN, and LSTM
- Part III — Eventually easier than it seemed
Let’s get this started!
First of all, let’s define what an anomaly detection problem is in general
Anomaly Detection — is the identification of rare items, events, or patterns that significantly differ from the majority of the data.
Well, basically, the anomaly is something that makes no or little sense when you look at it from the high ground.
And it brings us to the fact that anomalies are extremely context-dependent. And also, different people may consider different pieces of data as anomalies.
Why do we bother finding anomalies?
Imagine the situation — you are the Co-founder and CTO in a small startup company. Your company has 1 web application and every single client really matters for success. One day a smart client found the backdoor in your app and started to send direct large queries to your database. The CPU usage has changed because of these queries but it stayed in normal boundaries.
The data may leak for a very long period of time, and when somebody else will find this backdoor and make it public, it may cause enormous damage. And this situation is not about the code quality, there is always the risk that mistakes and backdoors may appear in your app and codebase. This situation is about the drawing of your attention (as CTO and one of the main decision-makers) to reveal the issue and save your business.
Generally speaking, if you can notice any variations of your system from its normal behavior, you can find out the reasons for such behavior, uncover and eliminate hidden issues and find new non-obvious opportunities.
However, it is quite expensive to hire someone for monitoring all your metrics 24/7. That is why we want to detect unexpected variations (anomalies) automatically, inexpensively (also in terms of damage), and quickly.
Alright, now we know why we want to solve this problem. But before moving to the how part, we need to distinguish between generic anomalies and anomalies in time series. This will help us to understand what types of techniques are more appropriate for our problem.
Generic anomaly detection
Here is a generic example, which illustrates the cluster-based approach. Shortly speaking, the cluster-based approaches try to group similar data in clusters and consider values that don’t fit in these clusters as anomalies.
Although this picture illustrates some specific approach, the idea is the same across any technique used. We have some values with no particular order and we try to figure out which values seem uncommon.
Time series anomaly detection
On the other hand, when we talk about anomaly detection for time series, the value itself may not seem suspicious, but it becomes suspicious due to the time when it appears and the values before:
For example, it is okay, when the CPU load of some application is about 20%, but it seems strange when the load unexpectedly jumps to 80%. And one more important thing is that the load level around 80% after this jump is not suspicious anymore. That is why some threshold warning values can’t handle such complex situations.
The moment of the jump is way more important than the stabilization after it.
When you’re dealing with the anomalies in general, you can shuffle them into any order and the anomalies will still be the same. If a person bought 1000 packs of toilet paper, you probably will always say that this seems anomalous.
When you’re dealing with the anomalies in time series, the order of data is important. If nobody ever bought a single pack of toilet paper in a particular store, and then a person bought just 1 pack — you already should consider it as an anomaly.
Cool, now we are ready to move on to the how part!
How to find anomalies
The general idea
One of the most common and the most successful yet approaches can be described as:
Find out what is normal and if something deviates too much from it — this is an anomaly.
And almost all of the approaches listed below try to achieve this one way or another.
Here are the most common approaches to use for anomaly detection:
- Statistical Methods
- Deviations from association rules and frequent itemsets
- One-class Support Vector Machine
- Clustering-based techniques (k-means)
- Density-based techniques (k-nearest neighbor, local outlier factor)
- Autoencoders and replicators (Neural Networks)
These are just some of the popular techniques amongst the possible ones.
How we find anomalies
Of course, every technique listed above can be used for our purposes. But nowadays, neural networks are in trend and outperform classical algorithms almost everywhere. After our research, we decided to try 2 types of neural networks and took 1 classic model as a baseline for comparison. And here is why:
- ARIMA statistical model as a baseline — this is the classic auto-regression model that is made exactly for the time series
- Convolutional Neural Network — such neural networks are usually used for image processing, but if you dig deeper into them, you may find that they actually look for the patterns in the images. Our time series also consists of patterns
- Long Short-Term Memory Neural Network — this type was designed especially for time-related data
And for the development we chose this set of tools:
- Jupyter Notebooks environment for the implementation of the models.
We prefer using Jupyter Notebooks instead of regular editors (such as JetBrains’ PyCharm) since Notebooks save code into separate cells, which can also be run in any order you need. Most editors use
.pyfiles and execute the whole bunch of code every time you run it — this makes the code modification process terrifying (sometimes training of a model consumes hours) because you have to wait to see the outcome of your simple
- Scikit-learn for some data preprocessing.
Scikit-learn has many tools that are already implemented. There are no big secrets behind the data preprocessing, and we don’t really want to waste our time implementing them from scratch, optimizing them for large amounts of data, etc. This is where Scikit-learn comes in handy.
- Statsmodel library for ARIMA model.
This library is used with the same motivation as scikit-learn. No big secrets behind ARIMA; already implemented tool; no need to waste time.
- PyTorch for neural networks.
It is one of the most used libraries for the construction of neural nets. The alternative is, of course, Tensorflow. They both have pros and cons and we choose PyTorch because it is designed for eager execution that is way more flexible. On the other hand, Tensorflow is designed for building the whole calculation graph with the training/validation phase, metrics calculation, etc., meaning that a lot of stuff has to be done before. For example, you may face some syntax error at the very beginning of a model definition right after you finish coding the whole model.
- Plotly for plots and graphs.
When we just started, we decided to use Matplotlib, which is used almost everywhere for plots and graphs. However, we’ve noticed a new player — plotly — and it turned out that it plots prettier and more informative graphs with less code.
Before diving into models architecting and implementation we should load the data and inspect it.
First of all, we should choose the dataset for our experiments. The internet is full of different datasets, for example, you can use Dataset Search from Google to find the appropriate one. We already have the perfect repository for time series anomaly detection — The Numenta Anomaly Benchmark (NAB):
- NAB contains many files with different metrics from different places. It is in the nature of metrics — being ordered in time and thus, being one of the best candidates for time series anomaly detection.
- From NAB we decided to use Real CPU utilization from AWS Cloudwatch metrics for Amazon Relational Database Service. These metrics are saved in
csvformat and the exact timestamps of anomalies in them (let’s call it labels) are saved in
jsonformat. You may use any other metric that you like.
- Just to clarify — these anomalies that we see in the dataset were marked as anomalies by the creators.
Here are direct links to the files in NAB that we use, you can download them, take a look and play around:
The dataset is chosen, great! Certainly, you can pick any other dataset, but we will focus on NAB. Next steps that we should do:
- Go through the README file from NAB repository to understand what we are working with
- Load data into our local filesystem
- Inspect our data of interest visually with anything that can open
json. Actually, we can directly open them in GitHub repo
- Load data into our notebook’s environment to make further manipulations inside the notebook using Python. We use Pandas and NumPy libraries— pretty standard choice for data loading and manipulation already has several tools that have been implemented
Assuming that we already went through the README, we can move to the loading into our local filesystem.
This dataset is located in git, but it doesn’t mean that every dataset should be version controlled. We need just
csv with data and
json with anomaly labels and they can even be simply on the local hard drive. Git here is for convenience.
Let’s grab the data into our local environment to manipulate them any way we want:
After some initial inspection in NAB’s GitHub repo, we see that both training and validation files consist of timestamps with corresponding values.
Each file contains 4032 ordered data rows with a 5-minute rate.
Let’s find these files in labels (once more,
json with exact timestamps of anomalies).
We can see that each file contains 2 anomalies.
Since we know what we are dealing with, we can start loading into the notebook with imports of some useful packages:
Finally, we can load timestamps for anomalies from
json into our local environment:
Artur Khanin is a Technical Project Lead at Akvelon’s Kazan office.
This project was designed and implemented with love by the following members of Akvelon’s team:
Special thanks to Aleksey Veretennikov as well.