Anomaly detection is a dynamic method of operation of antiviruses, host and network intrusion detection systems. When using this method, the software observes certain actions (program/process operation, network traffic parameters, user operation), monitoring possible unusual and suspicious events or trends.
In the context of detecting network anomalies/intrusions and detecting abuse, interesting events are often not rare - they are simply unusual. For example, unexpected activity spikes are usually noticeable, although such a surge in activity can go beyond many traditional methods of detecting statistical anomalies.
Types of anomalies
Before you study detection methods, you need to find out what anomalies you usually have to deal with:
Point anomaly: individual anomalous instances in a large dataset. In other words, when a data point takes a value far beyond all other ranges of data point values in a data set, this can be considered as a point anomaly. This is a rare event.
Collective anomaly: represented by a subset of data points that deviate from normal behavior. In a collective anomaly, individual instances may not be anomalous in themselves, but their collective appearance is anomalous.
Contextual anomaly: an outlier is called contextual when its value does not correspond to what we expect to observe for a similar data point in the same context. Contexts are usually temporary, and the same situation observed at different times cannot be an exception.
Anomaly Detection Method Classes
There are three main classes of anomaly detection methods. Essentially, the correct method for detecting anomalies depends on the available labels in the dataset.
Controlled anomaly detection methods require a dataset with a full set of "normal" and "abnormal" labels to operate the classification algorithm. This type of technique also includes classifier training. This is similar to traditional pattern recognition, except that there is a natural strong imbalance between classes when emissions are detected. Not all statistical classification algorithms are well suited to the unbalanced nature of anomaly detection.
Semi-controlled anomaly оdetection techniques use a conventional marked set of training data to build a model representing normal behavior. They then use this model to detect anomalies, checking how likely it is that the model will generate any encountered instance.
Uncontrolled anomaly оdetection techniques detect anomalies in an unmarked test dataset solely based on the intrinsic properties of this data. The working assumption is that, as in most cases, the vast majority of instances in the dataset will be normal. The anomaly detection algorithm will then detect instances that appear to be the least congruent to the rest of the datasets.
Anomaly detection methods
There are various methods for detecting anomalies. Depending on the circumstances, one may be better than others for a particular user or dataset.
Detection of anomalies based on clustering
Detection of anomalies based on clustering remains popular in uncontrolled learning. It is based on the assumption that similar data points tend to group together, as determined by their proximity to local centroids.
The K-means method, a widely used clustering algorithm, creates k similar clusters of data points. Users can then configure systems to mark data instances outside these groups as data anomalies. As an uncontrolled method, clustering does not require data labeling.
Clustering algorithms can be deployed to capture an abnormal data class. The algorithm has already created many clusters of data on the training sample to calculate the threshold of the abnormal event. It can then use this rule to create new clusters supposedly collecting new anomalous data.
However, clustering does not always work for time series data. This is because the data maps evolution over time, but the method creates a fixed set of clusters.
Density-based anomaly detection
Density-based anomaly detection methods require tagged data. This method of anomaly detection is based on the assumption that normal data points tend to occur in dense environments, while anomalies appear infrequently.
There are two types of algorithms for this type of evaluation of data anomalies:
- K-nearest neighbors (k-NN) is a basic non-parametric controlled machine learning method that can be used either for regression or to classify data based on distance measures such as Euclidean distance, Hamming distance, Manhattan or Minkowski distance.
- Local ejection factor (LOF), also called relative data density, is based on reach distance.
Detection of anomalies based on the reference vector machine
A support vector machine (SVM) is commonly used in controlled settings, but SVM extensions can also be used to detect anomalies for some unmarked data. SVM is a neural network that is well suited to classifying linearly separable binary patterns - obviously, the better the separation, the clearer the results.
Such an anomaly detection algorithm can study a softer boundary depending on the purposes of clustering data instances and correctly detecting anomalies. Depending on the situation, such an anomaly detector can output numerical scalar values for various purposes.
Anomaly detection is the identification of data points in data that do not correspond to normal patterns. This can be useful for many tasks, including detecting fraud, intrusions, defects, and the like.