Outlier Detection - Statgraphics Version 20

Outlier Detection

Four procedures are available for detecting outliers, one based on statistical assumptions and the other 3 based on machine learning algorithms. In machine learning, outliers are often called "anomalies".

Methodology

Isolation Forest

Local Outlier Factor

Outlier Identification

One-Class SVM

Isolation Forest

The Isolation Forest procedure implements a machine-learning process to identify potential outliers or other anomalies in a set of multivariate quantitative variables. It does so by creating a forest of decision trees and measuring how many splits are required to isolate each observation. The primary output of the procedure is an anomaly score, which is related to the average number of required splits amongst the trees. The higher the score, the more likely it is that a particular observation is an outlier.

Local Outlier Factor

The Local Outlier Factor procedure implements a machine-learning process to identify potential outliers or other anomalies in a set of multivariate quantitative variables. It does so by calculating the “local reachable density” of each observation and comparing it to its nearest neighbors. Observations with substantially lower density than their neighbors are classified as outliers.

One-Class SVM

The One-Class SVM (Support Vector Machine) procedure implements a machine-learning process to identify potential outliers or other anomalies in a set of multivariate quantitative variables. It does so by transforming the data to a higher dimensional space and searching for a hypersphere or hyperplane that separates the majority of the observations from the outliers.

Outlier Identification

The Outlier Identification procedure is designed to help determine whether or not a sample of n numeric observations contains outliers. Outliers are observations that do not come from the same distribution as the rest of the sample. Both graphical methods and formal statistical tests due to Grubbs and Dixon are included. The procedure will also save a column of flags identifying the outliers in a form that can be used to exclude those observations when running other procedures.