The Art and Science of Data Analysis, Chapter 1: Deciphering Outliers - Noise or Informative?
Everyone who ever did data analysis knows how outliers can affect results. Therefore many analysts are trying to identify and remove them from the dataset.
First, let's answer a question - why is it crucial to detect outliers? One of the reasons is that in many statistical tests, the outlier presence affects the statistical power of used methods. As a result, an analyst may get skewed and completely unreliable results.
To understand the impact better, let's consider an example - what happens to a data set with and without outliers?
Sample data set:
11, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5
We calculate the mean, median, and standard deviation:
Mean = 2.64
Median = 2.5
Standard Deviation = 1.23
Let's now add a few outliers to the data set:
11, 1, 1, 2, 2, 2, 2, 300, 3, 3, 4, 4, 4, 500
The new values we have are:
Mean = 59.21
Median = 2.5
Standard Deviation = 144.17
As you can see, outliers significantly affect the mean and standard deviation values but not the median.
But what should I do then with outliers? Is it better for analysis to completely remove outliers from the data?
But before we go deeper into what's wrong with that approach, let's write a definition of an Outlier. The outlier is abnormal behavior due to ... What?
Let's list different outliers' types:
- due to data errors
- due to random fluctuations
- due to unexpected behavior
- due to a new trend in data
a) Outliers due to a data error:
When an outlier happens due to an error measurement or incorrectly entered data, it is an actual Noise and should be dropped in most cases.
This is crucial for data quality, but detecting such cases might be more complex. To understand the complexity of that case, let's take a look at three examples:
- -1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5
- 1, 1, 1, 2, 2, 2, 2, 5, 3, 3, 3, 4, 4, 4, 5
- 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 6, 6, 4, 4
In the first example, "-1" is the outlier despite not being far from other values. Still, to address this, we may need to implement expert rules which check an acceptable domain for the data (only positive values are allowed).
In the second case, we have "5" following "2", which may not look like an outlier as we have several "5" in the dataset. However, if we know that data should be ordered in ascending order, the presence of that value violates this rule.
And finally, a double "6" in the tail of the data array looks a bit above the typical values, is this outlier or not? It strictly depends on the data's nature and the process.
The example above shows that data quality checks require extensive domain knowledge and a set of manual rules to be configured and maintained.
Is there a simple way to do this? The answer to that is a Time Series Analysis.
Analyzing current values based on historical behavior is the right way to ask for the answer from the data directly - "is the value I see is expected or abnormal"?
That's what we do in Comparative; we provide time series analytics to our users that can automatically cover all the cases from above.
b/c) Outliers due to random fluctuations or unexpected behavior.
The tricky part with those outliers is that they are usually valid and represent an actual value in data. A reasonable question here: "Should we remove those outliers from the data or not?" before we answer this question, let's take a look at a few techniques which can help to detect such outliers:
- graphical methods
Box-plot is one of the most effective and easy ways of identifying outliers. When checking a box-plot, an outlier is a data point outside the box.
The disadvantage of that method is that it performs relatively badly when you have skewed data or more than ten outliers.
- z-score method (Observation — Mean)/Standard Deviation
Z-score(a standard score) is an excellent statistical approach that gives you an idea of how many standard deviations away a data point is from the population mean.
In a normal distribution:
99.7% of the data points lie between +/- 3 standard deviation.
The drawback of this approach is that it relies on the assumption about the normal distribution of data, which might not be the case in most real examples.
- interquartile range
Interquartile range(IQR) is just the box's width in the box-plot, which measures how spread out the values are. Overall, IQR is a good and solid approach that tries to overcome a disadvantage of both methods above - it is simple, automotive, and does not rely on distribution assumptions.
IQR calculates with the following steps:
- Sort the data in ascending order
- Calculate the first and third quartile
- Find interquartile range (q3-q1)
- Find lower bound: q1*1.5
- Find upper bound: q3*1.5
The only disadvantage is that IQR is very sensitive to data skewness. It estimates wrong bounds when data is right/left skewed, a widespread case with business metrics.
In Comparative, we overcome this by modifying IQR with a medcouple algorithm, which can adjust coefficients for lower and upper bounds to address any data skewness. And one more thing, instead of just filtering outliers, we allow you to analyze them separately, so you can still detect and analyze a new behavior.
d) Outliers due to a new trend in data
To understand this specific case, let's consider a small example:
11, 1, 1, 2, 2, 2, 2, 3, 3, 3, 40, 40, 40, 600, 600, 400, 4
If we use the methods described above, the following values might be considered outliers: 40, 40, 40, 600, 600, 400. But are they?
If we plot this data and look at them, you can see that there are three noticeable segments in there:
- 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4
- 40, 40, 40
- 600, 600, 400
So instead of removing those data, we can split data into three groups and analyze them separately.
We use this approach at Comparative to perform a univariate segmentation of data and automatically detect if such groups exist.
So what is the right way to address outliers in the data?
You should know your data and do proper research on outliers in advance. You can also try different methods and see which makes more sense in your specific case and which is more suitable.
Or use Comparative, as we do all those things out of the box.
Gerard - an experienced data scientist, currently serves as the Director of Data Science at Comparative. With a wealth of experience in machine learning, NLP, and time-series analysis, the author is passionate about democratizing data and empowering individuals across industries to make data-driven decisions. Reach out to him at firstname.lastname@example.org