In this series of articles on the detection of anomalies, we will begin by determining what an anomaly is and discuss various methods for determining anomalies
What is an anomaly? Anomaly is an "observation that deviates so much from other observations that it is suspicious that it was generated by another mechanism."
Anomalies are a subset of emissions (AGGARWAL 2013).
All observations = normal data + anomalies
Emissions = noise + anomalies
Noise = uninteresting emissions
Anomaly = quite interesting release
A common method for evaluating anomalies in one-dimensional data is Z-Score. If the average value and standard deviation are known, then for each data point, the Z-Score as = Data_Point-Mean/Standard_Deviation is calculated.
Let's start by analyzing how the Z-Score is calculated and how the Z-SCRE can be used to detect anomalies.
We calculate the Z-Score and add the result to the data table.
Since we are concerned about the low level of Mean Participation Rate our limit value will be a negative number, we are looking for data with the level of participation below the average. Here we choose Z = −2. That is, any school with a Z -indicator below -2 will be marked as an anomaly.
Finally, get a list of schools that are abnormal.
We found our anomalies.
Now we will consider a set of data that shows the limitations of the Z-Score and why the modified Z-Score can be useful.
We will consider the number of goals scored by the best scorer at every World Cup from 1930 to 2018 (21 competitions in total).
And again, let's start with the use of Z-Score to identify anomalies. Since we are interested in superstars, this time we will have the upper threshold. We choose Z =+2. If this Z-Score is exceeded, any player will be marked as an anomaly.
Now we change the previous construction function to display the results.
Only one player is knocked out: JuST Fontaine.
Obviously, our analysis is imperfect. Looking at the schedule, we see that in 12 of 21 competitions the best scorer (s) scored (s) less than the average number of goals (7.05).
The question arises, why so?
The answer is that the average and standard deviation in itself are subject to the presence of anomalies. With his 13 goals, the amazing fontaine increases the average value that most players fall below him. As a result, it becomes the only anomaly.
And because of this, the Z-score can sometimes be unreliable, since the average and standard deviation in itself are sensitive to anomalies, but not median and fashion.
Modified Z-Score solves this problem using medians instead:
Now we calculate the modified Z-Score. For each data point, it is defined as XI as follows:
Where x is the data of data, and MAD is an absolute deviation from the median.
The value of MAD is 1.00 here we will make a small modification and introduce an amendment to the consistency of K, which allows us to use MAD as a coordinated assessment of the standard deviation. The value of K depends on the basic distribution of data. For simplicity, we will use the meaning for normal distribution k = 1.4826 (see ).
Note: The correction coefficient K = 1.4826 still suggests that the data underlying are normal!
Thus, the modified Z-score becomes
And it is precisely this form that we will use in functions below.
As before, we calculate the modified Z-Score for all players, then build a graph and derive the results. Please note that the threshold remains the same 𝑦 =+2.
Now we find four abnormal players.
So, it was all about the detection of anomalies using Zscore and how Modified-Zscore overcomes the limitations of Z-Score.
#machinelearning #artificialintelligence #ai #datascience #programming #Deechnology #Deeuplearning #bigData #bigdata