Outliers

Outliers#

example of outlier test using modified Thompson technique#

%pylab inline
import matplotlib as mpl
mpl.rcParams['lines.linewidth']=2
mpl.rcParams['lines.color']='r'
mpl.rcParams['figure.figsize']=(8,6)
mpl.rcParams['font.size']=14
mpl.rcParams['axes.labelsize']=20

%pylab is deprecated, use %matplotlib inline and import the required libraries.
Populating the interactive namespace from numpy and matplotlib

x = array([48.9, 49.2, 49.2, 49.3, 49.3, 49.8, 49.9, 50.1, 50.2, 50.7])
plot(arange(1,11),x,'o'),xlim([0,11]),xlabel('$n$'),ylabel('$T\; [^\circ C \,]$');

<>:2: SyntaxWarning: invalid escape sequence '\;'
<>:2: SyntaxWarning: invalid escape sequence '\;'
/tmp/ipykernel_34561/2914799422.py:2: SyntaxWarning: invalid escape sequence '\;'
  plot(arange(1,11),x,'o'),xlim([0,11]),xlabel('$n$'),ylabel('$T\; [^\circ C \,]$');

../_images/f882982b74dd48c5963616eb4b4b7aee3a8014494a3374cf18e03d49587c2e07.png

is there an outlier?#

Sort, look at the lowest and largest and plot to visualize

plot(arange(2,10),x[1:-1],'o'), plot(1,x[0],'rs',markersize=10,linewidth=2),plot(10,x[-1],'gs',markersize=10)
xlim([0,11])
xlabel('$\hat{n}$'),ylabel('Sorted $T\; [^\circ C \,]$');

<>:3: SyntaxWarning: invalid escape sequence '\h'
<>:3: SyntaxWarning: invalid escape sequence '\;'
<>:3: SyntaxWarning: invalid escape sequence '\h'
<>:3: SyntaxWarning: invalid escape sequence '\;'
/tmp/ipykernel_34561/3195646993.py:3: SyntaxWarning: invalid escape sequence '\h'
  xlabel('$\hat{n}$'),ylabel('Sorted $T\; [^\circ C \,]$');
/tmp/ipykernel_34561/3195646993.py:3: SyntaxWarning: invalid escape sequence '\;'
  xlabel('$\hat{n}$'),ylabel('Sorted $T\; [^\circ C \,]$');

../_images/77c449687617e48155e6a4196dec98a561323099a2dff44601cfd14165d8835a.png

The sample mean and standard deviation, \(\bar{x}\), \(S_x\)#

meanx = mean(x)
stdx = std(x,ddof=1)
print('mean = %6.2f,  std = %6.2f' % (meanx,stdx))

mean =  49.66,  std =   0.57

Defivations for each suspicious point, take the largest \(\delta_i = |x_i - \bar{x}|\)#

delta = abs(x - meanx)
print ('suspicious points are first and last:')
print ('%4.3f %4.3f' % (delta[0],delta[-1]))
print ('suspicious point is:',  argmax(delta), 'deviation is = %4.3f' % max(delta))

suspicious points are first and last:
0.760 1.040
suspicious point is: 9 deviation is = 1.040

Modified Thompson technique, \(\tau\)#

Define the confidence level (95%), therefore \(\alpha = 5\%\). Because we deal with outliers, the DOF is smaller:

\[ \tau = \frac{t_{\alpha/2} (n-1)}{\sqrt{n} \sqrt{n-2+t_{\alpha/2}^2}}, \qquad \alpha = 0.05, \quad df = n - 2 \]

if:

\[\delta_i > \tau S\]

then the point is an outlier

remove it, estimate new \(\bar{x}, S, \delta_i\) and repeat the test. until there is no outlier in the set.

from scipy.stats import t


n = len(x)
tv =  t.isf(0.05/2,n-2)
tau = tv*(n-1)/(sqrt(n)*sqrt(n-2+tv**2))

print ('n = %d, t = %6.4f, tau = %6.4f' % (n,tv,tau))

print ('compare: %6.3f vs. %6.3f ' % (max(delta), tau*stdx))
print ('Is max() above $t_{\\nu,95}S$? %s ' % (max(delta) > tau*stdx))

n = 10, t = 2.3060, tau = 1.7984
compare:  1.040 vs.  1.022 
Is max() above $t_{\nu,95}S$? True 

So we remove the outlier and repeat the test (from the beginning)#

# new data is one point less
xnew = x[:-1].copy()
meanx = mean(xnew)
stdx = std(xnew,ddof=1)
print ('x: ', xnew)
print ('mean = %6.2f,  std = %6.2f\n' % (meanx,stdx))

# new deviations
delta = abs(xnew - meanx)
print('deviations: ',delta)
print('\n')
print ('suspicious point is: %f, its deviation is = %f \n' %( argmax(delta), max(delta)))
# new \tau test
n = len(xnew)
tv =  t.isf(0.05/2,n-2)
tau = tv*(n-1)/(sqrt(n)*sqrt(n-2+tv**2))

print ('n = %d, t = %6.4f, tau = %6.4f\n' % (n,tv,tau))

print ('compare: %6.3f vs. %6.3f \n' % (max(delta), tau*stdx))
print('Is it outlier? :', max(delta) > tau*stdx)

x:  [48.9 49.2 49.2 49.3 49.3 49.8 49.9 50.1 50.2]
mean =  49.54,  std =   0.46

deviations:  [0.64444444 0.34444444 0.34444444 0.24444444 0.24444444 0.25555556
 0.35555556 0.55555556 0.65555556]

suspicious point is: 8.000000, its deviation is = 0.655556 

n = 9, t = 2.3646, tau = 1.7770

compare:  0.656 vs.  0.820 

Is it outlier? : False