Local Outlier Factor

Arun Mohan
4 min readDec 31, 2018

--

We have come across problems created by outliers in our model many times.If our data is one dimensional or two dimensional we can easily remove it.Did you ever thought how we can remove outlier in a 100 dimensional data?How do we find such outliers.Here comes our Local Outlier Factor to rescue.It helps to detect outliers in a data.Local outlier factor is a density-based method that relies on nearest neighbours search. The Local Outlier Factor (LOF) algorithm is an unsupervised anomaly detection method which computes the local density deviation of a given data point with respect to its neighbors

The LOF is a calculation that looks at the neighbors of a certain point to find out its density and compare this to the density of neighbour points later on. In short we can say that the density around an outlier object is significantly different from the density around its neighbors. Let me explain how it will help.

Let us take data points a(0,0), b(0,1), c(1,1),d(3,0).We will find local outlier factor of each points.

Step 1

Calculate distance between all the points.Here we will use manhattan distance.

distance(a,b) = 1
distance(a,c) = 2
distance(a,d) = 3
distance(b,c) = 1
distance(b,d) = 4
distance(c,d) = 3

Step 2

Calculate k-distance between point z and it’s k th nearest neighbour. Here k=2 .

k-distance(a) = distance(a,c) = 2(c is the 2nd nearest neighbour)
k-distance(b) = distance(b,a) = 1(a and c are 2nd nearest neighbour)
k-distance(c) = distance(c,a) = 2(e is the 2nd nearest neighbour)
k-distance(d) = distance(d,a) = 3(a and c are 2nd nearest neighbour)

Step 3

Calculate all k-neighbours (k=2) of the each point

N(a) = {b,c}
N(b) = {a,c}
N(c) = {b,a}
N(d) = {a,c}

Step 4

Calculate all local reachable densities of each point. By intution we can say that local reachability density tells how far we have to travel from our point to reach the next point or cluster of points.The lover lrd it is less dense.

In order to find local reachable density of a point we have to find reachable distances.Here ||Nk(o)|| means number of neighbours . For example, ||Nk(a)|| = 2.

The equation for reachable distance is as follows:(here k=2)

Here first we will find lrd(a)

reachability distance(b,a) = max{k-distance(b),distance(a,b)} 
= max{1,1} = 1
reachability distance(c,a) = max{k-distance(c),distance(a,c)}
= max{2,2} = 2
lrd(a) = ||N(a)||
-------------------------------------------------------
reachability distance(b,a) + reachability distance(c,a)

= 2/(1+2) = 0.667

Similarly we can find local reachable density of b,c and d

lrd(b) = 2/(2+2) = 0.5
lrd(c) = 2/(1+2) = 0.667
lrd(d) = 2/(3+3) = 0.33

Step 5

Calculate local outlier factor

LOF(a) =  (lrd(b) + lrd(c))
-----------------
||N(a)||*(lrd(a))

= (0.5 + 0.667) /(2*0.667) = 0.8748

Similarly we calculate local outlier factor of b, c and d.

LOF(b) =  (lrd(a) + lrd(c))
-----------------
||N(b)||*(lrd(b))
= (0.667+0.667) /(2*0.5) = 1.33LOF(c) = (lrd(b) + lrd(a))
-----------------
||N(c)||*(lrd(c))

= (0.5+0.667) / (2*0.667) = 0.8748
LOF(d) = (lrd(a) + lrd(c))
-----------------
||N(d)||*(lrd(d))
= (0.667+0.667) /(2*0.33) = 2

Step 6

Sort all LOF

LOF(d) = 2
LOF(b) = 1.33
LOF(a) = 0.8748
LOF(c) = 0.8748

The next question that comes to our mind is which point is outlier? Actually it depends on data.We can say Point d is the first outlier.In general we can say that if LOF value is small (LOF <1),it can be an inlier and if it is large,it can be outlier(LOF >>1).

Implementation using Python

Here we will implement the same using python. We will use LocalOutlierFactor from sklearn for that.

import numpy as np
from sklearn.neighbors import LocalOutlierFactor
x = np.array([[0,0],[0,1],[1,1],[3,0]])
clf = LocalOutlierFactor(n_neighbors=2,metric='manhattan')
y_pred = clf.fit_predict(x)
scores = clf.negative_outlier_factor_
print(-scores)

Output:

[0.875      1.33333333 0.875      2.        ]

Here we print(-scores) because by default negative_outlier_factor gives negative value of the same.To change to positive we used print(-score)

--

--