Build k-NN from scratch in Python

Neighbors (Image Source: Freepik)

In this article, we shall understand how k-Nearest Neighbors (kNN) algorithm works and build kNN algorithm from ground up. We also shall evaluate our algorithm using the k-Fold cross-validation which is also developed from scratch.

After completing this tutorial you will know:

  • How to code the k-Nearest Neighbors algorithm step-by-step
  • How to use k-Nearest Neighbors to make a prediction for new data
  • How to code the k-Fold Cross Validation step-by-step
  • How to evaluate k-Nearest Neighbors on a real dataset using k-Fold Cross Validation

Prerequisites: Basic understanding of Python and the concept of classes and objects from Object-oriented Programming (OOP)

k-Nearest Neighbors

k-Nearest Neighbors, kNN for short, is a very simple but powerful technique used for making predictions. The principle behind kNN is to use “most similar historical examples to the new data.”

k’ is a number used to identify similar neighbors for the new data point.

The entire training dataset is initially stored. When predictions are required for new data, kNN considers the k-most similar neighbors (records) to decide where the new data point will belong to based on feature similarity.

Once we find the distance or similarity, we choose the top k closest records. After discovering k closest records, we make the prediction by returning the most common outcome or taking the average. As such, kNN can be used for classification or regression problems.

kNN algorithm doesn’t have a training phase. The model just holds the data until a prediction is required and does no work. For this reason, kNN is often referred to as a “lazy learning method”.

k-Nearest Neighbors in 4 easy steps

  1. Choose a value for k
  2. Find the distance of the new point to each record of training data
  3. Get the k-Nearest Neighbors
  4. For classification problem, the new data point belongs to the class that most of the neighbors belong to. For regression problem, the prediction can be average or weighted average of the label of k-Nearest Neighbors

Building kNN from scratch using Python

You can follow along using the code available in my GitHub.

You can also install it by using:

pip install simple-kNN

GitHub repo for PyPI package version: https://github.com/chaitanyakasaraneni/simple-kNN

Step 1: Choosing a k value

Choice of K has a drastic impact on the results we obtain from kNN. Better choose an odd number.

Step 2: Calculating Distance

The next step is to calculate distance between two rows in a dataset.

Problem or data specific methods are used to calculate distance or similarity between two records. In general for tabular or vector data, Euclidean distance is considered as starting point. There are several other similarity or distance metrics such as Manhattan distance, Hamming distance, etc.

Euclidean distance is defined as the square root of the sum of squared distance (difference) between two points. It is also known as L2 norm.

Euclidean Distance between two vectors ‘x’ and ‘y

Manhattan distance is the sum of the absolute values of the differences between two points

Manhattan Distance between two vectors ‘x’ and ‘y

Hamming distance is used for categorical variables. In simple terms it tells us if the two categorical variables are same or not.

Hamming Distance between two vectors ‘x’ and ‘y’

where ‘δ’ is used to check equality of the two elements.

In python, we create a separate class that holds the methods to calculate the distance between two vectors.

class distanceMetrics:
    '''
    Description:
        This class contains methods to calculate various distance metrics
    '''
    def __init__(self):
        '''
        Description:
            Initialization/Constructor function
        '''
        pass
        
    def euclideanDistance(self, vector1, vector2):
        '''
        Description:
            Function to calculate Euclidean Distance
                
        Inputs:
            vector1, vector2: input vectors for which the distance is to be calculated
        Output:
            Calculated euclidean distance of two vectors
        '''
        self.vectorA, self.vectorB = vector1, vector2
        if len(self.vectorA) != len(self.vectorB):
            raise ValueError("Undefined for sequences of unequal length.")
        distance = 0.0
        for i in range(len(self.vectorA)-1):
            distance += (self.vectorA[i] - self.vectorB[i])**2
        return (distance)**0.5
    
    def manhattanDistance(self, vector1, vector2):
        """
        Desription:
            Takes 2 vectors a, b and returns the manhattan distance
        Inputs:
            vector1, vector2: two vectors for which the distance is to be calculated
        Output:
            Manhattan Distance of two input vectors
        """
        self.vectorA, self.vectorB = vector1, vector2
        if len(self.vectorA) != len(self.vectorB):
            raise ValueError("Undefined for sequences of unequal length.")
        return np.abs(np.array(self.vectorA) - np.array(self.vectorB)).sum()
    
    def hammingDistance(self, vector1, vector2):
        """
        Desription:
            Takes 2 vectors a, b and returns the hamming distance
            Hamming distance is meant for discrete-valued vectors, though it is a 
            valid metric for real-valued vectors.
        Inputs:
            vector1, vector2: two vectors for which the distance is to be calculated
        Output:
           Hamming Distance of two input vectors 
        """
        self.vectorA, self.vectorB = vector1, vector2
        if len(self.vectorA) != len(self.vectorB):
            raise ValueError("Undefined for sequences of unequal length.")
        return sum(el1 != el2 for el1, el2 in zip(self.vectorA, self.vectorB))

A class that contains methods to calculate the distance metrics

We shall utilize this class to find the nearest neighbors in the next step.

Step 3: Get Nearest Neighbors

Neighbors for a piece of new data in the dataset are the top — k closest instances that we obtain using the distance metrics defined above.

To locate the neighbors for a new piece of data within a dataset we must first calculate the distance between each record in the dataset to the new piece of data. We can do this by creating an object for the distanceMetric class that we defined above.

Once distances are calculated, we must sort all of the records in the training dataset by their distance to the new data. We can then select the top k to return as the most similar neighbors.

We can do this by keeping track of the distance for each record in the dataset as a list, sort the list of lists by the distance and then retrieve the neighbors.

def getNeighbors(self, testRow):
        '''
        Description:
            Train kNN model with x data
        Input:
            testRow: testing data with coordinates
        Output:
            k-nearest neighbors to the test data
        '''
        
        calcDM = distanceMetrics()
        distances = []
        for i, trainRow in enumerate(self.trainData):
            if self.distanceMetric == 'euclidean':
                distances.append([trainRow, calcDM.euclideanDistance(testRow, trainRow), self.trainLabels[i]])
            elif self.distanceMetric == 'manhattan':
                distances.append([trainRow, calcDM.manhattanDistance(testRow, trainRow), self.trainLabels[i]])
            elif self.distanceMetric == 'hamming':
                distances.append([trainRow, calcDM.hammingDistance(testRow, trainRow), self.trainLabels[i]])
            distances.sort(key=operator.itemgetter(1))

        neighbors = []
        for index in range(self.k):
            neighbors.append(distances[index])
        return neighbors
       

Now that we know how to get top k — neighbors from the dataset, we will use them to make predictions.

Step 4: Predictions

In this step, we shall use the top — k similar neighbors collected from training dataset to make predictions.

def predict(self, xTest, k, distanceMetric):
        '''
        Description:
            Apply kNN model on test data
        Input:
            xTest: testing data with coordinates
            k: number of neighbors
            distanceMetric: technique to calculate distance metric
        Output:
            predicted label 
        '''
        predictions = []
        
        for i, testCase in enumerate(testData):
            neighbors = getNeighbors(testCase)
            output= [row[-1] for row in neighbors]
            prediction = max(set(output), key=output.count)
            predictions.append(prediction)
        
        return predictions

In the case of classification, we can return the most represented class among the neighbors.

We can achieve this by performing the max() function on the list of output values from the neighbors. Given a list of class values observed in the neighbors, the max() function takes a set of unique class values and calls the count on the list of class values for each class value in the set.

Below is the complete kNN class:

class kNNClassifier:
    '''
    Description:
        This class contains the functions to calculate distances
    '''
    def __init__(self,k = 3, distanceMetric = 'euclidean'):
        '''
        Description:
            KNearestNeighbors constructor
        Input    
            k: total of neighbors. Defaulted to 3
            distanceMetric: type of distance metric to be used. Defaulted to euclidean distance.
        '''
        pass
    
    def fit(self, xTrain, yTrain):
        '''
        Description:
            Train kNN model with x data
        Input:
            xTrain: training data with coordinates
            yTrain: labels of training data set
        Output:
            None
        '''
        assert len(xTrain) == len(yTrain)
        self.trainData = xTrain
        self.trainLabels = yTrain

    def getNeighbors(self, testRow):
        '''
        Description:
            Train kNN model with x data
        Input:
            testRow: testing data with coordinates
        Output:
            k-nearest neighbors to the test data
        '''
        
        calcDM = distanceMetrics()
        distances = []
        for i, trainRow in enumerate(self.trainData):
            if self.distanceMetric == 'euclidean':
                distances.append([trainRow, calcDM.euclideanDistance(testRow, trainRow), self.trainLabels[i]])
            elif self.distanceMetric == 'manhattan':
                distances.append([trainRow, calcDM.manhattanDistance(testRow, trainRow), self.trainLabels[i]])
            elif self.distanceMetric == 'hamming':
                distances.append([trainRow, calcDM.hammingDistance(testRow, trainRow), self.trainLabels[i]])
            distances.sort(key=operator.itemgetter(1))

        neighbors = []
        for index in range(self.k):
            neighbors.append(distances[index])
        return neighbors
        
    def predict(self, xTest, k, distanceMetric):
        '''
        Description:
            Apply kNN model on test data
        Input:
            xTest: testing data with coordinates
            k: number of neighbors
            distanceMetric: technique to calculate distance metric
        Output:
            predicted label 
        '''
        self.testData = xTest
        self.k = k
        self.distanceMetric = distanceMetric
        predictions = []
        
        for i, testCase in enumerate(self.testData):
            neighbors = self.getNeighbors(testCase)
            output= [row[-1] for row in neighbors]
            prediction = max(set(output), key=output.count)
            predictions.append(prediction)
        
        return predictions

Now that we have our predictions, we need to evaluate the performance of our model. For this, we shall use k-Fold Cross Validation which is defined in the next part.

k Fold Cross validation

This technique involves randomly dividing the dataset into k-groups or folds of approximately equal size. The first fold is kept for testing and the model is trained on remaining k-1 folds.

5 fold cross validation. Blue block is the fold used for testing. Source: sklearn documentation

There are many variants of k-Fold Cross Validation. You can read more about them here.

In out approach, after each fold, we calculate accuracy, and thus accuracy of k-Fold CV is computed by taking average of the accuracies over k-folds.

Building kFCV from scratch using Python

As a first step, we divide the dataset into k– folds.

Then for each fold in the k-folds, we perform kNN algorithm, get predictions and evaluate the performance using accuracy as evaluation metric.

The method to split the data into k-Folds:

def crossValSplit(self, dataset, numFolds):
        '''
        Description:
            Function to split the data into number of folds specified
        Input:
            dataset: data that is to be split
            numFolds: integer - number of folds into which the data is to be split
        Output:
            split data
        '''
        dataSplit = list()
        dataCopy = list(dataset)
        foldSize = int(len(dataset) / numFolds)
        for _ in range(numFolds):
            fold = list()
            while len(fold) < foldSize:
                index = randrange(len(dataCopy))
                fold.append(dataCopy.pop(index))
            dataSplit.append(fold)
        return dataSplit

The method for evaluation:

def kFCVEvaluate(self, dataset, numFolds, *args):
        '''
        Description:
            Driver function for k-Fold cross validation 
        '''
        knn = kNNClassifier()
        folds = self.crossValSplit(dataset, numFolds)
        scores = list()
        for fold in folds:
            trainSet = list(folds)
            trainSet.remove(fold)
            trainSet = sum(trainSet, [])
            testSet = list()
            for row in fold:
                rowCopy = list(row)
                testSet.append(rowCopy)
                
            trainLabels = [row[-1] for row in trainSet]
            trainSet = [train[:-1] for train in trainSet]
            knn.fit(trainSet,trainLabels)
            
            actual = [row[-1] for row in testSet]
            testSet = [test[:-1] for test in testSet]
            
            predicted = knn.predict(testSet, *args)
            
            accuracy = printMetrics(actual, predicted)
            

        print('*'*20)
        print('Scores: %s' % scores)
        print('*'*20)        
        print('\nMaximum Accuracy: %3f%%' % max(scores))
        print('\nMean Accuracy: %.3f%%' % (sum(scores)/float(len(scores))))

Both methods combined into a single class:

class kFoldCV:
    '''
    This class is to perform k-Fold Cross validation on a given dataset
    '''
    def __init__(self):
        pass
    
    def crossValSplit(self, dataset, numFolds):
        '''
        Description:
            Function to split the data into number of folds specified
        Input:
            dataset: data that is to be split
            numFolds: integer - number of folds into which the data is to be split
        Output:
            split data
        '''
        dataSplit = list()
        dataCopy = list(dataset)
        foldSize = int(len(dataset) / numFolds)
        for _ in range(numFolds):
            fold = list()
            while len(fold) < foldSize:
                index = randrange(len(dataCopy))
                fold.append(dataCopy.pop(index))
            dataSplit.append(fold)
        return dataSplit
    
    
    def kFCVEvaluate(self, dataset, numFolds, *args):
        '''
        Description:
            Driver function for k-Fold cross validation 
        '''
        knn = kNNClassifier()
        folds = self.crossValSplit(dataset, numFolds)
        scores = list()
        for fold in folds:
            trainSet = list(folds)
            trainSet.remove(fold)
            trainSet = sum(trainSet, [])
            testSet = list()
            for row in fold:
                rowCopy = list(row)
                testSet.append(rowCopy)
                
            trainLabels = [row[-1] for row in trainSet]
            trainSet = [train[:-1] for train in trainSet]
            knn.fit(trainSet,trainLabels)
            
            actual = [row[-1] for row in testSet]
            testSet = [test[:-1] for test in testSet]
            
            predicted = knn.predict(testSet, *args)
            
            accuracy = printMetrics(actual, predicted)
            scores.append(accuracy)

        print('*'*20)
        print('Scores: %s' % scores)
        print('*'*20)        
        print('\nMaximum Accuracy: %3f%%' % max(scores))
        print('\nMean Accuracy: %.3f%%' % (sum(scores)/float(len(scores))))

We can execute this by creating an object for k-Fold cross validation method and call the evaluate method as shown below.

kfcv = kFoldCV()kfcv.kFCVEvaluate(data, foldCount, neighborCount, distanceMetric)

The kfcv.kFCVEvaluate() then splits the data into the specified number of folds and evaluated kNN algorithm by considering top-k neighbors using the distanceMetric specified.

Examples and implementation can be seen in my GitHub repository.

Conclusion

In this blog, we have seen:

  • kNN algorithm
  • Some distance metrics used in kNN algorithm
  • Predictions using kNN algorithm
  • Evaluating kNN algorithm using kFold Cross validation

Hope you gained some knowledge reading this article. Please remember that this article is just an overview and my understanding of kNN algorithm and kFold Cross validation technique that I read from various online sources.

By Chaitanya

I am a Computer Engineering Graduate and an Aspiring Data Scientist.

Leave a comment

Your email address will not be published. Required fields are marked *