Nearest = Most Similar?

Let’s assume the following situation. We work at a gardening store and have the history of all the resident’s purchasing history and their addresses. Now, there is a new resident who just moved into the city and we need to come up with a list of products to recommend to them. The tricky part is that we have no information about the new resident except for their address. How would we do this?

We can make that realization of the resident probably have a garden that shares a lot of similarity with their close neighbors. For example, if all of the new resident’s neighbor lives purchases items specifically for a garden that’s on a hill. It is very likely that our new reisident has a garden on the hill as well.

This is the key assumption and realization that is the foundation of K nearest neighbors, the closer the points are the more similar they will be.

Setup and notation

We will use the following conventions throughout:

$\textcolor{#38bdf8}{X} \in \mathbb{R}^{n \times d}$: training data
$\textcolor{#fb7185}{y} \in {1, …, C}$: class labels for training data
$\textcolor{#c084fc}{x}_\ast \in \mathbb{R}^{d \times 1}$: testing data
$\textcolor{#93c5fd}{K} \in \mathbb{R}^{1 \times 1}$: the number of neighbors for vote.

The K-Nearest-Neighbor (KNN) model

The KNN model is very simple and intuitive to understand. Exactly as the name entails, the KNN model assigns the label to a unseen datapoint $\textcolor{#c084fc}{x}\ast$ by counting the majority class labels of the K nearest neighbor around $\textcolor{#c084fc}{x}\ast$. Let’s first establish the foundamental KNN algorithm and then walk through the algorithm and it’s variations step by setp.

Computation and memory

KNN is an example of the “lazy learning” model. Unlike a lot of other models where we define a model using a set of mathematical fomulats and a set of objective functions, KNN doesn’t build a model in the traditional sense, it computes the label of each datapoint on the fly and does not go through a training stage. There is a trade-off between a “lazy learning” model and a non-lazy-learning model. In a non-lazy-learning model, we spent resources on training the model and finding the best parameters. In a lazy-learning model, we spend resources on storing all of the training data in memory.

Finding the K nearest neighbors

Since there is no notion of “training” in a KNN model, how do we find the label of a unseen datapoint? The answer is that we take advantage of the fact that the model “memorizes” the entire training dataset and their corresponding class labels. When the model sees a new datapoint, we simply calculate the distances between the new datapoint and every single other training datapoint and choose the look at the closest $\textcolor{#93c5fd}{K}$ number of neighbors. For example, let’s say we are doing 3-class classification and set $\textcolor{#93c5fd}{K}$ = 5, 3 out of 5 neighbors have class label 0, 1 lneighbor has class 1 and 1 neighbor has class 2. We catagorize the new datapoint as calss 0 as that is the majority voted class.

To formalize this inference procedure, we can write the following formulation:

let a test sample be $\textcolor{#c084fc}{x}\ast \in \mathbb{R}^{d\times 1}$ and let the training set be ${(\textcolor{#38bdf8}{x}_i,\textcolor{#fb7185}{y}_i)}{i=1}^n$ with $\textcolor{#fb7185}{y}_i \in {1,\dots,C}$.

For each training point, compute the pairwise distance (we will formally define distance metrics in the next section): $d_i=\delta(\textcolor{#c084fc}{x}_\ast,\textcolor{#38bdf8}{x}_i),\quad i=1,\dots,n.$

Let $\mathcal{N}K(\textcolor{#c084fc}{x}\ast)$ be the index set of the $\textcolor{#93c5fd}{K}$ smallest values among ${d_i}_{i=1}^n$. The KNN prediction with majority vote is $\hat{\textcolor{#fb7185}{y}}(\textcolor{#c084fc}{x}_\ast) = \arg\max_{c\in\{1,\dots,C\}} \sum_{i\in\mathcal{N}_K(\textcolor{#c084fc}{x}_\ast)} \mathbf{1}[\textcolor{#fb7185}{y}_i=c].$

Define “closest neighbors” and distance metrics.

Minkowski Distance
Minkowski distance is one of the most intuitive and popular choices as it is a family of distance metrics with one parameter: p. $d(\textcolor{#c084fc}{x}_\ast,\textcolor{#38bdf8}{x}_i)=(\sum_{i=1}^d∣\textcolor{#c084fc}{x}_\ast−\textcolor{#38bdf8}{x}_i∣^p)^{1/p},p≥1$

THe notable two special cases are (i)when $p$ = 1, it calculates Manhattan distance and (ii) hen p = 2, it calculates Euclidean distance. Note that to both Manhattan and Euclidea distance claulation requires the features to be scaled

Hamming Distance
Hamming distance is often used when the features are binary or one-hot-encoded. It simply calcualtes the number of mismatches between the 2 vectors. $d(\textcolor{#c084fc}{x}_\ast,\textcolor{#38bdf8}{x}_i)= \text{number of positions where } \textcolor{#c084fc}{x}_\ast \neq \textcolor{#38bdf8}{x}_i$

Cosine Distance
We often use cosine similarity to calculate the similarities between a pair of vectors. We can also use this notion to quantify distances.

$\text{cosine similarity}(\textcolor{#c084fc}{x}_\ast,\textcolor{#38bdf8}{x}_i) = \frac{\textcolor{#c084fc}{x}_\ast^T\textcolor{#38bdf8}{x}_i}{∥\textcolor{#c084fc}{x}_\ast∥∥\textcolor{#38bdf8}{x}_i∥}$ Since we know that the greater the similarity, the smaller the distance. We can then take the inverse of it: $d(\textcolor{#c084fc}{x}_\ast,\textcolor{#38bdf8}{x}_i) = 1- \frac{\textcolor{#c084fc}{x}_\ast^T\textcolor{#38bdf8}{x}_i}{∥\textcolor{#c084fc}{x}_\ast∥∥\textcolor{#38bdf8}{x}_i∥}$

Computation and memory

Tags: ML statistics classification supervised learning