Introduction
KNN algorithm is a supervised learning algorithm which we store training dataset (labeled) in the training time. And in the test time, we compare distance of the test data with every every each data in the training dataset, then we select the k (hyperparameter) nearest training examples to classify our test data. The most frequent label of these k training examples, is used to classify the test data. So the main drawback here is that since there is not actual training (because we just store the data), we need to compare distance with every training data, and as your training data is increasing, so is your test time (linearly). The other disadvantage is that rgb values are used to measure the distance and this results with color matching, e.g. picture of the airplane on the air mostly have blue color because of the sky and a ship on the ocean may also have blue background which might be confusing for KNN, you can end up with misclassifying plane as ship or vice versa.
There are two distance formulas which are called as L1 (Manhattan distance) and L2 (Euclidean distance) formulas (another hyperparameter);
L1 Distance;
\[d(I_i, I_j) = \sum_p|I_i^p - I_j^p|\]and L2 Distance;
\[d(I_i, I_j) = \sqrt{\sum_p(I_i^p - I_j^p)^2}\]Calculating Distance Matrix Without Loops
Let’s suppose that we have images with dimensions of 32323, then let’s store training dataset (5000 images) in the \(X_{train}\) matrix, and test dataset (500 images) in \(X_{test}\) as follows;
Then element-wise square has been calculated, each row vector has been summed and stored in cloumn vectors named as Test and Train;
So, the size of the Test and Train vectors are (500, 1), and (5000, 1) respectively. After that, Transpose of Test vector and Train vector has been multiplied and Mult matrix (in the code it is called as C matrix) has been obtained with size of (500, 5000).
Then each row vector of the Mult matrix has been divided by Train vector and the D matrix below has been obtained as a matrix that with size of (500, 5000) and each column vector equals to Test vector. To remind the reader this calculations are made for to obtain true matrix sizes in order to perform vectorizing in the end.
Same procedure has been applied to obtain E matrix whose row vectors equal to Train vector and its size again (500, 5000). And for this one, each column vector of Mult matrix has been divided by transpose of Test vector.
Lastly F matrix with the size of (500, 5000) has been obtained by multiplication of Xtest and transpose of Xtrain matrices. Now, L2 equation can be expanded as follows (4);
\[d(I_i, I_j) = \sqrt{\sum_p(I_i^2 + I_j^2 - 2I_i I_j)}\]And since summation has performed for square of every pixel values and stored in Test and Train vectors. After that those vector has been stored in D and E matrices. Then summation can be removed from the last equation above via changing the inputs of the formula. Now, dists matrix can be calculated as follows;
\[dists = \sqrt{(D + E - 2F)}\]Passing time to classify inputs without loops has been found about 1.7 seconds which at least 20 times faster than with looped versions.