K-Nearest Neighbors (KNN) is a simple yet powerful classification algorithm that classifies based on a similarity measure. This supervised ML algorithm can be used for classifications and predictive regression problems. However, it is mainly used for classifying predictive problems in the industry.
KNN classifier can classify unlabeled observations by assigning them to the class of the most similar labeled examples. KNN groups the data into coherent clusters or subsets and classifies the newly inputted data based on its similarity with previously trained data. The input is assigned to the class with which it shares the nearest neighbors.
For example, fruit, vegetable, and grain can be distinguished by their crunchiness and sweetness. To display them on a two-dimension plot, only two characteristics are employed. In reality, there can be any number of predictors, and the example can be extended to incorporate any characteristics. In general, fruits are sweeter than vegetables. Grains are neither crunchy nor sweet. Our work is to determine which category the sweet potato belongs. In this example, we choose the four nearest food kinds: apple, green bean, lettuce, and corn. Because the vegetable wins the most votes, sweet potato is assigned to the class of vegetables. In this example, you can see that the key concept of kNN is easy to understand.
Two properties that define KNN
- Lazy learning algorithm: KNN is a lazy learning algorithm since it does not have a specialized training phase and uses all the data for training during classification.
- Non-parametric learning algorithm: KNN is also a non-parametric learning algorithm because it doesn’t assume anything about the underlying data.
As explained earlier, the K-nearest neighbors (KNN) algorithm uses ‘feature similarity’ to predict the values of new data points, meaning that depending on how closely the new data point resembles the points in the training set, a value will be assigned to it. We can understand the working of the algorithm with the following steps:
- Step 1: We must load the training test dataset in the first step.
- Step 2: Next, we need to choose the value of K, i.e., the nearest data points. K can be any integer.
- Step 3: For each point in the test data, do the following:
- a. Calculate the distance between test data and each row of training data with the help of Euclidean, Manhattan, or Hamming distance methods. The common method to calculate distance is Euclidean.
- b. Now, sort them in ascending order based on the distance value.
- c. Next, it will choose the top K rows from the sorted array.
- d. Now, it will assign a class to the test point based on the most frequent class of these rows.
- Step 4: End
Pros of KNN algorithm
- KNN is known for its simplicity, comprehensibility, and scalability. Learning and implementation are extremely simple and intuitive.
- It is easy to interpret. The mathematical computations are easy to comprehend and understand.
- The calculation time is less.
- Predictive power is very high, which makes it effective and efficient.
- KNN is very effective for large training sets.
- It is very useful for nonlinear data because this algorithm has no assumptions about data.
- It is a versatile algorithm as we can use it for classification and regression.
- It has relatively high accuracy.
Cons of KNN algorithm
- KNN can be expensive in determining K if the dataset is large. It requires more memory storage than an effective classifier or supervised learning algorithms.
- In KNN, the prediction phase is slow for a larger dataset. The computation of accurate distances plays a big role in determining the algorithm’s accuracy.
- One of the major steps in KNN is determining the parameter K. Sometimes, it is unclear which type of distance to use and which feature will give the best result.
- It is very sensitive to the data’s scale and irrelevant features. Irrelevant or correlated features have a high impact and must be eliminated.
- The computation cost is quite high as each training example’s distance is calculated.
- KNN is a lazy learning algorithm as it doesn’t learn from the training data; it simply memorizes it and then uses that data to classify the new input.
- Typically difficult to handle high dimensionality
Applications of KNN
KNN can be used in the banking system to predict whether an individual is fit for loan approval or does an individual have characteristics similar to defaulters. KNN algorithms can also be used to find an individual’s credit rating by comparing with the persons having similar traits. In politics, with the help of KNN algorithms, we can classify a potential voter into various classes like “Will Vote,” “Will not Vote,” etc. Other areas in which the KNN algorithm is used include Speech Recognition, Handwriting Detection, Image Recognition, and Video Recognition.