Skip to main content

KNN example

  1. Concept Introduction:

    • Explain KNN as a supervised learning algorithm for classification and regression.
    • Describe how it works: finding the k nearest data points to a query point and using them for classification or prediction.
  2. Mathematics Behind KNN:

    • Distance calculation (e.g., Euclidean distance): d(x,y)=i=1n(xiyi)2 d(x, y) = \sqrt{\sum_{i=1}^n (x_i - y_i)^2}
    • Decision rule: Majority class in the k nearest neighbors.
  3. Step-by-Step Example:
    We'll classify a 2D dataset, show decision boundaries, and explain the classification mathematically.

Here’s the Python code to demonstrate KNN with a graph and a simple dataset:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.neighbors import KNeighborsClassifier

# Generate a simple dataset (2D for visualization)
X, y = make_classification(
n_samples=100,
n_features=2,
n_informative=2,
n_redundant=0,
n_clusters_per_class=1,
random_state=42
)

# Split into two classes (easier for visualization)
class_0 = X[y == 0]
class_1 = X[y == 1]

# Plot the dataset
plt.figure(figsize=(8, 6))
plt.scatter(class_0[:, 0], class_0[:, 1], c='blue', label='Class 0')
plt.scatter(class_1[:, 0], class_1[:, 1], c='red', label='Class 1')
plt.title("Simple 2D Dataset for KNN")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.grid()
plt.show()

# Train KNN on the dataset
k = 5
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X, y)

# Plot decision boundary
def plot_decision_boundary(X, y, model, title="Decision Boundary"):
h = 0.02
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.figure(figsize=(8, 6))
plt.contourf(xx, yy, Z, alpha=0.8, cmap=plt.cm.Paired)
plt.scatter(X[:, 0], X[:, 1], c=y, edgecolor='k', cmap=plt.cm.Paired)
plt.title(title)
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.grid()
plt.show()

plot_decision_boundary(X, y, knn, title=f"KNN Decision Boundary (k={k})")

# Example classification
query_point = np.array([[0, 0]]) # Query point
neighbors = knn.kneighbors(query_point, return_distance=True)

plt.figure(figsize=(8, 6))
plt.scatter(class_0[:, 0], class_0[:, 1], c='blue', label='Class 0')
plt.scatter(class_1[:, 0], class_1[:, 1], c='red', label='Class 1')
plt.scatter(query_point[:, 0], query_point[:, 1], c='green', label='Query Point', s=100)
for i in range(k):
plt.plot(
[query_point[0, 0], neighbors[1][0][i, 0]],
[query_point[0, 1], neighbors[1][0][i, 1]],
'k--'
)
plt.title("Query Point and Nearest Neighbors")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.grid()
plt.show()

Explanation of Code:

  1. Dataset:

    • A 2D dataset is created using make_classification, with two features and two classes.
    • Points from different classes are visualized in different colors.
  2. Visualization:

    • The scatter plot shows the distribution of points and how the dataset looks.
  3. Training:

    • The KNeighborsClassifier is trained with k=5.
  4. Decision Boundary:

    • The decision boundary is plotted to show how the KNN algorithm splits the feature space.
  5. Query Point:

    • A query point is shown with its 5 nearest neighbors, visually connecting them.

Mathematical Explanation:

  1. Distance:
    For the query point ((x, y) = (0, 0)), calculate the Euclidean distance to 5 nearest points:

    d=(x1x)2+(y1y)2d = \sqrt{(x_1 - x)^2 + (y_1 - y)^2}
  2. Classification Rule:
    If most neighbors belong to Class 0, classify the query point as Class 0, and vice versa.