← 返回首页
🤖

K近邻算法(KNN)入门

📂 ai ⏱ 2 min 277 words

K近邻算法(KNN)入门

什么是KNN

K近邻算法(K-Nearest Neighbors,KNN)是最简单直观的机器学习算法之一。它的核心思想是:相似的样本在特征空间中距离较近。对于一个新样本,KNN会找到训练集中距离它最近的K个样本,然后根据这些"邻居"的类别(投票)或数值(平均)来预测新样本的输出。

KNN是一种懒惰学习(Lazy Learning)算法——它在训练阶段不做任何计算,所有计算都推迟到预测阶段。

距离度量

KNN的核心是距离计算,常用的距离度量方法包括:

欧氏距离(Euclidean Distance)

$$d(x, y) = \sqrt{\sum_{i=1}^{n}(x_i - y_i)^2}$$

最常用的距离度量,适合连续数值特征。

曼哈顿距离(Manhattan Distance)

$$d(x, y) = \sum_{i=1}^{n}|x_i - y_i|$$

适用于高维稀疏数据。

闵可夫斯基距离(Minkowski Distance)

$$d(x, y) = (\sum_{i=1}^{n}|x_i - y_i|^p)^{1/p}$$

当p=1时为曼哈顿距离,p=2时为欧氏距离。

K值选择

K值的选择对模型性能影响很大:

一般建议使用交叉验证来选择最优K值。

代码示例:KNN分类实战

import numpy as np
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report
import warnings
warnings.filterwarnings('ignore')

# 加载数据
wine = load_wine()
X, y = wine.data, wine.target
print(f"数据集: {X.shape[0]}个样本, {X.shape[1]}个特征")

# 划分数据集
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 数据标准化(KNN对尺度敏感)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# --- 寻找最佳K值 ---
print("\n=== K值选择 ===")
k_range = range(1, 31)
train_scores = []
test_scores = []
cv_scores_list = []

for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train_scaled, y_train)

    train_scores.append(knn.score(X_train_scaled, y_train))
    test_scores.append(knn.score(X_test_scaled, y_test))
    cv_temp = cross_val_score(knn, X_train_scaled, y_train, cv=5)
    cv_scores_list.append(cv_temp.mean())

# 找出最佳K
best_k = k_range[np.argmax(cv_scores_list)]
print(f"最佳K值: {best_k}")
print(f"对应CV准确率: {max(cv_scores_list):.4f}")

# 使用最佳K训练最终模型
knn_best = KNeighborsClassifier(
    n_neighbors=best_k,
    weights='distance',  # 距离加权
    metric='euclidean',
    n_jobs=-1
)
knn_best.fit(X_train_scaled, y_train)

y_pred = knn_best.predict(X_test_scaled)
print(f"测试集准确率: {accuracy_score(y_test, y_pred):.4f}")

# 详细分类报告
print("\n=== 分类报告 ===")
print(classification_report(y_test, y_pred, target_names=wine.target_names))

# --- 不同距离度量比较 ---
print("\n=== 距离度量比较 ===")
metrics = ['euclidean', 'manhattan', 'chebyshev', 'minkowski']
for metric in metrics:
    knn_m = KNeighborsClassifier(n_neighbors=best_k, metric=metric)
    cv_m = cross_val_score(knn_m, X_train_scaled, y_train, cv=5)
    print(f"  {metric:12} | CV准确率: {cv_m.mean():.4f}")

# --- 加权KNN ---
print("\n=== 加权方式比较 ===")
weights_options = ['uniform', 'distance']
for weight in weights_options:
    knn_w = KNeighborsClassifier(n_neighbors=best_k, weights=weight)
    cv_w = cross_val_score(knn_w, X_train_scaled, y_train, cv=5)
    print(f"  {weight:10} | CV准确率: {cv_w.mean():.4f}")

# --- 预测概率 ---
print("\n=== 预测概率示例 ===")
knn_prob = KNeighborsClassifier(n_neighbors=best_k, weights='distance')
knn_prob.fit(X_train_scaled, y_train)
probs = knn_prob.predict_proba(X_test_scaled[:5])
for i, prob in enumerate(probs):
    pred_class = np.argmax(prob)
    print(f"  样本{i+1}: 预测类别={pred_class}, 概率={prob.round(3)}")

KNN的优缺点

优点:

缺点:

KNN的改进方法

  1. KD-Tree / Ball-Tree:加速最近邻搜索
  2. 降维:先用PCA等方法降低维度
  3. 特征选择:只使用重要特征
  4. 距离加权:给距离近的邻居更大权重

实际应用场景

总结

KNN虽然简单,但它体现了"近朱者赤"的直观思想。理解KNN有助于理解更复杂的机器学习算法。在实际应用中,KNN常作为基线模型使用,其思想也启发了许多其他算法的发展。