ZICONG

梓聪的个人主页

0%

Sklearn机器学习:KNN

特点

  • 数据驱向型,而非模型驱向型算法
  • 不对数据作任何的假设
  • 思想:按相似来归类

基本思想

  • 对于某个实例$i$,找到与之邻近的实例
  • 与之邻近的实例中,归为哪类最多的,该实例就被归为哪一类
  • 邻近指的是解释变量值的距离接近
  • 距离有很多计算方法,其中基本的一种是使用欧式距离

依赖

import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.neighbors import NearestNeighbors, KNeighborsClassifier
import matplotlib.pyplot as plt
%matplotlib inline

距离公式

  • 欧氏距离
  • 各个解释变量先进行正态化处理

选择K值

K 值指的是,每个数据找K个最近近邻,这些最近近邻中属于哪类最多的,当前数据就归为哪类。显然,K值应该为奇数。

K值选择的标准是在测试集中应该有最小误差。

  • 低K值能反应局部结构,但是同时也引入了噪音
  • 高K值减少了噪音,但是无法反映局部结构

当K=N的时候,等同于数据集中所有数据哪类多就归哪类。这是一种naive benchmark

例子:割草机

测试集和训练集

通过收入Income以及房子面积Lot_size来预测实例是否拥有割草机。

mower_df = pd.read_csv('../big_data/datasets/RidingMowers.csv')
mower_df['Number'] = mower_df.index + 1
mower_df.tail()

Income Lot_Size Ownership Number
19 66.0 18.4 Nonowner 20
20 47.4 16.4 Nonowner 21
21 33.0 18.8 Nonowner 22
22 51.0 14.0 Nonowner 23
23 63.0 14.8 Nonowner 24

将数据分为训练集和测试集。

test_frac = 0.4      # 训练集和测试集比例
train_data, test_data = train_test_split(mower_df, test_size=test_frac)

正态化

下面对相应数据进行正态化处理。存在的问题是,我们只知道训练集的均值和方差,(假装)无法得知测试集的均值和方差。因此我们需要先用训练集去“学习”正态化的变换,再用学到的变换引用于全数据集。

# 使用训练集学习变换
scaler = preprocessing.StandardScaler()
scaler.fit(train_data[['Income', 'Lot_Size']])

# 对整个数据集需要变换的数据进行变换
normalized_cols = scaler.transform(mower_df[['Income', 'Lot_Size']])
normalized_cols = pd.DataFrame(normalized_cols, columns = ['zIncome', 'zLot_Size'])

# 重新组装数据集
unchanged_cols = mower_df[['Ownership', 'Number']]
mower_norm = pd.concat([normalized_cols, unchanged_cols], axis=1)

mower_norm.head()

zIncome zLot_Size Ownership Number
0 -0.19 -0.19 Owner 1
1 1.08 -0.87 Owner 2
2 0.05 1.16 Owner 3
3 -0.12 0.82 Owner 4
4 1.16 2.01 Owner 5

下面需要重新提取训练集和测试集。在使用train_test_split()的时候,得到的训练集和测试集将包含一个.index属性。使用该属性将得到一个数组,包含着训练集或测试集中各条数据项在原先总数据集中的索引。利用这点可以在转换后的数据集中提取训练集和测试集。

# 由于行位置不变,使用原先测试集和训练集的索引来提取测试集和训练集
train_norm = mower_norm.iloc[train_data.index]
test_norm = mower_norm.iloc[test_data.index]

预测集

生成一个用来预测的数据集。

predict_df = pd.DataFrame([
    {'Income': 60, 'Lot_Size': 20},
    {'Income': 90, 'Lot_Size': 22},
    {'Income': 75, 'Lot_Size': 18}
])

可视化

fig1, ax = plt.subplots()

# 绘制训练集
subset_owner = train_data[train_data['Ownership'] == 'Owner']
subset_nonowner = train_data[train_data['Ownership'] == 'Nonowner']
ax.scatter(subset_owner['Income'], subset_owner['Lot_Size'], color='r', marker='o', label='Owner')
ax.scatter(subset_nonowner['Income'], subset_nonowner['Lot_Size'], color='b', marker='o', label='Nonowner')

# 绘制测试集
subset_owner = test_data[test_data['Ownership'] == 'Owner']
subset_nonowner = test_data[test_data['Ownership'] == 'Nonowner']
ax.scatter(subset_owner['Income'], subset_owner['Lot_Size'], color='r', marker='$t$')
ax.scatter(subset_nonowner['Income'], subset_nonowner['Lot_Size'], color='b', marker='$t$')

# 绘制预测集
ax.scatter(predict_df['Income'], predict_df['Lot_Size'], color='black', marker='*', label='Predict')

plt.xlabel('Income')
plt.ylabel('Lot Size')
plt.title('($t$ to denote test set)')
plt.legend()
plt.show()

6962VJ.png

求最优K值

求最优K值的方法是找到让测试集accuracy最大化的K。

# 初始化训练集和测试集的x和y
train_norm_x = train_norm[['zIncome', 'zLot_Size']]
train_norm_y = train_norm[['Ownership']]

test_norm_x = test_norm[['zIncome', 'zLot_Size']]
test_norm_y = test_norm[['Ownership']]

注意,训练集有14个数据项,因此K的最大取值是15。

train_norm_x.shape
(14, 2)

循环K从1到15,训练模型并且获得测试集的accuracy,并把它们存在a list of dict。

# 初始化列表用来存储不同K值获得的测试集的accuracy
results = []
# 循环K
for k in range(1, 15):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(train_norm_x, train_norm_y.values.ravel())
    k_test_accuracy = accuracy_score(test_norm_y, knn.predict(test_norm_x))
    
    results.append({'k':k, 'accuracy': k_test_accuracy})
# 将list of dicts转换为pd.df
results = pd.DataFrame(results)

results

k accuracy
0 1 0.5
1 2 0.3
2 3 0.5
3 4 0.3
4 5 0.5
5 6 0.2
6 7 0.2
7 8 0.2
8 9 0.2
9 10 0.2
10 11 0.2
11 12 0.2
12 13 0.2
13 14 0.2

注意,拟合数据时,y值原本是一列的pd.DataFrame向量,取.values属性获得(14, 1)的np.array向量,再取.ravel()方法获得一个“压扁”的numpy数组。这是该函数所需要传入的。

可以看到,当K取1,3,5时,accuracy最大。取3似最合适。

预测

正态化整个数据集及预测集。

scaler = preprocessing.StandardScaler()
scaler.fit(mower_df[['Income', 'Lot_Size']])

mower_norm_x = scaler.transform(mower_df[['Income', 'Lot_Size']])
mower_norm_y = mower_df[['Ownership']].values.ravel()

predict_norm_x = scaler.transform(predict_df)

训练整个数据集:

knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(mower_norm_x, mower_norm_y)
KNeighborsClassifier(n_neighbors=3)

找到预测集中每个数据项的三个(k=3)最邻近点,返回其距离和索引。

distances, indices = knn.kneighbors(predict_norm_x)
print("Distances: ")
print(distances, "\n")
print("Indices: ")
print(indices)
Distances: 
[[0.34532669 0.46448259 0.50133206]
 [0.40791006 0.52801632 0.69065338]
 [0.49402275 0.49402275 0.62479509]] 

Indices: 
[[ 3  8 13]
 [ 7  9  4]
 [16 19 14]]

预测:

predictions = knn.predict(predict_norm_x)
for index, prediction in enumerate(predictions):
    print(64 * '-')
    print(prediction)                           # 预测值
    print(predict_df.iloc[index])               # 第index数据项的取值 (index = 0, 1, 2)
    print(mower_df.iloc[indices[index,:]])      # 第index数据项的邻近点的取值
    print(mower_norm.iloc[indices[index,:]])    # 正态化的第index数据项的邻近点的取值
----------------------------------------------------------------
Owner
Income      60
Lot_Size    20
Name: 0, dtype: int64
    Income  Lot_Size Ownership  Number
3     61.5      20.8     Owner       4
8     69.0      20.0     Owner       9
13    52.8      20.8  Nonowner      14
    zIncome  zLot_Size Ownership  Number
3     -0.12       0.82     Owner       4
8      0.26       0.49     Owner       9
13    -0.55       0.82  Nonowner      14
----------------------------------------------------------------
Owner
Income      90
Lot_Size    22
Name: 1, dtype: int64
   Income  Lot_Size Ownership  Number
7    82.8      22.4     Owner       8
9    93.0      20.8     Owner      10
4    87.0      23.6     Owner       5
   zIncome  zLot_Size Ownership  Number
7     0.95       1.50     Owner       8
9     1.46       0.82     Owner      10
4     1.16       2.01     Owner       5
----------------------------------------------------------------
Nonowner
Income      75
Lot_Size    18
Name: 2, dtype: int64
    Income  Lot_Size Ownership  Number
16    84.0      17.6  Nonowner      17
19    66.0      18.4  Nonowner      20
14    64.8      17.2  Nonowner      15
    zIncome  zLot_Size Ownership  Number
16     1.01      -0.53  Nonowner      17
19     0.11      -0.19  Nonowner      20
14     0.05      -0.70  Nonowner      15

可视化:预测

为了画图,将预测结果(np.array)添加到预测集pd.DataFrame中。

predict_df['Ownership'] = predictions
predict_df

Income Lot_Size Ownership
0 60 20 Owner
1 90 22 Owner
2 75 18 Nonowner
fig2, ax = plt.subplots()

# 绘制数据集
subset_owner = mower_df[mower_df['Ownership'] == 'Owner']
subset_nonowner = mower_df[mower_df['Ownership'] == 'Nonowner']
ax.scatter(subset_owner['Income'], subset_owner['Lot_Size'], color='r', label='Owner')
ax.scatter(subset_nonowner['Income'], subset_nonowner['Lot_Size'], color='b', label='Nonowner')

# 绘制预测集
subset_owner = predict_df[predict_df['Ownership'] == 'Owner']
subset_nonowner = predict_df[predict_df['Ownership'] == 'Nonowner']
ax.scatter(subset_owner['Income'], subset_owner['Lot_Size'], color='r', marker='*')
ax.scatter(subset_nonowner['Income'], subset_nonowner['Lot_Size'], color='b', marker='*')

# 绘制近邻点
for i in range(3):                   # 对预测集每项循环
    for j in range(3):               # 对邻近点循环
        x_values = [mower_df['Income'][indices[i, j]], predict_df['Income'][i]]
        y_values = [mower_df['Lot_Size'][indices[i, j]], predict_df['Lot_Size'][i]]
        ax.plot(x_values, y_values, color = 'black', alpha = 0.3)

plt.xlabel('Income')
plt.ylabel('Lot Size')
plt.legend()
plt.show()

696cb4.png