avatar

逻辑回归和支持向量机

实验目的


  1. 对比理解梯度下降和批量随机梯度下降的区别与联系。
  2. 对比理解逻辑回归和线性分类的区别与联系。
  3. 进一步理解SVM的原理并在较大数据上实践。

实验数据


实验使用的是LIBSVM Data的中的a9a数据,包含32561 / 16281(testing)个样本,每个样本有123/123 (testing)个属性。请自行下载训练集和验证集。在读取数据时可能会出现维度不对的问题,是因为数据最后列全为零而被忽略,可以在下载的数据集文件后面自行添加后再读取,也可在读取数据集时指定n_features=123来解决。

实验步骤


逻辑回归与批量随机梯度下降

  1. 读取实验训练集和验证集。
  2. 逻辑回归模型参数初始化(可以考虑全零初始化,随机初始化或者正态分布初始化)。
  3. 选择Loss函数及对其求导,过程详见课件ppt。
  4. 自行确定batch_size大小,随机取出部分样本,求得部分样本对Loss函数的梯度 。
  5. 使用SGD优化方法更新参数模型,鼓励额外尝试Adam优化方法。
  6. 选择合适的阈值,将验证集中计算结果大于阈值的标记为正类,反之为负类。在验证集上测试并得到Loss函数值 Lvalidation 。
  7. 重复步骤4-6若干次,画出 Lvalidation 随迭代次数的变化图

线性分类与批量随机梯度下降

  1. 读取实验训练集和验证集。
  2. 支持向量机模型参数初始化(可以考虑全零初始化,随机初始化或者正态分布初始化)。
  3. 选择Loss函数并对其求导,过程详见课件ppt。
  4. 自行确定batch_size大小,随机取出部分样本,求得部分样本对Loss函数的梯度 。
  5. 使用SGD优化方法更新参数模型,鼓励额外尝试Adam优化方法。
  6. 选择合适的阈值,将验证集中计算结果大于阈值的标记为正类,反之为负类。在验证集上测试并得到Loss函数值 Lvalidation 。
  7. 重复步骤4-6若干次,画出 Lvalidation 随迭代次数的变化图

示例代码


导入依赖

1
2
3
4
5
6
import numpy as np
import pandas as pd
import sklearn.datasets as sd
import sklearn.model_selection as sms
import matplotlib.pyplot as plt
import random

读取数据集

1
2
# 读取实验训练集和验证集
X, y = sd.load_svmlight_file('a9a.txt',n_features = 123)
1
2
# 将数据集切分为训练集和验证集
X_train, X_valid, y_train, y_valid = sms.train_test_split(X, y)
1
2
3
4
5
# 将稀疏矩阵转为ndarray类型
X_train = X_train.toarray()
X_valid = X_valid.toarray()
y_train = y_train.reshape(len(y_train),1)
y_valid = y_valid.reshape(len(y_valid),1)
1
2
X_train = np.concatenate((np.ones((X_train.shape[0],1)), X_train), axis = 1)
X_valid = np.concatenate((np.ones((X_valid.shape[0],1)), X_valid), axis = 1)
1
X_train.shape, X_valid.shape, y_train.shape, y_valid.shape

逻辑回归

定义sigmoid函数

1
2
def sigmoid(z):
return 1 / (1 + np.exp(-z))

定义logistic loss函数

1
2
3
4
def logistic_loss(X, y ,theta):
hx = sigmoid(X.dot(theta))
cost = np.multiply((1+y), np.log(1+hx)) + np.multiply((1-y), np.log(1-hx))
return -cost.mean()/2

计算当前loss

1
2
theta = np.zeros((X_train.shape[1],1))
logistic_loss(X_train, y_train, theta)

定义logistic gradient函数

1
2
def logistic_gradient(X, y, theta):
return X.T.dot(sigmoid(X.dot(theta)) - y)

定义logistic score函数

1
2
3
4
5
6
def logistic_score(X, y, theta):
hx = sigmoid(X.dot(theta))
hx[hx>=0.5] = 1
hx[hx<0.5] = -1..11
hx = (hx==y)
return np.mean(hx)

定义logistic decent函数

1
2
3
4
5
6
7
8
9
10
11
def logistic_descent(X, y, theta, alpha, num_iters, batch_size, X_valid, y_valid):
loss_train = np.zeros((num_iters,1))
loss_valid = np.zeros((num_iters,1))
data = np.concatenate((y, X), axis=1)
for i in range(num_iters):
sample = np.matrix(random.sample(data.tolist(), batch_size))
grad = logistic_gradient(sample[:,1:125], sample[:,0], theta)
theta = theta - alpha * grad
loss_train[i] = logistic_loss(X, y, theta)
loss_valid[i] = logistic_loss(X_valid, y_valid, theta)
return theta, loss_train, loss_valid

执行梯度下降

1
2
3
4
5
theta = np.zeros((X_train.shape[1],1))
alpha = 0.0001
num_iters = 200
opt_theta, loss_train, loss_valid = logistic_descent(X_train, y_train, theta, alpha, num_iters, 64, X_valid, y_valid)
loss_train.max(), loss_train.min(), loss_valid.max(), loss_valid.min()

1
logistic_score(X_valid, y_valid, opt_theta)

1
2
3
4
5
6
7
8
9
10
iteration = np.arange(0, num_iters, step = 1)
fig, ax = plt.subplots(figsize = (12,8))
ax.set_title('Train vs Valid')
ax.set_xlabel('iteration')
ax.set_ylabel('loss')
plt.plot(iteration, loss_train, 'b', label='Training Set Loss')
plt.plot(iteration, loss_valid, 'r', label='Validation Set Loss')
# plt.plot(iteration, scores, 'g', label='Score on Validation Set')
plt.legend()
plt.show()

SVM

定义hinge loss函数

1
2
3
4
def hinge_loss(X, y, theta, C):
loss = np.maximum(0, 1 - np.multiply(y, X.dot(theta))).mean()
reg = np.multiply(theta,theta).sum() / 2
return C * loss + reg

计算当前loss

1
2
3
theta = np.random.random((X_train.shape[1],1))
C = 0.4
hinge_loss(X_train, y_train, theta, C)

定义hinge gradient函数

1
2
3
4
5
6
7
8
def hinge_gradient(X, y, theta, C):
error = np.maximum(0, 1 - np.multiply(y, X.dot(theta)))
index = np.where(error==0)
x = X.copy()
x[index,:] = 0
grad = theta - C * x.T.dot(y) / len(y)
grad[-1] = grad[-1] - theta[-1]
return grad

定义svm decent函数

1
2
3
4
5
6
7
8
9
10
11
def svm_descent(X, y, theta, alpha, num_iters, batch_size, X_valid, y_valid, C):
loss_train = np.zeros((num_iters,1))
loss_valid = np.zeros((num_iters,1))
data = np.concatenate((y, X), axis=1)
for i in range(num_iters):
sample = np.matrix(random.sample(data.tolist(), batch_size))
grad = hinge_gradient(sample[:,1:125], sample[:,0], theta, C)
theta = theta - alpha * grad
loss_train[i] = hinge_loss(X, y, theta, C)
loss_valid[i] = hinge_loss(X_valid, y_valid, theta, C)
return theta, loss_train, loss_valid

定义svm score函数

1
2
3
4
5
6
def svm_score(X, y, theta):
hx = X.dot(theta)
hx[hx>=5] = 1
hx[hx<5] = -1
hx = (hx==y)
return np.mean(hx)

执行梯度下降

1
2
3
4
5
theta = np.random.random((X_train.shape[1],1))
alpha = 0.01
num_iters = 500
opt_theta, loss_train, loss_valid = svm_descent(X_train, y_train, theta, alpha, num_iters, 64, X_valid, y_valid, C)
loss_train.max(), loss_train.min(), loss_valid.max(), loss_valid.min()

1
svm_score(X_valid, y_valid, opt_theta)

1
2
3
4
5
6
7
8
9
iteration = np.arange(0, num_iters, step = 1)
fig, ax = plt.subplots(figsize = (12,8))
ax.set_title('Train vs Valid')
ax.set_xlabel('iteration')
ax.set_ylabel('loss')
plt.plot(iteration, loss_train, 'b', label='Training Set Loss')
plt.plot(iteration, loss_valid, 'r', label='Validation Set Loss')
plt.legend()
plt.show()

Author: WJZheng
Link: https://wellenzheng.github.io/2020/04/09/%E9%80%BB%E8%BE%91%E5%9B%9E%E5%BD%92%E5%92%8C%E6%94%AF%E6%8C%81%E5%90%91%E9%87%8F%E6%9C%BA/
Copyright Notice: All articles in this blog are licensed under CC BY-NC-SA 4.0 unless stating additionally.

Comment