今日立冬,刚好一杯咖啡的工夫,看一段机器学习的几行经典代码,放松一下。


代码很简单,但却不失机器学习标准流程的必要步骤。

使用SVM(支持向量机)对肺癌数据集进行学习,并对测试集进行预测。
数据集:
https://raw.githubusercontent.com/aviralb13/git-codes/main/datas/lung%20cancer.csv
1. 导入包,并读取数据集并获取前5行数据:
import pandas as pd
import numpy as np
URL = 'https://raw.githubusercontent.com/aviralb13/git-codes/main/datas/lung%20cancer.csv'
data = pd.read_csv(URL)
data.head()

2. 对离散非数据特征,使用Onehot特征编码:
one_hot = pd.get_dummies(data['GENDER'])
data = data.drop('GENDER',axis = 1)
data = data.join(one_hot)
data.head()
3. 导入预处理库,并对目标类Label编码:
from sklearn import preprocessing
label = preprocessing.LabelEncoder()
data['LUNG_CANCER'] = label.fit_transform(data['LUNG_CANCER'])
data.head()
4. 获取并设定数据集特征X,以及预测目标值y:
features = ['AGE', 'SMOKING', 'YELLOW_FINGERS', 'ANXIETY', 'PEER_PRESSURE','CHRONIC DISEASE', 'FATIGUE ', 'ALLERGY ', 'WHEEZING','ALCOHOL CONSUMING', 'COUGHING', 'SHORTNESS OF BREATH','SWALLOWING DIFFICULTY', 'CHEST PAIN', 'F', 'M']
x = data[features]
y = data['LUNG_CANCER']
5. 导入Split库,并对对原数据集分为训练和测试数据:
from sklearn.model_selection import train_test_split
train_x,test_x,train_y,test_y = train_test_split(x,y)
6. 导入SVM算法,并对训练数据进行拟合,训练模型:
from sklearn.svm import LinearSVC
SVC = LinearSVC()
SVC.fit(train_x,train_y)
7. 导入模型评估库,并使用已经训练的模型对测试数据进行预测:
from sklearn.metrics import accuracy_score
prediction = SVC.predict(test_x)
accuracy_score(test_y, prediction)
预测准确率(accuracy)如下:
0.9358974358975359
咖啡喝完,收工!
完整源码如下:
https://github.com/aviralb13/git-codes/blob/main/datas/lung%20cancer.csv
原创文章,作者:门童靖博士,如若转载,请注明出处:https://www.agent-universe.cn/2022/11/12704.html