文本分类实战
文本分类实战
什么是文本分类
文本分类是将文本自动分配到预定义类别的任务,广泛应用于垃圾邮件过滤、情感分析等场景。
朴素贝叶斯分类器
朴素贝叶斯基于贝叶斯定理,计算简单且效果不错:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
texts = [
"机器学习算法研究",
"深度神经网络",
"自然语言处理",
"足球比赛精彩",
"篮球世界杯",
"体育新闻报道"
]
labels = [0, 0, 0, 1, 1, 1]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(texts)
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2)
nb_clf = MultinomialNB()
nb_clf.fit(X_train, y_train)
accuracy = nb_clf.score(X_test, y_test)
print(f"朴素贝叶斯准确率: {accuracy:.2f}")
SVM分类器
支持向量机在文本分类中表现优异:
from sklearn.svm import SVC
svm_clf = SVC(kernel='linear')
svm_clf.fit(X_train, y_train)
accuracy = svm_clf.score(X_test, y_test)
print(f"SVM准确率: {accuracy:.2f}")
import numpy as np
feature_names = vectorizer.get_feature_names_out()
coef = svm_clf.coef_[0]
top_indices = np.argsort(coef)[-5:]
print("重要特征:", [feature_names[i] for i in top_indices])
深度学习方法
使用LSTM进行文本分类:
import torch
import torch.nn as nn
class LSTMClassifier(nn.Module):
def __init__(self, vocab_size, embedding_dim, hidden_dim, num_classes):
super(LSTMClassifier, self).__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim)
self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)
self.fc = nn.Linear(hidden_dim, num_classes)
def forward(self, x):
embeds = self.embedding(x)
lstm_out, (hidden, _) = self.lstm(embeds)
output = self.fc(hidden[-1])
return output
model = LSTMClassifier(vocab_size=10000, embedding_dim=128,
hidden_dim=256, num_classes=2)
文本预处理
def preprocess_text(text):
text = text.lower()
text = ''.join(c for c in text if c.isalnum() or c.isspace())
tokens = text.split()
return ' '.join(tokens)
sample = "This is a GREAT movie!!!"
processed = preprocess_text(sample)
print(processed)
评估指标
from sklearn.metrics import classification_report, confusion_matrix
y_pred = nb_clf.predict(X_test)
print(classification_report(y_test, y_pred))
print("混淆矩阵:\n", confusion_matrix(y_test, y_pred))
交叉验证
from sklearn.model_selection import cross_val_score
scores = cross_val_score(nb_clf, X, labels, cv=3)
print(f"交叉验证分数: {scores.mean():.2f} (+/- {scores.std()*2:.2f})")
总结
文本分类是NLP的核心任务。从传统的朴素贝叶斯、SVM到深度学习方法,选择合适的算法取决于数据规模和任务复杂度。