NLP基础入门
NLP基础入门
什么是NLP
自然语言处理(NLP)是人工智能的重要分支,专注于让计算机理解、解释和生成人类语言。
文本预处理
文本预处理是NLP的第一步,包括清洗、分词等操作:
import re
def clean_text(text):
text = text.lower()
text = re.sub(r'<[^>]+>', '', text)
text = re.sub(r'[^\w\s]', '', text)
text = re.sub(r'\s+', ' ', text).strip()
return text
sample_text = "Hello, World! This is NLP <b>basics</b>."
cleaned = clean_text(sample_text)
print(cleaned)
分词技术
分词是将文本切分成词语的过程:
import jieba
chinese_text = "自然语言处理是人工智能的重要分支"
words = jieba.lcut(chinese_text)
print(words)
def tokenize_english(text):
return text.lower().split()
english_text = "Natural Language Processing is Important"
tokens = tokenize_english(english_text)
print(tokens)
TF-IDF
TF-IDF用于评估词语在文档中的重要性:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
documents = [
"机器学习是人工智能的分支",
"深度学习使用神经网络",
"自然语言处理处理文本数据"
]
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)
feature_names = vectorizer.get_feature_names_out()
print("特征词:", feature_names)
print("TF-IDF矩阵形状:", tfidf_matrix.shape)
Bag of Words模型
BoW模型是最简单的文本表示方法:
from sklearn.feature_extraction.text import CountVectorizer
documents = [
"I love machine learning",
"I love deep learning",
"Deep learning is great"
]
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(documents)
print("词汇表:", vectorizer.get_feature_names_out())
print("BoW矩阵:\n", bow_matrix.toarray())
词频统计
from collections import Counter
def word_frequency(text):
words = text.split()
return Counter(words)
text = "the cat sat on the mat the cat"
freq = word_frequency(text)
print("词频:", freq.most_common(3))
文本向量化
import numpy as np
def text_to_vector(text, vocab):
words = text.split()
vector = np.zeros(len(vocab))
for word in words:
if word in vocab:
vector[vocab[word]] += 1
return vector
vocab = {"love": 0, "machine": 1, "learning": 2, "deep": 3}
text = "I love machine learning"
vector = text_to_vector(text, vocab)
print("向量表示:", vector)
总结
NLP基础是构建复杂NLP应用的基石。掌握文本预处理、分词和特征提取技术,为后续的文本分析和理解打下基础。