🤖

NLP基础入门

📂 ai ⏱ 2 min 212 words

NLP基础入门

什么是NLP

自然语言处理（NLP）是人工智能的重要分支，专注于让计算机理解、解释和生成人类语言。

文本预处理

文本预处理是NLP的第一步，包括清洗、分词等操作：

import re

def clean_text(text):
    text = text.lower()
    text = re.sub(r'<[^>]+>', '', text)
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

sample_text = "Hello, World! This is NLP <b>basics</b>."
cleaned = clean_text(sample_text)
print(cleaned)

分词技术

分词是将文本切分成词语的过程：

import jieba

chinese_text = "自然语言处理是人工智能的重要分支"
words = jieba.lcut(chinese_text)
print(words)

def tokenize_english(text):
    return text.lower().split()

english_text = "Natural Language Processing is Important"
tokens = tokenize_english(english_text)
print(tokens)

TF-IDF

TF-IDF用于评估词语在文档中的重要性：

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

documents = [
    "机器学习是人工智能的分支",
    "深度学习使用神经网络",
    "自然语言处理处理文本数据"
]

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)

feature_names = vectorizer.get_feature_names_out()
print("特征词:", feature_names)
print("TF-IDF矩阵形状:", tfidf_matrix.shape)

Bag of Words模型

BoW模型是最简单的文本表示方法：

from sklearn.feature_extraction.text import CountVectorizer

documents = [
    "I love machine learning",
    "I love deep learning",
    "Deep learning is great"
]

vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(documents)

print("词汇表:", vectorizer.get_feature_names_out())
print("BoW矩阵:\n", bow_matrix.toarray())

词频统计

from collections import Counter

def word_frequency(text):
    words = text.split()
    return Counter(words)

text = "the cat sat on the mat the cat"
freq = word_frequency(text)
print("词频:", freq.most_common(3))

文本向量化

import numpy as np

def text_to_vector(text, vocab):
    words = text.split()
    vector = np.zeros(len(vocab))
    for word in words:
        if word in vocab:
            vector[vocab[word]] += 1
    return vector

vocab = {"love": 0, "machine": 1, "learning": 2, "deep": 3}
text = "I love machine learning"
vector = text_to_vector(text, vocab)
print("向量表示:", vector)

总结

NLP基础是构建复杂NLP应用的基石。掌握文本预处理、分词和特征提取技术，为后续的文本分析和理解打下基础。