見出し語化

AIって結局何なのかよく分からないので、とりあえず100日間勉強してみた Day46

経緯についてはこちらをご参照ください。

AIって結局何なのかよく分からないので、とりあえず100日間勉強してみた Day0

■本日の進捗

語幹処理を理解
見出し語化を理解

■はじめに

引き続き「Pythonではじめる機械学習（オライリー・ジャパン）」で学んでいきます。

これまで学んできたテキストデータに対する機械学習モデルの更なる精度向上のために、トークン化を改良していきます。

■語幹処理

語幹処理（stemming）とは、単語をその根幹を成す語幹（stem）に変換する手法です。

基本的なBoWでは（誤字やノイズを除くなどの処理手法はあるにせよ）その語幹が同じであっても異なるものは全て区別していて、関連があるかどうかに関わらず全て分離していました。

しかしこれには格、複数形、過去形、進行形などの派生形に関連を持たせられないという欠点があります。

良く知られているアルゴリズムには、英語の語尾を削除するシンプルなPorter Stemmer、その改良版で多言語対応しているSnowball Stemmer、より単語を短くする（語幹を取得しようとする）Lancaster Stemmerなどがあります。

scikit-learnのCountVectorizerはデフォルトでトークン化してくれていますが、任意の関数を作成してカスタムトークン分割器として使用するように指定することもできます。

早速、Porter Stemmerを試してみます。

import numpy as np
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
import nltk
from nltk.stem import PorterStemmer

def normalization(text):
    tokens = text.lower().split()
    return ' '.join([stemmer.stem(token) for token in tokens])


stemmer = PorterStemmer()

reviews_train = load_files("C:/Users/****/Documents/Python/aclImdb/train")
text_train, y_train = reviews_train.data, reviews_train.target

reviews_test = load_files("C:/Users/****/Documents/Python/aclImdb/test")
text_test, y_test = reviews_test.data, reviews_test.target

vectorizer = CountVectorizer(preprocessor=normalization)
X_train = vectorizer.fit_transform(text_train)
X_test = vectorizer.transform(text_test)

model = LogisticRegression(max_iter=10000)

param_grid = {'C': [0.001, 0.01, 0.1, 1, 10]}

grid = GridSearchCV(model, param_grid, cv=5, n_jobs=-1)
grid.fit(X_train, y_train)

print("best C:", grid.best_params_)
print("cv best score: {:.2f}".format(grid.best_score_))

best_model = grid.best_estimator_
feature_names = vectorizer.get_feature_names_out()
coefficients = best_model.coef_[0]
n = 20

class_0_features = np.argsort(coefficients)[:n]
print(f"Class 0 (Negative) top {n} features:")
for idx in class_0_features:
    print(f"{feature_names[idx]}: {coefficients[idx]:.3f}")
print()

class_1_features = np.argsort(coefficients)[-n:][::-1]
print(f"Class 1 (Positive) top {n} features:")
for idx in class_1_features:
    print(f"{feature_names[idx]}: {coefficients[idx]:.3f}")
print()

best C: {'C': 0.1}
cv best score: 0.88
Class 0 (Negative) top 20 features:
worst: -1.357
wast: -1.250
awful: -1.169
poorli: -1.018
terrible: -0.932
bore: -0.865
aw: -0.850
disappointment: -0.843
dull: -0.771
unfortunately: -0.763
disappoint: -0.724
mess: -0.722
boring: -0.720
poor: -0.719
fail: -0.697
worse: -0.689
lame: -0.687
unless: -0.685
badli: -0.672
save: -0.654

Class 1 (Positive) top 20 features:
excellent: 0.997
excel: 0.767
perfect: 0.743
favorit: 0.738
superb: 0.708
highli: 0.708
wonderful: 0.659
funniest: 0.647
delight: 0.645
refresh: 0.642
surprisingli: 0.637
today: 0.623
amazing: 0.607
gem: 0.591
brilliant: 0.587
subtl: 0.584
everyone: 0.564
flawless: 0.540
recommended: 0.525
great: 0.523

cv内側の最高スコアに特に変化はありません。モデルが重要視しているトークンを見てみると、単語が短縮されているものがいくつかありますが、英語的に有効な語幹を切っているのかと言うとかなり微妙な結果です。

今回はnltkライブラリ（下記のコマンドで別途インストールする必要があります。）を使ってみましたが、nltkにはいくつかのTokenizerが含まれています。これらを使うとより精度向上が期待できます。

●nltkのインストール

通常は下記

pip install nltk

「ModuleNotFoundError: No module named ‘nltk’」が表示される場合は、（Windowsであれば）下記を実行してみてください

py -3 -m pip install nltk

■見出し語化

見出し語化（lemmatization）とは、単語を意味や役割に基づいて見出し語（lemma）に変換する手法で、単純に語尾を落とす語幹処理よりも高度で複雑な処理になります。

例えば、語幹処理では “better” は “better” （あるいは”bett”）でしかないですが、見出し語化では “better” を “good” にすることが期待されます。

良く知られているライブラリには、nltkに搭載されているWordNetという辞書に基づいたWordNetLemmatizerや、高速な自然言語処理を行えるspaCyがあります。

こちらも語幹処理と同様にCountVectorizerにカスタムトークン分割器として指定することができます。

import numpy as np
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
import spacy

nlp = spacy.load("en_core_web_sm")

def normalization(text):
    doc = nlp(text)
    return ' '.join([token.lemma_ for token in doc])

reviews_train = load_files("C:/Users/****/Documents/Python/aclImdb/train")
text_train, y_train = reviews_train.data, reviews_train.target

reviews_test = load_files("C:/Users/****/Documents/Python/aclImdb/test")
text_test, y_test = reviews_test.data, reviews_test.target

vectorizer = CountVectorizer(preprocessor=normalization)
X_train = vectorizer.fit_transform(text_train)
X_test = vectorizer.transform(text_test)

model = LogisticRegression(max_iter=10000)

param_grid = {'C': [0.001, 0.01, 0.1, 1, 10]}

grid = GridSearchCV(model, param_grid, cv=5, n_jobs=-1)
grid.fit(X_train, y_train)

print("best C:", grid.best_params_)
print("cv best score: {:.2f}".format(grid.best_score_))

best_model = grid.best_estimator_
feature_names = vectorizer.get_feature_names_out()
coefficients = best_model.coef_[0]
n = 20

class_0_features = np.argsort(coefficients)[:n]
print(f"Class 0 (Negative) top {n} features:")
for idx in class_0_features:
    print(f"{feature_names[idx]}: {coefficients[idx]:.3f}")
print()

class_1_features = np.argsort(coefficients)[-n:][::-1]
print(f"Class 1 (Positive) top {n} features:")
for idx in class_1_features:
    print(f"{feature_names[idx]}: {coefficients[idx]:.3f}")
print()

best C: {'C': 0.1}
cv best score: 0.88
Class 0 (Negative) top 20 features:
waste: -1.221
awful: -1.147
disappointment: -1.065
poorly: -0.979
boring: -0.917
disappointing: -0.859
horrible: -0.845
dull: -0.792
save: -0.733
mess: -0.729
unfunny: -0.727
unless: -0.710
unfortunately: -0.700
bad: -0.695
poor: -0.689
badly: -0.683
terrible: -0.660
lame: -0.653
pointless: -0.652
ridiculous: -0.652

Class 1 (Positive) top 20 features:
excellent: 0.875
perfect: 0.845
superb: 0.765
favorite: 0.718
wonderfully: 0.704
amazing: 0.704
funniest: 0.690
rare: 0.668
enjoyable: 0.660
today: 0.651
surprisingly: 0.646
highly: 0.643
refreshing: 0.634
wonderful: 0.622
incredible: 0.608
perfectly: 0.589
entertaining: 0.588
gem: 0.585
delightful: 0.562
subtle: 0.550

何が変わったのか理解するのが結構難しいですが、”perfect” と “perfectly” がそれぞれ上位に来ています。これはSpaCyによる品詞ごとの文脈に対する意味の違いを明確にする効果が表れているのでしょうか。

●spacyのインストール

通常は下記

pip install spacy
python -m spacy download en_core_web_sm

上記でも「en_core_web_sm」が使えない場合は、（Windowsであれば）下記を実行してみてください

python3 -m spacy download en_core_web_sm
pip3 install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz

■おわりに

今回はより高度なトークン化を組み込みました。特に見出し語化はライブラリを用いることで特徴量を精査することができるようになっています。規則性など、ある程度データの意味が分かっている自然言語処理の強みを生かした手法ではないでしょうか。

ただ、データを削減する効果もあるもののSpaCyの実行はかなり重く感じました。この辺もカスタマイズ性があったりGPU化もできるみたいなので試してみたいところです。

■参考文献

Andreas C. Muller, Sarah Guido. Pythonではじめる機械学習. 中田秀基訳. オライリー・ジャパン. 2017. 392p.
ChatGPT. 4o mini. OpenAI. 2024. https://chatgpt.com/
API Reference. scikit-learn.org. https://scikit-learn.org/stable/api/index.html
Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher, Learning Word Vectors for Sentiment Analysis, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, June, 2011, Portland, Oregon, USA, Association for Computational Linguistics, 142–150, http://www.aclweb.org/anthology/P11-1015
Potts, Christopher. 2011. On the negativity of negation. In Nan Li and David Lutz, eds., Proceedings of Semantics and Linguistic Theory 20, 636-659.