共起行列とコサイン類似度

AIって結局何なのかよく分からないので、とりあえず100日間勉強してみた Day75

経緯についてはこちらをご参照ください。

AIって結局何なのかよく分からないので、とりあえず100日間勉強してみた Day0

■本日の進捗

共起行列を理解
コサイン類似度を理解

■はじめに

前回まで勉強してきた「ゼロから作るDeep Learning　Pythonで学ぶディープラーニングの理論と実装（オライリー・ジャパン）」という素晴らしい書籍のおかげで、単純パーセプトロンから畳み込みニューラルネットワークまで、深層学習の基礎を学ぶことができました。

主に画像認識分野におけるモデルについて学んできたので、画像認識に関する実運用を想定して、より現実に近い特徴マップを用いて実際に使える技術の深化を目指して行こうとも思ったのですが、せっかく”勉強”を主目的にしているので、先ほどの書籍の続編を用いて、画像認識とは少し違う（それでいて同じくらいよく使われている）分野への”お勉強”に進んでいきたいと思いますので、もう少々お付き合いいただければと思います。

続編は、「ゼロから作るDeep Learning②　自然言語処理編（オライリー・ジャパン）」になります。

今回も第1章は既知の内容（主に基本的なニューラルネットワークの実装）ですので、副題でもある自然言語処理分野の技術項目のカウントベース手法から学んでいきます。

■カウントベース手法

自然言語処理（Natural Language Processing：NLP）の分野は、人間の言葉をコンピュータ（機械学習モデル）に理解させることを目的としている分野で、その中でもカウントベース手法はコーパス中に現れる頻度や場所などの情報から単語を数値化してテキストデータを解析する手法です。

機械学習モデルに学習させるためには、まずは単語を数値化することが必要で、固定長のベクトルで単語の意味を表現する「分散表現」を実装していきます。

その前処理として、単純に得られた文章に対して単語をひとつずつ抜き取りID（番号）を付けていく関数を実装したいと思います。

import numpy as np

def preprocess(text):
    text = text.lower()
    text = text.replace('.', ' .')
    words = text.split(' ')

    word_to_id = {}
    id_to_word = {}
    for word in words:
        if word not in word_to_id:
            new_id = len(word_to_id)
            word_to_id[word] = new_id
            id_to_word[new_id] = word

    corpus = np.array([word_to_id[w] for w in words])
    return corpus, word_to_id, id_to_word

text = 'You say goodbye and I say hello.'
corpus, word_to_id, id_to_word = preprocess(text)

print("corpus:{}".format(corpus))
print(word_to_id)
print("word_to_id[you]: {}".format(word_to_id["you"]))
print(id_to_word)
print("id_to_word[0]: {}".format(id_to_word[0]))

corpus:[0 1 2 3 4 1 5 6]
{'you': 0, 'say': 1, 'goodbye': 2, 'and': 3, 'i': 4, 'hello': 5, '.': 6}
word_to_id[you]: 0
{0: 'you', 1: 'say', 2: 'goodbye', 3: 'and', 4: 'i', 5: 'hello', 6: '.'}
id_to_word[0]: you

preprocess関数は、引数である文章をスペース区切りで分割し、コーパス中に初めて出てきた単語を、単語からIDを呼び出すword_to_idと、IDから単語を呼び出すid_to_wordに格納していきます。corpusは全IDを格納したNumPy配列です。

■共起行列

共起行列（co-occurrence matrix）とは、単語の同時出現頻度を示す行列のことで、ターゲットがコーパス内で別のコンテキストとどのくらい一緒に出現しているかを意味するベクトル表現をまとめたものと考えることができます。

隣り合う単語をコンテキストとする共起行列を実装していきます。

まずは関数の引数として得られたコーパス（corpus）と単語の総数（vocab_size）から、コーパス内の重複ありの単語数（corpus_size）の算出と、総単語数の正方行列（co_matrix）の初期化を行います。

corpus_size = len(corpus)
co_matrix = np.zeros((vocab_size, vocab_size), dtype=np.int32)

次にコーパスをIDとWordに分割してくれるPython標準のenumerate関数を用いて共起行列を作成していきます。window_sizeが1の場合、ターゲットの左右1単語を対象にします。この時の左右の単語のIDをleft_idxとright_idxに格納したら、そのIDがコーパス内にある場合に限り、初期化してある共起行列の対象の場所にカウントしていきます。

for idx, word_id in enumerate(corpus):
    for i in range(1, window_size + 1):
        left_idx = idx - i
        right_idx = idx + i

        if left_idx >= 0:
            left_word_id = corpus[left_idx]
            co_matrix[word_id, left_word_id] += 1

        if right_idx < corpus_size:
            right_word_id = corpus[right_idx]
            co_matrix[word_id, right_word_id] += 1

先程のコーパスとpreprocess関数を用いて共起行列を作成してみます。

import numpy as np

def preprocess(text):
    text = text.lower()
    text = text.replace('.', ' .')
    words = text.split(' ')

    word_to_id = {}
    id_to_word = {}
    for word in words:
        if word not in word_to_id:
            new_id = len(word_to_id)
            word_to_id[word] = new_id
            id_to_word[new_id] = word

    corpus = np.array([word_to_id[w] for w in words])
    return corpus, word_to_id, id_to_word

def create_co_matrix(corpus, vocab_size, window_size=1):
    corpus_size = len(corpus)
    co_matrix = np.zeros((vocab_size, vocab_size), dtype=np.int32)

    for idx, word_id in enumerate(corpus):
        for i in range(1, window_size + 1):
            left_idx = idx - i
            right_idx = idx + i

            if left_idx >= 0:
                left_word_id = corpus[left_idx]
                co_matrix[word_id, left_word_id] += 1

            if right_idx < corpus_size:
                right_word_id = corpus[right_idx]
                co_matrix[word_id, right_word_id] += 1

    return co_matrix

text = 'You say goodbye and I say hello.'
corpus, word_to_id, id_to_word = preprocess(text)

vocab_size = len(word_to_id)
C = create_co_matrix(corpus, vocab_size)

for i in corpus:
    print(id_to_word[i])

print(C)

you
say
goodbye
and
i
say
hello
.
[[0 1 0 0 0 0 0]
 [1 0 1 0 1 1 0]
 [0 1 0 1 0 0 0]
 [0 0 1 0 1 0 0]
 [0 1 0 1 0 0 0]
 [0 1 0 0 0 0 1]
 [0 0 0 0 0 1 0]]

1つ目のベクトルである「you」はコーパスの一番最初にあるので、2番目の単語のみ隣り合っています。2つ目の「say」は2番目の単語なので、1番目と3番目（IDであれば0と2）と隣り合っています。またこの単語はコーパス内に2度出てきますが、共起行列では重複しないので5番目の「i」と6番目の「hello」の位置にもカウントされていますが、元のコーパスを考えれば「say」は5と6の間にあったことから正しく共起行列が作成できていることが分かります。

■コサイン類似度

コサイン類似度（cosine similarity）とは、2つのベクトル間の類似度を示す指標で、ベクトル間のコサイン角を用います。2つのベクトルが同じ方向を向いていれば類似度は1を示し、直角であれば0になります。反対を向いていれば-1もあり得ますが、自然言語処理では非負の値を用います。

$$ \mathrm{cosine \ similarity}(x, y) = \frac{\boldsymbol{x} \cdot \boldsymbol{y}}{\| \boldsymbol{x} \| \ \| \boldsymbol{y} \|} $$

コサイン類似度の実装は、上式から分かるようにとても簡単です。分母に非負を求めることに注意しながら実装してみます。

def cos_similarity(x, y):
    nx = x / np.sqrt(np.sum(x**2))
    ny = y / np.sqrt(np.sum(y**2))
    return np.dot(nx, ny)

最後に、コーパスから出現単語をベクトル化し、恐らく類似しているであろう「you」と「i」の類似度を、そして恐らく類似していないであろう「and」と「hello」の類似度をそれぞれ求めたいと思います。

import numpy as np

def preprocess(text):
    text = text.lower()
    text = text.replace('.', ' .')
    words = text.split(' ')

    word_to_id = {}
    id_to_word = {}
    for word in words:
        if word not in word_to_id:
            new_id = len(word_to_id)
            word_to_id[word] = new_id
            id_to_word[new_id] = word

    corpus = np.array([word_to_id[w] for w in words])
    return corpus, word_to_id, id_to_word

def create_co_matrix(corpus, vocab_size, window_size=1):
    corpus_size = len(corpus)
    co_matrix = np.zeros((vocab_size, vocab_size), dtype=np.int32)

    for idx, word_id in enumerate(corpus):
        for i in range(1, window_size + 1):
            left_idx = idx - i
            right_idx = idx + i

            if left_idx >= 0:
                left_word_id = corpus[left_idx]
                co_matrix[word_id, left_word_id] += 1

            if right_idx < corpus_size:
                right_word_id = corpus[right_idx]
                co_matrix[word_id, right_word_id] += 1

    return co_matrix

def cos_similarity(x, y):
    nx = x / np.sqrt(np.sum(x**2))
    ny = y / np.sqrt(np.sum(y**2))
    return np.dot(nx, ny)

text = 'You say goodbye and I say hello.'
corpus, word_to_id, id_to_word = preprocess(text)

vocab_size = len(word_to_id)
C = create_co_matrix(corpus, vocab_size)

c0 = C[word_to_id["you"]]
c1 = C[word_to_id["i"]]

print("you : {}".format(c0))
print("i   : {}".format(c1))
print("Cosine Similarity : {}".format(cos_similarity(c0, c1)))
print("")

c0 = C[word_to_id["and"]]
c1 = C[word_to_id["hello"]]

print("and  : {}".format(c0))
print("hello: {}".format(c1))
print("Cosine Similarity : {}".format(cos_similarity(c0, c1)))

you : [0 1 0 0 0 0 0]
i   : [0 1 0 1 0 0 0]
Cosine Similarity : 0.7071067811865475

and  : [0 0 1 0 1 0 0]
hello: [0 1 0 0 0 0 1]
Cosine Similarity : 0.0

「you」と「i」の類似度は70％近く、互いに主語であることを考えると直観的に正しかったと言えるのではないでしょうか。また、接続語と名詞である「and」と「hello」の類似度は0でした。これも良くベクトル表現できていることの表れではないでしょうか。

■おわりに

今回から自然言語処理に重点を当てて深層学習を学んでいきます。自然言語処理は様々な分野に適用され、今最もホットな分野ではないでしょうか。

ChatGPやGeminiなどのチャットボットや、その他生成AI系のインターフェイス、Google翻訳などの翻訳機など多岐に渡り活躍していて、人間がコンピュータを使う側であるうちは決して欠かすことのできない最重要分野と言えます。（自分はそう遠くない未来に人間がコンピュータに使われる側になると本気で考えている系の人間です…）AIが人間らしく見えるようになったのもほとんど自然言語処理のおかげでしょう。

今回のコサイン類似度は非常に短く簡単なコーパスしか与えていないにも関わらず、人間から見ても直観的な結果を返してきましたが、これが辞書などの前情報や正解データを一切与えていない教師なし学習の一種だということを考えると驚くべき性能です。もちろん現代で多用されている手法ではないのですが、研究が過熱していることも理解できるほどの期待感を感じずにはいられません。

自然言語処理への第一歩として良いアルゴリズムを学べました。

■参考文献

Andreas C. Muller, Sarah Guido. Pythonではじめる機械学習. 中田秀基訳. オライリー・ジャパン. 2017. 392p.
斎藤康毅. ゼロから作るDeep Learning Pythonで学ぶディープラーニングの理論と実装. オライリー・ジャパン. 2016. 320p.
斎藤康毅. ゼロから作るDeep Learning② 自然言語処理編. オライリー・ジャパン. 2018. 432p.
ChatGPT. 4o mini. OpenAI. 2024. https://chatgpt.com/
API Reference. scikit-learn.org. https://scikit-learn.org/stable/api/index.html
PyTorch documentation. pytorch.org. https://pytorch.org/docs/stable/index.html
Keiron O’Shea, Ryan Nash. An Introduction to Convolutional Neural Networks. https://ar5iv.labs.arxiv.org/html/1511.08458
API Reference. scipy.org. 2024. https://docs.scipy.org/doc/scipy/reference/index.html