PMI：相互情報量

AIって結局何なのかよく分からないので、とりあえず100日間勉強してみた Day76

経緯についてはこちらをご参照ください。

AIって結局何なのかよく分からないので、とりあえず100日間勉強してみた Day0

■本日の進捗

PMIを理解

■はじめに

今回も「ゼロから作るDeep Learning②　自然言語処理編（オライリー・ジャパン）」から学んでいきます。

前回実装した共起行列とコサイン類似度を用いてより良い分散表現に改善していきたいと思います。

■ランキング表示

指定した単語に対してコサイン類似度が高い順に表示するための関数を実装していきます。

関数への引数として、対象とする単語（query）や単語のIDと内容（word_to_id, id_to_word）、行列（word_matrix）、表示数（top）を得られているとします。

まずは入力されたqueryは存在するかどうかを確認します。表示の際には、%sの位置にqueryが代入されます。

if query not in word_to_id:
    print('%s is not found' % query)
    return

次に、入力された単語のIDと、そのベクトルをそれぞれquery_idとquery_vecに格納しておきます。

print('\n[query] ' + query)
query_id = word_to_id[query]
query_vec = word_matrix[query_id]

単語の総数から語彙数（vocab_size）を算出し、コサイン類似度を格納するための行列を初期化します。

vocab_size = len(id_to_word)
similarity = np.zeros(vocab_size)

語彙の全ての単語に対してコサイン類似度を算出し、用意しておいた行列に格納していきます。

for i in range(vocab_size):
    similarity[i] = cos_similarity(word_matrix[i], query_vec)

算出したコサイン類似度に対して、小さい順にソートするargsort関数を逆数をかけたコサイン類似度に適用することで、大きい順に取り出していきます。

入力された単語（query）を除いて、topに格納した表示数まで繰り返します。

count = 0
for i in (-1 * similarity).argsort():
    if id_to_word[i] == query:
        continue
    print(' %s: %s' % (id_to_word[i], similarity[i]))
    count += 1
    if count >= top:
        return

ここまでのアルゴリズムを関数化し、「i」をqueryとして適用してみます。

import numpy as np

def preprocess(text):
    text = text.lower()
    text = text.replace('.', ' .')
    words = text.split(' ')

    word_to_id = {}
    id_to_word = {}
    for word in words:
        if word not in word_to_id:
            new_id = len(word_to_id)
            word_to_id[word] = new_id
            id_to_word[new_id] = word

    corpus = np.array([word_to_id[w] for w in words])
    return corpus, word_to_id, id_to_word

def create_co_matrix(corpus, vocab_size, window_size=1):
    corpus_size = len(corpus)
    co_matrix = np.zeros((vocab_size, vocab_size), dtype=np.int32)

    for idx, word_id in enumerate(corpus):
        for i in range(1, window_size + 1):
            left_idx = idx - i
            right_idx = idx + i

            if left_idx >= 0:
                left_word_id = corpus[left_idx]
                co_matrix[word_id, left_word_id] += 1

            if right_idx < corpus_size:
                right_word_id = corpus[right_idx]
                co_matrix[word_id, right_word_id] += 1

    return co_matrix

def cos_similarity(x, y):
    nx = x / np.sqrt(np.sum(x**2))
    ny = y / np.sqrt(np.sum(y**2))
    return np.dot(nx, ny)

def most_similar(query, word_to_id, id_to_word, word_matrix, top=5):
    if query not in word_to_id:
        print('%s is not found' % query)
        return
    
    print('\n[query] ' + query)
    query_id = word_to_id[query]
    query_vec = word_matrix[query_id]

    vocab_size = len(id_to_word)
    similarity = np.zeros(vocab_size)
    for i in range(vocab_size):
        similarity[i] = cos_similarity(word_matrix[i], query_vec)

    count = 0
    for i in (-1 * similarity).argsort():
        if id_to_word[i] == query:
            continue
        print(' %s: %s' % (id_to_word[i], similarity[i]))

        count += 1
        if count >= top:
            return

text = 'You say goodbye and I say hello.'
corpus, word_to_id, id_to_word = preprocess(text)

vocab_size = len(word_to_id)
C = create_co_matrix(corpus, vocab_size)

most_similar('i', word_to_id, id_to_word, C, top=5)

[query] i
 goodbye: 0.9999999999999998
 you: 0.7071067811865475
 hello: 0.4999999999999999
 and: 0.0
 say: 0.0

「i」のコサイン類似度を高い順に並べることに成功しました。結果を見てみると直観的に最も類似しているであろう「you」よりも「goodbye」の方が遥かに高い類似度を返してきています。（というよりベクトルがほとんど同じ方向を向いています。）これはある意味で正しい挙動でしょう。コーパスを見てみれば、「i」と「goodbye」はどちらも隣に「say」と「and」があります。これでは「i」と「goodbye」が同じだよ、と返してきても仕方がないでしょう。

この問題を解決するためにはコーパス（教師データ）を大きく増やすか、あるいは手法の改善を考えてみるのが良いかもしれません。

■相互情報量

相互情報量（PMI：Pointwise Mutual Information）とは、2つの事象の関連度を示す指標で、同時に出現する確率が、それぞれ独立して出現する確率に対してどれだけ高いかを示します。

$$ \mathrm{PMI}(x, y) = \log_2 \frac{\mathrm{P}(x, y)}{\mathrm{P}(x) \mathrm{P}(y)} $$

確率Pは共起行列で示された出現回数を総単語数で割ることで記述することができるので、

$$ \mathrm{P}(x, y) = \frac{\mathrm{C}(x, y)}{\mathrm{N}} $$

よって、PMIは下記のように示されます。

$$ \mathrm{PMI}(x, y) = \log_2 \frac{\mathrm{C}(x, y) \cdot \mathrm{N}}{\mathrm{C}(x) \mathrm{C}(y)} $$

ただし、C(x), C(y)ともに共起回数が0の場合にPMIが発散してしまうため、最低でも0になるように条件分けしてあげる必要があります。これを正の相互情報量（Positive PPM）と呼びます。

$$ \mathrm{PPMI}(x, y) = \mathrm{max}(0, \mathrm{PMI}(x, y)) $$

このPPMIを実装していきます。

まずは行列Mに引数である共起行列Cと同じ形で初期化します。Nに共起行列Cの合計である総共起回数を、Sに各単語の出現回数をそれぞれ格納します。

進捗表示用に全要素数（total）を求めて、カウンター（cnt）を初期化しておきます。

M = np.zeros_like(C, dtype=np.float32)
N = np.sum(C)
S = np.sum(C, axis=0)
total = C.shape[0] * C.shape[1]
cnt = 0

i, jにそれぞれ行と列のインデックスを繰り返していき、各PMIを計算していきます。max関数にて最小0のPMIの値を行列Mに格納して、最後にこのMを返り値とすれば関数は完成です。

verboseをTrueにした場合は、途中の進捗を1%ごとに表示して確認することもできます。

for i in range(C.shape[0]):
    for j in range(C.shape[1]):
        pmi = np.log2(C[i, j] * N / (S[j] * S[i]) + eps)
        M[i, j] = max(0, pmi)

        if verbose:
            cnt += 1
            if cnt % (total//100 + 1) == 0:
                print('%.lf%% done' % (100 * cnt/total))

これを実装して得られるPPMI行列を確認してみます。

import numpy as np

def preprocess(text):
    text = text.lower()
    text = text.replace('.', ' .')
    words = text.split(' ')

    word_to_id = {}
    id_to_word = {}
    for word in words:
        if word not in word_to_id:
            new_id = len(word_to_id)
            word_to_id[word] = new_id
            id_to_word[new_id] = word

    corpus = np.array([word_to_id[w] for w in words])
    return corpus, word_to_id, id_to_word

def create_co_matrix(corpus, vocab_size, window_size=1):
    corpus_size = len(corpus)
    co_matrix = np.zeros((vocab_size, vocab_size), dtype=np.int32)

    for idx, word_id in enumerate(corpus):
        for i in range(1, window_size + 1):
            left_idx = idx - i
            right_idx = idx + i

            if left_idx >= 0:
                left_word_id = corpus[left_idx]
                co_matrix[word_id, left_word_id] += 1

            if right_idx < corpus_size:
                right_word_id = corpus[right_idx]
                co_matrix[word_id, right_word_id] += 1

    return co_matrix

def cos_similarity(x, y):
    nx = x / np.sqrt(np.sum(x**2))
    ny = y / np.sqrt(np.sum(y**2))
    return np.dot(nx, ny)

def most_similar(query, word_to_id, id_to_word, word_matrix, top=5):
    if query not in word_to_id:
        print('%s is not found' % query)
        return
    
    print('\n[query] ' + query)
    query_id = word_to_id[query]
    query_vec = word_matrix[query_id]

    vocab_size = len(id_to_word)
    similarity = np.zeros(vocab_size)
    for i in range(vocab_size):
        similarity[i] = cos_similarity(word_matrix[i], query_vec)

    count = 0
    for i in (-1 * similarity).argsort():
        if id_to_word[i] == query:
            continue
        print(' %s: %s' % (id_to_word[i], similarity[i]))

        count += 1
        if count >= top:
            return
        
def ppmi(C, verbose=False, eps=1e-8):
    M = np.zeros_like(C, dtype=np.float32)
    N = np.sum(C)
    S = np.sum(C, axis=0)
    total = C.shape[0] * C.shape[1]
    cnt = 0

    for i in range(C.shape[0]):
        for j in range(C.shape[1]):
            pmi = np.log2(C[i, j] * N / (S[j] * S[i]) + eps)
            M[i, j] = max(0, pmi)

            if verbose:
                cnt += 1
                if cnt % (total//100 + 1) == 0:
                    print('%.lf%% done' % (100 * cnt/total))
    
    return M

text = 'You say goodbye and I say hello.'
corpus, word_to_id, id_to_word = preprocess(text)

vocab_size = len(word_to_id)
C = create_co_matrix(corpus, vocab_size)

W = ppmi(C)
print(W)

[[0.        1.8073549 0.        0.        0.        0.        0.       ]
 [1.8073549 0.        0.8073549 0.        0.8073549 0.8073549 0.       ]
 [0.        0.8073549 0.        1.8073549 0.        0.        0.       ]
 [0.        0.        1.8073549 0.        1.8073549 0.        0.       ]
 [0.        0.8073549 0.        1.8073549 0.        0.        0.       ]
 [0.        0.8073549 0.        0.        0.        0.        2.807355 ]
 [0.        0.        0.        0.        0.        2.807355  0.       ]]

■おわりに

これまでの共起行列では冠詞や接続詞など、単に頻繁に使われるが故に共起している単語に対して偏りが出てしまいましたが、PMIを用いることでこのような偏りを補正することができます。これによってより意味合いの関連度を重視するような指標を得ることができるようになります。

ちなみに変数名をMとかSとかいう名前で定義するのは自分が学生の頃は教授からひどく怒られたものですが、これは大丈夫なのでしょうか笑

■参考文献

Andreas C. Muller, Sarah Guido. Pythonではじめる機械学習. 中田秀基訳. オライリー・ジャパン. 2017. 392p.
斎藤康毅. ゼロから作るDeep Learning Pythonで学ぶディープラーニングの理論と実装. オライリー・ジャパン. 2016. 320p.
斎藤康毅. ゼロから作るDeep Learning② 自然言語処理編. オライリー・ジャパン. 2018. 432p.
ChatGPT. 4o mini. OpenAI. 2024. https://chatgpt.com/
API Reference. scikit-learn.org. https://scikit-learn.org/stable/api/index.html
PyTorch documentation. pytorch.org. https://pytorch.org/docs/stable/index.html
Keiron O’Shea, Ryan Nash. An Introduction to Convolutional Neural Networks. https://ar5iv.labs.arxiv.org/html/1511.08458
API Reference. scipy.org. 2024. https://docs.scipy.org/doc/scipy/reference/index.html