共起行列の次元削減

AIって結局何なのかよく分からないので、とりあえず100日間勉強してみた Day77

経緯についてはこちらをご参照ください。

AIって結局何なのかよく分からないので、とりあえず100日間勉強してみた Day0

■本日の進捗

SVDを理解

■はじめに

今回も「ゼロから作るDeep Learning②　自然言語処理編（オライリー・ジャパン）」から学んでいきます。

これまで用いてきた共起行列に対する次元削減手法であるSVDを実装していきます。

■SVD

特異値分解（Singular Value Decomposition：SVD）とは、共起行列やPMIを用いて単語を分散表現に落とし込んだ行列に対して次元削減する手法で、単に次元を落とすのではなく特異値の大きいものだけを取り出すことで、特徴を残したまま低次元化することができます。

対象とする共起行列をXとすると、SVDを行うことで下記のように分解します。

$$ \mathrm{\boldsymbol{X}} = \mathrm{\boldsymbol{U}} \mathrm{\boldsymbol{S}} \mathrm{\boldsymbol{V}}^{\mathrm{T}} $$

Uは単語の特徴を表す直交行列、Sは特異値からなる対角行列、V^Tはコンテキストの特徴を表す直交行列です。

前回適用したPPMI（Positive Pointwise Mutual Information）の結果を振り返ってみます。

[[0.        1.8073549 0.        0.        0.        0.        0.       ]
 [1.8073549 0.        0.8073549 0.        0.8073549 0.8073549 0.       ]
 [0.        0.8073549 0.        1.8073549 0.        0.        0.       ]
 [0.        0.        1.8073549 0.        1.8073549 0.        0.       ]
 [0.        0.8073549 0.        1.8073549 0.        0.        0.       ]
 [0.        0.8073549 0.        0.        0.        0.        2.807355 ]
 [0.        0.        0.        0.        0.        2.807355  0.       ]]

単語間の依存性を捉えた分散表現ができていますが、行列のおよそ半分を占める0が目立ちます。このような行列はノイズ感度が高く、計算コスト的にも良くありません。（この時は小さいコーパスでしたが、実際のコーパスに対して同様の処理を行うと途方もなく大きな行列計算を行うことになります。）

取り扱うデータにこのような性質があるので、SVDの実行には効率的で高速に行列計算を行える外部ライブラリを用いるのが一般的のようです。

NumPyのlinalg関数にあるsvdや、scikit-learnのdecompositionクラスにあるTruncatedSVDなどがありますが、既にインポート済みのNumPyで挙動を確認してみます。

import numpy as np

def preprocess(text):
    text = text.lower()
    text = text.replace('.', ' .')
    words = text.split(' ')

    word_to_id = {}
    id_to_word = {}
    for word in words:
        if word not in word_to_id:
            new_id = len(word_to_id)
            word_to_id[word] = new_id
            id_to_word[new_id] = word

    corpus = np.array([word_to_id[w] for w in words])
    return corpus, word_to_id, id_to_word

def create_co_matrix(corpus, vocab_size, window_size=1):
    corpus_size = len(corpus)
    co_matrix = np.zeros((vocab_size, vocab_size), dtype=np.int32)

    for idx, word_id in enumerate(corpus):
        for i in range(1, window_size + 1):
            left_idx = idx - i
            right_idx = idx + i

            if left_idx >= 0:
                left_word_id = corpus[left_idx]
                co_matrix[word_id, left_word_id] += 1

            if right_idx < corpus_size:
                right_word_id = corpus[right_idx]
                co_matrix[word_id, right_word_id] += 1

    return co_matrix

def cos_similarity(x, y):
    nx = x / np.sqrt(np.sum(x**2))
    ny = y / np.sqrt(np.sum(y**2))
    return np.dot(nx, ny)

def most_similar(query, word_to_id, id_to_word, word_matrix, top=5):
    if query not in word_to_id:
        print('%s is not found' % query)
        return
    
    print('\n[query] ' + query)
    query_id = word_to_id[query]
    query_vec = word_matrix[query_id]

    vocab_size = len(id_to_word)
    similarity = np.zeros(vocab_size)
    for i in range(vocab_size):
        similarity[i] = cos_similarity(word_matrix[i], query_vec)

    count = 0
    for i in (-1 * similarity).argsort():
        if id_to_word[i] == query:
            continue
        print(' %s: %s' % (id_to_word[i], similarity[i]))

        count += 1
        if count >= top:
            return
        
def ppmi(C, verbose=False, eps=1e-8):
    M = np.zeros_like(C, dtype=np.float32)
    N = np.sum(C)
    S = np.sum(C, axis=0)
    total = C.shape[0] * C.shape[1]
    cnt = 0

    for i in range(C.shape[0]):
        for j in range(C.shape[1]):
            pmi = np.log2(C[i, j] * N / (S[j] * S[i]) + eps)
            M[i, j] = max(0, pmi)

            if verbose:
                cnt += 1
                if cnt % (total//100 + 1) == 0:
                    print('%.lf%% done' % (100 * cnt/total))
    
    return M

text = 'You say goodbye and I say hello.'
corpus, word_to_id, id_to_word = preprocess(text)

vocab_size = len(word_to_id)
C = create_co_matrix(corpus, vocab_size)

W = ppmi(C)

U, S, V = np.linalg.svd(W)
print("U:", U)
print("S:", S)
print("V:", V)

U: [[ 3.40948761e-01 -1.11022302e-16 -4.44089210e-16 -1.20516241e-01
   0.00000000e+00 -9.32324946e-01 -1.08553466e-16]
 [ 0.00000000e+00 -5.97636402e-01  1.80237904e-01  0.00000000e+00
  -7.81245828e-01  0.00000000e+00  0.00000000e+00]
 [ 4.36312199e-01 -5.55111512e-17 -2.22044605e-16 -5.08782864e-01
  -1.38777878e-17  2.25325629e-01 -7.07106769e-01]
 [ 1.61997685e-16 -4.97828126e-01  6.80396318e-01 -5.24907442e-17
   5.37799239e-01 -1.28514287e-16  8.57616310e-18]
 [ 4.36312199e-01 -3.22893989e-17 -1.65386068e-16 -5.08782864e-01
  -1.34548148e-17  2.25325629e-01  7.07106769e-01]
 [ 7.09237099e-01 -3.22893989e-17 -1.65386068e-16  6.83926761e-01
  -1.34548148e-17  1.70958877e-01  7.68754397e-17]
 [-2.56401452e-16 -6.28488600e-01 -7.10334539e-01  8.45960656e-17
   3.16902101e-01  2.49218348e-16 -1.10425444e-17]]
S: [3.1680453e+00 3.1680453e+00 2.7029872e+00 2.7029872e+00 1.5144811e+00
 1.5144811e+00 2.5550233e-17]
V: [[-0.0000000e+00  5.9763640e-01  2.2565209e-16  4.9782813e-01
   1.7014093e-16 -2.7394827e-16  6.2848860e-01]
 [-3.4094876e-01 -1.1102230e-16 -4.3631220e-01  0.0000000e+00
  -4.3631220e-01 -7.0923710e-01 -0.0000000e+00]
 [ 1.2051624e-01 -4.7184479e-16  5.0878286e-01  0.0000000e+00
   5.0878286e-01 -6.8392676e-01 -0.0000000e+00]
 [ 0.0000000e+00 -1.8023790e-01 -6.9897476e-17 -6.8039632e-01
  -4.2141898e-17  6.8880402e-17  7.1033454e-01]
 [-9.3232495e-01 -5.5511151e-17  2.2532563e-01  0.0000000e+00
   2.2532563e-01  1.7095888e-01 -0.0000000e+00]
 [-0.0000000e+00 -7.8124583e-01 -2.6626587e-16  5.3779924e-01
  -1.5524356e-16  3.4435680e-16  3.1690210e-01]
 [-0.0000000e+00  3.8861137e-17 -7.0710677e-01 -1.1742222e-16
   7.0710677e-01  5.3874928e-17 -7.1880089e-17]]

これでPPMI後の共起行列を分解することができました。次元削減後の単語特徴ベクトルを単語ごとに集まっているのがUで、Sは言わば重要度を示しているのでした。つまり（例えば）2次元に削減した分散表現は続けて下記のように求めることができます。

U2 = U[:, :2]
S2 = np.diag(S[:2])
print("features:", np.dot(U2, S2))

features: [[ 1.08014107e+00 -3.51723682e-16]
 [ 0.00000000e+00 -1.89333916e+00]
 [ 1.38225675e+00 -1.75861841e-16]
 [ 5.13215992e-16 -1.57714200e+00]
 [ 1.38225675e+00 -1.02294276e-16]
 [ 2.24689531e+00 -1.02294276e-16]
 [-8.12291414e-16 -1.99108040e+00]]

上位2次元の基底ベクトルをそれぞれ抜き出し、上位2つの特異値（重要度）をかけています。

■おわりに

今回は共起行列に対して次元削減を行いました。単語の分散表現自体は変換できていましたが、疎なベクトルになっていたものを（かなり）密なベクトルにすることができました。

■参考文献

Andreas C. Muller, Sarah Guido. Pythonではじめる機械学習. 中田秀基訳. オライリー・ジャパン. 2017. 392p.
斎藤康毅. ゼロから作るDeep Learning Pythonで学ぶディープラーニングの理論と実装. オライリー・ジャパン. 2016. 320p.
斎藤康毅. ゼロから作るDeep Learning② 自然言語処理編. オライリー・ジャパン. 2018. 432p.
ChatGPT. 4o mini. OpenAI. 2024. https://chatgpt.com/
API Reference. scikit-learn.org. https://scikit-learn.org/stable/api/index.html
PyTorch documentation. pytorch.org. https://pytorch.org/docs/stable/index.html
Keiron O’Shea, Ryan Nash. An Introduction to Convolutional Neural Networks. https://ar5iv.labs.arxiv.org/html/1511.08458
API Reference. scipy.org. 2024. https://docs.scipy.org/doc/scipy/reference/index.html