PTBデータセットで遊んでみた

AIって結局何なのかよく分からないので、とりあえず100日間勉強してみた Day78

経緯についてはこちらをご参照ください。

AIって結局何なのかよく分からないので、とりあえず100日間勉強してみた Day0

■本日の進捗

PTBデータセットを理解

■はじめに

今回も「ゼロから作るDeep Learning②　自然言語処理編（オライリー・ジャパン）」から学んでいきます。

今回は、これまで学んできた共起行列の作成やPMI、SVDといった手法を実装したモジュールを用いて、より実践的なデータセットに適用して遊んでみたいと思います。

■PTBデータセット

PTB（Penn Treebank）データセットとは、自然言語処理の分野で研究やベンチマークなどに広く利用されているデータセットで、ウォールストリートジャーナルに実際に掲載された記事を元に米ペンシルバニア大学によって開発されました。

そもそもTreebankとは、各単語の品詞やフレーズに関する情報が付加されたコーパスで、木構造のような階層的な構造をしているためこのように呼ばれています。

実はこのデータセットはライセンス契約が必要な有料データセットなのですが、Tomas Mikolov氏が提供してくれている言語モデリングタスク用に改修された簡略版は研究・教育目的でライセンスフリーで利用することができます。

この簡易版では、先ほどの付加情報がないため本当の意味ではTreebankデータセットではないのですが、数字を<num>に一般化されていたり、稀な単語を<unk>に置き換えたりといった処理がされています。

簡易版PTBデータセットは下記からダウンロード可能です。

https://github.com/tomsercu/lstm

それでは実際にこのデータセットを使っていきます。まずはデータの読み書きのためのライブラリをインポートして、ファイル名を格納しておきます。mid_pathには実行するディレクトリからデータセットを保存しているディレクトリまでのPATHを適宜変更しておいてください。

import sys
import os
sys.path.append('..')
import numpy as np
import pickle

# setting for PTB dataset
key_file = {
    'train':'ptb.train.txt',
    'test':'ptb.test.txt',
    'valid':'ptb.valid.txt'
}
save_file = {
    'train':'ptb.train.npy',
    'test':'ptb.test.npy',
    'valid':'ptb.valid.npy'
}
vocab_file = 'ptb.vocab.pkl'
dataset_dir = os.path.dirname(os.path.abspath(__file__))
mid_path = '..\..\Download_Dataset\lstm-master\data'

続いて、trainデータを単語とIDに分けるための関数を用意しておきます。

ここでは事前に前回実行時のファイル（ptb.vocab.pkl）が保存されているかを確認し、存在すれば読み込み、なければコーパスを処理していきます。

コーパスは改行コードによって分割されている文章を一旦まとめて<eos>で区切られた1つの文にしてから、strip()でデータの先頭や末尾にある余分な空白を取り除いて、split()でスペース区切りで分割した結果をwords変数に格納していきます。

このwordsを重複なしで単語とIDをそれぞれ格納していき、結果をptb.vocab.pklファイルに保存してから、関数の呼び出し元へ返します。ここで、’wb’ は書き込みモード（w）で開き、バイナリ形式（b）で扱うためのオプションです。pickle.dumpで扱う場合はPythonオブジェクトをバイナリ形式に変換して保存を行います。

def load_vocab():
    vocab_path = os.path.join(dataset_dir, vocab_file)
    if os.path.exists(vocab_path):
        with open(vocab_path, 'rb') as f:
            word_to_id, id_to_word = pickle.load(f)
        return word_to_id, id_to_word

    word_to_id = {}
    id_to_word = {}
    data_type = 'train'
    file_name = key_file[data_type]
    file_path = os.path.join(dataset_dir, mid_path, file_name)

    words = open(file_path).read().replace('\n', '<eos>').strip().split()

    for i, word in enumerate(words):
        if word not in word_to_id:
            tmp_id = len(word_to_id)
            word_to_id[word] = tmp_id
            id_to_word[tmp_id] = word

    with open(vocab_path, 'wb') as f:
        pickle.dump((word_to_id, id_to_word), f)

    return word_to_id, id_to_word

最後に実際のインターフェイスとしてコーパスと単語を返すための関数を実装します。

まずはdata_typeにvalが指定されていた場合にvalidに変更して、train, test, validに合わせた保存ファイル名を指定します。

先ほどのload_vocab関数を呼び出して単語とIDを決定したら、保存ファイルの有無を確認しあればその内容をそのまま返します。なければほとんど先ほどの同様の処理をして、corpus変数にすべての単語を並べて保存します。

corpus, word_to_id, id_to_wordの3変数を返したら終了です。

def load_data(data_type='train'):
    if data_type == 'val': data_type = 'valid'
    save_path = dataset_dir + '/' + save_file[data_type]

    word_to_id, id_to_word = load_vocab()

    if os.path.exists(save_path):
        corpus = np.load(save_path)
        return corpus, word_to_id, id_to_word
    
    file_name = key_file[data_type]
    file_path = os.path.join(dataset_dir, mid_path, file_name)
    words = open(file_path).read().replace('\n', '<eos>').strip().split()
    corpus = np.array([word_to_id[w] for w in words])

    np.save(save_path, corpus)
    return corpus, word_to_id, id_to_word

これを実装すればデータセットから単語とIDを抜き出してまとめることができます。

import sys
import os
sys.path.append('..')
import numpy as np
import pickle

# setting for PTB dataset
key_file = {
    'train':'ptb.train.txt',
    'test':'ptb.test.txt',
    'valid':'ptb.valid.txt'
}
save_file = {
    'train':'ptb.train.npy',
    'test':'ptb.test.npy',
    'valid':'ptb.valid.npy'
}
vocab_file = 'ptb.vocab.pkl'
dataset_dir = os.path.dirname(os.path.abspath(__file__))
mid_path = '..\..\Download_Dataset\lstm-master\data'

def load_vocab():
    vocab_path = os.path.join(dataset_dir, vocab_file)
    if os.path.exists(vocab_path):
        with open(vocab_path, 'rb') as f:
            word_to_id, id_to_word = pickle.load(f)
        return word_to_id, id_to_word

    word_to_id = {}
    id_to_word = {}
    data_type = 'train'
    file_name = key_file[data_type]
    file_path = os.path.join(dataset_dir, mid_path, file_name)

    words = open(file_path).read().replace('\n', '<eos>').strip().split()

    for i, word in enumerate(words):
        if word not in word_to_id:
            tmp_id = len(word_to_id)
            word_to_id[word] = tmp_id
            id_to_word[tmp_id] = word

    with open(vocab_path, 'wb') as f:
        pickle.dump((word_to_id, id_to_word), f)

    return word_to_id, id_to_word

def load_data(data_type='train'):
    if data_type == 'val': data_type = 'valid'
    save_path = dataset_dir + '/' + save_file[data_type]

    word_to_id, id_to_word = load_vocab()

    if os.path.exists(save_path):
        corpus = np.load(save_path)
        return corpus, word_to_id, id_to_word
    
    file_name = key_file[data_type]
    file_path = os.path.join(dataset_dir, mid_path, file_name)
    words = open(file_path).read().replace('\n', '<eos>').strip().split()
    corpus = np.array([word_to_id[w] for w in words])

    np.save(save_path, corpus)
    return corpus, word_to_id, id_to_word

corpus, word_to_id, id_to_word = load_data('train')

print('num of words:', len(corpus))

num of words: 929589

単語が92万語以上（重複含む）あることが分かります。大規模なデータセットに比べれば小ぶりではありますが、これまで使ってきた数単語から成るコーパスよりは相当大きなデータセットになっています。

■PTBでのコサイン類似度

先ほど構築した関数でコサイン類似度を算出してみます。対象とする単語は、”we”, “have”, “honda”, “car”の4つの単語のそれぞれコサイン類似度上位5つを表示してみます。

import sys
import os
sys.path.append('..')
import numpy as np
import pickle
from sklearn.utils.extmath import randomized_svd

# setting for PTB dataset
key_file = {
    'train':'ptb.train.txt',
    'test':'ptb.test.txt',
    'valid':'ptb.valid.txt'
}
save_file = {
    'train':'ptb.train.npy',
    'test':'ptb.test.npy',
    'valid':'ptb.valid.npy'
}
vocab_file = 'ptb.vocab.pkl'
dataset_dir = os.path.dirname(os.path.abspath(__file__))
mid_path = '..\..\Download_Dataset\lstm-master\data'

def load_vocab():
    vocab_path = os.path.join(dataset_dir, vocab_file)
    if os.path.exists(vocab_path):
        with open(vocab_path, 'rb') as f:
            word_to_id, id_to_word = pickle.load(f)
        return word_to_id, id_to_word

    word_to_id = {}
    id_to_word = {}
    data_type = 'train'
    file_name = key_file[data_type]
    file_path = os.path.join(dataset_dir, mid_path, file_name)

    words = open(file_path).read().replace('\n', '<eos>').strip().split()

    for i, word in enumerate(words):
        if word not in word_to_id:
            tmp_id = len(word_to_id)
            word_to_id[word] = tmp_id
            id_to_word[tmp_id] = word

    with open(vocab_path, 'wb') as f:
        pickle.dump((word_to_id, id_to_word), f)

    return word_to_id, id_to_word

def load_data(data_type='train'):
    if data_type == 'val': data_type = 'valid'
    save_path = dataset_dir + '/' + save_file[data_type]

    word_to_id, id_to_word = load_vocab()

    if os.path.exists(save_path):
        corpus = np.load(save_path)
        return corpus, word_to_id, id_to_word
    
    file_name = key_file[data_type]
    file_path = os.path.join(dataset_dir, mid_path, file_name)
    words = open(file_path).read().replace('\n', '<eos>').strip().split()
    corpus = np.array([word_to_id[w] for w in words])

    np.save(save_path, corpus)
    return corpus, word_to_id, id_to_word

def preprocess(text):
    text = text.lower()
    text = text.replace('.', ' .')
    words = text.split(' ')

    word_to_id = {}
    id_to_word = {}
    for word in words:
        if word not in word_to_id:
            new_id = len(word_to_id)
            word_to_id[word] = new_id
            id_to_word[new_id] = word

    corpus = np.array([word_to_id[w] for w in words])
    return corpus, word_to_id, id_to_word

def create_co_matrix(corpus, vocab_size, window_size=1):
    corpus_size = len(corpus)
    co_matrix = np.zeros((vocab_size, vocab_size), dtype=np.int32)

    for idx, word_id in enumerate(corpus):
        for i in range(1, window_size + 1):
            left_idx = idx - i
            right_idx = idx + i

            if left_idx >= 0:
                left_word_id = corpus[left_idx]
                co_matrix[word_id, left_word_id] += 1

            if right_idx < corpus_size:
                right_word_id = corpus[right_idx]
                co_matrix[word_id, right_word_id] += 1

    return co_matrix

def cos_similarity(x, y):
    nx = x / np.sqrt(np.sum(x**2))
    ny = y / np.sqrt(np.sum(y**2))
    return np.dot(nx, ny)

def most_similar(query, word_to_id, id_to_word, word_matrix, top=5):
    if query not in word_to_id:
        print('%s is not found' % query)
        return
    
    print('\n[query] ' + query)
    query_id = word_to_id[query]
    query_vec = word_matrix[query_id]

    vocab_size = len(id_to_word)
    similarity = np.zeros(vocab_size)
    for i in range(vocab_size):
        similarity[i] = cos_similarity(word_matrix[i], query_vec)

    count = 0
    for i in (-1 * similarity).argsort():
        if id_to_word[i] == query:
            continue
        print(' %s: %s' % (id_to_word[i], similarity[i]))

        count += 1
        if count >= top:
            return
        
def ppmi(C, verbose=False, eps=1e-8):
    M = np.zeros_like(C, dtype=np.float32)
    N = np.sum(C)
    S = np.sum(C, axis=0)
    total = C.shape[0] * C.shape[1]
    cnt = 0

    for i in range(C.shape[0]):
        for j in range(C.shape[1]):
            pmi = np.log2(C[i, j] * N / (S[j] * S[i]) + eps)
            M[i, j] = max(0, pmi)

            if verbose:
                cnt += 1
                if cnt % (total//100 + 1) == 0:
                    print('%.lf%% done' % (100 * cnt/total))
    
    return M

window_size = 2
wordvec_size = 100

corpus, word_to_id, id_to_word = load_data('train')

vocab_size = len(word_to_id)
C = create_co_matrix(corpus, vocab_size, window_size)
W = ppmi(C)

U, S, V = randomized_svd(W, n_components=wordvec_size, n_iter=5, random_state=8)
word_vecs = U[:, :wordvec_size]

querys = ['we', 'have', 'honda', 'car']
for query in querys:
    most_similar(query, word_to_id, id_to_word, word_vecs, top=5)

[query] we
 i: 0.6876371502876282
 're: 0.6452517509460449
 've: 0.6233857870101929
 you: 0.6085333228111267
 'm: 0.586887776851654

[query] have
 been: 0.6786949038505554
 has: 0.5799141526222229
 had: 0.5554906725883484
 've: 0.39367467164993286
 be: 0.33541420102119446

[query] honda
 toyota: 0.6151961088180542
 nissan: 0.557483971118927
 motor: 0.555155336856842
 procter: 0.45490726828575134
 sits: 0.45157667994499207

[query] car
 auto: 0.5797815322875977
 cars: 0.575616717338562
 vehicle: 0.5617407560348511
 luxury: 0.5404506325721741
 truck: 0.517784595489502

さすが、これだけのデータ量のあるデータセットを用いれば直観的に正しい類似度を出すことができています。”we”や”have”などは”‘re”や”been”などの一緒に使われることの多い単語も含めて、似ている単語が並んでいます。”honda”に関しては競合他社である”toyota”や”nissan”が並び、”car”ではその単語を形容するような単語もあります。

■おわりに

今回は、ダウンロードしたPTBデータセットを使える形にして、これまで学んできた手法を使って分散表現をして実際に類似度を算出してみました。

これらの手法に素晴らしい性能があることは確認できましたね。

■参考文献

Andreas C. Muller, Sarah Guido. Pythonではじめる機械学習. 中田秀基訳. オライリー・ジャパン. 2017. 392p.
斎藤康毅. ゼロから作るDeep Learning Pythonで学ぶディープラーニングの理論と実装. オライリー・ジャパン. 2016. 320p.
斎藤康毅. ゼロから作るDeep Learning② 自然言語処理編. オライリー・ジャパン. 2018. 432p.
ChatGPT. 4o mini. OpenAI. 2024. https://chatgpt.com/
API Reference. scikit-learn.org. https://scikit-learn.org/stable/api/index.html
PyTorch documentation. pytorch.org. https://pytorch.org/docs/stable/index.html
Keiron O’Shea, Ryan Nash. An Introduction to Convolutional Neural Networks. https://ar5iv.labs.arxiv.org/html/1511.08458
API Reference. scipy.org. 2024. https://docs.scipy.org/doc/scipy/reference/index.html