読者です 読者をやめる 読者になる 読者になる

言語処理100本ノック

39. Zipfの法則

単語の出現頻度順位を横軸,その出現頻度を縦軸として,両対数グラフをプロットせよ.

%matplotlib inline
import re
from collections import Counter

sentences = []
with open("D:\\nlp100\\neko.txt.mecab",encoding="UTF-8") as fr:
    line = fr.readline()
    keitaiso = []
    while line:
        if "EOS" in line:
            if len(keitaiso)>0:
                sentences.append(keitaiso)
                keitaiso = []
        else:
            line = re.split(r'[\t,]',line)            
            keitaiso.append({"surface":line[0],"base":line[7],"pos":line[1],"pos1":line[2]})
        line =fr.readline()

word_count = {}
for sentence in sentences:
    for index in range(0,len(sentence)):
        item = sentence[index]["surface"]
        if item in word_count:
            word_count[item] += 1
        else:
            word_count[item] = 1
list = [(k, word_count[k]) for k in sorted(word_count, key=word_count.get, reverse=True)]

import matplotlib.pyplot as plt
Y = []
for y in list:
    Y.append(y[1])
plt.loglog(range(len(Y)),Y)
plt.show()

f:id:bitop:20170109140642p:plain