言語処理100本ノック

37. 頻度上位10語¶

出現頻度が高い10語とその出現頻度をグラフ(例えば棒グラフなど)で表示せよ.

import re
from collections import Counter

sentences = []
with open("D:\\nlp100\\neko.txt.mecab",encoding="UTF-8") as fr:
    line = fr.readline()
    keitaiso = []
    while line:
        if "EOS" in line:
            if len(keitaiso)>0:
                sentences.append(keitaiso)
                keitaiso = []
        else:
            line = re.split(r'[\t,]',line)            
            keitaiso.append({"surface":line[0],"base":line[7],"pos":line[1],"pos1":line[2]})
        line =fr.readline()

word_count = {}
for sentence in sentences:
    for index in range(0,len(sentence)):
        item = sentence[index]["surface"]
        if item in word_count:
            word_count[item] += 1
        else:
            word_count[item] = 1
list = [(k, word_count[k]) for k in sorted(word_count, key=word_count.get, reverse=True)]

import matplotlib.pyplot as plt
X = range(10)
Y = []
for y in list[:10]:
    Y.append(y[1])
plt.bar(X,Y)
plt.show()

f:id:bitop:20161231105657p:plain

言語処理100本ノック

36. 単語の出現頻度

文章中に出現する単語とその出現頻度を求め,出現頻度の高い順に並べよ.

import re
from collections import Counter

sentences = []
with open("D:\\nlp100\\neko.txt.mecab",encoding="UTF-8") as fr:
    line = fr.readline()
    keitaiso = []
    while line:
        if "EOS" in line:
            if len(keitaiso)>0:
                sentences.append(keitaiso)
                keitaiso = []
        else:
            line = re.split(r'[\t,]',line)            
            keitaiso.append({"surface":line[0],"base":line[7],"pos":line[1],"pos1":line[2]})
        line =fr.readline()

word_count = {}
for sentence in sentences:
    for index in range(0,len(sentence)):
        item = sentence[index]["surface"]
        if item in word_count:
            word_count[item] += 1
        else:
            word_count[item] = 1
list = [(k, word_count[k]) for k in sorted(word_count, key=word_count.get, reverse=True)]
for k, v in list:
    print(k, v)

<結果 一部>

の 9194
。 7486
て 6868
、 6772
は 6420
に 6243
を 6071
と 5508
が 5337
た 3988
で 3806
「 3231
」 3225
も 2479
ない 2390
だ 2363
し 2322
から 2032
ある 1728

言語処理100本ノック

35. 名詞の連接

名詞の連接(連続して出現する名詞)を最長一致で抽出せよ

import re

sentences = []
with open("D:\\nlp100\\neko.txt.mecab",encoding="UTF-8") as fr:
    line = fr.readline()
    keitaiso = []
    while line:
        if "EOS" in line:
            if len(keitaiso)>0:
                sentences.append(keitaiso)
                keitaiso = []
        else:
            line = re.split(r'[\t,]',line)            
            keitaiso.append({"surface":line[0],"base":line[7],"pos":line[1],"pos1":line[2]})
        line =fr.readline()

rensetu = []
for sentence in sentences:
    for index in range(0,len(sentence)):
        if sentence[index]["pos"] == "名詞":
            rensetu.append(sentence[index]["surface"])
        else:
            if len(rensetu) > 1:
                print(rensetu)
            rensetu = []

 <結果 一部>
['人間', '中']
['一番', '獰悪']
['時', '妙']
['一', '毛']
['その後', '猫']
['一', '度']
['ぷうぷうと', '煙']
['邸', '内']
['三', '毛']
['書生', '以外']
['四', '五', '遍']
['この間', 'おさん']

言語処理100本ノック

34. 「AのB」

2つの名詞が「の」で連結されている名詞句を抽出せよ.

import re

sentences = []
with open("D:\\nlp100\\neko.txt.mecab",encoding="UTF-8") as fr:
    line = fr.readline()
    keitaiso = []
    while line:
        if "EOS" in line:
            if len(keitaiso)>0:
                sentences.append(keitaiso)
                keitaiso = []
        else:
            line = re.split(r'[\t,]',line)            
            keitaiso.append({"surface":line[0],"base":line[7],"pos":line[1],"pos1":line[2]})
        line =fr.readline()

for sentence in sentences:
    for index in range(1,len(sentence) - 1):
        if sentence[index - 1]["pos"] == "名詞" and sentence[index]["surface"] == "の" and sentence[index + 1]["pos"] == "名詞":
            print(sentence[index - 1]["surface"] + sentence[index]["surface"] + sentence[index + 1]["surface"])

<結果>

彼の掌
掌の上
書生の顔
はずの顔
顔の真中
穴の中
書生の掌
掌の裏
何の事
肝心の母親
藁の上
笹原の中
池の前
池の上    

言語処理100本ノック

33. サ変名詞

サ変接続の名詞をすべて抽出せよ.

import re

sentences = []
with open("D:\\nlp100\\neko.txt.mecab",encoding="UTF-8") as fr:
    line = fr.readline()
    keitaiso = []
    while line:
        if "EOS" in line:
            if len(keitaiso)>0:
                sentences.append(keitaiso)
                keitaiso = []
        else:
            line = re.split(r'[\t,]',line)            
            keitaiso.append({"surface":line[0],"base":line[7],"pos":line[1],"pos1":line[2]})
        line =fr.readline()

for sentence in sentences:
    for word in sentence:
        if word["pos1"] == "サ変接続":
            print(word["surface"])

<結果 一部>
見当
記憶
話
装飾
突起
運転
記憶
分別
決心
我慢
餓死
訪問
始末
猶予
遭遇
我慢
記憶
返報
勉強

言語処理100本ノック

32. 動詞の原形

動詞の原形をすべて抽出せよ.

import re

sentences = []
with open("D:\\nlp100\\neko.txt.mecab",encoding="UTF-8") as fr:
    line = fr.readline()
    keitaiso = []
    while line:
        if "EOS" in line:
            if len(keitaiso)>0:
                sentences.append(keitaiso)
                keitaiso = []
        else:
            line = re.split(r'[\t,]',line)            
            keitaiso.append({"surface":line[0],"base":line[7],"pos":line[1],"pos1":line[2]})
        line =fr.readline()

for sentence in sentences:
    for word in sentence:
        if word["pos"] == "動詞":
            print(word["base"])

<結果 一部>
生れる
つく
する
泣く
する
いる
始める
見る
聞く
捕える
煮る
食う
思う
載せる
られる
持ち上げる
られる

言語処理100本ノック

31. 動詞

動詞の表層形をすべて抽出せよ.

import re

sentences = []
with open("D:\\nlp100\\neko.txt.mecab",encoding="UTF-8") as fr:
    line = fr.readline()
    keitaiso = []
    while line:
        if "EOS" in line:
            if len(keitaiso)>0:
                sentences.append(keitaiso)
                keitaiso = []
        else:
            line = re.split(r'[\t,]',line)            
            keitaiso.append({"surface":line[0],"base":line[7],"pos":line[1],"pos1":line[2]})
        line =fr.readline()

for sentence in sentences:
    for word in sentence:
        if word["pos"] == "動詞":
            print(word["surface"])
<結果(一部)>

生れ
つか
し
泣い
し
いる
始め
見
聞く
捕え
煮
食う
思わ
載せ
られ
持ち上げ
られ