「PythonとJavaScriptではじめるデータビジュアライゼーション」を読む

6.3xpathを使った対象HTMLの設定

xpathとは木構造をもつHTML要素を指定する記法

f:id:bitop:20170916140506p:plain

h2の上でCopy XPathコマンドを実行すると  
//*[@id="mw-content-text"]/div/h2[2]  
が得られた。

他のCopyコマンドも実行してみた
outerHTML
<h2><span class="mw-headline" id="Argentina">Argentina</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=List_of_Nobel_laureates_by_country&amp;action=edit&amp;section=2" title="Edit section: Argentina">edit</a><span class="mw-editsection-bracket">]</span></span></h2>   
<h2>...</h2>で囲まれた部分をコピーしてきた  

Copy selector
#mw-content-text > div > h2:nth-child(11)

6.3.1 Scrapyシェルを使ったxpathのテスト

＞scrapy startproject nobel_winners で作ったnobel_winnersフォルダに入りコマンドプロンプト上でscrapy shellと打ち込むと IPython風のシェルができる。

settings.pyを開いてDOWNLOAD_DELAY=3のコメントアウトをはずして保存。
＞scrapy shell https://en.wikipedia.org/wiki/List_of_Nobel_laureates_by_country

In [1]: h2s = response.xpath('//h2')
In [2]: print(type(h2s))
<class 'scrapy.selector.unified.SelectorList'>
In [3]: len(h2s)
Out[3]: 72
本とはちょっと違った結果になった。
h2 = h2s[0]
h2.でTABKeyを押すと
 css()                re()                 select()
 extract()            re_first()           selectorlist_cls
 extract_unquoted()   register_namespace() type
 get()                remove_namespaces()  xpath()
 getall()             response             h2.text
 namespaces           root

 re_first(),selectorlist_cls,extract_unquoted,get(),getall(),rootが増えている。  
In [7]: h2.extract()
Out[7]: '<h2>Contents</h2>'

ここの部分のh2タグを取ってきたようです

f:id:bitop:20170916152115p:plain

In [8]: h2s[1].extract()
Out[8]: '<h2><span class="mw-headline" id="Summary">Summary</span><span class="m
w-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.ph
p?title=List_of_Nobel_laureates_by_country&amp;action=edit&amp;section=1" title=
"Edit section: Summary">edit</a><span class="mw-editsection-bracket">]</span></s
pan></h2>'
In [9]: h2_arg = h2s[1]
In [10]: country = h2_arg.xpath('span[@class="mw-headline"]/text()').extract()
In [11]: country
Out[11]: ['Summary']

ここのh2を取ってきている
f:id:bitop:20170916155544p:plain

本とは違っている。ページ構成が変更になったのだろう。  
In [12]: h2_arg = h2s[2]
In [13]: country = h2_arg.xpath('span[@class="mw-headline"]/text()').extract()
In [14]: country
Out[14]: ['Argentina']
国名が取れた。
In [15]: ol_arg = h2_arg.xpath('following-sibling::ol[1]')
In [16]: ol_arg
Out[16]: [<Selector xpath='following-sibling::ol[1]' data='<ol>\n<li><a href="/w
iki/C%C3%A9sar_Milst'>]
In [17]: ol_arg = h2_arg.xpath('following-sibling::ol[1]')[0]
In [18]: ol_arg
Out[18]: <Selector xpath='following-sibling::ol[1]' data='<ol>\n<li><a href="/wi
ki/C%C3%A9sar_Milst'>
In [19]: lis_arg = ol_arg.xpath('li')
In [20]: lis_arg
Out[20]:
[<Selector xpath='li' data='<li><a href="/wiki/C%C3%A9sar_Milstein" '>,
<Selector xpath='li' data='<li><a href="/wiki/Adolfo_P%C3%A9rez_Esq'>,
<Selector xpath='li' data='<li><a href="/wiki/Luis_Federico_Leloir"'>,
<Selector xpath='li' data='<li><a href="/wiki/Bernardo_Houssay" tit'>,
<Selector xpath='li' data='<li><a href="/wiki/Carlos_Saavedra_Lamas'>]
In [21]: len(lis_arg)
Out[21]: 5
In [22]: li = lis_arg[0]
In [23]: li.ex2017-09-16 15:32:08 [py.warnings] WARNING: C:\Users\joshua\Anacond
a3\lib\site-packages\jedi\evaluate\compiled\__init__.py:328: ScrapyDeprecationWa
rning: Attribute `_root` is deprecated, use `root` instead
getattr(obj, name)
In [24]: li.extract()
Out[24]: '<li><a href="/wiki/C%C3%A9sar_Milstein" title="César Milstein">César M
ilstein</a>, Physiology or Medicine, 1984</li>'
In [25]: name=li.xpath('a//text()')[0].extract()
In [26]: name
Out[26]: 'César Milstein'
In [27]: list_text = li.xpath('descendant-or-self::text()').extract()
In [28]: list_text
Out[28]: ['César Milstein', ', Physiology or Medicine, 1984']
In [29]: ''.join(list_text)
Out[29]: 'César Milstein, Physiology or Medicine, 1984'
名前、カテゴリ、受賞年がとれた