hadoopソースコードリーディング　2回目　　...

HadoopでWikipedia解析（≒HadoopでXML解析）

Blog :http://d.hatena.ne.jp/yamiura/

Twitter :yamiura

http://d.hatena.ne.jp/yamiura/

Wikipediaのデータ（XML）

圧縮ファイル１６G！XML！

<page> <title>GNU Free Documentation License</title> <id>75</id> <revision> <id>135</id> <timestamp>2002-12-17T06:04:47Z</timestamp> <contributor> <username>Tomos</username> <id>10</id> </contributor> <comment>さわり/just started</comment> <text xml:space=“preserve”>[[GNU]]　Free Documentation　Licenseの略称。

　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　GNU フリー文書利用許諾契約書として、・・・・・・・・・

<revision> <id>7103</id> <timestamp>2003-02-25T16:40:31Z</timestamp> <contributor> <ip>211.123.199.231</ip> </contributor>

XMLの内容

<page> <title>GNU Free Documentation License</title> <id>75</id> <revision> <id>135</id> <timestamp>2002-12-17T06:04:47Z</timestamp> <contributor> <username>Tomos</username> <id>10</id> </contributor> <comment>さわり/just started</comment> <text xml:space=“preserve”>[[GNU]]　Free Documentation　Licenseの略称。

　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　GNU フリー文書利用許諾契約書として、・・・・・・・・・

<revision> <id>7103</id> <timestamp>2003-02-25T16:40:31Z</timestamp> <contributor> <ip>211.123.199.231</ip> </contributor>

仕事中？

カテゴリは？

会社からですか？

あらゆる情報がつまった夢のXML!!!

http://www.mwsoft.jp/programming/munou/wikipedia_data_list.html

参考：XMLの種類の説明ページ










ここから、本題。HadoopでXML処理

Mapへのインプットを決めるクラスMapへのインプットを決めるクラスMapへのインプットを決めるクラスMapへのインプットを決めるクラス

デフォルトは、1行入力

Jobを定義するMain文　イメージ

取り消し線、黄色線は何？

非推奨（？）な旧クラス、新クラス

Mapper,Reducerも同様の状態

旧クラス（非推奨？）のほうがImplが多い・・・

推奨されていない

旧クラスのほうが圧倒的に豊富

新しいもの≠いいもの

こんな気分でした

旧クラスには、XML処理用クラス有り

ただし、hadoop streming普通のHadoopでもクラスは使えます

新クラスには、XML処理用クラス無し

orz...でも、自作は結構簡単です！！

まとめ - XML処理もできる - InputFormat,RecordReaderで制御

- トラップあり - （厳密な処理は注意） - （Wik-IEのコードを参考に）

hadoopソースコードリーディング 2回目 ...

Technology

hadoopソースコードリーディング　2回目　　...