web-based ultra-large-scale corpora at ninjal masayuki asahara, mizuho imada, sachi yasuda hikari...

39
Web-based Ultra-Large-Scale Corpora at NINJAL Masayuki ASAHARA, Mizuho IMADA, Sachi YASUDA Hikari KONISHI, Kikuo MAEKAWA National Institute for Japanese Language and Linguistics, Japan Center for Corpus Development

Upload: sophia-blair

Post on 25-Dec-2015

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Web-based Ultra-Large-Scale Corpora at NINJAL Masayuki ASAHARA, Mizuho IMADA, Sachi YASUDA Hikari KONISHI, Kikuo MAEKAWA National Institute for Japanese

Web-based Ultra-Large-Scale Corporaat NINJAL

Masayuki ASAHARA, Mizuho IMADA, Sachi YASUDAHikari KONISHI, Kikuo MAEKAWA

National Institute for Japanese Language and Linguistics, JapanCenter for Corpus Development

Page 2: Web-based Ultra-Large-Scale Corpora at NINJAL Masayuki ASAHARA, Mizuho IMADA, Sachi YASUDA Hikari KONISHI, Kikuo MAEKAWA National Institute for Japanese

[Introduction]National Institute for Japanese Language and Linguistics (NINJAL), Japan

• Founded as the ‘National Language Research Institute’ in 1948• Located in Tachikawa, Tokyo, since 2005• ‘Center for Corpus Development’

– Released ‘Corpus of Spontaneous Japanese (CSJ)’ (2001-2005).– Released ‘Balanced Corpus of Contemporary Written Japanese (BCCWJ)’ (2006-2010)– Developing ‘NINJAL Historycal/Diachronic Corpus’ (2011-2015)– Developing ‘NINJAL Web Corpus’ (2011-201)

2014/05/19 IIPC Open Day 2

Page 3: Web-based Ultra-Large-Scale Corpora at NINJAL Masayuki ASAHARA, Mizuho IMADA, Sachi YASUDA Hikari KONISHI, Kikuo MAEKAWA National Institute for Japanese

[Introduction] An ongoing NINJAL project: Compilation of a web-scale Japanese corpusProject goal: To compile a ten billion-word corpus of web texts for linguistic research

•Covering rarely occurring linguistic expressions•Ensuring balanced sampling over time (seasons) and domains•Profiling originators•Annotating word boundaries, morphological information, and syntactic dependency structures•Providing a search environment including metadata, strings, and annotations

Project term: Late fiscal year 2011–the end of FY 2015

2014/05/19 IIPC Open Day 3

Page 4: Web-based Ultra-Large-Scale Corpora at NINJAL Masayuki ASAHARA, Mizuho IMADA, Sachi YASUDA Hikari KONISHI, Kikuo MAEKAWA National Institute for Japanese

Table of contents

• Introduction• Previous studies– Japanese Web corpora and linguistic resources

• Design of a web-scale Japanese corpus– Four basic technologies

• Page collection• Linguistic annotation• Release• Preservation

• Research progress• Conclusion

2014/05/19 IIPC Open Day 4

Page 5: Web-based Ultra-Large-Scale Corpora at NINJAL Masayuki ASAHARA, Mizuho IMADA, Sachi YASUDA Hikari KONISHI, Kikuo MAEKAWA National Institute for Japanese

JAPANESE WEB CORPORA AND LINGUISTIC RESOURCES

[Previous Studies]

2014/05/19 IIPC Open Day 5

Page 6: Web-based Ultra-Large-Scale Corpora at NINJAL Masayuki ASAHARA, Mizuho IMADA, Sachi YASUDA Hikari KONISHI, Kikuo MAEKAWA National Institute for Japanese

• ‘Publishing collected web texts’ is in a legal grey areaTypes of publication

• Publish word list and n-gram • Provide search environment with snippets • Publish resources by copyrighted content holders • Compile data in countries other than Japan• Exception: Web Archiving Project (WARP) by National Diet Library (NDL),

Japan • Who has created Japanese web-scale language resources (JWLR)?

Types of developer• Private companies• Universities and public institutes• Individuals• Foreign researchers

[Previous Studies] Basic premise: Copyright Law of Japan

2014/05/19 IIPC Open Day 6

wordlistwordlistsearchsearch

copyrightedcopyrighted

foreignforeignNDLNDL

Page 7: Web-based Ultra-Large-Scale Corpora at NINJAL Masayuki ASAHARA, Mizuho IMADA, Sachi YASUDA Hikari KONISHI, Kikuo MAEKAWA National Institute for Japanese

[Previous studies]JWLRs created by private companies • Google: ‘Japanese Web n-gram Version 1’– Word n-grams from web texts (255 billion tokens)

• Baidu: ‘Baidu Blog and Forum-Times Corpus’– Word list and n-grams from blogs and BBSs– Ten million sentences crawled from 2000–2010

• Baidu: ‘Baidu Mobile Web Corpus with Emoji’– Word list and n-grams of texts used for mobile search

• Rakuten: ‘Rakuten Data Release’– Review data from internet shopping mall

• Yahoo Japan: ‘Yahoo Answers Corpus Version 2’– 26 million questions and 73 million answers

2014/05/19 IIPC Open Day 7

wordlistwordlistsearchsearch

copyrightedcopyrighted

foreignforeignNDLNDL

wordlistwordlist

wordlistwordlist

wordlistwordlist

copyrightedcopyrighted

copyrightedcopyrighted

Page 8: Web-based Ultra-Large-Scale Corpora at NINJAL Masayuki ASAHARA, Mizuho IMADA, Sachi YASUDA Hikari KONISHI, Kikuo MAEKAWA National Institute for Japanese

[Previous studies]JWLRs by universities and public institutions• NICT: ‘Japanese Syntactic Dependency Database Version

1.1’– 480 million syntactic dependency relations in 600 million pages

and 43 billion sentences• Kyoto University: ‘Kyoto-U Case Frames (Version 1.0)’ in

2009– 40,000 case frames from 1.6 billion sentences

• Tsukuba-U: ‘Tsukuba Web Corpus’– 1.1 billion-word text corpus developed by lexical profiling using

Yahoo API • NDL: ‘Web Archive Project’

– Web archive of the official websites of Japanese institutions

2014/05/19 IIPC Open Day 8

wordlistwordlistsearchsearch

copyrightedcopyrighted

foreignforeignNDLNDL

wordlistwordlist

wordlistwordlist

searchsearch

NDLNDL

Page 9: Web-based Ultra-Large-Scale Corpora at NINJAL Masayuki ASAHARA, Mizuho IMADA, Sachi YASUDA Hikari KONISHI, Kikuo MAEKAWA National Institute for Japanese

[Previous studies]JWLR created by individuals

• Yata: ‘Japanese Web Corpus 2010’– HTML and text archive using the Yahoo API in

2010– Seed lexicon for Web API is IPADIC-2.7.0– Provides original texts and word n-grams

2014/05/19 IIPC Open Day 9

wordlistwordlist

wordlistwordlistsearchsearch

copyrightedcopyrighted

foreignforeignNDLNDL

Page 10: Web-based Ultra-Large-Scale Corpora at NINJAL Masayuki ASAHARA, Mizuho IMADA, Sachi YASUDA Hikari KONISHI, Kikuo MAEKAWA National Institute for Japanese

[Previous studies]JWLR created by countries other than Japan

• [Ueyama and Baroni 2005]– Two web corpora: 3.5 + 4.5 million words

• [Baroni and Ueyama 2006]– Blog data: 62 million words

• [Srdanovic+ 2008]– ‘JPWaC 2008’: 400 million words

• [Pomikalek and Suchomel 2012]– ‘JpTenTen11’ :10 billion-word text corpus

developed by crawling in 2011

2014/05/19 IIPC Open Day 10

wordlistwordlistsearchsearch

copyrightedcopyrighted

foreignforeignNDLNDL

foreignforeign

foreignforeign

foreignforeign

foreignforeign

Page 11: Web-based Ultra-Large-Scale Corpora at NINJAL Masayuki ASAHARA, Mizuho IMADA, Sachi YASUDA Hikari KONISHI, Kikuo MAEKAWA National Institute for Japanese

Table of contents

• Introduction• Previous studies– Japanese Web corpora and linguistic resources

• Design of a web-scale Japanese corpus– Four basic technologies

• Page collection• Linguistic annotation• Release• Preservation

• Research progress• Conclusion

2014/05/19 IIPC Open Day 11

Page 12: Web-based Ultra-Large-Scale Corpora at NINJAL Masayuki ASAHARA, Mizuho IMADA, Sachi YASUDA Hikari KONISHI, Kikuo MAEKAWA National Institute for Japanese

FOUR BASIC TECHNOLOGIES[Design of a web-scale Japanese corpus]

2014/05/19 IIPC Open Day 12

1. Page collection2. Linguistic annotation3. Release4. Preservation

Page 13: Web-based Ultra-Large-Scale Corpora at NINJAL Masayuki ASAHARA, Mizuho IMADA, Sachi YASUDA Hikari KONISHI, Kikuo MAEKAWA National Institute for Japanese

[Design of a web-scale Japanese corpus]Four basic technologies

1. Page collectionCrawling techniques, strategies, and plans

2. Linguistic annotationCharacter normalisation, word segmentation, morphological information annotation, syntactic dependency parsing, and register estimation

3. ReleaseHow to make the corpus publicly available

4. PreservationWeb archive in chronological order

2014/05/19 IIPC Open Day 13

Page 14: Web-based Ultra-Large-Scale Corpora at NINJAL Masayuki ASAHARA, Mizuho IMADA, Sachi YASUDA Hikari KONISHI, Kikuo MAEKAWA National Institute for Japanese

[Design of a web-scale Japanese corpus]Four basic technologies—1. Page collection

Performing remote harvesting (bulk collection) using a web crawlerHow?

•Heritrix Crawler (Version 3.1)– Developed by Internet Archive (United States)– Used by national libraries (e.g., NDL in Japan)

•Crawling strategy and plan– Crawling Japanese web pages including spam blogs (splogs) and machine-

generated pages– Crawling 100 million pages every three months (fixed-point observation)– Changing target pages yearly

2014/05/19 IIPC Open Day 14

Page 15: Web-based Ultra-Large-Scale Corpora at NINJAL Masayuki ASAHARA, Mizuho IMADA, Sachi YASUDA Hikari KONISHI, Kikuo MAEKAWA National Institute for Japanese

[Design of a web-scale Japanese corpus]Four basic technologies—2. Linguistic annotation

Four sorts of (automatic) annotation2.1 Normalisation– HTML-to-text and character-encoding normalisation

2.2 Japanese morphological analysis– Word segmentation and POS annotation

2.3 Japanese dependency analysis– Syntactic dependency structure annotation

2.4 Register estimation– Metadata alternative

2014/05/19 IIPC Open Day 15

Page 16: Web-based Ultra-Large-Scale Corpora at NINJAL Masayuki ASAHARA, Mizuho IMADA, Sachi YASUDA Hikari KONISHI, Kikuo MAEKAWA National Institute for Japanese

2.1 Normalisation•HTML to text and character encoding issues*– NWC (Nihongo Web Corpus) Toolkit [Yata 2010]

compatible with Google Web Japanese n-gram method

* Japanese character encodingEncoding Japanese characters for use on a computer. Several standard methods exist, including JIS, Shift-JIS, EUC, and Unicode.

[Design of a web-scale Japanese corpus]Four basic technologies—2. Linguistic annotation

2014/05/19 IIPC Open Day 16

Page 17: Web-based Ultra-Large-Scale Corpora at NINJAL Masayuki ASAHARA, Mizuho IMADA, Sachi YASUDA Hikari KONISHI, Kikuo MAEKAWA National Institute for Japanese

2.2 Japanese morphological analysis•Part-of speech (POS) tagset and word unit– UniDic POS tagset (Kokugo-ken Short Unit)

• Analyser: MeCab with UniDic– UniDic POS tagset (Kokugo-ken Long Unit)

• Analyser: MeCab with UniDic and Chunker CRF++– Masuoka–Takubo POS tagset

• Analyser: JUMAN or MeCab with JUMAN compatible dictionary

– Purely unsupervised word unit without POS• Analyser: Bayesian unsupervised word segmenter

[Mochihashi 2009]

[Design of a web-scale Japanese corpus]Four basic technologies—2. Linguistic annotation

2014/05/19 IIPC Open Day 17

Page 18: Web-based Ultra-Large-Scale Corpora at NINJAL Masayuki ASAHARA, Mizuho IMADA, Sachi YASUDA Hikari KONISHI, Kikuo MAEKAWA National Institute for Japanese

2.3 Japanese dependency analysis•Dependency annotation standard– Kyoto text corpus standard• The de facto standard in Japan• Analyser: KNP or CaboCha

– BCCWJ Standard• Covers phenomena in web texts

– Sentence fragments, scrambling, URLs, and smileys• Analyser: CaboCha with the Balanced Corpus of

Contemporary Written Japanese (BCCWJ)

[Design of a web-scale Japanese corpus]Four basic technologies—2. Linguistic annotation

2014/05/19 IIPC Open Day 18

Page 19: Web-based Ultra-Large-Scale Corpora at NINJAL Masayuki ASAHARA, Mizuho IMADA, Sachi YASUDA Hikari KONISHI, Kikuo MAEKAWA National Institute for Japanese

2.4 Register estimation•Register (style) as a category of page metadata– Unsupervised clustering and manual annotation

on the representative pages– (Semi-supervised) register annotation using

BCCWJ metadata

[Design of a web-scale Japanese corpus]Four basic technologies—2. Linguistic annotation

2014/05/19 IIPC Open Day 19

Page 20: Web-based Ultra-Large-Scale Corpora at NINJAL Masayuki ASAHARA, Mizuho IMADA, Sachi YASUDA Hikari KONISHI, Kikuo MAEKAWA National Institute for Japanese

[Design of a web-scale Japanese corpus]Four basic technologies—3. Release

Three sorts of release3.1 Online release—search application3.2 Offline release—word list and n-gram release3.3 Natural language analysers

2014/05/19 IIPC Open Day 20

Page 21: Web-based Ultra-Large-Scale Corpora at NINJAL Masayuki ASAHARA, Mizuho IMADA, Sachi YASUDA Hikari KONISHI, Kikuo MAEKAWA National Institute for Japanese

[Design of a web-scale Japanese corpus]Four basic technologies—3. Release

3.1 Online release—search application•10 billion-scale search application as a web service– String search– Word-unit- and POS-based query

e.g.) Chuunagon by NINJAL using BCCWJ– Bunsetsu unit and dependency-based query

e.g.) ChaKi.NET by Nara Institute of Science and Technology (NAIST)

– Facet navigation by register information

2014/05/19 IIPC Open Day 21

Page 22: Web-based Ultra-Large-Scale Corpora at NINJAL Masayuki ASAHARA, Mizuho IMADA, Sachi YASUDA Hikari KONISHI, Kikuo MAEKAWA National Institute for Japanese

[Design of a web-scale Japanese corpus]Four basic technologies—3. Release

3.2 Offline release—quarterly word list and n-gram release– Word list

• With morphological information, orthography-based– Character n-gram

• Without morphological information, orthography-based– Word n-gram

• Without morphological information, lemma-based– Frequent subtrees in dependency structure– Frequent HTML tags

2014/05/19 IIPC Open Day 22

Page 23: Web-based Ultra-Large-Scale Corpora at NINJAL Masayuki ASAHARA, Mizuho IMADA, Sachi YASUDA Hikari KONISHI, Kikuo MAEKAWA National Institute for Japanese

[Design of a web-scale Japanese corpus]Four basic technologies—3. Release

3.3 Natural language analysersDevelop natural language analysers for web-scale corpus statistics– Lexicon for Japanese morphological analyser– Japanese dependency analyzer based on co-

occurrence statistics

2014/05/19 IIPC Open Day 23

Page 24: Web-based Ultra-Large-Scale Corpora at NINJAL Masayuki ASAHARA, Mizuho IMADA, Sachi YASUDA Hikari KONISHI, Kikuo MAEKAWA National Institute for Japanese

[Design of a web-scale Japanese corpus]Four basic technologies—4. Preservation

Preserve the data collected for linguistic studies to monitor any changes•Web ARChive (WARC) format– A web archive preservation format

•Open-source wayback (hot backup)– Harvesting WARC files on a web application– Same as internet archives

•Linear Tape-Open (LTO) tape libraries (cold backup)

2014/05/19 IIPC Open Day 24

Page 25: Web-based Ultra-Large-Scale Corpora at NINJAL Masayuki ASAHARA, Mizuho IMADA, Sachi YASUDA Hikari KONISHI, Kikuo MAEKAWA National Institute for Japanese

Table of contents

• Introduction• Previous studies– Japanese Web corpora and linguistic resources

• Design of a web-scale Japanese corpus– Four basic technologies

• Page collection• Linguistic annotation• Release• Preservation

• Research progress• Conclusion

2014/05/19 IIPC Open Day 25

Page 26: Web-based Ultra-Large-Scale Corpora at NINJAL Masayuki ASAHARA, Mizuho IMADA, Sachi YASUDA Hikari KONISHI, Kikuo MAEKAWA National Institute for Japanese

RESEARCH PROGRESS

2014/05/19 IIPC Open Day 26

Page 27: Web-based Ultra-Large-Scale Corpora at NINJAL Masayuki ASAHARA, Mizuho IMADA, Sachi YASUDA Hikari KONISHI, Kikuo MAEKAWA National Institute for Japanese

[Research Progress]

2014/05/19 IIPC Open Day 27

Page collection: – Began on October 2012 (2012-4Q)– Crawled six quarters (from 2012-4Q to 2014-1Q)

Linguistic annotation: – Analyzed four quarters data (from 2012-4Q to 2013-3Q)

Statistical data (from 2012-4Q to 2013-3Q):– Collected pages and page conflicts– Collected links– Analysed data: number of morphemes and sentences– N-grams

Page 28: Web-based Ultra-Large-Scale Corpora at NINJAL Masayuki ASAHARA, Mizuho IMADA, Sachi YASUDA Hikari KONISHI, Kikuo MAEKAWA National Institute for Japanese

[Research Progress]Collected pages from 2012-4Q to 2013-3Q

2014/05/19 IIPC Open Day 28

One quarter statistics•100 million crawl attempts → 60 million crawled pages

caused by HTTP errors and observance of the Robots Exclusion Protocol•60 million crawled pages → 42-45 million deduplicated pages (72.9-74.5%)

Four quarters statistics•42.7 % of URLs are unmodified in four crawls

2012-4Q 2013-1Q 2013-2Q 2013-3QTokens of Pages (1 quarter) 61,668,805 58,844,092 61,479,268 57,892,917Deduplicated Numbers of Pages 45,933,605 42,932,982 45,111,527 42,192,931

74.5% 73.0% 73.4% 72.9%

Types of URLs (4 quarters)Numbers of unmodified URLsNumbers of modified URLs

Statistics of page changes in 4 quarters64,539,233

27,604,915 (42.7%)36,934,706 (57.3%)

Page 29: Web-based Ultra-Large-Scale Corpora at NINJAL Masayuki ASAHARA, Mizuho IMADA, Sachi YASUDA Hikari KONISHI, Kikuo MAEKAWA National Institute for Japanese

[Research Progress]Page conflict in 2012-4Q

2014/05/19 IIPC Open Day 29

copy sites

robots.txt or ‘soft 404’

Page 30: Web-based Ultra-Large-Scale Corpora at NINJAL Masayuki ASAHARA, Mizuho IMADA, Sachi YASUDA Hikari KONISHI, Kikuo MAEKAWA National Institute for Japanese

[Research Progress]Collected links from 2012-4Q to 2013-3Q

2014/05/19 IIPC Open Day 30

• Seed URLs 100 million URLs→ Seven billion links (tokens) in one quarter 843-892 million links (types) in one quarter

⇒ 1.6 billion links (types) in four quarters

2012-4Q 2013-1Q 2013-2Q 2013-3QLinks (Tokens) 6,905,805,383 6,610,763,700 7,064,611,259 7,222,958,033Links (Types) 892,135,930 843,166,672 865,694,816 855,684,918

Links (Types)Statistics in 4 quarters

1,642,699,579

Page 31: Web-based Ultra-Large-Scale Corpora at NINJAL Masayuki ASAHARA, Mizuho IMADA, Sachi YASUDA Hikari KONISHI, Kikuo MAEKAWA National Institute for Japanese

[Research Progress]Incoming links from our seed URLs to target in 2012-4Q

2014/05/19 IIPC Open Day 31

Page 32: Web-based Ultra-Large-Scale Corpora at NINJAL Masayuki ASAHARA, Mizuho IMADA, Sachi YASUDA Hikari KONISHI, Kikuo MAEKAWA National Institute for Japanese

[Research Progress]Statistics of analyzed data

2014/05/19 IIPC Open Day 32

60 million URLs⇒ 60 billion morphemes (without sentence extraction)⇒ 30 billion morphemes (with sentence extraction)

filtered out 50% of non-Japanese texts⇒ 2.5 billion sentences (tokens) and one billion sentences (types)

2012-4Q 2013-1Q 2013-2Q 2013-3QNumber of WARC files 814 870 910 905Number of URLs 61,668,805 58,844,092 61,479,268 57,892,917Number of Morphemes 64,714,650,129 62,077,520,745 63,414,252,638 65,736,027,334

  (w/o sentence extraction)Number of Morphemes 33,767,409,441 32,651,138,004 33,073,991,355 30,923,912,566

  (w/ sentence extraction) 52.2% 52.6% 52.2% 47.0%Number of Sentences (Tokens) 2,678,315,774 2,600,122,908 2,659,617,620 2,478,309,312Number of Sentences (Types) 1,097,011,506 1,048,772,913 1,063,649,324 1,007,771,383

Page 33: Web-based Ultra-Large-Scale Corpora at NINJAL Masayuki ASAHARA, Mizuho IMADA, Sachi YASUDA Hikari KONISHI, Kikuo MAEKAWA National Institute for Japanese

[Research Progress]Sentence Duplication in 2012-4Q data

2014/05/19 IIPC Open Day 33

Titles, anchor texts of links or fixed phrases

← appearing only once in the corpus

← The most frequent one‘ 職業とキャリア’ (occupation and career)in Yahoo! Answers

Page 34: Web-based Ultra-Large-Scale Corpora at NINJAL Masayuki ASAHARA, Mizuho IMADA, Sachi YASUDA Hikari KONISHI, Kikuo MAEKAWA National Institute for Japanese

[Research Progress]Statistics of n-gram data

2014/05/19 IIPC Open Day 34

Our Web Corpus Our Web Corpus Google N-gram(2012-4Q) (2012-4Q)(n≧ 3) (n≧ 3) (n≧ 20)Dedupilicated sentences Original sentences

Number of morphemes (Tokens) 18.0 billions 33.7 billions 255 billionsNumber of sentences 1.0 billions 2.6 billions 20 billions

1-gram 3.9 millions 5.0 millions 2.5 millions2-gram 47 millions 85 millions 80 millions3-gram 160 millions 440 millions 390 millions4-gram 210 millions 870 millions 700 millions5-gram 170 millions 1030 millions 770 millions6-gram 120 millions 970 millions 680 millions7-gram 84 millions 850 millions 570 millions

Page 35: Web-based Ultra-Large-Scale Corpora at NINJAL Masayuki ASAHARA, Mizuho IMADA, Sachi YASUDA Hikari KONISHI, Kikuo MAEKAWA National Institute for Japanese

Rank1-gram 2-gram 3-gram 4-gram

Our Web Corpus 1 の  し て    て い ます      し て い ます

2012-4Q 2 に  まし た    て い た      て い まし た

Deduplicated 3 て  て い    し て い      さ れ て いる

4 が  て いる    し て いる      し て い た

5 は  し た    と 思い ます      さ れ て い

6 を  で は    さ れ て      た の です が

7 た  に は    に なっ て      て き まし た

8 で  さ れ    の です が      れ て い ます

9 と  ませ ん    し まし た      は あり ませ ん

10 し  い ます    さ れ た      に なり まし た

Our Web Corpus 1 の    まし た    記事 へ の      記事 へ の トラック

2012-4Q 2 に    でしょ う    お願い し ます      専用 ページ を 表示

Original 3 を  行っ て Q & A      利用 する こと が

4 は  思っ て    続き を 読む      機能 を 利用 する

5 て  情報 を    マーク へ 投稿      おすすめ の 知恵 ノート

6 が  利用 規約    専用 ページ を      正確 性 の 保証

7 た  おすすめ の    機能 を 利用      お客様 自身 の 責任

8 で  記事 へ    済み の 質問      回答 を 指示 する

9 と  追加 する    おすすめ の 知恵      便利 に 新規 取得

10 し  場合 は    エンターテインメント と 趣味      はてな ブック マーク へ

 'Google N-gram' 1 の  し て    て い ます      し て い ます

2 に  まし た    し て い      さ れ て いる

3 を  て い    て い た      さ れ て い

4 は  て いる    し て いる      は あり ませ ん

5 て  し た    さ れ て      れ て い ます

6 が  ませ ん    に なっ て      て い まし た

7 た  さ れ    し まし た      に なり まし た

8 で  に は    さ れ た      し て おり ます

9 と  で は    れ て いる      て き まし た

10 し  い ます    あり ま せん      し て い た2014/05/19 IIPC Open Day 35

‘user policy’‘user policy’

‘for the social bookmark’

‘for the social bookmark’

Page 36: Web-based Ultra-Large-Scale Corpora at NINJAL Masayuki ASAHARA, Mizuho IMADA, Sachi YASUDA Hikari KONISHI, Kikuo MAEKAWA National Institute for Japanese

Rank5-gram 6-gram 7-gram

Our Web Corpus 1        されています          ではないでしょうか            のではないでしょうか2012-4Q 2        ではありません          ていたのですが            のタグが付けられた質問Deduplicated 3        と思っています          のではないでしょう            ではないかと思います

4        していました          のではないかと            に関するウェブ上 の情報 を探す5        ではないでしょう          に行ってきました            ああああああああああああああ6        のではないか          ような気 がします            のではないかと思い7        はないでしょうか          タグが付けられた質問          していたのですが8        になっています            のタグが付けられた            思っていたのですが9        ていましたが          させていただきました            えええええええ

10        ていたのです          たいと思っています            と思っていたのですOur Web Corpus 1        記事 へのトラックバック          機能 を利用 することが            機能 を利用 することができ2012-4Q 2        機能 を利用 すること          利用 することができませ            利用 することができませんOriginal 3        利用 することができ          正確 性 を保証 して            正確 性 を保証 しており

4        正確 性 を保証 し          お客様 自身 の責任 と判断            お客様 自身 の責任 と判断 で5        お客様 自身 の責任 と          すべての機能 を利用 する            すべての機能 を利用 すること6        はてなブックマークへ投稿          知恵袋 のすべての機能 を            知恵袋 のすべての機能 を利用7        更新 情報 が届きます          おすすめの解決 済みの質問    My       ニックネームの 知恵袋 で確認 でき8        おすすめの解決 済みの          URL記事 へのトラックバック            質問 年月 や画像 の有無 を9        すべての機能 を利用    My     ニックネームの 知恵袋 で確認            質問 や知恵 ノートは選択 さ

10        質問 年月 や画像 の          することができません            以上 更新 がないブログに表示 'Google N-gram' 1        されています          無料 でお届けします            料 無料 でお届けします

2        ではありません          料 無料 でお届けし            配送 料 無料 でお届けし3        でお届けします          配送 料 無料 でお届け            国内 配送 料 無料 でお届け4        無料 でお届けし          国内 配送 料 無料 でお            以上 国内 配送 料 無料 でお5 1500       円 以上 国内 配送          円 以上 国内 配送 料 無料            円 以上 国内 配送 料 無料 で6        料 無料 でお届け          以上 国内 配送 料 無料 で 1500           円 以上 国内 配送 料 無料7        配送 料 無料 でお 1500         円 以上 国内 配送 料            はインラインフレームを使用 して8        国内 配送 料 無料 で          を使用 しています            フレームを使用 しています9        以上 国内 配送 料 無料          インラインフレームを使用 して            インラインフレームを使用 してい

10        円 以上 国内 配送 料          この記事 へのトラックバック            部分 はインラインフレームを使用 し

2014/05/19 IIPC Open Day 36

‘Free shipping within Japan for items worth 1,500 yen or more’

‘Free shipping within Japan for items worth 1,500 yen or more’

‘trackback for the article’‘trackback for the article’

‘tagged question’‘tagged question’

Page 37: Web-based Ultra-Large-Scale Corpora at NINJAL Masayuki ASAHARA, Mizuho IMADA, Sachi YASUDA Hikari KONISHI, Kikuo MAEKAWA National Institute for Japanese

Table of contents

• Introduction• Previous studies– Japanese Web corpora and linguistic resources

• Design of a web-scale Japanese corpus– Four basic technologies

• Page collection• Linguistic annotation• Release• Preservation

• Research progress• Conclusion

2014/05/19 IIPC Open Day 37

Page 38: Web-based Ultra-Large-Scale Corpora at NINJAL Masayuki ASAHARA, Mizuho IMADA, Sachi YASUDA Hikari KONISHI, Kikuo MAEKAWA National Institute for Japanese

CONCLUSION

2014/05/19 IIPC Open Day 38

Page 39: Web-based Ultra-Large-Scale Corpora at NINJAL Masayuki ASAHARA, Mizuho IMADA, Sachi YASUDA Hikari KONISHI, Kikuo MAEKAWA National Institute for Japanese

[Conclusions]

An overview of the design of the web-scale corpus at NINJAL •Ten billion-scale web corpus•Remote harvesting page collection•Multi-layered linguistic annotation

– Word unit, morphological information, syntactic dependency structure, and register information

•Release for linguists– Web service– Word list and n-gram– Language analysers

• Preservation so that linguistic studies can monitor any changes– Web archive for linguistic research

2014/05/19 IIPC Open Day 39