web-based ultra-large-scale corpora at ninjal masayuki asahara, mizuho imada, sachi yasuda hikari...

Web-based Ultra-Large-Scale Corporaat NINJAL

Masayuki ASAHARA, Mizuho IMADA, Sachi YASUDAHikari KONISHI, Kikuo MAEKAWA

National Institute for Japanese Language and Linguistics, JapanCenter for Corpus Development

[Introduction]National Institute for Japanese Language and Linguistics (NINJAL), Japan

• Founded as the ‘National Language Research Institute’ in 1948• Located in Tachikawa, Tokyo, since 2005• ‘Center for Corpus Development’

– Released ‘Corpus of Spontaneous Japanese (CSJ)’ (2001-2005).– Released ‘Balanced Corpus of Contemporary Written Japanese (BCCWJ)’ (2006-2010)– Developing ‘NINJAL Historycal/Diachronic Corpus’ (2011-2015)– Developing ‘NINJAL Web Corpus’ (2011-201)

2014/05/19 IIPC Open Day 2

[Introduction] An ongoing NINJAL project: Compilation of a web-scale Japanese corpusProject goal: To compile a ten billion-word corpus of web texts for linguistic research

•Covering rarely occurring linguistic expressions•Ensuring balanced sampling over time (seasons) and domains•Profiling originators•Annotating word boundaries, morphological information, and syntactic dependency structures•Providing a search environment including metadata, strings, and annotations

Project term: Late fiscal year 2011–the end of FY 2015


Table of contents

• Introduction• Previous studies– Japanese Web corpora and linguistic resources

• Design of a web-scale Japanese corpus– Four basic technologies

• Page collection• Linguistic annotation• Release• Preservation

• Research progress• Conclusion


JAPANESE WEB CORPORA AND LINGUISTIC RESOURCES

[Previous Studies]


• ‘Publishing collected web texts’ is in a legal grey areaTypes of publication

• Publish word list and n-gram • Provide search environment with snippets • Publish resources by copyrighted content holders • Compile data in countries other than Japan• Exception: Web Archiving Project (WARP) by National Diet Library (NDL),

Japan • Who has created Japanese web-scale language resources (JWLR)?

Types of developer• Private companies• Universities and public institutes• Individuals• Foreign researchers

[Previous Studies] Basic premise: Copyright Law of Japan


wordlistwordlistsearchsearch

copyrightedcopyrighted

foreignforeignNDLNDL

[Previous studies]JWLRs created by private companies • Google: ‘Japanese Web n-gram Version 1’– Word n-grams from web texts (255 billion tokens)

• Baidu: ‘Baidu Blog and Forum-Times Corpus’– Word list and n-grams from blogs and BBSs– Ten million sentences crawled from 2000–2010

• Baidu: ‘Baidu Mobile Web Corpus with Emoji’– Word list and n-grams of texts used for mobile search

• Rakuten: ‘Rakuten Data Release’– Review data from internet shopping mall

• Yahoo Japan: ‘Yahoo Answers Corpus Version 2’– 26 million questions and 73 million answers





wordlistwordlist

wordlistwordlist

wordlistwordlist



[Previous studies]JWLRs by universities and public institutions• NICT: ‘Japanese Syntactic Dependency Database Version

1.1’– 480 million syntactic dependency relations in 600 million pages

and 43 billion sentences• Kyoto University: ‘Kyoto-U Case Frames (Version 1.0)’ in

2009– 40,000 case frames from 1.6 billion sentences

• Tsukuba-U: ‘Tsukuba Web Corpus’– 1.1 billion-word text corpus developed by lexical profiling using

Yahoo API • NDL: ‘Web Archive Project’

– Web archive of the official websites of Japanese institutions





wordlistwordlist

wordlistwordlist

searchsearch

NDLNDL

[Previous studies]JWLR created by individuals

• Yata: ‘Japanese Web Corpus 2010’– HTML and text archive using the Yahoo API in

2010– Seed lexicon for Web API is IPADIC-2.7.0– Provides original texts and word n-grams


wordlistwordlist




[Previous studies]JWLR created by countries other than Japan

• [Ueyama and Baroni 2005]– Two web corpora: 3.5 + 4.5 million words

• [Baroni and Ueyama 2006]– Blog data: 62 million words

• [Srdanovic+ 2008]– ‘JPWaC 2008’: 400 million words

• [Pomikalek and Suchomel 2012]– ‘JpTenTen11’ :10 billion-word text corpus

developed by crawling in 2011





foreignforeign

foreignforeign

foreignforeign

foreignforeign

Table of contents






FOUR BASIC TECHNOLOGIES[Design of a web-scale Japanese corpus]


1. Page collection2. Linguistic annotation3. Release4. Preservation

[Design of a web-scale Japanese corpus]Four basic technologies

1. Page collectionCrawling techniques, strategies, and plans

2. Linguistic annotationCharacter normalisation, word segmentation, morphological information annotation, syntactic dependency parsing, and register estimation

3. ReleaseHow to make the corpus publicly available

4. PreservationWeb archive in chronological order


[Design of a web-scale Japanese corpus]Four basic technologies—1. Page collection

Performing remote harvesting (bulk collection) using a web crawlerHow?

•Heritrix Crawler (Version 3.1)– Developed by Internet Archive (United States)– Used by national libraries (e.g., NDL in Japan)

•Crawling strategy and plan– Crawling Japanese web pages including spam blogs (splogs) and machine-

generated pages– Crawling 100 million pages every three months (fixed-point observation)– Changing target pages yearly


[Design of a web-scale Japanese corpus]Four basic technologies—2. Linguistic annotation

Four sorts of (automatic) annotation2.1 Normalisation– HTML-to-text and character-encoding normalisation

2.2 Japanese morphological analysis– Word segmentation and POS annotation

2.3 Japanese dependency analysis– Syntactic dependency structure annotation

2.4 Register estimation– Metadata alternative


2.1 Normalisation•HTML to text and character encoding issues*– NWC (Nihongo Web Corpus) Toolkit [Yata 2010]

compatible with Google Web Japanese n-gram method

* Japanese character encodingEncoding Japanese characters for use on a computer. Several standard methods exist, including JIS, Shift-JIS, EUC, and Unicode.



2.2 Japanese morphological analysis•Part-of speech (POS) tagset and word unit– UniDic POS tagset (Kokugo-ken Short Unit)

• Analyser: MeCab with UniDic– UniDic POS tagset (Kokugo-ken Long Unit)

• Analyser: MeCab with UniDic and Chunker CRF++– Masuoka–Takubo POS tagset

• Analyser: JUMAN or MeCab with JUMAN compatible dictionary

– Purely unsupervised word unit without POS• Analyser: Bayesian unsupervised word segmenter

[Mochihashi 2009]



2.3 Japanese dependency analysis•Dependency annotation standard– Kyoto text corpus standard• The de facto standard in Japan• Analyser: KNP or CaboCha

– BCCWJ Standard• Covers phenomena in web texts

– Sentence fragments, scrambling, URLs, and smileys• Analyser: CaboCha with the Balanced Corpus of

Contemporary Written Japanese (BCCWJ)



2.4 Register estimation•Register (style) as a category of page metadata– Unsupervised clustering and manual annotation

on the representative pages– (Semi-supervised) register annotation using

BCCWJ metadata



[Design of a web-scale Japanese corpus]Four basic technologies—3. Release

Three sorts of release3.1 Online release—search application3.2 Offline release—word list and n-gram release3.3 Natural language analysers



3.1 Online release—search application•10 billion-scale search application as a web service– String search– Word-unit- and POS-based query

e.g.) Chuunagon by NINJAL using BCCWJ– Bunsetsu unit and dependency-based query

e.g.) ChaKi.NET by Nara Institute of Science and Technology (NAIST)

– Facet navigation by register information



3.2 Offline release—quarterly word list and n-gram release– Word list

• With morphological information, orthography-based– Character n-gram

• Without morphological information, orthography-based– Word n-gram

• Without morphological information, lemma-based– Frequent subtrees in dependency structure– Frequent HTML tags



3.3 Natural language analysersDevelop natural language analysers for web-scale corpus statistics– Lexicon for Japanese morphological analyser– Japanese dependency analyzer based on co-

occurrence statistics


[Design of a web-scale Japanese corpus]Four basic technologies—4. Preservation

Preserve the data collected for linguistic studies to monitor any changes•Web ARChive (WARC) format– A web archive preservation format

•Open-source wayback (hot backup)– Harvesting WARC files on a web application– Same as internet archives

•Linear Tape-Open (LTO) tape libraries (cold backup)


Table of contents






RESEARCH PROGRESS


[Research Progress]


Page collection: – Began on October 2012 (2012-4Q)– Crawled six quarters (from 2012-4Q to 2014-1Q)

Linguistic annotation: – Analyzed four quarters data (from 2012-4Q to 2013-3Q)

Statistical data (from 2012-4Q to 2013-3Q):– Collected pages and page conflicts– Collected links– Analysed data: number of morphemes and sentences– N-grams

[Research Progress]Collected pages from 2012-4Q to 2013-3Q


One quarter statistics•100 million crawl attempts → 60 million crawled pages

caused by HTTP errors and observance of the Robots Exclusion Protocol•60 million crawled pages → 42-45 million deduplicated pages (72.9-74.5%)

Four quarters statistics•42.7 % of URLs are unmodified in four crawls

2012-4Q 2013-1Q 2013-2Q 2013-3QTokens of Pages (1 quarter) 61,668,805 58,844,092 61,479,268 57,892,917Deduplicated Numbers of Pages 45,933,605 42,932,982 45,111,527 42,192,931

74.5% 73.0% 73.4% 72.9%

Types of URLs (4 quarters)Numbers of unmodified URLsNumbers of modified URLs

Statistics of page changes in 4 quarters64,539,233

27,604,915 (42.7%)36,934,706 (57.3%)

[Research Progress]Page conflict in 2012-4Q


copy sites

robots.txt or ‘soft 404’

[Research Progress]Collected links from 2012-4Q to 2013-3Q


• Seed URLs 100 million URLs→ Seven billion links (tokens) in one quarter 843-892 million links (types) in one quarter

⇒ 1.6 billion links (types) in four quarters

2012-4Q 2013-1Q 2013-2Q 2013-3QLinks (Tokens) 6,905,805,383 6,610,763,700 7,064,611,259 7,222,958,033Links (Types) 892,135,930 843,166,672 865,694,816 855,684,918

Links (Types)Statistics in 4 quarters

1,642,699,579

[Research Progress]Incoming links from our seed URLs to target in 2012-4Q


[Research Progress]Statistics of analyzed data


60 million URLs⇒ 60 billion morphemes (without sentence extraction)⇒ 30 billion morphemes (with sentence extraction)

filtered out 50% of non-Japanese texts⇒ 2.5 billion sentences (tokens) and one billion sentences (types)

2012-4Q 2013-1Q 2013-2Q 2013-3QNumber of WARC files 814 870 910 905Number of URLs 61,668,805 58,844,092 61,479,268 57,892,917Number of Morphemes 64,714,650,129 62,077,520,745 63,414,252,638 65,736,027,334

　 (w/o sentence extraction)Number of Morphemes 33,767,409,441 32,651,138,004 33,073,991,355 30,923,912,566

　 (w/ sentence extraction) 52.2% 52.6% 52.2% 47.0%Number of Sentences (Tokens) 2,678,315,774 2,600,122,908 2,659,617,620 2,478,309,312Number of Sentences (Types) 1,097,011,506 1,048,772,913 1,063,649,324 1,007,771,383

[Research Progress]Sentence Duplication in 2012-4Q data


Titles, anchor texts of links or fixed phrases

← appearing only once in the corpus

← The most frequent one‘ 職業とキャリア’　(occupation and career)in Yahoo! Answers

[Research Progress]Statistics of n-gram data


Our Web Corpus Our Web Corpus Google N-gram(2012-4Q) (2012-4Q)(n≧ 3) (n≧ 3) (n≧ 20)Dedupilicated sentences Original sentences

Number of morphemes (Tokens) 18.0 billions 33.7 billions 255 billionsNumber of sentences 1.0 billions 2.6 billions 20 billions

1-gram 3.9 millions 5.0 millions 2.5 millions2-gram 47 millions 85 millions 80 millions3-gram 160 millions 440 millions 390 millions4-gram 210 millions 870 millions 700 millions5-gram 170 millions 1030 millions 770 millions6-gram 120 millions 970 millions 680 millions7-gram 84 millions 850 millions 570 millions

Rank1-gram 2-gram 3-gram 4-gram

Our　Web　Corpus 1 の　して　　ています　　　しています

2012-4Q 2 に　ました　　ていた　　　ていました

Deduplicated 3 て　てい　　してい　　　されている

4 が　ている　　している　　　していた

5 は　した　　と思います　　　されてい

6 を　では　　されて　　　たのですが

7 た　には　　になって　　　てきました

8 で　され　　のですが　　　れています

9 と　ません　　しました　　　はありません

10 し　います　　された　　　になりました

Our　Web　Corpus 1 の　　ました　　記事への　　　記事へのトラック

2012-4Q 2 に　　でしょう　　お願いします　　　専用ページを表示

Original 3 を　行って Q　&　A 　　　利用することが

4 は　思って　　続きを読む　　　機能を利用する

5 て　情報を　　マークへ投稿　　　おすすめの知恵ノート

6 が　利用規約　　専用ページを　　　正確性の保証

7 た　おすすめの　　機能を利用　　　お客様自身の責任

8 で　記事へ　　済みの質問　　　回答を指示する

9 と　追加する　　おすすめの知恵　　　便利に新規取得

10 し　場合は　　エンターテインメントと趣味　　　はてなブックマークへ

　'Google　N-gram' 1 の　して　　ています　　　しています

2 に　ました　　してい　　　されている

3 を　てい　　ていた　　　されてい

4 は　ている　　している　　　はありません

5 て　した　　されて　　　れています

6 が　ません　　になって　　　ていました

7 た　され　　しました　　　になりました

8 で　には　　された　　　しております

9 と　では　　れている　　　てきました

10 し　います　　ありません　　　していた2014/05/19 IIPC Open Day 35

‘user policy’‘user policy’

‘for the social bookmark’

‘for the social bookmark’

Rank5-gram 6-gram 7-gram

Our　Web　Corpus 1 　　　　されています　　　　　ではないでしょうか　　　　　　のではないでしょうか2012-4Q 2 　　　　ではありません　　　　　ていたのですが　　　　　　のタグが付けられた質問Deduplicated 3 　　　　と思っています　　　　　のではないでしょう　　　　　　ではないかと思います

4 　　　　していました　　　　　のではないかと　　　　　　に関するウェブ上の情報を探す5 　　　　ではないでしょう　　　　　に行ってきました　　　　　　ああああああああああああああ6 　　　　のではないか　　　　　ような気がします　　　　　　のではないかと思い7 　　　　はないでしょうか　　　　　タグが付けられた質問　　　　　していたのですが8 　　　　になっています　　　　　　のタグが付けられた　　　　　　思っていたのですが9 　　　　ていましたが　　　　　させていただきました　　　　　　えええええええ

10 　　　　ていたのです　　　　　たいと思っています　　　　　　と思っていたのですOur　Web　Corpus 1 　　　　記事へのトラックバック　　　　　機能を利用することが　　　　　　機能を利用することができ2012-4Q 2 　　　　機能を利用すること　　　　　利用することができませ　　　　　　利用することができませんOriginal 3 　　　　利用することができ　　　　　正確性を保証して　　　　　　正確性を保証しており

4 　　　　正確性を保証し　　　　　お客様自身の責任と判断　　　　　　お客様自身の責任と判断で5 　　　　お客様自身の責任と　　　　　すべての機能を利用する　　　　　　すべての機能を利用すること6 　　　　はてなブックマークへ投稿　　　　　知恵袋のすべての機能を　　　　　　知恵袋のすべての機能を利用7 　　　　更新情報が届きます　　　　　おすすめの解決済みの質問　　My　　　　ニックネームの知恵袋で確認でき8 　　　　おすすめの解決済みの　　　　　URL記事へのトラックバック　　　　　　質問年月や画像の有無を9 　　　　すべての機能を利用　　My　　　ニックネームの知恵袋で確認　　　　　　質問や知恵ノートは選択さ

10 　　　　質問年月や画像の　　　　　することができません　　　　　　以上更新がないブログに表示　'Google　N-gram' 1 　　　　されています　　　　　無料でお届けします　　　　　　料無料でお届けします

2 　　　　ではありません　　　　　料無料でお届けし　　　　　　配送料無料でお届けし3 　　　　でお届けします　　　　　配送料無料でお届け　　　　　　国内配送料無料でお届け4 　　　　無料でお届けし　　　　　国内配送料無料でお　　　　　　以上国内配送料無料でお5 1500　　　　円以上国内配送　　　　　円以上国内配送料無料　　　　　　円以上国内配送料無料で6 　　　　料無料でお届け　　　　　以上国内配送料無料で 1500　　　　　　円以上国内配送料無料7 　　　　配送料無料でお 1500　　　　　円以上国内配送料　　　　　　はインラインフレームを使用して8 　　　　国内配送料無料で　　　　　を使用しています　　　　　　フレームを使用しています9 　　　　以上国内配送料無料　　　　　インラインフレームを使用して　　　　　　インラインフレームを使用してい

10 　　　　円以上国内配送料　　　　　この記事へのトラックバック　　　　　　部分はインラインフレームを使用し


‘Free shipping within Japan for items worth 1,500 yen or more’

‘Free shipping within Japan for items worth 1,500 yen or more’

‘trackback for the article’‘trackback for the article’

‘tagged question’‘tagged question’

Table of contents






CONCLUSION


[Conclusions]

An overview of the design of the web-scale corpus at NINJAL •Ten billion-scale web corpus•Remote harvesting page collection•Multi-layered linguistic annotation

– Word unit, morphological information, syntactic dependency structure, and register information

•Release for linguists– Web service– Word list and n-gram– Language analysers

• Preservation so that linguistic studies can monitor any changes– Web archive for linguistic research


web-based ultra-large-scale corpora at ninjal masayuki asahara, mizuho imada, sachi yasuda hikari...

Documents

tsukuba web corpus

word corpus of web texts

developing ninjal web

baidu mobile web corpus

corpus development slide

japanese language

japanese web ngram version

iipc open day4 slide