nlp and big data

17
NLP and Big Data Shanxi HPC Research Center Xiaoge LI [email protected] WBDB2013, Xi’an, China

Upload: brigid

Post on 23-Feb-2016

47 views

Category:

Documents


0 download

DESCRIPTION

Shanxi HPC Research Center. NLP and Big Data. Xiaoge LI [email protected] WBDB2013, Xi’an, China. Introduction. Internet is a big knowledge base unstructured NLP & IE “understand” human language. Unstructured data. Structure data. Problems. Human language changed - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: NLP  and  Big  Data

NLP and Big Data

 Shanxi HPC Research Center

Xiaoge [email protected]

WBDB2013, Xi’an, China

Page 2: NLP  and  Big  Data

Introduction

Internet is a big knowledge base unstructured

NLP & IE“understand” human language

Unstructured data Structure data

Page 3: NLP  and  Big  Data

Problems Human language changed

Let Google it !Net language ( LOL , 给力 ) compounds words (JFK airport)

Domain knowledgeDomain specific training sets

Chinese tokenization 小菊 / nr / 的 /u/ 生活 / vn / 很 /d/ 给 /v 力 / vg 小菊 / nr / 的 /u/ 生活 / vn / 很 /d/ 给力 /a 

Page 4: NLP  and  Big  Data

NLP need big data

Unsupervised (weekly supervised)learningknowledge acquisitionRelationship New wordsNE gazette

Page 5: NLP  and  Big  Data

System Architecture

Linux Cluster

HDFS

Knowledge

acquisition

NLP & IE Map

Reduce HBase

Entity graph

information

fusion

Page 6: NLP  and  Big  Data

knowledge acquisition

Large scale Corpus from Web Weekly supervised learning Bootstrapping technique Map reduce , Hbase Location NE and new word P = 87.28%, 72.1%

Page 7: NLP  and  Big  Data

Chinese NLP & IE engine

Pipeline FST & statistic mixture modelInput : plain textOut : structured XMLMap reduce Speed: 500KB/s in 10 nodes

Page 8: NLP  and  Big  Data

Information objectInformation Object

Name Entity

Person

Organization

Location

Product

Time

事件

Pre-defined Event

General Event

Profile and Event

Page 9: NLP  and  Big  Data

Example Profile

In Concept-Based Profile, its attributes are filled by its participant profiles.

Page 10: NLP  and  Big  Data

Information NetworkNLP

• Tokenization

• POS• Sallow

parsing• Deep

parsing

IE

• NE tag• CE

linkage• NE

Profile • Profile

Merge

Page 11: NLP  and  Big  Data

Cross Document Information fusion

Hierarchical Clustering Map Reduce Hbase Half Million Profiles Computing complexity P=94.65% R=88.24% F= 91.33%

Page 12: NLP  and  Big  Data

Information Graph multi-dimension

Orange: locationGray: organizationBlue: Person

Source:2012 People’s dailyQuery :China Agricultural University

Expand 1 level

Page 13: NLP  and  Big  Data

Organization-Organization Network

Query: China Agricultural University filter: Organization

Page 14: NLP  and  Big  Data

Location-Personal NetworkQuery : 青岛港, filter : Location

Page 15: NLP  and  Big  Data

Person-location NetworkQuery: 金日成

Page 16: NLP  and  Big  Data

Future Work

Query LanguageGraph Mining Enhance NLP Enginevisualization

Page 17: NLP  and  Big  Data

Questions?

Thank you