swt final project presentation

18
What happened? Martin Majlis

Upload: martin-majlis

Post on 18-Jun-2015

693 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SWT Final Project Presentation

What happened?Martin Majlis

Page 2: SWT Final Project Presentation

28/01/10 SWT - Final Project 2

Outline

Introduction Architecture Back-end

Downloading Extraction

Front-end Web application iGoogle Gadget

Page 3: SWT Final Project Presentation

28/01/10 SWT - Final Project 3

Introduction

Answer on questions: what happened on 3 January what happened on 3 January 1865 what happened on January 1825 what happened from January until July 1985 what happened during the 16th century what started on January 1930 what ended in 1990

Page 4: SWT Final Project Presentation

28/01/10 SWT - Final Project 4

Architecture

Back-end Downloading Structure Converting Parsing

Front-end Web application iGoogle Gadget

Page 5: SWT Final Project Presentation

28/01/10 SWT - Final Project 5

Build process

Fully automatized Target for each phase Less error-prone GNU Make

Page 6: SWT Final Project Presentation

28/01/10 SWT - Final Project 6

Data Source

Czech Wikipedia Documented format Dumps regularly generated Cleaner than general texts

Page 7: SWT Final Project Presentation

28/01/10 SWT - Final Project 7

Downloading / Conversion

Downloading Script from DBPedia Added traffic shaping

Data Conversion Recognizing pages/categories Building category “hierarchy”

Page 8: SWT Final Project Presentation

28/01/10 SWT - Final Project 8

Categories

Confusing Structure Netherlands - 229

Physics, Planets, Illusions, Psychology, Literature, Organ, Neuroscience, etc.

Maximal deep 5 Median: 31 Mean: 33.87

Page 9: SWT Final Project Presentation

28/01/10 SWT - Final Project 9

Date Extraction – Regular Exp.

Regular expressions aren't for parsing Day=(\d+)\.; Month = (Jan|Feb|...); Year=(\d+) Date = (Day Month Year | Day Month | Month Year |

Year) Extract = (“from” Date “until” Date | Date “-” Date |

“between” Date “and” Date | “from” Date)

Day number can be on 14 positions In real more than 1000 slots

Page 10: SWT Final Project Presentation

28/01/10 SWT - Final Project 10

Date Extraction - Tools

Standard way: GNU Flex / GNU Bison Ragel

Problem with UTF-8 support Unicode – almost 100.000 characters Big transition tables (100.000 vs 127)

Page 11: SWT Final Project Presentation

28/01/10 SWT - Final Project 11

Date Extraction - Mixed

Lexical Analysis Regular Expressions Filling Table

Syntactic Analysis Theoretically CFG Practically again regular expressions

Page 12: SWT Final Project Presentation

28/01/10 SWT - Final Project 12

Date Extraction - Example

Lexical Analysis “From 23 January 1956 until 2 February 1960” “From {{DATE_1}} until {{DATE_2}}”

Syntactic Analysis Interval = “From” DATE “to” DATE Interval = “Between” DATE “and” DATE

Page 13: SWT Final Project Presentation

28/01/10 SWT - Final Project 13

Date Representation

Dates from 10.000 BC to 2500 AC Not exact: 13th century, June 1689 Zero

2 January - 5days = 28 December 2 January 1AC -5days = 28 December

1BC Simple tuples

(“I”, 23, 1, 1956, 20, 2, 2, 1960, 20)

Page 14: SWT Final Project Presentation

28/01/10 SWT - Final Project 14

Web application

PHP5 + MySQL Nette Framework + Dibi http://css.majlis.cz/

GT: http://jdem.cz/dspw9

HTML, JSON, XML output

Page 15: SWT Final Project Presentation

28/01/10 SWT - Final Project 15

iGoogle Gadget

iGoogle = Google personalized homepage URL: http://jdem.cz/dspx7 Using JSON Tricky development

Page 16: SWT Final Project Presentation

28/01/10 SWT - Final Project 16

Future Work

Improve performance 20th century events – 28s – 406.980 (one OR) 20th century events – 0.0007s – 392.573 (no OR)

Improve parser architecture

Page 17: SWT Final Project Presentation

28/01/10 SWT - Final Project 17

Questions?

Page 18: SWT Final Project Presentation

28/01/10 SWT - Final Project 18

Thank You!