2xml marko tadić ([email protected]) department of linguistics, faculty of philosophy, university...

18
2 2 XML XML Marko Tadić ([email protected]) Department of linguistics, Faculty of philosophy, University of Zagreb (www.ffzg.hr) Tübingen, 2000-11-08

Upload: eustacia-hodge

Post on 02-Jan-2016

218 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: 2XML Marko Tadić (marko.tadic@ffzg.hr) Department of linguistics, Faculty of philosophy, University of Zagreb () Tübingen, 2000-11-08

22XMLXML

Marko Tadić([email protected])

Department of linguistics, Faculty of philosophy, University of Zagreb (www.ffzg.hr)

Tübingen, 2000-11-08

Page 2: 2XML Marko Tadić (marko.tadic@ffzg.hr) Department of linguistics, Faculty of philosophy, University of Zagreb () Tübingen, 2000-11-08

Human language technologyHuman language technology

language resources– corpora– dictionaries

language tools– language resource organizing and retrieval tools– morphology– syntax– semantics– ...

Page 3: 2XML Marko Tadić (marko.tadic@ffzg.hr) Department of linguistics, Faculty of philosophy, University of Zagreb () Tübingen, 2000-11-08

Text availability for building corporaText availability for building corpora

written language– flood of text in digital form– “cheap” sources

spoken language– difficulties in data collecting

• problems of recording• problems of transcription• problems of spontaneity of speakers• “expensive” source (typing)

both language varieties– corpus as text in digital form

Page 4: 2XML Marko Tadić (marko.tadic@ffzg.hr) Department of linguistics, Faculty of philosophy, University of Zagreb () Tübingen, 2000-11-08

WWW as a text sourceWWW as a text source

estimation of words accessible through Altavista 2000-02(source: Greg Grefenstette, XRCE,2000-02)

automated conversion of texts to a standardized format needed

Word count estimateWelsh 7,590,000Albanian 9,203,000Breton 9,975,000Lithuanian 20,927,000Latvian 21,925,000Esperanto 26,795,000Basque 28,296,000Latin 38,256,000Estonian 43,257,000Irish 49,778,000Icelandic 53,167,000Roumanian 63,846,000Croatian 72,122,000Slovene 74,998,000Turkish 100,548,000Malay 113,236,000Catalan 126,324,000Slovakian 140,909,000Finnish 192,105,000Dannish 206,167,000Polish 235,726,000Hungarian 268,944,000Czech 269,310,000Norwegian 455,391,000Dutch 622,063,000Swedish 644,740,000Portugese 924,965,000Italian 1,240,205,000Spanish 1,595,489,000French 2,208,418,000German 3,068,760,000English 47,264,700,000

Page 5: 2XML Marko Tadić (marko.tadic@ffzg.hr) Department of linguistics, Faculty of philosophy, University of Zagreb () Tübingen, 2000-11-08

Corpus encoding standardsCorpus encoding standards

pre-mark-up encoding SGML (’80 and mid-’90)– Text Encoding Initiative (TEI)– Corpus Encoding Standard (CES)

• Ide et al. (1996)

XML (last couple of years)– XCES (XML version of CES)

• Ide, Bonhomme & Romary (2000)

Page 6: 2XML Marko Tadić (marko.tadic@ffzg.hr) Department of linguistics, Faculty of philosophy, University of Zagreb () Tübingen, 2000-11-08

Conversion to XMLConversion to XML

2XML– tool for conversion– input formats• HTML• RTF

– output format• XML

Page 7: 2XML Marko Tadić (marko.tadic@ffzg.hr) Department of linguistics, Faculty of philosophy, University of Zagreb () Tübingen, 2000-11-08

2XML 12XML 1

producer– Institute of linguistics, Faculty of Philosophy,

University of Zagrebprogramming– Softleks d.o.o., Zagreb

platforms–Windows 9x/ME/NT/2000

requirements– Internet Explorer 5.* to run

Page 8: 2XML Marko Tadić (marko.tadic@ffzg.hr) Department of linguistics, Faculty of philosophy, University of Zagreb () Tübingen, 2000-11-08

2XML 22XML 2

principle: two-step conversion1st step– input: HTML or RTF– output: intermediate “dirty” XML

2nd step– input: “dirty” XML– used-defined script applied to it– output: XML document

Page 9: 2XML Marko Tadić (marko.tadic@ffzg.hr) Department of linguistics, Faculty of philosophy, University of Zagreb () Tübingen, 2000-11-08

2XML Conversion: step 12XML Conversion: step 1

Page 10: 2XML Marko Tadić (marko.tadic@ffzg.hr) Department of linguistics, Faculty of philosophy, University of Zagreb () Tübingen, 2000-11-08

2XML Conversion: step 22XML Conversion: step 2

Page 11: 2XML Marko Tadić (marko.tadic@ffzg.hr) Department of linguistics, Faculty of philosophy, University of Zagreb () Tübingen, 2000-11-08

2XML2XMLuser-user-defineddefinedscriptscript

Page 12: 2XML Marko Tadić (marko.tadic@ffzg.hr) Department of linguistics, Faculty of philosophy, University of Zagreb () Tübingen, 2000-11-08

2XML Goodies2XML Goodies

goodies– XML tree labeling– XML text editing– execute script on load– batch processing: whole directory

Page 13: 2XML Marko Tadić (marko.tadic@ffzg.hr) Department of linguistics, Faculty of philosophy, University of Zagreb () Tübingen, 2000-11-08

2XML: tree labeling & editing2XML: tree labeling & editing

Page 14: 2XML Marko Tadić (marko.tadic@ffzg.hr) Department of linguistics, Faculty of philosophy, University of Zagreb () Tübingen, 2000-11-08

2XML Tokenizer2XML Tokenizer

program which tokenizes XML filesoutput in two formats– tokenized XML file– tabbed file

Page 15: 2XML Marko Tadić (marko.tadic@ffzg.hr) Department of linguistics, Faculty of philosophy, University of Zagreb () Tübingen, 2000-11-08

2XML Tokenizer 22XML Tokenizer 2

Page 16: 2XML Marko Tadić (marko.tadic@ffzg.hr) Department of linguistics, Faculty of philosophy, University of Zagreb () Tübingen, 2000-11-08

TokenizerTokenizeroutput:output:dic filedic file

Page 17: 2XML Marko Tadić (marko.tadic@ffzg.hr) Department of linguistics, Faculty of philosophy, University of Zagreb () Tübingen, 2000-11-08

TokenizerTokenizeroutput:output:XML fileXML file

Page 18: 2XML Marko Tadić (marko.tadic@ffzg.hr) Department of linguistics, Faculty of philosophy, University of Zagreb () Tübingen, 2000-11-08

22XMLXML

Marko Tadić([email protected])

Department of linguistics, Faculty of philosophy, University of Zagreb (www.ffzg.hr)

Tübingen, 2000-11-08