2xml marko tadić ([email protected]) department of linguistics, faculty of philosophy, university...
TRANSCRIPT
![Page 1: 2XML Marko Tadić (marko.tadic@ffzg.hr) Department of linguistics, Faculty of philosophy, University of Zagreb () Tübingen, 2000-11-08](https://reader036.vdocuments.mx/reader036/viewer/2022082517/56649ede5503460f94bee5cf/html5/thumbnails/1.jpg)
22XMLXML
Marko Tadić([email protected])
Department of linguistics, Faculty of philosophy, University of Zagreb (www.ffzg.hr)
Tübingen, 2000-11-08
![Page 2: 2XML Marko Tadić (marko.tadic@ffzg.hr) Department of linguistics, Faculty of philosophy, University of Zagreb () Tübingen, 2000-11-08](https://reader036.vdocuments.mx/reader036/viewer/2022082517/56649ede5503460f94bee5cf/html5/thumbnails/2.jpg)
Human language technologyHuman language technology
language resources– corpora– dictionaries
language tools– language resource organizing and retrieval tools– morphology– syntax– semantics– ...
![Page 3: 2XML Marko Tadić (marko.tadic@ffzg.hr) Department of linguistics, Faculty of philosophy, University of Zagreb () Tübingen, 2000-11-08](https://reader036.vdocuments.mx/reader036/viewer/2022082517/56649ede5503460f94bee5cf/html5/thumbnails/3.jpg)
Text availability for building corporaText availability for building corpora
written language– flood of text in digital form– “cheap” sources
spoken language– difficulties in data collecting
• problems of recording• problems of transcription• problems of spontaneity of speakers• “expensive” source (typing)
both language varieties– corpus as text in digital form
![Page 4: 2XML Marko Tadić (marko.tadic@ffzg.hr) Department of linguistics, Faculty of philosophy, University of Zagreb () Tübingen, 2000-11-08](https://reader036.vdocuments.mx/reader036/viewer/2022082517/56649ede5503460f94bee5cf/html5/thumbnails/4.jpg)
WWW as a text sourceWWW as a text source
estimation of words accessible through Altavista 2000-02(source: Greg Grefenstette, XRCE,2000-02)
automated conversion of texts to a standardized format needed
Word count estimateWelsh 7,590,000Albanian 9,203,000Breton 9,975,000Lithuanian 20,927,000Latvian 21,925,000Esperanto 26,795,000Basque 28,296,000Latin 38,256,000Estonian 43,257,000Irish 49,778,000Icelandic 53,167,000Roumanian 63,846,000Croatian 72,122,000Slovene 74,998,000Turkish 100,548,000Malay 113,236,000Catalan 126,324,000Slovakian 140,909,000Finnish 192,105,000Dannish 206,167,000Polish 235,726,000Hungarian 268,944,000Czech 269,310,000Norwegian 455,391,000Dutch 622,063,000Swedish 644,740,000Portugese 924,965,000Italian 1,240,205,000Spanish 1,595,489,000French 2,208,418,000German 3,068,760,000English 47,264,700,000
![Page 5: 2XML Marko Tadić (marko.tadic@ffzg.hr) Department of linguistics, Faculty of philosophy, University of Zagreb () Tübingen, 2000-11-08](https://reader036.vdocuments.mx/reader036/viewer/2022082517/56649ede5503460f94bee5cf/html5/thumbnails/5.jpg)
Corpus encoding standardsCorpus encoding standards
pre-mark-up encoding SGML (’80 and mid-’90)– Text Encoding Initiative (TEI)– Corpus Encoding Standard (CES)
• Ide et al. (1996)
XML (last couple of years)– XCES (XML version of CES)
• Ide, Bonhomme & Romary (2000)
![Page 6: 2XML Marko Tadić (marko.tadic@ffzg.hr) Department of linguistics, Faculty of philosophy, University of Zagreb () Tübingen, 2000-11-08](https://reader036.vdocuments.mx/reader036/viewer/2022082517/56649ede5503460f94bee5cf/html5/thumbnails/6.jpg)
Conversion to XMLConversion to XML
2XML– tool for conversion– input formats• HTML• RTF
– output format• XML
![Page 7: 2XML Marko Tadić (marko.tadic@ffzg.hr) Department of linguistics, Faculty of philosophy, University of Zagreb () Tübingen, 2000-11-08](https://reader036.vdocuments.mx/reader036/viewer/2022082517/56649ede5503460f94bee5cf/html5/thumbnails/7.jpg)
2XML 12XML 1
producer– Institute of linguistics, Faculty of Philosophy,
University of Zagrebprogramming– Softleks d.o.o., Zagreb
platforms–Windows 9x/ME/NT/2000
requirements– Internet Explorer 5.* to run
![Page 8: 2XML Marko Tadić (marko.tadic@ffzg.hr) Department of linguistics, Faculty of philosophy, University of Zagreb () Tübingen, 2000-11-08](https://reader036.vdocuments.mx/reader036/viewer/2022082517/56649ede5503460f94bee5cf/html5/thumbnails/8.jpg)
2XML 22XML 2
principle: two-step conversion1st step– input: HTML or RTF– output: intermediate “dirty” XML
2nd step– input: “dirty” XML– used-defined script applied to it– output: XML document
![Page 9: 2XML Marko Tadić (marko.tadic@ffzg.hr) Department of linguistics, Faculty of philosophy, University of Zagreb () Tübingen, 2000-11-08](https://reader036.vdocuments.mx/reader036/viewer/2022082517/56649ede5503460f94bee5cf/html5/thumbnails/9.jpg)
2XML Conversion: step 12XML Conversion: step 1
![Page 10: 2XML Marko Tadić (marko.tadic@ffzg.hr) Department of linguistics, Faculty of philosophy, University of Zagreb () Tübingen, 2000-11-08](https://reader036.vdocuments.mx/reader036/viewer/2022082517/56649ede5503460f94bee5cf/html5/thumbnails/10.jpg)
2XML Conversion: step 22XML Conversion: step 2
![Page 11: 2XML Marko Tadić (marko.tadic@ffzg.hr) Department of linguistics, Faculty of philosophy, University of Zagreb () Tübingen, 2000-11-08](https://reader036.vdocuments.mx/reader036/viewer/2022082517/56649ede5503460f94bee5cf/html5/thumbnails/11.jpg)
2XML2XMLuser-user-defineddefinedscriptscript
![Page 12: 2XML Marko Tadić (marko.tadic@ffzg.hr) Department of linguistics, Faculty of philosophy, University of Zagreb () Tübingen, 2000-11-08](https://reader036.vdocuments.mx/reader036/viewer/2022082517/56649ede5503460f94bee5cf/html5/thumbnails/12.jpg)
2XML Goodies2XML Goodies
goodies– XML tree labeling– XML text editing– execute script on load– batch processing: whole directory
![Page 13: 2XML Marko Tadić (marko.tadic@ffzg.hr) Department of linguistics, Faculty of philosophy, University of Zagreb () Tübingen, 2000-11-08](https://reader036.vdocuments.mx/reader036/viewer/2022082517/56649ede5503460f94bee5cf/html5/thumbnails/13.jpg)
2XML: tree labeling & editing2XML: tree labeling & editing
![Page 14: 2XML Marko Tadić (marko.tadic@ffzg.hr) Department of linguistics, Faculty of philosophy, University of Zagreb () Tübingen, 2000-11-08](https://reader036.vdocuments.mx/reader036/viewer/2022082517/56649ede5503460f94bee5cf/html5/thumbnails/14.jpg)
2XML Tokenizer2XML Tokenizer
program which tokenizes XML filesoutput in two formats– tokenized XML file– tabbed file
![Page 15: 2XML Marko Tadić (marko.tadic@ffzg.hr) Department of linguistics, Faculty of philosophy, University of Zagreb () Tübingen, 2000-11-08](https://reader036.vdocuments.mx/reader036/viewer/2022082517/56649ede5503460f94bee5cf/html5/thumbnails/15.jpg)
2XML Tokenizer 22XML Tokenizer 2
![Page 16: 2XML Marko Tadić (marko.tadic@ffzg.hr) Department of linguistics, Faculty of philosophy, University of Zagreb () Tübingen, 2000-11-08](https://reader036.vdocuments.mx/reader036/viewer/2022082517/56649ede5503460f94bee5cf/html5/thumbnails/16.jpg)
TokenizerTokenizeroutput:output:dic filedic file
![Page 17: 2XML Marko Tadić (marko.tadic@ffzg.hr) Department of linguistics, Faculty of philosophy, University of Zagreb () Tübingen, 2000-11-08](https://reader036.vdocuments.mx/reader036/viewer/2022082517/56649ede5503460f94bee5cf/html5/thumbnails/17.jpg)
TokenizerTokenizeroutput:output:XML fileXML file
![Page 18: 2XML Marko Tadić (marko.tadic@ffzg.hr) Department of linguistics, Faculty of philosophy, University of Zagreb () Tübingen, 2000-11-08](https://reader036.vdocuments.mx/reader036/viewer/2022082517/56649ede5503460f94bee5cf/html5/thumbnails/18.jpg)
22XMLXML
Marko Tadić([email protected])
Department of linguistics, Faculty of philosophy, University of Zagreb (www.ffzg.hr)
Tübingen, 2000-11-08