2xml marko tadić ([email protected]) department of linguistics, faculty of philosophy, university...
TRANSCRIPT
22XMLXML
Marko Tadić([email protected])
Department of linguistics, Faculty of philosophy, University of Zagreb (www.ffzg.hr)
Tübingen, 2000-11-08
Human language technologyHuman language technology
language resources– corpora– dictionaries
language tools– language resource organizing and retrieval tools– morphology– syntax– semantics– ...
Text availability for building corporaText availability for building corpora
written language– flood of text in digital form– “cheap” sources
spoken language– difficulties in data collecting
• problems of recording• problems of transcription• problems of spontaneity of speakers• “expensive” source (typing)
both language varieties– corpus as text in digital form
WWW as a text sourceWWW as a text source
estimation of words accessible through Altavista 2000-02(source: Greg Grefenstette, XRCE,2000-02)
automated conversion of texts to a standardized format needed
Word count estimateWelsh 7,590,000Albanian 9,203,000Breton 9,975,000Lithuanian 20,927,000Latvian 21,925,000Esperanto 26,795,000Basque 28,296,000Latin 38,256,000Estonian 43,257,000Irish 49,778,000Icelandic 53,167,000Roumanian 63,846,000Croatian 72,122,000Slovene 74,998,000Turkish 100,548,000Malay 113,236,000Catalan 126,324,000Slovakian 140,909,000Finnish 192,105,000Dannish 206,167,000Polish 235,726,000Hungarian 268,944,000Czech 269,310,000Norwegian 455,391,000Dutch 622,063,000Swedish 644,740,000Portugese 924,965,000Italian 1,240,205,000Spanish 1,595,489,000French 2,208,418,000German 3,068,760,000English 47,264,700,000
Corpus encoding standardsCorpus encoding standards
pre-mark-up encoding SGML (’80 and mid-’90)– Text Encoding Initiative (TEI)– Corpus Encoding Standard (CES)
• Ide et al. (1996)
XML (last couple of years)– XCES (XML version of CES)
• Ide, Bonhomme & Romary (2000)
Conversion to XMLConversion to XML
2XML– tool for conversion– input formats• HTML• RTF
– output format• XML
2XML 12XML 1
producer– Institute of linguistics, Faculty of Philosophy,
University of Zagrebprogramming– Softleks d.o.o., Zagreb
platforms–Windows 9x/ME/NT/2000
requirements– Internet Explorer 5.* to run
2XML 22XML 2
principle: two-step conversion1st step– input: HTML or RTF– output: intermediate “dirty” XML
2nd step– input: “dirty” XML– used-defined script applied to it– output: XML document
2XML Conversion: step 12XML Conversion: step 1
2XML Conversion: step 22XML Conversion: step 2
2XML2XMLuser-user-defineddefinedscriptscript
2XML Goodies2XML Goodies
goodies– XML tree labeling– XML text editing– execute script on load– batch processing: whole directory
2XML: tree labeling & editing2XML: tree labeling & editing
2XML Tokenizer2XML Tokenizer
program which tokenizes XML filesoutput in two formats– tokenized XML file– tabbed file
2XML Tokenizer 22XML Tokenizer 2
TokenizerTokenizeroutput:output:dic filedic file
TokenizerTokenizeroutput:output:XML fileXML file
22XMLXML
Marko Tadić([email protected])
Department of linguistics, Faculty of philosophy, University of Zagreb (www.ffzg.hr)
Tübingen, 2000-11-08