tagging the first editions of the tei guidelines used the standard generalized markup language...

30
Tagging The first editions of the TEI Guidelines used the Standard Generalized Markup Language (SGML) The most recent edition can also be expressed in the Extensible Markup Language (XML)

Upload: pauline-oneal

Post on 26-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tagging The first editions of the TEI Guidelines used the Standard Generalized Markup Language (SGML) The most recent edition can also be expressed in

Tagging

The first editions of the TEI Guidelines used the Standard Generalized Markup Language (SGML)

The most recent edition can also be expressed in the Extensible Markup Language (XML)

Page 2: Tagging The first editions of the TEI Guidelines used the Standard Generalized Markup Language (SGML) The most recent edition can also be expressed in

An example

<pb n='474'/> <div1 type="chapter" n='38'> <p>Reader, I married him. A quiet wedding we had: he and I,

the parson and clerk, were alone present. When we got back from church, I went into the kitchen of the manor-house, where Mary was cooking the dinner, and John cleaning the knives, and I said &mdash;</p>

<p><q>Mary, I have been married to Mr Rochester this morning.</q> The housekeeper and her husband were of that decent, phlegmatic order of people,[…]; but Mary, bending again over the roast, said only &mdash; </p>

<p><q>Have you, miss? Well, for sure!</q></p>

Page 3: Tagging The first editions of the TEI Guidelines used the Standard Generalized Markup Language (SGML) The most recent edition can also be expressed in

a TEI document at the textual level consists of the following elements:

<front> – contains any prefatory matter (headers, title page, prefaces,

dedications, etc.) found before the start of a text proper. <group>

– contains a number of unitary texts or groups of texts. <body>

– contains the whole body of a single unitary text, excluding any front or back matter.

<back> – contains any appendixes, etc., following the main part of a

text.

Page 4: Tagging The first editions of the TEI Guidelines used the Standard Generalized Markup Language (SGML) The most recent edition can also be expressed in

Marking Highlighted Phrases

On the one hand the Nibelungenlied is associated with the new rise of romance of twelfth-century France, the romans d'antiquité.

<p>On the one hand the <title>Nibelungenlied</title> is associated with the new rise of romance of twelfth-century France, the <foreign>romans d'antiquit&eacute;</foreign>.</p>

Page 5: Tagging The first editions of the TEI Guidelines used the Standard Generalized Markup Language (SGML) The most recent edition can also be expressed in

A major caveat : complication

Lou Burnard admits : “[…] there remain many situations in which the TEI's desire to exclude no-one has led to a multiplication of distinctions at first sight rather bewildering. It seems to say the least unlikely that anyone will ever encode a document using every possible element defined by the union of every TEI tag set, though such a monster DTD is indeed possible.”

Page 6: Tagging The first editions of the TEI Guidelines used the Standard Generalized Markup Language (SGML) The most recent edition can also be expressed in

POS (part of speech) tagging

Each word form is given a POS tag : Results of trials of selective gut

decontamination have been mixed Results_NN2 of_IO trials_NN2 of_IO

selective_JJ gut_NN1 decontamination_NN1 have_VH0 been_VBN mixed_VVN ._.

Page 7: Tagging The first editions of the TEI Guidelines used the Standard Generalized Markup Language (SGML) The most recent edition can also be expressed in

Result of Morphological Analysis

Results Results+Prop+Fam+Sg of of+Prep trials trial+Noun+Pl of of+Prep selective selective+Adj gut gut+Noun+Sg

OR gut gut+Verb+Pres+Non3sg decontamination decontamination+Noun+Sg have have+Noun+Sg

OR have have+Aux+Pres+Non3sg OR have have+Verb+Pres+Non3sg

been be+Verb+PastPerf+123SP mixed mix+Verb+PastBoth+123SP

OR mixed mixed+Adj

Page 8: Tagging The first editions of the TEI Guidelines used the Standard Generalized Markup Language (SGML) The most recent edition can also be expressed in

Result of Part-of-Speech Disambiguation

Results result +NOUN of of +PREP trials trial +NOUN of of +PREP selective selective +ADJ gut gut +NOUN decontamination decontamination +NOUN have have +VHPRES been be +VBPAP mixed mixed +ADJ

Page 9: Tagging The first editions of the TEI Guidelines used the Standard Generalized Markup Language (SGML) The most recent edition can also be expressed in

A simple example : We can fish

Result of Morphological Analysis:– We we+Pron+Pers+Nom+1P+Pl – can can+Noun+Sg OR can

can+Verb+Pres+Non3sg OR can can+Aux – fish fish+Verb+Pres+Non3sg fish fish+Noun+SP

Result of Part-of-Speech Disambiguation:– We we +PRONPERS – can can +VAUX – fish fish +VINF

Page 10: Tagging The first editions of the TEI Guidelines used the Standard Generalized Markup Language (SGML) The most recent edition can also be expressed in

SUPERVISED VS. UNSUPERVISED

Supervised taggers typically rely on pre-tagged corpora to serve as the basis for creating any tools to be used throughout the tagging process, for example: the tagger dictionary, the word/tag frequencies, the tag sequence probabilities and/or the rule set.

Page 11: Tagging The first editions of the TEI Guidelines used the Standard Generalized Markup Language (SGML) The most recent edition can also be expressed in

Unsupervised models, on the other hand, are those which do not require a pretagged corpus but instead use sophisticated computational methods to automatically induce word groupings (i.e. tag sets) and based on those automatic groupings, to either calculate the probabilistic information needed by stochastic taggers or to induce the context rules needed by rule-based systems.

Page 12: Tagging The first editions of the TEI Guidelines used the Standard Generalized Markup Language (SGML) The most recent edition can also be expressed in

RULE BASED TAGGING

Typical rule based approaches use contextual information to assign tags to unknown or ambiguous words. These rules are often known as context frame rules. As an example, a context frame rule might say something like: If an ambiguous/unknown word X is preceded by a determiner and followed by a noun, tag it as an adjective :

det - X - n = X/adj

Page 13: Tagging The first editions of the TEI Guidelines used the Standard Generalized Markup Language (SGML) The most recent edition can also be expressed in

STOCHASTIC TAGGING

Any model which somehow incorporates frequency or probability, i.e. statistics, may be properly labelled stochastic.

Etymology:Greek stochastikos skillful in aiming, from stochazesthai to aim at, guess at, from stochos target, aim, guess

1 : RANDOM; specifically : involving a random variable *a stochastic process*2 : involving chance or probability : PROBABILISTIC *a stochastic model of radiation-induced mutation*

Page 14: Tagging The first editions of the TEI Guidelines used the Standard Generalized Markup Language (SGML) The most recent edition can also be expressed in

The simplest stochastic taggers disambiguate words based solely on the probability that a word occurs with a particular tag. In other words, the tag encountered most frequently in the training set is the one assigned to an ambiguous instance of that word. The problem with this approach is that while it may yield a valid tag for a given word, it can also yield inadmissible sequences of tags.

Page 15: Tagging The first editions of the TEI Guidelines used the Standard Generalized Markup Language (SGML) The most recent edition can also be expressed in

An alternative to the word frequency approach is to calculate the probability of a given sequence of tags occurring. This is sometimes referred to as the n-gram approach, referring to the fact that the best tag for a given word is determined by the probability that it occurs with the n previous tags.

The combination of the previous two approaches is known as a Hidden Markov Model

Page 16: Tagging The first editions of the TEI Guidelines used the Standard Generalized Markup Language (SGML) The most recent edition can also be expressed in

An example : the man still saw her

the AT man NN VB still NN VB RB (JJ) saw NN VB VBD her PPO PP$

Page 17: Tagging The first editions of the TEI Guidelines used the Standard Generalized Markup Language (SGML) The most recent edition can also be expressed in

Table of transitional probabilities for «  the man still saw her »

NN PPO PPS RB VB VBD .

AT 186 1

NN 4 40 9

PPO

PP$

RB 5 16

VB

VBD 584 143

Page 18: Tagging The first editions of the TEI Guidelines used the Standard Generalized Markup Language (SGML) The most recent edition can also be expressed in

An example of tree tagging

Page 19: Tagging The first editions of the TEI Guidelines used the Standard Generalized Markup Language (SGML) The most recent edition can also be expressed in

Semantic ambiguity

Results of trials of selective gut decontamination have been mixed

Page 20: Tagging The first editions of the TEI Guidelines used the Standard Generalized Markup Language (SGML) The most recent edition can also be expressed in

TRIAL 1. Law. Examination of evidence and applicable law by a

competent tribunal to determine the issue of specified charges or claims. (PROCES)

2.a. The act or process of testing, trying, or putting to the proof: a trial of one's faith. b. An instance of such testing, especially as part of a series of tests or experiments. (TEST)

3. An effort or attempt: succeeded on the third trial. (TENTATIVE)

4. A state of pain or anguish that tests patience, endurance, or belief. (DIFFICULTE, TOURMENT, EPREUVE)

5. A trying, troublesome, or annoying person or thing: The child was a trial to his parents. (?) (PERSONNE DIFFICILE A SUPPORTER)

6. A preliminary competition or test to determine qualifications, as in a sport. (ESSAI, EPREUVE)

Page 21: Tagging The first editions of the TEI Guidelines used the Standard Generalized Markup Language (SGML) The most recent edition can also be expressed in

A few ambiguous nouns in the medical vocabulary (HCA = Health care activity)

HCA / Spatial concept : section, abord HCA / Body space or junction : ouverture,

séparation HCA / organization : administration HCA / finding : décollement, déviation HCA / susbtance : préparation

Page 22: Tagging The first editions of the TEI Guidelines used the Standard Generalized Markup Language (SGML) The most recent edition can also be expressed in

WordNet (a lexical database for the English language)

English nouns, verbs, adjectives and adverbs are organized into synonym sets, each representing one underlying lexical concept.

Different relations link the synonym sets The database can be searched on line at :

http://www.cogsci.princeton.edu/cgi-bin/webwn

Page 23: Tagging The first editions of the TEI Guidelines used the Standard Generalized Markup Language (SGML) The most recent edition can also be expressed in

Wordnet search for trial (1)

1. trial -- ((law) legal proceedings consisting of the judicial examination of issues by a competent tribunal; "most of these complaints are settled before they go to trial") procès, jugement2. test, trial, run -- (the act of testing something; "in the experimental trials the amount of carbon was measured separately"; "he called each flip of the coin a new trial") test, galop d’essai3. trial -- ((sports) a preliminary competition to determine qualifications; "the trials for the semifinals began yesterday") éliminatoires, essais

Page 24: Tagging The first editions of the TEI Guidelines used the Standard Generalized Markup Language (SGML) The most recent edition can also be expressed in

Wordnet search for trial (2)

4. trial -- ((law) the determination of a person's innocence or guilt by due process of law; "he had a fair trial and the jury found him guilty") procès5. trial, trial run, test, tryout -- (trying something to find out about it; "a sample for ten days free trial"; "a trial of progesterone failed to relieve the pain") essai6. trial, tribulation, visitation -- (an annoying or frustrating or catastrophic event; "his mother-in-law's visits were a great trial for him"; "life is full of tribulations"; "a visitation of the plague") épreuve, difficulté7. test, trial -- (the act of undergoing testing; "he survived the great test of battle"; "candidates must compete in a trial of skill") épreuve

Page 25: Tagging The first editions of the TEI Guidelines used the Standard Generalized Markup Language (SGML) The most recent edition can also be expressed in

Synonyms for the various senses of trial

 Sense 1 => proceeding, legal proceeding, proceedings

Sense 2 => attempt, effort, endeavor, endeavour, try 

Sense 3 => contest, competition Sense 4 => proceeding, legal proceeding, 

proceedings Sense 5 => experiment, experimentation Sense 6 => affliction Sense 7 => attempt, effort, endeavor, endeavour, try

Page 26: Tagging The first editions of the TEI Guidelines used the Standard Generalized Markup Language (SGML) The most recent edition can also be expressed in

Other WordNet searches

Coordinate terms (sense 1) => foreclosure saisie       => intervention       => procedure       => legal action, action, action at law       => lawsuit, suit, case, cause, causa       => adoption       => appeal       => bankruptcy       => receivership       => litigation, judicial proceeding       => naturalization, naturalisation       => review       => hearing

Page 27: Tagging The first editions of the TEI Guidelines used the Standard Generalized Markup Language (SGML) The most recent edition can also be expressed in

Hypernyms

Sense 1trial       => proceeding, legal proceeding, proceedings           => due process, due process of law               => group action                   => act, human action, human activity                    => event

Page 28: Tagging The first editions of the TEI Guidelines used the Standard Generalized Markup Language (SGML) The most recent edition can also be expressed in

Hyponyms Sense 1trial       => mistrial (procès entaché d’un vice de procédure)       => retrial (nouveau procès)

Page 29: Tagging The first editions of the TEI Guidelines used the Standard Generalized Markup Language (SGML) The most recent edition can also be expressed in

Meronyms Sense 1trial          HAS PART: plea          HAS PART: prosecution, 

criminal prosecution          HAS PART: defense, defence, denial, demurrer (personne qui soulève une objection)

Page 30: Tagging The first editions of the TEI Guidelines used the Standard Generalized Markup Language (SGML) The most recent edition can also be expressed in

Derivationally related forms (none for trial) Domain

Sense 1trial       CATEGORY-->(noun) law#2, jurisprudence#2

Familiaritytrial used as a noun is common (polysemy count = 7)(other values are : uncommon, rare, very rare)

trial lawyer : avocat plaidant