modern information retreival
DESCRIPTION
Modern Information Retreival. Chap. 06: Text and Multimedia Languages and Properties (Introduction, Metadata and Text) 6.1, 6.2, 6.3. Introduction. Text main form of communicating knowledge. Document loosely defined, denote a single unit of information. can be any physical unit - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Modern Information Retreival](https://reader035.vdocuments.mx/reader035/viewer/2022070419/56815b8d550346895dc98fc0/html5/thumbnails/1.jpg)
Modern Information Modern Information RetreivalRetreival
Chap. 06: Text and Multimedia Chap. 06: Text and Multimedia Languages and Properties Languages and Properties
(Introduction, Metadata and Text) (Introduction, Metadata and Text) 6.1, 6.2, 6.36.1, 6.2, 6.3
![Page 2: Modern Information Retreival](https://reader035.vdocuments.mx/reader035/viewer/2022070419/56815b8d550346895dc98fc0/html5/thumbnails/2.jpg)
IntroductionIntroduction• Text Text
– main form of communicating knowledge.main form of communicating knowledge.• DocumentDocument
– loosely defined, denote a single unit of loosely defined, denote a single unit of information.information.
– can be any physical unitcan be any physical unit•a filea file•an emailan email•a Web Pagea Web Page
![Page 3: Modern Information Retreival](https://reader035.vdocuments.mx/reader035/viewer/2022070419/56815b8d550346895dc98fc0/html5/thumbnails/3.jpg)
IntroductionIntroduction• DocumentDocument
– Syntax and structureSyntax and structure– SemanticsSemantics– Information about itselfInformation about itself
![Page 4: Modern Information Retreival](https://reader035.vdocuments.mx/reader035/viewer/2022070419/56815b8d550346895dc98fc0/html5/thumbnails/4.jpg)
IntroductionIntroduction• Document SyntaxDocument Syntax
– Implicit, or expressed in a language (e.g, TeX)Implicit, or expressed in a language (e.g, TeX)– Powerful languages: easier to parse, difficult to Powerful languages: easier to parse, difficult to
convert to other formats.convert to other formats.– Open languages are better (interchange)Open languages are better (interchange)– Semantics of texts in natural language are not easy Semantics of texts in natural language are not easy
for a computer to understandfor a computer to understand– Trend: languages which provides information on Trend: languages which provides information on
structure, format and semantics being readable by structure, format and semantics being readable by human and computershuman and computers
![Page 5: Modern Information Retreival](https://reader035.vdocuments.mx/reader035/viewer/2022070419/56815b8d550346895dc98fc0/html5/thumbnails/5.jpg)
IntroductionIntroduction• New applications are pushing for New applications are pushing for
format such that information can be format such that information can be represented independetly of style.represented independetly of style.
• Style: defined by the author, but the Style: defined by the author, but the reader may decide part of itreader may decide part of it
• Style can include treatment of other Style can include treatment of other mediamedia
![Page 6: Modern Information Retreival](https://reader035.vdocuments.mx/reader035/viewer/2022070419/56815b8d550346895dc98fc0/html5/thumbnails/6.jpg)
MetadataMetadata• ““Data about the data”Data about the data”
– e.g: in a DBMS, schema specifies name of the e.g: in a DBMS, schema specifies name of the relations, attributes, domains, etc.relations, attributes, domains, etc.
• Descriptive MetadataDescriptive Metadata– Author, source, lengthAuthor, source, length– Dublin Core Metadata Element SetDublin Core Metadata Element Set
• Semantic MetadataSemantic Metadata– Characterizes the subject matter within the document Characterizes the subject matter within the document
contentscontents– MEDLINEMEDLINE
![Page 7: Modern Information Retreival](https://reader035.vdocuments.mx/reader035/viewer/2022070419/56815b8d550346895dc98fc0/html5/thumbnails/7.jpg)
MetadataMetadata• Metadata information on Web documentsMetadata information on Web documents
– cataloging, content rating, property rights, digital cataloging, content rating, property rights, digital signaturessignatures
• New standard: Resource Description FrameworkNew standard: Resource Description Framework– description of Web resources to facilitate automated description of Web resources to facilitate automated
processing of informationprocessing of information– nodes and attched atribute/values pairsnodes and attched atribute/values pairs
• Metadescription of non-textual objectsMetadescription of non-textual objects– keyword can be used to search the objectskeyword can be used to search the objects
![Page 8: Modern Information Retreival](https://reader035.vdocuments.mx/reader035/viewer/2022070419/56815b8d550346895dc98fc0/html5/thumbnails/8.jpg)
RDF ModelRDF Model• A model is a collection of statementsA model is a collection of statements• Statement := (predicate,subject,object)Statement := (predicate,subject,object)• Predicate is a resourcePredicate is a resource• Subject is a resourceSubject is a resource• Object is either a resource or a literalObject is either a resource or a literal
Subject Object
Predicate
Statement
![Page 9: Modern Information Retreival](https://reader035.vdocuments.mx/reader035/viewer/2022070419/56815b8d550346895dc98fc0/html5/thumbnails/9.jpg)
Example shown in triples Example shown in triples viewview
![Page 10: Modern Information Retreival](https://reader035.vdocuments.mx/reader035/viewer/2022070419/56815b8d550346895dc98fc0/html5/thumbnails/10.jpg)
RDF model and natural RDF model and natural languagelanguage
• Subject. Subject. In grammar, this is the noun or noun In grammar, this is the noun or noun phrase that is the doer of the action. In the sentence phrase that is the doer of the action. In the sentence “The company sells batteries,” the subject is “the “The company sells batteries,” the subject is “the company.”company.”
• Predicate. Predicate. In grammar, this is the part of a In grammar, this is the part of a sentence that modifies the subject and includes the sentence that modifies the subject and includes the verb phrase. In our sentence, the predicate is the verb phrase. In our sentence, the predicate is the phrase “sells”phrase “sells”
• Object. Object. In grammar this is a noun that is acted In grammar this is a noun that is acted upon by the verb. In our sentence, the object is the upon by the verb. In our sentence, the object is the noun “batteries.”noun “batteries.”
![Page 11: Modern Information Retreival](https://reader035.vdocuments.mx/reader035/viewer/2022070419/56815b8d550346895dc98fc0/html5/thumbnails/11.jpg)
XML vs. RDFXML vs. RDF• RDF is not just an XML dialect.RDF is not just an XML dialect.
– XML:XML:•Has a Has a treetree structure data model. structure data model.•Only nodes are labeled.Only nodes are labeled.
– RDF:RDF:•Has a Has a graphgraph structure data model. structure data model.•Both edges (properties) and nodes Both edges (properties) and nodes
(subjects/objects) are labeled.(subjects/objects) are labeled.
![Page 12: Modern Information Retreival](https://reader035.vdocuments.mx/reader035/viewer/2022070419/56815b8d550346895dc98fc0/html5/thumbnails/12.jpg)
Linking StatementsLinking Statements•The subject of one statement can The subject of one statement can
be the object of anotherbe the object of another•Such collections of statements Such collections of statements
form a directed, labeled graphform a directed, labeled graphGanji CE
studentOF
Sharif http://ce.sharif.edu
departmentOF hasHomePage
![Page 13: Modern Information Retreival](https://reader035.vdocuments.mx/reader035/viewer/2022070419/56815b8d550346895dc98fc0/html5/thumbnails/13.jpg)
RDF Graph: ‘anonymous’ RDF Graph: ‘anonymous’ nodesnodes
Person12345
Jonathan
Borden
person.name
first
last
value
value
PersonName LiteralPerson
![Page 14: Modern Information Retreival](https://reader035.vdocuments.mx/reader035/viewer/2022070419/56815b8d550346895dc98fc0/html5/thumbnails/14.jpg)
How can RDF be implementedHow can RDF be implemented•Usually RDF/XML syntaxUsually RDF/XML syntax•However other notations are possibleHowever other notations are possible
– e.g. Notation3:e.g. Notation3:•Buddy Belden owns a business.Buddy Belden owns a business.•The business has a Web site accessible at The business has a Web site accessible at
http://www.c2i2.com/~budstv.http://www.c2i2.com/~budstv.•Buddy is the father of Lynne.Buddy is the father of Lynne.
•<#Buddy> <#owns> <#business>.<#Buddy> <#owns> <#business>.•<#business> <#has-website> <#business> <#has-website>
<http://www.c2i2.com/~budstv>.<http://www.c2i2.com/~budstv>.•<#Buddy> <#father-of> <#Lynne>.<#Buddy> <#father-of> <#Lynne>.
![Page 15: Modern Information Retreival](https://reader035.vdocuments.mx/reader035/viewer/2022070419/56815b8d550346895dc98fc0/html5/thumbnails/15.jpg)
Converting N3 to RDFConverting N3 to RDF• Jena toolkit can do such conversionJena toolkit can do such conversion
![Page 16: Modern Information Retreival](https://reader035.vdocuments.mx/reader035/viewer/2022070419/56815b8d550346895dc98fc0/html5/thumbnails/16.jpg)
XML Syntax for RDFXML Syntax for RDF• RDF has an XML syntax that has a specific meaning:RDF has an XML syntax that has a specific meaning:• Every Every DescriptionDescription element describes a resource element describes a resource• Every attribute or nested element inside a Every attribute or nested element inside a DescriptionDescription is a is a propertyproperty of that Resourceof that Resource
• We can refer to resources by using URIsWe can refer to resources by using URIs
<rdf:Description <rdf:Description aboutabout="some.uri/person/ganji">="some.uri/person/ganji"> <studentOf <studentOf resourceresource="some.uri/Sharif/CE"/>="some.uri/Sharif/CE"/><</Description/Description>><Description <Description aboutabout="some.uri/Sharif/CE">="some.uri/Sharif/CE"> <hasHomePage<hasHomePage>http://ce.sharif.edu<>http://ce.sharif.edu</hasHomePage/hasHomePage>> <departmentOf <departmentOf resourceresource="some.uri/~Sharif"/>="some.uri/~Sharif"/><</rdf:Description>/rdf:Description>
![Page 17: Modern Information Retreival](https://reader035.vdocuments.mx/reader035/viewer/2022070419/56815b8d550346895dc98fc0/html5/thumbnails/17.jpg)
RDF typeRDF type• RDF predifined propertyRDF predifined property• Its value – a resource that represent a category or Its value – a resource that represent a category or
classclass• Its subject – Instance of that category or classIts subject – Instance of that category or class
prefix prefix ex: URI: http://www.example.org/termsex: URI: http://www.example.org/terms
![Page 18: Modern Information Retreival](https://reader035.vdocuments.mx/reader035/viewer/2022070419/56815b8d550346895dc98fc0/html5/thumbnails/18.jpg)
ContainersContainers• Containers are collectionsContainers are collections
– they allow grouping of resources (or literal they allow grouping of resources (or literal values)values)
• It is possible to make statements about It is possible to make statements about the container (as a whole) or about its the container (as a whole) or about its members individuallymembers individually
• It is also possible to create collections It is also possible to create collections based on URI patternsbased on URI patterns– for example, all files in a particular web sitefor example, all files in a particular web site
![Page 19: Modern Information Retreival](https://reader035.vdocuments.mx/reader035/viewer/2022070419/56815b8d550346895dc98fc0/html5/thumbnails/19.jpg)
RDF containersRDF containers• BagBag: (A resource having type rdf:Bag): (A resource having type rdf:Bag)
– Represents an unordered list of resources or Represents an unordered list of resources or literalsliterals
– Duplicated values are prermittedDuplicated values are prermitted• SequenceSequence: (A resource having type rdf:Seq): (A resource having type rdf:Seq)
– Represents ordered list of resources or Represents ordered list of resources or literalliteral
– Duplicated values are permittedDuplicated values are permitted• AlternativesAlternatives: (A resource having type rdf:Alt): (A resource having type rdf:Alt)
– Represents group of resources or literals Represents group of resources or literals that are alternativesthat are alternatives
![Page 20: Modern Information Retreival](https://reader035.vdocuments.mx/reader035/viewer/2022070419/56815b8d550346895dc98fc0/html5/thumbnails/20.jpg)
Sequence exampleSequence example
http://www.w3.org/TR/REC-rdf-syntax
“Ora Lassila”
rdf:_1
rdf:Seq
dc:Creator
rdf:Type
“Ralph Swick”
rdf:_2
![Page 21: Modern Information Retreival](https://reader035.vdocuments.mx/reader035/viewer/2022070419/56815b8d550346895dc98fc0/html5/thumbnails/21.jpg)
Bag exampleBag example
![Page 22: Modern Information Retreival](https://reader035.vdocuments.mx/reader035/viewer/2022070419/56815b8d550346895dc98fc0/html5/thumbnails/22.jpg)
RDF Schema (RDFS)RDF Schema (RDFS)• RDF gives a formalism for meta data RDF gives a formalism for meta data annotation, and a way to write it down in annotation, and a way to write it down in XML, but it does not give any special XML, but it does not give any special meaning to vocabulary such as meaning to vocabulary such as subClassOfsubClassOf or or typetype
• RDF Schema allows you to define RDF Schema allows you to define vocabulary terms and the relations vocabulary terms and the relations between those termsbetween those terms– it gives “extra meaning” to particular RDF it gives “extra meaning” to particular RDF
predicates and resourcespredicates and resources– this “extra meaning”, or semantics, specifies this “extra meaning”, or semantics, specifies
how a term should be interpretedhow a term should be interpreted
![Page 23: Modern Information Retreival](https://reader035.vdocuments.mx/reader035/viewer/2022070419/56815b8d550346895dc98fc0/html5/thumbnails/23.jpg)
Core Classes & PropertiesCore Classes & PropertiesCore Classes
Core Properties
rdfs:Resource
rdfs:Literal
rdfs:XMLLiteral
rdfs:Class
rdfs:Property
rdfs:Type
rdfs:SubClassOf
rdfs:SubPropertyOf
rdfs:Domain
rdfs:Range
rdfs:Label
rdfs:Comment
![Page 24: Modern Information Retreival](https://reader035.vdocuments.mx/reader035/viewer/2022070419/56815b8d550346895dc98fc0/html5/thumbnails/24.jpg)
RDFS ExamplesRDFS Examples
<Person,<Person,typetype,,ClassClass>><hasColleague,<hasColleague,typetype,,PropertyProperty>><Professor,<Professor,subClassOfsubClassOf,Person>,Person><Carole,<Carole,typetype,Professor>,Professor><hasColleague,<hasColleague,rangerange,Person>,Person><hasColleague,<hasColleague,domaindomain,Person>,Person>
![Page 25: Modern Information Retreival](https://reader035.vdocuments.mx/reader035/viewer/2022070419/56815b8d550346895dc98fc0/html5/thumbnails/25.jpg)
RDF/RDFS “Liberality”RDF/RDFS “Liberality”• No distinction between classes and instances No distinction between classes and instances
(individuals)(individuals)<Species,<Species,typetype,,ClassClass>><Lion,<Lion,typetype,Species>,Species><Leo,<Leo,typetype,Lion>,Lion>
• Properties can themselves have propertiesProperties can themselves have properties<hasDaughter,<hasDaughter,subPropertyOfsubPropertyOf,hasChild>,hasChild><hasDaughter,<hasDaughter,typetype,familyProperty>,familyProperty>
• No distinction between language constructors No distinction between language constructors and ontology vocabulary, so constructors can and ontology vocabulary, so constructors can be applied to themselves/each otherbe applied to themselves/each other<<typetype,,rangerange,,ClassClass>><<PropertyProperty,,typetype,,ClassClass>><<typetype,,subPropertyOfsubPropertyOf,,subClassOfsubClassOf>>
![Page 26: Modern Information Retreival](https://reader035.vdocuments.mx/reader035/viewer/2022070419/56815b8d550346895dc98fc0/html5/thumbnails/26.jpg)
Problems with RDFSProblems with RDFS• RDFS RDFS too weaktoo weak to describe resources in sufficient to describe resources in sufficient
detaildetail– No No localised range and domainlocalised range and domain constraints constraints
• Can’t say that the range of hasChild is person when applied Can’t say that the range of hasChild is person when applied to persons and elephant when applied to elephantsto persons and elephant when applied to elephants
– No No existence/cardinalityexistence/cardinality constraints constraints• Can’t say that all Can’t say that all instancesinstances of person have a mother that is of person have a mother that is
also a person, or that persons have exactly 2 parentsalso a person, or that persons have exactly 2 parents– No No transitive, inverse or symmetricaltransitive, inverse or symmetrical properties properties
• Can’t say that isPartOf is a transitive property, that hasPart Can’t say that isPartOf is a transitive property, that hasPart is the inverse of isPartOf or that touches is symmetricalis the inverse of isPartOf or that touches is symmetrical
– ……• Difficult to provide Difficult to provide reasoning supportreasoning support
– No “native” reasoners for non-standard semanticsNo “native” reasoners for non-standard semantics– May be possible to reason via FO axiomatisationMay be possible to reason via FO axiomatisation
![Page 27: Modern Information Retreival](https://reader035.vdocuments.mx/reader035/viewer/2022070419/56815b8d550346895dc98fc0/html5/thumbnails/27.jpg)
RDF(S) toolsRDF(S) tools• Read RDF data Read RDF data
– Parsers: Jena, Redland, SWI-PrologParsers: Jena, Redland, SWI-Prolog– Validators: W3C RDF validation serviceValidators: W3C RDF validation service– Editors: IsaViz, RDF Author, RDFEd, InferEdEditors: IsaViz, RDF Author, RDFEd, InferEd
• Store RDF data (XML format, tripples or Store RDF data (XML format, tripples or relational/oo DB)relational/oo DB)– Sesame, RSSDB, RDFLibSesame, RSSDB, RDFLib
• Use RDF data (applications, RSS news, etc.)Use RDF data (applications, RSS news, etc.)• Manipulate RDF data (inference, query, etc.)Manipulate RDF data (inference, query, etc.)
– Jena RDQL, etc.Jena RDQL, etc.– Example:Example:
SELECT ?person, ?knowsSELECT ?person, ?knowsWHERE (?x <WHERE (?x <http://xmlns.com/foap/knowshttp://xmlns.com/foap/knows> ?z),> ?z),(?x <(?x <http://xmlns.com/foap/namehttp://xmlns.com/foap/name> ?person),> ?person), (?z <(?z <http://xmlns.com/foap/namehttp://xmlns.com/foap/name> ?knows)> ?knows)
![Page 28: Modern Information Retreival](https://reader035.vdocuments.mx/reader035/viewer/2022070419/56815b8d550346895dc98fc0/html5/thumbnails/28.jpg)
RDF ValidatorsRDF Validators• RDF Validation ServiceRDF Validation Service
– http://www.w3.org/RDF/Validator/http://www.w3.org/RDF/Validator/• In general all the RDF parsers do In general all the RDF parsers do
some kind of validationsome kind of validation
![Page 29: Modern Information Retreival](https://reader035.vdocuments.mx/reader035/viewer/2022070419/56815b8d550346895dc98fc0/html5/thumbnails/29.jpg)
ReferencesReferences•RDF Resource Guide:RDF Resource Guide:
– http://http://www.ilrt.bris.ac.uk/discovery/rdf/resourwww.ilrt.bris.ac.uk/discovery/rdf/resourcesces//
• http://www.w3.org/RDFhttp://www.w3.org/RDF•http://www.w3.org/RDF/Validator/http://www.w3.org/RDF/Validator/
![Page 30: Modern Information Retreival](https://reader035.vdocuments.mx/reader035/viewer/2022070419/56815b8d550346895dc98fc0/html5/thumbnails/30.jpg)
TextText• Text coding in bitsText coding in bits
– EBCDIC, ASCIIEBCDIC, ASCII• Initially, 7 bits. Later, 8 bitsInitially, 7 bits. Later, 8 bits
– UnicodeUnicode•16 bits, to accommodate oriental languages16 bits, to accommodate oriental languages
![Page 31: Modern Information Retreival](https://reader035.vdocuments.mx/reader035/viewer/2022070419/56815b8d550346895dc98fc0/html5/thumbnails/31.jpg)
TextText• FormatsFormats
– No single format existsNo single format exists– IR system should retrieve information IR system should retrieve information
from different formatsfrom different formats– Past: IR systems convert the documentsPast: IR systems convert the documents– Today: IR systems use filtersToday: IR systems use filters
![Page 32: Modern Information Retreival](https://reader035.vdocuments.mx/reader035/viewer/2022070419/56815b8d550346895dc98fc0/html5/thumbnails/32.jpg)
TextText• FormatsFormats
– Formats for document interchange (RTF)Formats for document interchange (RTF)– Formats for displaying (PDF, PostScript)Formats for displaying (PDF, PostScript)– Formats for encode email (MIME)Formats for encode email (MIME)– Compressed filesCompressed files
•uuencode/uudecode, binhexuuencode/uudecode, binhex
![Page 33: Modern Information Retreival](https://reader035.vdocuments.mx/reader035/viewer/2022070419/56815b8d550346895dc98fc0/html5/thumbnails/33.jpg)
TextText• Information TheoryInformation Theory
– Amount of information is related to the Amount of information is related to the distribution of symbols in the document.distribution of symbols in the document.
– Entropy:Entropy:
– Definition of entropy depends on the probabilities Definition of entropy depends on the probabilities of each symbol.of each symbol.
– Text models are used to obtain those probabilitesText models are used to obtain those probabilites
ii
i ppE 21
log
![Page 34: Modern Information Retreival](https://reader035.vdocuments.mx/reader035/viewer/2022070419/56815b8d550346895dc98fc0/html5/thumbnails/34.jpg)
TextText• Example - EntropyExample - Entropy
– 001001011011001001011011
121log
21
21log
21
22
E
![Page 35: Modern Information Retreival](https://reader035.vdocuments.mx/reader035/viewer/2022070419/56815b8d550346895dc98fc0/html5/thumbnails/35.jpg)
TextText• Example - EntropyExample - Entropy
– 111111111111111111111111 01log10log0 22 E
![Page 36: Modern Information Retreival](https://reader035.vdocuments.mx/reader035/viewer/2022070419/56815b8d550346895dc98fc0/html5/thumbnails/36.jpg)
TextText• Modeling Natural LanguageModeling Natural Language
– Symbols: separate words or belong to Symbols: separate words or belong to wordswords
– Symbols are not uniformly distributedSymbols are not uniformly distributed•binomial modelbinomial model
– Dependency of previous symbolsDependency of previous symbols•kk-order markovian model -order markovian model
– We can take words as symbolsWe can take words as symbols
![Page 37: Modern Information Retreival](https://reader035.vdocuments.mx/reader035/viewer/2022070419/56815b8d550346895dc98fc0/html5/thumbnails/37.jpg)
TextText• Modeling Natural LanguageModeling Natural Language
– Words distribution inside documentsWords distribution inside documents– Zipf´s Law: Zipf´s Law: ii-th most frequent word appears 1/-th most frequent word appears 1/ii
times of the most frequent word, hence i-th frequent times of the most frequent word, hence i-th frequent word appears:word appears:
– Real data fits better with Real data fits better with between 1.5 and 2.0 between 1.5 and 2.0
V
jV
V
jH
Hin
1
1)(
))(/(
![Page 38: Modern Information Retreival](https://reader035.vdocuments.mx/reader035/viewer/2022070419/56815b8d550346895dc98fc0/html5/thumbnails/38.jpg)
TextText• Modeling Natural LanguageModeling Natural Language
– Example - word distibution (Zipf’s Law)Example - word distibution (Zipf’s Law)•V=1000, V=1000, = 2 = 2•most frequent word: n=300 most frequent word: n=300 •2nd most frequent: n=762nd most frequent: n=76•3rd most frequent: n=333rd most frequent: n=33•4th most frequent: n=194th most frequent: n=19
![Page 39: Modern Information Retreival](https://reader035.vdocuments.mx/reader035/viewer/2022070419/56815b8d550346895dc98fc0/html5/thumbnails/39.jpg)
TextText• Modeling Natural LanguageModeling Natural Language
– Number of distinct wordsNumber of distinct words– Heaps’ Law:Heaps’ Law:– Set of different words is fixed by a Set of different words is fixed by a
constant, but the limit is too highconstant, but the limit is too high
KnV
![Page 40: Modern Information Retreival](https://reader035.vdocuments.mx/reader035/viewer/2022070419/56815b8d550346895dc98fc0/html5/thumbnails/40.jpg)
TextText• Modeling Natural LanguageModeling Natural Language
– Heaps’ Law exampleHeaps’ Law example•kk between 10 and 100, between 10 and 100, is less than 1 is less than 1•example: n=400000, example: n=400000, = 0.5 = 0.5
– K=25, V=15811K=25, V=15811– K=35, V=22135K=35, V=22135
![Page 41: Modern Information Retreival](https://reader035.vdocuments.mx/reader035/viewer/2022070419/56815b8d550346895dc98fc0/html5/thumbnails/41.jpg)
TextText• Modeling Natural LanguageModeling Natural Language
– Length of the wordsLength of the words•defines total space needed for vocabularydefines total space needed for vocabulary
– Heaps’ Law: length increases logarithmically Heaps’ Law: length increases logarithmically with text size.with text size.
– In practice, a finit-state model is usedIn practice, a finit-state model is used•space has p=0.2space has p=0.2•space cannot apear twice subsequentlyspace cannot apear twice subsequently•there are 26 lettersthere are 26 letters
![Page 42: Modern Information Retreival](https://reader035.vdocuments.mx/reader035/viewer/2022070419/56815b8d550346895dc98fc0/html5/thumbnails/42.jpg)
TextText• Similarity ModelsSimilarity Models
– Distance FunctionDistance Function•Should be symmetric and satisfy triangle Should be symmetric and satisfy triangle
inequalityinequality– Hamming DistanceHamming Distance
•number of positions that have different charactersnumber of positions that have different characters reversereverse rerecceeivivee
![Page 43: Modern Information Retreival](https://reader035.vdocuments.mx/reader035/viewer/2022070419/56815b8d550346895dc98fc0/html5/thumbnails/43.jpg)
TextText• Similarity ModelsSimilarity Models
– Edit (Levenshtein) DistanceEdit (Levenshtein) Distance•minimum number of operations needed to make minimum number of operations needed to make
strings equalstrings equal
surveysurvey sursurggeerryy
•superior for modeling syntatic errorssuperior for modeling syntatic errors•extensions: weights, transpositions, etcextensions: weights, transpositions, etc
![Page 44: Modern Information Retreival](https://reader035.vdocuments.mx/reader035/viewer/2022070419/56815b8d550346895dc98fc0/html5/thumbnails/44.jpg)
TextText• Similarity ModelsSimilarity Models
– Longest Common Subsequence (LCS)Longest Common Subsequence (LCS) survey - surgerysurvey - surgery LCS: sureyLCS: surey
– Documents: lines as symbols (diff in Unix)Documents: lines as symbols (diff in Unix)•time consumingtime consuming
![Page 45: Modern Information Retreival](https://reader035.vdocuments.mx/reader035/viewer/2022070419/56815b8d550346895dc98fc0/html5/thumbnails/45.jpg)
ConclusionsConclusions• Text is the main form of communicating Text is the main form of communicating
knowledge.knowledge.• Documents have syntax, structure and semanticsDocuments have syntax, structure and semantics• Metadata: information about dataMetadata: information about data• Formats of textFormats of text• Modeling Natural LanguageModeling Natural Language
– EntropyEntropy– Distribution of symbolsDistribution of symbols
• SimilaritySimilarity