ges$one’avanzatadell’informazione · contents ! introduction ! inverted indices ! construction...

42
Ges$one Avanzata dell’Informazione Part A – FullText Informa$on Management FullText Indexing

Upload: others

Post on 28-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ges$one’Avanzatadell’Informazione · Contents ! Introduction ! Inverted Indices ! Construction ! Searching 2 GAvI9’Full9TextInformaon’Management:’Full9TextIndexing’

Ges$one  Avanzata  dell’Informazione  Part  A  –  Full-­‐Text  Informa$on  Management  

Full-­‐Text  Indexing  

Page 2: Ges$one’Avanzatadell’Informazione · Contents ! Introduction ! Inverted Indices ! Construction ! Searching 2 GAvI9’Full9TextInformaon’Management:’Full9TextIndexing’

Contents }  Introduction }  Inverted Indices }  Construction }  Searching

2   GAvI  -­‐  Full-­‐Text  Informa$on  Management:  Full-­‐Text  Indexing  

Page 3: Ges$one’Avanzatadell’Informazione · Contents ! Introduction ! Inverted Indices ! Construction ! Searching 2 GAvI9’Full9TextInformaon’Management:’Full9TextIndexing’

Sequen$al  or  online  searching  }  Involves  finding  the  occurrences  of  a  paFern  in  a  text  when  the  text  is  not  preprocessed  

}  Is  appropriate  when  the  text  is  small  }  Is  the  only  choice  if  the  text  collec$on  is  very  vola$le  (i.e.  undergoes  modifica$ons  very  frequently),  or  the  index  space  overhead  cannot  be  afforded  

3   GAvI  -­‐  Full-­‐Text  Informa$on  Management:  Full-­‐Text  Indexing  

Page 4: Ges$one’Avanzatadell’Informazione · Contents ! Introduction ! Inverted Indices ! Construction ! Searching 2 GAvI9’Full9TextInformaon’Management:’Full9TextIndexing’

Indexed  searching  

Index  }  data  structure  over  the  text  to  speed  up  the  search  }  is  appropriate  when  the  text  collec$on  is  large  and  semi-­‐sta$c  }  Semi-­‐sta$c  collec$on:  is  updated  at  reasonably  regular  intervals  but  is  not  deemed  to  support  thousands  of  inser$on  of  single  words  per  second.  

Indexing  techniques  }  inverted  indices,  suffix  arrays,  and  signature  files  }  consider    

}  search  cost    }  space  overhead    }  cost  of  building  and  upda$ng  indexing  structures  

4   GAvI  -­‐  Full-­‐Text  Informa$on  Management:  Full-­‐Text  Indexing  

Page 5: Ges$one’Avanzatadell’Informazione · Contents ! Introduction ! Inverted Indices ! Construction ! Searching 2 GAvI9’Full9TextInformaon’Management:’Full9TextIndexing’

Indexing  techniques  Inverted  indices  

}  Word  oriented  mechanism  for  indexing  a  text  collec$on  }  Composed  of  vocabulary  and  occurrences  }  are  currently  the  best  choice  for  most  applica$ons  

Suffix  arrays  }  are  faster  for  phrase  searches  and  other  less  common  queries  }  are  harder  to  build  and  maintain  

Signature  files  }  Word  oriented  index  structures  based  on  hashing  }   were  popular  in  the  1980s  

5   GAvI  -­‐  Full-­‐Text  Informa$on  Management:  Full-­‐Text  Indexing  

Page 6: Ges$one’Avanzatadell’Informazione · Contents ! Introduction ! Inverted Indices ! Construction ! Searching 2 GAvI9’Full9TextInformaon’Management:’Full9TextIndexing’

Nota$ons  and  Background  

}  Nota$ons  n:  the  size  of  the  text  database  m:  paFern  length  M:  amount  of  main  memory  available    

}  Background  }  sorted  arrays  }  binary  search  tree  }  B-­‐tree  }  hash  table  }  trie  

6   GAvI  -­‐  Full-­‐Text  Informa$on  Management:  Full-­‐Text  Indexing  

Page 7: Ges$one’Avanzatadell’Informazione · Contents ! Introduction ! Inverted Indices ! Construction ! Searching 2 GAvI9’Full9TextInformaon’Management:’Full9TextIndexing’

Trie  }  TRIE  or  PREFIX  TREE  }  An  in-­‐memory  mul$way  tree  storing  a  set  of  strings  

}  Strings  are  stored  in  the  leaves  }  Every  edge  of  the  trie  is  labelled  with  a  leFer  }  All  the  descendants  of  a  node  have  a  common  prefix    

}  the  sequence  of  leFers  from  root  to  the  node  

    This  is  a  text.  A  text  has  many  words.  Words  are  made  from  leFers.  

1            6    9  11      17  19      24      28            33                  40                46    50              55          60            

leFers:60  ‘l’  ‘m’   ‘a’   ‘d’  

‘n’  

made:50  

many:28  text:11,19  

words:33,40  

‘t’  

‘w’  

7   GAvI  -­‐  Full-­‐Text  Informa$on  Management:  Full-­‐Text  Indexing  

Strings  to  be  stored  and  the  corresponding  star$ng  posi$ons:    leFers:  60  made:50    many:  28    text:  11,  19    words:  33,40  

 

Page 8: Ges$one’Avanzatadell’Informazione · Contents ! Introduction ! Inverted Indices ! Construction ! Searching 2 GAvI9’Full9TextInformaon’Management:’Full9TextIndexing’

Trie  (cont.)  

Construc<on  }  The  root  of  the  trie  uses  the  first  character  }  The  children  of  the  root  uses  the  second  character,  and  so  on  

}  If  the  remaining  subtrie  contains  only  one  string,  that  string’s  iden$ty  is  stored  in  a  leaf  node  

Access  }  Start  from  the  root    }  Follow  the  path  given  by  the  character  sequence  of  the  paFern  

}  Stop  when  a  leaf  is  found  or  no  character  matches  with  the  current  one  

}  the  access  cost  is  O(m)  

8   GAvI  -­‐  Full-­‐Text  Informa$on  Management:  Full-­‐Text  Indexing  

Page 9: Ges$one’Avanzatadell’Informazione · Contents ! Introduction ! Inverted Indices ! Construction ! Searching 2 GAvI9’Full9TextInformaon’Management:’Full9TextIndexing’

Contents }  Introduction }  Inverted Indices }  Construction }  Searching

9   GAvI  -­‐  Full-­‐Text  Informa$on  Management:  Full-­‐Text  Indexing  

Page 10: Ges$one’Avanzatadell’Informazione · Contents ! Introduction ! Inverted Indices ! Construction ! Searching 2 GAvI9’Full9TextInformaon’Management:’Full9TextIndexing’

Unstructured  data  in  1680  

}  Which  plays  of  Shakespeare  contain  the  words  Brutus  AND  Caesar    but  NOT  Calpurnia?  

}  One  could  grep  all  of  Shakespeare’s  plays  for  Brutus  and  Caesar,  then  strip  out  lines  containing  Calpurnia?  

}  Why  is  that  not  the  answer?  }  Slow  (for  large  corpora)  }  NOT  Calpurnia  is  non-­‐trivial  }  Other  opera$ons  (e.g.,  find  the  word    Romans  near  countrymen)  not  feasible  

10  

Sec. 1.1

Page 11: Ges$one’Avanzatadell’Informazione · Contents ! Introduction ! Inverted Indices ! Construction ! Searching 2 GAvI9’Full9TextInformaon’Management:’Full9TextIndexing’

Term-­‐document  incidence  

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 1 1 0 0 0 1

Brutus 1 1 0 1 0 0

Caesar 1 1 0 1 1 1

Calpurnia 0 1 0 0 0 0

Cleopatra 1 0 0 0 0 0

mercy 1 0 1 1 1 1

worser 1 0 1 1 1 0

1 if play contains word, 0 otherwise

Brutus AND Caesar BUT NOT Calpurnia

Sec. 1.1

Page 12: Ges$one’Avanzatadell’Informazione · Contents ! Introduction ! Inverted Indices ! Construction ! Searching 2 GAvI9’Full9TextInformaon’Management:’Full9TextIndexing’

Incidence  vectors  }  So  we  have  a  0/1  vector  for  each  term.  }  To  answer  query:    

}  take  the  vectors  for    }  Brutus:  110100    }  Caesar:  110111    }  Calpurnia  (complemented):101111    

}  bitwise  AND  }  110100  AND  110111  AND  101111  =  100100  

}  Select  the  documents  corresponding  to  1    

12  

Sec. 1.1

Page 13: Ges$one’Avanzatadell’Informazione · Contents ! Introduction ! Inverted Indices ! Construction ! Searching 2 GAvI9’Full9TextInformaon’Management:’Full9TextIndexing’

Answers  to  query  

} Antony and Cleopatra,  Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain.

} Hamlet, Act III, Scene ii Lord Polonius: I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me.

13  

Sec. 1.1

Page 14: Ges$one’Avanzatadell’Informazione · Contents ! Introduction ! Inverted Indices ! Construction ! Searching 2 GAvI9’Full9TextInformaon’Management:’Full9TextIndexing’

Bigger  collec$on  

GAvI  -­‐  Full-­‐Text  Informa$on  Management:  Full-­‐Text  Indexing  14  

}  Consider  N  =  106  documents,  each  with  about  1000  tokens          ⇒  total  of  109  tokens  

}  On  average  6  bytes  per  token,  including  spaces  and  punctua$on    ⇒  size  of  document  collec$on  is  about  6  ·∙  109  =  6  GB  

}  Assume  there  are  M  =  500,000  dis$nct  terms  in  the  collec$on    (Note  that  we  are  making  a  term/token  dis$nc$on.)  

Page 15: Ges$one’Avanzatadell’Informazione · Contents ! Introduction ! Inverted Indices ! Construction ! Searching 2 GAvI9’Full9TextInformaon’Management:’Full9TextIndexing’

Can’t  build  the  incidence  matrix  

GAvI  -­‐  Full-­‐Text  Informa$on  Management:  Full-­‐Text  Indexing  15  

}  M  =  500,000  ×  106  =  half  a  trillion  0s  and  1s.  

}  But  the  matrix  has  no  more  than  one  billion  1s.  

}  Matrix  is  extremely  sparse.  

}  What  is  a  beFer  representa$ons?  

}  We  only  record  the  1s.  

Page 16: Ges$one’Avanzatadell’Informazione · Contents ! Introduction ! Inverted Indices ! Construction ! Searching 2 GAvI9’Full9TextInformaon’Management:’Full9TextIndexing’

Inverted  index  }  For  each  term  t,  we  must  store  a  list  of  all  documents  that  contain  t.  }  Iden$fy  each  by  a  docID,  a  document  serial  number  

}  Can  we  use  fixed-­‐size  arrays  for  this?  

16  

Brutus

Calpurnia

Caesar 1 2 4 5 6 16 57 132

1 2 4 11 31 45 173

2 31

What happens if the word Caesar is added to document 14?

Sec. 1.2

174

54 101

Page 17: Ges$one’Avanzatadell’Informazione · Contents ! Introduction ! Inverted Indices ! Construction ! Searching 2 GAvI9’Full9TextInformaon’Management:’Full9TextIndexing’

Defini$on  of  Inverted  index  }  A  word-­‐oriented  mechanism  for  indexing  a    text  collec$on  in  order  to  speed  up  the  searching  task  

}  Two  elements    }  Vocabulary  

}  The  set  of  all  different  words  in  the  text  }  Occurrences  (Pos<ng  lists)  

}  For  each  word  a  list  of  loca$ons  where  term  occurs  ¨  Document-­‐based:  A  list  of  documents  with  the  corresponding  term  frequency  

¨  Word-­‐based:  A  list  of  documents  with  the  corresponding  word  posi$ons  

17   GAvI  -­‐  Full-­‐Text  Informa$on  Management:  Full-­‐Text  Indexing  

Page 18: Ges$one’Avanzatadell’Informazione · Contents ! Introduction ! Inverted Indices ! Construction ! Searching 2 GAvI9’Full9TextInformaon’Management:’Full9TextIndexing’

Document-­‐based  inverted  index  

system

computer

database

science D2, 4

D5, 2

D1, 3

D7, 4 Index terms df

3

2

4

1

Dj, tfj

Vocabulary   Pos<ng  lists  

• • •

18   GAvI  -­‐  Full-­‐Text  Informa$on  Management:  Full-­‐Text  Indexing  

Page 19: Ges$one’Avanzatadell’Informazione · Contents ! Introduction ! Inverted Indices ! Construction ! Searching 2 GAvI9’Full9TextInformaon’Management:’Full9TextIndexing’

Word-­‐based  inverted  index  

}  The  posi$on  of  the  term  in  the  document,  wpi  ,  facilitates  proximity  searching  

system

computer

database

science D2, …

D5, …

D1, 1,100,634

D7, 50, 90, 150, 800 Index terms df

3

2

4

1

Dj, wp1, …,  wpn

Vocabulary   Pos<ng  lists  

• • •

19   GAvI  -­‐  Full-­‐Text  Informa$on  Management:  Full-­‐Text  Indexing  

Page 20: Ges$one’Avanzatadell’Informazione · Contents ! Introduction ! Inverted Indices ! Construction ! Searching 2 GAvI9’Full9TextInformaon’Management:’Full9TextIndexing’

Space  requirements  }  The  space  required  for  the  vocabulary  is  rather    small    }  HEAPS’  LAW:  the  vocabulary  grows  as  O(nb)  where    

b  ∈  [0,1]  (usually  [0.4,0.6])    }  The  occurrences  demand  much  more  space  

Occurrence  space  (in  rela$on  to  

original  collec$on  size)  

Small  collec$on  (1  Mb)

Medium  collec$on  (200  Mb)

Large  collec$on  (2  Gb)

No  Stop  words

All  text No  Stop  words  

All  text No  Stop  words  

All  text

Addressing  words  45%

 73%

 36%

 64%

 35%

 63%

Addressing  documents

 19%

 26%

 18%

 32%

 26%

 47%

20   GAvI  -­‐  Full-­‐Text  Informa$on  Management:  Full-­‐Text  Indexing  

Page 21: Ges$one’Avanzatadell’Informazione · Contents ! Introduction ! Inverted Indices ! Construction ! Searching 2 GAvI9’Full9TextInformaon’Management:’Full9TextIndexing’

Contents }  Introduction }  Inverted Indices }  Construction }  Searching

21   GAvI  -­‐  Full-­‐Text  Informa$on  Management:  Full-­‐Text  Indexing  

Page 22: Ges$one’Avanzatadell’Informazione · Contents ! Introduction ! Inverted Indices ! Construction ! Searching 2 GAvI9’Full9TextInformaon’Management:’Full9TextIndexing’

Tokenizer  

Friends Romans Countrymen

Inverted  index  construc$on  

Linguis$c  modules  

Modified tokens friend roman countryman

Indexer  

Inverted index

friend  

roman  

countryman  

2 4

2

13 16

1

Documents to be indexed

Friends, Romans, countrymen.

Sec. 1.2

Page 23: Ges$one’Avanzatadell’Informazione · Contents ! Introduction ! Inverted Indices ! Construction ! Searching 2 GAvI9’Full9TextInformaon’Management:’Full9TextIndexing’

Construc$on  based  on  a  trie  

}  All  the  vocabulary  known  up  to  now  is  kept  in  a  trie  structure  Steps  1.  Read  each  word  of  the  text  2.  Search  the  word  in  the  trie  3.  If  word  is  not  found  in  the  trie,  it  is  added  to  the  trie  with  an  empty  list  of  occurrences  

4.  If  word  is  in  the  trie,  the  new  posi$on  is  added  to  the  end  of  its  list  of  occurrences  

}  Once  the  text  is  exhausted,  the  trie  is  wriFen  to  disk  together  with  the  list  of  occurences  

 Complexity  –  O(1)  opera$ons  per  text  character  (step  2)  –  O(1)  inser$on  of  the  posi$on  in  the  list  of  occurrences  (steps  3  &  4)    –  O(n)  for  the  overall  process    

23   GAvI  -­‐  Full-­‐Text  Informa$on  Management:  Full-­‐Text  Indexing  

Page 24: Ges$one’Avanzatadell’Informazione · Contents ! Introduction ! Inverted Indices ! Construction ! Searching 2 GAvI9’Full9TextInformaon’Management:’Full9TextIndexing’

Splipng  the  index  into  two  files  }  Splipng  the  index  into  two  files  allows  the  vocabulary  to  be  kept  in  memory  at  search  $me  in  many  cases  

}  Pos$ng  file  }  The  lists  of  occurrences  are  stored  con$guously  

}  Vocabulary  file  }  The  vocabulary  is  stored  and,  for  each  word,  the  number  of  documents  associated  with  it  and  a  pointer  to  its  list  in  the  pos$ng  file  is  also  included  

24   GAvI  -­‐  Full-­‐Text  Informa$on  Management:  Full-­‐Text  Indexing  

Page 25: Ges$one’Avanzatadell’Informazione · Contents ! Introduction ! Inverted Indices ! Construction ! Searching 2 GAvI9’Full9TextInformaon’Management:’Full9TextIndexing’

An  example  

Term Start n boundary

case computer database

deliver document

fan play

position science System

0 1 3 5 6 7

10 13 14 15 17

1 2 2 1 1 3 3 1 1 2 1

0 0023

0012

0041

0011

0032

5 0021

0031

0011

0021

0031

10 0011

0021

0042

0022

0011

15 0032

0041

0031

The  pos<ng  list  of    “document”  starts  at    

posi<on    7  and  contains  3  entries.  

“document”  is  contained  in    

documents  001002  003  with    frequency  …  

Vocabulary    file  implemented    through  a  sorted  array  

Pos$ng  file  

25   GAvI  -­‐  Full-­‐Text  Informa$on  Management:  Full-­‐Text  Indexing  

Page 26: Ges$one’Avanzatadell’Informazione · Contents ! Introduction ! Inverted Indices ! Construction ! Searching 2 GAvI9’Full9TextInformaon’Management:’Full9TextIndexing’

Structures  for  vocabulary  file  To  accelerate  vocabulary  searches:  }  Sorted  array  

}  The  vocabulary  is  stored  in  lexicographical  order  }  Searched  using  a  standard  binary  search    

}  Complexity  O(log2|vocabulary|)  }  Disadvantage:  upda$ng  is  expensive  

}  B+-­‐tree  }  The  vocabulary  is  stored  in  a  B+-­‐tree  }  Disadvantage:  B+-­‐tree  uses  more  space  than  sorted  array  

}  Trie  }  Hash  

26   GAvI  -­‐  Full-­‐Text  Informa$on  Management:  Full-­‐Text  Indexing  

Page 27: Ges$one’Avanzatadell’Informazione · Contents ! Introduction ! Inverted Indices ! Construction ! Searching 2 GAvI9’Full9TextInformaon’Management:’Full9TextIndexing’

Construc$on  using  PARTIAL  INDICES  }  For  large  texts  where  the  index  does  not  fit  in  main  

memory  Construc<on  step  

1.  The  algorithm  already  described  is  used  un$l  the  main  memory  is  exhausted.  

2.  When  no  more  memory  is  available,  the  par$al  index  obtained  up  to  now  is  wriFen  to  disk.  

3.  Erase  from  main  memory.  4.  Con$nue  with  the  rest  of  the  text.  5.  A  number  of  par$al  indices  on  disk  are  merged  in  a  

hierarchical  fashion.  

27   GAvI  -­‐  Full-­‐Text  Informa$on  Management:  Full-­‐Text  Indexing  

Page 28: Ges$one’Avanzatadell’Informazione · Contents ! Introduction ! Inverted Indices ! Construction ! Searching 2 GAvI9’Full9TextInformaon’Management:’Full9TextIndexing’

Merging  par$al  indices  Merging  step  }  Given  two  par$al  indexes  I1  and  I2  1.  Merge  the  sorted  vocabularies    

•  Complexity  O(|I1|+|I2|))  2.  Whenever  the  same  word  appears  in  both  indices,  

merge  both  lists  of  occurrences  }  By  construc$on,  the  occurrences  of  the  smaller-­‐

numbered  index  are  before  those  of  the  larger-­‐numbered  index    

}  We  can  perform  list  concatena$on  

28   GAvI  -­‐  Full-­‐Text  Informa$on  Management:  Full-­‐Text  Indexing  

Page 29: Ges$one’Avanzatadell’Informazione · Contents ! Introduction ! Inverted Indices ! Construction ! Searching 2 GAvI9’Full9TextInformaon’Management:’Full9TextIndexing’

Merging  par$al  indices  

system

computer

database

D1, 20,50,90

D1, 30

1

1 1 D1, 100

I1  

system database D2, 10

D2, 20 1 1 2 D1, 120

I2  

D2, 50

≤  

system database

1

2 2

I12  

CAD

CAD

≤  

computer 1

D2, 20

D1, 20,50,90 D1, 30 D2, 10

D1, 100 D1, 120 D2, 50

29   GAvI  -­‐  Full-­‐Text  Informa$on  Management:  Full-­‐Text  Indexing  

Page 30: Ges$one’Avanzatadell’Informazione · Contents ! Introduction ! Inverted Indices ! Construction ! Searching 2 GAvI9’Full9TextInformaon’Management:’Full9TextIndexing’

Merging  par$al  indices  

Binary  fashion  }  More  than  two  indices  can  be  merged  

}  Complexity:  }  n/M:  Number  of  par$al  

indexes  }  O(n)  merging  cost  at  

each  level  }  O(n  log(n/M))    the  

overall  cost  }  To  reduce  build-­‐$me  space  requirements  }  It  is  possible  to  perform  

the  merging  in-­‐place   I-1 I-2 I-3 I-4 I-5 I-6 I-7 I-8

1 2 4 5

I-1..2 I-3..4 I-5..6 I-7..8

6 3

I-1..4 I-5..8

7

I-1..8

Level 1 (initial dumps)

Level 2

Level 3

Level 4 (final index)

30   GAvI  -­‐  Full-­‐Text  Informa$on  Management:  Full-­‐Text  Indexing  

Page 31: Ges$one’Avanzatadell’Informazione · Contents ! Introduction ! Inverted Indices ! Construction ! Searching 2 GAvI9’Full9TextInformaon’Management:’Full9TextIndexing’

Popular  indexing  systems  

GAvI  -­‐  Full-­‐Text  Informa$on  Management:  Full-­‐Text  Indexing  31  

Indexing  and  search  technology  

Enterprise  search  platorm    

Web  crawler  

Informa$on  platorm  for  enterprise  

Open  seman$c  search  

Page 32: Ges$one’Avanzatadell’Informazione · Contents ! Introduction ! Inverted Indices ! Construction ! Searching 2 GAvI9’Full9TextInformaon’Management:’Full9TextIndexing’

Addi$onal  material    

GAvI  -­‐  Full-­‐Text  Informa$on  Management:  Full-­‐Text  Indexing  32  

}  Inside  google  data  center    hFps://www.youtube.com/watch?v=PBx7rgqeGG8  

}  Inverted  index  compression  }  Stanford  IR  book  chapter  hFp://nlp.stanford.edu/IR-­‐book/html/htmledi$on/index-­‐compression-­‐1.html  

}  MaFeo  Catena,  Craig  Macdonald  Iadh  Ounis  “On  Inverted  Index  Compression  for  Search  Engine”  ECIR  2014  (available  online)  

}  Inverted  index  distribu$on  }  Apache  SolrCloud  hFps://cwiki.apache.org/confluence/display/solr/How+SolrCloud+Works  

Page 33: Ges$one’Avanzatadell’Informazione · Contents ! Introduction ! Inverted Indices ! Construction ! Searching 2 GAvI9’Full9TextInformaon’Management:’Full9TextIndexing’

Contents }  Introduction }  Inverted Indices }  Construction }  Searching

33   GAvI  -­‐  Full-­‐Text  Informa$on  Management:  Full-­‐Text  Indexing  

Page 34: Ges$one’Avanzatadell’Informazione · Contents ! Introduction ! Inverted Indices ! Construction ! Searching 2 GAvI9’Full9TextInformaon’Management:’Full9TextIndexing’

Three  General  steps  

1.  Vocabulary  search  }  The  paFerns  and  words  present  in  the  query  are  isolated  

and  searched  in  the  vocabulary  }  Phrases  and  proximity  queries  are  split  into  single  words  

}  The  cost  depends  on  the  used  structure  }  Trie  or  Hash:  O(m)  thus  independent  on  the  text  size  }  B-­‐tree  or  Sorted  Array:  O(log  n)  

2.  Retrieval  of  occurrences  }  The  lists  of  the  occurrences  of  all  the  words  found  are  

retrieved  3.  Manipula$on  of  occurrences  

}  The  occurrences  are  processed  to  solve  basic  queries,  phrases,  proximity,  or  Boolean  opera$ons  

34   GAvI  -­‐  Full-­‐Text  Informa$on  Management:  Full-­‐Text  Indexing  

Page 35: Ges$one’Avanzatadell’Informazione · Contents ! Introduction ! Inverted Indices ! Construction ! Searching 2 GAvI9’Full9TextInformaon’Management:’Full9TextIndexing’

Single-­‐word  Retrieval  }  Single-­‐word  queries  

}  Can  be  searched  using  any  suitable  data  structure  to  speed  up  the  search  in  the  vocabulary  file  

}  Prefix  and  range  queries    }  Can  be  solved  with  binary  search,  tries,  or  B-­‐trees,  but  not  with  hashing.  

35   GAvI  -­‐  Full-­‐Text  Informa$on  Management:  Full-­‐Text  Indexing  

Page 36: Ges$one’Avanzatadell’Informazione · Contents ! Introduction ! Inverted Indices ! Construction ! Searching 2 GAvI9’Full9TextInformaon’Management:’Full9TextIndexing’

Boolean  Retrieval  }  Given  a  boolean  query  

}  Parse  the  query  into  its  syntax  tree  }  Leaves:  basic  queries    }  Internal  nodes:  operators  

 }  Solve  the  leaves  of  the  query  syntax  tree  using    

the  appropriate  algorithm  }  Work  the  relevant  documents  by  composi$on  operators  

}  OR:    Recursively  retrieve  e1  and  e2  and  take  union  of  results  }  AND:  Recursively  retrieve  e1  and  e2  and  take  intersec$on  of  results  }  BUT:  Recursively  retrieve  e1  and  e2  and  take  set  difference  of  results  

}  Op$miza$on  }  Algebraic  op$miza$on  

}  E.g.  a  OR  (a  AND  b)  =  a  }  Keep  the  set  sorted  in  order  to  proceed  sequen$ally  on  both  lists  when  

intersec$on,  union,  etc.  are  required    36   GAvI  -­‐  Full-­‐Text  Informa$on  Management:  Full-­‐Text  Indexing  

Q:  database  OR  science  AND  case  ?  

database  

science   case  

AND  

OR  

Page 37: Ges$one’Avanzatadell’Informazione · Contents ! Introduction ! Inverted Indices ! Construction ! Searching 2 GAvI9’Full9TextInformaon’Management:’Full9TextIndexing’

Intersec$ng  two  pos$ngs  lists  (a  “merge”  algorithm)  

37  

Page 38: Ges$one’Avanzatadell’Informazione · Contents ! Introduction ! Inverted Indices ! Construction ! Searching 2 GAvI9’Full9TextInformaon’Management:’Full9TextIndexing’

Boolean  Retrieval  

Term Start n

boundary case

computer database

deliver document

fan play

position science System

0 1 3 5 6 7

10 13 14 15 17

1 2 2 1 1 3 3 1 1 2 1

0 0023

0012

0041

0011

0032

5 0021

0031

0011

0021

0031

10 0011

0021

0042

0022

0011

15 0032

0041

0031

Q:  database  OR  science  AND  case  ?  

       database  OR  science  AND  case        

⇒  {002,004}  

⇒  database  OR  (  {003,004}  ∩  {001,004}  )  

⇒  {002}  ∪  {004}  

38   GAvI  -­‐  Full-­‐Text  Informa$on  Management:  Full-­‐Text  Indexing  

Page 39: Ges$one’Avanzatadell’Informazione · Contents ! Introduction ! Inverted Indices ! Construction ! Searching 2 GAvI9’Full9TextInformaon’Management:’Full9TextIndexing’

Phrasal  Retrieval  }  BeFer  a  word-­‐based  inverted  file  ①  Retrieve  documents  and  posi$ons  for  each  individual  

word    ②  intersect  documents  ③  check  for  ordered  con$guity  of  keyword  posi$ons  }  Best  to  start  con$guity  check  with  the  least  common  word  in  the  phrase  

}  Simple  and  effec$ve  op$miza$on:    Process  in  order  of  increasing  frequency  

39   GAvI  -­‐  Full-­‐Text  Informa$on  Management:  Full-­‐Text  Indexing  

Page 40: Ges$one’Avanzatadell’Informazione · Contents ! Introduction ! Inverted Indices ! Construction ! Searching 2 GAvI9’Full9TextInformaon’Management:’Full9TextIndexing’

Searching:  Phrasal  Retrieval  1.  Find  set  of  documents  D  in  which  all  keywords  (k1…km)  in  

phrase  occur  (using  AND  query  processing)  2.  Ini$alize  empty  set,  R,  of  retrieved  documents  3.  For  each  document,  d,  in  D:  

1.  Get  array,  Pi  ,of  posi$ons  of  occurrences  for  each  ki  in  d  2.  Find  shortest  array  Ps  of  the  Pi’s  3.  For  each  posi$on  p  of  keyword  ks  in  Ps        

   For  each  keyword  ki  except  ks  1.  Use  binary  search  to  find  a  posi$on  (p  +  i  –  s)  in  the  array  Pi  

4.  If  correct  posi$on  for  every  keyword  found,  add  d  to  R  4.  Return  R  

40   GAvI  -­‐  Full-­‐Text  Informa$on  Management:  Full-­‐Text  Indexing  

Page 41: Ges$one’Avanzatadell’Informazione · Contents ! Introduction ! Inverted Indices ! Construction ! Searching 2 GAvI9’Full9TextInformaon’Management:’Full9TextIndexing’

Searching:  Phrasal  Retrieval  

This  is  a  text.  A  text  has  many  words.  Words  are  made  from  leFers.                                1                        2                        3                  4                      5                                  6                                    7            

many

letters

made 6

7 1

1 1 3

text

words

1

1 1, 2 4, 5

Q:  “many  words”  ?                                1                    2                          

3 P1  (many)  

4, 5 P2  (words)  

|P1|  ≤  |P2|  

Step  3.3:  take  3    

Step  3.3.1:  Use  binary  search  to  find  the  posi$on  (p  +  i  –  s)  =  (3  +  2  -­‐  1)  in  the  array  P2    

Step  4:  add  the  document  to  R    

41   GAvI  -­‐  Full-­‐Text  Informa$on  Management:  Full-­‐Text  Indexing  

Page 42: Ges$one’Avanzatadell’Informazione · Contents ! Introduction ! Inverted Indices ! Construction ! Searching 2 GAvI9’Full9TextInformaon’Management:’Full9TextIndexing’

Searching:  Proximity  Retrieval  }  Use  approach  similar  to  phrasal  search  to  find  documents  in  which  all  keywords  are  found  in  a  context  that  sa$sfies  the  proximity  constraints  

}  During  binary  search  for  posi$ons  of  remaining  keywords,  find  closest  posi$on  of  ki  to  p  and  check  that  it  is  within  maximum  allowed  distance  

42   GAvI  -­‐  Full-­‐Text  Informa$on  Management:  Full-­‐Text  Indexing