on shmuel tomi klein bar ilan university back space dana shapira ashkelon academic college the...
TRANSCRIPT
OnOn
Shmuel Tomi KleinShmuel Tomi KleinBar Ilan UniversityBar Ilan University
BackBackspacspacee
Dana ShapiraDana ShapiraAshkelon Academic CollegeAshkelon Academic College
the the UselfulneUselfulnessssofof
Extension of study ofExtension of study of NEGATIONNEGATIONin large IR systemsin large IR systems
United -Nations
Edgar (-1:2) Po
Backspace
Not really a character, but can be useful
Three applicationsThree applications
Handling large numbers
Text compression in IR
Blockwise Huffman decoding
Handling large numbers
1
Syntax:
A (1:3) -B (1:5) C -D (1:1) E
In use at the Responsa Project
Handling large numbers
1
Too many large numbers
Break in blocks of k digits
1234567 1234 567
Problem with precision:
5678 also retrieves 123456789
Handling large numbers
1
Each word includes a trailing blank
House of Lords
I declared an income of 1000000 on my last 10 1040 forms
Long numbers use Backspace BS
1234567890 1234 BS 5678 BS 90
I declared an income of 1000 BS 000 on my last 10 1040 forms1 2 3 4 5 6 7 8 9 10 11 12 13 14
Handling large numbers
1
234 -BS 234
To search for submit query
2000 1040 -BS 2000 1040 -BS
12345678 -BS 1234 BS 5678 -BS
1234567 -BS 1234 BS 567
[email protected] user @ BS addr . BS com
Text Compression in IR
2
Huffword: alternating words and non-wordsUse single Huffman tree for:
— words including a trailing blank
— punctuation signs: BS ;
— Backspace, to handle exceptions
Text Compression in IR
2
FileFileSizeSizeHuffwordHuffwordBSHuffBSHuffgzipgzipbzipbzip
EnglisEnglishh
3.1M3.1MBB
3.913.913.973.973.283.284.414.41
FrencFrenchh
7.1M7.1MBB
3.983.984.034.033.273.274.634.63
Given
Alphabet
np
p
1
na
a
1
nl
l
1
with
probabilities
find
lengths
such that
average length
i
n
iilp
1
is minimized
A
B
D
C
E
10
0
0
0
1
1
1
A B C D E
0.4 0.3 0.1 0.1 0.1
1 2 3 4 4
HUFFMAN
0 11 101 1000 1001
Blockwise Huffman decoding
3
Table Entry Pattern Decoding
0 1 001 0 0 1 A A Rem
A
B
D
C
E
10
0
0
0
1
1
1
0
1
2
3 1 6 1110 11 10 B Rem
3 3 100011 1000 11 D B
Decoding Decoding kk bits together bits togetherPartial decoding tables
Decoding Decoding kk bits together bits togetherPartial decoding tables
0
1
3
2
A
B
D
C
E
Pattern Pattern for for
Table 0Table 0
Table Table 00Table Table 11Table Table 22Table Table 33
WWllWWllWWllWWll
00000000AAAAAA00DD00DADA00DAADAA00
11001001AAAA11EE00DD11DADA11
22010010AA22CACA00EAEA00DD22
33011011ABAB00CC11EE11DBDB00
44100100--33BAABAA00CAACAA00EAAEAA00
55101101CC00BABA11CACA11EAEA11
66110110BABA00BB22CC22EE22
77111111BB11BBBB00CBCB00EBEB00
Prefix:
Λ 10 100
1
Pattern for Pattern for Table 0Table 0
Table Table 00Table Table 11Table Table 22Table Table 33
WWllWWllWWllWWll
00000000AAAAAA00DD00DADA00DAADAA0011001001AAAA11EE00DD11DADA1122010010AA22CACA00EAEA00DD2233011011ABAB00CC11EE11DBDB0044100100--33BAABAA00CAACAA00EAAEAA0055101101CC00BABA11CACA11EAEA1166110110BABA00BB22CC22EE2277111111BB11BBBB00CBCB00EBEB00
0j
for 1f to EOI(output , j) ← T( j , M [ f ; f + k –
1] ) kff
100
101
101
000110
j 0 3
-
1
EA C
0
DA
2
B
Decoding Algorithm
100
101
000110
101outp
ut
A 0B 11C 101D 1000E 1001
Looking for new tradeoffs
0
1
3
2
A
B
D
C
E
Reduced Reduced Partial Partial
decoding decoding tablestables
includingincludingbackspacesbackspaces
0
3
Pattern for Pattern for Table 0Table 0
Table Table 00Table Table 33
WWllbbWWllbb
00000000AAAAAA0000DAADAA0000
11001001AAAA0011DADA0011
22010010AA0022DD0022
33011011ABAB0000DBDB0000
44100100--3300EAAEAA0000
55101101CC0000EAEA0011
66110110BABA0000EE0022
77111111BB0011EBEB0000
Revised Decoding Algorith
m
0j
for to EOI(output , j ) ← T( j , M [ f ; f + k –
1] )kff
1f
0back
, back
– back
1 0 0 1 0 1 1 1 0 0 0 0 1 0 1
EA- DA C
1 -
1
B
1
Reduced Reduced tablestables
A 0B 11C 101D 1000E 1001
1 0 0 1 0 1 1 1 0 0 0 0 1 0 1
E A B D A CRegular Huffman
- EA B DA CPartial decoding
tables
- EA B - DA CReduced tables
with backspace
BitBit
partialpartial
decoddecodee
tablestables
reducereducedd
tablestables
kk118888
WSJWSJ
bpbpaa
11886.46.4
MB/MB/secsec6.66.6--7.67.6
RARAMM2.12.119719734.34.
11
Experimental results
BitBit
partialpartial
decoddecodee
tablestables
reducereducedd
tablestables
kk118888
KJVKJV
bpbpaa
11886.46.4
MB/MB/secsec
10.10.11
0.40.413.13.77
RARAMM
0.20.211
17178.78.7
Experimental results
Conclusion
3 examples of IR applications
Use of conceptual elements,
like backspaces, may improve
algorithms.
Thank you !Thank you !
Questions?Questions?