on shmuel tomi klein bar ilan university back space dana shapira ashkelon academic college the...

20
On On Shmuel Tomi Klein Shmuel Tomi Klein Bar Ilan University Bar Ilan University Back Back space space Dana Shapira Dana Shapira Ashkelon Academic College Ashkelon Academic College the the Uselfulne Uselfulne ss ss of of

Upload: erika-reed

Post on 29-Dec-2015

225 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: On Shmuel Tomi Klein Bar Ilan University Back space Dana Shapira Ashkelon Academic College the Uselfulness of

OnOn

Shmuel Tomi KleinShmuel Tomi KleinBar Ilan UniversityBar Ilan University

BackBackspacspacee

Dana ShapiraDana ShapiraAshkelon Academic CollegeAshkelon Academic College

the the UselfulneUselfulnessssofof

Page 2: On Shmuel Tomi Klein Bar Ilan University Back space Dana Shapira Ashkelon Academic College the Uselfulness of

Extension of study ofExtension of study of NEGATIONNEGATIONin large IR systemsin large IR systems

United -Nations

Edgar (-1:2) Po

Backspace

Not really a character, but can be useful

Page 3: On Shmuel Tomi Klein Bar Ilan University Back space Dana Shapira Ashkelon Academic College the Uselfulness of

Three applicationsThree applications

Handling large numbers

Text compression in IR

Blockwise Huffman decoding

Page 4: On Shmuel Tomi Klein Bar Ilan University Back space Dana Shapira Ashkelon Academic College the Uselfulness of

Handling large numbers

1

Syntax:

A (1:3) -B (1:5) C -D (1:1) E

In use at the Responsa Project

Page 5: On Shmuel Tomi Klein Bar Ilan University Back space Dana Shapira Ashkelon Academic College the Uselfulness of

Handling large numbers

1

Too many large numbers

Break in blocks of k digits

1234567 1234 567

Problem with precision:

5678 also retrieves 123456789

Page 6: On Shmuel Tomi Klein Bar Ilan University Back space Dana Shapira Ashkelon Academic College the Uselfulness of

Handling large numbers

1

Each word includes a trailing blank

House of Lords

I declared an income of 1000000 on my last 10 1040 forms

Long numbers use Backspace BS

1234567890 1234 BS 5678 BS 90

I declared an income of 1000 BS 000 on my last 10 1040 forms1 2 3 4 5 6 7 8 9 10 11 12 13 14

Page 7: On Shmuel Tomi Klein Bar Ilan University Back space Dana Shapira Ashkelon Academic College the Uselfulness of

Handling large numbers

1

234 -BS 234

To search for submit query

2000 1040 -BS 2000 1040 -BS

12345678 -BS 1234 BS 5678 -BS

1234567 -BS 1234 BS 567

[email protected] user @ BS addr . BS com

Page 8: On Shmuel Tomi Klein Bar Ilan University Back space Dana Shapira Ashkelon Academic College the Uselfulness of

Text Compression in IR

2

Huffword: alternating words and non-wordsUse single Huffman tree for:

— words including a trailing blank

— punctuation signs: BS ;

— Backspace, to handle exceptions

Page 9: On Shmuel Tomi Klein Bar Ilan University Back space Dana Shapira Ashkelon Academic College the Uselfulness of

Text Compression in IR

2

FileFileSizeSizeHuffwordHuffwordBSHuffBSHuffgzipgzipbzipbzip

EnglisEnglishh

3.1M3.1MBB

3.913.913.973.973.283.284.414.41

FrencFrenchh

7.1M7.1MBB

3.983.984.034.033.273.274.634.63

Page 10: On Shmuel Tomi Klein Bar Ilan University Back space Dana Shapira Ashkelon Academic College the Uselfulness of

Given

Alphabet

np

p

1

na

a

1

nl

l

1

with

probabilities

find

lengths

such that

average length

i

n

iilp

1

is minimized

A

B

D

C

E

10

0

0

0

1

1

1

A B C D E

0.4 0.3 0.1 0.1 0.1

1 2 3 4 4

HUFFMAN

0 11 101 1000 1001

Blockwise Huffman decoding

3

Page 11: On Shmuel Tomi Klein Bar Ilan University Back space Dana Shapira Ashkelon Academic College the Uselfulness of

Table Entry Pattern Decoding

0 1 001 0 0 1 A A Rem

A

B

D

C

E

10

0

0

0

1

1

1

0

1

2

3 1 6 1110 11 10 B Rem

3 3 100011 1000 11 D B

Decoding Decoding kk bits together bits togetherPartial decoding tables

Page 12: On Shmuel Tomi Klein Bar Ilan University Back space Dana Shapira Ashkelon Academic College the Uselfulness of

Decoding Decoding kk bits together bits togetherPartial decoding tables

0

1

3

2

A

B

D

C

E

Pattern Pattern for for

Table 0Table 0

Table Table 00Table Table 11Table Table 22Table Table 33

WWllWWllWWllWWll

00000000AAAAAA00DD00DADA00DAADAA00

11001001AAAA11EE00DD11DADA11

22010010AA22CACA00EAEA00DD22

33011011ABAB00CC11EE11DBDB00

44100100--33BAABAA00CAACAA00EAAEAA00

55101101CC00BABA11CACA11EAEA11

66110110BABA00BB22CC22EE22

77111111BB11BBBB00CBCB00EBEB00

Prefix:

Λ 10 100

1

Page 13: On Shmuel Tomi Klein Bar Ilan University Back space Dana Shapira Ashkelon Academic College the Uselfulness of

Pattern for Pattern for Table 0Table 0

Table Table 00Table Table 11Table Table 22Table Table 33

WWllWWllWWllWWll

00000000AAAAAA00DD00DADA00DAADAA0011001001AAAA11EE00DD11DADA1122010010AA22CACA00EAEA00DD2233011011ABAB00CC11EE11DBDB0044100100--33BAABAA00CAACAA00EAAEAA0055101101CC00BABA11CACA11EAEA1166110110BABA00BB22CC22EE2277111111BB11BBBB00CBCB00EBEB00

0j

for 1f to EOI(output , j) ← T( j , M [ f ; f + k –

1] ) kff

100

101

101

000110

j 0 3

-

1

EA C

0

DA

2

B

Decoding Algorithm

100

101

000110

101outp

ut

A 0B 11C 101D 1000E 1001

Page 14: On Shmuel Tomi Klein Bar Ilan University Back space Dana Shapira Ashkelon Academic College the Uselfulness of

Looking for new tradeoffs

0

1

3

2

A

B

D

C

E

Reduced Reduced Partial Partial

decoding decoding tablestables

includingincludingbackspacesbackspaces

0

3

Page 15: On Shmuel Tomi Klein Bar Ilan University Back space Dana Shapira Ashkelon Academic College the Uselfulness of

Pattern for Pattern for Table 0Table 0

Table Table 00Table Table 33

WWllbbWWllbb

00000000AAAAAA0000DAADAA0000

11001001AAAA0011DADA0011

22010010AA0022DD0022

33011011ABAB0000DBDB0000

44100100--3300EAAEAA0000

55101101CC0000EAEA0011

66110110BABA0000EE0022

77111111BB0011EBEB0000

Revised Decoding Algorith

m

0j

for to EOI(output , j ) ← T( j , M [ f ; f + k –

1] )kff

1f

0back

, back

– back

1 0 0 1 0 1 1 1 0 0 0 0 1 0 1

EA- DA C

1 -

1

B

1

Reduced Reduced tablestables

A 0B 11C 101D 1000E 1001

Page 16: On Shmuel Tomi Klein Bar Ilan University Back space Dana Shapira Ashkelon Academic College the Uselfulness of

1 0 0 1 0 1 1 1 0 0 0 0 1 0 1

E A B D A CRegular Huffman

- EA B DA CPartial decoding

tables

- EA B - DA CReduced tables

with backspace

Page 17: On Shmuel Tomi Klein Bar Ilan University Back space Dana Shapira Ashkelon Academic College the Uselfulness of

BitBit

partialpartial

decoddecodee

tablestables

reducereducedd

tablestables

kk118888

WSJWSJ

bpbpaa

11886.46.4

MB/MB/secsec6.66.6--7.67.6

RARAMM2.12.119719734.34.

11

Experimental results

Page 18: On Shmuel Tomi Klein Bar Ilan University Back space Dana Shapira Ashkelon Academic College the Uselfulness of

BitBit

partialpartial

decoddecodee

tablestables

reducereducedd

tablestables

kk118888

KJVKJV

bpbpaa

11886.46.4

MB/MB/secsec

10.10.11

0.40.413.13.77

RARAMM

0.20.211

17178.78.7

Experimental results

Page 19: On Shmuel Tomi Klein Bar Ilan University Back space Dana Shapira Ashkelon Academic College the Uselfulness of

Conclusion

3 examples of IR applications

Use of conceptual elements,

like backspaces, may improve

algorithms.

Page 20: On Shmuel Tomi Klein Bar Ilan University Back space Dana Shapira Ashkelon Academic College the Uselfulness of

Thank you !Thank you !

Questions?Questions?