language independent collocation extraction (lice) vidas daudaravičius andrius utka (vytautas...
Post on 22-Dec-2015
221 views
TRANSCRIPT
Language Independent Collocation Extraction
(LICE)
Vidas Daudaravičius
Andrius Utka
(Vytautas Magnus University)
Mutual Information
-10
0
10
20
30
0 25 50 75 100 350 600 850 2,000 4,500 7,000 9,500 30,000 55,000 80,000 150,000 400,000 650,000 930,000 2,900,000
max
avg
min
The sum of word frequencies in a word pair
MI•quotations in foreign languages•specific noun phrases•first names and surnames preceded by titles•names of institutions and organisations
Midshipmen Abdulla Mohammed Al-Kaabi; Ahmed Suleman Al-Mamari; Ali Adam Al-Maimani; Ali Suleman Al-Rawahi; L P Chariandy; Feras Al-Kandari; Khalid Al-Moqbali; Khamis Ali Al-Sulaitni; Khamis
Saeed Al-Mazrouei; Majed Al-Majed; Mansour Sultan Al-Ramyan; Mohammed A Al-Mazrouei; Mohammed Ali Al-Wahaibi; Naser Al-Mutairi; Osama Khaled Al-Ammar.
)()(
),(log);( 2
yfxf
yxfNyxMI
T-score
-6
-4
-2
0
2
4
6
8
10
12
1 10 100 1,000 10,000 100,000 1,000,000 10,000,000
max
avg
min
The sum of word frequencies in a word pair
Log*(T-score)
•specific noun phrases•proper nouns•idioms•verb phrases
“We think that there should be tighter safeguards with us being used as an example of what can go wrong. The Law Society has done the right thing but it was one of its members who did this, so
it is bad it spent two years and two previous attempts denying us our compensation.”
),(
)()(),(
),(yxfN
yfxfyxf
yxT
Dice
-25
-20
-15
-10
-5
0
5
0 25 50 75 100 350 600 850 2,000 4,500 7,000 9,500 30,000 55,000 80,000 150,000 400,000 650,000 930,0002,900,000
max
avg
min
The sum of the word frequencies in a word pair
Dice
•quotations in foreign languages•specific noun phrases•first names and surnames preceded by titles•names of organisations and institutions•exclamations
Fade in theme music. Tum-ti-tum-ti-tum-ti-tum Tum-ti-tum-ti-tum tum etc (trad arr Snoop Doggy Dogg).
)()(
),(2log);( 2
yfxf
yxfyxDice
Gravity Counts
-15
-5
5
15
25
0 25 50 75 100 350 600 850 2,000 4,500 7,000 9,500 30,000 55,000 80,000 150,000 400,000 650,000 930,000 2,900,000
max
avg
min
The sum of the word frequencies in a word pair
Gravity Counts
•specific noun phrases•proper nouns•idioms•verb phrases
… he replied: “The Conservative party wants to win the next election. I want to win the next election. I have the will to win the next election and I believe we will have a case to take to the British people that
will encourage them to believe it’s right that we carry on the job we’ve been trying to do.
)(
)('),(log
)(
)(),(log),(
yf
ynyxf
xf
xnyxfyxG
EM
BR
AC
ING
N
OR
TH
16,4
10,813,6
20,818,8
5,1
11,7
6,2
0,6
13,113,711,5
7,210,6
0
15,4
9,38,8
14,8
-5
5
15
25H
E
WIL
L
WIL
L
WO
RK
WO
RK
F
OR
FO
R
A
A
NE
W
NE
W
FR
EE
FR
EE
T
RA
DE
TR
AD
E
AR
EA
AR
EA
E
MB
RA
CIN
G
NO
RT
H
AM
ER
ICA
AM
ER
ICA
A
ND
AN
D
EU
RO
PE
EU
RO
PE
A
N
AN
I
DE
A
IDE
A
PR
ES
IDE
NT
PR
ES
IDE
NT
C
LIN
TO
N
CLI
NT
ON
I
S
IS
IN
TE
RE
ST
ED
INT
ER
ES
TE
D
IN
President Clinton isinterested in
North America andEurope, an idea
Free trade areaHe will workfor a new
-3,2
Extraction of a Collocational Strings
Extraction of Nominal Phrases fromLithuanian Language Corpus (100m)
CH
AQ
UE
ÉT
AT
ME
MB
RE
CO
MP
AR
E
SU
R
UN
E
PÉ
RIO
DE D
AU
MO
INS
DE
UX
AN
S
LES
IND
ICE
S
DE
QU
ALI
TÉ
DE
S
VA
RIÉ
TÉ
S
DE
BLÉ
DU
R À
CE
UX
DE
S
VA
RIÉ
TÉ
S
RE
PR
ÉS
E
AU
NIV
EA
U
RÉ
GIO
NA
L
Span =1
Span = 3
-10-505
10
15202530
-10-50
5101520
25
GC
MI
AC (French)
EA
CH
ME
MB
ER
ST
AT
E
SH
ALL
CO
MP
AR
E
OV
ER A
PE
RIO
D
OF
AT
LEA
ST
TW
O
YE
AR
S
TH
E
QU
ALI
TY
IND
EX
ES
OF
TH
E
DU
RU
M
WH
EA
T
VA
RIE
TIE
S
WIT
H
TH
OS
E
OF
TH
E
RE
PR
ES
EN
TA
TIV
E
VA
RIE
TIE
S
AT
RE
GIO
NA
L
LEV
EL
-10,0
-5,0
0,0
5,0
10,0
15,0
20,0
25,0
Span =1
Span = 3
-10
-5
0
5
10
15
20
25
30
GC
MI
AC (English)
-15,0
-10,0
-5,0
0,0
5,0
10,0
15,0
20,0C
IAS
CU
NO
ST
AT
O
ME
MB
RO
RA
FF
RO
NT
A
NE
LL
AR
CO DI
UN
PE
RIO
DO DI
ALM
EN
O
DU
E
AN
NI
GLI
IND
ICI
DI
QU
ALI
TÀ
DE
LLE
VA
RIE
TÀ DI
FR
UM
EN
TO
DU
RO
CO
N
QU
ELL
I
DE
LLE
VA
RIE
TÀ
RA
PP
RE
SE
NT
AT
IVE A
LIV
ELL
O
RE
GIO
NA
LE
Span =1
Span = 3
-10-505
1015202530
GC
MI
AC (Italian)
-10
-5
0
5
10
15
20
25
KU
LL
ST
AT
ME
MB
RU
GĦ
AN
DU
JQA
BB
EL
FU
Q
FIR
XA
TA
MIL
L
AN
QA
S
SE
NT
EJN
L
IND
IĊI
TA
L
KW
ALI
TÀ
TA
L
VA
RJE
TA
JIE
T
TA
QA
MĦ
TA L
AW
ST
RA
LJA
MA
DA
WK
TA
L
VA
RJE
TA
JIE
T
FU
Q
LIV
ELL
RE
ĠJO
NA
LI
RA
PP
RE
ZE
NT
AT
TIV
I
Span =1
Span = 3
-10
-5
0
5
10
15
20
25
30
GC
MI
AC (Maltese)
-10,0-5,00,05,0
10,015,020,025,0
ELK
E
LID
ST
AA
T
VE
RG
ELI
JKT
OP
RE
GIO
NA
AL
NIV
EA
U
OV
ER
EE
N
PE
RIO
DE
VA
N
TE
N
MIN
ST
E
TW
EE
JAA
R
DE
KW
ALI
TE
ITS
IND
EX
VA
N
DE
DU
RU
MT
AR
WE
RA
ME
T
DIE
VA
N
DE
RE
PR
ES
EN
TA
TIE
VE
RA
SS
EN
Span =1
Span = 3
-10-505
1015202530
GC
MI
AC (Dutch)
FR
EN
IT
MT
NL
EACHMEMBER
STATESHALL
CHAQUE
ÉTAT
MEMBRE
CIASCUNO
STATO
MEMBRO
GĦANDU
KULL
STATMEMBRU
ELKE
LIDSTAAT
DE
OF
THE
DI
TA
VAN
DE
BLÉ
DUR
DURUM
WHEAT
FRUMENTO
DURO
TAL
QAMĦ
AWSTRALJA
DURUMTARWERA
DES
OF
THE
DELLE
TAL
VAN
DE
AUNIVEAU
RÉGIONAL
ATREGIONAL
LEVEL
ALIVELLO
REGIONALE
FUQLIVELL
REĠJONALI
OPREGIONAAL
NIVEAU
Phrase Alignment
Language Independent Collocation Extraction
(LICE)
http://donelaitis.vdu.lt/~vidas/celex/lice.php
Vidas DaudaravičiusAndrius Utka
(Vytautas Magnus University)