pln em perl

Processamento de Linguagem Naturalem Perl

Alberto [email protected]

Portuguese Perl Workshop28 Setembro 2012

1 / 16

Disclaimer / Aviso

Ferramentas para a Lıngua Portuguesa do Projeto Natura

2 / 16

Identificacao de LınguaLingua::Identify

Implementacao em Perl de varios algoritmos de identificacaode lıngua;

Lınguas detetadas: pt en de bg da es it fr fi hr nl ro ru pl el lasq sv tr sl hu id uk;

� �1 use Lingua : : Identify qw ( : language_identification ) ;2 my $lang = langof ( $textstring ) ;� ��

1 use Lingua : : Identify qw ( : language_identification ) ;2 my ( $lang , $prob ) = langof_file ( {3 ’ a c t i v e−l a n g u a g e s ’ => [ ’ e s ’ , ’ pt ’ ] ,4 } , $filename ) ;� �

3 / 16




� �1 use Lingua : : Identify qw ( : language_identification ) ;2 my $lang = langof ( $textstring ) ;� �

� �1 use Lingua : : Identify qw ( : language_identification ) ;2 my ( $lang , $prob ) = langof_file ( {3 ’ a c t i v e−l a n g u a g e s ’ => [ ’ e s ’ , ’ pt ’ ] ,4 } , $filename ) ;� �

3 / 16




� �1 use Lingua : : Identify qw ( : language_identification ) ;2 my $lang = langof ( $textstring ) ;� ��

1 use Lingua : : Identify qw ( : language_identification ) ;2 my ( $lang , $prob ) = langof_file ( {3 ’ a c t i v e−l a n g u a g e s ’ => [ ’ e s ’ , ’ pt ’ ] ,4 } , $filename ) ;� �

3 / 16

Identificacao de LınguaLingua::Identify::CLD

Implementacao em C++ do Chromium Compact LanguageDetector;

Codigo C++ distribuıdo diretamente no modulo Perl;

Suporte para dezenas de lınguas (incluindo pig-latin);

� �1 use Lingua : : Identify : : CLD ;2

3 my $cld = Lingua : : Identify : : CLD−>new ( isPlainText => 0 ) ;4 my $lang = $cld−>identify ( ”Text ” ) ;� �

4 / 16

Segmentacao e AtomizacaoLingua::PT::PLNbase

Implementacao em Perl;

Segmentacao e Atomizacao para a lıngua Portuguesa;

Alguns hacks para dar resultados menos maus noutras lınguas;

� �1 use Lingua : : PT : : PLNbase ;2

3 # o b t e r l i s t a de p a l a v r a s / t o k e n s4 my @atomos = atomiza ( $texto ) ;5

6 # o b t e r f r a s e s7 my @frases = frases ( $texto ) ;� �

5 / 16

Segmentacao e AtomizacaoLingua::FreeLing3

Interface a biblioteca C++ freeling;

Suporte para RU, PT, ES, GA, CA, EN, IT;

Com outras funcionalidades avancadas (a ver. . . )

� �1 use FL3 ’ en ’ ;2

3 $tokens = tokenizer−>tokenize ( $text ) ;4 $sentences = splitter−> s p l i t ( $tokens ) ;� ��

1 use FL3 ;2 $atomos = tokenizer ( ’ pt ’)−>tokenize ( $texto ) ;3 $frases = splitter ( ’ pt ’)−> s p l i t ( $atomos ) ;� �

6 / 16





� �1 use FL3 ’ en ’ ;2

3 $tokens = tokenizer−>tokenize ( $text ) ;4 $sentences = splitter−> s p l i t ( $tokens ) ;� �

� �1 use FL3 ;2 $atomos = tokenizer ( ’ pt ’)−>tokenize ( $texto ) ;3 $frases = splitter ( ’ pt ’)−> s p l i t ( $atomos ) ;� �

6 / 16





� �1 use FL3 ’ en ’ ;2

3 $tokens = tokenizer−>tokenize ( $text ) ;4 $sentences = splitter−> s p l i t ( $tokens ) ;� ��

1 use FL3 ;2 $atomos = tokenizer ( ’ pt ’)−>tokenize ( $texto ) ;3 $frases = splitter ( ’ pt ’)−> s p l i t ( $atomos ) ;� �

6 / 16

Analise MorfologicaLingua::Jspell

Interface a biblioteca/aplicacao C jspell;

Biblioteca e aplicacao incluıda no modulo Perl;

Funciona com dicionarios semelhantes aos do ispell;

Sem desambiguacao;

Dicionarios para PT, ES, LA, EN, FR.

� �1 use Lingua : : Jspell ;2 my $dict = Lingua : : Jspell−>new ( ”pt PT ” ) ;3

4 my @radicals = $dict−>rad ( ”pode ” ) ; # poder , podar5

6 my @analysis = $dict−>fea ( ”g a t i n h a ” ) ;7 # { rad=>’ g a t i n h a r ’ , . . } , { rad=>’ g a t i n h a r ’ , . . } , { rad=>’gato ’ , . .8

9 my @derivated = $dict−>der ( ”gato ” ) ;10 # gata , g a t i n h o , g a t i n h a , g a t i n h o s , gatas , gato , g a t i n h a s , g .� �

7 / 16

Analise MorfologicaLingua::Jspell

Interface a biblioteca/aplicacao C jspell;

Biblioteca e aplicacao incluıda no modulo Perl;

Funciona com dicionarios semelhantes aos do ispell;

Sem desambiguacao;

Dicionarios para PT, ES, LA, EN, FR.� �1 use Lingua : : Jspell ;2 my $dict = Lingua : : Jspell−>new ( ”pt PT ” ) ;3

4 my @radicals = $dict−>rad ( ”pode ” ) ; # poder , podar5

6 my @analysis = $dict−>fea ( ”g a t i n h a ” ) ;7 # { rad=>’ g a t i n h a r ’ , . . } , { rad=>’ g a t i n h a r ’ , . . } , { rad=>’gato ’ , . .8

9 my @derivated = $dict−>der ( ”gato ” ) ;10 # gata , g a t i n h o , g a t i n h a , g a t i n h o s , gatas , gato , g a t i n h a s , g .� �

7 / 16

Analise MorfologicaLingua::FreeLing3



Com detecao de locucoes, nomes proprios, entidades, etc.;

E ainda outras funcionalidades (a ver. . . )

� �1 use FL3 ’ en ’ ;2

3 my $tokens = tokenizer−>tokenize ( $text ) ;4 my $sentences = splitter−> s p l i t ( $tokens ) ;5 $sentences = morph−>analyze ( $sentences ) ;6

7 # s e n t e n c e s e uma r e f e r e n c i a para uma l i s t a de o b j e t o s do8 # t i p o ‘ ‘ L ingua : : F r e e L i n g 3 : : Word ’ ’9 my @analysis $sentences−>[0]−>analysis ;� �

8 / 16

Etiquetacao de Part-of-SpeechLingua::FreeLing3

Desambiguacao;

Duas abordagens diferentes (hmm ou relax);

� �1 use FL3 ’ en ’ ;2

3 my $tokens = tokenizer−>tokenize ( $text ) ;4 my $sentences = splitter−> s p l i t ( $tokens ) ;5 $sentences = morph−>analyze ( $sentences ) ;6 $sentences = hmm−>tag ( $sentences ) ;� �� 1 use FL3 ’ en ’ ;2

3 my $tokens = tokenizer−>tokenize ( $text ) ;4 my $sentences = splitter−> s p l i t ( $tokens ) ;5 $sentences = morph−>analyze ( $sentences ) ;6 $sentences = relax−>tag ( $sentences ) ;� �

9 / 16


Desambiguacao;


� �1 use FL3 ’ en ’ ;2

3 my $tokens = tokenizer−>tokenize ( $text ) ;4 my $sentences = splitter−> s p l i t ( $tokens ) ;5 $sentences = morph−>analyze ( $sentences ) ;6 $sentences = hmm−>tag ( $sentences ) ;� �

� �1 use FL3 ’ en ’ ;2


9 / 16


Desambiguacao;


� �1 use FL3 ’ en ’ ;2

3 my $tokens = tokenizer−>tokenize ( $text ) ;4 my $sentences = splitter−> s p l i t ( $tokens ) ;5 $sentences = morph−>analyze ( $sentences ) ;6 $sentences = hmm−>tag ( $sentences ) ;� �� 1 use FL3 ’ en ’ ;2


9 / 16

Parsing de DependenciasLingua::FreeLing3

Disponıvel para algumas das lınguas do FreeLing3;

� �1 use FL3 ’ e s ’ ;2

3 my $tokens = tokenizer−>tokenize ( $text ) ;4 my $sentences = splitter−> s p l i t ( $tokens ) ;5 $sentences = morph−>analyze ( $sentences ) ;6 $sentences = hmm−>analyze ( $sentences ) ;7 $sentences = chart−>parse ( $sentences ) ;� �

10 / 16

NLGrepcom Lingua::FreeLing3

Procurar padroes (sequencias);

com palavras, lemas, ou propriedades morfologicas;

$ fl3-nlgrep -l pt pg33056.txt ~ser A C A

era grosso e baixo

era excellente e detestavel

e pura e severa

Sou exclusivo e pessoal

era orgulhoso e fraco

e independente e superior

era grande e vistosa

era justo nem bonito

e trivial e chocho

era restricta e mansa

(...)

11 / 16


� �1 open my $fh , ”<: u t f 8 ” , $filename ;2

3 w h i l e (my $l = <$fh>) {4 my ( $tokens , $frases ) ;5 $tokens = tokenizer−>tokenize ( $l ) ;6 $frases = splitter−> s p l i t ( $tokens ) ;7 $frases = morph−>analyze ( $frases ) ;8 $frases = hmm−>tag ( $frases ) ;9

10 # para cada f r a s e11 f o r my $frase ( @$frases ) {12 my @words = $frase−>words ;13

14 # j a n e l a d e s l i z a n t e15 w h i l e ( @words > @query ) {16 i f ( match (\ @words , \@query ) ) {17 show_match ( @words [ 0 . . $ # query ] )18 }19 s h i f t @words ;20 } } }� �

12 / 16

NLGrepcom Lingua::FreeLing3� �1 # v e r i f i c a p a l a v r a s c o n t r a e x p r e s s ao2 # de p e s q u i s a3 sub match {4 f o r my $i (0 . . $#query ) {5 # i g n o r a r p a l a v r a s e w i l d c a r d6 n e x t i f $query−>[$i ] eq ” ” ;7

8 # s e procuramos p a l a v r a e x a c t a9 i f ( $query−>[$i ] =˜ /ˆ=(.∗) $ /) {

10 r e t u r n 0 i f $1 ne $words−>[$i]−>lc_form ;11 }12 # s e procuramos por lema13 e l s i f ( $query−>[$i ] =˜ / ˆ ˜ ( .∗ ) $ /) {14 r e t u r n 0 i f $1 ne $words−>[$i]−>lemma ;15 }16 # caso c o n t r a r i o , e t i q u e t a POS17 e l s e {18 my $tag = $words−>[$i]−>tag ;19 r e t u r n 0 i f $tag ! ˜ /ˆ$query−>[$i ] / i ;20 } }21 r e t u r n 1 ;22 }� �

13 / 16


� �1 # imprime as p a l a v r a s .2 sub show_match {3 p r i n t j o i n ( ” ” ,map{$_−>form} @_ ) , ”\n ”4 }� �

14 / 16

Outras Ferramentas

Lingua::FreeLing3::Utils - conjunto defuncoes/comandos para calculo de n-grams, analisemorfologica e nlgrep, implementados em Fl3;

XML::TMX - para lidar com memorias de traducao;

Lingua::NATools - para alinhamentos de textos paralelos eextracao de terminologia/dicionarios bilingues;

Biblio::Thesaurus - para lidar com ontologias/thesauri;

Lingua::PT::ProperNames - para detecao e extracao denomes proprios;

Text::Ngram - calculo de ngrams (carateres);

Text::WordGrams - calculo de ngrams (palavras);

15 / 16

Obrigado!

16 / 16

pln em perl

Technology