classificaÇÃo automÁtica de documentos …

CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS

TEMPORALMENTE ROBUSTA

THIAGO CUNHA DE MOURA SALLES

CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS

TEMPORALMENTE ROBUSTA

Dissertação apresentada ao Programa de Pós-Graduação em Ciência da Computação do Ins-tituto de Ciências Exatas da Universidade Fe-deral de Minas Gerais como requisito parcialpara a obtenção do grau de Mestre em Ciênciada Computação.

ORIENTADOR: MARCOSANDRÉ GONÇALVES

COORIENTADOR: LEONARDO CHAVES DUTRA DA ROCHA

Belo Horizonte

Março de 2011

THIAGO CUNHA DE MOURA SALLES

AUTOMATIC DOCUMENT CLASSIFICATION

TEMPORALLY ROBUST

Dissertation presented to the Graduate Pro-gram in Computer Science of the Federal Uni-versity of Minas Gerais in partial fulfillmentof the requirements for the degree of Masterin Computer Science.

ADVISOR: MARCOSANDRÉ GONÇALVES

CO-ADVISOR: LEONARDO CHAVES DUTRA DA ROCHA

Belo Horizonte

March 2011

Salles, Thiago Cunha de Moura.S168c Classificação Automática de Documentos

Temporalmente Robusta / Thiago Cunha de Moura Salles.— Belo Horizonte, 2011.

xxxvi, 106 f. : il. ; 29cm

Dissertação (mestrado) — Universidade Federal deMinas Gerais. Departamento de Ciência da Computação.

Orientador: Marcos André Gonçalves.Coorientador: Leonardo Chaves Dutra da Rocha.

1. Computação - Teses. 2. Recuperação da informação -Teses. I. Orientador. II. Título.

CDU 519.6*73 (043)

“In times of change, learners inherit the Earth,

while the learned find themselves beautifully equipped

to deal with a world that no longer exists.”

(Eric Hoffer)

Resumo

Classificação Automática de Documentos (CAD) é um tópico de pesquisa de grande relevân-

cia na comunidade de aprendizado de máquina e recuperação deinformação, e diversos al-

goritmos para CAD foram propostos na literatura. A maioria de algoritmos para CAD, no

entanto, assume uma distribuição estática dos dados. Essa premissa é comumente violada

em dados reais. Neste trabalho, lidamos com os desafios relacionados à dinâmica temporal

observada em coleções textuais. Apresentamos evidências sobre a existência de três efeitos

temporais em três coleções reais, que são refletidos por variações observadas ao longo do

tempo na distribuição das classes, nas similaridades entrepares de classes e nos relaciona-

mentos entre termos e classes.Quantificamos, então, o impacto de tais efeitos temporais em

quatro algoritmos tradicionais para CAD, realizando uma série de projetos fatoriais comple-

tos. Mostramos que tais efeitos afetam as coleções de forma distinta e impactam na eficácia

dos algoritmos para CAD em diferentes proporções. As análises quantitativas realizadas

provêm informações valiosas para um melhor entendimento acerca do comportamento dos

algoritmos para CAD quando diante de distribuições de dadosque variam ao longo do tempo,

e apontam requisitos importantes para a proposta de modelosde classificação mais acurados.

Baseado nas análises conduzidas, com o intuito deminimizaro impacto de tais efeitos em

algoritmos para CAD, introduzimos umafunção de ponderação temporal(TWF) que reflete

a natureza variável das coleções textuais e propomos uma metodologia para determinar tanto

a expressão quanto os parâmetros da mesma. Tal metodologia foi aplicada a três coleções

textuais. Três algoritmos tradicionais para CAD (kNN, Rocchio e Naïve Bayes) foram es-

tendidos a fim de incorporar a TWF, segundo duas estratégias propostas, obtendo o que

chamamos de classificadores temporalmente robustos. Os classificadores temporalmente ro-

bustos obtiveram ganhos significativos em eficácia em relação às suas versões tradicionais.

Abstract

Automatic Document Classification (ADC) continues to be a relevant research topic in the

machine learning and information retrieval communities, and several ADC algorithms have

been proposed. However, the majority of ADC algorithms assume that the underlying data

distribution does not change over time. In this work, we are concerned with the challenges

imposed by the temporal dynamics observed in textual datasets. We provide evidence of

the existence of three main temporal effects in three textual datasets, reflected by variations

observed over time in the class distribution, in the pairwise class similarities, and in the

relationships between terms and classes. We thenquantify, using a series of full factorial

design experiments, the impact of these effects on four wellknown ADC algorithms. We

show that these temporal effects affect each analyzed dataset differently, and that they re-

strict the performance of each considered ADC algorithm at different extents. The reported

quantitative analyses provide valuable insights to betterunderstand the behavior of ADC al-

gorithms when faced with non-static (temporal) data distributions and highlight important

requirements for the proposal of more accurate classification models. Based on the per-

formed analyses, in order tominimizethe impact of temporal effects in ADC algorithms, we

introduce atemporal weighting function(TWF) which reflects the varying nature of textual

datasets and propose a methodology to determine its expression and parameters. We ap-

plied this methodology to three textual datasets and then proposed two strategies to extend

three ADC algorithms (namely kNN, Rocchio and Naïve Bayes) to incorporate the TWF,

which we call temporally-aware classifiers. Experiments showed that the temporally-aware

classifiers achieved significant gains, outperforming (or at least matching) state-of-the-art

algorithms in almost all cases.

Resumo Estendido

Introdução

Classificação Automática de Documentos (CAD) é um tópico de pesquisa de grande relevân-

cia nas comunidades de Aprendizado de Máquina e Recuperaçãode Informação. De fato, o

desenvolvimento de algoritmos eficazes e eficientes para CADtem se mostrado de grande

importância, dada a crescente complexidade e escala dos cenários de aplicação atuais, como

a Web. A tarefa de CAD consiste no aprendizado de modelos que associam documentos a

classes semanticamente coesas, baseado em um conjunto de documentos previamente clas-

sificados. Esses modelos são componentes chave para dar suporte e melhorar uma variedade

de tarefas, tais como o projeto de diretórios de tópicos, identificação de estilos de escrita,

organização de bibliotecas digitais, auxílio aos usuáriospara uma melhor interação com

máquinas de buscas, dentre outras.

O problema

Para melhor entendermos o problema estudado neste trabalho, faz-se necessário apresen-

tarmos brevemente a tarefa de CAD, considerando o paradigmasupervisionado. O objetivo

principal de CAD é predizer a classe (desconhecida) de um novo documento, baseado em um

conjunto de documentos previamente classificados (Sebastiani, 2002). Sejadi = (~xi, ci) um

documento cuja representação vetorial (“bag of words”) é dada por~xi e cuja classeci ∈ C

é um atributo categórico proveniente de um conjunto finitoC de classes. Assim, o objetivo

de CAD pode ser definido como o aprendizado de uma aproximaçãodiscreta da distribuição

a posterioridas classesP (ci|di), que reflete o relacionamento preditivo entre documentos e

classes. Esse aprendizado é realizado de acordo com o conjunto de documentos previamente

classificados (conjunto de treinamento).

A aproximação deP (ci|di) pode ser realizada tanto via estimativa direta, quanto via

estimativa indireta (pela regra de Bayes). A primeira estratégia define os chamados clas-

sificadores discriminativos, caracterizados por aprenderas fronteiras inter-classes de forma

a minimizar a taxa de erros (ou alguma métrica correlata), literalmente discriminando as

classes, sem realizar nenhuma suposição referente à funçãode densidade de probabilidade

de cada classe. Por outro lado, a segunda estratégia, que define os chamados classificadores

generativos, valem-se da estimativa tanto da probabilidade condicionalP (di|ci) das classes

quanto da probabilidadea priori P (ci) das mesmas para estimar a probabilidadea posteriori

almejada. Nesse caso, presume-se um modelo tanto para as densidadesP (di|ci) quanto para

as probabilidades a prioriP (ci), sendo os parâmetros do modelo estimados com base no

conjunto de treinamento. Então, a probabilidadea posteriorié obtida por meio da aplicação

da regra de Bayes:

P (ci|di) =P (ci) · P (di|ci)

c′∈C P (c′) · P (di|c′), (1)

ondeP (ci) eP (di|ci) denotam, respectivamente, a probabilidadea priori e condicionais das

classes.

A premissa básica adotada pela vasta maioria de algoritmos para CAD é que os dados

de treinamento, utilizados para construir um modelo de classificação, são amostras aleatórias

provenientes de uma distribuição de dados estacionária. Entretanto, este pode não ser o caso.

De fato, em diversos (talvez a maioria) dos problemas de classificação reais, os dados uti-

lizados para treinamento podem não ser provenientes da mesma distribuição que governa os

dados a serem classificados, em virtude da dinâmica temporaldos mesmos. Por exemplo,

sistemas para filtragem de spams e recomendação de itens de informação são naturalmente

confrontados por dados inerentemente dinâmicos. Assim, o sucesso de algoritmos de classi-

ficação pode ser comprometido quando diante de dados não-estáticos.

Conforme analisado porKelly et al. (1999), as variações observadas nas distribuições

de dados se refletem em, no mínimo, três aspectos:

• Variações nas probabilidadesa priori—P (ci);

• Variações nas probabilidadesa posteriori—P (ci|di);

• Variações nas probabilidades condicionais—P (di|ci).

Note que, de acordo com a Equação1, comop(ci|di) depende dep(di|ci), tanto os classi-

ficadores discriminativos quanto os generativos que assumem uma distribuição estacionária

de dados podem ter sua efetividade limitada quando aplicados a distribuições não-estáticas

de dados.

Neste trabalho, estamos particularmente interessados no impacto da dinâmica temporal

observada em dados textuais em algoritmos para CAD. Devido àdinâmica do conhecimento

e, até mesmo, das linguagens, as características de coleções textuais podem apresentar vari-

ações ao longo do tempo. De fato, como analisado porMourão et al.(2008), três efeitos

temporais que, em última análise, podem ser vistos como manifestações dos três aspectos

apontados anteriormente, se mostraram significativos em duas coleções textuais reais. O

primeiro efeito,CD (“Class Distribution variation”), refere-se a variações na distribuição

das classes ao longo do tempo (ou seja, as frequências relativas das classes não se man-

tém estáticas). O segundo efeito,TD (“Term Distribution variation”), refere-se às variações

observadas ao longo do tempo na distribuição dos termos, refletido por variações na repre-

sentatividade dos mesmos em relação às classes em que ocorrem. Finalmente, o terceiro

efeito,CS (“Class Similarity variation”) refere-se às variações nas similaridades entre pares

de classes na medida em que o tempo avança. De fato, duas classes podem se mostrar simi-

lares (ou dissimilares) entre si em um determinado momento,e essa similaridade se reduzir

(ou aumentar) ao longo do tempo. Ainda, em (Mourão et al., 2008) os autores evidenciaram

que essa evolução temporal é um desafio para algoritmos de aprendizado que, por sua vez,

podem ter sua efetividade limitada ao negligenciar tal aspecto.

Neste trabalho, avançamos o conhecimento na área por meio daquantificaçãoe mini-

mizaçãodo impacto dos efeitos temporais em algoritmos para CAD. Pormeio da realização

de uma série de projetos fatoriais completos,quantificamosa extensão dos efeitos tempo-

rais em diferentes coleções textuais, bem como o impacto dosmesmos em quatro algoritmos

tradicionais para CAD. Baseado no conhecimento obtido com essa caracterização mais apro-

fundada, desenvolvemos estratégias para minimizar o impacto de tais efeitos em três algo-

ritmos, alcançando resultados competitivos com o estado daarte em classificação automática

de documentos, com um menor custo computacional.

Análise Quantitativa dos Efeitos Temporais em

Classificação Automática de Documentos

A fim de quantificarmos o impacto dos efeitos temporais em algoritmos para CAD, primeira-

mente revisitamos a caracterização reportada em (Mourão et al., 2008), em que os autores

apresentam evidências sobre a existência dos três efeitos temporais discutidos em duas

coleções textuais reais: ACM-DL e MEDLINE. A primeira é composta por24897 docu-

mentos provenientes da Biblioteca Digital da ACM, distribuídos em11 classes disjuntas, e

criados entre1980 e 2002. A segunda é composta por861454 documentos classificados em

7 classes relacionadas à Medicina, criados entre1970 e1985. Incluímos, ainda, uma terceira

coleção, proveniente do domínio de notícias, a fim de prover evidências quanto à existência

dos efeitos temporais na mesma. Trata-se da AG-NEWS, uma coleção composta por835795

documentos, distribuídos entre11 classes disjuntas, criados em um intervalo de573 dias.

Potencialmente, trata-se uma coleção mais dinâmica que as demais.

De fato, ao caracterizarmos essa coleção em função dos efeitos temporais, de acordo

com a metodologia proposta em (Mourão et al., 2008), tornou-se claro que a AG-NEWS

é acometida pelos três efeitos temporais. Como um exemplo, apresentamos na Figura1

a distribuição relativa das classes observada ao longo do tempo (utilizando uma unidade

temporal semanal). Claramente, a distribuição das classesvaria. Mais detalhes a cerca de tal

caracterização são encontrados no texto completo da dissertação.

Figura 1: Variação na Distribuição de Classes—AG-NEWS.

Projeto Fatorial Completo

Uma vez evidenciada a existência dos efeitos temporais nas três coleções textuais adotadas,

partimos, então, para uma caracterização mais aprofundadados mesmos, quantificando como

eles afetam as coleções e a efetividade de quatro algoritmospara CAD amplamente utiliza-

dos pela comunidade de Aprendizado de Máquina, a saber,Rocchio, K Nearest Neighbors

(kNN), Naïve BayeseSupport Vector Machine(SVM).

Dadosk fatores, que podem assumirn níveis (valores possíveis), e uma variável re-

sposta, um projeto fatorialnkr busca quantificar o impacto de cada fator (bem como as in-

terações entre eles), na variável resposta, por meio der replicações experimentais. No nosso

caso, objetivamos quantificar o impacto dos efeitos temporais (fatores), e suas interações, na

efetividade de algoritmos para CAD (variável resposta). Consideramos dois possíveis níveis:

nível baixo e nível alto, referindo-se a uma baixa influênciae uma alta influência dos efeitos

temporais, respectivamente.

Um primeiro aspecto a ser tratado para a realização do projeto fatorial consiste em

isolar os níveis de cada fator. Isso se dá por meio da partiçãode documentos da coleção

sob estudo em grupos apresentando níveis baixo e altos de influência dos efeitos temporais.

Para tanto, propomos alguns mecanismos para realizar esse isolamento, conforme descrito

a seguir:

Distribuição de Classes (CD): Mensuramos a variação da distribuição de cada classec ao

longo do tempo, por meio do Coeficiente de Variação (CVc =σc

µc) da proporção relativa

de c em cada ponto no tempo. Para isso, calculamos a proporçãoPc,p de ocorrência

de documentos na classec considerando cada ponto no tempop e obtivemos tanto a

médiaµc quanto o desvio padrãoσc desses valores. Assim, associamos a cada classec

seu respectivo Coeficiente de VariaçãoCVc. Definimos, então, um limiarδCD tal que

documentos pertencentes a classes cujoCV é inferior aδCD são associados ao nível

baixo (grupoCD↓) e os demais são associados ao nível alto (grupoCD↑).

Distribuição de Termos (TD): A fim de isolar os níveis baixo e alto desse efeito tempo-

ral, propomos uma métrica chamada “Nível de Estabilidade doDocumento” (DSL—

Document StabilityLevel). ODSL de um documentod denota a densidade de termos

estáveis (ou seja, de termos que apresentam baixa variação em suas representatividades

em relação às classes) que compõemd. Definimos um limiarδTD para isolar os dois

níveis e, documentos cujoDSL é inferior aδTD são associados ao nível baixo (grupo

TD↓) e os demais são associados ao nível alto (grupoTD↑).

Similaridade de Classes (CS): Para isolar os níveis associados a esse efeito, consideramos

as variações observadas ao longo do tempo nas similaridadesentre pares de classes.

Considere o par de classes〈ci, cj〉, comi 6= j. Para cada ponto no tempop, definimos

Vi,p eVj,p como os vocabulários das classesci e cj observados emp, respectivamente,

sendo compostos pelosk termos mais representativos para tais classes nesse ponto

no tempo, de acordo com a métricaInformation Gain. Calculamos a similaridade

de cosseno entre ambos os vocabulários e mensuramos a variabilidade observada ao

longo do tempo por meio do Coeficiente de Variação. Assim, para cada classeci,

mensuramos a variabilidade observada na similaridade delacom as demais classes

cj 6= ci. Como anteriormente, definimos um limiarδCS a fim de separar os documentos

associados às classes com menor variabilidade (grupoCS↓) daqueles associados às

classes com maior variabilidade (grupoCS↑).

Realizado o isolamento dos níveis para cada fator, observamos uma alta correlação en-

tre os efeitos temporaisCD eCS. De fato, essa correlação inviabiliza a condução de um pro-

jeto fatorial23r (ou seja, com os três efeitos temporais considerados simultaneamente). As-

sim, adotamos uma estratégia de experimentação par-a-par,avaliando o impacto dos efeitos

CD eTD (projeto fatorialCD×TD) e o impacto dos efeitosCS eTD (projeto fatorialCS×TD)

nos algoritmos para CAD, de forma isolada. Para cada combinação entre os algoritmos para

CAD e coleções adotados, executamos o par de projetos fatoriaisCD×TD eCS×TD.

Principais Resultados

A realização dos projetos fatoriais mencionados nos revelauma série de informações perti-

nentes acerca do comportamento das coleções sob o prisma temporal, bem como referentes

ao comportamento dos algoritmos para CAD quando aplicados acoleções cujas característi-

cas variam ao longo do tempo.

Primeiramente, mostramos que os efeitos temporais ocorremde forma mais promi-

nente nas coleções ACM-DL e AG-NEWS do que na MEDLINE. Mais especificamente,

com99% de confiança, obtivemos as seguintes ordenações parciais:

CDMEDLINE < CDACM−DL ∼ CDAG−NEWS,

CSMEDLINE < CSACM−DL ∼ CSAG−NEWS,

TDMEDLINE < TDACM−DL < TDAG−NEWS.

Em segundo lugar, considerando a coleção ACM-DL, o impacto dos efeitosCD eCS

se mostraram estatisticamente equivalentes ao impacto do efeitoTD, enquanto considerando

as coleções MEDLINE e AG-NEWS, foi observado que tantoCD quantoCS se mostraram

mais proeminentes que o efeitoTD.

Ainda, os quatro algoritmos para CAD analisados foram impactados negativamente

pelos efeitos temporais, em termos de efetividade da classificação. De fato, as maiores

degradações em efetividade foram observadas quando os algoritmos foram aplicados às

coleções mais dinâmicas (ACM-DL e AG-NEWS). Considerando os algoritmos isolada-

mente, a análise quantitativa realizada nos possibilitou um melhor entendimento a cerca das

forças e fraquezas dos classificadores em relação aos três efeitos temporais estudados. Por

exemplo, o classificador SVM se mostrou mais robusto ao efeito TD, sendo impactado de

forma marcante pelos demais efeitos. Tal comportamento pode ser justificado pelas próprias

características do classificador, conforme discutido na dissertação. Mostramos, também, que

os outros três classificadores sob estudo são bastante sensíveis aos três efeitos temporais.

Apresentamos na Tabela1 a ordenação parcial dos algoritmos, para cada coleção de dados

adotada, em relação ao impacto dos efeitos temporais observados. Os relacionamentos re-

portados evidenciam o fato de que, além de os algoritmos paraCAD serem negativamente

afetados pelos efeitos temporais, a degradação observada épeculiar a cada algoritmo e a

cada coleção de dados.

Efeito ColeçãoTemporal ACM-DL MEDLINE AG-NEWSCD SVM > NB∼ KNN ∼ RO RO > SVM > NB > KNN RO∼ KNN > SVM ∼ NBCS SVM > KNN ∼ RO > NB RO > SVM∼ NB > KNN RO∼ KNN ∼ NB > SVMTD SVM ∼ KNN ∼ RO∼ NB SVM > RO∼ NB ∼ KNN RO > NB > KNN > SVM

Tabela 1: Um Estudo Comparativo Sobre o Impacto dos Efeitos Temporais em cada Algo-ritmo para CAD—Rocchio (RO), SVM, Naïve Bayes (NB) e KNN.

Os resultados obtidos com essa análise, portanto, corroboram nosso argumento de que

a dimensão temporal é um aspecto de grande importância que, apesar dos desafios intrínsecos

associados à dinâmica temporal, deve ser apropriadamente considerado para o desenvolvi-

mento de modelos de classificação acurados.

Classificação Automática de Documentos

Temporalmente Robusta

Baseado nas lições aprendidas com a caracterização temporal descrita, propomos algumas

estratégias paraminimizaro impacto dos efeitos temporais em algoritmos para CAD quando

aplicados a dados provenientes de distribuições que variamao longo do tempo. Tais es-

tratégias se baseiam no uso do que chamamos de Função de Ponderação Temporal (TWF,

de “Temporal Weighting Function”). Propomos primeiramente uma metodologia, baseada

em uma série de testes estatísticos, para determinar a expressão e os parâmetros da TWF,

de forma a melhor descrever o processo evolutivo subjacenteque governa a variação dos

dados. Instanciamos tal metodologia considerando as três coleções textuais descritas anteri-

ormente. Descobrimos, então, que a TWF’s associadas às coleções ACM-DL e MEDLINE

seguem uma distribuição lognormal, com99% de confiança. Entretanto, os mesmos testes

adotados falharam considerando-se a AG-NEWS. Portanto, a TWF associada à coleção AG-

NEWS segue uma distribuição distinta, e outros testes (potencialmente mais complexos—o

que pode impossibilitar seu uso por quem não possua as habilidades estatísticas necessárias)

tornam-se necessários. De fato, para a classificação temporalmente robusta, apenas os va-

lores reais positivos associados às distâncias temporais são necessários. Assim, para pos-

sibilitar a aplicabilidade desses classificadores a casos em que os testes necessários para

determinar a TWF sejam mais complexos (ou, até mesmo desconhecidos), oferecemos uma

estratégia automática para determinar tal função, sem a necessidade da realização de qual-

quer teste estatístico.

Uma vez definida a TWF, é necessário prover mecanismos para incorporá-la no ar-

cabouço de classificação. Propomos, então, três estratégias para tal:

TWF aplicada a Documentos: Essa estratégia consiste em ponderar cada documento de

treino pela TWF, de acordo com a distância temporal entre elee o documento a ser

classificado. Dessa forma, documentos de treino provenientes de pontos no tempo em

que a distribuição dos dados diverge daquela observada no momento de criação do

documento a ser classificado têm sua influência minimizada naregra de decisão. Na

Figura2 apresentamos uma descrição esquemática para essa estratégia.

Figura 2: TWF Aplicada a Documentos.

TWF aplicada a Pontuações:Nesse caso, as pontuações produzidas por um classificador

tradicional, considerando um conjunto de treinamento composto por documentos cuja

classec é transformada para a classe derivada〈c, p〉 (ondep denota o ponto no tempo

em que o documento foi criado) são consideradas, atrelando os padrões observados não

apenas às classes, mas também ao ponto no tempo em que foram observadas. Assim,

as pontuações obtidas pelo classificador tradicional para cada〈c, p〉 são agregadas por

meio de uma soma ponderada, onde os pesos são dados pela TWF. Apresentamos na

Figura3 uma descrição gráfica dessa estratégia.

Figura 3: TWF Aplicada a Pontuações.

TWF aplicada a Pontuações (Versão Estendida):Essa estratégia particiona os documen-

tos de treinamento em sub-grupos compostos por documentos criados em um mesmo

ponto no tempo (logo, sem variação temporal). Classificadores tradicionais são então

aplicados a cada partição de documentos, a fim de classificar odocumento de teste

baseado nesses diversos conjuntos de treinamento. As pontuações referentes à classe

c, obtidas para cada partição de dados, são agregadas de acordo com uma soma pon-

derada, sendo os pesos dados pela TWF. A representação esquemática dessa estratégia

encontra-se na Figura4.

Figura 4: TWF Aplicada a Pontuações (Versão Estendida).

As três estratégias descritas foram implementadas utilizando três classificadores, a

saber, Rocchio, KNN e Naïve Bayes.

Principais Resultados

Avaliamos experimentalmente a efetividade e eficácia dos classificadores propostos. Para

tal, adotamos uma estratégia de validação cruzada em 10 partes e validamos estatisticamente

os resultados utilizando um teste-t de dupla calda, com99% de confiança.

Os classificadores temporalmente robustos obtiveram melhorias estatisticamente sig-

nificativas quando comparados às abordagens tradicionais na maioria dos casos. Como um

exemplo, considere a Tabela2. Como podemos observar, todas as versões temporalmente

robustas do Rocchio e KNN obtiveram resultados estatisticamente superiores às suas ver-

sões tradicionais (em termos de MacroF1 e MicroF1). Observamos, também, que a versão

temporal do Naïve Bayes, baseada na aplicação da TWF em pontuações incorreu em per-

das significativas. Atribuímos esse problema ao desbalanceamento de classes artificialmente

aumentado por essa estratégia, bem como à quantidade reduzida de documentos de treino

associados às classes〈c, p〉 para realizar estimativas acuradas a cerca da distribuiçãodos da-

dos. A estratégia estendida de aplicação da TWF em pontuações busca atenuar o problema

do desbalanceamento (embora, continue sendo desfavorecida pela escassez de dados).

Algoritmo Rocchio KNN Naïve BayesMétrica macF1(%) micF1(%) macF1(%) micF1(%) macF1(%) micF1(%)

Tradicional 57.39 68.24 58.48 71.84 57.27 73.24TWF 60.02 70.64 59.92 73.84 60.78 74.11

em documentos (+4.58)N (+3.52)N (+2.46)N (+2.78)N (+6.13)N (+1.19)•TWF 59.85 72.47 62.02 74.45 44.85 63.93

em pontuações (+4.29)N (+6.20)N (+6.05)N (+3.63)N (-27.69)H (-14.56)HTWF 59.27 71.39 59.78 73.85 56.23 72.35

em pontuações est. (+3.28)N (+4.62)N (+2.22)N (+2.80)N (-1.84)• (+1.23)•

Tabela 2: Resultados Obtidos Incorporando a TWF Definida Estatisticamente ao Rocchio,KNN e Naïve Bayes—ACM-DL.

Avaliamos o uso da estratégia automatizada para a determinação da TWF. Para fins de

exemplificação, reportamos na Tabela3 os resultados referentes à coleção ACM-DL obti-

dos pelos classificadores temporalmente robustos utilizando a TWF determinada por essa

estratégia. De fato, o procedimento para determinação automática da TWF se mostrou efi-

caz, de forma que seu uso rendeu resultados estatisticamente equivalentes àqueles obtidos

utilizando a TWF determinada estatisticamente, conforme podemos observar ao contrastar

as Tabelas2 e 3. Comparamos, também, a efetividade dessa estratégia ao utilizar todo o

conjunto de treinamento ou apenas10% do mesmo para determinar a TWF (linhas “100%

deD” e “10% deD”, respectivamente). Conforme podemos observar, com apenas 10% do

treinamento é possível determinar a TWF de forma acurada e obter resultados estatistica-

mente equivalentes àqueles obtidos utilizando todo o treinamento. Claramente, determinar

a TWF com uma amostra reduzida do treino leva a uma drástica redução no tempo de ex-

ecução. Por exemplo, a determinação da TWF utilizando o Rocchio demanda4.49 ± 0.04

segundos quando utilizando todo o treinamento, ao passo que, considerando10% do treino,

o tempo de execução cai para apenas0.77s± 0.02s, valor esse desprezível se comparado ao

tempo gasto pela tarefa de classificação.

Finalmente, comparamos nossos melhores classificadores temporais com o estado da

arte SVM em termos de efetividade e eficiência. Conforme podemos observar na Tabela4,

nossos melhores classificadores apresentaram eficácia estatisticamente equivalente (ou, até

mesmo superior) quando comparado ao SVM, com um tempo de execução (dado pelo tempo

despendido tanto para treinar quanto para testar) bastanteinferior—mesmo considerando

o fato de os classificadores temporais apresentarem umoverheadreferente à consideração

do aspecto temporal, e serem naturalmente classificadores postergados. Claramente, isso

evidencia a qualidade das soluções propostas.

Algoritmo Rocchio KNN Naïve BayesMétrica macF1(%) micF1(%) macF1(%) micF1(%) macF1(%) micF1(%)

Tradicional 57.39 68.24 58.48 71.84 57.27 73.24

TWF (100% deD) 60.21 70.70 60.08 73.88 61.38 74.60em documentos (+4.91)N (+3.60)N (+2.74)N (+2.84)N (+7.18)N (+1.86)•

TWF (10% deD) 60.52 70.88 61.02 74.27 61.44 74.24em documentos (+5.45)N (+3.87)N (+4.84)N (+3.82)N (+7.28)N (+1.36)•

TWF (100% deD) 60.47 72.90 61.88 74.53 45.16 64.55em pontuações (+5.47)N (+6.83)N (+5.81)N (+3.74)N (-26.82)H (-13.46)H

TWF (10% deD) 59.68 72.40 61.37 73.77 44.47 64.58em pontuações (+3.99)N (+6.10)N (+4.94)N (+2.69)N (-28.78)H (-13.41)H

TWF (100% deD) 59.96 71.99 59.80 73.95 56.28 72.73em pontuações est. (+4.48)N (+5.49)N (+2.26)N (+2.94)N (-1.76)• (-0.70)•TWF (10% deD) 59.85 71.79 59.76 73.85 56.19 72.70em pontuações est. (+4.29)N (+5.20)N (+2.19)N (+2.80)N (-1.89)• (-0.74)•

Tabela 3: Resultados Obtidos Incorporando a TWF Definida Automaticamente ao Rocchio,KNN e Naïve Bayes—ACM-DL.

AlgoritmoMétrica

macF1(%) micF1.(%) Tempo (s)

SVM 59.91 73.88 144.10±5.30

Rocchio com TWF 60.47 72.909.00±0.00

em pontuações (+0.93) • (−1.34) •KNN com TWF 59.78 73.88

11.03±0.48em documentos (−0.22) • (+0.00) •KNN com TWF 61.88 74.53

10.10±0.31em pontuações (+3.29) N (+0.88) •

Naïve Bayes com TWF 61.38 74.609.10±0.32

em documentos (+2.45) N (+0.97) •

Tabela 4: Melhores classificadores temporaisversusSVM—ACM-DL.

Conclusões

Nesse trabalho apresentamos uma análise quantitativa sobre o impacto dos efeitos temporais

em quatro algoritmos para CAD amplamente utilizados pela comunidade de Aprendizado de

Máquina, aplicados a três coleções textuais reais, com dinâmica temporais potencialmente

distintas. Mostramos que, contrariamente à suposição adotada pela maioria dos algoritmos

de aprendizado, em que os dados seguem uma distribuição estática, as coleções estudadas

apresentam dinâmica temporal distinta, com variações na distribuição dos dados. Tais vari-

ações temporais potencialmente limitam a eficácia dos classificadores. De fato, a análise

conduzida mostrou que todos os quatro classificadores estudados foram negativamente afe-

tados pelos efeitos temporais, sendo as degradações mais prominentes observadas quando

aplicados às coleções mais dinâmicas (ACM-DL e AG-NEWS). Assim, a dimensão tem-

poral se mostra como um importante aspecto a ser consideradocom o intuito de prover

classificadores acurados.

Além da quantificação do impacto dos efeitos temporais em algoritmos para CAD,

propomos três estratégias mara minimizar tal impacto. Taisestratégias baseiam-se na apli-

cação do que chamamos de Função de Ponderação Temporal (TWF,de “Temporal Weighting

Function”). Propomos tanto uma metodologia estatística quanto um procedimento automa-

tizado, para determinar a TWF. Os resultados obtidos com a aplicação dos classificadores

temporalmente robustos mostraram que considerar a informação temporal leva a ganhos es-

tatisticamente significativos quando comparados às abordagens tradicionais. Ainda, os clas-

sificadores propostos, que obtiveram os melhores resultados, se mostraram competitivos ao

classificador estado da arte SVM, tanto em termos de eficácia quanto em termos de tempo

de execução.

List of Figures

4.1 Class Distributions in the Three Reference Datasets. . .. . . . . . . . . . . . . 27

4.2 Class Distribution Temporal Variation in Each Reference Dataset. . . . . . . . 31

4.3 Term Distribution Temporal Variation of Each ReferenceDataset. . . . . . . . 32

4.4 Determining the Lower and Upper Levels ofCD andCS—ACM-DL. . . . . . 43

4.5 Determining the Lower and Upper Levels of TD—ACM-DL. . . .. . . . . . . 44

4.6 Determining the Lower and Upper Levels ofCD andCS—MEDLINE. . . . . 46

4.7 Determining the Lower and Upper Levels ofTD—MEDLINE. . . . . . . . . . 47

4.8 Determining the Lower and Upper Levels ofCD andCS—AG-NEWS. . . . . 48

4.9 Determining the Lower and Upper Levels ofTD—AG-NEWS. . . . . . . . . . 49

4.10 Cumulative Distribution Function of Document Stability Level Values. . . . . . 54

5.1 Dδ Distribution (Scaled to[0, 1] Interval). . . . . . . . . . . . . . . . . . . . . 66

5.2 Fitted Temporal Weighting Function with Log-Transformed Data. . . . . . . . 68

5.3 Estimated Temporal Weighting Function. . . . . . . . . . . . . .. . . . . . . 71

5.4 Graphical Representation of TWF in Documents. . . . . . . . .. . . . . . . . 72

5.5 Graphical Representation of TWF in Scores. . . . . . . . . . . .. . . . . . . . 75

5.6 Graphical Representation of Extended TWF in Scores. . . .. . . . . . . . . . 77

5.7 Relative〈c, p〉 Sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.8 Relative〈c, p〉 Sizes for AG-NEWS Dataset. . . . . . . . . . . . . . . . . . . .88

List of Tables

2.1 Contingency Table for Classification Effectiveness Evaluation. . . . . . . . . . 11

4.1 Adopted Class Identifiers for each Reference Dataset. . .. . . . . . . . . . . . 26

4.2 Pairwise Class Similarity (standard deviations) in ACM-DL. . . . . . . . . . . 33

4.3 Pairwise Class Similarity (standard deviations) in MEDLINE. . . . . . . . . . 33

4.4 Pairwise Class Similarity (standard deviations) in AG-NEWS. . . . . . . . . . 34

4.5 Factorial Design—ACM-DL. . . . . . . . . . . . . . . . . . . . . . . . . .. . 45

4.6 Factorial Design—MEDLINE. . . . . . . . . . . . . . . . . . . . . . . . .. . 48

4.7 Factorial Design—AG-NEWS. . . . . . . . . . . . . . . . . . . . . . . . .. . 50

4.8 Comparative Study: The Impact of the Temporal Effects the ADC Algorithms. . 57

5.1 D’Agostino’s D-Statistic Test of Normality. . . . . . . . . .. . . . . . . . . . 66

5.2 Temporal DistancesversusTerms. . . . . . . . . . . . . . . . . . . . . . . . . 67

5.3 Estimated Parameters for Both Datasets, with 99% Confidence Intervals. . . . . 67

5.4 Results Obtained with theStatistically DefinedTWF—ACM-DL. . . . . . . . . 81

5.5 Results Obtained with theStatistically DefinedTWF—MEDLINE. . . . . . . . 81

5.6 Results Obtained for the Least and Most Frequent Classes〈c, p〉 Sampling for

Naïve Bayes—MEDLINE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.7 Results Obtained with theEstimatedTWF—ACM-DL. . . . . . . . . . . . . . 86

5.8 Results Obtained with theEstimatedTWF—MEDLINE. . . . . . . . . . . . . 86

5.9 Results Obtained with theEstimatedTWF—AG-NEWS. . . . . . . . . . . . . 87

5.10 Effectiveness Comparison: Best Performing Temporally-Aware Classifiersver-

susSVM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

5.11 Execution Time (in seconds) of each Explored ADC Algorithm. . . . . . . . . 91

5.12 Execution Time Comparison: Best Performing Temporally-Aware Classifiers

versusSVM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5.13 Execution Time of the TWF Estimation using the Rocchio Classifier. . . . . . . 93

List of Algorithms

1 Factorial Design Procedure. . . . . . . . . . . . . . . . . . . . . . . . . .. 37

2 Automatic TWF Determination . . . . . . . . . . . . . . . . . . . . . . . .70

3 Rocchio-TWF-Doc: Rocchio with Temporal Weighting in Documents . . . 73

4 KNN-TWF-Doc: KNN with Temporal Weighting in Documents . . .. . . 74

5 Naïve Bayes TWF-Doc: Naïve Bayes with Temporal Weighting in Documents75

6 TWF-Sc: Temporal Weighting in Scores . . . . . . . . . . . . . . . . . .. 76

7 TWF-Sc-Ext: Extended Temporal Weighting in Scores . . . . . .. . . . . 78

Contents

Resumo xi

Abstract xiii

Resumo Estendido xv

List of Figures xxvii

List of Tables xxix

List of Algorithms xxxi

1 Introduction 1

1.1 Context and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Dissertation Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . .. . 2

1.3 Work Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.5 Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Preliminaries: Basic Concepts 9

2.1 Automatic Document Classification . . . . . . . . . . . . . . . . . .. . . 9

2.2 Evaluation Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . .11

2.3 Temporal Representation of Documents . . . . . . . . . . . . . . .. . . . 13

3 Related Work 15

3.1 Problem Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15

3.2 Strategies Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17

3.2.1 Detecting Data Variations . . . . . . . . . . . . . . . . . . . . . . .17

3.2.2 Dealing with Data Variations . . . . . . . . . . . . . . . . . . . . .17

3.2.3 Characterizing Data Variations . . . . . . . . . . . . . . . . . .. . 20

xxxiii

3.3 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .22

4 A Quantitative Analysis of Temporal Effects on ADC 23

4.1 Experimental Workload . . . . . . . . . . . . . . . . . . . . . . . . . . . .25

4.1.1 Reference Datasets . . . . . . . . . . . . . . . . . . . . . . . . . .25

4.1.2 ADC Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.2 Characterization of Temporal Effects on Textual Datasets . . . . . . . . . . 29

4.2.1 Class Distribution Temporal Variation . . . . . . . . . . . .. . . . 30

4.2.2 Term Distribution Temporal Variation . . . . . . . . . . . . .. . . 31

4.2.3 Class Similarity Temporal Variation . . . . . . . . . . . . . .. . . 32

4.3 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . .34

4.3.1 Factorial Design . . . . . . . . . . . . . . . . . . . . . . . . . . . .34

4.3.2 Applying2kr Design in the Characterization of Temporal Effects .38

4.3.3 Quantifying the Impact of Temporal Effects on ADC . . . .. . . . 42

4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .49

4.4.1 Impact of Temporal Effects on the Reference Datasets .. . . . . . 51

4.4.2 Impact of Temporal Effects on the ADC Algorithms . . . . .. . . . 53

4.4.3 Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .58

5 Temporally-Aware Algorithms for ADC 61

5.1 Temporal Weighting Function . . . . . . . . . . . . . . . . . . . . . . .. . 64

5.2 Fully-Automated TWF Definition . . . . . . . . . . . . . . . . . . . . .. 68

5.3 Temporally-aware ADC . . . . . . . . . . . . . . . . . . . . . . . . . . . .71

5.3.1 Temporal Weighting in Documents . . . . . . . . . . . . . . . . . .72

5.3.2 Temporal Weighting in Scores . . . . . . . . . . . . . . . . . . . .74

5.3.3 Extended Temporal Weighting in Scores . . . . . . . . . . . . .. . 76

5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .78

5.4.1 Parameter Settings . . . . . . . . . . . . . . . . . . . . . . . . . .79

5.4.2 Experiments with the Statistically Defined TWF . . . . . .. . . . . 80

5.4.3 Experiments with the Estimated TWF . . . . . . . . . . . . . . . .85

5.4.4 Runtime Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . .89

5.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .93

6 Conclusions and Future Work 95

6.1 A Quantitative Analysis of Temporal Effects on ADC . . . . .. . . . . . . 95

6.2 Temporally-Aware Algorithms for ADC . . . . . . . . . . . . . . . .. . . 96

6.2.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

6.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .98

Bibliography 101

Chapter 1

Introduction

In this chapter, we discuss the main motivations and arguments that support this work. We

also briefly describe our work and explicitly state our contributions.

1.1 Context and Motivation

Text classification is still one of the major information retrieval problems, and developing

robust and accurate classification models continues to be ingreat need as a consequence of

the increasing complexity and scale of current applicationscenarios, such as the Web. The

task of Automatic Document Classification (ADC) aims at creating models that associate

documents with semantically meaningful categories. Thesemodels are key components for

supporting and enhancing a variety of other tasks such as automated topic tagging (that is,

assigning labels to documents), building topic directories, identifying the writing style of a

document, organizing digital libraries, improving the precision of Web searching, and even

helping users to interact with search engines.

Similarly to other machine learning techniques, ADC usually follows a supervised

learning strategy: a training set of already classified documents is employed for creating a

classifier. Once built, the classifier is used for predictingclasses for a new set of unclassified

documents. The majority of supervised algorithms considerthat all (pre-classified) docu-

ments provide equally important information to discover the features that better identify a

(previously unclassified) document’s class. However, thismay not hold in practice due to

several factors such as the document’s timeliness, the venue in which it was published, its

authors, among other factors (de M. Palotti et al., 2010).

2 CHAPTER 1. INTRODUCTION

1.2 Dissertation Hypothesis

In the following, we state the fundamental hypotheses that serve as guidance to this work:

• The temporal evolution of textual data limits the performance of ADC classifiers;

• Distinct textual datasets present differing dynamical behavior;

• Different ADC algorithms may be distinctively affected by the temporal evolution of

• The temporal evolution of data may be explored to devise moreeffective classification

models.

1.3 Work Description

In this work, we are particularly concerned with the impact that thetemporal effectsmay have

on ADC algorithms. Due to several factors, such as the dynamics of knowledge and even the

dynamics of languages, the characteristics of a textual dataset may change over time. For

example, the relative proportion of documents belonging todifferent classes may change as

consequence of the so-called virtual concept drift (Tsymbal, 2004). Thus, density-based clas-

sifiers, which are sensitive to class distribution, may not work well, since the “assumed” class

frequencies observed from an independent training set may not represent the “true” frequen-

cies observed when the test document was created (Yang and Zhou, 2008; Zhang and Zhou,

2010). As we shall see, not only the temporal variations in class frequencies may affect

classification effectiveness, but also the relationships between terms and classes. That is,

the distribution of terms among classes may vary over time, due to changes in writing style,

term usage, and so on. Consider, for instance, the termspheromoneandant colony. Before

the 1990s, they referred exclusively to documents in the area of Natural Sciences. However,

after the introduction of the technique ofAnt Colony Optimizationin the area of Artificial

Intelligence, these terms became relevant for classifyingComputer Science documents too.

In such scenarios, the classification effectiveness may deteriorate over time. Therefore, the

temporal dynamics of the data is an important aspect that must be taken into account in the

learning of more accurate classification models.

As a matter of fact,Mourão et al.(2008) have recently distinguished three different

temporal effects that may affect the performance of automatic classifiers. The first effect is

theclass distribution variation, which accounts for the impact of the temporal evolution on

the relative frequencies of the classes. The second effect is theterm distribution variation,

which refers to changes in the terms’ representativeness with respect to the classes as time

1.3. WORK DESCRIPTION 3

goes by. The third effect is theclass similarity variation, which considers how the similarity

among classes, as a function of the terms that occur in their documents, changes over time.

The authors showed that accounting for the temporal evolution of documents poses a chal-

lenge to learning a classification model, which is usually less effective when such factors are

neglected, as assumptions made when the model is built (thatis, learned) may no longer hold

due to temporal effects.

Despite these previous studies, to the best of our knowledge, a deeper and thoroughly

analysis about how and to which extent these temporal effects really impact ADC algorithms

has not been performed yet. A key aspect to be addressed in this task concerns the peculiar

behavior that each temporal effect may present in differentdatasets. For example, while

some datasets may present large class distribution variations over time, other datasets may,

in contrast, present a more significant variability on term distribution. Moreover, different

ADC algorithms may be distinctively affected by these effects due to their sensitivity or

robustness to each specific effect. In other words, the best strategy to handle temporal effects

may depend on the specific characteristics of both the dataset and the ADC algorithm used,

thus turning the learning of a more accurate classification model that deals with these effects

an even more challenging task.

In sum, two important questions that must be answered in order to better understand the

impact of temporal effects are:(i) Which temporal effects influence more in each dataset?

(ii) What is the behavior of each ADC algorithm when faced with different levels of each

temporal effect?In fact, it has already been established that these temporaleffects do exist

in some collections and affect negatively one specific algorithm, namely the SVM classifier

(Mourão et al., 2008). In this work, we take a step further towards answering the posed

questions, by proposing a factorial experimental design (Jain, 1991) aimed at quantifying

the impact of the temporal effects in four representative ADC algorithms, considering three

textual datasets with differing characteristics in their temporal evolution.

Hence, the first part of this dissertation aims atquantifyingthe impact of temporal

effects in ADC algorithms and provides as contributions:(i) a re-visitation of the character-

ization reported in (Mourão et al., 2008), with the inclusion of a third dataset belonging to

a distinct and more dynamic domain, in order to strengthen the argument for the existence

of such temporal effects;(ii) the proposal of a methodology to enable a deeper study of the

aforementioned temporal effects, by means of a factorial experimental design aimed at un-

covering how each temporal effect affects each ADC algorithm and textual dataset;(iii) an

instantiation of that methodology considering three real textual datasets and four well known

ADC algorithms, along with a detailed study regarding the impact of the temporal effects on

them. Specifically, we focus on four traditional ADC algorithms, namely Rocchio, K Nearest

Neighbors (KNN), Naïve Bayes and Support Vector Machine (SVM), and on three different

and widely used textual collections covering long time periods, namely, ACM-DL (22 con-

secutive years), MEDLINE (15 consecutive years) and, finally, AG-NEWS (573 consecutive

days).

As we shall see, there is a higher impact of the temporal effects in the ACM-DL and

AG-NEWS datasets when compared to the MEDLINE dataset. In the ACM-DL dataset,

the impact of class distribution and class similarity variations are statistically equivalent to

the impact of the term distribution variation, whereas MEDLINE and AG-NEWS are more

impacted by the first two effects. These findings motivate thedevelopment of strategies to

handle the temporal effects in ADC algorithms according to each dataset specific dynamical

behavior. Furthermore, all four analyzed ADC algorithms suffered a negative impact of

the temporal effects in terms of classification effectiveness. Indeed, the most significant

performance losses were observed when these algorithms were applied to the most dynamic

ACM-DL and AG-NEWS datasets. Extending the results presented in (Mourão et al., 2008)

by quantifyingthe impact of each temporal effect in the ADC algorithms, we here show

that the SVM classifier is more resilient to the term distribution effect, while still being

impacted by the other two effects. We also show that the otherthree algorithms, on the other

hand, are very sensitive to all three effects. These resultscorroborate our argument that the

temporal dimension is an important aspect that has to be considered when learning accurate

classification models.

Based on the performed quantitative analysis of the impact of temporal effects in ADC

algorithms, the second part of this dissertation focus on how to minimizetheir impact in

ADC algorithms. We propose a strategy to incorporate temporal information to document

classifiers, aiming at improving their effectiveness by properly handling data with varying

distributions. Our strategy is based on the evolution of theterm-class relationship over time,

captured by a metric ofdominance. We start by determining atemporal weighting function

for a collection according to its characteristics, based ona series of statistical tests performed

to determine its expression, and a curve fitting procedure todetermine its parameters. We

found that this function follows a lognormal distribution for two datasets we used, namely

ACM-DL and MEDLINE. However, the set of statistical tests performed to define the TWF

expressions for ACM-DL and MEDLINE datasets was not able to properly define the TWF

expression regarding the AG-NEWS dataset, which does not follow a (log-)normal distri-

bution. Indeed, the required tests may be prohibitively complex to perform depending on

the dataset characteristics, limiting the practical applicability of this strategy. Thus, we also

propose an automatic procedure to learn the TWF function, without the needs to perform

such statistical tests.

The final step is to incorporate the temporal weighting function to ADC algorithms

and we propose three strategies that follow a lazy classification approach. In the three strate-

1.4. CONTRIBUTIONS 5

gies, the weights assigned to each example depend on the notion of a temporal distanceδ,

defined as the difference between the time of creationp of a training example and a reference

time pointpr. The first strategy, namedtemporal weighting in documents, weights training

instances according toδ. The second strategy, calledtemporal weighting in scores, takes

into account the scores (e.g., similarities, probabilities) produced by a traditional classifier

applied to a modified training set where the class of each training documentc is mapped to

a derived classc 7→ 〈c, p〉, with p denoting the training document’s creation point in time,

ultimately tying together the observed patters and both theclass and temporal information.

A weighted sum of the learned scores is then performed, according to the TWF, and used to

make the final classification decision. Finally, the third strategy, namedextended temporal

weighting in scores, partitions the training setD in sub-groups of documentsDp with the

same creation point in timep. Then, a classification model is built based on eachDp in iso-

lation. The classes’ scores are then produced for eachDp and, as before, they are aggregated

using the TWF to weight them. We specifically show how these strategies are implemented

in three traditional ADC algorithms, namely, Rocchio, k Nearest Neighbors (KNN), and

Naïve Bayes.

We evaluated our strategies using three actual textual datasets that span for decades

(ACM-DL and MedLine) or for several months (AG-NEWS). The temporally-aware clas-

sifiers achieved significant improvements on classificationeffectiveness, even matching or

outperforming the state of the art SVM classifier in some cases with a drastically reduced

execution time.

1.4 Contributions

The specific contributions of this work are:

• aquantificationof the impact of three main temporal effects in four widely used ADC

algorithms. More specifically,

– we re-visit the characterization reported in (Mourão et al., 2008), by including

a third dataset belonging to a distinct and more dynamic domain, in order to

strengthen the argument for the existence of variations in textual data;

– we propose a methodology to enable a deeper study of the threetemporal ef-

fects, by means of a factorial experimental design aimed at uncovering how each

temporal effect affects each ADC algorithm and textual dataset;

– we instantiate that methodology considering three real textual datasets and four

ADC algorithms, and provide a detailed study regarding the impact of the tem-

poral effects on them;

• the proposal of strategies tominimizethe impact of the temporal effects in ADC algo-

rithms. Again, more specifically,

– we introduce a temporal weighting function to capture the varying behavior of

textual datasets, and propose two strategies to devise it;

– we extend three well known ADC algorithms to incorporate such function, de-

vising the temporally-aware algorithms for ADC;

– we perform an extensive experimental analysis in order to assess the benefits of

considering the temporal dynamics of data.

In the following we enumerate the already published work as direct contributions of

this dissertation, along with some work published during the MS.C. course:

• Salles, T., Rocha, L., Pappa, G. L., Mourão, F., Gonçalves, M. A., and Meira Jr.,

W. Temporally-aware algorithms for document classification. In Proceedings of the

International ACM SIGIR Conference on Research & Development of Information

Retrieval, pages 307–314, Genebra, Switzerland, 2010.

• Salles, T., Rocha, L., Mourão, F., Pappa, G. L., Cunha, L., Gonçalves, M. A., and

Meira Jr., W.Automatic document classification temporally robust. Journal of Infor-

mation and Data Management, 1(2):199–212, 2010.

• Salles, T., Rocha, L., Mourão, F., Pappa, G. L., Cunha, L., Gonçalves, M. A., and

Meira Jr., W.Classificação Automática de Documentos Robusta Temporalmente. In

XXIV Simpósio Brasileiro de Banco de Dados, pages 106–119, Fortaleza, Brazil,

• Salles, T., Rocha, L., Pappa, G. L., Mourão, F., Gonçalves, M. A., and Meira Jr., W.A

Quantitative Analysis of the Temporal Effects on AutomaticDocument Classification.

In Journal of Machine Learning Research, 2011 (submitted).

• Pappa, G. L., Zadrozny, B., Rocha, L.,Salles, T., Meira Jr., W., Gonçalves, M. A.

Exploiting Contexts to Deal with Uncertainty in Classification. In Proceedings of the

First ACM SIGKDD Workshop on Knowledge Discovery from Uncertain Data, pages

19–22, Paris, France, 2009.

1.4. CONTRIBUTIONS 7

• de M. Palotti, J. R.,Salles, T., Pappa, G.L., Gonçalves, M. A., and Meira Jr., W.As-

sessing Documents’ Credibility with Genetic Programming. IEEE Congress on Evo-

lutionary Computation, 2011 (to appear).

• de M. Palotti, J. R.,Salles, T., Pappa, G. L., Arcanjo, F., Gonçalves, M. A., and Meira

Jr., W. Estimating the credibility of examples in automatic document classification.

Journal of Information and Data Management, 1(3):439–454,2010.

• Figueiredo, F., Rocha, L., Couto, T.,Salles, T., Gonçalves, M. A., Meira Jr., W.Word

Co-occurence Features for Text Classification. Information Systems, 2011 (in press).

1.5 Roadmap

This work is structured in five chapters. The remainder of this work is organized as follows.

Chapter 2: In this chapter we briefly describe the supervised ADC task and some evaluation

strategies. We also present some of the notational conventions adopted in this work.

Chapter 3: In this chapter we describe related work. We start by discussing some of the

application scenarios where time poses as an important aspect to be considered. Then,

we discuss some of the efforts towards either detecting or handling variations on the

data distributions. We distinguish two broad areas for doing so: concept drift and

adaptive document classification.

Chapter 4: In this chapter we provide evidence of the existence of temporal effects. We

provide an extensive characterization of the properties ofthree textual datasets with

relation to the extent of each temporal effects on them, and quantify the impact of

the temporal effects on four well known ADC algorithms (i.e., Rocchio, K Nearest

Neighbors, Naïve Bayes and Support Vector Machine).

Chapter 5: In this chapter we propose three strategies, based on atemporal weighting

function(TWF), to address and minimize the impact of the temporal effects in three

extended versions of three ADC algorithms. We start by introducing the TWF and

proposing two strategies to determine it. Then, we describehow to modify three ADC

algorithms (namely, Rocchio, K Nearest Neighbors and NaïveBayes) in order to in-

corporate the TWF into them. we propose three strategies fordoing so.

Chapter 6: Finally, in this chapter we conclude the dissertation, summarize our main find-

ings and propose some directions for further investigation.

Chapter 2

Preliminaries: Basic Concepts

In this work, we are mainly concerned with Automatic Document Classification (ADC), a

well studied subject related to the classification problem,1 considering a supervised learning

paradigm. This section serves two main purposes: (i) briefly describe the supervised ADC

task and some evaluation strategies, in order to provide thereader with some basic notions

on the subject; and (ii ) present some notational conventions adopted in this work.

2.1 Automatic Document Classification

The purpose of supervised ADC algorithms is to predict the unknown class of a document,

based on a set of already classified documents (Sebastiani, 2002). Let di = (~xi, ci) be

a document, where~xi denotes its vectorial (bag of words) representation andci ∈ C a

categorical attribute (or response variable) indicating its class (C is a finite set composed

by all the possible classes). The main goal of an ADC algorithm is thus to learn a discrete

approximation of the class a posteriori probability distributionP (ci|di), which underlies the

relationships between documents and their associated classes. This probability distribution

is learned according to a training set composed by already classified documents. There are

two approaches for doing so, either based on a direct estimation of P (ci|di), or based on an

indirect estimation ofP (ci|di).

The first approach, which defines the so calleddiscriminativeclassifiers, learns the

class boundaries that minimize the error rate (or some correlated measure), ultimatelydis-

criminating between classes (that is, they learn class boundaries) without making any as-

sumption regarding the probability density function for each class. On the other hand, the

second approach, which defines thegenerativeclassifiers, learns the class conditional prob-

ability distribution and the a priori class probabilities to estimate the class a posteriori prob-1Also known as the discrimination problem in the statistics literature.

10 CHAPTER 2. PRELIMINARIES: BASIC CONCEPTS

ability distributionP (ci|di). In this case, one should assume a model for the class densities

P (di|ci) and its parameters are estimated from the training set. For example, a normal dis-

tribution may be chosen, and its mean and variance parameters are estimated according to

the already classified data. Then the class a posteriori probability distributionP (ci|di) is

estimated according to the Bayes’ rule:

P (ci|di) =P (ci) · P (di|ci)

c′∈C P (c′) · P (di|c′), (2.1)

whereP (ci) denotes the class priors andP (di|ci) denotes the class densities.

Informally, given a training set of already classified documents with feature measure-

ments, we build a classification model, or learner, which will enable us to classify a new

unseen document. A good learner is one that accurately predicts such class. In the per-

spective of function approximation, this translates into finding a good approximationf of

f : DU 7→ C, that underlies the predictive relationship between the documents and their

associated classes, based on the training setD ⊂ DU , whereDU denotes the input space

composed by both classified and unclassified documents.

In order to assess how good an approximation is, one should consider the generaliza-

tion capabilities of the approximatedf . Recall that thef is an approximation based on the

training set, that isf : D 7→ C. The quality of such approximation refers to how wellf

predict the classes of unseen documents (i.e., documentsd′ 6∈ D). This is assessed by the

generalization capability off . Clearly, a functionf that accurately predicts the class of doc-

uments fromD may not be accurate to predict the class of documents fromDU \ D (i.e., the

set of unclassified documents)2. In this case, we say thatf is overfitted w.r.t.D. Hence,

there exists a trade-off between the complexity off (the more complexf is, the more spe-

cific patterns observed in the training set are learned) and the generalization power off (the

more specific patterns observed inDU \ D may not be observed inD).

It has been already proved that, asymptotically, the discriminative classifiers are supe-

rior to the generative ones (Vapnik, 1998), with several reported experiments corroborating

this superiority (Drummond, 2006). In fact, if there are not enough training examples, the

parametric model is deemed to overfit, decreasing its generalization power (Hastie et al.,

2009). However, some authors claim, based on experimental evaluation, that with realistic

training set sizes, the generative classifiers can also perform as well as or better than the

discriminative ones. This comes true if the assumed parametric model used by the genera-

tive classifier is correct. In this case, the class priors become a useful information which is

ignored by the discriminative classifiers. As will be described in Section4.1, in this work

we consider both generative classifiers (represented by theNaïve Bayes classifier) and dis-

2A \ B denotes the set difference betweenA andB and is a set composed by elements inA but not inB.

2.2. EVALUATION TECHNIQUES 11

criminative classifiers (represented by Rocchio, K NearestNeighbors and Support Vector

Machine classifiers).

2.2 Evaluation Techniques

An important aspect to be considered is how to evaluate the effectiveness of a classifier (that

is, its accuracy in classifying unseen data or, in other words, its generalization power), as-

sessed by first learning a classification model based on the training set and then applying it to

classify a set of unseen documents (the test set). Some measures of classification effective-

ness are then used to assess the quality of the classificationmodel learned. Several measures

for this purpose were proposed in the literature and some of them are widely used by the

machine learning community. Perhaps the most used measuresare the precision, recall and

the F1 measure. In order to describe each of these measures, consider the contingency table

represented in Table2.1 (also known as confusion matrix), whereTP , TN , FP andFN

denote, respectively, the number of true positives, true negatives, false positives and false

negatives, defined as:

True Positive (TP): positive test document correctly into the positive class,

True Negative (TN): negative test document correctly classified into the negative class,

False Positive (FP):negative test document incorrectly classified into the positive class,

False Negative (FN):positive test document incorrectly classified into the negative class.

The precisionp of a performed classification denotes the fraction of all documents

assigned to the positive classci by the classifier that really belong toci. In terms of the

contingency table, this translates into

TP + FP.

Positive Ground TruthClass= ci ci Not ci

Predictionci TP FP

Not ci FN TN

Table 2.1: Contingency Table for Classification Effectiveness Evaluation.

12 CHAPTER 2. PRELIMINARIES: BASIC CONCEPTS

The recallr of a performed classification denotes the fraction of all documents that

belong to the positive classci that were correctly assigned toci by the classifier. Again, in

terms of the contingency table, this can be expressed as

TP + FN.

Finally, the F1 measure is defined as the harmonic mean of the precision and the recall,

given by

F1 =2pr

There are two conventional methods to evaluate classification algorithms when applied to

problems with more than two classes, namely by micro-averaging and macro-averaging the

F1 measure. The micro-averaged F1 (microF1) is calculated from a global contingency table

(similarly to Table2.1), with the precision and recall being calculated as a sum of each entry

of the table:

pmicro =

∑|C|i=1 TPi

∑|C|i=1 TPi + FPi

rmicro =

∑|C|i=1 TPi

∑|C|i=1 TPi + FNi

In contrast, the macro-averaged F1 (macroF1) is calculated by first calculating the precision

and recall values for each class and computing their averagevalue:

pmacro =1

|C|∑

TPi + FPi

rmacro =1

|C|∑

TPi + FNi

Notice that the main difference between both strategies is that the microF1 is a document-

pivoted measure that gives equal weights to the documents while the macroF1 measure is a

class-pivoted measure that gives equal weights to the classes.

Since the ADC task is inherently a stochastic process, it is fundamental to adopt some

evaluation strategies that guarantee the statistical validity of the obtained classification re-

sults, which is achieved by replicating the experiments using different training sets to learn

a classification model. For this purpose, the cross validation strategy has become a standard

in the machine learning community. There are, at least, two usual strategies for cross val-

idation, the K-fold cross validation and the repeated random sub-sampling (Kohavi, 1995).

2.3. TEMPORAL REPRESENTATION OFDOCUMENTS 13

The K-fold cross validation consists of randomly splittingthe data intoK independent folds.

At each iteration, one fold is retained as the test set, and the remainingK − 1 folds are

used as training set. The repeated random sup-sampling consists of randomly selecting a

fraction of documents from the dataset, without replacement, to compose the test set, and

the remaining documents retained as the training set. This is performed for each replication.

Since in the K-fold cross validation the size of the folds aredependent of the number of iter-

ations, it becomes more suitable to medium/large sized datasets, while the repeated random

sub-sampling is usually adopted to small sized datasets when the number of replications is

large.

For more details on ADC and evaluation strategies, we refer the reader to

(Baeza-Yates and Ribeiro-Neto, 2011; Hastie et al., 2009; Manning et al., 2008).

2.3 Temporal Representation of Documents

In this work we deal with the documents’ timeliness, represented by their creation points in

time. We consider time as a discrete attribute associated todocuments. Thus, we represent

the documents by a tripledi = (~xi, ci, pi), where~xi denotes the vectorial “bag of words”

representation ofdi, ci denotes its associated class andpi denotes its creation point in time.

An important aspect to consider refers to the temporal unit used. The temporal unit

should be the minimum time interval between relevant changes observed in data and is,

clearly, dataset dependent. For example, since scientific conferences are usually annual,

relevant changes usually occur yearly, and the temporal unit should be one year. On the

other hand, the temporal unit to be used for data from published news articles should be

more fine grained (e.g., one day or one month).

Chapter 3

Related Work

In this chapter, we discuss some related work. First, we report some efforts spent on the

dissertation’s target problem, that is, the impact of varying data distributions in learning

algorithms when applied to some important scenarios. Then,we focus our attention on some

works aimed at either detecting or dealing with such problem.

3.1 Problem Overview

A fundamental assumption of the vast majority of automatic classifiers is that the data used

to learn a classification model are random samples independently and identically distributed

(i.i.d.) from a stationary distribution that governs the test data.However, this may not be the

case. In fact, in many (perhaps most) real-world classification problems, the training data

may not be randomly drawn from the same distribution as test data (to which the classifier

will be applied) when there exists variations in the underlying data distribution. Hence,

the success of classification algorithms may be diminished when faced to real-world time-

varying data. As argued byAlonso et al.(2007), “time is an important dimension of any

information space and can be very useful in information retrieval”.

As analyzed byKelly et al. (1999), the observed variations in the data distributions

may be reflected by, at least, three aspects:

1. Varying a priori class probabilities—P (ci);

2. Varying class a posteriori probabilities—P (ci|di);

3. Varying class densities—P (di|ci).

Notice that, according to Equation2.1, sincep(ci|di) depends onp(di|ci), both the

generative and discriminative classifiers that do assume a static underlying data distribution

16 CHAPTER 3. RELATED WORK

are deemed to be error prone when faced with non-stationary data. This problem becomes

critical as it is not a hard task to enumerate real-world examples of scenarios in which auto-

matic classification procedures are applied to inherently dynamic data. For example, in spam

filtering applications the ultimate goal is to filter out undesired spam messages. However,

spammers actively change the nature of their messages to elude the spam filters, and devel-

oping strategies that take into account such dynamic behavior becomes a necessary task to

guarantee the effectiveness of the filters (Fdez-Riverola et al., 2007).

Another example relates to the information filtering techniques employed by personal

assistance applications aimed at personalizing the flow of information according to the user

interests. A specific type of information filtering technique is to recommend information

items to users, according to their interests. This is accomplished by predicting which in-

formation items meet the users interests, based on their profiles. Clearly, changes in user

interests are problematic and should be addressed in order to guarantee an effective recom-

mendation. Thus, modeling the temporal dynamics of user interests should be a key when

designing them. Indeed, recently there was an open competition for the best filtering algo-

rithm to predict user ratings for films, based on previous ratings (NetFlix Prize). The winners

of the contest explored the temporal aspect as one of the keysto the problem, considering

that both the movie popularities ans user preferences change over time (Koren, 2010). This

reinforces the importance of a proper handling of dynamicaldata. Other examples are au-

tomatic credit card fraud detection (Wang et al., 2003), where previously observed patterns

regarding fraudulent credit card transactions are used to learn a classification model that is

able to predict the legitimacy of new transactions. However, such patterns also change over

time, and should be taken into account in order to avoid fraudulent transactions. It should

be clear from now that variations on the data distribution pose as an important problem to be

tackled in order to improve the effectiveness of learning algorithms.

In this work, we focus on the temporal dynamics observed in textual datasets. As a

matter of fact, due to several factors, such as the dynamics of knowledge and even the dy-

namics of languages, the characteristics of textual data may change over time (Mourão et al.,

2008). As previously discussed, automatic document classifiersmay have trouble with such

kind of data. Thus, this work tackles the following problem:

Problem 1 (Problem Statement)The majority of automatic document classifiers assume a

stationary data distribution. However, in (perhaps most) real-world classification problems

this premise is violated, being an important task to consider the temporal dynamics of data

in order to boost the effectiveness of the classifiers.

3.2. STRATEGIES OVERVIEW 17

3.2 Strategies Overview

Although ADC is a widely studied subject, the analysis of temporal aspects in this class of

algorithms is quite recent—it has been studied only in the last decade. Most previous studies

have focused on detecting and dealing with these effects to improve classification quality,

whereas we are aware of only one prior effort towards characterizing the impact of temporal

effects on ADC effectiveness.

3.2.1 Detecting Data Variations

We start by reviewing previous attempts todetectsignificant changes in the underlying

data distribution due to temporal effects.Gama et al.(2004) presented a method to detect

changes in the distribution of the training examples by means of an online classifier that

performs a sequence of trials to perform the classification.On each trial, it makes some

predictions and receives a feedback accounting for the classification error, in order to detect

significant changes in the data at hand. This approach is ableto detect both gradual and

abrupt changes. Similarly,Nishida and Yamauchi(2009) propose a system to detect and

predict changing distributions by managing a set of offline and online classifiers to account

for, respectively, data variations and classifiers’ prediction errors. Furthermore, the system

also performs a clustering step to allow the prediction of future variations. Other studies

explore statistical tests to detect drift (Dries and Rückert, 2009; Nishida and Yamauchi,

2007). In (Dries and Rückert, 2009), for instance, the authors propose three adaptive

tests that are capable to adapt to different (gradual or abrupt) changing behaviors. In

(Nishida and Yamauchi, 2007), the authors propose to classify a set of examples belonging

to a recent time window, and to compare the achieved accuracyagainst the one obtained

with a global classifier that considers all available data. The basic idea is that statistically

significant decreases in accuracy suggest data variations.Such solution is able to quickly

detect drift when the window size is small, at the cost of being susceptible to data sparseness.

3.2.2 Dealing with Data Variations

Previous efforts todeal with varying data distributions can be categorized into twobroad

areas, namely, adaptive document classification and concept drift.

3.2.2.1 Adaptive Document Classification

Adaptive document classification (Cohen and Singer, 1999) embodies a set of techniques to

deal with changes in the underlying data distribution so as to improve the effectiveness of

document classifiers through incremental and efficient adaptation of the classification mod-

els. Adaptive document classification brings three main challenges to document classifica-

tion (Liu and Lu, 2002). The first one is the definition of a context and how it may be ex-

ploited to devise more accurate classification models. A context is a semantically significant

set of documents. Previous research suggests that they may be determined through at least

two strategies: identification of neighbor terms to a certain keyword (Lawrence and Giles,

1998), and identification of terms that indicate the scope and semantics of the document

(Caldwell et al., 2000). In our case, the strategies to deal with varying data distributions

explore the stability of terms, which can be seen as a kind of (temporal) context—but in a

finer granularity (i.e., terms). The second challenge is howto build the classification models

incrementally (Kim et al., 2004), whereas the third challenge relates to the computational

efficiency of the resulting classifiers. Here, we do not consider the incremental construction

of classification models. Our temporally-aware classifiersuse the temporal information to

learn more accurate classification models, instead of updating them in a incremental fashion.

Clearly this is a natural extension of our work and we intend to consider it in the future.

3.2.2.2 Concept Drift

Concept or topic drift (Tsymbal, 2004) comprises another relevant set of efforts to deal with

varying data distributions in classification. A prevailingapproach to address concept drift is

to completely retrain the classifier according to a sliding window, which ultimately involves

example selection techniques. A number of previous studiesfall into this category. For in-

stance, the method presented in (Klinkenberg and Joachims, 2000) maintains a window with

examples sufficiently “close” to the current target concept, and automatically adjusts the

window size so that the estimated generalization error is minimized. In (Žliobaite, 2009),

a classification model is built using training examples which are close to the test in terms

of both time and space. The methods presented in (Klinkenberg, 2004) either maintain an

adaptive time window on the training data, or select representative training examples, or

weight them.Widmer and Kubat(1996) describe a set of algorithms that react to concept

drift in a flexible way and can take advantage of situations where contexts reappear. The

main idea of these algorithms is to keep only a window of currently trusted examples and

hypothesis, and store concept descriptions in order to reuse them if a previous context reap-

pears. In (Rocha et al., 2008), the authors introduce the concept oftemporal context, defined

as a subset of the dataset that minimizes the impact of temporal effects in the performance

of classifiers. They also propose an algorithm, namedChronos, to identify these contexts

based on the stability of the terms in the training set. Temporal contexts are used to sample

the training examples for the classification process, and examples considered to be outside

the temporal context are discarded by the classifier.

Unlike previous efforts that use a single window to determine drift in the data,

Lazarescu et al.(2004) present a method that uses three windows of different sizesto es-

timate the change in the data. While algorithms that use a window of fixed size impose

hard constraints over drift patterns, those that use heuristics to adjust the window size to the

current extent of concept drift often involve lots of parameters to be calibrated. In order to

provide some theoretical basis for the choice of window size, Kuncheva and Žliobaite(2009)

developed a framework for relating the classification errorto the window size, aiming at pro-

viding an optimal window size choice. Such optimal choice leads to statistically significant

improvements in window-based strategies. Following this direction, in (Bifet and Gavaldà,

2006) the authors propose a window-based strategy for drifting data streams, that automat-

ically chooses the optimal window size, called ADWIN. This approach keeps a windowW

with the most recent data and splits it in two adjacent sub-windowsW0 andW1. Using statis-

tical tests to compare both windows, it detects when a drift occurred. In this case, all possible

adjacent sub-windows must be considered. Clearly, this is acostly operation (both in terms

of time and memory). In (Bifet and Gavaldà, 2007), the authors propose an improvement

for ADWIN, called ADWIN2, with the same effectiveness guarantees of ADWIN and more

efficient data structures.

Window-based approaches may be considered too rigid since it may miss valuable

information laying outside of the window. Accordingly, a second approach to deal with con-

cept drift consists in properly weighting training examples while building the classification

model in order to reflect the temporal variations in the underlying data distribution, instead

of simply discarding them.1 Following this direction,Koychev(2000) defined a linear time-

based utility function to account for variations in the datadistribution such that the impact

of the examples on the classification model decreases with time. Experimental evaluation

conducted with the Naïve Bayes and the ID3 algorithms showedthe effectiveness of such ap-

proach. In (Klinkenberg and Rüping, 2003), the authors defined an exponential time-based

function in order to weight examples based on their age. The reported experimental evalua-

tion showed that weighting examples in drifting scenarios leads to significant improvements

over fixed-window strategies, while being outperformed by an adaptive-window approach.

However, such time-based utility functions are typically defined in a very ad-hoc manner

(e.g., linear functions, exponential functions, etc), without any theoretical justification built

from changes in data patterns.

Thus, the following question remains unanswered:how can we properly define such

time-based utility function?In order to answer that question, not only the temporal distance

1In this sense, window-based approaches could be though as a type of binary weighting function.

between training and test examples should be considered, but also the varying characteristics

of the underlying data distribution. Following this direction, in this work we report a statis-

tical analysis of the temporal effects on three textual datasets in order to define atemporal

weighting function(TWF) which properly models the changing behavior of the underlying

data distribution, reflecting its dynamical nature, and capturing both the temporal distance

between training and test examples and the variations of thecharacteristics of the dataset

(Salles et al., 2010b). We also propose three instance weighting strategies thatemploy the

temporal weighting function to deal with these temporal effects (Salles et al., 2010a). We

applied these strategies to three well known ADC algorithms, namely, Rocchio, KNN and

Naïve Bayes and, as reported in Section5.4, we found that the new temporally-aware classi-

fiers achieve statistically significant gains over their traditional counterparts.

Another common approach to deal with concept drift focuses on the combination

of various classification models generated from different algorithms (ensembles) for clas-

sification, pruning or adapting the weights according to recent data (Folino et al., 2007;

Kolter and Maloof, 2003; Scholz and Klinkenberg, 2007). Scholz and Klinkenberg(2007)

proposed a boosting-like method to train a classifier ensemble from data streams. It natu-

rally adapts to concept drift and allows one to quantify the drift in terms of its base learners.

The algorithm was shown to outperform learning algorithms that ignore concept drift. In the

same direction,Kolter and Maloof(2003) presented a technique that maintains an ensemble

of base learners, predicts instance classes using a weighted-majority vote of these “experts”,

and dynamically creates and deletes experts in response to changes in performance. Ad-

ditionally, Folino et al.(2007) proposed to build an ensemble of classifiers using genetic

programming to inductively generate decision trees. In spite of these prior proposals, one

important challenge of approaches based on classifier ensembles is the efficient management

of multiple models. As a matter of fact, one of our proposed strategies are based on the

combination of various classification models, but with a much simpler way to manage them,

by exploiting the TWF.

3.2.3 Characterizing Data Variations

In addition to the aforementioned studies, which aim at either detecting or exploiting the

changes in data distribution, in (Forman, 2006) the author provides a characterization of

varying data distributions in the textual data domain, where the concept drift problem is

studied considering three main types of data variations: (i) shifting class distribution, which

is reflected by the observed variations over time in the proportion of documents assigned

to each class; (ii ) shifting subclass distribution, which accounts for varying feature distribu-

tions; and, finally, (iii ) the fickle concept drift, that denotes the cases where documents are

assigned to distinct classes at different points in time. Moreover, in that work, the author

proposes a visualization tool aimed at analyzing the feature space (in a binary classification

setting) and thus providing clues about the varying behavior of the most predictive features as

time goes by. A real textual dataset, composed by news articles, was characterized according

to the three mentioned drifting patterns, and was shown to bea very dynamic dataset.

Following this direction, in (Mourão et al., 2008), the authors provide a characteriza-

tion of these changes in terms of three maintemporal effects: (i) the class distribution varia-

tion, that accounts for the impact of the temporal evolutionon the relative frequencies of the

classes;(ii) the term distribution variation, which refers to changes inthe representativeness

of the terms with respect to the classes as time goes by; and, finally, (iii) the class similarity

variation, which considers how the similarity among classes, as a function of the terms that

occur in their documents, changes over time. In fact, the class distribution variation and the

term distribution variation effects correspond to the shifting class distribution and the shift-

ing subclass distribution discussed in (Forman, 2006), respectively. Furthermore, while the

class similarity variation effect is not analyzed in (Forman, 2006), the fickle drifting pattern

is not considered in (Mourão et al., 2008). As a matter of fact, the fickle drift type, which

corresponds to the change of class of a given document due to some eventual correction, is

probably the most difficult case to be handled. These are veryrare events which may not af-

fect the classifier effectiveness and even the strategies discussed in (Forman, 2006) to handle

concept drift do not deal with this case. Hence, here we focuson the three temporal effects

analyzed in (Mourão et al., 2008), adopting the authors’ proposed nomenclature.

Building upon the characterization reported in both studies, we here propose a method-

ology to enable a deeper study of temporal effects. We propose to use a factorial experimental

design toquantifyto which extent each of these variations impact ADC algorithms, accord-

ing to datasets with distinct temporal dynamics. This quantitative analysis is an advance to

the aforementioned studies, since both analyze the variations in the data distribution in a

purely qualitativemanner. We also instantiate the proposed methodology usingthree real

textual datasets and four traditional ADC algorithms. In comparison with previous work,

our characterization methodology and results contribute directly to the definition of more

successful strategies to deal with and to exploit temporal effects. They also provide valuable

insights into the behavior of the analyzed algorithms when faced with changing distributions.

It is interesting to notice that, while the majority of the aforementioned works aimed

at dealing with varying data distributions typically consider scenarios characterized by the

classification of future data (with older data becoming obsolete as time goes by), here we

propose an approach to classify documents in scenarios where we may have information

about both the past and the future when classifying the test data, and this information may

change over time. For example, considering a training set composed by documents created

between the years1980 and2011, when classifying a test document created at year2000 we

take into account both past and future data. It should be noticed, however, that our approach

may be easily adapted for scenarios where we only have past information, such as Adaptive

Document Classification and Concept Drift.

3.3 Chapter Summary

We discussed in this chapter the importance of considering the temporal dynamics of data

in machine learning techniques. We also reported some work aimed at either detecting or

handling variations in the data distribution in automatic classification tasks. We saw that the

main approaches for detecting data distribution variations are based on statistical tests and

classifier ensembles. Moreover, we discussed three main techniques for handling varying

data distributions (instance selection, instance weighting and ensembles) along with their

merits and drawbacks. Throughout the discussion, we pointed out how our work advances

the current research efforts.

Chapter 4

A Quantitative Analysis of Temporal

Effects on ADC

In this chapter, we are particularly concerned with the impact thattemporal effectsmay have