classificaÇÃo automÁtica de documentos …
Post on 26-Jul-2022
2 Views
Preview:
TRANSCRIPT
CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS
TEMPORALMENTE ROBUSTA
THIAGO CUNHA DE MOURA SALLES
CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS
TEMPORALMENTE ROBUSTA
Dissertação apresentada ao Programa de Pós-Graduação em Ciência da Computação do Ins-tituto de Ciências Exatas da Universidade Fe-deral de Minas Gerais como requisito parcialpara a obtenção do grau de Mestre em Ciênciada Computação.
ORIENTADOR: MARCOSANDRÉ GONÇALVES
COORIENTADOR: LEONARDO CHAVES DUTRA DA ROCHA
Belo Horizonte
Março de 2011
THIAGO CUNHA DE MOURA SALLES
AUTOMATIC DOCUMENT CLASSIFICATION
TEMPORALLY ROBUST
Dissertation presented to the Graduate Pro-gram in Computer Science of the Federal Uni-versity of Minas Gerais in partial fulfillmentof the requirements for the degree of Masterin Computer Science.
ADVISOR: MARCOSANDRÉ GONÇALVES
CO-ADVISOR: LEONARDO CHAVES DUTRA DA ROCHA
Belo Horizonte
March 2011
c© 2011, Thiago Cunha de Moura Salles.Todos os direitos reservados.
Salles, Thiago Cunha de Moura.S168c Classificação Automática de Documentos
Temporalmente Robusta / Thiago Cunha de Moura Salles.— Belo Horizonte, 2011.
xxxvi, 106 f. : il. ; 29cm
Dissertação (mestrado) — Universidade Federal deMinas Gerais. Departamento de Ciência da Computação.
Orientador: Marcos André Gonçalves.Coorientador: Leonardo Chaves Dutra da Rocha.
1. Computação - Teses. 2. Recuperação da informação -Teses. I. Orientador. II. Título.
CDU 519.6*73 (043)
“In times of change, learners inherit the Earth,
while the learned find themselves beautifully equipped
to deal with a world that no longer exists.”
(Eric Hoffer)
ix
Resumo
Classificação Automática de Documentos (CAD) é um tópico de pesquisa de grande relevân-
cia na comunidade de aprendizado de máquina e recuperação deinformação, e diversos al-
goritmos para CAD foram propostos na literatura. A maioria de algoritmos para CAD, no
entanto, assume uma distribuição estática dos dados. Essa premissa é comumente violada
em dados reais. Neste trabalho, lidamos com os desafios relacionados à dinâmica temporal
observada em coleções textuais. Apresentamos evidências sobre a existência de três efeitos
temporais em três coleções reais, que são refletidos por variações observadas ao longo do
tempo na distribuição das classes, nas similaridades entrepares de classes e nos relaciona-
mentos entre termos e classes.Quantificamos, então, o impacto de tais efeitos temporais em
quatro algoritmos tradicionais para CAD, realizando uma série de projetos fatoriais comple-
tos. Mostramos que tais efeitos afetam as coleções de forma distinta e impactam na eficácia
dos algoritmos para CAD em diferentes proporções. As análises quantitativas realizadas
provêm informações valiosas para um melhor entendimento acerca do comportamento dos
algoritmos para CAD quando diante de distribuições de dadosque variam ao longo do tempo,
e apontam requisitos importantes para a proposta de modelosde classificação mais acurados.
Baseado nas análises conduzidas, com o intuito deminimizaro impacto de tais efeitos em
algoritmos para CAD, introduzimos umafunção de ponderação temporal(TWF) que reflete
a natureza variável das coleções textuais e propomos uma metodologia para determinar tanto
a expressão quanto os parâmetros da mesma. Tal metodologia foi aplicada a três coleções
textuais. Três algoritmos tradicionais para CAD (kNN, Rocchio e Naïve Bayes) foram es-
tendidos a fim de incorporar a TWF, segundo duas estratégias propostas, obtendo o que
chamamos de classificadores temporalmente robustos. Os classificadores temporalmente ro-
bustos obtiveram ganhos significativos em eficácia em relação às suas versões tradicionais.
xi
Abstract
Automatic Document Classification (ADC) continues to be a relevant research topic in the
machine learning and information retrieval communities, and several ADC algorithms have
been proposed. However, the majority of ADC algorithms assume that the underlying data
distribution does not change over time. In this work, we are concerned with the challenges
imposed by the temporal dynamics observed in textual datasets. We provide evidence of
the existence of three main temporal effects in three textual datasets, reflected by variations
observed over time in the class distribution, in the pairwise class similarities, and in the
relationships between terms and classes. We thenquantify, using a series of full factorial
design experiments, the impact of these effects on four wellknown ADC algorithms. We
show that these temporal effects affect each analyzed dataset differently, and that they re-
strict the performance of each considered ADC algorithm at different extents. The reported
quantitative analyses provide valuable insights to betterunderstand the behavior of ADC al-
gorithms when faced with non-static (temporal) data distributions and highlight important
requirements for the proposal of more accurate classification models. Based on the per-
formed analyses, in order tominimizethe impact of temporal effects in ADC algorithms, we
introduce atemporal weighting function(TWF) which reflects the varying nature of textual
datasets and propose a methodology to determine its expression and parameters. We ap-
plied this methodology to three textual datasets and then proposed two strategies to extend
three ADC algorithms (namely kNN, Rocchio and Naïve Bayes) to incorporate the TWF,
which we call temporally-aware classifiers. Experiments showed that the temporally-aware
classifiers achieved significant gains, outperforming (or at least matching) state-of-the-art
algorithms in almost all cases.
xiii
Resumo Estendido
Introdução
Classificação Automática de Documentos (CAD) é um tópico de pesquisa de grande relevân-
cia nas comunidades de Aprendizado de Máquina e Recuperaçãode Informação. De fato, o
desenvolvimento de algoritmos eficazes e eficientes para CADtem se mostrado de grande
importância, dada a crescente complexidade e escala dos cenários de aplicação atuais, como
a Web. A tarefa de CAD consiste no aprendizado de modelos que associam documentos a
classes semanticamente coesas, baseado em um conjunto de documentos previamente clas-
sificados. Esses modelos são componentes chave para dar suporte e melhorar uma variedade
de tarefas, tais como o projeto de diretórios de tópicos, identificação de estilos de escrita,
organização de bibliotecas digitais, auxílio aos usuáriospara uma melhor interação com
máquinas de buscas, dentre outras.
O problema
Para melhor entendermos o problema estudado neste trabalho, faz-se necessário apresen-
tarmos brevemente a tarefa de CAD, considerando o paradigmasupervisionado. O objetivo
principal de CAD é predizer a classe (desconhecida) de um novo documento, baseado em um
conjunto de documentos previamente classificados (Sebastiani, 2002). Sejadi = (~xi, ci) um
documento cuja representação vetorial (“bag of words”) é dada por~xi e cuja classeci ∈ C
é um atributo categórico proveniente de um conjunto finitoC de classes. Assim, o objetivo
de CAD pode ser definido como o aprendizado de uma aproximaçãodiscreta da distribuição
a posterioridas classesP (ci|di), que reflete o relacionamento preditivo entre documentos e
classes. Esse aprendizado é realizado de acordo com o conjunto de documentos previamente
classificados (conjunto de treinamento).
A aproximação deP (ci|di) pode ser realizada tanto via estimativa direta, quanto via
estimativa indireta (pela regra de Bayes). A primeira estratégia define os chamados clas-
sificadores discriminativos, caracterizados por aprenderas fronteiras inter-classes de forma
xv
a minimizar a taxa de erros (ou alguma métrica correlata), literalmente discriminando as
classes, sem realizar nenhuma suposição referente à funçãode densidade de probabilidade
de cada classe. Por outro lado, a segunda estratégia, que define os chamados classificadores
generativos, valem-se da estimativa tanto da probabilidade condicionalP (di|ci) das classes
quanto da probabilidadea priori P (ci) das mesmas para estimar a probabilidadea posteriori
almejada. Nesse caso, presume-se um modelo tanto para as densidadesP (di|ci) quanto para
as probabilidades a prioriP (ci), sendo os parâmetros do modelo estimados com base no
conjunto de treinamento. Então, a probabilidadea posteriorié obtida por meio da aplicação
da regra de Bayes:
P (ci|di) =P (ci) · P (di|ci)
∑
c′∈C P (c′) · P (di|c′), (1)
ondeP (ci) eP (di|ci) denotam, respectivamente, a probabilidadea priori e condicionais das
classes.
A premissa básica adotada pela vasta maioria de algoritmos para CAD é que os dados
de treinamento, utilizados para construir um modelo de classificação, são amostras aleatórias
provenientes de uma distribuição de dados estacionária. Entretanto, este pode não ser o caso.
De fato, em diversos (talvez a maioria) dos problemas de classificação reais, os dados uti-
lizados para treinamento podem não ser provenientes da mesma distribuição que governa os
dados a serem classificados, em virtude da dinâmica temporaldos mesmos. Por exemplo,
sistemas para filtragem de spams e recomendação de itens de informação são naturalmente
confrontados por dados inerentemente dinâmicos. Assim, o sucesso de algoritmos de classi-
ficação pode ser comprometido quando diante de dados não-estáticos.
Conforme analisado porKelly et al. (1999), as variações observadas nas distribuições
de dados se refletem em, no mínimo, três aspectos:
• Variações nas probabilidadesa priori—P (ci);
• Variações nas probabilidadesa posteriori—P (ci|di);
• Variações nas probabilidades condicionais—P (di|ci).
Note que, de acordo com a Equação1, comop(ci|di) depende dep(di|ci), tanto os classi-
ficadores discriminativos quanto os generativos que assumem uma distribuição estacionária
de dados podem ter sua efetividade limitada quando aplicados a distribuições não-estáticas
de dados.
Neste trabalho, estamos particularmente interessados no impacto da dinâmica temporal
observada em dados textuais em algoritmos para CAD. Devido àdinâmica do conhecimento
e, até mesmo, das linguagens, as características de coleções textuais podem apresentar vari-
xvi
ações ao longo do tempo. De fato, como analisado porMourão et al.(2008), três efeitos
temporais que, em última análise, podem ser vistos como manifestações dos três aspectos
apontados anteriormente, se mostraram significativos em duas coleções textuais reais. O
primeiro efeito,CD (“Class Distribution variation”), refere-se a variações na distribuição
das classes ao longo do tempo (ou seja, as frequências relativas das classes não se man-
tém estáticas). O segundo efeito,TD (“Term Distribution variation”), refere-se às variações
observadas ao longo do tempo na distribuição dos termos, refletido por variações na repre-
sentatividade dos mesmos em relação às classes em que ocorrem. Finalmente, o terceiro
efeito,CS (“Class Similarity variation”) refere-se às variações nas similaridades entre pares
de classes na medida em que o tempo avança. De fato, duas classes podem se mostrar simi-
lares (ou dissimilares) entre si em um determinado momento,e essa similaridade se reduzir
(ou aumentar) ao longo do tempo. Ainda, em (Mourão et al., 2008) os autores evidenciaram
que essa evolução temporal é um desafio para algoritmos de aprendizado que, por sua vez,
podem ter sua efetividade limitada ao negligenciar tal aspecto.
Neste trabalho, avançamos o conhecimento na área por meio daquantificaçãoe mini-
mizaçãodo impacto dos efeitos temporais em algoritmos para CAD. Pormeio da realização
de uma série de projetos fatoriais completos,quantificamosa extensão dos efeitos tempo-
rais em diferentes coleções textuais, bem como o impacto dosmesmos em quatro algoritmos
tradicionais para CAD. Baseado no conhecimento obtido com essa caracterização mais apro-
fundada, desenvolvemos estratégias para minimizar o impacto de tais efeitos em três algo-
ritmos, alcançando resultados competitivos com o estado daarte em classificação automática
de documentos, com um menor custo computacional.
Análise Quantitativa dos Efeitos Temporais em
Classificação Automática de Documentos
A fim de quantificarmos o impacto dos efeitos temporais em algoritmos para CAD, primeira-
mente revisitamos a caracterização reportada em (Mourão et al., 2008), em que os autores
apresentam evidências sobre a existência dos três efeitos temporais discutidos em duas
coleções textuais reais: ACM-DL e MEDLINE. A primeira é composta por24897 docu-
mentos provenientes da Biblioteca Digital da ACM, distribuídos em11 classes disjuntas, e
criados entre1980 e 2002. A segunda é composta por861454 documentos classificados em
7 classes relacionadas à Medicina, criados entre1970 e1985. Incluímos, ainda, uma terceira
coleção, proveniente do domínio de notícias, a fim de prover evidências quanto à existência
dos efeitos temporais na mesma. Trata-se da AG-NEWS, uma coleção composta por835795
documentos, distribuídos entre11 classes disjuntas, criados em um intervalo de573 dias.
xvii
Potencialmente, trata-se uma coleção mais dinâmica que as demais.
De fato, ao caracterizarmos essa coleção em função dos efeitos temporais, de acordo
com a metodologia proposta em (Mourão et al., 2008), tornou-se claro que a AG-NEWS
é acometida pelos três efeitos temporais. Como um exemplo, apresentamos na Figura1
a distribuição relativa das classes observada ao longo do tempo (utilizando uma unidade
temporal semanal). Claramente, a distribuição das classesvaria. Mais detalhes a cerca de tal
caracterização são encontrados no texto completo da dissertação.
Figura 1: Variação na Distribuição de Classes—AG-NEWS.
Projeto Fatorial Completo
Uma vez evidenciada a existência dos efeitos temporais nas três coleções textuais adotadas,
partimos, então, para uma caracterização mais aprofundadados mesmos, quantificando como
eles afetam as coleções e a efetividade de quatro algoritmospara CAD amplamente utiliza-
dos pela comunidade de Aprendizado de Máquina, a saber,Rocchio, K Nearest Neighbors
(kNN), Naïve BayeseSupport Vector Machine(SVM).
Dadosk fatores, que podem assumirn níveis (valores possíveis), e uma variável re-
sposta, um projeto fatorialnkr busca quantificar o impacto de cada fator (bem como as in-
terações entre eles), na variável resposta, por meio der replicações experimentais. No nosso
caso, objetivamos quantificar o impacto dos efeitos temporais (fatores), e suas interações, na
efetividade de algoritmos para CAD (variável resposta). Consideramos dois possíveis níveis:
nível baixo e nível alto, referindo-se a uma baixa influênciae uma alta influência dos efeitos
temporais, respectivamente.
Um primeiro aspecto a ser tratado para a realização do projeto fatorial consiste em
isolar os níveis de cada fator. Isso se dá por meio da partiçãode documentos da coleção
sob estudo em grupos apresentando níveis baixo e altos de influência dos efeitos temporais.
xviii
Para tanto, propomos alguns mecanismos para realizar esse isolamento, conforme descrito
a seguir:
Distribuição de Classes (CD): Mensuramos a variação da distribuição de cada classec ao
longo do tempo, por meio do Coeficiente de Variação (CVc =σc
µc) da proporção relativa
de c em cada ponto no tempo. Para isso, calculamos a proporçãoPc,p de ocorrência
de documentos na classec considerando cada ponto no tempop e obtivemos tanto a
médiaµc quanto o desvio padrãoσc desses valores. Assim, associamos a cada classec
seu respectivo Coeficiente de VariaçãoCVc. Definimos, então, um limiarδCD tal que
documentos pertencentes a classes cujoCV é inferior aδCD são associados ao nível
baixo (grupoCD↓) e os demais são associados ao nível alto (grupoCD↑).
Distribuição de Termos (TD): A fim de isolar os níveis baixo e alto desse efeito tempo-
ral, propomos uma métrica chamada “Nível de Estabilidade doDocumento” (DSL—
Document StabilityLevel). ODSL de um documentod denota a densidade de termos
estáveis (ou seja, de termos que apresentam baixa variação em suas representatividades
em relação às classes) que compõemd. Definimos um limiarδTD para isolar os dois
níveis e, documentos cujoDSL é inferior aδTD são associados ao nível baixo (grupo
TD↓) e os demais são associados ao nível alto (grupoTD↑).
Similaridade de Classes (CS): Para isolar os níveis associados a esse efeito, consideramos
as variações observadas ao longo do tempo nas similaridadesentre pares de classes.
Considere o par de classes〈ci, cj〉, comi 6= j. Para cada ponto no tempop, definimos
Vi,p eVj,p como os vocabulários das classesci e cj observados emp, respectivamente,
sendo compostos pelosk termos mais representativos para tais classes nesse ponto
no tempo, de acordo com a métricaInformation Gain. Calculamos a similaridade
de cosseno entre ambos os vocabulários e mensuramos a variabilidade observada ao
longo do tempo por meio do Coeficiente de Variação. Assim, para cada classeci,
mensuramos a variabilidade observada na similaridade delacom as demais classes
cj 6= ci. Como anteriormente, definimos um limiarδCS a fim de separar os documentos
associados às classes com menor variabilidade (grupoCS↓) daqueles associados às
classes com maior variabilidade (grupoCS↑).
Realizado o isolamento dos níveis para cada fator, observamos uma alta correlação en-
tre os efeitos temporaisCD eCS. De fato, essa correlação inviabiliza a condução de um pro-
jeto fatorial23r (ou seja, com os três efeitos temporais considerados simultaneamente). As-
sim, adotamos uma estratégia de experimentação par-a-par,avaliando o impacto dos efeitos
CD eTD (projeto fatorialCD×TD) e o impacto dos efeitosCS eTD (projeto fatorialCS×TD)
xix
nos algoritmos para CAD, de forma isolada. Para cada combinação entre os algoritmos para
CAD e coleções adotados, executamos o par de projetos fatoriaisCD×TD eCS×TD.
Principais Resultados
A realização dos projetos fatoriais mencionados nos revelauma série de informações perti-
nentes acerca do comportamento das coleções sob o prisma temporal, bem como referentes
ao comportamento dos algoritmos para CAD quando aplicados acoleções cujas característi-
cas variam ao longo do tempo.
Primeiramente, mostramos que os efeitos temporais ocorremde forma mais promi-
nente nas coleções ACM-DL e AG-NEWS do que na MEDLINE. Mais especificamente,
com99% de confiança, obtivemos as seguintes ordenações parciais:
CDMEDLINE < CDACM−DL ∼ CDAG−NEWS,
CSMEDLINE < CSACM−DL ∼ CSAG−NEWS,
TDMEDLINE < TDACM−DL < TDAG−NEWS.
Em segundo lugar, considerando a coleção ACM-DL, o impacto dos efeitosCD eCS
se mostraram estatisticamente equivalentes ao impacto do efeitoTD, enquanto considerando
as coleções MEDLINE e AG-NEWS, foi observado que tantoCD quantoCS se mostraram
mais proeminentes que o efeitoTD.
Ainda, os quatro algoritmos para CAD analisados foram impactados negativamente
pelos efeitos temporais, em termos de efetividade da classificação. De fato, as maiores
degradações em efetividade foram observadas quando os algoritmos foram aplicados às
coleções mais dinâmicas (ACM-DL e AG-NEWS). Considerando os algoritmos isolada-
mente, a análise quantitativa realizada nos possibilitou um melhor entendimento a cerca das
forças e fraquezas dos classificadores em relação aos três efeitos temporais estudados. Por
exemplo, o classificador SVM se mostrou mais robusto ao efeito TD, sendo impactado de
forma marcante pelos demais efeitos. Tal comportamento pode ser justificado pelas próprias
características do classificador, conforme discutido na dissertação. Mostramos, também, que
os outros três classificadores sob estudo são bastante sensíveis aos três efeitos temporais.
Apresentamos na Tabela1 a ordenação parcial dos algoritmos, para cada coleção de dados
adotada, em relação ao impacto dos efeitos temporais observados. Os relacionamentos re-
portados evidenciam o fato de que, além de os algoritmos paraCAD serem negativamente
afetados pelos efeitos temporais, a degradação observada épeculiar a cada algoritmo e a
cada coleção de dados.
xx
Efeito ColeçãoTemporal ACM-DL MEDLINE AG-NEWSCD SVM > NB∼ KNN ∼ RO RO > SVM > NB > KNN RO∼ KNN > SVM ∼ NBCS SVM > KNN ∼ RO > NB RO > SVM∼ NB > KNN RO∼ KNN ∼ NB > SVMTD SVM ∼ KNN ∼ RO∼ NB SVM > RO∼ NB ∼ KNN RO > NB > KNN > SVM
Tabela 1: Um Estudo Comparativo Sobre o Impacto dos Efeitos Temporais em cada Algo-ritmo para CAD—Rocchio (RO), SVM, Naïve Bayes (NB) e KNN.
Os resultados obtidos com essa análise, portanto, corroboram nosso argumento de que
a dimensão temporal é um aspecto de grande importância que, apesar dos desafios intrínsecos
associados à dinâmica temporal, deve ser apropriadamente considerado para o desenvolvi-
mento de modelos de classificação acurados.
Classificação Automática de Documentos
Temporalmente Robusta
Baseado nas lições aprendidas com a caracterização temporal descrita, propomos algumas
estratégias paraminimizaro impacto dos efeitos temporais em algoritmos para CAD quando
aplicados a dados provenientes de distribuições que variamao longo do tempo. Tais es-
tratégias se baseiam no uso do que chamamos de Função de Ponderação Temporal (TWF,
de “Temporal Weighting Function”). Propomos primeiramente uma metodologia, baseada
em uma série de testes estatísticos, para determinar a expressão e os parâmetros da TWF,
de forma a melhor descrever o processo evolutivo subjacenteque governa a variação dos
dados. Instanciamos tal metodologia considerando as três coleções textuais descritas anteri-
ormente. Descobrimos, então, que a TWF’s associadas às coleções ACM-DL e MEDLINE
seguem uma distribuição lognormal, com99% de confiança. Entretanto, os mesmos testes
adotados falharam considerando-se a AG-NEWS. Portanto, a TWF associada à coleção AG-
NEWS segue uma distribuição distinta, e outros testes (potencialmente mais complexos—o
que pode impossibilitar seu uso por quem não possua as habilidades estatísticas necessárias)
tornam-se necessários. De fato, para a classificação temporalmente robusta, apenas os va-
lores reais positivos associados às distâncias temporais são necessários. Assim, para pos-
sibilitar a aplicabilidade desses classificadores a casos em que os testes necessários para
determinar a TWF sejam mais complexos (ou, até mesmo desconhecidos), oferecemos uma
estratégia automática para determinar tal função, sem a necessidade da realização de qual-
quer teste estatístico.
Uma vez definida a TWF, é necessário prover mecanismos para incorporá-la no ar-
xxi
cabouço de classificação. Propomos, então, três estratégias para tal:
TWF aplicada a Documentos: Essa estratégia consiste em ponderar cada documento de
treino pela TWF, de acordo com a distância temporal entre elee o documento a ser
classificado. Dessa forma, documentos de treino provenientes de pontos no tempo em
que a distribuição dos dados diverge daquela observada no momento de criação do
documento a ser classificado têm sua influência minimizada naregra de decisão. Na
Figura2 apresentamos uma descrição esquemática para essa estratégia.
Figura 2: TWF Aplicada a Documentos.
TWF aplicada a Pontuações:Nesse caso, as pontuações produzidas por um classificador
tradicional, considerando um conjunto de treinamento composto por documentos cuja
classec é transformada para a classe derivada〈c, p〉 (ondep denota o ponto no tempo
em que o documento foi criado) são consideradas, atrelando os padrões observados não
apenas às classes, mas também ao ponto no tempo em que foram observadas. Assim,
as pontuações obtidas pelo classificador tradicional para cada〈c, p〉 são agregadas por
meio de uma soma ponderada, onde os pesos são dados pela TWF. Apresentamos na
Figura3 uma descrição gráfica dessa estratégia.
Figura 3: TWF Aplicada a Pontuações.
TWF aplicada a Pontuações (Versão Estendida):Essa estratégia particiona os documen-
tos de treinamento em sub-grupos compostos por documentos criados em um mesmo
xxii
ponto no tempo (logo, sem variação temporal). Classificadores tradicionais são então
aplicados a cada partição de documentos, a fim de classificar odocumento de teste
baseado nesses diversos conjuntos de treinamento. As pontuações referentes à classe
c, obtidas para cada partição de dados, são agregadas de acordo com uma soma pon-
derada, sendo os pesos dados pela TWF. A representação esquemática dessa estratégia
encontra-se na Figura4.
Figura 4: TWF Aplicada a Pontuações (Versão Estendida).
As três estratégias descritas foram implementadas utilizando três classificadores, a
saber, Rocchio, KNN e Naïve Bayes.
Principais Resultados
Avaliamos experimentalmente a efetividade e eficácia dos classificadores propostos. Para
tal, adotamos uma estratégia de validação cruzada em 10 partes e validamos estatisticamente
os resultados utilizando um teste-t de dupla calda, com99% de confiança.
Os classificadores temporalmente robustos obtiveram melhorias estatisticamente sig-
nificativas quando comparados às abordagens tradicionais na maioria dos casos. Como um
exemplo, considere a Tabela2. Como podemos observar, todas as versões temporalmente
robustas do Rocchio e KNN obtiveram resultados estatisticamente superiores às suas ver-
sões tradicionais (em termos de MacroF1 e MicroF1). Observamos, também, que a versão
temporal do Naïve Bayes, baseada na aplicação da TWF em pontuações incorreu em per-
das significativas. Atribuímos esse problema ao desbalanceamento de classes artificialmente
aumentado por essa estratégia, bem como à quantidade reduzida de documentos de treino
associados às classes〈c, p〉 para realizar estimativas acuradas a cerca da distribuiçãodos da-
xxiii
dos. A estratégia estendida de aplicação da TWF em pontuações busca atenuar o problema
do desbalanceamento (embora, continue sendo desfavorecida pela escassez de dados).
Algoritmo Rocchio KNN Naïve BayesMétrica macF1(%) micF1(%) macF1(%) micF1(%) macF1(%) micF1(%)
Tradicional 57.39 68.24 58.48 71.84 57.27 73.24TWF 60.02 70.64 59.92 73.84 60.78 74.11
em documentos (+4.58)N (+3.52)N (+2.46)N (+2.78)N (+6.13)N (+1.19)•TWF 59.85 72.47 62.02 74.45 44.85 63.93
em pontuações (+4.29)N (+6.20)N (+6.05)N (+3.63)N (-27.69)H (-14.56)HTWF 59.27 71.39 59.78 73.85 56.23 72.35
em pontuações est. (+3.28)N (+4.62)N (+2.22)N (+2.80)N (-1.84)• (+1.23)•
Tabela 2: Resultados Obtidos Incorporando a TWF Definida Estatisticamente ao Rocchio,KNN e Naïve Bayes—ACM-DL.
Avaliamos o uso da estratégia automatizada para a determinação da TWF. Para fins de
exemplificação, reportamos na Tabela3 os resultados referentes à coleção ACM-DL obti-
dos pelos classificadores temporalmente robustos utilizando a TWF determinada por essa
estratégia. De fato, o procedimento para determinação automática da TWF se mostrou efi-
caz, de forma que seu uso rendeu resultados estatisticamente equivalentes àqueles obtidos
utilizando a TWF determinada estatisticamente, conforme podemos observar ao contrastar
as Tabelas2 e 3. Comparamos, também, a efetividade dessa estratégia ao utilizar todo o
conjunto de treinamento ou apenas10% do mesmo para determinar a TWF (linhas “100%
deD” e “10% deD”, respectivamente). Conforme podemos observar, com apenas 10% do
treinamento é possível determinar a TWF de forma acurada e obter resultados estatistica-
mente equivalentes àqueles obtidos utilizando todo o treinamento. Claramente, determinar
a TWF com uma amostra reduzida do treino leva a uma drástica redução no tempo de ex-
ecução. Por exemplo, a determinação da TWF utilizando o Rocchio demanda4.49 ± 0.04
segundos quando utilizando todo o treinamento, ao passo que, considerando10% do treino,
o tempo de execução cai para apenas0.77s± 0.02s, valor esse desprezível se comparado ao
tempo gasto pela tarefa de classificação.
Finalmente, comparamos nossos melhores classificadores temporais com o estado da
arte SVM em termos de efetividade e eficiência. Conforme podemos observar na Tabela4,
nossos melhores classificadores apresentaram eficácia estatisticamente equivalente (ou, até
mesmo superior) quando comparado ao SVM, com um tempo de execução (dado pelo tempo
despendido tanto para treinar quanto para testar) bastanteinferior—mesmo considerando
o fato de os classificadores temporais apresentarem umoverheadreferente à consideração
do aspecto temporal, e serem naturalmente classificadores postergados. Claramente, isso
evidencia a qualidade das soluções propostas.
xxiv
Algoritmo Rocchio KNN Naïve BayesMétrica macF1(%) micF1(%) macF1(%) micF1(%) macF1(%) micF1(%)
Tradicional 57.39 68.24 58.48 71.84 57.27 73.24
TWF (100% deD) 60.21 70.70 60.08 73.88 61.38 74.60em documentos (+4.91)N (+3.60)N (+2.74)N (+2.84)N (+7.18)N (+1.86)•
TWF (10% deD) 60.52 70.88 61.02 74.27 61.44 74.24em documentos (+5.45)N (+3.87)N (+4.84)N (+3.82)N (+7.28)N (+1.36)•
TWF (100% deD) 60.47 72.90 61.88 74.53 45.16 64.55em pontuações (+5.47)N (+6.83)N (+5.81)N (+3.74)N (-26.82)H (-13.46)H
TWF (10% deD) 59.68 72.40 61.37 73.77 44.47 64.58em pontuações (+3.99)N (+6.10)N (+4.94)N (+2.69)N (-28.78)H (-13.41)H
TWF (100% deD) 59.96 71.99 59.80 73.95 56.28 72.73em pontuações est. (+4.48)N (+5.49)N (+2.26)N (+2.94)N (-1.76)• (-0.70)•TWF (10% deD) 59.85 71.79 59.76 73.85 56.19 72.70em pontuações est. (+4.29)N (+5.20)N (+2.19)N (+2.80)N (-1.89)• (-0.74)•
Tabela 3: Resultados Obtidos Incorporando a TWF Definida Automaticamente ao Rocchio,KNN e Naïve Bayes—ACM-DL.
AlgoritmoMétrica
macF1(%) micF1.(%) Tempo (s)
SVM 59.91 73.88 144.10±5.30
Rocchio com TWF 60.47 72.909.00±0.00
em pontuações (+0.93) • (−1.34) •KNN com TWF 59.78 73.88
11.03±0.48em documentos (−0.22) • (+0.00) •KNN com TWF 61.88 74.53
10.10±0.31em pontuações (+3.29) N (+0.88) •
Naïve Bayes com TWF 61.38 74.609.10±0.32
em documentos (+2.45) N (+0.97) •
Tabela 4: Melhores classificadores temporaisversusSVM—ACM-DL.
Conclusões
Nesse trabalho apresentamos uma análise quantitativa sobre o impacto dos efeitos temporais
em quatro algoritmos para CAD amplamente utilizados pela comunidade de Aprendizado de
Máquina, aplicados a três coleções textuais reais, com dinâmica temporais potencialmente
distintas. Mostramos que, contrariamente à suposição adotada pela maioria dos algoritmos
de aprendizado, em que os dados seguem uma distribuição estática, as coleções estudadas
apresentam dinâmica temporal distinta, com variações na distribuição dos dados. Tais vari-
ações temporais potencialmente limitam a eficácia dos classificadores. De fato, a análise
conduzida mostrou que todos os quatro classificadores estudados foram negativamente afe-
xxv
tados pelos efeitos temporais, sendo as degradações mais prominentes observadas quando
aplicados às coleções mais dinâmicas (ACM-DL e AG-NEWS). Assim, a dimensão tem-
poral se mostra como um importante aspecto a ser consideradocom o intuito de prover
classificadores acurados.
Além da quantificação do impacto dos efeitos temporais em algoritmos para CAD,
propomos três estratégias mara minimizar tal impacto. Taisestratégias baseiam-se na apli-
cação do que chamamos de Função de Ponderação Temporal (TWF,de “Temporal Weighting
Function”). Propomos tanto uma metodologia estatística quanto um procedimento automa-
tizado, para determinar a TWF. Os resultados obtidos com a aplicação dos classificadores
temporalmente robustos mostraram que considerar a informação temporal leva a ganhos es-
tatisticamente significativos quando comparados às abordagens tradicionais. Ainda, os clas-
sificadores propostos, que obtiveram os melhores resultados, se mostraram competitivos ao
classificador estado da arte SVM, tanto em termos de eficácia quanto em termos de tempo
de execução.
xxvi
List of Figures
4.1 Class Distributions in the Three Reference Datasets. . .. . . . . . . . . . . . . 27
4.2 Class Distribution Temporal Variation in Each Reference Dataset. . . . . . . . 31
4.3 Term Distribution Temporal Variation of Each ReferenceDataset. . . . . . . . 32
4.4 Determining the Lower and Upper Levels ofCD andCS—ACM-DL. . . . . . 43
4.5 Determining the Lower and Upper Levels of TD—ACM-DL. . . .. . . . . . . 44
4.6 Determining the Lower and Upper Levels ofCD andCS—MEDLINE. . . . . 46
4.7 Determining the Lower and Upper Levels ofTD—MEDLINE. . . . . . . . . . 47
4.8 Determining the Lower and Upper Levels ofCD andCS—AG-NEWS. . . . . 48
4.9 Determining the Lower and Upper Levels ofTD—AG-NEWS. . . . . . . . . . 49
4.10 Cumulative Distribution Function of Document Stability Level Values. . . . . . 54
5.1 Dδ Distribution (Scaled to[0, 1] Interval). . . . . . . . . . . . . . . . . . . . . 66
5.2 Fitted Temporal Weighting Function with Log-Transformed Data. . . . . . . . 68
5.3 Estimated Temporal Weighting Function. . . . . . . . . . . . . .. . . . . . . 71
5.4 Graphical Representation of TWF in Documents. . . . . . . . .. . . . . . . . 72
5.5 Graphical Representation of TWF in Scores. . . . . . . . . . . .. . . . . . . . 75
5.6 Graphical Representation of Extended TWF in Scores. . . .. . . . . . . . . . 77
5.7 Relative〈c, p〉 Sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.8 Relative〈c, p〉 Sizes for AG-NEWS Dataset. . . . . . . . . . . . . . . . . . . .88
xxvii
List of Tables
2.1 Contingency Table for Classification Effectiveness Evaluation. . . . . . . . . . 11
4.1 Adopted Class Identifiers for each Reference Dataset. . .. . . . . . . . . . . . 26
4.2 Pairwise Class Similarity (standard deviations) in ACM-DL. . . . . . . . . . . 33
4.3 Pairwise Class Similarity (standard deviations) in MEDLINE. . . . . . . . . . 33
4.4 Pairwise Class Similarity (standard deviations) in AG-NEWS. . . . . . . . . . 34
4.5 Factorial Design—ACM-DL. . . . . . . . . . . . . . . . . . . . . . . . . .. . 45
4.6 Factorial Design—MEDLINE. . . . . . . . . . . . . . . . . . . . . . . . .. . 48
4.7 Factorial Design—AG-NEWS. . . . . . . . . . . . . . . . . . . . . . . . .. . 50
4.8 Comparative Study: The Impact of the Temporal Effects the ADC Algorithms. . 57
5.1 D’Agostino’s D-Statistic Test of Normality. . . . . . . . . .. . . . . . . . . . 66
5.2 Temporal DistancesversusTerms. . . . . . . . . . . . . . . . . . . . . . . . . 67
5.3 Estimated Parameters for Both Datasets, with 99% Confidence Intervals. . . . . 67
5.4 Results Obtained with theStatistically DefinedTWF—ACM-DL. . . . . . . . . 81
5.5 Results Obtained with theStatistically DefinedTWF—MEDLINE. . . . . . . . 81
5.6 Results Obtained for the Least and Most Frequent Classes〈c, p〉 Sampling for
Naïve Bayes—MEDLINE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.7 Results Obtained with theEstimatedTWF—ACM-DL. . . . . . . . . . . . . . 86
5.8 Results Obtained with theEstimatedTWF—MEDLINE. . . . . . . . . . . . . 86
5.9 Results Obtained with theEstimatedTWF—AG-NEWS. . . . . . . . . . . . . 87
5.10 Effectiveness Comparison: Best Performing Temporally-Aware Classifiersver-
susSVM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.11 Execution Time (in seconds) of each Explored ADC Algorithm. . . . . . . . . 91
5.12 Execution Time Comparison: Best Performing Temporally-Aware Classifiers
versusSVM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.13 Execution Time of the TWF Estimation using the Rocchio Classifier. . . . . . . 93
xxix
List of Algorithms
1 Factorial Design Procedure. . . . . . . . . . . . . . . . . . . . . . . . . .. 37
2 Automatic TWF Determination . . . . . . . . . . . . . . . . . . . . . . . .70
3 Rocchio-TWF-Doc: Rocchio with Temporal Weighting in Documents . . . 73
4 KNN-TWF-Doc: KNN with Temporal Weighting in Documents . . .. . . 74
5 Naïve Bayes TWF-Doc: Naïve Bayes with Temporal Weighting in Documents75
6 TWF-Sc: Temporal Weighting in Scores . . . . . . . . . . . . . . . . . .. 76
7 TWF-Sc-Ext: Extended Temporal Weighting in Scores . . . . . .. . . . . 78
xxxi
Contents
Resumo xi
Abstract xiii
Resumo Estendido xv
List of Figures xxvii
List of Tables xxix
List of Algorithms xxxi
1 Introduction 1
1.1 Context and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Dissertation Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . .. . 2
1.3 Work Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Preliminaries: Basic Concepts 9
2.1 Automatic Document Classification . . . . . . . . . . . . . . . . . .. . . 9
2.2 Evaluation Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . .11
2.3 Temporal Representation of Documents . . . . . . . . . . . . . . .. . . . 13
3 Related Work 15
3.1 Problem Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15
3.2 Strategies Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17
3.2.1 Detecting Data Variations . . . . . . . . . . . . . . . . . . . . . . .17
3.2.2 Dealing with Data Variations . . . . . . . . . . . . . . . . . . . . .17
3.2.3 Characterizing Data Variations . . . . . . . . . . . . . . . . . .. . 20
xxxiii
3.3 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .22
4 A Quantitative Analysis of Temporal Effects on ADC 23
4.1 Experimental Workload . . . . . . . . . . . . . . . . . . . . . . . . . . . .25
4.1.1 Reference Datasets . . . . . . . . . . . . . . . . . . . . . . . . . .25
4.1.2 ADC Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Characterization of Temporal Effects on Textual Datasets . . . . . . . . . . 29
4.2.1 Class Distribution Temporal Variation . . . . . . . . . . . .. . . . 30
4.2.2 Term Distribution Temporal Variation . . . . . . . . . . . . .. . . 31
4.2.3 Class Similarity Temporal Variation . . . . . . . . . . . . . .. . . 32
4.3 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . .34
4.3.1 Factorial Design . . . . . . . . . . . . . . . . . . . . . . . . . . . .34
4.3.2 Applying2kr Design in the Characterization of Temporal Effects .38
4.3.3 Quantifying the Impact of Temporal Effects on ADC . . . .. . . . 42
4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .49
4.4.1 Impact of Temporal Effects on the Reference Datasets .. . . . . . 51
4.4.2 Impact of Temporal Effects on the ADC Algorithms . . . . .. . . . 53
4.4.3 Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .58
5 Temporally-Aware Algorithms for ADC 61
5.1 Temporal Weighting Function . . . . . . . . . . . . . . . . . . . . . . .. . 64
5.2 Fully-Automated TWF Definition . . . . . . . . . . . . . . . . . . . . .. 68
5.3 Temporally-aware ADC . . . . . . . . . . . . . . . . . . . . . . . . . . . .71
5.3.1 Temporal Weighting in Documents . . . . . . . . . . . . . . . . . .72
5.3.2 Temporal Weighting in Scores . . . . . . . . . . . . . . . . . . . .74
5.3.3 Extended Temporal Weighting in Scores . . . . . . . . . . . . .. . 76
5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .78
5.4.1 Parameter Settings . . . . . . . . . . . . . . . . . . . . . . . . . .79
5.4.2 Experiments with the Statistically Defined TWF . . . . . .. . . . . 80
5.4.3 Experiments with the Estimated TWF . . . . . . . . . . . . . . . .85
5.4.4 Runtime Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . .89
5.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .93
6 Conclusions and Future Work 95
6.1 A Quantitative Analysis of Temporal Effects on ADC . . . . .. . . . . . . 95
6.2 Temporally-Aware Algorithms for ADC . . . . . . . . . . . . . . . .. . . 96
6.2.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
xxxiv
6.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .98
Bibliography 101
xxxv
Chapter 1
Introduction
In this chapter, we discuss the main motivations and arguments that support this work. We
also briefly describe our work and explicitly state our contributions.
1.1 Context and Motivation
Text classification is still one of the major information retrieval problems, and developing
robust and accurate classification models continues to be ingreat need as a consequence of
the increasing complexity and scale of current applicationscenarios, such as the Web. The
task of Automatic Document Classification (ADC) aims at creating models that associate
documents with semantically meaningful categories. Thesemodels are key components for
supporting and enhancing a variety of other tasks such as automated topic tagging (that is,
assigning labels to documents), building topic directories, identifying the writing style of a
document, organizing digital libraries, improving the precision of Web searching, and even
helping users to interact with search engines.
Similarly to other machine learning techniques, ADC usually follows a supervised
learning strategy: a training set of already classified documents is employed for creating a
classifier. Once built, the classifier is used for predictingclasses for a new set of unclassified
documents. The majority of supervised algorithms considerthat all (pre-classified) docu-
ments provide equally important information to discover the features that better identify a
(previously unclassified) document’s class. However, thismay not hold in practice due to
several factors such as the document’s timeliness, the venue in which it was published, its
authors, among other factors (de M. Palotti et al., 2010).
1
2 CHAPTER 1. INTRODUCTION
1.2 Dissertation Hypothesis
In the following, we state the fundamental hypotheses that serve as guidance to this work:
• The temporal evolution of textual data limits the performance of ADC classifiers;
• Distinct textual datasets present differing dynamical behavior;
• Different ADC algorithms may be distinctively affected by the temporal evolution of
data;
• The temporal evolution of data may be explored to devise moreeffective classification
models.
1.3 Work Description
In this work, we are particularly concerned with the impact that thetemporal effectsmay have
on ADC algorithms. Due to several factors, such as the dynamics of knowledge and even the
dynamics of languages, the characteristics of a textual dataset may change over time. For
example, the relative proportion of documents belonging todifferent classes may change as
consequence of the so-called virtual concept drift (Tsymbal, 2004). Thus, density-based clas-
sifiers, which are sensitive to class distribution, may not work well, since the “assumed” class
frequencies observed from an independent training set may not represent the “true” frequen-
cies observed when the test document was created (Yang and Zhou, 2008; Zhang and Zhou,
2010). As we shall see, not only the temporal variations in class frequencies may affect
classification effectiveness, but also the relationships between terms and classes. That is,
the distribution of terms among classes may vary over time, due to changes in writing style,
term usage, and so on. Consider, for instance, the termspheromoneandant colony. Before
the 1990s, they referred exclusively to documents in the area of Natural Sciences. However,
after the introduction of the technique ofAnt Colony Optimizationin the area of Artificial
Intelligence, these terms became relevant for classifyingComputer Science documents too.
In such scenarios, the classification effectiveness may deteriorate over time. Therefore, the
temporal dynamics of the data is an important aspect that must be taken into account in the
learning of more accurate classification models.
As a matter of fact,Mourão et al.(2008) have recently distinguished three different
temporal effects that may affect the performance of automatic classifiers. The first effect is
theclass distribution variation, which accounts for the impact of the temporal evolution on
the relative frequencies of the classes. The second effect is theterm distribution variation,
which refers to changes in the terms’ representativeness with respect to the classes as time
1.3. WORK DESCRIPTION 3
goes by. The third effect is theclass similarity variation, which considers how the similarity
among classes, as a function of the terms that occur in their documents, changes over time.
The authors showed that accounting for the temporal evolution of documents poses a chal-
lenge to learning a classification model, which is usually less effective when such factors are
neglected, as assumptions made when the model is built (thatis, learned) may no longer hold
due to temporal effects.
Despite these previous studies, to the best of our knowledge, a deeper and thoroughly
analysis about how and to which extent these temporal effects really impact ADC algorithms
has not been performed yet. A key aspect to be addressed in this task concerns the peculiar
behavior that each temporal effect may present in differentdatasets. For example, while
some datasets may present large class distribution variations over time, other datasets may,
in contrast, present a more significant variability on term distribution. Moreover, different
ADC algorithms may be distinctively affected by these effects due to their sensitivity or
robustness to each specific effect. In other words, the best strategy to handle temporal effects
may depend on the specific characteristics of both the dataset and the ADC algorithm used,
thus turning the learning of a more accurate classification model that deals with these effects
an even more challenging task.
In sum, two important questions that must be answered in order to better understand the
impact of temporal effects are:(i) Which temporal effects influence more in each dataset?
(ii) What is the behavior of each ADC algorithm when faced with different levels of each
temporal effect?In fact, it has already been established that these temporaleffects do exist
in some collections and affect negatively one specific algorithm, namely the SVM classifier
(Mourão et al., 2008). In this work, we take a step further towards answering the posed
questions, by proposing a factorial experimental design (Jain, 1991) aimed at quantifying
the impact of the temporal effects in four representative ADC algorithms, considering three
textual datasets with differing characteristics in their temporal evolution.
Hence, the first part of this dissertation aims atquantifyingthe impact of temporal
effects in ADC algorithms and provides as contributions:(i) a re-visitation of the character-
ization reported in (Mourão et al., 2008), with the inclusion of a third dataset belonging to
a distinct and more dynamic domain, in order to strengthen the argument for the existence
of such temporal effects;(ii) the proposal of a methodology to enable a deeper study of the
aforementioned temporal effects, by means of a factorial experimental design aimed at un-
covering how each temporal effect affects each ADC algorithm and textual dataset;(iii) an
instantiation of that methodology considering three real textual datasets and four well known
ADC algorithms, along with a detailed study regarding the impact of the temporal effects on
them. Specifically, we focus on four traditional ADC algorithms, namely Rocchio, K Nearest
Neighbors (KNN), Naïve Bayes and Support Vector Machine (SVM), and on three different
4 CHAPTER 1. INTRODUCTION
and widely used textual collections covering long time periods, namely, ACM-DL (22 con-
secutive years), MEDLINE (15 consecutive years) and, finally, AG-NEWS (573 consecutive
days).
As we shall see, there is a higher impact of the temporal effects in the ACM-DL and
AG-NEWS datasets when compared to the MEDLINE dataset. In the ACM-DL dataset,
the impact of class distribution and class similarity variations are statistically equivalent to
the impact of the term distribution variation, whereas MEDLINE and AG-NEWS are more
impacted by the first two effects. These findings motivate thedevelopment of strategies to
handle the temporal effects in ADC algorithms according to each dataset specific dynamical
behavior. Furthermore, all four analyzed ADC algorithms suffered a negative impact of
the temporal effects in terms of classification effectiveness. Indeed, the most significant
performance losses were observed when these algorithms were applied to the most dynamic
ACM-DL and AG-NEWS datasets. Extending the results presented in (Mourão et al., 2008)
by quantifyingthe impact of each temporal effect in the ADC algorithms, we here show
that the SVM classifier is more resilient to the term distribution effect, while still being
impacted by the other two effects. We also show that the otherthree algorithms, on the other
hand, are very sensitive to all three effects. These resultscorroborate our argument that the
temporal dimension is an important aspect that has to be considered when learning accurate
classification models.
Based on the performed quantitative analysis of the impact of temporal effects in ADC
algorithms, the second part of this dissertation focus on how to minimizetheir impact in
ADC algorithms. We propose a strategy to incorporate temporal information to document
classifiers, aiming at improving their effectiveness by properly handling data with varying
distributions. Our strategy is based on the evolution of theterm-class relationship over time,
captured by a metric ofdominance. We start by determining atemporal weighting function
for a collection according to its characteristics, based ona series of statistical tests performed
to determine its expression, and a curve fitting procedure todetermine its parameters. We
found that this function follows a lognormal distribution for two datasets we used, namely
ACM-DL and MEDLINE. However, the set of statistical tests performed to define the TWF
expressions for ACM-DL and MEDLINE datasets was not able to properly define the TWF
expression regarding the AG-NEWS dataset, which does not follow a (log-)normal distri-
bution. Indeed, the required tests may be prohibitively complex to perform depending on
the dataset characteristics, limiting the practical applicability of this strategy. Thus, we also
propose an automatic procedure to learn the TWF function, without the needs to perform
such statistical tests.
The final step is to incorporate the temporal weighting function to ADC algorithms
and we propose three strategies that follow a lazy classification approach. In the three strate-
1.4. CONTRIBUTIONS 5
gies, the weights assigned to each example depend on the notion of a temporal distanceδ,
defined as the difference between the time of creationp of a training example and a reference
time pointpr. The first strategy, namedtemporal weighting in documents, weights training
instances according toδ. The second strategy, calledtemporal weighting in scores, takes
into account the scores (e.g., similarities, probabilities) produced by a traditional classifier
applied to a modified training set where the class of each training documentc is mapped to
a derived classc 7→ 〈c, p〉, with p denoting the training document’s creation point in time,
ultimately tying together the observed patters and both theclass and temporal information.
A weighted sum of the learned scores is then performed, according to the TWF, and used to
make the final classification decision. Finally, the third strategy, namedextended temporal
weighting in scores, partitions the training setD in sub-groups of documentsDp with the
same creation point in timep. Then, a classification model is built based on eachDp in iso-
lation. The classes’ scores are then produced for eachDp and, as before, they are aggregated
using the TWF to weight them. We specifically show how these strategies are implemented
in three traditional ADC algorithms, namely, Rocchio, k Nearest Neighbors (KNN), and
Naïve Bayes.
We evaluated our strategies using three actual textual datasets that span for decades
(ACM-DL and MedLine) or for several months (AG-NEWS). The temporally-aware clas-
sifiers achieved significant improvements on classificationeffectiveness, even matching or
outperforming the state of the art SVM classifier in some cases with a drastically reduced
execution time.
1.4 Contributions
The specific contributions of this work are:
• aquantificationof the impact of three main temporal effects in four widely used ADC
algorithms. More specifically,
– we re-visit the characterization reported in (Mourão et al., 2008), by including
a third dataset belonging to a distinct and more dynamic domain, in order to
strengthen the argument for the existence of variations in textual data;
– we propose a methodology to enable a deeper study of the threetemporal ef-
fects, by means of a factorial experimental design aimed at uncovering how each
temporal effect affects each ADC algorithm and textual dataset;
6 CHAPTER 1. INTRODUCTION
– we instantiate that methodology considering three real textual datasets and four
ADC algorithms, and provide a detailed study regarding the impact of the tem-
poral effects on them;
• the proposal of strategies tominimizethe impact of the temporal effects in ADC algo-
rithms. Again, more specifically,
– we introduce a temporal weighting function to capture the varying behavior of
textual datasets, and propose two strategies to devise it;
– we extend three well known ADC algorithms to incorporate such function, de-
vising the temporally-aware algorithms for ADC;
– we perform an extensive experimental analysis in order to assess the benefits of
considering the temporal dynamics of data.
In the following we enumerate the already published work as direct contributions of
this dissertation, along with some work published during the MS.C. course:
• Salles, T., Rocha, L., Pappa, G. L., Mourão, F., Gonçalves, M. A., and Meira Jr.,
W. Temporally-aware algorithms for document classification. In Proceedings of the
International ACM SIGIR Conference on Research & Development of Information
Retrieval, pages 307–314, Genebra, Switzerland, 2010.
• Salles, T., Rocha, L., Mourão, F., Pappa, G. L., Cunha, L., Gonçalves, M. A., and
Meira Jr., W.Automatic document classification temporally robust. Journal of Infor-
mation and Data Management, 1(2):199–212, 2010.
• Salles, T., Rocha, L., Mourão, F., Pappa, G. L., Cunha, L., Gonçalves, M. A., and
Meira Jr., W.Classificação Automática de Documentos Robusta Temporalmente. In
XXIV Simpósio Brasileiro de Banco de Dados, pages 106–119, Fortaleza, Brazil,
2009.
• Salles, T., Rocha, L., Pappa, G. L., Mourão, F., Gonçalves, M. A., and Meira Jr., W.A
Quantitative Analysis of the Temporal Effects on AutomaticDocument Classification.
In Journal of Machine Learning Research, 2011 (submitted).
• Pappa, G. L., Zadrozny, B., Rocha, L.,Salles, T., Meira Jr., W., Gonçalves, M. A.
Exploiting Contexts to Deal with Uncertainty in Classification. In Proceedings of the
First ACM SIGKDD Workshop on Knowledge Discovery from Uncertain Data, pages
19–22, Paris, France, 2009.
1.4. CONTRIBUTIONS 7
• de M. Palotti, J. R.,Salles, T., Pappa, G.L., Gonçalves, M. A., and Meira Jr., W.As-
sessing Documents’ Credibility with Genetic Programming. IEEE Congress on Evo-
lutionary Computation, 2011 (to appear).
• de M. Palotti, J. R.,Salles, T., Pappa, G. L., Arcanjo, F., Gonçalves, M. A., and Meira
Jr., W. Estimating the credibility of examples in automatic document classification.
Journal of Information and Data Management, 1(3):439–454,2010.
• Figueiredo, F., Rocha, L., Couto, T.,Salles, T., Gonçalves, M. A., Meira Jr., W.Word
Co-occurence Features for Text Classification. Information Systems, 2011 (in press).
8 CHAPTER 1. INTRODUCTION
1.5 Roadmap
This work is structured in five chapters. The remainder of this work is organized as follows.
Chapter 2: In this chapter we briefly describe the supervised ADC task and some evaluation
strategies. We also present some of the notational conventions adopted in this work.
Chapter 3: In this chapter we describe related work. We start by discussing some of the
application scenarios where time poses as an important aspect to be considered. Then,
we discuss some of the efforts towards either detecting or handling variations on the
data distributions. We distinguish two broad areas for doing so: concept drift and
adaptive document classification.
Chapter 4: In this chapter we provide evidence of the existence of temporal effects. We
provide an extensive characterization of the properties ofthree textual datasets with
relation to the extent of each temporal effects on them, and quantify the impact of
the temporal effects on four well known ADC algorithms (i.e., Rocchio, K Nearest
Neighbors, Naïve Bayes and Support Vector Machine).
Chapter 5: In this chapter we propose three strategies, based on atemporal weighting
function(TWF), to address and minimize the impact of the temporal effects in three
extended versions of three ADC algorithms. We start by introducing the TWF and
proposing two strategies to determine it. Then, we describehow to modify three ADC
algorithms (namely, Rocchio, K Nearest Neighbors and NaïveBayes) in order to in-
corporate the TWF into them. we propose three strategies fordoing so.
Chapter 6: Finally, in this chapter we conclude the dissertation, summarize our main find-
ings and propose some directions for further investigation.
Chapter 2
Preliminaries: Basic Concepts
In this work, we are mainly concerned with Automatic Document Classification (ADC), a
well studied subject related to the classification problem,1 considering a supervised learning
paradigm. This section serves two main purposes: (i) briefly describe the supervised ADC
task and some evaluation strategies, in order to provide thereader with some basic notions
on the subject; and (ii ) present some notational conventions adopted in this work.
2.1 Automatic Document Classification
The purpose of supervised ADC algorithms is to predict the unknown class of a document,
based on a set of already classified documents (Sebastiani, 2002). Let di = (~xi, ci) be
a document, where~xi denotes its vectorial (bag of words) representation andci ∈ C a
categorical attribute (or response variable) indicating its class (C is a finite set composed
by all the possible classes). The main goal of an ADC algorithm is thus to learn a discrete
approximation of the class a posteriori probability distributionP (ci|di), which underlies the
relationships between documents and their associated classes. This probability distribution
is learned according to a training set composed by already classified documents. There are
two approaches for doing so, either based on a direct estimation of P (ci|di), or based on an
indirect estimation ofP (ci|di).
The first approach, which defines the so calleddiscriminativeclassifiers, learns the
class boundaries that minimize the error rate (or some correlated measure), ultimatelydis-
criminating between classes (that is, they learn class boundaries) without making any as-
sumption regarding the probability density function for each class. On the other hand, the
second approach, which defines thegenerativeclassifiers, learns the class conditional prob-
ability distribution and the a priori class probabilities to estimate the class a posteriori prob-1Also known as the discrimination problem in the statistics literature.
9
10 CHAPTER 2. PRELIMINARIES: BASIC CONCEPTS
ability distributionP (ci|di). In this case, one should assume a model for the class densities
P (di|ci) and its parameters are estimated from the training set. For example, a normal dis-
tribution may be chosen, and its mean and variance parameters are estimated according to
the already classified data. Then the class a posteriori probability distributionP (ci|di) is
estimated according to the Bayes’ rule:
P (ci|di) =P (ci) · P (di|ci)
∑
c′∈C P (c′) · P (di|c′), (2.1)
whereP (ci) denotes the class priors andP (di|ci) denotes the class densities.
Informally, given a training set of already classified documents with feature measure-
ments, we build a classification model, or learner, which will enable us to classify a new
unseen document. A good learner is one that accurately predicts such class. In the per-
spective of function approximation, this translates into finding a good approximationf of
f : DU 7→ C, that underlies the predictive relationship between the documents and their
associated classes, based on the training setD ⊂ DU , whereDU denotes the input space
composed by both classified and unclassified documents.
In order to assess how good an approximation is, one should consider the generaliza-
tion capabilities of the approximatedf . Recall that thef is an approximation based on the
training set, that isf : D 7→ C. The quality of such approximation refers to how wellf
predict the classes of unseen documents (i.e., documentsd′ 6∈ D). This is assessed by the
generalization capability off . Clearly, a functionf that accurately predicts the class of doc-
uments fromD may not be accurate to predict the class of documents fromDU \ D (i.e., the
set of unclassified documents)2. In this case, we say thatf is overfitted w.r.t.D. Hence,
there exists a trade-off between the complexity off (the more complexf is, the more spe-
cific patterns observed in the training set are learned) and the generalization power off (the
more specific patterns observed inDU \ D may not be observed inD).
It has been already proved that, asymptotically, the discriminative classifiers are supe-
rior to the generative ones (Vapnik, 1998), with several reported experiments corroborating
this superiority (Drummond, 2006). In fact, if there are not enough training examples, the
parametric model is deemed to overfit, decreasing its generalization power (Hastie et al.,
2009). However, some authors claim, based on experimental evaluation, that with realistic
training set sizes, the generative classifiers can also perform as well as or better than the
discriminative ones. This comes true if the assumed parametric model used by the genera-
tive classifier is correct. In this case, the class priors become a useful information which is
ignored by the discriminative classifiers. As will be described in Section4.1, in this work
we consider both generative classifiers (represented by theNaïve Bayes classifier) and dis-
2A \ B denotes the set difference betweenA andB and is a set composed by elements inA but not inB.
2.2. EVALUATION TECHNIQUES 11
criminative classifiers (represented by Rocchio, K NearestNeighbors and Support Vector
Machine classifiers).
2.2 Evaluation Techniques
An important aspect to be considered is how to evaluate the effectiveness of a classifier (that
is, its accuracy in classifying unseen data or, in other words, its generalization power), as-
sessed by first learning a classification model based on the training set and then applying it to
classify a set of unseen documents (the test set). Some measures of classification effective-
ness are then used to assess the quality of the classificationmodel learned. Several measures
for this purpose were proposed in the literature and some of them are widely used by the
machine learning community. Perhaps the most used measuresare the precision, recall and
the F1 measure. In order to describe each of these measures, consider the contingency table
represented in Table2.1 (also known as confusion matrix), whereTP , TN , FP andFN
denote, respectively, the number of true positives, true negatives, false positives and false
negatives, defined as:
True Positive (TP): positive test document correctly into the positive class,
True Negative (TN): negative test document correctly classified into the negative class,
False Positive (FP):negative test document incorrectly classified into the positive class,
False Negative (FN):positive test document incorrectly classified into the negative class.
The precisionp of a performed classification denotes the fraction of all documents
assigned to the positive classci by the classifier that really belong toci. In terms of the
contingency table, this translates into
p =TP
TP + FP.
Positive Ground TruthClass= ci ci Not ci
Predictionci TP FP
Not ci FN TN
Table 2.1: Contingency Table for Classification Effectiveness Evaluation.
12 CHAPTER 2. PRELIMINARIES: BASIC CONCEPTS
The recallr of a performed classification denotes the fraction of all documents that
belong to the positive classci that were correctly assigned toci by the classifier. Again, in
terms of the contingency table, this can be expressed as
r =TP
TP + FN.
Finally, the F1 measure is defined as the harmonic mean of the precision and the recall,
given by
F1 =2pr
p+ r.
There are two conventional methods to evaluate classification algorithms when applied to
problems with more than two classes, namely by micro-averaging and macro-averaging the
F1 measure. The micro-averaged F1 (microF1) is calculated from a global contingency table
(similarly to Table2.1), with the precision and recall being calculated as a sum of each entry
of the table:
pmicro =
∑|C|i=1 TPi
∑|C|i=1 TPi + FPi
,
rmicro =
∑|C|i=1 TPi
∑|C|i=1 TPi + FNi
.
In contrast, the macro-averaged F1 (macroF1) is calculated by first calculating the precision
and recall values for each class and computing their averagevalue:
pmacro =1
|C|
|C|∑
i=1
TPi
TPi + FPi
,
rmacro =1
|C|
|C|∑
i=1
TPi
TPi + FNi
.
Notice that the main difference between both strategies is that the microF1 is a document-
pivoted measure that gives equal weights to the documents while the macroF1 measure is a
class-pivoted measure that gives equal weights to the classes.
Since the ADC task is inherently a stochastic process, it is fundamental to adopt some
evaluation strategies that guarantee the statistical validity of the obtained classification re-
sults, which is achieved by replicating the experiments using different training sets to learn
a classification model. For this purpose, the cross validation strategy has become a standard
in the machine learning community. There are, at least, two usual strategies for cross val-
idation, the K-fold cross validation and the repeated random sub-sampling (Kohavi, 1995).
2.3. TEMPORAL REPRESENTATION OFDOCUMENTS 13
The K-fold cross validation consists of randomly splittingthe data intoK independent folds.
At each iteration, one fold is retained as the test set, and the remainingK − 1 folds are
used as training set. The repeated random sup-sampling consists of randomly selecting a
fraction of documents from the dataset, without replacement, to compose the test set, and
the remaining documents retained as the training set. This is performed for each replication.
Since in the K-fold cross validation the size of the folds aredependent of the number of iter-
ations, it becomes more suitable to medium/large sized datasets, while the repeated random
sub-sampling is usually adopted to small sized datasets when the number of replications is
large.
For more details on ADC and evaluation strategies, we refer the reader to
(Baeza-Yates and Ribeiro-Neto, 2011; Hastie et al., 2009; Manning et al., 2008).
2.3 Temporal Representation of Documents
In this work we deal with the documents’ timeliness, represented by their creation points in
time. We consider time as a discrete attribute associated todocuments. Thus, we represent
the documents by a tripledi = (~xi, ci, pi), where~xi denotes the vectorial “bag of words”
representation ofdi, ci denotes its associated class andpi denotes its creation point in time.
An important aspect to consider refers to the temporal unit used. The temporal unit
should be the minimum time interval between relevant changes observed in data and is,
clearly, dataset dependent. For example, since scientific conferences are usually annual,
relevant changes usually occur yearly, and the temporal unit should be one year. On the
other hand, the temporal unit to be used for data from published news articles should be
more fine grained (e.g., one day or one month).
Chapter 3
Related Work
In this chapter, we discuss some related work. First, we report some efforts spent on the
dissertation’s target problem, that is, the impact of varying data distributions in learning
algorithms when applied to some important scenarios. Then,we focus our attention on some
works aimed at either detecting or dealing with such problem.
3.1 Problem Overview
A fundamental assumption of the vast majority of automatic classifiers is that the data used
to learn a classification model are random samples independently and identically distributed
(i.i.d.) from a stationary distribution that governs the test data.However, this may not be the
case. In fact, in many (perhaps most) real-world classification problems, the training data
may not be randomly drawn from the same distribution as test data (to which the classifier
will be applied) when there exists variations in the underlying data distribution. Hence,
the success of classification algorithms may be diminished when faced to real-world time-
varying data. As argued byAlonso et al.(2007), “time is an important dimension of any
information space and can be very useful in information retrieval”.
As analyzed byKelly et al. (1999), the observed variations in the data distributions
may be reflected by, at least, three aspects:
1. Varying a priori class probabilities—P (ci);
2. Varying class a posteriori probabilities—P (ci|di);
3. Varying class densities—P (di|ci).
Notice that, according to Equation2.1, sincep(ci|di) depends onp(di|ci), both the
generative and discriminative classifiers that do assume a static underlying data distribution
15
16 CHAPTER 3. RELATED WORK
are deemed to be error prone when faced with non-stationary data. This problem becomes
critical as it is not a hard task to enumerate real-world examples of scenarios in which auto-
matic classification procedures are applied to inherently dynamic data. For example, in spam
filtering applications the ultimate goal is to filter out undesired spam messages. However,
spammers actively change the nature of their messages to elude the spam filters, and devel-
oping strategies that take into account such dynamic behavior becomes a necessary task to
guarantee the effectiveness of the filters (Fdez-Riverola et al., 2007).
Another example relates to the information filtering techniques employed by personal
assistance applications aimed at personalizing the flow of information according to the user
interests. A specific type of information filtering technique is to recommend information
items to users, according to their interests. This is accomplished by predicting which in-
formation items meet the users interests, based on their profiles. Clearly, changes in user
interests are problematic and should be addressed in order to guarantee an effective recom-
mendation. Thus, modeling the temporal dynamics of user interests should be a key when
designing them. Indeed, recently there was an open competition for the best filtering algo-
rithm to predict user ratings for films, based on previous ratings (NetFlix Prize). The winners
of the contest explored the temporal aspect as one of the keysto the problem, considering
that both the movie popularities ans user preferences change over time (Koren, 2010). This
reinforces the importance of a proper handling of dynamicaldata. Other examples are au-
tomatic credit card fraud detection (Wang et al., 2003), where previously observed patterns
regarding fraudulent credit card transactions are used to learn a classification model that is
able to predict the legitimacy of new transactions. However, such patterns also change over
time, and should be taken into account in order to avoid fraudulent transactions. It should
be clear from now that variations on the data distribution pose as an important problem to be
tackled in order to improve the effectiveness of learning algorithms.
In this work, we focus on the temporal dynamics observed in textual datasets. As a
matter of fact, due to several factors, such as the dynamics of knowledge and even the dy-
namics of languages, the characteristics of textual data may change over time (Mourão et al.,
2008). As previously discussed, automatic document classifiersmay have trouble with such
kind of data. Thus, this work tackles the following problem:
Problem 1 (Problem Statement)The majority of automatic document classifiers assume a
stationary data distribution. However, in (perhaps most) real-world classification problems
this premise is violated, being an important task to consider the temporal dynamics of data
in order to boost the effectiveness of the classifiers.
3.2. STRATEGIES OVERVIEW 17
3.2 Strategies Overview
Although ADC is a widely studied subject, the analysis of temporal aspects in this class of
algorithms is quite recent—it has been studied only in the last decade. Most previous studies
have focused on detecting and dealing with these effects to improve classification quality,
whereas we are aware of only one prior effort towards characterizing the impact of temporal
effects on ADC effectiveness.
3.2.1 Detecting Data Variations
We start by reviewing previous attempts todetectsignificant changes in the underlying
data distribution due to temporal effects.Gama et al.(2004) presented a method to detect
changes in the distribution of the training examples by means of an online classifier that
performs a sequence of trials to perform the classification.On each trial, it makes some
predictions and receives a feedback accounting for the classification error, in order to detect
significant changes in the data at hand. This approach is ableto detect both gradual and
abrupt changes. Similarly,Nishida and Yamauchi(2009) propose a system to detect and
predict changing distributions by managing a set of offline and online classifiers to account
for, respectively, data variations and classifiers’ prediction errors. Furthermore, the system
also performs a clustering step to allow the prediction of future variations. Other studies
explore statistical tests to detect drift (Dries and Rückert, 2009; Nishida and Yamauchi,
2007). In (Dries and Rückert, 2009), for instance, the authors propose three adaptive
tests that are capable to adapt to different (gradual or abrupt) changing behaviors. In
(Nishida and Yamauchi, 2007), the authors propose to classify a set of examples belonging
to a recent time window, and to compare the achieved accuracyagainst the one obtained
with a global classifier that considers all available data. The basic idea is that statistically
significant decreases in accuracy suggest data variations.Such solution is able to quickly
detect drift when the window size is small, at the cost of being susceptible to data sparseness.
3.2.2 Dealing with Data Variations
Previous efforts todeal with varying data distributions can be categorized into twobroad
areas, namely, adaptive document classification and concept drift.
3.2.2.1 Adaptive Document Classification
Adaptive document classification (Cohen and Singer, 1999) embodies a set of techniques to
deal with changes in the underlying data distribution so as to improve the effectiveness of
18 CHAPTER 3. RELATED WORK
document classifiers through incremental and efficient adaptation of the classification mod-
els. Adaptive document classification brings three main challenges to document classifica-
tion (Liu and Lu, 2002). The first one is the definition of a context and how it may be ex-
ploited to devise more accurate classification models. A context is a semantically significant
set of documents. Previous research suggests that they may be determined through at least
two strategies: identification of neighbor terms to a certain keyword (Lawrence and Giles,
1998), and identification of terms that indicate the scope and semantics of the document
(Caldwell et al., 2000). In our case, the strategies to deal with varying data distributions
explore the stability of terms, which can be seen as a kind of (temporal) context—but in a
finer granularity (i.e., terms). The second challenge is howto build the classification models
incrementally (Kim et al., 2004), whereas the third challenge relates to the computational
efficiency of the resulting classifiers. Here, we do not consider the incremental construction
of classification models. Our temporally-aware classifiersuse the temporal information to
learn more accurate classification models, instead of updating them in a incremental fashion.
Clearly this is a natural extension of our work and we intend to consider it in the future.
3.2.2.2 Concept Drift
Concept or topic drift (Tsymbal, 2004) comprises another relevant set of efforts to deal with
varying data distributions in classification. A prevailingapproach to address concept drift is
to completely retrain the classifier according to a sliding window, which ultimately involves
example selection techniques. A number of previous studiesfall into this category. For in-
stance, the method presented in (Klinkenberg and Joachims, 2000) maintains a window with
examples sufficiently “close” to the current target concept, and automatically adjusts the
window size so that the estimated generalization error is minimized. In (Žliobaite, 2009),
a classification model is built using training examples which are close to the test in terms
of both time and space. The methods presented in (Klinkenberg, 2004) either maintain an
adaptive time window on the training data, or select representative training examples, or
weight them.Widmer and Kubat(1996) describe a set of algorithms that react to concept
drift in a flexible way and can take advantage of situations where contexts reappear. The
main idea of these algorithms is to keep only a window of currently trusted examples and
hypothesis, and store concept descriptions in order to reuse them if a previous context reap-
pears. In (Rocha et al., 2008), the authors introduce the concept oftemporal context, defined
as a subset of the dataset that minimizes the impact of temporal effects in the performance
of classifiers. They also propose an algorithm, namedChronos, to identify these contexts
based on the stability of the terms in the training set. Temporal contexts are used to sample
the training examples for the classification process, and examples considered to be outside
3.2. STRATEGIES OVERVIEW 19
the temporal context are discarded by the classifier.
Unlike previous efforts that use a single window to determine drift in the data,
Lazarescu et al.(2004) present a method that uses three windows of different sizesto es-
timate the change in the data. While algorithms that use a window of fixed size impose
hard constraints over drift patterns, those that use heuristics to adjust the window size to the
current extent of concept drift often involve lots of parameters to be calibrated. In order to
provide some theoretical basis for the choice of window size, Kuncheva and Žliobaite(2009)
developed a framework for relating the classification errorto the window size, aiming at pro-
viding an optimal window size choice. Such optimal choice leads to statistically significant
improvements in window-based strategies. Following this direction, in (Bifet and Gavaldà,
2006) the authors propose a window-based strategy for drifting data streams, that automat-
ically chooses the optimal window size, called ADWIN. This approach keeps a windowW
with the most recent data and splits it in two adjacent sub-windowsW0 andW1. Using statis-
tical tests to compare both windows, it detects when a drift occurred. In this case, all possible
adjacent sub-windows must be considered. Clearly, this is acostly operation (both in terms
of time and memory). In (Bifet and Gavaldà, 2007), the authors propose an improvement
for ADWIN, called ADWIN2, with the same effectiveness guarantees of ADWIN and more
efficient data structures.
Window-based approaches may be considered too rigid since it may miss valuable
information laying outside of the window. Accordingly, a second approach to deal with con-
cept drift consists in properly weighting training examples while building the classification
model in order to reflect the temporal variations in the underlying data distribution, instead
of simply discarding them.1 Following this direction,Koychev(2000) defined a linear time-
based utility function to account for variations in the datadistribution such that the impact
of the examples on the classification model decreases with time. Experimental evaluation
conducted with the Naïve Bayes and the ID3 algorithms showedthe effectiveness of such ap-
proach. In (Klinkenberg and Rüping, 2003), the authors defined an exponential time-based
function in order to weight examples based on their age. The reported experimental evalua-
tion showed that weighting examples in drifting scenarios leads to significant improvements
over fixed-window strategies, while being outperformed by an adaptive-window approach.
However, such time-based utility functions are typically defined in a very ad-hoc manner
(e.g., linear functions, exponential functions, etc), without any theoretical justification built
from changes in data patterns.
Thus, the following question remains unanswered:how can we properly define such
time-based utility function?In order to answer that question, not only the temporal distance
1In this sense, window-based approaches could be though as a type of binary weighting function.
20 CHAPTER 3. RELATED WORK
between training and test examples should be considered, but also the varying characteristics
of the underlying data distribution. Following this direction, in this work we report a statis-
tical analysis of the temporal effects on three textual datasets in order to define atemporal
weighting function(TWF) which properly models the changing behavior of the underlying
data distribution, reflecting its dynamical nature, and capturing both the temporal distance
between training and test examples and the variations of thecharacteristics of the dataset
(Salles et al., 2010b). We also propose three instance weighting strategies thatemploy the
temporal weighting function to deal with these temporal effects (Salles et al., 2010a). We
applied these strategies to three well known ADC algorithms, namely, Rocchio, KNN and
Naïve Bayes and, as reported in Section5.4, we found that the new temporally-aware classi-
fiers achieve statistically significant gains over their traditional counterparts.
Another common approach to deal with concept drift focuses on the combination
of various classification models generated from different algorithms (ensembles) for clas-
sification, pruning or adapting the weights according to recent data (Folino et al., 2007;
Kolter and Maloof, 2003; Scholz and Klinkenberg, 2007). Scholz and Klinkenberg(2007)
proposed a boosting-like method to train a classifier ensemble from data streams. It natu-
rally adapts to concept drift and allows one to quantify the drift in terms of its base learners.
The algorithm was shown to outperform learning algorithms that ignore concept drift. In the
same direction,Kolter and Maloof(2003) presented a technique that maintains an ensemble
of base learners, predicts instance classes using a weighted-majority vote of these “experts”,
and dynamically creates and deletes experts in response to changes in performance. Ad-
ditionally, Folino et al.(2007) proposed to build an ensemble of classifiers using genetic
programming to inductively generate decision trees. In spite of these prior proposals, one
important challenge of approaches based on classifier ensembles is the efficient management
of multiple models. As a matter of fact, one of our proposed strategies are based on the
combination of various classification models, but with a much simpler way to manage them,
by exploiting the TWF.
3.2.3 Characterizing Data Variations
In addition to the aforementioned studies, which aim at either detecting or exploiting the
changes in data distribution, in (Forman, 2006) the author provides a characterization of
varying data distributions in the textual data domain, where the concept drift problem is
studied considering three main types of data variations: (i) shifting class distribution, which
is reflected by the observed variations over time in the proportion of documents assigned
to each class; (ii ) shifting subclass distribution, which accounts for varying feature distribu-
tions; and, finally, (iii ) the fickle concept drift, that denotes the cases where documents are
3.2. STRATEGIES OVERVIEW 21
assigned to distinct classes at different points in time. Moreover, in that work, the author
proposes a visualization tool aimed at analyzing the feature space (in a binary classification
setting) and thus providing clues about the varying behavior of the most predictive features as
time goes by. A real textual dataset, composed by news articles, was characterized according
to the three mentioned drifting patterns, and was shown to bea very dynamic dataset.
Following this direction, in (Mourão et al., 2008), the authors provide a characteriza-
tion of these changes in terms of three maintemporal effects: (i) the class distribution varia-
tion, that accounts for the impact of the temporal evolutionon the relative frequencies of the
classes;(ii) the term distribution variation, which refers to changes inthe representativeness
of the terms with respect to the classes as time goes by; and, finally, (iii) the class similarity
variation, which considers how the similarity among classes, as a function of the terms that
occur in their documents, changes over time. In fact, the class distribution variation and the
term distribution variation effects correspond to the shifting class distribution and the shift-
ing subclass distribution discussed in (Forman, 2006), respectively. Furthermore, while the
class similarity variation effect is not analyzed in (Forman, 2006), the fickle drifting pattern
is not considered in (Mourão et al., 2008). As a matter of fact, the fickle drift type, which
corresponds to the change of class of a given document due to some eventual correction, is
probably the most difficult case to be handled. These are veryrare events which may not af-
fect the classifier effectiveness and even the strategies discussed in (Forman, 2006) to handle
concept drift do not deal with this case. Hence, here we focuson the three temporal effects
analyzed in (Mourão et al., 2008), adopting the authors’ proposed nomenclature.
Building upon the characterization reported in both studies, we here propose a method-
ology to enable a deeper study of temporal effects. We propose to use a factorial experimental
design toquantifyto which extent each of these variations impact ADC algorithms, accord-
ing to datasets with distinct temporal dynamics. This quantitative analysis is an advance to
the aforementioned studies, since both analyze the variations in the data distribution in a
purely qualitativemanner. We also instantiate the proposed methodology usingthree real
textual datasets and four traditional ADC algorithms. In comparison with previous work,
our characterization methodology and results contribute directly to the definition of more
successful strategies to deal with and to exploit temporal effects. They also provide valuable
insights into the behavior of the analyzed algorithms when faced with changing distributions.
It is interesting to notice that, while the majority of the aforementioned works aimed
at dealing with varying data distributions typically consider scenarios characterized by the
classification of future data (with older data becoming obsolete as time goes by), here we
propose an approach to classify documents in scenarios where we may have information
about both the past and the future when classifying the test data, and this information may
change over time. For example, considering a training set composed by documents created
22 CHAPTER 3. RELATED WORK
between the years1980 and2011, when classifying a test document created at year2000 we
take into account both past and future data. It should be noticed, however, that our approach
may be easily adapted for scenarios where we only have past information, such as Adaptive
Document Classification and Concept Drift.
3.3 Chapter Summary
We discussed in this chapter the importance of considering the temporal dynamics of data
in machine learning techniques. We also reported some work aimed at either detecting or
handling variations in the data distribution in automatic classification tasks. We saw that the
main approaches for detecting data distribution variations are based on statistical tests and
classifier ensembles. Moreover, we discussed three main techniques for handling varying
data distributions (instance selection, instance weighting and ensembles) along with their
merits and drawbacks. Throughout the discussion, we pointed out how our work advances
the current research efforts.
Chapter 4
A Quantitative Analysis of Temporal
Effects on ADC
In this chapter, we are particularly concerned with the impact thattemporal effectsmay have
on ADC algorithms. Due to several factors, such as the dynamics of knowledge and even the
dynamics of languages, the characteristics of a textual dataset may change over time. For
example, the relative proportion of documents belonging todifferent classes may change as
consequence of the so-called virtual concept drift (Tsymbal, 2004). Thus, density-based clas-
sifiers, which are sensitive to class distribution, may not work well, since the “assumed” class
frequencies observed from an independent training set may not represent the “true” frequen-
cies observed when the test document was created (Yang and Zhou, 2008; Zhang and Zhou,
2010). As we shall see, not only the temporal variations in class frequencies may affect
classification effectiveness, but also the relationships between terms and classes. That is,
the distribution of terms among classes may vary over time, due to changes in writing style,
term usage, and so on. In such scenarios, the classification effectiveness may deteriorate over
time. Therefore, the temporal dynamics of the data is an important aspect that must be taken
into account in the learning of more accurate classificationmodels.
As a matter of fact,Mourão et al.(2008) have recently distinguished three different
temporal effects that may affect the performance of automatic classifiers. The first effect is
theclass distribution variation, which accounts for the impact of the temporal evolution on
the relative frequencies of the classes. The second effect is theterm distribution variation,
which refers to changes in the terms’ representativeness with respect to the classes as time
goes by. The third effect is theclass similarity variation, which considers how the similarity
among classes, as a function of the terms that occur in their documents, changes over time.
The authors showed that accounting for the temporal evolution of documents poses a chal-
lenge to learning a classification model, which is usually less effective when such factors are
23
24 CHAPTER 4. A QUANTITATIVE ANALYSIS OF TEMPORAL EFFECTS ONADC
neglected, as assumptions made when the model is built (thatis, learned) may no longer hold
due to temporal effects.
Despite these previous studies, to the best of our knowledge, a deeper and thoroughly
analysis about how and to which extent these temporal effects really impact ADC algorithms
has not been performed yet. A key aspect to be addressed in this task concerns the peculiar
behavior that each temporal effect may present in differentdatasets. For example, while
some datasets may present large class distribution variations over time, other datasets may,
in contrast, present a more significant variability on term distribution. Moreover, different
ADC algorithms may be distinctively affected by these effects due to their sensitivity or
robustness to each specific effect. In other words, the best strategy to handle temporal effects
may depend on the specific characteristics of both the dataset and the ADC algorithm used,
thus turning the learning of a more accurate classification model that deals with these effects
an even more challenging task.
In sum, two important questions that must be answered in order to better understand the
impact of temporal effects are:(i) Which temporal effects influence more in each dataset?
(ii) What is the behavior of each ADC algorithm when faced with different levels of each
temporal effect?In fact, it has already been established that these temporaleffects do exist
in some collections and affect negatively one specific algorithm, namely the SVM classifier
(Mourão et al., 2008). In this chapter, we take a step further towards answering the posed
questions, by proposing a factorial experimental design (Jain, 1991) aimed at quantifying
the impact of the temporal effects in four representative ADC algorithms, considering three
textual datasets with differing characteristics in their temporal evolution.
The original contributions of this chapter are:(i) a re-visitation of the characteriza-
tion reported in (Mourão et al., 2008), with the inclusion of a third dataset belonging to a
distinct and more dynamic domain, in order to strengthen theargument for the existence of
such temporal effects;(ii) the proposal of a methodology to enable a deeper study of the
aforementioned temporal effects, by means of a factorial experimental design aimed at un-
covering how each temporal effect affects each ADC algorithm and textual dataset;(iii) an
instantiation of that methodology considering three real textual datasets and four well known
ADC algorithms, along with a detailed study regarding the impact of the temporal effects on
them. Specifically, we focus on four traditional ADC algorithms, namely Rocchio, K Nearest
Neighbors (KNN), Naïve Bayes and Support Vector Machine (SVM), and on three different
and widely used textual collections covering long time periods, namely, ACM-DL (22 con-
secutive years), MEDLINE (15 consecutive years) and, finally, AG-NEWS (573 consecutive
days).
As we shall see, there is a higher impact of the temporal effects in the ACM-DL and
AG-NEWS datasets when compared to the MEDLINE dataset. In the ACM-DL dataset,
4.1. EXPERIMENTAL WORKLOAD 25
the impact of class distribution and class similarity variations are statistically equivalent to
the impact of the term distribution variation, whereas MEDLINE and AG-NEWS are more
impacted by the first two effects. These findings motivate thedevelopment of strategies to
handle the temporal effects in ADC algorithms according to each dataset specific dynamical
behavior. Furthermore, all four analyzed ADC algorithms suffered a negative impact of
the temporal effects in terms of classification effectiveness. Indeed, the most significant
performance losses were observed when these algorithms were applied to the most dynamic
ACM-DL and AG-NEWS datasets. Extending the results presented in (Mourão et al., 2008)
by quantifyingthe impact of each temporal effect in the ADC algorithms, we here show
that the SVM classifier is more resilient to the term distribution effect, while still being
impacted by the other two effects. We also show that the otherthree algorithms, on the other
hand, are very sensitive to all three effects. These resultscorroborate our argument that the
temporal dimension is an important aspect that has to be considered when learning accurate
classification models.
This chapter is organized as follows: In Section4.1we describe the workload used in
our experimental design, that is, the reference datasets and the analyzed ADC algorithms.
An extension of the characterization done byMourão et al.(2008), providing evidence of
the existence of temporal effects in three textual datasets, is presented in Section4.2. Next,
Section4.3 describes the factorial experimental approach proposed asa methodology to
provide a more precise picture of the impact of temporal effects on different ADC algorithms,
whereas the results of applying the proposed methodology onthe considered datasets and
ADC algorithms are discussed in Section4.4. Finally, Section4.5summarize our findings.
4.1 Experimental Workload
In this section, we present the experimental workload used in our analysis and in the remain-
ing chapters. We provide a brief description of the three reference datasets (Section4.1.1) as
well as the four ADC algorithms analyzed (Section4.1.2).
4.1.1 Reference Datasets
The three reference datasets considered in our study consist of sets of textual documents,
each one assigned to a single class (a single label problem).For clarity purposes, throughout
this paper we refer to each class by a corresponding identifier, as listed, for each dataset, in
Table4.1. The considered datasets are:
ACM-DL: a subset of the ACM Digital Library with24897 documents containing
articles related to Computer Science created between 1980 and 2002. We considered only
26 CHAPTER 4. A QUANTITATIVE ANALYSIS OF TEMPORAL EFFECTS ONADC
the first level of the taxonomy adopted by ACM, including11 classes, which remained the
same throughout the period of analysis. The distribution ofthe24897 documents among the
11 classes, in the entire time period, is presented in Figure4.1a.
MEDLINE: a derived subset of the MedLine dataset, with861454 documents classi-
fied into7 distinct classes related to Medicine, and created between the years of 1970 and
1985. The class distribution of the861454 documents during the entire time period is de-
picted in Figure4.1b.
AG-NEWS: a collection of835795 news articles, classified into11 distinct classes,
that spans over573 days. This dataset presents some interesting characteristics that are
typical of news datasets. For instance, some topics appear and disappear very suddenly due
to periodical or ephemeral events. Moreover, there is a higher variability in the meaning of
the terms, along with a greater extent of class imbalance, due to the very dynamic nature of
the news domain. The class distribution, spanning the whole573 day period, is shown in
Figure4.1c.
ACM-DL MEDLINE AG-NEWS0. General Literature 0. Aids 0. Business1. Hardware 1. Bioethics 1. Science & Technology2. Computer Systems Organization2. Cancer 2. Entertainment3. Software 3. Complementary Medicine 3. Sports4. Data 4. History 4. United States5. Theory of Computation 5. Space Life 5. World6. Mathematics of Computing 6. Toxicology 6. Health7. Information Systems 7. Top News8. Computing Methodologies 8. Europe9. Computer Applications 9. Italia
10. Computing Milieux 10. Top Stories
Table 4.1: Adopted Class Identifiers for each Reference Dataset.
These datasets potentially present distinct evolution patterns, due to their own cha-
racteristics. In particular, we expect that MEDLINE exhibits a more stable behavior, in
comparison to the other two datasets, since it represents a more consolidated knowledge
area. Thus, we expect a tendency of newly inserted terms becoming stable along the years.
In contrast, we expect a higher dynamism in AG-NEWS, a natural behavior of news datasets,
which tend to present higher variability in their characteristics (for example, variations in
class distributions according to transient events, hot topics, and so on).
4.1. EXPERIMENTAL WORKLOAD 27
(a) ACM-DL (b) MEDLINE (c) AG-NEWS
Figure 4.1: Class Distributions in the Three Reference Datasets.
4.1.2 ADC Algorithms
We selected four representative and widely used ADC algorithms to conduct our study. These
algorithms are:
Rocchio: an eager classifier that uses the centroid of a class to find boundaries be-
tween classes. The centroid of a class is defined as the average vector computed over its
training examples. When classifying a new exampled′, Rocchio associates it with the class
represented by the closest centroid tod′.
KNN: a lazy classifier that assigns to a test documentd′ the majority class among those
of its k nearest neighbor training documents in the vector space. Unlike Rocchio, KNN de-
termines the decision boundary locally, considering each training document independently.
We here use cosine similarity to determine the nearest neighbors of a test document.
Naïve Bayes (NB):a probabilistic learning method that aims at inferring a model for
each class by assigning to a test documentd′ the class associated with the most probable
model that would have generated it. Here, we adopt the Multinomial Naïve Bayes approach
(Manning et al., 2008), since it is widely used for probabilistic text classification. The pos-
terior class probabilitiesP (d′|c) are defined as
P (d′|c) = η × P (c)×∏
v∈d′
P (t|c), (4.1)
whereη denotes a normalizing factor,P (c) is the class prior probability andP (t|c) denotes
the conditional probability of observingt having already observedc. The NB classifier
assigns to a test exampled′ the classc with the highest posterior probabilityP (d′|c).
Support Vector Machine (SVM): the SVM classifier aims at finding an optimal sep-
arating hyperplane between the positive and negative training documents, maximizing the
distance (margin) to the closest points from either class. GivenN training documents repre-
28 CHAPTER 4. A QUANTITATIVE ANALYSIS OF TEMPORAL EFFECTS ONADC
sented as pairs(xi, yi), wherexi is the weighted feature vector of theith training document
andyi ∈ {−1,+1} the set membership of the document, SVM tries to maximize themargin
between them on the training data, which leads to better classification effectiveness on test
data. We may state the problem as
minβ,β0
1
2‖β‖2, subject to
yi(xTi β + β0) ≥ 1, (4.2)
whereβ is a vector normal to the hyperplane (the so-called weight vector),β0 is its intercept,
and0 ≤ i ≤ N .
After introducing Lagrange multipliersαi (0 ≤ i ≤ N) for each inequality constraints
in Equation4.2, along with slack variablesξi to account for non-separable data (a bounded
tolerable training error rate), we form the following Lagrangian (primal):
LP =1
2‖β‖2 + C
N∑
i=1
ξi −N∑
i=1
αi[yi(xTi β + β0)− (1− ξi)]−
N∑
i=1
µiξi, (4.3)
which we minimize with respect toβ, β0 andξi, whereµi are Lagrange multipliers employed
to enforceξi > 0. Setting the corresponding derivatives to zero, this yields:
β =
N∑
i=1
αiyixi (4.4)
0 =N∑
i=1
αiyi (4.5)
αi = C − µi, (4.6)
whereαi ≥ 0, µi ≥ 0 andξi ≥ 0, ∀i. By substitution into4.3, we get the so-called La-
grangian Wolfe (dual) function:
LD =N∑
i=1
αi −1
2
N∑
i=1
N∑
j=1
αiαjyiyjxTi xj .
Furthermore, the solution must satisfy the Karush-Kuhn-Tucker (KKT) conditions,
which include, along with Equations4.4, 4.5and4.6, the following ones:
4.2. CHARACTERIZATION OF TEMPORAL EFFECTS ONTEXTUAL DATASETS 29
αi[yi(xTi β + β0)− (1− ξi)] = 0 (4.7)
µiξi = 0
yi(xTi β + β0)− (1− ξi) ≥ 0,
where0 ≤ i ≤ N .
Finally, the solution forβ is β =∑N
i=1 αiyixi, with non-zeroαi for support points
which lie in the support vectors. The solution forβ0 may be devised by Equation4.7, nor-
mally averaging the solutions regarding the support pointsto achieve numerical stability.
Thus, we can express the SVM’s decision function as:
F = sign(xTβ + β0),
where the sign of the score is used to predict the example’s class. Since SVM is a bi-
nary classifier, it should be adapted to to handle multi-class classification problems. The
two most common strategies for doing so are the one-against-one and the one-against-all
(Manning et al., 2008).
4.2 Characterization of Temporal Effects on Textual
Datasets
In this section, we briefly describe the characterization reported in (Mourão et al., 2008),
which uncovered three main temporal effects that affect theACM-DL and MEDLINE
datasets:(i) the class distribution variation;(ii) the term distribution variation; and, finally,
(iii) the class similarity variation. More importantly, we also extend this prior characteriza-
tion to include a third, distinct, and more dynamic dataset,namely AG-NEWS. Our main
goal is to strengthen the argument for the existence of temporal effects in the reference
datasets, thus motivating our quantitative analysis of their impact on ADC algorithms when
applied to these datasets.
Before proceeding, we must first discretize the temporal dimension in order to capture
the variabilities in the characteristics of the explored datasets. Time can be seen as a dis-
cretization of natural changes inherent to any knowledge area. Detectable changes, however,
may occur at different time scales, depending on the characteristics of the given knowl-
edge area. In the case of ACM-DL and MEDLINE, which are sets ofscientific articles, we
adopted yearly intervals for identifying such changes, as scientific conferences usually occur
30 CHAPTER 4. A QUANTITATIVE ANALYSIS OF TEMPORAL EFFECTS ONADC
once per year. For the AG-NEWS dataset, we adopted, instead,a daily granularity, which
should more accurately capture changes in a set of news articles. Next, we discuss the main
findings of the characterization of each temporal effect in the three datasets.
4.2.1 Class Distribution Temporal Variation
The impact of temporal evolution on class distribution (CD)relates to the variation of the
fraction of documents assigned to each class over time. CD temporal variation should be
properly considered to avoid undesirable classifier’s bias. For instance, as mentioned before,
if CD varies significantly, the “assumed” class distribution may not reflect the “true” class
distribution observed when test data was created. Notice that, as an extreme case, classes
may appear and disappear as consequence of splits and joins of existing classes. For example,
the sub-classesInformation RetrievalandArtificial Intelligencein the ACM-DL Computing
Classification System (CCS) belonged to the same class—Applications—in 1964. Currently,
each one belongs to a different class:Information Retrievalbelongs toInformation Systems,
whereasArtificial Intelligencebelongs toComputing Methodologies.
To assist the analysis of the CD temporal variation in each dataset, Figure4.2shows the
class probability distributions for each year of ACM-DL andMEDLINE (as inMourão et al.
2008) and for each week of AG-NEWS.1 The figure illustrates the variation in terms of the
representativeness of the classes, that is, in terms of the fraction of document occurrences
in each class, as time goes by. As the figures show, most classes, particularly in ACM-
DL and AG-NEWS, exhibit frequent oscillations in their representativeness, whereas others
become more or less representative with time. For instance,theMathematics of Computing
class, in ACM-DL, became less representative with time, whereas the AG-NEWSWorld
class presented a peak in its representativeness between the 25th and37th weeks. Another
interesting case is the MEDLINEAids class. Although it contains documents dating from
1970, the fraction of documents belonging to it only became significant after1985.
These results illustrate that one needs to be very careful when creating classification
models in order to avoid generating a biased model that may not be accurate for the dataset to
be tested. The fact that the fractions of documents in several classes are constantly changing
over time, as can be seen for several classes in all three datasets in Figure4.2, makes this a
real problem that must be taken into account.
1We show AG-NEWS results on a weekly basis to improve graph readability
4.2. CHARACTERIZATION OF TEMPORAL EFFECTS ONTEXTUAL DATASETS 31
(a) ACM-DL (b) MEDLINE
(c) AG-NEWS
Figure 4.2: Class Distribution Temporal Variation in Each Reference Dataset.
4.2.2 Term Distribution Temporal Variation
Term Distribution (TD) variation is related to how the distribution of terms among the classes
changes over time as a consequence of terms appearing, disappearing, and having variable
discriminative power across classes. Take the following example of two classes,Mythology
andAstrophysics. Besides being the god of hell in Greek mythology, Pluto was also con-
sidered to be a planet until mid-2006. Up to this date, documents with the term Pluto had
a higher probability of being classified in theAstrophysicsclass due to the great amount of
references that mention Pluto as a planet. From this date on,since Pluto is not considered to
be a planet anymore, there has been a significant reduction inthe number of documents re-
ferring to it in this context. In mythology, however, the reference of Pluto did not present any
sensible variation. In this case, the term Pluto lost discriminative power in theAstrophysics
class and gained it in theMythologyclass. Intuitively, we may state that TD evolution usually
happens gradually, so that the distribution of terms observed at time periods that are closer
time-wise tend also to be more similar.
32 CHAPTER 4. A QUANTITATIVE ANALYSIS OF TEMPORAL EFFECTS ONADC
In order to characterize the TD temporal variation effect, we define, for each class and
each point in time, theclass vocabularyas the set of terms that have the highest values of
info-gain (Forman, 2003). The vocabulary of a class at a given point in timet represents that
class int. We then compare the vocabularies produced for the same class across all points
in time using the normalized cosine similarity between them. Figure4.3shows the average
cosine similarities as we vary the time distance between thevocabularies. For the sake of
clarity, we present results for a subset of the classes of each dataset, since the same behavior
is observed for all classes. Clearly, for all three datasets, the class vocabularies are varying
significantly over time. For the less stable ACM-DL and AG-NEWS datasets, the similarities
drop significantly even for a time distance equal to 1.2
(a) ACM-DL (b) MEDLINE (c) AG-NEWS
Figure 4.3: Term Distribution Temporal Variation of Each Reference Dataset.
Since the class vocabulary changes significantly with time,it becomes clear that a clas-
sification model generated considering documents created at certain period of time may be
less effective when tested using documents from another period of time, because the vocabu-
lary may have changed in a way that the assumptions made when learning the classifier may
no longer hold, that is, the discriminative terms may not be the same. Such difficulty turns
out to be a very interesting challenge as well.
4.2.3 Class Similarity Temporal Variation
Finally, class similarity (CS) variation relates to how thepairwise similarity among classes,
as a function of the terms that occur in their documents, varies over time. The similarity
between two arbitrary classes may change over time due to themigration and variation of the
frequency of the terms in their vocabularies: two classes may be similar at a given moment,
and become less similar later in the future, and vice versa.
2At time distance zero, the similarity, for all classes, is equal to 1, since we are comparing a vocabulary toitself, which obviously corresponds to the maximum possible similarity value.
4.2. CHARACTERIZATION OF TEMPORAL EFFECTS ONTEXTUAL DATASETS 33
In order to analyze the CS temporal variation, we calculate the cosine similarity
between the vocabularies of each pair of distinct classes atany given point in time (years for
ACM-DL and MEDLINE, and days for AG-NEWS). Tables4.2, 4.3and4.4show the results
for MEDLINE, ACM-DL and AG-NEWS, respectively. Each entry in the tables comprises
the standard deviation of the similarities between the associated pairs of classes computed
over all points in time. As we can observe, the similarities between some pairs of classes
vary significantly with time. For example, the similaritiesbetweenGeneral Literature(id 0)
andTheory of Computation(id 5) in ACM-DL, Complementary Medicine(id 3) andHistory
(id 4) in MEDLINE, andWorld andTop Stories(ids 5 and10, respectively) in AG-NEWS
have standard deviations equal to0.29, 0.21 and0.33, respectively. This means that these
pairs of classes may have been very similar in some periods, but also loosely related in
others. Thus, the difficulty in separating them apart variessignificantly as time goes by.
Class ID 0 1 2 3 4 5 6 7 8 9 10
0 0 0.14 0.12 0.12 0.12 0.29 0.14 0.13 0.14 0.12 0.291 - 0 0.08 0.13 0.11 0.12 0.11 0.12 0.10 0.10 0.132 - - 0 0.10 0.09 0.10 0.08 0.07 0.08 0.10 0.133 - - - 0 0.09 0.06 0.09 0.10 0.11 0.12 0.134 - - - - 0 0.05 0.08 0.09 0.10 0.13 0.135 - - - - - 0 0.14 0.13 0.07 0.06 0.296 - - - - - - 0 0.13 0.10 0.09 0.157 - - - - - - - 0 0.10 0.08 0.158 - - - - - - - - 0 0.11 0.139 - - - - - - - - - 0 0.1210 - - - - - - - - - - 0
Table 4.2: Pairwise Class Similarity (standard deviations) in ACM-DL.
Class ID 0 1 2 3 4 5 6
0 0 0.19 0.16 0.18 0.19 0.18 0.191 - 0 0.04 0.20 0.17 0.19 0.122 - - 0 0.04 0.03 0.04 0.053 - - - 0 0.21 0.08 0.054 - - - - 0 0.20 0.115 - - - - - 0 0.056 - - - - - - 0
Table 4.3: Pairwise Class Similarity (standard deviations) in MEDLINE.
Summarizing this discussion, there is clear evidence of temporal variations in the class
and term distributions, as well as on the similarities amongclasses in all three analyzed
34 CHAPTER 4. A QUANTITATIVE ANALYSIS OF TEMPORAL EFFECTS ONADC
Class ID 0 1 2 3 4 5 6 7 8 9 10
0 0 0.17 0.16 0.23 0.15 0.18 0.20 0.15 0.17 0.01 0.161 - 0 0.20 0.25 0.24 0.24 0.21 0.19 0.20 0.02 0.152 - - 0 0.24 0.18 0.10 0.20 0.18 0.26 0.04 0.403 - - - 0 0.30 0.30 0.27 0.20 0.31 0.01 0.164 - - - - 0 0.19 0.19 0.21 0.26 0.01 0.195 - - - - - 0 0.23 0.16 0.28 0.04 0.336 - - - - - - 0 0.15 0.22 0.01 0.137 - - - - - - - 0 0.13 0.02 0.158 - - - - - - - - 0 0.02 0.249 - - - - - - - - - 0 0.0510 - - - - - - - - - - 0
Table 4.4: Pairwise Class Similarity (standard deviations) in AG-NEWS.
datasets. These variations may ultimately affect the performance of classifiers. In the fol-
lowing section, we detail the proposed methodology to quantify the impact of each of these
three temporal effects on ADC algorithms.
4.3 Experimental Design
In this section, we describe our proposed methodology to assess the impact of the identified
temporal effects on each ADC algorithm and textual dataset.The core component of our
methodology is a factorial experimental design (Jain, 1991). This technique has already
been applied in multiple contexts toquantifythe effect of different factors and inter-factor
interactions on a given response variable (see examples inde Lima et al. 2010; Jain 1991;
Orair et al. 2010; Vaz de Melo et al. 2008). However, to the best of our knowledge, this is
the first time it is applied in the specific context of temporaleffects and ADC algorithms. As
will be discussed below, the application of this technique in this context brings challenges of
its own.
We start by, in Section4.3.1, reviewing the factorial design procedure in general terms.
We discuss how it can be applied to evaluate the impact of temporal effects on ADC algo-
rithms in Section4.3.2, presenting its application on the four selected ADC algorithms and
the three chosen textual datasets in Section4.3.3
4.3.1 Factorial Design
Givenk factors (the so-calledindependent variables), which can assumen levels (possible
values), and a response variable, a full factorialnk experimental design aims at quantifying
4.3. EXPERIMENTAL DESIGN 35
the impact of each individual factor as well as of all inter-factor interactions (of all orders)
on a given response variable. In other words, it aims at quantifying theeffectof these factors
and interactions on the variations observed in the responseacross a series ofnk experiments,
carefully designed to cover all possible configurations of factor levels.
To conduct thenk design, the parameters that affect the system under study must be
carefully controlled, in order to avoid misleading conclusions due to unexpected effects.
Thus, one has to be able to isolate and carefully vary thefactors, which are parameters re-
lated to the goals of the study and thus selected to be analyzed, while controlling the other
parameters, which are kept fixed. Usually, factors are varied from smaller to larger val-
ues, based on the assumption of monotonicity, that is, the response variable continuously
increases (or decreases) as the factor value becomes larger. In many scenarios, the system
under study presents an inherent variability, and, thus, measurements are susceptible to in-
accuracies, referred to asexperimental errors. In such cases, the impact of the factors and
of their interactions should be assessed in comparison to such errors, and an experimental
design withr replications (nkr) should be adopted. This is done by replicating the mea-
surements for each factor-level combinationr times. It is important to emphasize the need
of controlling all parameters with significant impact on thesystem, by either treating them
as factors or keeping them fixed, as the effect of uncontrolled parameters can not be distin-
guished from experimental errors.
Such experimental design is typically used as a primary toolto help one sort factors
and inter-factor interactions in terms of their impact on the response variable, thus provid-
ing quantitative evidence of which factors (and/or interactions) are more relevant for further
(more detailed) investigation. The examination of every possible factor-level combination
enables one to have a complete picture of the system behaviorregarding the factors con-
sidered. However, it comes at the expense of a potentially very costly study. The required
number of experiments (i.e.,nkr experiments) may be too large and unfeasible to perform
due to resource and time constraints. One of the most recommended strategies to reduce
the number of required experiments consists in reducing thenumber of levels considered for
each factor (Jain, 1991). As a matter of fact, for an initial assessment, one can consider only
two levels (lower and upper levels) of each factor, thus performing a2kr factorial design. By
doing so, one can determine the relative importance of all factors and interactions, and leave
for a more detailed study the analysis of more levels of the most relevant factors.
We describe the main steps of a2kr factorial design using, for illustration purposes,
k = 2 factors, referred to asA andB. The22r design aims to fit an additive model that
characterizes the impact of each factorA andB as well as of its interactionAB on the
response variabley. This model is given by:
36 CHAPTER 4. A QUANTITATIVE ANALYSIS OF TEMPORAL EFFECTS ONADC
y = q0 + qAxA + qBxB + qABxAxB + ε, (4.8)
whereq0 is the mean value of the response variable,qA, qB andqAB stand for the effects
associated with factorsA, B and interactionAB, andε denotes the experimental errors. For
each factorf ∈ {A,B}, a variablexf is defined as
xf =
−1 if f is at the lower level,
+1 if f is at the upper level.
Thus,qf denotes the extent of the variation on the global averageq0 imposed by factorf , on
average.
The 22r experimental design can be summarized into five steps. Step 1consists of
parameter estimation, in which we computeq0 and the effectsqA, qB andqAB. Once the
effects have been computed, the model can be used to estimatethe response for any given
factor values (x-values). For instance, the estimated response when factorsA andB are at
levelsxAi andxBi, respectively, is computed as:
yi = q0 + qAxAi + qBxBi + qABxAixBi (4.9)
The importance of a factor can be measured by the proportion of the total variation in
the response variable that can be explained by it. Thus, in Step 2, we compute the variation
of responsey across all experiments that can be explained by each factor (SSA, SSB, SSAB,
respectively) as well as the variation that remains unexplained, being thus credited to ex-
perimental errors (SSE). In other words, we computeSSf = 2krq2f (f ∈ {A,B,AB}),
andSSE =∑2k
i=1
∑rj=1 e
2ij , where erroreij denotes the difference between the estimated
response for theith experiment (yi) and the value measured in itsjth replication (yij). The
total variation, referred to asSum of Squares Total(SST ), is also computed as the sum of
SSA, SSB, SSAB andSSE.
Next (Step 3), we express theSSf (f ∈ {A,B,AB}) andSSE as percentages of
the total variationSST , so as to more easily assess the importance of each factor andof
the experimental errors in the observed response variations. Factors (and interactions) that
explain a higher percentage of the total variation are considered more important and thus,
are candidate for further analysis.
Since the effects are computed from a sample, they are indeedrandom variables, and
could take different values if another set of experiments was performed. Thus, it is necessary
to compute their associated confidence intervals (Step 4). We do so by first computing the
rooted mean square of errors (RMSE) and the standard deviationsf of each effectqf (f ∈
4.3. EXPERIMENTAL DESIGN 37
{A,B,AB}). RMSE denotes the standard error of the estimates, thus measuringhow well
the model explains the observations. It is computed as the square root of the ratio ofSSE
to the degrees of freedom associated with the experimental errors (in the current design,
22(r − 1)).3 The 100(1 − α)% two-sided confidence intervals are computed using either
a Student’st distribution orz distribution, depending on the degrees of freedom22(r − 1)
(seeJain, 1991). Any effect whose confidence interval does not include zerois statistically
significant with the given confidence.
Finally, in Step 5 we assess the model quality, by means of thecoefficient of determina-
tion R2. This is done by comparing the unexplained variation (SSE) with the total variation
(SST ), being a measure of goodness of the fit for the additive modelin Equation4.8. The
closer to 1, the better the fitted model.
The general procedure to perform a2kr design, for any values ofk andr, is presented
in Algorithm 1.
Algorithm 1 Factorial Design Procedure.function FACTORIALDESIGN
Step 1: Estimate model parameters (i.e., grand mean and factor effects)
q0 ←1
2kr
∑2k
i=1
∑r
j=1 yij
qf ←1
2k∑2k
i=1 xfiyi, wheref ∈ [1, 2k − 1] andyi =1
r
∑r
j=1 yij
Step 2: Compute total variation as well as variation due to each factor and to experimental errorsSSf ← 2krq2f , wheref ∈ [1, 2k − 1]
SSE ←∑2k
i=1
∑r
j=1 e2ij , whereeij = yij − yi
SST ←∑2k−1
f=1 SSf + SSE
Step 3: Compute percentage of variation each factor/error is responsible for
Pf ←SSf
SST
× 100, wheref ∈ [1, 2k − 1]
PE ←SSE
SST
× 100
Step 4: Compute confidence intervals of the effects
RMSE ←√
SSE
2k(r − 1)
sf ←RMSE√
2kr, wheref ∈ [1, 2k − 1]
CIf ← qf ± t[1−α
2;2k(r−1)]sf , wheref ∈ [1, 2k − 1]
Step 5: Assess model accuracy by the coefficient of determinationR2 ← 1− SSE
SST
end function
3In a general2kr design, the degrees of freedom of the experimental errors isgiven by2k(r − 1).
38 CHAPTER 4. A QUANTITATIVE ANALYSIS OF TEMPORAL EFFECTS ONADC
4.3.2 Applying 2kr Design in the Characterization of Temporal
Effects
In this section we describe how the2kr design can be applied to quantify the impact of
temporal aspects on ADC algorithms, considering differentdatasets. As we focus on three
different temporal aspects, namely the class distributiontemporal variation (CD), the term
distribution temporal variation (TD) and the class similarity temporal variation (CS), our
experimental design takesk = 3 factors. The two levels considered for each factor, which we
call “lower” and “upper” levels, defined below, refer thus tothe degree of temporal variation
observed on it. Given a reference dataset and an ADC algorithm, the goal is to partition the
document set into23 groups corresponding to all possible factor-level configurations, and
then evaluate the algorithm for each configuration, considering the grouped documents. We
then apply the2kr design procedure, described in Algorithm1, to quantify the effect of each
factor and inter-factor interaction on the effectiveness of the ADC algorithm. The response
variabley is thus the classification effectiveness, which is here assessed by the commonly
usedF1 measure.F1, the harmonic mean between the precisionp and the recallr, given by:
F1 =2pr
p+ r,
where precision is the percentage of documents assigned by the classifier to classci that were
correctly classified, and recall is the percentage of documents belonging to classci that were
correctly classified.4
For each configuration, we run a numberr of replications following a cross-validation
strategy, commonly adopted by the machine learning community. There are, at least, two
usual approaches for doing so: the K-fold cross validation and the repeated random sub-
sampling. The K-fold cross validation consists of randomlysplitting the data intoK inde-
pendent folds. At each iteration, one fold is retained as thetest set, and the remainingK − 1
folds are used as training set. The repeated random sup-sampling consists of randomly se-
lecting a fraction of documents from the dataset, without replacement, to compose the test
set, and the remaining documents retained as the training set. This is performed for each
replication. Since in the K-fold cross validation the size of the folds are dependent of the
4The describedF1 measure corresponds to the overall performance of the methods across all classes. Usinga per-class variation of the measure (also known as Macro-F1) would imply in having to consider anotherparameter in the analysis: the class imbalance. In order to focus our analysis on the time-related factors, thegoal of the present study, we would have to isolate or controlthis parameter. However, possible solutions toisolate this parameter (for example, under- or oversampling Lin et al. 2009; Liu et al. 2007) are typically veryhard to perform in practice without affecting the temporal factors, which ultimately could compromise ourstudy. Thus, we leave for future work the consideration of this metric in our experimental design. We note,however, that, in the absence of very skewed class distributions, both variations of theF1 metric tend to producecompatible results.
4.3. EXPERIMENTAL DESIGN 39
number of iterations, it becomes more suitable to medium/large sized datasets, while the re-
peated random sub-sampling is usually adopted to small sized datasets when the number of
replications is large.
One challenge to build our factorial design is how to define the23 groups of documents.
For that, we must quantify the temporal variation of each factor in the set of documents of
the reference dataset, define the two levels for each factor and, based on them, group the
documents according to all possible factor-level combinations. The following three sections
(Sections4.3.2.1-4.3.2.3) describe how we performed these steps for theCD, TD andCS
factors. Note that, sinceCD andCS relate exclusively to the characteristics of the class to
which a document belongs, we define theCD andCS levels associated with a document
based on the corresponding values of its class.TD, on the other hand, relates to the rela-
tionships among terms and classes. Thus, in order to define the TD level associated with
a document, isolating this factor from the others, we adopt afiner grained approach that
analyzes the document’s contents. After defining the factorlevels, we discuss a few other
aspects that require attention to avoid misleading results(Section4.3.2.4).
4.3.2.1 Class Distribution: Lower and Upper Levels
LetC andP be the set of classes and points in time observed in the reference dataset, respec-
tively. To isolate the class distribution effect into lowerand upper levels, we consider the
relative sizes of the classes (i.e., fraction of the datasetdocuments assigned to the classes) at
each point in timep ∈ P. For each classc ∈ C, we compute the coefficient of variation CV
(that is, the ratio of the standard deviation to the mean) of the relative size ofc for all values
of p. The CV is used since it is dimensionless and scale invariant, and thus more appropriate
to deal with temporal changes in class distribution observed in the reference dataset.
We then partition the documents into two groups based on a given thresholdδCD: those
whose classes present CV values lower thanδCD are assigned to the “lower” group (CD↓, with
associated variablexCD = −1), while those whose classes present CV values higher than
δCD are assigned to the “upper” group (CD↑, with associated variablexCD = +1). We defer
to Section4.3.3the details regarding how we define theδCD threshold.
4.3.2.2 Term Distribution: Lower and Upper Levels
We determine theTD level to which a document belongs computing thedocument stability
level, which is characterized by the density of the documents terms that are stable. In order
to assess the stability of a given term, we use the concept ofstability period(Rocha et al.,
2008).
40 CHAPTER 4. A QUANTITATIVE ANALYSIS OF TEMPORAL EFFECTS ONADC
Definition 1 (Stability period) Let DF (t, c, p) be the number of documents belonging to
classc ∈ C that contain termt and that were created at the point in timep ∈ P. A stability
periodSt,pr of a termt consideringpr ∈ P as the reference point in time, is the set of points
in timep present in the largest continuous period of time,5 starting frompr and growing both
to the past and the future, until there exists some classc such that
DOMINANCE(t, c, p) =DF (t, c, p)
∑
c′∈C DF (t, c′, p)> α,
for some predefined0 < α ≤ 1.
We characterize the stability of a termt, regarding a reference point in timepr, by the
term stability level(TSL), defined as:
TSL(t, pr) =|STABILITY PERIOD(t, pr)|
|P|
We then use the TSL to estimate the document stability level (DSL) of a given docu-
mentd. Let p be the point in time whend was created. We define the DSL ofd as:
DSL(d) =
∑
t∈d TSL(t, p)
|{t′|t′ ∈ d}|
As we can observe,0 ≤ DSL(d) ≤ 1, where the lower bound (DSL(d) = 0) occurs
for documents without stable terms, and the upper bound (DSL(d) = 1) occurs for docu-
ments composed only by termst with maximalTSL(t, p) (that is, terms that have stability
periods with maximum duration regarding the time whend was created).
The documents are then partitioned into the two groups: those with DSL less than a
pre-defined thresholdδTD are assigned to the “lower” group (TD↓, with associated variable
xTD = −1) and the remaining documents are assigned to the “upper” group (TD↑, with
associated variablexTD = +1). Again, we defer the definition of such threshold to the
Section4.3.3.
4.3.2.3 Class Similarity: Lower and Upper Levels
The “lower” group (CS↓, with associated variablexCS = −1) is composed by documents
whose classes are more stable in terms of their similaritieswith other classes during the
whole period covered in the reference dataset. Accordingly, the “upper” group (CS↑, with5We consider the same definition of stability period asRocha et al.(2008), adopting acontinuousperiod
of time, due to computational feasibility. Considering non-continuous intervals increases the search spaceexponentially with the number of points in time (2|P| possible intervals to be considered). This is a safe decisionbecause, as we can observe in Figure4.3, the variations observed in the relationships between terms and classesare smooth (that is, we do not observe any abrupt steps in the curves).
4.3. EXPERIMENTAL DESIGN 41
associated variablexCS = +1) is composed by documents whose classes present higher
variability in their similarities with other classes. To quantify this variability for a classc,
we first compute the similaritysim(Vc,p,Vc′,p) wherec, c′ ∈ C, c 6= c′ andVc,p denotes the
c’s vocabulary at the point in timep ∈ P. The vocabulary of a classc at timep consists of
the top-K terms with highest Information Gain (Forman, 2003) in c at that time. We then
compute the coefficient of variation CV of the(|C| − 1)|P| pooled similarities.6 We separate
documents into two groups based on the CV values of their classes and on a pre-defined
thresholdδCS, which will be further discussed in Section4.3.3.
4.3.2.4 Other Challenging Aspects
As a requirement for a well conducted experimental design, we must control the parameters
that may influence the responses but are not the target of the analysis (i.e., are not treated
as factors in the design). One such parameter is thesamplingeffect, characterized by the
differences in classification effectiveness obtained by varying the size of the training set. As
it is well known, as the training set used by supervised learning strategies becomes larger,
the more information becomes available to build the classification model, which ultimately
influences the effectiveness of the classifier. If we neglectsuch matter, and consider different
training set sizes for each factor-level combination, we may mask the actual impact of the
temporal effects on the ADC algorithms. Clearly, we must isolate the sampling effect to
remove its influence on the response variable. Therefore, for each experimental replication,
we randomly selected the same number of documents for each ofthe2k partitions, according
to the size of the smaller partition. This ensures training sets with equal sizes across all
factor-level combinations, thus isolating the sampling effect.
One important dataset-dependent aspect is that the documents and classes in the
reference dataset must fulfill all23 groups to enable us to conduct the proposed experimental
design. However, in some cases, as in the reference datasetsanalyzed here, this might not
hold, particularly due to combinations regarding theCD andCS factors (see discussion
in Section4.3.3). In such cases, we are not able to isolate and simultaneously analyze all
three temporal factors. To overcome this issue, and yet provide valuable insights about the
temporal effects, we propose a pairwise approach, consisting of two 22r designs, referred
to asCD×TD andCS×TD. This decision comes with a cost, as we are not able to analyzea
possible interaction betweenCD andCS. However, as we will see in the next section, these
two factors are typically very correlated. Thus, analyzingthem in separate experimental
6Note that, as in Section4.3.2.1, we here use the CV metric to characterize class similarity temporalvariations. This is in contrast to Section4.2, where, followingMourão et al.(2008) strictly, we characterize theclass similarity variation using the standard deviation ofthe pooled similarities, a metric that depends on theunit and scale of the measurements.
42 CHAPTER 4. A QUANTITATIVE ANALYSIS OF TEMPORAL EFFECTS ONADC
designs might still be worthwhile.
The first experimental design,CD×TD, aims at analyzing the impact ofCD, TD and
their interaction on the classification effectiveness achieved by the four algorithms in the
three reference datasets. The second one, referred to asCS×TD, allows one to quantify the
impact ofCS, TD and their interaction. For both designs, all documents of each refer-
ence dataset are divided into four partitions, with the samenumber of documents randomly
sampled at each replication, covering thus all possible factor-level combinations:{CD↓TD↓,
CD↓TD↑, CD↑TD↓, CD↑TD↑} for the former and{CS↓TD↓, CS↓TD↑, CS↑TD↓, CS↑TD↑} for the latter,
where↓ and↑ denote “lower” and “upper” levels, respectively.
4.3.3 Quantifying the Impact of Temporal Effects on ADC
In this section, we present how we applied the proposed methodology to quantify the im-
pact of temporal effects on ADC using, as experimental workload, the four ADC algorithms
(Rocchio, KNN, Naïve Bayes and SVM) and the three textual datasets (ACM-DL, MED-
LINE and AG-NEWS) presented in Section4.1. In other words, we performed a series of
experiments, following the proposed methodology, for eachcombination of ADC algorithm
and reference dataset. As the number of available documentsin all three datasets are not
enough to cover all23 partitions, we adopted the strategy described in Section4.3.2.4, con-
ducting two separate22r designs in each case.
Recall that, by Definition1, in order to define theTD ↓ andTD ↑ document groups,
we must determine the dominance thresholdα to compute the stability periods. Different
values ofα were evaluated and, as they lead to similar results, we fixedα = 50%, ensuring
that the terms will have a high degree of exclusivity with a single class. Furthermore, as
described in Section4.1, the KNN and SVM classifiers have some tuning parameters. In
particular, one must define the number of nearest neighbors to be considered (parameter
K) to use KNN. The SVM parameters depend on the kernel functionused. We chose the
RBF kernel function, since it yielded more stable results across replications than its linear
counterpart. For this classifier, the tuning parameters arethe costC of misclassification and
the shape of the RBF kernel function (parameterγ). All parameters were calibrated with
a cross-validation performed over the training set. We usedthe LibSVM implementation
(Chang and Lin, 2001), employing an one-against-one procedure to adapt the binary SVM
to the multi-class scenario, since this is the case in our reference datasets.
Next, we discuss the experimental design conducted for eachreference dataset.
4.3. EXPERIMENTAL DESIGN 43
4.3.3.1 ACM-DL
The first step is to partition the ACM-DL documents into four groups for theCD×TD design
and four other groups for theCS×TD design. We do so by first partitioning them into one
pair of groups for each design, using theδCD andδCS thresholds. We setδCD as theaverage
CV of class sizes (Section4.3.2.1), computed across all classes. Similarly, we setδCS as
the averageCV of pooled similarities (Section4.3.2.3). These thresholds along with the
CV values of individual classes are shown in Figure4.4. We note that, to computeδCS, we
disregarded the CV associated with theGeneral Literatureclass (id0), since, as shown in
Figure4.4b), it is significantly larger than the CVs of the other classes. We believe that, for an
initial assessment, this decision might not significantly impair our analysis. The documents
of class 0 were then assigned to theCS↑ partition.
Analyzing Figure4.4, we can further understand why the ideal23 design was not
possible to be conducted on the ACM-DL dataset. LetCCD↑, CCD↓, CCS↑, CCS↓ denote
the sets of classes in partitionsCD↑, CD↓, CD↑ andCD↓, respectively. As we can observe,
CCD↑ ∩ CCS↓ = ∅ whereas|CCD↓ ∩ CCS↑| = 1. As we need at least two classes in each
partition to proceed with the classification task, there arenot enough documents to fill all
the cells of the ideal23r design. Figure4.4 also shows that 3 out of the 4 classes with high
CS also present highCD and all classes with lowCS also have lowCD. In other words,
there is a high correlation between these two factors, whichsupports our decision to ignore
a possible interaction between them, decoupling the analysis into two separate22r factorial
designs.
(a) Class Distribution Variation (CD) (b) Class Similarity Variation (CS)
Figure 4.4: Determining the Lower and Upper Levels ofCD andCS—ACM-DL.
Next, we further subdivide eachCD-based document partition according to theTD
44 CHAPTER 4. A QUANTITATIVE ANALYSIS OF TEMPORAL EFFECTS ONADC
factor, using theδTD threshold (Section4.3.2.2). We do the same for theCS-based document
partitions. We setδTD equal to the average DSL value across all documents in each partition.
Figure4.5shows the distribution of DSL values and theδTD for each partition. Documents
from theCD↓ (or CS↓) partition with DSL smaller than the correspondingδTD are assigned to
CD↓TD↓ (or CS↓TD↓) group, whereas those with DSL higher thanδT are assigned toCD↓TD↑
(or CS↓TD↑) group. The same applies for those documents fromCD↑ andCS↑.
(a) LowCD (b) HighCD
(c) LowCS (d) HighCS
Figure 4.5: Determining the Lower and Upper Levels of TD—ACM-DL.
Recall that a2kr design requiresr replications to be performed for each configuration
and, as discussed in the previous section, this can be achieved by employing either K-fold
cross validation or repeated random sub-sampling. Due to the small size of the ACM-DL
dataset and the use of sampling to isolate the sampling effect, we use the repeated random
sub-sampling strategy, selecting50% of documents to compose the test set and the remaining
retained for the training set. We performedr = 50 replications.
Table4.5shows the results of both factorial designs (CD×TD andCS×TD) for each ADC
algorithm (first column). For better presentation, we representCD (CS) asA andTD as
4.3. EXPERIMENTAL DESIGN 45
B for theCD×TD (CS×TD) design. For each algorithm and design, the “%Var” row liststhe
percentage of variation in classification effectiveness that can be explained by each effectqf
(f ∈ {A,B,AB}) and by experimental errors (ε). Similarly, the “Mean” row denotes the
estimated coefficients of the model, capturing the “average” impact of each factor: positive
values indicate an increase in classification effectiveness and negative values indicate the op-
posite. Note thatq0 refers to the grand mean, computed over all observations. The “99% CI”
rows report the 99% confidence intervals associated with thegrand meanq0 and each effect
qf (f ∈ {A,B,AB}). Intervals that include zero indicate statistically non-significant impact
of the associated factors. Finally, the “R2” column reports the coefficient of determination
of the proposed model: values close to1 indicate a well fitted model. Similar tables, referred
to as ANOVA (ANalysis Of VAriance) tables, will be used to summarize the results obtained
with the other datasets as well. We leave to Section4.4a detailed discussion of all results.
ADCAnalysis of Variance (ANOVA)
Model: y = q0 + qAxA + qBxB + qABxAxB + ε
Effects: q0 qA qB qAB ε R2
Rocchio
CD %Var − 40.41% 53.83% 1.80% 3.96%0.96× Mean 64.55 −10.38 −11.98 2.19 −
TD 99% CI [62.93, 66.17] [−12.00,−8.76] [−13.60,−10.36] [0.57, 3.81] −CS %Var − 50.86% 40.97% 3.72% 4.45%
0.95× Mean 66.86 −7.95 −7.13 −2.15 −TD 99% CI [65.69, 68.04] [−9.12,−6.78] [−8.31,−5.96] [−3.32,−0.98] −
KNN
CD %Var − 48.61% 44.77% 2.17% 4.45%0.95× Mean 65.40 −12.90 −12.38 2.73 −
TD 99% CI [63.45, 67.34] [−14.84,−10.95] [−14.32,−10.43] [0.78, 4.67] −CS %Var − 50.64% 44.16% 2.26% 2.93%
0.97× Mean 65.83 −9.14 −8.54 −1.93 −TD 99% CI [64.74, 67.34] [−10.24,−8.05] [−9.63,−7.44] [−3.03,−0.84] −
NB
CD %Var − 42.92% 51.39% 1.70% 3.98%0.96× Mean 63.56 −11.41 −12.48 2.27 −
TD 99% CI [61.83, 65.29] [−13.14,−9.68] [−14.21,−10.75] [0.54, 4.00] −CS %Var − 34.38% 53.90% 3.84% 7.87%
0.92× Mean 62.50 −5.97 −7.47 −1.99 −TD 99% CI [61.08, 63.92] [−7.39,−4.55] [−8.90,−6.05] [−3.42,−0.57] −
SVM
CD %Var − 64.67% 33.48% 0.21% 1.64%0.98× Mean 58.44 −17.16 −12.35 0.98 −
TD 99% CI [57.08, 59.80] [−18.52,−15.80] [−13.71,−10.98] [−0.38, 2.34] −CS %Var − 61.51% 31.28% 3.99% 3.22%
0.97× Mean 60.98 −10.81 −7.71 −2.75 −TD 99% CI [59.75, 62.21] [−12.04,−9.58] [−8.94,−6.48] [−3.98,−1.52] −
Table 4.5: Factorial Design applied to Rocchio, KNN, Naïve Bayes and SVM for ACM-DL(CD×TD design: A=CD and B=TD.CS×TD design: A=CS and B=TD).
4.3.3.2 MEDLINE
In order to build the four document partitions of each factorial design for the MEDLINE
dataset, we follow the same strategy adopted for ACM-DL. First, we partition the docu-
ments regarding theCD andCS factors, settingδCD andδCS as the average CV measures
46 CHAPTER 4. A QUANTITATIVE ANALYSIS OF TEMPORAL EFFECTS ONADC
(a) Class Distribution Variation (CD) (b) Class Similarity Variation (CS)
Figure 4.6: Determining the Lower and Upper Levels ofCD andCS—MEDLINE.
computed for each factor. These partitions and corresponding thresholds are shown in Fig-
ure4.6. As with theGeneral Literatureclass in ACM-DL and theCS-based partition, we
here choose to ignore theAidsclass (id0) to defineδCD, assigning its documents to theCD↑
partition. We then further subdivide each of these partitions according to theTD factor,
settingδTD as the average DSL of all documents in each partition, as depicted in Figure4.7.
Note that, according to Figure4.6, CCD↑ ∩ CCS↓ = ∅ and|CCD↓ ∩ CCS↑| = 1. Thus,
the argument for the unfeasibility of a three-factor experimental design applied to ACM-DL
also holds for MEDLINE. However, the figure also shows that 3 out of the 4 classes with
highCS also have highCD, and all classes with lowCS also have lowCD. Thus, once
again, there is a high correlation between both factors, motivating our approach to decouple
the three-factor design into two independent 2-factor analyses.
Since the MEDLINE dataset is quite large (over 800 thousand documents), we are able
to replicate each experiment by performing a 10-fold cross validation, as the test size is suffi-
ciently large to achieve stable results among the replications. The results achieved with both
factorial designs (CD×TD andCS×TD), considering each ADC algorithm, are summarized in
Table4.6, and will be analyzed in Section4.4.
4.3.3.3 AG-NEWS
Finally, the same overall procedure is also adopted to buildthe two22r experimental designs
for AG-NEWS. We partition the documents with respect to theCD andCS factors, using
δCD andδCS values equal to corresponding average CV values, as shown inFigure4.8. Once
again, we choose to disregard theTop Storiesclass (id10) from theδCD computation, as it
4.3. EXPERIMENTAL DESIGN 47
(a) LowCD (b) HighCD
(c) LowCS (d) HighCS
Figure 4.7: Determining the Lower and Upper Levels ofTD—MEDLINE.
presents a much higher CV in comparison to the other 9 classes. We assign its documents to
theCD↑ partition. Regarding the computation ofδCS, Figure4.8-b) shows that, unlike in the
previous cases, not only one but two classes, namelyTop Stories(id 10) andItalia (id 9), have
much larger average CV values. Instead of disregarding bothmeasures, we adopt a different
strategy to deal with these large deviations, smoothing their impact on the final computation.
We first compute an average CV across classes 9 and 10. Let it beCV9,10. We then take as
δCS the overall average computed over the average CVs of all remaining classes (shown in
Figure4.8-b) andCV9,10.
Similarly to the other two datasets, Figure4.8 shows that the number of documents
in AG-NEWS is not enough to fill all partitions of the ideal23 experimental design as
|CCD↓ ∩ CCS↑| = 1. Thus, the lack of enough samples to fill theCD↓CS↑ partition prevents
us to perform a complete three-factor design. The figure alsoshows that 2 out of the 3
classes with highCS also have highCD, and 3 out of the 4 classes with lowCS also have
low CD, indicating, once again, that both factors are very correlated.
48 CHAPTER 4. A QUANTITATIVE ANALYSIS OF TEMPORAL EFFECTS ONADC
ADCAnalisys of Variance (ANOVA)
Model: y = q0 + qAxA + qBxB + qABxAxB + ε
Effects: q0 qA qB qAB ε R2
Rocchio
CD %Var − 92.57% 5.77% 0.85% 0.80%0.99× Mean 79.97 −6.28 −1.57 0.60 −
TD 99% CI [79.68, 80.26] [−6.57,−5.99] [−1.86,−1.28] [0.31, 0.89] −CS %Var − 87.96% 11.15% 0.00% 0.89%
0.99× Mean 81.59 −4.41 −1.57 −0.01 −TD 99% CI [81.37, 81.81] [−4.63,−4.19] [−1.79,−1.35] [−0.23, 0.21] −
KNN
CD %Var − 72.24% 25.68% 0.24% 1.84%0.98× Mean 84.95 −3.48 −2.08 0.20 −
TD 99% CI [84.67, 85.22] [−3.76,−3.20] [−2.35,−1.80] [−0.08, 0.48] −CS %Var − 76.84% 20.17% 0.36% 2.63%
0.97× Mean 84.39 −3.86 −1.98 −0.27 −TD 99% CI [84.03, 84.74] [−4.21,−3.50] [−2.33,−1.62] [−0.62, 0.09] −
NB
CD %Var − 76.01% 22.69% 0.01% 1.28%0.99× Mean 86.49 −4.18 −2.28 0.06 −
TD 99% CI [86.22, 86.76] [−4.45,−3.91] [−2.55,−2.01] [−0.21, 0.33] −CS %Var − 62.93% 35.70% 0.17% 1.20%
0.99× Mean 87.67 −2.57 −1.94 −0.13 −TD 99% CI [87.49, 87.85] [−2.75,−2.39] [−2.11,−1.76] [−0.31, 0.04] −
SVM
CD %Var − 76.33% 22.45% 0.26% 0.96%0.99× Mean 86.19 −4.75 −2.58 −0.28 −
TD 99% CI [85.92, 86.45] [−5.02,−4.49] [−2.84,−2.31] [−0.54,−0.01] −CS %Var − 61.90% 35.14% 0.93% 2.03%
0.98× Mean 87.90 −2.68 −2.02 −0.33 −TD 99% CI [87.66, 88.14] [−2.92,−2.44] [−2.26,−1.78] [−0.57,−0.09] −
Table 4.6: Factorial Design applied to Rocchio, KNN, Naïve Bayes and SVM for MEDLINE(CD×TD design: A=CD and B=TD,CS×TD design: A=CS and B=TD).
(a) Class Distribution Variation (CD) (b) Class Similarity Variation (CS)
Figure 4.8: Determining the Lower and Upper Levels ofCD andCS—AG-NEWS.
We replicate each experiment by performing a 10-fold cross validation, since, similarly
to MEDLINE, AG-NEWS also has a very large dataset. Table4.7 summarizes the results,
which are discussed in the next section.
4.4. DISCUSSION 49
(a) LowCD (b) HighCD
(c) LowCS (d) HighCS
Figure 4.9: Determining the Lower and Upper Levels ofTD—AG-NEWS.
4.4 Discussion
Having presented our methodology to analyze the impact of temporal effects on ADC al-
gorithms and illustrated its applicability to four algorithms and three reference datasets, we
now discuss our results, reported in Tables4.5-4.7. Recall that, when analyzing the results of
a specific experimental design, the impact of each factor on the response variable is captured
by the percentage of variation explained by it (“% Var” in theANOVA tables). However,
when comparing results across different designs, as we do here, it is important also to ana-
lyze the effects associated with each factorf , qf , and their relative impact on the grand mean
q0. Since the total variation of the responses (SST ) may vary across different designs, the
relative impact of eachqf on the grand meanq0 allows a more fair comparison of the impact
of each factor on the results across the designs. Ultimately, it represents the extent at which
classification effectiveness improves or degrades, depending on the sign ofqf , when factor
f is at its higher or lower level.
50 CHAPTER 4. A QUANTITATIVE ANALYSIS OF TEMPORAL EFFECTS ONADC
ADCAnalisys of Variance (ANOVA)
Model: y = q0 + qAxA + qBxB + qABxAxB + ε
Effects: q0 qA qB qAB ε R2
Rocchio
CD %Var − 55.56% 44.09% 0.30% 0.06%0.99× Mean 77.12 −11.17 −9.95 0.82 −
TD 99% CI [76.94, 77.30] [−11.35,−10.99] [−10.13,−9.77] [0.64, 1.00] −CS %Var − 51.11% 46.69% 2.10% 0.09%
0.99× Mean 76.83 −10.52 −10.05 2.13 −TD 99% CI [76.61, 77.05] [−10.74,−10.30] [−10.28,−9.83] [1.91, 2.35] −
KNN
CD %Var − 80.18% 18.56% 1.21% 0.06%0.99× Mean 84.29 −11.24 −5.41 −1.38 −
TD 99% CI [84.14, 84.44] [−11.38,−11.09] [−5.55,−5.26] [−1.53,−1.23] −CS %Var − 80.29% 18.63% 1.01% 0.06%
0.99× Mean 85.57 −10.43 −5.02 −1.17 −TD 99% CI [85.43, 85.72] [−10.57,−10.28] [−5.17,−4.88] [−1.32,−1.03] −
NB
CD %Var − 68.16% 31.78% 0.003% 0.06%0.99× Mean 82.86 −10.12 −6.91 −0.07 −
TD 99% CI [82.71, 83.01] [−10.27,−9.97] [−7.06,−6.76] [−0.22, 0.08] −CS %Var − 81.05% 18.67% 0.19% 0.09%
0.99× Mean 84.61 −10.71 −5.14 −0.53 −TD 99% CI [84.42, 84.78] [−10.89,−10.53] [−5.32,−4.96] [−0.70,−0.34] −
SVM
CD %Var − 80.98% 17.76% 1.15% 0.11%0.99× Mean 85.91 −10.32 −4.83 −1.23 −
TD 99% CI [85.71, 86.10] [−10.51,−10.12] [−5.03,−4.64] [−1.42,−1.03] −CS %Var − 83.37% 14.54% 2.00% 0.09%
0.99× Mean 87.75 −9.63 −4.02 −1.49 −TD 99% CI [87.58, 87.90] [−9.78,−9.47] [−4.18,−3.86] [−1.65,−1.33] −
Table 4.7: Factorial Design applied to Rocchio, KNN, Naïve Bayes and SVM for AG-NEWS(CD×TD design: A=CD and B=TD,CS×TD design: A=CS and B=TD).
We start with two general observations. First, across all reference datasets and ADC
algorithms, our experimental designs are successful in isolating the parameters that are the
target of the study: the analyzed temporal effects explain the vast majority of the variations
observed in the results. Indeed, the percentages of variation left unexplained and thus cred-
ited to experimental errors (columnε) are under 8%, 3% and 1% for ACM-DL, MEDLINE
and AG-NEWS, respectively. The larger variations left unexplained for the ACM-DL dataset
is possibly due to the fact that this dataset is much smaller than the other two (small sample
sizes incurs in greater variability). However, as we can observe in Table4.5, the percent-
ages of variation credited to experimental errors are inferior to the percentages credited to
the temporal factors. Consistently, the coefficient of determinationR2 is above0.95 in most
cases.
Our second general observation is that the percentages of variation explained by the
secondary factors (columnqAB), i.e., the interactions betweenCD andTD for theCD×TD
design, and betweenCS andTD for theCS×TD design, are very small across all datasets and
algorithms, falling below4%, 1% and2.2% for the ACM-DL, MEDLINE and AG-NEWS
datasets. Indeed, the effect of this interaction is statistically insignificant, with99% confi-
dence, in many of these cases (see line “99% CI”). If significant, the effect associated with
4.4. DISCUSSION 51
the interaction is often negative, implying that it contributes to a degradation in classifica-
tion effectiveness, although the magnitude of such degradation is very small (up to4.51%,
0.37% and1.70%, on average, with respect to the overall average performance reported in
q0, for ACM-DL, MEDLINE and AG-NEWS, respectively). In a few cases, the effect of
the interaction is positive, implying that it actually contributes to improve classification ef-
fectiveness. We conjecture that this is a side effect of the interactions betweenCD andCS
factors that are not captured by our pairwise experimental designs. In other words, the pos-
itive interaction is possibly due to a few classes havingCD↓ andCS↑ in all three datasets, as
argued in Sections4.3.3.1-4.3.3.3. Nevertheless, even when positive, the effect due to the
interaction of multiple factors is very small, with an impact on the grand mean by as much
as only4.17%, 0.75% and2.77%, on average, for ACM-DL, MEDLINE and AG-NEWS, re-
spectively. Thus, we argue that the primary factorsCD, CS andTD are the main sources of
impact on classification effectiveness across all analyzedscenarios, focusing our discussion
on them.
In the following, we analyze specific results for each reference dataset, discussing the
overall behavior observed across all ADC algorithms in Section 4.4.1. We then discuss the
results for each specific ADC algorithm, pointing out invariants across datasets and drawing
insights into the influence of the temporal effects on each algorithm in Section4.4.2. Finally,
in Section4.4.3, we summarize the main implications of our findings.
4.4.1 Impact of Temporal Effects on the Reference Datasets
We start by analyzing the relative impact of the temporal factors (qf ) on the average effec-
tivenessq0 of the ADC algorithms in each dataset, given by the ratioqfq0
. As Tables4.5-4.7
show, the effects associated with the temporal factors (i.e., columnsqA andqB) represent
an impact on the average effectiveness of the ADC algorithms(i.e., columnq0) that falls, on
average, between9.55% and29.36% of q0 in the ACM-DL dataset,1.92% and7.85% of q0 in
the MEDLINE dataset, and4.58% and14.48% of q0 in the AG-NEWS dataset. Thus, the im-
pact of the temporal effects on classification is much higherin the ACM-DL and AG-NEWS
datasets than in the MEDLINE dataset. Such observation is consistent with the character-
ization reported in Section4.2, which points out a more stable behavior of MEDLINE in
contrast to the more dynamic nature of ACM-DL and AG-NEWS. Itis also consistent with
the qualitative analysis reported in (Mourão et al., 2008), which showed that:(i) once a term
appears, it tends to remain more stable over time in MEDLINE than in the other two datasets,
thus implying a smaller impact ofTD on classification of the former, and(ii) the more con-
solidated knowledge area captured in the MEDLINE dataset justifies the smaller impact of
CD andCS on it.
52 CHAPTER 4. A QUANTITATIVE ANALYSIS OF TEMPORAL EFFECTS ONADC
To further corroborate such findings, we performed a two-sided Mann-Whitney test
(Hollander and A., 1999) to compare the coefficient of variations (CV’s) of class sizes (i.e.,
regardingCD) computed for the three reference datasets.7 Recall that the CV values are
reported in Figures4.4a, 4.6aand4.8afor ACM-DL, MEDLINE and AG-NEWS, respec-
tively. With 99% confidence, we found that the CV’s of class sizes in the MEDLINE dataset
are indeed smaller than those computed for the ACM-DL and AG-NEWS datasets (p-values
of 0.001 and0.005, respectively). Comparing the CV values computed for ACM-DL and
AG-NEWS, we found that both samples are statistically indistinguishable (p-value of0.24).
Thus, we state thatCDMEDLINE < CDACM−DL ∼ CDAG−NEWS to refer to the relative
impact ofCD in each dataset.
The same test was performed to compare the CVs of pooled similarities (i.e., related
to CS), reported in Figures4.4b, 4.6band4.8b. Once again, we found that theCV val-
ues computed for the MEDLINE dataset are smaller than those obtained for ACM-DL and
AG-NEWS (p-values of0.005 and0.0006, respectively), whereas no statistical difference
was observed between the values computed for ACM-DL and AG-NEWS (p-value of0.15).
Thus, we state thatCSMEDLINE < CSACM−DL ∼ CSAG−NEWS.
To compare the impact ofTD on the three datasets, we show, in Figure4.10, the
empirical cumulative distribution of the observed document stability level (DSL) values, for
each level ofCD andCS and for each reference dataset. The curves for MEDLINE show a
clear bias towards higher DSL levels, thus indicating a smaller impact ofTD. The curves for
both ACM-DL and AG-NEWS exhibit a much stronger bias towardsless stable documents,
exposing the more dynamic nature of these datasets. We note that, for the MEDLINE dataset,
the bias towards more stable documents is stronger for theCD↑ and CS↑ levels. In other
words, the partitions with higher temporal variations inCD andCS tend to have more
stable documents, in comparison with the partitions with lower variations on the two effects.
This behavior is a peculiarity of the MEDLINE dataset, beingnot observed in neither ACM-
DL nor AG-NEWS datasets. Once again, we applied the Mann-Whitney test, finding that
the DSL values are indeed larger for the MEDLINE than for ACM-DL and AG-NEWS (p-
values smaller than10−5), and that DSL values are larger in ACM-DL than in AG-NEWS
(p-value< 10−5). Thus, we state thatTDMEDLINE < TDACM−DL < TDAG−NEWS.
Having compared the relative impact of each temporal aspectacross the three datasets,
we now analyze the relative impact of the three temporal effects for each dataset. For the
ACM-DL dataset, we can not distinguish, with 99% confidence,the relative impact ofCD
(orCS) from the relative impact ofTD on most ADC algorithms. One exception is observed
whenSVM is used, for which the impact ofTD is smaller than the impact of the other two
7We chose a nonparametric test since the CV values regarding the CD aspect are not normally distributed.
4.4. DISCUSSION 53
effects. Indeed, the effect associated withTD is 38.95% (40.21%) smaller than the effect of
CD (CS). Another exception is with Naïve Bayes, on which the effectassociated withTD
is statistically different and somewhat larger (25.13%) than the effect ofCS. We reach this
finding by analyzing the99% confidence intervals for the effects of each factor (“99% CI"
lines in Table4.5), which show statistical ties forqTD andqCD (or qCS) for the application of
Rocchio and KNN classifiers as well as a statistical tie forqTD andqCD for the application of
Naïve Bayes. In contrast, for MEDLINE and AG-NEWS, the impact of TD is consistently
lower than the impact ofCD (or CS) on all four algorithms. For instance, in the case of
MEDLINE, the impact ofTD is four times smaller than the impact ofCD, and almost
two times smaller than the impact ofCD, considering the Rocchio classifier. In the case of
AG-NEWS there is also a pronounced skew regarding the impactof TD and the impact of
CD/CS, with cases where the impact ofTD is almost the double of the impact ofCD and
CS. Similar conclusions are reached by analyzing the percentages of variations explained by
each individual factor. These findings reveal the challenges imposed by the temporal effects
and developing strategies to handle them in ADC algorithms shows up to be a promising
research direction.
Note that, except for Naïve Bayes on the ACM-DL dataset,CD’s impact on classifi-
cation is higher thanTD’s impact if and only if theCS’s impact is also higher thanTD’s.
This should come as no surprise given the strong positive correlation between both factors, as
discussed in Section4.3.3. Temporal variations in class sizes directly impact temporal vari-
ations in class vocabularies, and ultimately in the similarities across classes. For instance, if
a class increases in size with time, the number of candidate terms in its vocabulary may also
increase, causing more variations in its similarities withthe remaining classes. Thus, tem-
poral variations in class distribution contribute to variations in class similarities, justifying a
high correlation between both factors.
4.4.2 Impact of Temporal Effects on the ADC Algorithms
We now turn our attention to the impact of the temporal effects on the ADC algorithms.
As we can observe from the three ANOVA tables, all three factors have negative effects
(columnsqA andqB) in all analyzed scenarios, implying that all explored ADC algorithms
are negatively impacted by the temporal effects in all threedatasets. In fact, relative to the
overall average performance (q0), the effect ofCD contributes to an average decrease in clas-
sification effectiveness by as much29.36% (for the SVM classifier). Similarly, higher levels
CS andTD contribute to a classification degradation of as much as17.73% and21.13%
(also for the SVM classifier), on average. Moreover, the degradation is more significant for
the reference datasets in which the impact of the temporal effects is stronger, i.e., ACM-DL
54 CHAPTER 4. A QUANTITATIVE ANALYSIS OF TEMPORAL EFFECTS ONADC
(a) LowCD (b) HighCD
(c) LowCS (d) HighCS
Figure 4.10: Cumulative Distribution Function of DocumentStability Level Values.
and AG-NEWS, as expected. In the following, we discuss the impact on each specific al-
gorithm, focusing on the results for the ACM-DL dataset (Table 4.5), as it is the one most
influenced by all three temporal effects.
Starting with the Rocchio classifier, we observe that all thethree temporal effects
greatly impact classification effectiveness, with more than 40% of the observed variations
allocated to each of them in both experimental designs. Indeed, the factors contribute to a sig-
nificant degradation in the overall classification effectiveness, in each design. For instance,
in theCD×TD design, a higher level ofCD incurs in an average degradation of16.08% in the
average performance, whereas the degradation caused by a higher level ofTD is 18.56%, on
average. Similarly, in theCS×TD design, corresponding degradations due to higher levels of
CS andTD are11.89% and10.66%, on average. The reasons for such significant impact on
Rocchio’s performance are the following.CD andCS affect the coordinates of the centroids
learned by the Rocchio classifier: asMiao and Kamel(2011) pointed out, the centroid vector
does not take the distribution of class sizes into account, and thus may be affected by varia-
4.4. DISCUSSION 55
tions in such distribution. Since the distribution of classsizes in the entire dataset may not
be the same as the corresponding distribution observed whenthe test document was created,
the classifier’s prediction may be error prone.TD also significantly affects this classifier,
since, when averaging the vectors of each class to compute the class centroids, it considers
all training points to determine the class of a test document. Thus, it may be affected by the
variations in the term-class relationships.
Similarly, both KNN and Naïve Bayes classifiers are also greatly impacted by the three
temporal effects, with more than 44% and 34%, respectively,of the total variations allocated
to each of the them. Indeed, both classifiers have a bias regarding the distribution of class
sizes. In the KNN classifier, larger classes tend to have moredocuments in theK-neighbor
set for each test document (Tan, 2005). The Naïve Bayes classifier, in turn, tends to privilege
larger classes due to the class prior probability expressedin Equation4.1: when the class
sizes are similar, this classifier uses a priori informationto break ties, being directly affected
by variations in the distribution of class sizes. Thus, similarly to Rocchio,CD affects both
classifier’s decision boundaries: since the distribution of class sizes considering the entire
dataset may not reflect the distribution when the test document was created, both classifiers
may make wrong predictions. In fact, the average degradations incurred by theCD effect
in KNN and Naïve Bayes are of19.72% and17.95% in the average response, respectively.
Moreover,CS affects the KNN classifier (with an average decrease of13.88% in the aver-
age performance) because it directly perturbs the K nearestneighbor set, that is, because of
differences in the pairwise class similarities, this set may be composed by classes that were
similar in different points in time. Naïve Bayes, in turn, isaffected byCS (with an aver-
age decrease of9.55% in the average performance) as this classifier considers a somewhat
local metric to assess the relationships between terms and classes, expressed by the term
conditional probabilityP (t|c). As discussed in Section4.3.2.3, estimatingP (t|c) ultimately
searches a subset of the class vocabularies and, when vocabularies change over time the de-
cision rule also changes. Finally,TD significantly impacts both classifiers because both of
them consider the terms present in all training points to determine the decision boundaries.
In KNN, the impact ofTD takes place when building theK-neighbor set, while in Naïve
Bayes such impact occurs when estimating the maximum likelihood estimates (MLE) from
the training set—more specifically, the term conditional probabilitiesP (t|c). Thus, both
classifiers are also sensitive to variations in the term-class relationships.
The reader should note the peculiar behavior of the Naïve Bayes classifier in the ACM-
DL dataset, regarding the impacts of theTD andCS effects. Unlike the other three classi-
fiers, the degradation caused by theTD effect is somewhat larger (25.13%) than the degrada-
tion incurred by theCS effect. This can be explained by the nature of this classifier. Recall
that the Rocchio, KNN and SVM are examples of discriminativeclassifiers, which directly
56 CHAPTER 4. A QUANTITATIVE ANALYSIS OF TEMPORAL EFFECTS ONADC
modelP (c|d) by learning the class boundaries that minimize the error rate (or some corre-
lated measure), ultimatelydiscriminatingbetween classes (that is, they learn class borders).
On the other hand, as a generative classifier, Naïve Bayes learnsP (c|d) indirectly, by apply-
ing Bayes’ rule and estimating bothP (c) andP (d|c) (Rasmussen and Williams, 2006). In
fact, the class conditionalP (d|c) is influenced by variations in the term-class relationships
(hence, byTD andCS effects). However, as discussed in Section4.3.2.3, unlike theTD
effect, theCS effect is bounded by the variations in the relationships between the most in-
formative terms and the classes as time goes by. Recall that the vocabularyVc,p denotes the
set of terms that occurred in classc at the point in timep. LetVc be the set of all terms that
occurred in classc, irrespective to a point in time (that is, the class vocabulary). SinceVc,p is
smaller thanVc, it is expected that in the generative case the influence ofTD dominates the
influence ofCS. For the discriminative case, however, minimizing the error rate (or some
correlated measure) bounds, at a certain extent, the influence of the variations in the term-
class relationships to the most discriminative terms. Thus, the discriminative classifiers are
naturally less sensitive to theTD effect than to theCS effect.
Considering SVM, bothCD andCS explain, each, more than 60% of the variations in
classification effectiveness in both experimental designs. TD, in contrast, is response to at
most 33% of the explained variation. Thus,CS andCD do have a more significant impact
on SVM’s effectiveness thanTD. The reasons for this are the following. First, variations in
the distribution of class sizes lead to boundary hyperplaneskewness (seeSun et al., 2009),
potentially misleading the classification decisions when considering data distributed over
several points in time with changing distribution. Due to the KTT condition expressed by
Equation4.5, the increase of someαi at the positive side of the hyperplane forces an increase
of someαi at the negative side to satisfy that constraint and, due to possible imbalances in the
distribution of class sizes, either of them may receive a higher value, causing the hyperplane
to be skewed towards the smaller class. Thus, clearly,CD does have a strong impact on this
classifier, and so doesCS, given that the two effects are strongly correlated. In contrast,
SVM has a natural robustness to theTD aspect: only the support points are taken into
account, during the test phase, to determine the classes, thus ultimately limiting the impact
of TD during the classification process. As expressed by Equation4.4, only the points with
αi > 0 (the support points) affect the decision rule, being the only step in whichTD impacts
the classification process.
Turning our attention to the results for MEDLINE and AG-NEWS, reported in Tables
4.6 and 4.7, we find that, in both datasets, the impact ofCD andCS in all four ADC
algorithm are consistently higher than the impact ofTD. This should come as no surprise
since, as discussed in Section4.4.1, these datasets are more influenced by theCD/CS
effects than byTD.
4.4. DISCUSSION 57
We summarize our findings on the impact of the temporal effects on the four ADC
algorithms in Table4.8, which shows a partial ordering of the algorithms with respect to
the average impact of each temporal effect, for each dataset. This ordering is determined
by taking the overall average performance of each algorithm(q0) as baseline, analyzing the
effect associated with each factorqf (along with its corresponding 99% confidence interval),
and its relative difference to the overall average. As we cansee, the Rocchio classifier is
the most affected by all three effects in both MEDLINE and AG-NEWS. SVM is also very
affected, particularly in ACM-DL and MEDLINE, with bothCD andCS impacting more
SVM than the other three algorithms in ACM-DL. The impact ofTD, on the other hand,
is approximately the same on all four algorithms in ACM-DL. These relationships reinforce
that, apart from being negatively impacted by all three temporal effects, the four explored
ADC algorithms exhibit distinct behavior when faced with datasets with specific temporal
dynamics, as revealed by the conducted factorial designs.
Temporal DatasetEffect ACM-DL MEDLINE AG-NEWSCD SVM > NB∼ KNN ∼ RO RO > SVM > NB > KNN RO∼ KNN > SVM ∼ NBCS SVM > KNN ∼ RO > NB RO > SVM∼ NB > KNN RO∼ KNN ∼ NB > SVMTD SVM ∼ KNN ∼ RO∼ NB SVM > RO∼ NB ∼ KNN RO > NB > KNN > SVM
Table 4.8: A Comparative Study on the Impact of the Temporal Effects on each ADCAlgorithm—Rocchio (RO), SVM, Naïve Bayes (NB) and KNN.
4.4.3 Implications
The analyses performed in the previous sections provide some general guidelines regarding
the definition of requirements for strategies to deal with temporal effects in ADC. First, these
strategies should consider stable terms, since they untangle some latent structural properties
of the classes. It may be tempting to consider just training documents created at (or nearby)
the creation time of the test document (window-based approach) to define class boundaries.
However, such strategy may not be a wise choice, since it can lead to data sparseness prob-
lems. Moreover, it may also discard valuable information regarding stable terms occurring in
training documents created at points in time other than the test document’s creation time—
which may reveal discriminative evidence about the classes’ structural properties. Even
when considering training documents created at different points in time with respect to the
test document, stable terms still can provide valuable information to the classifier. Such in-
formation, however, is neglected when adopting window-based strategies. This motivates
58 CHAPTER 4. A QUANTITATIVE ANALYSIS OF TEMPORAL EFFECTS ONADC
the use of instance weighting strategies, specially when dealing with more stable datasets,
such as MEDLINE. However, in order for this strategy to be successful, the weighting func-
tion must capture the underlying process that guides the temporal evolution of the dataset.
Furthermore, not only the stability of the terms over time should be explored, but also the
variations in the distributions of class sizes and class similarities.
Similarly, the proposed methodology to evaluate the impactof the temporal effects on
classification effectiveness provides valuable insights to better understand the behavior not
only of the considered ADC algorithms when faced with these effects but also of strategies
aimed at overcoming them. Such aspect will be taken into account when analyzing the
behavior of the temporally-aware algorithms proposed in Chapter5. In fact, the analyses and
methodologies performed here allow us now to have a deeper understanding of the results
reported in Section5.4.
4.5 Chapter Summary
In this chapter, we proposed a methodology, based on a seriesof full factorial designs, to
evaluate the impact of temporal aspects on ADC algorithms when applied to several refer-
ence datasets. First, we extended the characterization performed byMourão et al.(2008),
providing evidence of the existence of three temporal aspects in three textual datasets,
namely ACM-DL, MEDLINE and AG-NEWS. Then, we instantiated the methodology to
quantify the impact of the temporal aspects on the classification effectiveness of four well-
known ADC algorithms, namely Rocchio, KNN, Naïve Bayes and SVM.
Our characterization results show that, contrary to the assumption of static data distri-
bution on which all four explored algorithms are based each reference dataset has a specific
temporal behavior, exhibiting changes in the underlying data distribution with time. Such
temporal variations potentially limit the classification performance. According to our results,
the ACM-DL and AG-NEWS datasets are much more dynamic than the MEDLINE dataset,
resulting in the four explored ADC algorithms being more impacted by the temporal aspects
in the first two datasets. In addition to such findings, our proposed methodology enabled
us to quantify the impact of each temporal aspect on the analyzed datasets and algorithms,
allowing us to answer the two following questions, posed in this chapter:
1. Which temporal effects are more representative in each dataset? In the ACM-DL
dataset, the impact of the observed temporal variations in the distribution of class
sizes and in the pairwise class similarities are statistically equivalent to the impact
of the observed variations in the term distribution on most classifiers (SVM being an
exception). MEDLINE and AG-NEWS, on the other hand, are clearly more impacted
4.5. CHAPTER SUMMARY 59
by the first two temporal aspects. These findings reveal the challenges imposed by the
temporal effects and that developing strategies to handle them in ADC algorithms is a
promising research direction.
2. What is the behavior of each ADC algorithm when faced with different levels of each
temporal aspect?All four explored ADC algorithms suffer a negative impact ofthe
temporal aspects in terms of classification effectiveness,being the most significant
impacts observed when these algorithms are applied to the most dynamic datasets
(i.e., ACM-DL and AG-NEWS). The SVM classifier was shown to bemore robust to
the term distribution aspect, while still being impacted bythe other two aspects. The
other three algorithms, on the other hand, are very sensitive to all three aspects. Thus,
the temporal dimension turns out to be an important aspect that has to be considered
when learning accurate classification models.
Chapter 5
Temporally-Aware Algorithms for
Automatic Document Classification
In this chapter we are particularly concerned with how tominimizethe impact that temporal
effects may have on ADC algorithms, in the light of the lessons learned in Chapter4. As
previously discussed, the temporal dynamics of data, reflected by the quantified temporal
effects, may violate the common assumption of stationary data distributions, limiting the
performance of ADC algorithms. As we have shown, all the three explored textual datasets
present varying class distributions, along with varying term distributions and pairwise class
similarities, at differing extents. Moreover, the analyzed ADC algorithms had their effec-
tiveness hindered by these variations (again, at differentdegrees). To address this issue, we
propose strategies to devise temporally-aware ADC algorithms. Recall that the class distribu-
tion variation relates to the observed variations over timein the representativeness of classes,
whereas the term distribution variation and the class similarities variation effects relate to the
observed variations over time in the term-class relationships and to variations in the pairwise
class similarities, respectively. Similarly to the strategies adopted to isolate each factor in
the experimental designs described in Section4.3, here the first effect is addressed in a finer
grained basis, at document level, while the other two effects are handled at a collection level.
In order to incorporate temporal awareness to document classifiers, we introduce a
weighting function that we calltemporal weighting function—or simply TWF—aiming at
addressing the previously quantified temporal effects. Such weighting function is modeled
according to the observed evolution of the term-class relationships over time, captured by a
metric of dominance(see Section4.3.2). We start by determining thetemporal weighting
functionfor a collection according to its characteristics—insteadof considering simple ad-
hoc functions based on the document’s age, as done in previous work (seeKlinkenberg 2004;
Koychev 2000). Towards this end, we offer a modeling framework that enables us to conduct
61
62 CHAPTER 5. TEMPORALLY-AWARE ALGORITHMS FORADC
a series of statistical tests in order to derive a function that effectively models the underlying
process governing the dynamical nature of the datasets. Forexample, as we shall see, this
function follows a lognormal distribution for the ACM-DL and MEDLINE datasets.
As will be described in Section5.1, to derive the TWF function demands some statis-
tical procedures that may not be suitable for a practitionerto perform, due to the diversity
and sophistication of the tests that may be needed to determine its expression. As reported
in Section5.1, the widely used procedures for independence and normalitytests of random
variables failed when applied to the AG-NEWS dataset, sinceits TWF does not follow a
Gaussian process, even in the log-transformed space. In this case, some other (possibly
more complex) tests should be performed, and it may be prohibitively hard for a practi-
tioner (which typically aims to achieve high accuracy classification, without regarding the
propertiesof the function that underlies the data variation process—which is, by the way, re-
flected by its expression and parameters), hurting the practical applicability of the proposed
framework. Furthermore, automatic data-mining processesfocused on classification may
need some fully-automated ways to determine the TWF. As a matter of fact, for the sake of
temporally-aware ADC, one just needs to know the positive real valued weights associated
to each temporal distance and, in fact, while the proposed statistical framework is able to
uncover the properties of the function that underlies the data variations, it may not be appli-
cable for the two mentioned scenarios, and strategies to overcome this are desirable. To cover
these practical scenarios, we propose an automatic approach, which overcomes the needs to
perform the required statistical tests, where the ADC algorithms themselves are employed to
derive the TWF.
Finally, the final step is to incorporate the TWF into the ADC algorithms. We propose
three strategies for doing so, where the weights assigned toeach example depend on the
notion of a temporal distanceδ. The temporal distance is defined as the difference between
a point in timep, in which a training example was created, and a reference point in time
pr, in which the test example was created. Such weights reflect the observed variability in
the data distribution, as captured by the TWF. The first strategy, namedtemporal weighting
in documents, weights training instances according toδ, ultimately addressing the explored
term distribution variation (TD) effect—since the TWF considers the observed variations
over time in the term-class relationships. However, as we have shown, the class distribution
variation (CD) and the class similarity variation (CS) effects also play an important role
for some datasets, and we should minimize their impact. Towards this end, we propose a
second strategy, calledtemporal weighting in scores, which is based on a mapping in the
class domainc 7→ 〈c, p〉, in which the training documents’ classes are mapped to derived
classes〈c, p〉 wherec denotes the actual document’s class andp denotes its creation point in
time. In this case, the scores (for example, similarities, probabilities) learned by a traditional
63
classifier applied to the modified training set are weighted according toδ = p− pr, wherep
is the point in time associated to the derived class〈c, p〉 andpr is the point in time in which
the test document was created—that is,scorec =∑
p scorec,p(c, p) TWFδ. The combined
weighted scores for each class are then used to take the final classification decision. This
strategy is motivated by the fact that, when considering each point in time in isolation we
ultimately isolate the temporal effects. But, since it is not always plausible to consider just
the documents created in the reference point in timepr (due to data scarcity), we aggregate
the obtained scores for each point in time, using the TWF to account for the variations in
the term-class relationships (the only potential effect reflected when aggregating the inter-
mediate scores). However, this strategy has a somewhat undesirable property related to the
mappingc 7→ 〈c, p〉. As the number of documents from the derived class〈c, p〉 is typically
much smaller than the number of documents belonging to classc, the class imbalance artifi-
cially increases. Since class imbalance may be harmful for document classifiers (Chen et al.,
2011), we propose a third strategy aimed at ameliorating this, namely, theextended tempo-
ral weighting in scores. In this strategy the training setD is partitioned in sub-groups of
documentsDp with the same creation point in timep. Then, a classification model is built
based on eachDp in isolation. The classes’ scores are then produced for eachDp and, as
before, they are aggregated using the TWF to weight them. By construction, the class im-
balance problem is bounded by the imbalance observed in the class distribution observed in
Dp, which is usually smaller.
The three proposed strategies were implemented in three ADCalgorithms (Rocchio,
K Nearest Neighbors (KNN), and Naïve Bayes) and were evaluated using the three digital
libraries described in Section4.1 (ACM-DL, MEDLINE and AG-NEWS). As we shall see,
we achieved significant improvements on classification effectiveness for all classifiers. For
instance, the temporal-aware version of Naïve Bayes outperformed by up to 10% the state-
of-the-art classifier (SVM).
This chapter is organized as follows: In Section5.1we introduce the temporal weight-
ing function and describe a methodology to determine its expression and its parameters. In
Section5.2 we propose a strategy to automatically determine the TWF, without the needs
for performing any kind of statistical testing. In Section5.3 we describe our extensions to
traditional ADC algorithms, in order to incorporate the TWFinto them. We report the per-
formed experimental evaluation to assess the benefits of considering the temporal dimension
in ADC algorithms in Section5.4. Finally, in Section5.5we summarize our findings.
64 CHAPTER 5. TEMPORALLY-AWARE ALGORITHMS FORADC
5.1 Temporal Weighting Function
The potential impact that certain temporal effects have on term-class relationships may have
a great influence on the results of the classification process, as characterized in Chapter4.
Thus, incorporating information about these changes into the ADC algorithms has the po-
tential to improve its effectiveness.
We address this issue through a temporal weighting function(TWF) that quantifies
the influence of a training document while classifying a testdocument, as a function of the
temporal distance between their creation times. We distinguish two major steps in defining
such function: its expression and its parameters. The expression is usually harder to deter-
mine, since it may express the generative process behind thefunction, that is, it may express
the fundamental properties of the data variation phenomena— which can be smooth (possi-
bly following some linear nature), abrupt (maybe with some exponential behavior) or even
periodic—while the parameters are usually obtained using approximation strategies.
Intuitively, given a test document to be classified, the TWF must set higher weights to
training documents that are more similar to that test document with relation to the strength
of term-class relationships. As described in Section4.3.2, one metric that expresses such
strength is thedominance, since the more exclusive a term is to a given predefined class, the
stronger this relationship. The simplest approach to modelthe function that governs such
variations would be to use a unit pulse function⊓(δ) at temporal distance0
⊓(δ) =
α if δ = 0,
0 if δ 6= 0,
with the pulse magnitudeα proportional to the observed term dominance associated with
the training documents created in the same point in time of the test document. However, as
argued in Section4.4, considering a larger time interval instead of a single point in time is
better, since it better handles smoothed data variations and does not discard potentially useful
information regarding stable terms. We then need to determine the time period that must be
considered when modeling the underlying data variations, and this can be accomplished by
the notion of stability period described in Section4.3.2.2(see Definition1).
Recall that the stability periodSt,pr of a termt, considering the reference point in time
pr, in which the test document was created, consists of the largest continuous period of time,
starting frompr and growing both to the past and the future, whereDominance(t, c) > α
(for some predefinedα and any classc). In the case of the explored datasets, we investigated
different values forα when computing stability periods and, as they lead to similar results,
we adoptedα = 50%, ensuring that the terms will have a high degree of exclusivity with
5.1. TEMPORAL WEIGHTING FUNCTION 65
some class.
Notice that the stability period of a term depends on a reference point in time and
thus a term may present different stability periods, one foreach point in time in which it
occurred in the dataset. We first determine the stability period for each term and then com-
bine them, as follows. In order to handle such situation, we mapped all the time points
in a stability period to temporal distances, where the reference year is considered as dis-
tance0. For instance, a termt1 may have different stability periods when considering
the years 1989 or 2000 as a reference. More specifically, if the stability period oft1 is
{1999,2000,2001} regardingpr = 2000, and {1988,1989,1990} regardingpr = 1989, these
periods would be both mapped to {-1,0,1}. ConsideringS′t as the set of temporal distances
that occur on the stability periods of termt (considering all reference momentspr), then
S′t = {δ ← pn − pr|∀pr ∈ P andpn ∈ St,pr}. Making the stability periods easily compara-
ble is important because our real interest is to know what kind of distribution the temporal
distances follow with respect to different terms.
The next step is to determine the function expression and, towards this goal, we con-
sidered the stability period of each term as a random variable (RV), where the occurrence of
each possible temporal distance in its stability period is an event. More formally, as Table5.2
shows, we are interested in the frequencies of the temporal distancesδ1 to δn, for termst1to tk. An interesting property that we may test is whether these RV’s are independent. This
hypothesis can be corroborated by the Fisher’s Exact Test (Clarkson et al., 1993) to assess
the independence of eachRVi andRVj, ∀i 6= j, where, as mentioned, each RV represents
the occurrence of a temporal distanceδ for a termt.
We applied this test to the three reference datasets, ACM-DL, MEDLINE and AG-
NEWS. For the first two, we obtained a p-value of0.99 through a Monte Carlo simulation,
which allows us to state that their associated random variables are indeed independent. Thus,
the observed variability of occurrences ofδ for different terms is a result of independent ef-
fects (Limpert et al., 2001). For the AG-NEWS dataset, this independence does not hold,as
indicated by the low p-value obtained (10−4), and some other hypothesis should be tested.
This highlights the difficulty faced when defining the function that best models the varying
behavior of the data at hand, motivating the development of afully-automated strategy that
overcomes the need to explicitly determine the TWF expression. As a matter of fact, one can
afford to avoid the explicit determination of the TWF expression and parameters since, for
the sake of temporally-aware ADC, just the TWF image matters(that is, the weights associ-
ated with each temporal distance). Clearly, avoiding to determine the TWF expression and
parameters comes at the cost of missing the opportunity to discover the TWF’s properties—
which is revealed when determining its expression and parameters. With this trade-off in
mind, in Section5.2we describe a fully-automated strategy to determine the weights of each
66 CHAPTER 5. TEMPORALLY-AWARE ALGORITHMS FORADC
temporal distance.
Turning our attention to the ACM-DL and MEDLINE datasets (that passed in the inde-
pendence tests), it is still not clear whether the effects responsible for the observed variability
in the temporal distance distributionDδ can be additive (leading to a normal distribution) or
multiplicative (leading to a lognormal distribution). In Figure5.1we show theDδ distribu-
tion, scaled to the[0, 1] interval. We then apply a statistical normality test to boththe original
and log-transformed distribution. According to D’Agostino’s D-Statistic Test of Normality
(D’Agostino R.B., 1973), with 99% confidence, we found that the lognormal distribution
best fits both the ACM-DL and MEDLINE collections, as presented in Table5.1.
(a) ACM-DL Dataset (b) MEDLINE Dataset
Figure 5.1:Dδ Distribution (Scaled to[0, 1] Interval).
Data ACM-DL MEDLINEOriginal 4.497e−6 0.002762
Log-Transformed 0.2144 0.6802
Table 5.1: D’Agostino’s D-Statistic Test of Normality. Bold-face for Tests That we can notReject the Null Hypothesis of Normality.
Consider that the distributionDδ related to the occurrences of the temporal distances
δ in the stability periods, which represents the distribution of eachδi over all termst, is
lognormally distributed iflnDδ is normally distributed. More generally, since the tempo-
ral distancesδi are RV’s under the independence assumption with finite mean and variance,
then, by the Central Limit Theorem,lnDδ =∑n
i=1 lnδi will asymptotically approach a
normal distribution and, by definition,Dδ converges to a lognormal distribution (Crow EL,
5.1. TEMPORAL WEIGHTING FUNCTION 67
t1 t2 . . . tk Dδ
δ1 f11 f12 . . . f1k∑k
i=1 f1iδ2 f21 f22 . . . f2k
∑ki=1 f2i
...δn fn1 fn2 . . . fnk
∑ki=1 fni
Table 5.2: Temporal DistancesversusTerms.
1988). For a lognormal distribution, the asymptotically most efficient method for estimat-
ing its associated parameters relies on a log-transformation (Limpert et al., 2001). Using
a Maximum Likelihood method, we estimated those parametersfor both collections, and
then back-transformed them, as shown in Table5.3. We considered a 3-parameter gaussian
function,
F = aie−
(x−bi)2
2c2i ,
where the parameterai is the height of the curve’s peak,bi is the position of the center of the
peak, andci controls the width of the curve. The last one, also called theshape parameter,
reflects the nature of the variations of term-class relationships over time. Since abrupt or
smooth variations lead to small or greater stability periods, respectively, the shape of the
distribution changes accordingly, being a matter of parameter estimation to capture such
distinct natures. We performed two curve fitting procedures, considering a single gaussian
F and a mixture of two gaussians, given byG = G1+G2, where eachGi denotes a gaussian
function. The last one was the model that best fittedDδ, and its parameters are presented
in Table5.3, along with the goodness of fitting measureR2. TheR2 measure denotes the
percentage of variance explained by the model and, for both collections, the obtained model
explains99% of such variance.
ParametersACM-DL MEDLINE
Value Confidence Interval Value Confidence Intervala1 0.325 (0.288, 0.362) 0.089 (0.066, 0.113)b1 -0.028 (-0.309, 0.253) -0.013 (-0.349, 0.324)c1 3.636 (3.117, 4.154) 1.635 (1.099, 2.17)a2 0.616 (0.589, 0.643) 0.901 (0.891, 0.911)b2 0.037 (-0.395, 0.470) 0.092 (-0.130, 0.314)c2 20.14 (20.93, 23.35) 24.51 (23.71, 25.3)
R2 0.990 0.992
Table 5.3: Estimated Parameters for Both Datasets, with 99%Confidence Intervals.
The greater the frequency ofδ on stability periods, the more suitable training docu-
68 CHAPTER 5. TEMPORALLY-AWARE ALGORITHMS FORADC
ments created inδ are to build an accurate classification model, making the modeling of the
Temporal Weighting Function as a lognormal distribution aneffective strategy.
Figure5.2 shows the distribution of temporal scores for each possibletemporal dis-
tance between the creation time of test documentd′ and the training documents for the
ACM-DL and the MEDLINE datasets.
(a) ACM-DL Dataset
(b) MEDLINE Dataset
Figure 5.2: Fitted Temporal Weighting Function with Log-Transformed Data.
5.2 Fully-Automated TWF Definition
Clearly, to determine the expression and parameters of a function that effectively models
the underlying data variations is an important task, since it reveals the properties of the data
variations and offer substantial knowledge that can be exploited towards the development of
accurate classification models. However, to define the TWF function demands some statisti-
5.2. FULLY-AUTOMATED TWF DEFINITION 69
cal procedures that may not be suitable for a practitioner toperform, due to the diversity and
sophistication of the tests that may be needed to define its expression. As reported in Sec-
tion 5.1, the most straightforward procedures for independence testing of random variables
failed when applied to the AG-NEWS dataset. Thus, unlike theTWF’s associated to the
ACM-DL and MEDLINE datasets, AG-NEWS’ TWF does not follow a Gaussian process
and some other (possibly more complex) tests should be performed to assess its expression.
However, it may be prohibitively hard for a practitioner (which typically aims at achieving
high accuracy classification without regards to the properties of the function that underlies
the data variation process—which is, by the way, reflected byits expression and parame-
ters), hurting the practical applicability of the proposedframework to devise the TWF and,
consequently, the applicability of the temporally-aware classifiers described in Section5.3.
Furthermore, automatic data-mining processes focused on classification may need automatic
ways to determine the TWF. As a matter of fact, for the sake of temporally-aware ADC,
one just needs to know the positive real valued weights associated to each temporal distance.
While the proposed statistical framework is able to uncoverthe properties of the function
that underlies the data variations, it may not be applicablefor the two mentioned scenarios,
and strategies to overcome this issue are desirable.
To cover such practical scenarios, in this section we describe a technique to auto-
matically determine the TWF, without the needs to perform any statistical test. Hence, we
describe a straightforward and suitable way to devise the TWF by a practitioner or by some
other automated data-mining process. Our goal is thus to develop a procedure which, given
a set of already classified documents, outputs a functionTWFEST : δ 7→ [0, 1] that ultimately
models the underlying data variations. More specifically, the ADC algorithms themselves
are used to devise such mapping.
LetD be the training set composed by already classified documentsdi = (〈−→xi , pi〉, ci),
where−→xi is the vectorial (bag of words) representation ofdi, pi denotes its creation point in
time andci denotes its associated class. The first step of our procedureconsists of changing
the associated class of each document to its creation point in time, that is, we representdiasd′i = (−→xi , p). Then, the training set is randomly partitioned into two subsets,Dt andDv,
and a classification procedure is performed, usingDt as a training set andDv as a valida-
tion set. Our basic assumption is that, due to the temporal effects previously described, the
underlying data distribution may be different for each point in time pi, and so we expect
that the classifier may be able to learn some structural properties observed in each of them.
Clearly, the classifier will not achieve high accuracy, since, as shown in Chapter4, the vari-
ations observed due to the temporal effects are typically smooth and hence it is unlikely that
the observed changes produce enough variations to enable discriminating data between each
point in time. However, the classifier will potentially predict nearby points in time, since
70 CHAPTER 5. TEMPORALLY-AWARE ALGORITHMS FORADC
data from nearby points in time tend to have similar underlying distributions.
Formally, an ADC algorithm is used to learn the underlying relationships between the
data and their creation points in time, expressed by the a posteriori probability distribution
P (pi|di). Thus, if documents created at a point in timepi share some structural properties
with data from a reference point in timepr (namely, when documents fromDv were created),
thenpi will receive a higher score than some other uncorrelated point in timepj 6= pi. Since
we aim at associating a real valued weight to the temporal distancesδi = pi − pr, we adopt
the following rule to devise the TWF:
TWF (δi) =
∑N
j=1 I(pj − pr = δi)
N,
whereI(•) is an indicator function which returns1 if the predicate• is true and returns
0 otherwise,pr is the actual creation point in time of the classified documents from Dv
(that is, the reference point in time),pj is the predicted point in time (which received the
highest score by the classifier) andN = |Dv| denotes the number of documents classified.
Intuitively, temporal distances with higher weights contain the most useful documents to
build the classification model, since they provide data sampled from similar distributions to
the ones that govern the test data. On the other hand, temporal distances with smaller weights
tend to have more unstable data, which may induce the classifier to misleading predictions.
The described procedure to automatically determine the TWFis listed in Algorithm2.
Algorithm 2 Automatic TWF Determination1: function LEARNTWF(D)2: D′ ← {d′i|d
′i = (di.xi, di.pi) ∧ di ∈ D}
3: (Dt,Dv)←RANDOMSPLIT(D′)4: p[ ]←CLASSIFY(Dt ,Dv)
5: TWF (δi) =∑N
j=1 I(pj−d′j .pj=δi)
N, whered′j ∈ Dv
6: return TWF7: end function
We show in Figure5.3 the estimated TWF’s for each explored textual datasets, esti-
mated by each explored classifier. They were obtained duringthe 10-Fold Cross Validation
procedure to assess the effectiveness of the temporally-aware algorithms, as reported in Sec-
tion 5.4. The reader should notice the similarities among the TWF’s learned by each classi-
fier. In fact, it does not matter which classifier is employed in line 4 of Algorithm 2, as we
shall discuss in Section5.4.
5.3. TEMPORALLY-AWARE ADC 71
(a) ACM-DL Dataset
(b) MEDLINE Dataset
(c) AG-NEWS Dataset
Figure 5.3: Estimated Temporal Weighting Function.
5.3 Temporally-aware ADC
This section shows how three well-known text classifiers, namely Rocchio, KNN and Naïve
Bayes (Manning et al., 2008), can be modified to take into account the temporal weighting
function defined in Sections5.1 and5.2. The three algorithms are modified following two
strategies: temporal weighting in documents and temporal weighting in scores, as detailed
72 CHAPTER 5. TEMPORALLY-AWARE ALGORITHMS FORADC
below.
5.3.1 Temporal Weighting in Documents
The temporal weighting in documents strategy weights each training document by the tem-
poral weighting function according to its temporal distance to the test documentd′, as repre-
sented in Figure5.4. In the following we detail this strategy.
Figure 5.4: Graphical Representation of TWF in Documents.
The strategy to incorporate the weight of each training document to a given classifier
depends inherently on the characteristics of the classification algorithm being modified. In
the case of distance-based classifiers, the temporal weighting function can be easily applied
when calculating the distance between the training and testdocuments, by weighting each
training document (vectorial representation) by its associated temporal weight. In the case
of the Naïve Bayes, the temporal function can be used to weight the impact of each training
example in both the a priori and conditional probabilities estimation (that is, weight its impact
on the counts), in order to generate a more accurate a posteriori probability.
Rocchio Recall from Section4.1.2that the Rocchio classifier uses the centroid of a
class to find boundaries between classes. As an eager classifier, Rocchio does not require
any information fromd′ to create a classification model. Hence, we adapt it to becomea
lazy classifier when using the temporal weighting function,since the weights depends on the
creation point in time of a test document. When classifying anew documentd′, Rocchio
associates it to the class represented by the centroid closest tod′. In order to make Rocchio
a lazy classifier, we explicitly change the separation boundaries of classes according to the
temporal weights produced by the TWF function.
Hence, it needs to calculate each Rocchio’s class centroid based on the creation point
in time pr of a test documentd′. Consider the setDc ⊆ D of training documents that
belong to the classc. This set can be partitioned into subgroupsDc,p ⊆ Dc of documents
5.3. TEMPORALLY-AWARE ADC 73
created at the same point in timep ∈ P. The centroid−→µ c for classc is thus defined by
weighting the documents vector representations with the score produced by the temporal
functionTWF (δ), obtained using the temporal distanceδ between the creation point in time
of d ∈ Dc,p andd′, for all p ∈ P. Thus, a centroid−→µ c is given by:
−→µ c =1
‖Dc‖
∑
d∈Dc
(
∑
p∈P
−→d p · TWF (δ)
)
,
whereDc is the number of documents in classc, P is the set of points in time observed in the
training set,−→d p ∈ Dc denotes a training document created at the point in timep andδ is the
temporal distance between−→d p and the test documentd′.
This approach redefines the centroid’s coordinates in the vectorial space considering
document’s representativeness on classc w.r.t. the reference point in timepr. Both training
and classification procedures are presented in Algorithm3.
Algorithm 3 Rocchio-TWF-Doc: Rocchio with Temporal Weighting in Documents1: function TRAIN(C, D, d′, TWF )2: for each c ∈ C do
3:−→µ c ←
1‖Dc‖
∑
d∈Dc
(
∑
p∈P
−→d p · TWF (p− d′.p)
)
4: end for5: return {−→µ c : c ∈ C}6: end function7: function CLASSIFY(D, C, d′)8: {−→µ c : c ∈ C} ←TRAIN(D, C, d′)
9: return argmaxc cos(−→µ c,−→d′ )
10: end function
KNN As described in Section4.1, KNN is a lazy classifier that assigns to a test doc-
umentd′ the majority class among those of itsk nearest neighbor documents in the vector
space. Determining the test document’s class from thek nearest neighbors training docu-
ments may not be ideal in the presence of term-class relationships that vary considerably
over time. To deal with it, we apply the proposed temporal weighting function during the
computation of similarities amongd′ and the documents in the training set, aiming to select
the closest documents, in terms of both similarity and timeliness.
Let s be the cosine similarity between a training documentd andd′. If d is similar
to d′ but is temporally distant, then it is moved away fromd′, reducing the probability of
being among thek nearest documents ofd′. LetTWF (δ) be the temporal weight associated
with the temporal distance between the time of creation of documentsd andd′. Then, the
documents’ similarity is given by:
74 CHAPTER 5. TEMPORALLY-AWARE ALGORITHMS FORADC
sim(d, d′)← cos(d, d′) · TWF (δ).
Both training and classification procedures are presented in Algorithm4.
Algorithm 4 KNN-TWF-Doc: KNN with Temporal Weighting in Documents1: function KNEARESTNEIGHBORS(D, d′, k, TWF )2: for eachd ∈ D do3: δ ← d.p− d′.p4: sim(d, d′)← cos(d, d′) · TWF (δ)5: priorityQueue.insert(sim, d)6: end for7: return priorityQueue.first(k)8: end function9: function CLASSIFY(D, d′ , k)
10: knn← KNEARESTNEIGHBORS(D, d′, k)11: return {argmaxc
∑
c
knn.nextDoc(c)}
12: end function
Naïve BayesSimilarly to the previously defined “temporal weighting in documents"
approaches, here we apply the temporal weighting function on the information used by the
learning method, namely the relative frequencies of documents and terms, as follows:
P (d′|c) = η ·
∑
p(Ncp · TWF (δ))∑
p(Np · TWF (δ))·∏
t∈d′
∑
p(ftcp · TWF (δ))∑
p
∑
t′∈V
(ft′cp · TWF (δ)),
whereη denotes a normalizing factor,Ncp is the number of training documents ofD assigned
to classc and created at the point in timep,Np is the number of training documents created at
the point in timep, ftcp stands for the frequency of occurrence of termt in training documents
of classc that were created on point in timep and, finally,δ denotes the temporal distance
betweenp and the creation time ofd′ (a.k.a., the reference point in time).
The main goal of this strategy is to reduce the impact that temporally distant informa-
tion have when estimating a posteriori probabilities. Algorithm 5 presents this strategy.
5.3.2 Temporal Weighting in Scores
A more sophisticated approach to exploit the temporal weighting function considers the
“scores” produced by the traditional classifiers, as represented in Figure5.5. By score we
mean: (i) the smallest distance from the test documentd′ to a class centroid for Rocchio; (ii )
the smallest sum of the distances of the K-nearest neighborsto documentd′ assigned to class
c in the case of KNN; or (iii ) the probability to generated′ with the model associated to some
5.3. TEMPORALLY-AWARE ADC 75
Algorithm 5 Naïve Bayes TWF-Doc: Naïve Bayes with Temporal Weighting inDocuments1: function CLASSIFY(D, d′, TWF )2: for each c ∈ C do3: aPriori[c]←
∑p(Ncp·TWF (δ))
∑p(Np·TWF (δ))
4: termCond[c]←∏
t∈d′
∑p(ftcp·TWF (δ))
∑p
∑t′∈V
(ft′cp·TWF (δ))
5: end for6: return {argmaxc η · aPriori[c] · termCond[c]}7: end function
classc for Naïve Bayes. From now on, we refer to this approach astemporal weighting in
scores.
Figure 5.5: Graphical Representation of TWF in Scores.
LetC andP be the set of classes and creation points in time of the training documents.
First, each training document classc ∈ C is associated with the corresponding creation point
in timep ∈ P, generating a new class defined as〈c, p〉 ⊆ C× P. Then, we use a traditional
classification algorithm to generate scores for each new class〈c, p〉. Thus, the first step of
this strategy consists of generating a new training setDc,p with the class domain transformed
from C to C × P. Then, the test documentd′ is classified by a traditional classifier applied
to this new training set, ultimately generating scores for each〈c, p〉. Note that this scenario
isolates term-class relationship variations, since it ties together the predictive relationships
of the patterns observed in each classc at the point in timep, in which the patterns were
observed. To decide to which classc the documentd′ should be assigned to, the learned
scores for each〈c, p〉 are summed up, for allp ∈ P, weighting them by theTWF (δ), where
δ = p− pr corresponds to the temporal distance betweenp and the time of creationpr of d′,
that is,
76 CHAPTER 5. TEMPORALLY-AWARE ALGORITHMS FORADC
scoresc,p ← TRADITIONAL CLASSIFIER(d′,Dc,p〉),
scorec ←∑
p∈P
scoresc,p(c, p) · TWF (δ),
whereDc,p is the new set of training documents generated by mapping each document’s class
c to the derived class〈c, p〉, according to its creation point in time. At the end of this process,
d′ will be assigned to the classc with highest score, as listed in Algorithm6.
Algorithm 6 TWF-Sc: Temporal Weighting in Scores1: function CLASSIFY(d′ , C, P, D, TWF )2: Dc,p ← {dc,p = (
−→d.x, 〈d.c, d.p〉)|d ∈ D}
3: scoresc,p ←TRADITIONAL CLASSIFIER(d′ ,Dc,p)4: for each c ∈ C do5: δ ← p− d′.p6: scorec ←
∑
p∈P scoresc,p(c, p) · TWF (δ)7: end for8: return {argmaxc scorec}9: end function
5.3.3 Extended Temporal Weighting in Scores
When generating the scores for the derived class〈c, p〉 during the classification of a test
document, the number of positive training documents (that is, training documents belonging
to 〈c, p〉) is usually outnumbered by the number of negative training documents (that is,
training documents belonging to a derived class〈c, p〉′ 6= 〈c, p〉). This is known as the
class imbalance problem, and is an issue for classifiers withsome bias towards the majority
classes. Indeed, this is a problem inherent to the classification task and the majority of
automatic classifiers is affected by this issue. The “in scores” strategy becomes vulnerable
to the class imbalance problem since it artificially increases the imbalance when mapping the
classes to〈c, p〉. Several works have already proposed strategies to minimize such problem,
for example, strategies to under-sample the majority classes (Lin et al., 2009) or over-sample
the minority classes (Chen et al., 2011). We address this issue by modifying the “in scores”
strategy in order to minimize the class imbalance problem.
More formally, letDc denote the set of documents belonging to classc andDpc ⊆ Dc
denote the set of documents created at the point in timep that also belongs to the classc.
Clearly, |Dpc | ≤ |Dc|. Now, consider our previously proposed “in scores” strategy, in which
a classifier is used to learn the scores for〈c, p〉. The difference between|Dpc | (the number
5.3. TEMPORALLY-AWARE ADC 77
of positive documents) and|Dc \ Dpc | (the number of negative documents) is expected to be
greater than the difference between|Dc| and|D\Dc|. In other words, the number of negative
documents observed in the transformed class domain (C × P) outnumber the number of
positive documents in a much more expressive way than when considering the original class
domain (C). Thus, the “in scores” strategy artificially increases theclass imbalance when
considering the derived classes〈c, p〉, and such imbalance is greater than the observed when
considering the original class distribution.
Figure 5.6: Graphical Representation of Extended TWF in Scores.
Based on this observation, the extended version of the “in scores” aims at mitigating
the class imbalance problem by considering each point in time in isolation, as represented
in Figure5.6, employing a series of classifiers to associate scores for the classes considering
only documents belonging to each point in time independently, but belonging to all classes.
The scores obtained by each classifier are then aggregated with the corresponding TWF
weight, according to the temporal distance between the point in time associated to each
classifier and the creation time of the test document. LetDp denote the set of documents
created at the point in timep. SinceDpc ⊆ Dp then|Dp
c | ≤ |Dp| and the majority class sizes
observed inDp is bounded by|Dp|. In the “in scores” strategy, the majority class size is
bounded by|Dc,p| = |D|. Consequently, the class imbalance observed in the first approach
is smaller than the class imbalance observed in the second one.
There are two rather subtle differences between this new strategy and the previous
in scores approach. First, as mentioned, by construction the class imbalance problem is
bounded by the class imbalance observed inDp, since it is the set considered when training
78 CHAPTER 5. TEMPORALLY-AWARE ALGORITHMS FORADC
the intermediate classifiers. Second, consider the traditional classification procedure per-
formed by both strategies. While in the “in scores” strategysuch classification is performed
considering the modified training setDc,p, in the extended version only documents belonging
to Dp are considered, and this implies that documents created at apoint in timep′ 6= p do
not influence in the learned scores. As an example, consider the classTop Stories(id 10)
of the AG-NEWS dataset. From Figure4.2, we can observe that, in the50th week, none
of the documents belong to such class. Now assume that the classification procedure of
both strategies will attempt to learn a score for this class when classifying a test document
belonging to the1st week. The scores learned by the classifier considering the “in scores”
strategy will consider all training documents, including those belonging to the50th week.
Thus, these negative documents ultimately influence the learned scores. On the other hand,
in the “extended in scores” strategy, the classification procedures applied to each point in
time will act in isolation: the classifiers assigned to the points in time within the50th week
will output scores equal to zero for this class and just the classifiers assigned to points in
time with documents belonging to this class will output non-zeroed scores. The extended in
scores procedure is listed in Algorithm7.
Algorithm 7 TWF-Sc-Ext: Extended Temporal Weighting in Scores1: function CLASSIFY(d′ , C, P, D, TWF )2: for eachp ∈ P do3: Dp ← {d ∈ D|d.p = p}4: δ ← p− d′.p5: scorec(c)+= TRADITIONAL CLASSIFIER(d′ ,Dp) · TWF (δ)6: end for7: return {argmaxc scorec}8: end function
5.4 Results
Having presented our strategies to determine the TWF and ourtemporally-aware classifiers,
we now report our experimental evaluation to assess the effectiveness of the temporally-
aware classifiers in minimizing the impact of the temporal effects observed in the three ex-
plored textual datasets. Recall from Section5.1, that we found that for the ACM-DL and
MEDLINE datasets the TWF follows a lognormal distribution,unlike the TWF associated to
the AG-NEWS dataset, whose expression is still unknown (meaning that a different, possibly
more complex, statistical test is required to assess its TWF). Hence, we start by evaluating
the temporal algorithms using the original TWF obtained in Section5.1applied to the ACM-
DL and MEDLINE datasets. Next, we evaluate our temporally-aware classifiers using the
5.4. RESULTS 79
TWF estimated using a machine learning approach, as described in Section5.2. In this case,
since complex statistical tests are not necessary anymore,we are able to determine the TWF
for the three textual datasets, in a fully-automated way. Thus, we evaluate our temporally-
aware classifiers using the TWF applied to the three reference datasets in order to assess their
effectiveness.
In order to evaluate the impact that the proposed TWF has on the classification task,
we compare both the traditional and temporally-aware versions of Rocchio, KNN and Naïve
Bayes in the three adopted datasets (ACM-DL, MEDLINE and AG-NEWS). For compari-
son we use two standard information retrieval measures: micro averaged F1 (MicroF1) and
macro averaged F1 (MacroF1). As described in Section2.2, while the MicroF1 measures the
classification effectiveness over all decisions made by theclassifier, the MacroF1 measures
the classification effectiveness for each individual classand averages them. All experiments
were executed using a 10-fold cross-validation (Breiman and Spector, 1992) procedure con-
sidering training, validation and test sets. The parameters were set using the validation set,
and the effectiveness of the algorithms measured in the testpartition.
We start by reporting the parameter setup performed in orderto conduct our experi-
mental evaluation, in Section5.4.1. Then, in Section5.4.2we report and analyze the results
obtained when using the original definition of the TWF (described in Section5.1) for the
ACM-DL and MEDLINE datasets. Finally, in Section5.4.3we evaluate the use of the fully-
automated strategy to devise the TWF to feed our temporally-aware classifiers and discuss
some important aspects regarding its efficiency in terms of runtime. All the experiments
were ran using a Quad-Core AMD OpteronTM CPU with16GB of RAM.
5.4.1 Parameter Settings
An important aspect to be considered when dealing with the temporally-aware classifiers is
that the TWF scale must be compatible with the values weighted by the TWF. Clearly, this is
algorithm specific and should be properly set to ensure that the TWF effectively improves the
classifier’s decision rules without compromising them. To explicitly control the TWF scale
(without modifying its shape), we introduce a scaling factor β which should be properly
calibrated over the training set.
Hence, in order to run the experiments, two important parameters had to be set: the
value ofK for KNN and the scaling factorβ. We first performed some experiments with
KNN to define the value ofK. This parameter significantly impacts the quality of classifier,
and must be carefully chosen. The following values were tested, by means of a crossed
validation over the training set, for each version of the traditional and temporally-aware
algorithms:3, 10, 30, 50, 150, and200. For the traditional version of the algorithmk = 30
80 CHAPTER 5. TEMPORALLY-AWARE ALGORITHMS FORADC
achieved better results, while for both thein documentsand in scoresversions of KNN
the best value ofk was50. The intuition for the traditional KNN to perform better with
smaller values ofK is that, as the number of neighbors increases, the variationon term-
class relationships also increases, and the probability ofmisclassification increases. On the
other hand, when settingK < 30 the traditional version of KNN performed poorly due
to overfitting. When considering temporal information by means of the proposed temporal
weights, in contrast, more consistent information becomesavailable (due to a largerK),
allowing a more accurate model. Finally, theextended in scoresversion of KNN performed
better withK = 3 in the ACM-DL dataset andK = 10 in the MEDLINE and AG-NEWS
datasets (recall that in this strategy, the KNN’s classification model is built from reduced
training sets, composed by documents belonging to the same point in time, which justifies
this smaller value).
We empirically tested three values forβ: 1, 10, and100. The best value of each version
of each classifier was considered. For Rocchio and KNN, the best results were obtained with
β = 1. For Naïve Bayes, the best value wasβ = 10.
5.4.2 Experiments with the Statistically Defined TWF
In this section, we report our experiments to compare the traditional and the proposed
temporally-aware versions of Rocchio, KNN and Naïve Bayes,using the statistically defined
TWF, reported in Section5.1, for the ACM-DL and MEDLINE datasets. We defer the anal-
ysis regarding the AG-NEWS dataset to Section5.4.3, when we discuss the obtained results
using the estimated TWF (as described in Section5.2). The results obtained for the ACM-
DL and MEDLINE datasets are reported in Tables5.4and5.5, respectively. In both tables,
each line presents the results achieved by the versions of the classifiers identified in the first
row and column. The values obtained for MacroF1 (“macF1”) and MicroF1 (“micF1”) are
reported, as well as the percentage difference between values achieved by the temporally-
aware methods and the traditional version of the classifiers. This percentage difference is
followed by a symbol that indicates whether the variations are statistically significant ac-
cording to a 2-tailed paired t-test, given a 99% confidence level. N denotes a significant
positive variation,• a non significant variation andH a significant negative variation. This
notation will be adopted also in Section5.4.3.
As we can see in Tables5.4 and 5.5, all modified versions of Rocchio and KNN
achieved better results than the baseline in ACM-DL. In MEDLINE, the versions “in scores”
and “extended in scores” achieved statistically significant gains, while the versions “in doc-
uments” were statistically the tied with the baseline. In particular, Rocchio with TWF in
scores presents the most significant improvements in both datasets, with gains up to+17.86
5.4. RESULTS 81
Algorithm Rocchio KNN Naïve BayesMetric macF1(%) micF1(%) macF1(%) micF1(%) macF1(%) micF1(%)
Baseline 57.39 68.24 58.48 71.84 57.27 73.24TWF 60.02 70.64 59.92 73.84 60.78 74.11
in documents (+4.58)N (+3.52)N (+2.46)N (+2.78)N (+6.13)N (+1.19)•TWF 59.85 72.47 62.02 74.45 44.85 63.93
in scores (+4.29)N (+6.20)N (+6.05)N (+3.63)N (-27.69)H (-14.56)HTWF 59.27 71.39 59.78 73.85 56.23 72.35
in scores ext. (+3.28)N (+4.62)N (+2.22)N (+2.80)N (-1.84)• (+1.23)•
Table 5.4: Results Obtained when Incorporating theStatistically DefinedTWF to Rocchio,KNN, and Naïve Bayes—ACM-DL.
Algorithm Rocchio KNN Naïve BayesMetric macF1(%) micF1(%) macF1(%) micF1(%) macF1(%) micF1(%)
Baseline 54.26 69.27 72.49 82.86 64.61 80.82TWF 54.08 69.48 74.10 83.36 66.75 82.87
in documents (-0.33)• (+0.30)• (+2.22)N (+0.60)• (+3.31)N (+2.54)NTWF 63.95 77.63 75.89 86.35 58.12 80.49
in scores (+17.86)N (+12.07)N (+4.69)N (+4.21)N (-10.04)H (-0.41)•TWF 63.63 77.28 74.45 84.96 63.41 81.06
in scores ext. (+17.27)N (+11.56)N (+2.70)N (+2.53)N (-1.89)• (+0.30)•
Table 5.5: Results Obtained when Incorporating theStatistically DefinedTWF to Rocchio,KNN, and Naïve Bayes—MEDLINE.
and+12.07 for MacroF1 and MicroF1, respectively. Similarly, KNN with TWF in scores
achieves the best results among all KNN variations, with gains of+6.05% and+4.21% for
MacroF1 and MicroF1, respectively. In the case of Rocchio, the improvements achieved us-
ing the TWF can be explained by the fact that, in the traditional version, the documents are
summarized in a unique representative vector (centroid), aggregating documents from dis-
tinct creation points in time, ultimately affecting the prediction ability of the classifier. In
the case of KNN, the definition of class boundaries is done considering each training docu-
ment independently. KNN assumes that documents of same class are located close by on the
vectorial space. By using the TWF, thek nearest documents are reorganized, and the most
temporally relevant documents are placed closer to the document being classified, according
to the temporal distance between them.
The Naïve Bayes with TWF in documents presents better results for MacroF1 on both
ACM-DL and MEDLINE, and better MicroF1 in the MEDLINE dataset. Note that the best
improvement was achieved in MacroF1, pointing out that this strategy effectively reduces the
Naïve Bayes bias towards the most frequent classes and, consequently, improves the effec-
82 CHAPTER 5. TEMPORALLY-AWARE ALGORITHMS FORADC
tiveness of such classifier when predicting documents from the smaller classes. However,
in contrast with Rocchio and KNN, the Naïve Bayes with TWF in scores performs poorly
in both datasets. A closer look to the “in scores” strategy reveals that if such strategy is
built upon a traditional classifier whose decision rule is strongly influenced by the negative
documents (as KNN and Naïve Bayes), its performance is deemed to be poor when applied
to datasets with skewed〈c, p〉 distributions. Although in KNN this problem can be ame-
liorated by a proper tuning of the parameterK, in Naïve Bayes this is not possible. Thus,
we attribute the poor performance of the Naïve Bayes with TWFin scores to two major
weaknesses of traditional Naïve Bayes version. First, whenfacing skewed data distributions,
the traditional version of Naïve Bayes unwittingly preferslarger classes over others, caus-
ing decision boundaries to be biased (in this case, the prediction of the smaller classes is
influenced by the negative documents belonging to the major classes). Second, when data is
scarce, there is not enough information to perform accurateestimates, leading to bad results.
The skewness of data distribution among classes〈c, p〉 can be quantified by the Coef-
ficient of VariationCV = σµ
of their sizes, whereσ andµ stand for the standard deviation
and mean. To explore the impact of data skewness on Naïve Bayes, we sampled MEDLINE,
creating two sub-collections composed by the least and mostfrequent classes〈c, p〉, mini-
mizing data skewness. While the entire collection presentsCV = 1.33, the sub-collections
with the least and most frequent classes presentCV equal to0.57 and0.43, respectively. As
we can observe in Tables5.5and5.6, the greater the CV, the worse are the results.
Naïve Bayes Least frequent classes〈c, p〉 Most frequent classes〈c, p〉Metric CV macF1(%) micF1(%) CV macF1(%) micF1(%)
Baseline0.57
74.72 80.420.43
88.70 87.65TWF 78.49 84.75 91.16 89.66
in scores (+5.04)N (+5.38)N (+2.77)N (+2.29)N
Table 5.6: Results Obtained for the Least and Most Frequent Classes〈c, p〉 Sampling forNaïve Bayes—MEDLINE.
Figure5.7shows the histogram with the percentages of classes〈c, p〉with up to a given
size (specified in the x-axis) for the ACM-DL and MEDLINE datasets. As we can observe
in Figure5.7a, the data scarcity is prominent in the ACM-DL dataset, contributing to the
poor performance of the Naïve Bayes with TWF in scores. Notice that70% of classes〈c, p〉
have less than100 documents, a number too low to guarantee accurate estimates. This is
also observed in the MEDLINE dataset (see Figure5.7b), but at a smaller extent—more
specifically,13% of the classes〈c, p〉 are composed by less than500 documents, whereas
35% are composed by2500 to 3000 documents. In addition, ACM-DL has an even more
5.4. RESULTS 83
skewed data distribution over each time point (with aCV equal to1.69 regarding the〈c, p〉
sizes), preventing us to sample it in sub-collections with smaller CV, as performed with the
MEDLINE dataset.
(a) ACM-DL Dataset (b) MEDLINE Dataset
Figure 5.7: Relative〈c, p〉 Sizes.
Recall that the main motivation behind the “extended in scores” strategy is to ame-
liorate the class imbalance problem, that negatively impacts the “in scores” effectiveness.
In the “extended in scores” strategy, the influence of negative documents is bounded when
considering data from each point in time in isolation. More specifically, the class imbalance
is not given by the〈c, p〉 distribution as in the “in scores” strategy, but by the observed class
imbalancewithin each point in time. In fact, this class distribution is typically more evenly
distributed than the artificial〈c, p〉 distribution. As we can observe in the reported results,
the extended in scores version of Naïve Bayes performed better than its in scores version.
However, it still did not performed better than the baseline(with statistically equivalent re-
sults in all cases), due to the discussed data scarcity problem, which prevents this classifier
to learn accurate estimates about the class densities. Strategies to handle data scarcity (for
instance, by oversampling the training set) are one of the current research focuses and we
plan to further investigate this matter as future work.
Now, we analyze the obtained results in the light of the quantitative analysis reported
in Chapter4. As observed, using TWF in scores in most cases led to better results than those
applying TWF in documents. This is due to the fact that the “inscores” strategy addresses
simultaneously the three discussed temporal effects, namely, the class distribution variation
(CD), the pairwise class similarity variation (CS) and the term distribution variation (TD),
whereas the “in documents” strategy takes into account justthe TD effect, as discussed
next. Furthermore, as we can observe in Table5.5 regarding the MEDLINE dataset, the
84 CHAPTER 5. TEMPORALLY-AWARE ALGORITHMS FORADC
“in documents” strategy the results obtained were statistically equivalent to the baselines
in almost all cases, with the Naïve Bayes being an exception.As will be discussed in the
following, this is due to the MEDLINE characteristics with respect to the extent of theTD
effect.
Recall that the temporal weighting in documents strategy weights each training doc-
ument by the TWF according to its temporal distance to the test document. The TWF is
modeled according to the observed variations over time in the term-class relationships for
each dataset, ultimately addressing theTD aspect. Furthermore, recall that the both the tem-
poral weighting in scores and its extended version ties together the observed patters to both
class and temporal information. While the “in scores” transforms the class domain fromC
to C × P (generating a new training set), the “extended in scores” strategy groups training
documents into partitions composed by documents created atthe same point in time, per-
forming a traditional classification procedure over each partition. Both strategies assume
that the temporal effects may be safely neglected within a single point in time and thus the
classification models learned considering each point in time in isolation are not affected by
them. However, as previously stated, considering only the data related to a single point in
time may disregard valuable information to learn an accurate classification model. Thus, the
second step of these strategies consists of aggregating theinformation learned for each point
in time, weighting the obtained classification scores by theTWF. Aggregating the obtained
scores for each point in time is affected by theTD effect, since the scores reflect the re-
lationships between terms and classes. In order to overcomethe observed variations in the
term-class relationships across the different points in time, the TWF is used to weight them
according to the temporal distance between the points in time associated to each partition
and the creation time of the test document. Thus, while the first step addresses theCD and
CS effects, the second step addresses theTD aspect observed when aggregating the scores.
Analyzing the reported results, we can observe that, for ACM-DL, the three strategies
achieved significant gains. Considering the temporal weighting in documents approach, we
can justify its gains due to the high impact ofTD in that dataset. Moreover, since ACM-
DL is also subject to a high impact of bothCD andCS, both the temporal weighting in
scores and its extended version also performed well, since they address such effects, as
previously discussed. In MEDLINE, in contrast, since the impact ofTD is smaller than the
impact of the other two effects, we should expect less significant gains achieved by temporal
weighting in documents. Indeed, this was the observed behavior: such approach achieved
statistical ties compared to baselines in almost all cases.However, as bothCD andCS
are important factors in that dataset, we can observe statistically significant improvements
in classification effectiveness when the temporal weighting in scores and its extension are
applied. Furthermore, the largest improvements are achieved when the temporal weighting
5.4. RESULTS 85
in scores is applied with the Rocchio classifier, which, as discussed in the previous section,
is the most affected by bothCD andCS in that dataset (see summary in Table4.8).
5.4.3 Experiments with the Estimated TWF
In this section, we report our experimental evaluation to assess the effectiveness of the pro-
posed temporally-aware classifiers using the TWF learned bythe fully-automated procedure
described in Section5.2. The goal here is to increase the applicability of the temporally-
aware classifiers. For example, even if uncertain about the expression (and parameters) of
the TWF associated to the AG-NEWS dataset, we can still determine the weights associ-
ated to each temporal distance using the procedure described in Algorithm 2, and use our
temporally-aware classifiers with the learned TWF. Thus, inthis section we examine the re-
sults obtained when applying the temporally-aware classifiers to the three reference datasets,
using the estimated TWF. Recall that, in this case, the TWF islearned from the training set
D. An interesting aspect to be analyzed refers to the amount ofdata required to accurately
estimate this function.Is the whole training set needed to learn the TWF ?In order to have
a first glance on this matter, we evaluate our strategies using the TWF learned from the en-
tire D and from a sample composed by10% of D, selected by a per point in time random
sampling (to guarantee that each point in time will have at least one document).
We start by comparing the results obtained using the estimated TWF and the results
obtained when using the statistically defined TWF, considering the ACM-DL and MED-
LINE datasets. We stress here that the results obtained estimating the TWF using the three
classifiers (line4 of Algorithm 2) were statistically equivalent, as could be expected by the
observed similarities in Figure5.3. Thus, we report just the results obtained by estimating
the TWF using the Rocchio classifier. As will shall see, the use of the estimated TWF led
to results statistically equivalent to the ones obtained when using the original definition of
the TWF. Then, we compare the effectiveness of the traditional and the temporally-aware
classifiers when applied to the AG-NEWS dataset (since the same conclusions drawn in the
previous section, regarding the ACM-DL and MEDLINE, also hold here).
An important aspect to be observed regarding the Tables5.7and5.8, is that using the
estimated TWF led to statistically equivalent results to the use of the statistically defined
TWF. This was assessed by a 2-tailed paired t-test, with99% confidence level. In fact, there
is an interesting similarity between the distribution of temporal distances used to determine
the TWF expression, illustrated in Figure5.1, and the estimated TWFs for both datasets, il-
lustrated in Figures5.3aand5.3b. This implies that we can adopt the automated procedure to
determine the TWF without affecting the effectiveness of the temporally-aware algorithms.
Furthermore, the same discussion presented in the previoussection, regarding the quantita-
86 CHAPTER 5. TEMPORALLY-AWARE ALGORITHMS FORADC
Algorithm Rocchio KNN Naïve BayesMetric macF1(%) micF1(%) macF1(%) micF1(%) macF1(%) micF1(%)
Baseline 57.39 68.24 58.48 71.84 57.27 73.24
TWF (100% of D) 60.21 70.70 60.08 73.88 61.38 74.60in documents (+4.91)N (+3.60)N (+2.74)N (+2.84)N (+7.18)N (+1.86)•
TWF (10% of D) 60.52 70.88 61.02 74.27 61.44 74.24in documents (+5.45)N (+3.87)N (+4.84)N (+3.82)N (+7.28)N (+1.36)•
TWF (100% of D) 60.47 72.90 61.88 74.53 45.16 64.55in scores (+5.47)N (+6.83)N (+5.81)N (+3.74)N (-26.82)H (-13.46)H
TWF (10% of D) 59.68 72.40 61.37 73.77 44.47 64.58in scores (+3.99)N (+6.10)N (+4.94)N (+2.69)N (-28.78)H (-13.41)H
TWF (100% of D) 59.96 71.99 59.80 73.95 56.28 72.73in scores ext. (+4.48)N (+5.49)N (+2.26)N (+2.94)N (-1.76)• (-0.70)•
TWF (10% of D) 59.85 71.79 59.76 73.85 56.19 72.70in scores ext. (+4.29)N (+5.20)N (+2.19)N (+2.80)N (-1.89)• (-0.74)•
Table 5.7: Results Obtained when Incorporating theEstimatedTWF to Rocchio, KNN, andNaïve Bayes—ACM-DL.
Algorithm Rocchio KNN Naïve BayesMetric macF1(%) micF1(%) macF1(%) micF1(%) macF1(%) micF1(%)
Baseline 54.26 69.27 72.49 82.86 64.61 80.82
TWF (100% of D) 54.03 69.48 73.96 82.76 67.95 82.98in documents (-0.43)• (+0.30)• (+2.03)N (-0.12)• (+5.17)N (+2.67)N
TWF (10% of D) 55.01 70.35 73.63 82.87 67.84 82.89in documents (+1.38)• (+1.56)• (+1.57)• (+0.01)• (+5.00)N (+2.56)N
TWF (100% of D) 64.47 77.12 75.99 86.33 58.20 80.48in scores (+18.82)N (+11.33)N (+4.83)N (+4.19)N (-9.92)H (-0.42)•
TWF (10% of D) 64.25 77.03 75.88 86.36 58.23 80.51in scores (+18.41)N (+11.20)N (+4.68)N (+4.22)N (-9.87)H (-0.38)•
TWF (100% of D) 64.53 77.16 74.63 85.07 64.64 81.12in scores ext. (+18.93)N (+11.39)N (+2.95)N (+2.67)N (-0.05)• (+0.37)•
TWF (10% of D) 64.32 77.24 74.74 84.99 64.74 81.10in scores ext. (+18.54)N (+11.51)N (+3.10)N (+2.57)N (+0.20)• (+0.35)•
Table 5.8: Results Obtained when Incorporating theEstimatedTWF to Rocchio, KNN, andNaïve Bayes—MEDLINE.
tive analysis of the behavior of the temporally-aware classifiers w.r.t. each dataset also holds
here. Similarly, the poor performance of both the in scores versions of the Naïve Bayes clas-
sifier is also attributed to the class imbalance problem (forthe TWF in scores strategy) and
to the lack of training documents associated to each class (for the extended TWF in scores
5.4. RESULTS 87
strategy), just as before.
Another important aspect to be observed in Tables5.7and5.8is that it does not matter
using either the entire training setD or just10% of it to learn the TWF. In fact, both cases led
to statistically equivalent results (assessed by a 2-tailed paired t-test with99% confidence),
in all cases. This is important property of Algorithm2 since the smaller is the training set
(that is, its input) the smaller the expected runtime to learn the TWF.
Algorithm Rocchio KNN Naïve BayesMetric macF1(%) micF1(%) macF1(%) micF1(%) macF1(%) micF1(%)
Baseline 54.89 58.16 60.05 68.49 60.92 67.83
TWF (100% of D) 58.34 62.34 58.96 67.45 62.24 68.46in documents (+6.29)N (+7.19)N (-1.85)• (-1.54)• (+2.17)N (+0.93)•
TWF (10% of D) 58.35 62.29 58.91 67.35 62.38 68.55in documents (+6.30)N (+5.68)N (-1.90)• (-1.66)• (+2.40)N (+1.06)•
TWF (100% of D) 57.82 66.26 58.36 64.94 51.65 61.91in scores (+5.34)N (+13.93)N (-2.90)H (-5.47)H (-15.22)H (-8.73)H
TWF (10% of D) 58.01 66.30 58.15 64.84 51.69 61.97in scores (+5.68)N (+14.00)N (-3.16)H (-5.33)H (-15.15)H (-8.64)H
TWF (100% of D) 57.72 66.12 59.12 68.93 56.43 65.21in scores ext. (+5.16)N (+13.69)N (-1.57)• (+0.64)• (-7.37)H (-3.86)H
TWF (10% of D) 57.69 65.99 59.08 68.77 56.47 65.22in scores ext. (+5.10)N (+13.46)N (-1.61)• (+0.41)• (-7.30)H (-3.85)H
Table 5.9: Results Obtained when Incorporating theEstimatedTWF to Rocchio, KNN, andNaïve Bayes—AG-NEWS.
We now turn our attention to the AG-NEWS dataset. Similarly to the ACM-DL and
MEDLINE datasets, the temporally-aware versions of the Rocchio classifier present the most
significant improvements over the baseline, with gains up to6.29% and14.00% for MacroF1and MicroF1, respectively. As in the MEDLINE dataset, both the “in scores” versions of the
Rocchio classifier performed better than its “in documents”version. This is due to the nature
of this dataset, evidenced by the quantitative analysis reported in Chapter4. In fact, this
dataset present more prominent variations in the class distribution (CD) and the class simi-
larities (CS) than in the term distribution (TD)—as reported in Section4.4.1, “the impact of
theTD effect is consistently lower than the impact ofCD (orCS) on all four algorithms” in
this dataset. However, unlike the temporally-aware versions of Rocchio, the “in documents”
versions of KNN and Naïve Bayes were statistically tied withtheir baselines (with the Naïve
Bayes with TWF in documents being an exception, with statistically significant gains in the
MacroF1 of up to 2.40%). This is justifiable by the smaller extent of theTD effect in the
AG-NEWS dataset. Accordingly, their ”in scores” versions should perform better and this
88 CHAPTER 5. TEMPORALLY-AWARE ALGORITHMS FORADC
was not the observed. Indeed, both the KNN and Naïve Bayes with TWF in scores led to
significant losses in both MacroF1 and MicroF1. Again, we attribute this to both the class
imbalance and the data scarcity problems. In order to provide evidence for this problem, in
Figure5.8we show the histogram of the〈c, p〉 sizes (as done with the other two datasets). In
fact, we can observe that72% of the〈c, p〉 sizes are smaller than200 (with 46% of the〈c, p〉
classes composed by at most100 documents), with CV equal to1.85—a much more skewed
and sparse distribution than of the other two datasets.
Figure 5.8: Relative〈c, p〉 Sizes for AG-NEWS Dataset.
Recall that the “extended in scores” strategy aims at minimizing the influence of the
class imbalance problem in the classification effectiveness. In fact, this strategy outper-
formed the “in scores” strategy in all cases. However, it wasnot able to outperform the
baselines. This is due to the previously discussed data scarcity problem, which is, by the
way, more pronounced in the AG-NEWS dataset than in the othertwo datasets. In contrast
to the improvements obtained by the “extended in scores” version of the KNN classifier in the
MEDLINE dataset, in the AG-NEWS it incurred in statistical ties. Furthermore, in contrast
to the statistical ties obtained by the extended in scores version of the Naïve Bayes classifier,
in the AG-NEWS dataset it produced statistically significant losses. We conjecture that the
adoption of strategies to overcome the data scarcity problem may improve the effectiveness
of such strategy. Again, we leave this matter for future work.
Finally, we also compared the best temporally-aware classifiers to the state of the art
Support Vector Machine(Joachims, 1999) classifier. We adopted an efficient SVM imple-
mentation, SVM_PerfJoachims(2006), which is based on the maximum-margin approach
and can be trained in linear time. We used an one-against-all(seeManning et al., 2008)
methodology to adapt binary SVM to multi-class classification, since, as presented in Sec-
tion 4.1, the explored datasets present more than two classes. Such comparison is presented
5.4. RESULTS 89
in Table5.10. For ACM-DL dataset (Table5.10a), the significant gains are of3.29% and
2.45% in macroF1 (and statistical ties in microF1), for KNN with TWF in scores and Naïve
Bayes with TWF in documents, respectively. Furthermore, both the Rocchio with TWF in
scores and KNN with TWF in documents obtained results statistically equivalent to the SVM
results. In all these three cases, the temporally-aware classifiers fastest than the SVM by two
orders of magnitude. For the MEDLINE dataset (Table5.10b), the most significant gains are
of 2.03% and2.48% in macroF1 and microF1, respectively, obtained by the KNN with TWF
in scores. The extended in scores version of KNN achieved statistically tied results. As we
shall discuss in Section5.4.4, both classifiers are significantly faster than SVM. Considering
that SVM is a state of the art classifier, and that both datasets are imbalanced, our results evi-
dence the quality of the proposed solution. Considering theAG-NEWS dataset (Table5.10c),
the best performing temporally-aware classifier was unableto outperform the SVM due to
the limitations already discussed. However, it is worth to say that our temporally-aware
classifier was not drastically outperformed, and there exists room for improvements.
5.4.4 Runtime Analysis
Now we turn our attention to the efficiency of our proposed classifiers, in terms of execution
time. We start by considering the average time spent by the classifiers in each iteration of
the K-Fold cross validation, and comparing the temporally-aware algorithms with both their
traditional counterparts and the state of the artSupport Vector Machine(Joachims, 1999)
classifier (using the previously described SVM_Perf implementation). Next, we analyze the
additional cost associated with the automatic determination of the TWF.
We report in Table5.11the average execution time of each traditional classifier (rows
entitled “traditional”, including the measurement for the SVM classifier), their temporally-
aware versions (lines “In Documents”, “In Scores” and “Extended In Scores”), along with
the standard deviation over the mean value (reported after the± symbol). We consider the
execution time measured for the overall classification task, comprised by both the training
and test stages.1 The measurements regarding the ACM-DL dataset refers to theclassifica-
tion of 2490 documents using19918 training documents, while the measurements regard-
ing the MEDLINE dataset refers to the testing of86145 documents with a classification
model learned from689163 training documents. Finally, the measurements regarding the
AG-NEWS dataset refers to the classification of83580 documents based on a classification
model learned from668636 documents. Clearly, the columns of this table are not compara-
ble, and we consider the measurements for each dataset independently.
1Recall that in our experimental setup, using the 10-fold cross validation strategy, one fold is used as testset, another one is retained as the validation set and the remaining folds are used as training set.
90 CHAPTER 5. TEMPORALLY-AWARE ALGORITHMS FORADC
AlgorithmMetric
macF1(%) micF1.(%)
SVM 59.91 73.88
Rocchio with 60.47 72.90TWF in scores (+0.93) • (−1.34) •
KNN with 59.78 73.88TWF in documents (−0.22) • (+0.00) •
KNN with 61.88 74.53TWF in scores (+3.29) N (+0.88) •
Naïve Bayes with 61.38 74.60TWF in documents (+2.45) N (+0.97) •
(a) ACM-DL Dataset
AlgorithmMetric
macF1(%) micF1.(%)
SVM 74.48 84.24
KNN with 75.99 86.33TWF in scores (+2.03) N (+2.48) N
KNN with 74.63 85.07TWF in scores ext. (+0.20) • (+0.98) •
(b) MEDLINE Dataset
AlgorithmMetric
macF1(%) micF1.(%)SVM 64.94 72.59
Naïve Bayes with 62.38 68.55TWF in documents (−4.10) H (−5.56) H
(c) AG-NEWS Dataset
Table 5.10: Effectiveness Comparison: Best Performing Temporally-Aware Classifiersver-susSVM.
As one could expect, our temporally-aware classifiers are typically slower than their
traditional counterparts. This comes as no surprise, sincethere is the overhead of considering
and managing the temporal information. Furthermore, the temporally-aware classifiers are,
by nature, lazy classifiers, which comes at the cost of a higher runtime. Furthermore, our
temporally-aware versions incurred in a higher increase inexecution time in the AG-NEWS
dataset, due to the higher number of points in time of this dataset. However, in almost all
cases our lazy temporally-aware classifiers were more efficient, in terms of execution time,
than the SVM classifier.
We also compared the best version of the methods previously proposed, for example,
5.4. RESULTS 91
AlgorithmRuntime (seconds) per Dataset
ACM-DL MEDLINE AG-NEWSSVM Traditional 144.10±5.30 26955.0±2356.0 28667.0±1151.0
Rocchio
Traditional 2.00±0.00 111.0±0.0 96.5±0.5In Documents 6.60±0.52 209.5±12.5 4615.5±89.5
In Scores 9.00±0.00 300.5±3.5 5287.5±9.5Extended In Scores 7.20±0.42 263.5±0.5 3807.0±29.0
KNN
Traditional 8.90±0.32 13442.5±79.5 8154.0±60.0In Documents 11.03±0.48 15949.0±51.0 10368.5±8.5
In Scores 10.10±0.31 12557.5±630.5 8630.5±349.5Extended In Scores 8.40±0.75 7753.5±78.5 4711.5±46.5
Traditional 5.00±0.00 213.0±7.0 186.5±0.5Naïve In Documents 9.10±0.32 293.0±2.0 3780.0±95.0Bayes In Scores 63.80±1.32 1311.0±1.0 43570.0±85.0
Extended In Scores 60.50±1.18 656.5±6.5 38966.5±108.5
Table 5.11: Execution Time (in seconds) of each Explored ADCAlgorithm.
the KNN with TWF in scores and Naïve Bayes with TWF in documents, to the SVM classi-
fier, in terms of efficiency (execution time). Such comparison is presented in Table5.12. For
ACM-DL dataset (Table5.12a), our best performing classifiers were up to13 times faster
than SVM, while, at least, matching the SVM effectiveness (as reported in Table5.10a). For
the MEDLINE dataset (Table5.12b), the KNN with TWF in scores was more than two times
faster than SVM, and, as previously reported, outperformedthis classifier in both MacroF1and MicroF1. The extended in scores version of KNN achieved was more thanthree times
faster than SVM. Considering the AG-NEWS dataset (Table5.12c), our best performing
temporally-aware classifier was almost eight times faster than the SVM (but unable to out-
perform the SVM classifier in terms of effectiveness).
Finally, we now consider the efficiency of the TWF determination algorithm. Recall
from Algorithm2 that it is necessary to perform a classification over the training set in order
to estimate the a posteriori probability distributionP (pi|di) and then determine the TWF.
There are two key aspects to be considered. First, since it does not matter which of the three
classifiers are used to learn the TWF (since they led to statistically significant results), it is
advisable to use the Rocchio classifier for doing so, since itis, by far, the most efficient one
(as can be observed in Table5.11, by comparing the traditional versions of each of them).
Second, it is clear that if the training set size increases, the cost involved at determining the
TWF also increases and can be potentially prohibitive. To better understand the dependency
between the execution time of the TWF determination and the training set size, we measured
92 CHAPTER 5. TEMPORALLY-AWARE ALGORITHMS FORADC
Algorithm Time (s)
SVM 144.10±5.30
Rocchio with9.00±0.00
TWF in scoresKNN with
11.03±0.48TWF in documents
KNN with10.10±0.31
TWF in scoresNaïve Bayes with
9.10±0.32TWF in documents
(a) ACM-DL Dataset
Algorithm Time (s)
SVM 26955.0±2356.0
KNN with12557.5±630.5
TWF in scoresKNN with
7753.5±78.5TWF in scores ext.
(b) MEDLINE Dataset
Algorithm Time (s)
SVM 28667.0±1151.0
Naïve Bayes with3780.0±95.0
TWF in documents
(c) AG-NEWS Dataset
Table 5.12: Execution Time Comparison: Best Performing Temporally-Aware ClassifiersversusSVM.
the execution time spent to determine the TWF using the entire training setD and a per point
in time sample ofD, by randomly selecting10% of the documents ofD. We then compared
such measurements with the time spent by the fastest temporally-aware classifiers and the
SVM classifier, for each explored dataset. This comparison is reported in Table5.13. As we
can observe, the time required to automatically learn the TWF is negligible when compared
to the time spent by the classification task. In addition to this efficiency aspect, the TWF
determination is inherently an offline procedure, guaranteeing its practical applicability.
5.5. CHAPTER SUMMARY 93
DatasetRuntime
TWF Determination Fastest Temporally-Aware10% of D EntireD Classifiers
ACM-DL 0.77±0.02 4.49±0.04 6.60±0.52Rocchio
in documents
MEDLINE 31.00±3.00 180.00±28.00 209.50±12.50Rocchio
in documents
AG-NEWS 120.00±8.00 1560.00±25.00 3780.00±95.00Naïve Bayesin documents
Table 5.13: Execution Time of the TWF Estimation using the Rocchio Classifier.
5.5 Chapter Summary
In this chapter, we discussed the impact that temporal effects may have in ADC, and pro-
posed new strategies for instance weighting that leads to more accurate classification models.
We started by proposing a methodology to model a temporal weighting function (TWF) that
captures changes in term-class relationships for a given period of time. For the ACM-DL
and MEDLINE datasets, we showed that the TWF follows a lognormal distribution, whose
parameters may be easily determined using statistical methods (see Section5.1). For the
AG-NEWS dataset, on the other hand, we showed that the same hypothesis testing proce-
dures adopted for the ACM-DL and MEDLINE datasets failed, implying that its associated
TWF follows a distinct (yet unknown) distribution. Thus, toassess the TWF associated to the
AG-NEWS dataset requires some other (possibly more complex) statistical tests, motivating
the development of a strategy to determine the TWF that overcomes the needs for perform-
ing such tests. As a matter of fact, for the sake of temporally-aware ADC, one just needs
to know the positive real valued weights associated to each temporal distanceδ. In Sec-
tion 5.2 we described such strategy, that uses the ADC algorithms themselves to determine
the mappingδ 7→ R+. This is accomplished by estimating the a posteriori probability dis-
tributionP (pi|di) and gathering the relative frequencies of the temporal distancesδ between
the predicted point in timepi and the actual point in timedi.p in whichdi was created.
We then presented our three strategies to incorporate the TWF to classifiers: TWF
in documents, TWF in scoresand an extended version of the TWFin scoresstrategy. The
TWF in documentsstrategy weights each training document by the TWF according to its
temporal distance to the test document. The TWFin scoresstrategy, in contrast, learns
the scores for each classc by using a traditional ADC algorithm to first learn scores for
the derived classes〈c, p〉, and then aggregating these scores using the TWF to weight them
(that is,scorec =∑
p scorec,p(c, p) TWFδ). Finally, theextendedTWF in scorespartitions
94 CHAPTER 5. TEMPORALLY-AWARE ALGORITHMS FORADC
the training documents in sub-groups of documents with the same creation point in time
(and thus without temporal variability on the term-class relationships), learns a series of
classification models for each partition and aggregates thegenerated scores using the TWF
to weight them.
The three strategies were implemented considering three traditional classifiers, namely
Rocchio, KNN, and Naïve Bayes. Results with the traditionalversions of these classifiers
and the temporally-aware ones showed that considering temporal information significantly
improves the results of the traditional classifiers. We alsoshowed that even if using10% of
the training set to automatically determine the TWF we can accurately estimate it and achieve
comparable results to the ones obtained using all the training set for doing so. This highlights
that, in addition to the fact that this strategy overcomes the needs to perform potentially
complex hypothesis testing to determine the TWF, it demandsa quite small additional cost
for doing so, being performed usually in an offline manner. Also, both temporally-aware
KNN and Naïve Bayes achieved better results than SVM in the ACM-DL and MEDLINE
datasets, with better performance. Considering that SVM isa state of the art classifier, and
that the explored datasets are imbalanced, our results evidence the quality of our solution,
coupled with an efficient implementation.
Chapter 6
Conclusions and Future Work
In this chapter we summarize the research contributions of this dissertation and point out
some directions for further investigation.
6.1 A Quantitative Analysis of Temporal Effects on
ADC
In this work, we proposed a methodology, based on a series of full factorial designs, to
evaluate the impact of temporal effects on ADC algorithms when applied to distinct textual
datasets. First, we extended the characterization performed byMourão et al.(2008), pro-
viding evidence of the existence of three temporal effects in three textual datasets, namely
ACM-DL, MEDLINE and AG-NEWS. Then, we instantiated the methodology to quantify
the impact of the temporal aspects on the classification effectiveness of four well-known
ADC algorithms, namely Rocchio, KNN, Naïve Bayes and SVM.
Our characterization results show that, contrary to the assumption of static data dis-
tribution on which most of the ADC algorithms rely, each reference dataset has a specific
temporal behavior, exhibiting changes in the underlying data distribution with time. Such
temporal variations potentially limit the classification performance. According to our re-
sults, the ACM-DL and AG-NEWS datasets are much more dynamicthan the MEDLINE
dataset, implying that in the four explored ADC algorithms would be more impacted by the
temporal aspects in the first two datasets. In addition to such findings, our proposed method-
ology enabled us to quantify the impact of each temporal aspect on the analyzed datasets and
algorithms, allowing us to answer the two following questions, posed in Chapter4:
1. Which temporal effects influence more in each dataset?In the ACM-DL dataset, the
impact of the observed temporal variations in the distribution of class sizes and in
95
96 CHAPTER 6. CONCLUSIONS AND FUTURE WORK
the pairwise class similarities are statistically equivalent to the impact of the observed
variations in the term distribution on most classifiers (SVMbeing an exception). MED-
LINE and AG-NEWS, on the other hand, are clearly more impacted by the first two
temporal aspects. These findings reveal the challenges imposed by the temporal ef-
fects and that developing strategies to handle them in ADC algorithms is a promising
research direction.
2. What is the behavior of each ADC algorithm when faced with different levels of each
temporal aspect?All four explored ADC algorithms suffer a negative impact ofthe
temporal aspects in terms of classification effectiveness,being the most significant
impacts observed when these algorithms are applied to the most dynamic datasets
(i.e., ACM-DL and AG-NEWS). The SVM classifier was shown to bemore robust to
the term distribution aspect, while still being impacted bythe other two aspects. The
other three algorithms, on the other hand, are very sensitive to all three aspects. Thus,
the temporal dimension turns out to be an important aspect that has to be considered
when learning accurate classification models.
6.2 Temporally-Aware Algorithms for ADC
Beyond quantifying the impact of the temporal effects in ADCalgorithms, we proposed
strategies tominimizetheir impact in three well known ADC algorithms, based on an in-
stance weighting paradigm to devise more accurate classification models. We started by
proposing a methodology to model a Temporal Weighting Function (TWF) that captures
changes in term-class relationships for a given period of time. For two of the three real
datasets explored, namely ACM-DL and MEDLINE, we showed that their TWF’s follow
a lognormal distribution, whose parameters may easily be tuned using statistical methods.
On the other hand, the TWF associated to the AG-NEWS dataset does not follow a normal
distribution (even in the log-transformed space). Indeed,the straightforward tests for inde-
pendence and normality of random variables failed, with99% confidence, and some other
(possibly more complex) tests should be performed. To guarantee the practical employment
of the temporally-aware classifiers, automated ways to determine the TWF are desirable. As
a matter of fact, for the sake of temporally-aware ADC, one just needs to know the pos-
itive real valued weights associated to each temporal distance. Thus, we also proposed a
fully-automated strategy to devise the TWF.
In order to incorporate the TWF to classifiers, we proposed three approaches: TWF
in documents, TWF in scores and the extended TWF in scores. TWF in documents weights
each training document by the TWF according to its temporal distance to the test document.
6.2. TEMPORALLY-AWARE ALGORITHMS FORADC 97
TWF in scores, in contrast, takes into account scores produced by a traditional classifier
applied to a modified training set where the class of each training documentc is mapped to
a derived classc 7→ 〈c, p〉, with p denoting the training document’s creation point in time,
ultimately tying together the observed patters and both theclass and temporal information.
A weighted sum of the learned scores is then performed, according to the TWF. Finally,
the extended TWF in scores partitions the training documents in sub-groups of documents
with the same creation point in time (and thus without temporal variability on the term-class
relationships) including documents of all classes, learnsa series of classification models
for each partition and aggregates the generated scores using the TWF to weight them. These
strategies were incorporated to three traditional classifiers, namely Rocchio, KNN, and Naïve
Bayes.
Results with the traditional versions of these classifiers and the temporally-aware ones
showed that considering temporal information significantly improves the results of the tradi-
tional classifiers. We also studied the impact of estimatingthe TWF and incorporating it to
the classifiers, both in terms of effectiveness and efficiency. Two important aspects were dis-
cussed: First, all the three explored ADC algorithms provided an accurate TWF estimation.
Due to its efficiency and similar results obtained when compared to the other classifiers, we
chose Rocchio to estimate the TWF. Second, sampling10% of the training documents (in a
per point in time basis) to learn the TWF provided the same gains in the temporally-aware
classifiers as when using all the training set. This reduces even more the additional cost in
the runtime of the classification task. Also, both temporally-aware KNN and Naïve Bayes
achieved more effective results than SVM, also with better overall performance (i.e., consid-
ering the ACM-DL dataset, our best performing classifiers were up to 13 times faster than
SVM). Considering that SVM is a state of the art classifier, and that both collections are
very unbalanced, our results evidence the quality of our solution, coupled with an efficient
implementation.
6.2.1 Limitations
The proposed temporally-aware algorithms have some limitations and, consequently, there
are room for further improvements. These include:
Data Imbalance: As we discussed, the “in scores” version of our classifiers are sensitive
to the data imbalance observed when considering each derived class〈c, p〉. Ac-
tually, class imbalance is considered as a challenge by the Data Mining commu-
nity (Yang and Wu, 2006). This is a rather common scenario that raises due to several
factors as incomplete sampling of labeled data due to crawling problems, ephemeral
98 CHAPTER 6. CONCLUSIONS AND FUTURE WORK
events, the high costs involved in labeling data, and so on. Strategies to handle these
cases are promising towards improving the effectiveness ofthe “in scores” strategy.
Data Scarcity: Another major technical challenge faced by the Data Mining community
relates to the scarcity of training data. In fact, both the “in scores” and the “extended
in scores” versions of our temporally-aware classifiers have their performance limited
by this problem, since the number of documents assigned for some classc and created
at the point in timepmay not be sufficient to learn accurate estimates. Again, strategies
to tackle this problem are good candidates to improve the effectiveness of both versions
of the temporally-aware classifiers.
6.3 Future Work
As a future work, we intend to incorporate temporal information to the SVM classifier, by
defining kernel functions that use the proposed TWF. We also plan to refine the TWF, which
can be further improved in, at least, two ways. First, it can be defined in a finer grained basis,
in order to account for the potentially distinct evolutive behavior of terms (that is, the TWF
may be further refined to account for not only the temporal distances between documents but
also according to each term in isolation). Second, as discussed in Section2.3, the temporal
unit used to determine the documents’ timeliness is defined according to the domain in which
the temporally-aware classifiers are applied to. This is done in a purelyqualitativefashion. A
well stablished way to define the temporal unit is thus highlydesirable, and a promising strat-
egy for doing so is the Formal Concept Analysis (Ganter and Wille, 1999) (FCA). FCA is a
well studied mathematical framework that is able to uncoverimplicit relationships between
objects and their attributes, ultimately finding ontologies (Ganter et al., 2005; Wille, 2005).
Such framework is widely used in concept classification and knowledge management. Con-
sidering our temporally-aware strategies, one can use the FCA to automatically determine
temporal periods which can be used as a kind of temporal units, instead of its purely quali-
tative determination. With such strategy, one can determine semantically meaningful groups
of documents that share some underlying data distribution,which is invariant over time, and
thus can be exploited to infer a proper temporal granularityin a fully-automated manner.
Another aspect that can be further improved relates to the memory and time efficiency.
Nowadays, very large databases are becoming even more common. Several organizations
have to maintain databases that grow without a limit, with a surprisingly fast rate. Clearly,
the classification of such data streams brings some challenging problems to be handled, as
hard memory/time constraints. In fact, mining high-speed drifting data streams is a topic that
is continuously receiving attention from the Data Mining community. While our classifiers
6.3. FUTURE WORK 99
are still able to provide high quality classification with execution times much smaller than
the state of the art SVM classifier, the assessment of the testcreation point in time before
learning the classification model (that is, the lazy nature of our classifiers) may prevent the
applicability of the temporally-aware classifiers in such high speed streaming scenarios. The
definition of non-lazy strategies for ADC that can take advantage of temporal information
in a memory/time efficient way (e.g., by incrementally adjusting the classification model ac-
cording to the observed variations in the underlying data distribution) is a promising research
direction.
Factors other then the documents’ timeliness may also be exploited towards the con-
struction of more effective classification models. Indeed,we have already achieved some in-
teresting results when exploiting the underlying citationand authorship networks extracted
from the ACM-DL dataset (de M. Palotti et al., 2010), and a further investigation on this
matter may be valuable. For example, tying together the information gathered from these
networks with the documents’ timeliness may be an interesting research direction.
Finally, in a classification framework, not only the learning step may be affected by
the temporal dynamics of data, but also some of the data pre-processing steps, such as fea-
ture selection and data sampling. For example, since several ADC algorithms are affected
by the class imbalance problem, where some classes are more representative than others,
it is a common strategy to pre-process the data in order to provide more balanced training
sets. The usual way to balance the class distribution on training data consists of oversam-
pling the smaller classes or undersampling the larger ones.However, to the best of our
knowledge, none of the already proposed strategies for databalancing handle the temporal
dimension. Thus, we plan to further study the impact of the temporal dynamics on class
balancing strategies. Furthermore, we consider that strategies for feature selection may be
improved if considering the evolutive behavior of terms (for example, considering not only
the predictive power of terms, but also their temporal stability). This may reveal effective
approaches to further improve such data processing strategies and, ultimately, lead to more
accurate classification models.
Bibliography
Alonso, O., Gertz, M., and Baeza-Yates, R. (2007). On the value of temporal information in
information retrieval.SIGIR Forum, 41(2):35–41.
Baeza-Yates, R. and Ribeiro-Neto, B. (2011).Modern Information Retrieval: The Concepts
and Technology Behind Search. Addison-Wesley, Boston, MA.
Bifet, A. and Gavaldà, R. (2006). Kalman filters and adaptivewindows for learning in data
streams. InDiscovery Science, pages 29–40, Barcelona, Spain.
Bifet, A. and Gavaldà, R. (2007). Learning from time-changing data with adaptive win-
dowing. In Proceedings of the SIAM International Conference on Data Mining, pages
443–448, Minneapolis, USA.
Breiman, L. and Spector, P. (1992). Submodel Selection and Evaluation in Regression - the
X-Random Case.International Statistical Review, 60(3):291–319.
Caldwell, N. H. M., Clarkson, P. J., Rodgers, P. A., and Huxor, A. P. (2000). Web-based
knowledge management for distributed design.IEEE Intelligent Systems, 15(3):40–47.
Chang, C.-C. and Lin, C.-J. (2001).LIBSVM: A Library for Support Vector Machines. Soft-
ware available athttp://www.csie.ntu.edu.tw/~cjlin/libsvm.
Chen, E., Lin, Y., Xiong, H., Luo, Q., and Ma, H. (2011). Exploiting probabilistic topic
models to improve text categorization under class imbalance. Information Processing &
Management, 47(2):202–214.
Clarkson, D. B., Fan, Y.-a., and Joe, H. (1993). A remark on algorithm 643: FEXACT: an al-
gorithm for performing fisher’s exact test in r x c % contingency tables.ACM Transactions
on Mathematical Software, 19(4):484–488.
Cohen, W. W. and Singer, Y. (1999). Context-sensitive learning methods for text categoriza-
tion. ACM Transactions on Information Systems, 17(2):141–173.
101
102 BIBLIOGRAPHY
Crow EL, S. K. (1988). Log-normal distributions: Theory and application. New York:
Dekker.
D’Agostino R.B., P. E. (1973). Tests for departure from normality. Biometrika, 60:613–622.
de Lima, E. B., Pappa, G. L., de Almeida, J. M., Gonçalves, M. A., and Jr., W. M. (2010).
Tuning genetic programming parameters with factorial designs. InProceedings of the
IEEE Congress on Evolutionary Computation, pages 1–8, Barcelona, Spain.
de M. Palotti, J. R., Salles, T., Pappa, G. L., Arcanjo, F., Gonçalves, M. A., and Jr., W. M.
(2010). Estimating the credibility of examples in automatic document classification.Jour-
nal of Information and Data Management, 1(3):439–454.
Dries, A. and Rückert, U. (2009). Adaptive concept drift detection.Statistical Analysis and
Data Mining, 2(5-6):311–327.
Drummond, C. (2006). Discriminative vs. generative classifiers for cost sensitive learning.
In Canadian Conference on AI, pages 479–490, Québec, Canada.
Fdez-Riverola, F., Iglesias, E., Díaz, F., Méndez, J., and Corchado, J. (2007). Applying
lazy learning algorithms to tackle concept drift in spam filtering. Expert Systems with
Applications, 33(1):36–48.
Folino, G., Pizzuti, C., and Spezzano, G. (2007). An adaptive distributed ensemble approach
to mine concept-drifting data streams. InProceedings of the IEEE International Confer-
ence on Tools with Artificial Intelligence, pages 183–188, Patras, Greece.
Forman, G. (2003). An extensive empirical study of feature selection metrics for text classi-
fication. Journal of Machine Learning Research, 3:1289–1305.
Forman, G. (2006). Tackling concept drift by temporal inductive transfer. InProceedings
of the International ACM SIGIR Conference on Research & Development of Information
Retrieval, pages 252–259, Washington, USA.
Gama, J., Medas, P., Castillo, G., and Rodrigues, P. (2004).Learning with drift detection.
In Proceedings of the Brazilian Symposium on Artificial Intelligence, pages 286–295, São
Luís, Brazil.
Ganter, B., Stumme, G., and Wille, R., editors (2005).Formal Concept Analysis, Founda-
tions and Applications, volume 3626 ofLecture Notes in Computer Science. Springer.
Ganter, B. and Wille, R. (1999).Formal Concept Analysis: Mathematical Foundations.
Springer, Berlin, Heidelberg.
BIBLIOGRAPHY 103
Hastie, T., Tibshirani, R., and Friedman, J. H. (2009).The Elements of Statistical Learning.
Springer, New York, NY.
Hollander, M. and A., D. (1999).Nonparametric Statistical Methods. Wiley-Interscience,
New York, NY.
Jain, R. (1991).The Art of Computer Systems Performance Analysis: Techniques for Exper-
imental Design, Measurement, Simulation, and Modeling. John Wiley, New York, NY.
Joachims, T. (1999). Making large-scale SVM learning practical. In Schölkopf, B., Burges,
C., and Smola, A., editors,Advances in Kernel Methods - Support Vector Learning, chap-
ter 11, pages 169–184. MIT Press.
Joachims, T. (2006). Training linear svms in linear time. InProceedings of the ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, pages
217–226, Philadelphia, USA.
Kelly, M. G., Hand, D. J., and Adams, N. M. (1999). The impact of changing populations
on classifier performance. InProceedings of the ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, pages 367–371, San Diego, USA.
Kim, Y. S., Park, S. S., Deards, E., and Kang, B. H. (2004). Adaptive web document clas-
sification with MCRDR. InProceedings of the International Conference on Information
Technology: Coding and Computing, pages 476–480, Las Vegas, USA.
Klinkenberg, R. (2004). Learning drifting concepts: Example selection vs. example weight-
ing. Intelligent Data Analysis, 8(3):281–300.
Klinkenberg, R. and Joachims, T. (2000). Detecting conceptdrift with support vector ma-
chines. InProceedings of the International Conference on Machine Learning, pages 487–
494, Stanford, USA.
Klinkenberg, R. and Rüping, S. (2003). Concept drift and theimportance of example. In
Text Mining - Theoretical Aspects and Applications, pages 55–78. Physica-Verlag.
Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and
model selection. InProceedings of the International Joint Conference on Artificial Intel-
ligence, pages 1137–1143, Québec, Canada.
Kolter, J. Z. and Maloof, M. A. (2003). Dynamic weighted majority: A new ensemble
method for tracking concept drift. Technical report, Department of Computer Science,
Georgetown University, Washington, USA.
104 BIBLIOGRAPHY
Koren, Y. (2010). Collaborative filtering with temporal dynamics. Communications of the
ACM, 53:89–97.
Koychev, I. (2000). Gradual forgetting for adaptation to concept drift. InProceedings of the
ECAI Workshop Current Issues in Spatio-Temporal Reasoning, pages 101–106, Berlin,
Germany.
Kuncheva, L. I. and Žliobaite, I. (2009). On the window size for classification in changing
environments.Intelligent Data Analysis, 13(6):861–872.
Lawrence, S. and Giles, C. L. (1998). Context and page analysis for improved web search.
IEEE Internet Computing, 2(4):38–46.
Lazarescu, M. M., Venkatesh, S., and Bui, H. H. (2004). Usingmultiple windows to track
concept drift.Intelligent Data Analysis, 8(1):29–59.
Limpert, E., Stahel, W. A., and Abbt, M. (2001). Log-normal distributions across the sci-
ences: Keys and clues.BioScience, 51(5):341–352.
Lin, Z., Hao, Z., Yang, X., and Liu, X. (2009). Several SVM ensemble methods integrated
with under-sampling for imbalanced data learning. InProceedings of the International
Conference on Advanced Data Mining and Applications, pages 536–544, Beijing, China.
Liu, A., Ghosh, J., and Martin, C. (2007). Generative oversampling for mining imbalanced
datasets. InProceedings of the International Conference on Data Mining, pages 66–72,
Las Vegas, USA.
Liu, R.-L. and Lu, Y.-L. (2002). Incremental context miningfor adaptive document clas-
sification. InProceedings of the ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pages 599–604, Edmonton, Canada.
Manning, C. D., Raghavan, P., and Schtze, H. (2008).Introduction to Information Retrieval.
Cambridge University Press, New York, NY.
Miao, Y.-Q. and Kamel, M. (2011). Pairwise optimized rocchio algorithm for text catego-
rization. Pattern Recognition Letters, 32(2):375–382.
Mourão, F., Rocha, L., Araújo, R., Couto, T., Gonçalves, M.,and Meira Jr., W. (2008).
Understanding temporal aspects in document classification. In Proceedings of the Inter-
national Conference on Web Search and Web Data Mining, pages 159–170, Palo Alto,
USA.
BIBLIOGRAPHY 105
Nishida, K. and Yamauchi, K. (2007). Detecting concept drift using statistical testing. InPro-
ceedings of the International Conference on Discovery Science, pages 264–269, Sendai,
Japan.
Nishida, K. and Yamauchi, K. (2009). Learning, detecting, understanding, and predicting
concept changes. InProceedings of the International Joint Conference on Neural Net-
works, pages 283–290, Atlanta, USA.
Orair, G. H., Teixeira, C., Wang, Y., Jr., W. M., and Parthasarathy, S. (2010). Distance-
based outlier detection: Consolidation and renewed bearing. Proceedings of the VLDB
Endowment, 3(2):1469–1480.
Rasmussen, C. E. and Williams, C. (2006).Gaussian Processes for Machine Learning. MIT
Press, Cambridge, MA.
Rocha, L., Mourão, F., Pereira, A., Gonçalves, M. A., and Meira Jr., W. (2008). Exploiting
temporal contexts in text classification. InProceedings of the International Conference on
Information and Knowledge Engineering, pages 243–252, Napa Valley, USA.
Salles, T., Rocha, L., Mourão, F., Pappa, G. L., Cunha, L., Gonçalves, M. A., and Jr., W. M.
(2010a). Automatic document classification temporally robust. Journal of Information
and Data Management, 1(2):199–212.
Salles, T., Rocha, L., Pappa, G. L., Mourão, F., Gonçalves, M. A., and Jr., W. M. (2010b).
Temporally-aware algorithms for document classification.In Proceedings of the Inter-
national ACM SIGIR Conference on Research & Development of Information Retrieval,
pages 307–314, Genebra, Switzerland.
Scholz, M. and Klinkenberg, R. (2007). Boosting classifiersfor drifting concepts.Intelligent
Data Analysis, 11(1):3–28.
Sebastiani, F. (2002). Machine learning in automated text categorization.ACM Computing
Surveys, 34(1):1–47.
Sun, A., Lim, E.-P., and Liu, Y. (2009). On strategies for imbalanced text classification using
svm: A comparative study.Decision Support Systems, 48(1):191–201.
Tan, S. (2005). Neighbor-weighted k-nearest neighbor for unbalanced text corpus.Expert
Systems with Applications, 28(4):667–671.
Tsymbal, A. (2004). The problem of concept drift: Definitions and related work. Technical
report, Department of Computer Science, Trinity College, Dublin, Ireland.
106 BIBLIOGRAPHY
Vapnik, V. N. (1998).Statistical Learning Theory. Wiley-Interscience, New York, NY.
Vaz de Melo, P. O., da Cunha, F. D., Almeida, J. M., Loureiro, A. A., and Mini, R. A. (2008).
The problem of cooperation among different wireless sensornetworks. InProceedings
of the International Symposium on Modeling, Analysis and Simulation of Wireless and
Mobile Systems, pages 86–91, Vancouver, Canada.
Žliobaite, I. (2009). Combining time and space similarity for small size learning under
concept drift. InProceedings of the International Symposium on Foundationsof Intelligent
Systems, pages 412–421, Prague, Czech Republic.
Wang, H., Fan, W., Yu, P. S., and Han, J. (2003). Mining concept-drifting data streams using
ensemble classifiers. InProceedings of the ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pages 226–235, Washington, USA.
Widmer, G. and Kubat, M. (1996). Learning in the presence of concept drift and hidden
contexts.Machine Learning, 23(1):69–101.
Wille, R. (2005). Formal concept analysis as mathematical theory of concepts and concept
hierarchies. InFormal Concept Analysis, pages 1–33.
Yang, C. and Zhou, J. (2008). Non-stationary data sequence classification using online class
priors estimation.Pattern Recognition, 41(8):2656–2664.
Yang, Q. and Wu, X. (2006). 10 Challenging Problems in Data Mining Research.Interna-
tional Journal of Information Technology & Decision Making, 5(4):597–604.
Zhang, Z. and Zhou, J. (2010). Transfer estimation of evolving class priors in data stream
classification.Pattern Recognition, 43(9):3151–3161.
top related