ingiltere’de+veri+bilimi+ve++ akademi3endustri+isbirligi ... · ingiltere’de+veri+bilimi+ve++...
TRANSCRIPT
Ingiltere’de Veri Bilimi ve Akademi-‐Endustri Isbirligi Ornekleri
Emine Yilmaz
Ziyaretci Ogre2m Uyesi, Sabanci Universitesi Docent Dr. ve Turing Fellow, University College London (UCL)
Turing Ens;tusu
• Ingiltere’de 2015 yilinda kurulmus ulusal veri bilimi merkezi • Veri biliminde dunya capinda lider bir aras2rma merkezi olusturmak • Bilgisayar bilimleri, Matema2k, Ista2s2k, Sosyal bilimler gibi veri bilimi ile ilgili alanlarda uzman kadro olusturmak
• University of Cambridge, Edinburgh, Oxford, UCL ve Warwick kurucu universiteler
• Ens2tu calisanlari • Kurucu universitelerdeki akademisyenler (fellowlar) • Tam zamanli ens2tude calisan doktora ogrencileri ve post-‐doclar • Ziyaretci ogre2m uyeleri ve stajyerler
Turing Ens;tusu-‐Endustri Ortakligi
• Endustri’den partnerler: Lloyd’s Register Founda2on, Intel, HSBC
• Endustri ortaklari • Ens2tunun calisma onceliklerini belirlemede soz sahibi olur • Ens2tu calisanlari ile ortak projeler yurutur • Veri paylasimlari sayesinde endustri ile direk alakali olan problemler uzerinde calisilir
• Turing’deki egi2mlere (masterclass, seminerler, calistaylar) ka2labilme onceligi
Elsevier ile Ortak Proje
• Elsevier bir akademik yayin plaZormu • Mendeley, science direct, scopus, vb. cok sayida urun
• Amac: Kisilerin ilgi alanlarini anlayip kisiselles2rilmis sonuclar sunma
• Bir post-‐doc ogrencisi araciligiyla ortak calisma yurutuldu • Calisma pek cok dergi/konferanslarda yayinlandi • Yapilan aras2rmalar Elsevier datasinin uzerine de uygulandi
Elsevier Projesi: Aras;rma Konulari
• Kullanicilarin ilgi alanlarinin belirlenmesi • Topic modelleme (LDA): Su anda kullanilan en populer method • Kullanicilarin daha once okuduklari dokumanlara bakarak ilgilendikleri konular belirlenir
• Topic modelleme’nin cozemedigi iki onemli sorun • Kullanicilarin ilgi alanlari zamanla (dinamik olarak) degisebilir • Kullanicilar sistemi ilk kullanmaya basladiklarinda ilgi alanlarini belirlemek zor olmaktadir
• Elsevier projesinin amaci topic modelleme’nin cozemedigi bu iki probleme cozum ge2rmek2r
Topic Modelleme: Latent Dirichlet Alloca;on (LDA)
• Her topic kelimeler uzerinde belirlenmis bir olasilik dagilimi (probability distribu2on) • Her dokuman latent topicler uzerinde tanimlanmis bir mixture • Her bir kullanici profili ziyaret e^gi dokumanlara bagli olarak topic’ler uzerinde bir olasilik dagilimi
• Bir dokuman elde etmek icin (genera2ve model) • Topicler uzerinde bir olasilik dagilimi (Distribu2on over topics) secilir
• θd ∼ Dirichlet(α) • Dokumandaki her bir kelime icin
• Topicler uzerindeki olasilik dagilimina dayanarak rastgele bir topic secilir • zd,n ∼ mul2nomial(θd)
• Secilmis olan topic uzerinden rastgele bir kelime secilir • wd,n ∼ mul2nomial(βzd,n)
Topic Modelleme: Latent Dirichlet Alloca;on (LDA)
Dinamik ve Sosyal Network’e Dayali Topic Modelleme q Topic modelleme kisilerin degisen ilgilerini modellemede ve kullanicilar sistemi ilk kullanmaya basladiginda iyi sonuclar vermez
q Yeni bir modelleme algoritmasi
q Arka arkaya gelen zaman araliklarindaki olasilik dagilimlari birbirine bagimli (dependent)
q Modellemede arkadas bilgisini kullanma q 𝑢↑′ 𝑛𝑢𝑛 𝑖𝑙𝑔𝑖 𝑎𝑙𝑎𝑛𝑙𝑎𝑟𝑖: 𝜃↓𝑡, 𝑢 q 𝑢↑′ 𝑛𝑢𝑛 𝑎𝑟𝑘𝑎𝑑𝑎𝑠𝑙𝑎𝑟𝑖𝑛𝑖𝑛 𝑖𝑙𝑔𝑖 𝑎𝑙𝑎𝑛𝑙𝑎𝑟𝑖:
8
each document exhibits those topics (Blei, Ng, and Jor-dan 2003). Since the well-known topic models, PLSI (Prob-abilistic Latent Semantic Indexing) (Hofmann 1999) andLDA (Latent Dirichlet Allocation) (Blei, Ng, and Jordan2003), were proposed, topic models with dynamics havebeen widely studied. These include the Dynamic TopicModel (DTM) (Blei and Lafferty 2006), Dynamic MixtureModel (DMM) (Wei, Sun, and Wang 2007), Topic over Time(ToT) (Wang and McCallum 2006), Topic Tracking Model(TTM) (Iwata et al. 2009), and more recently, GeneralizedLinear Dynamic topic model (Caballero and Akella 2015),the dynamic User Clustering Topic model (UCT) (Zhaoet al. 2016), News and Twitter Interaction Topic Model(NTIT) (Hua et al. 2016), Dynamic Clustering Topic model(DCT) (Liang, Yilmaz, and Kanoulas 2016) and scaling-updynamic model (Bhadury et al. 2016). All of these modelsexcept DCT aim at inferring documents’ dynamic topic dis-tributions rather than user clustering. Except UCT and DCTthat work in the context of short text streams, most of the theprevious dynamic topic models works in the context of longtext streams. To the best of our knowledge, none of existingdynamic topic models has considered the problem of clus-tering users with collaborative information, e.g., followees’interests, in the context of short text streams.
Problem FormulationThe problem we address is to track users’ dynamic inter-ests and cluster them over time in the context of short textstreams such that users in the same cluster at a specific pointin time share similar interests. The dynamic user clusteringalgorithm is essentially a function g that satisfies:
ut
= {u1
, u
2
, . . . , u|ut|}g�! C
t
= {c1
, c2
, . . . , cZ
},
where ut
represents a set of users appearing in the stream
up to time t, with u
i
being the i-th user in ut
and |ut
| thetotal number of users in the user set, while C
t
is the re-sulting set of clusters of users with c
z
being the z-th clus-ter in C
t
and Z being the total number of clusters. We letD
t
= {. . . ,dt�2
,dt�1
,dt
} denote the stream of docu-ments generated by users in u
t
up to time t with dt
beingthe most recent set of short documents arriving at time pe-riod t. We assume that the length of a document d in D
t
isno more than a predefined small length (for instance, 140characters in the case of Twitter).
MethodIn this section, we describe our proposed User Collabora-tive Interest Tracking topic model, UCIT, aiming at trackingusers’ and their followees’ interests, and dynamically clus-tering them in the context of short text streams.
OverviewWe use Twitter as our default setting of short text streamsand provide an overview of our proposed UCIT model inAlgorithm 1. Following (Liang, Ren, and de Rijke 2014;Zhao et al. 2016), we represent each user’s interests by top-ics. Thus, the interests of each user u 2 u
t
at time pe-riod t are represented as a multinomial distribution ✓
t,u
=
Algorithm 1: Overview of the proposed UCIT model.Input : A set of users ut along with their tweets Dt
Output: Clusters of users Ct
1 Construct a collection of word-pairs bt,u for each user u2 Use UCIT model to track each user’s interests as ✓t,u and
their collaborative interest as t,u
3 Cluster users based on each user’s interest ✓t,u and theircollaborative interest t,u
✓t�1
↵t�1
t�1
�t�1
✓t
↵t
t
�t
z z
vi vj vi vj
�t�1 �t
�t�1 �t
Z Z
|ut�1
| |ut�1
| |ut
| |ut
|
|bt�1,u| |bt,u||u
t�1
| |ut
|
Figure 1: Graphical representation of our user interest trackingclustering topic model, UCIT. Shaded nodes represent observedvariables.
{✓t,u,z
}Zz=1
over topics. Here Z is the total number of topics.The distribution ✓
t,u
is inferred by the UCIT model. To al-leviate the sparcity problem of short texts, and by followingrecent work on the topic (Yan et al. 2013; 2015), we con-struct and represent documents by their biterms, i.e. wordpairs in them (step 1 in Algorithm 1). Next, we propose adynamic Dirichlet multinomial mixture user collaborativeinterest tracking topic model to capture each user’s dynamicinterests ✓
t,u
= {✓t,u,z
}Zz=1
and their collaborative interests
t,u
= { t,u,z
}Zz=1
inferred from their followees ft,u
, attime t, in the context of short text streams (step 2 in Algo-rithm 1). Here f
t,u
is user u’s all followees at t.Based on each user’s multinomial distributions ✓
t,u
and
t,u
, we cluster users using K-means clustering (Jain 2010)(step 3 in Algorithm 1). With the time period t moving for-ward, the clustering result changes dynamically.
User Collaborative Interest Tracking Model
Modeling interests over time. The goal of UCIT topicmodel is to infer the dynamical topic distribution of eachuser, ✓
t,u
= {✓t,u,z
}Zz=1
, and the user’s collaborative topicdistribution,
t,u
= { t,u,z
}Zz=1
, in short text streams ata given time t, and dynamically cluster all users based oninformation of each user’s ✓
t,u
and t,u
over time. Fig. 1shows a graphical representation of our UCIT model.
Given a user u, to track the dynamics of their interests, we
t,u
Deneysel Sonuclar: Data
q Datasets: Twicer’da 1375 kullanici, onlarin arkadaslari ve arkadaslarinin tweetleri q Mayis 2015 tarihine kadar kaydolan kullanicilar q Cogu kullanicinin 2 ila 50 arasinda degisen takipcisi bulunmakta
q Kullanicilarin gelecekteki hareketlerini tahmin etmeye dayali kalite degerlendirmesi
9
Ornek Sonuclar
Deneysel Sonuclar
(a) Topiksel temsil (representa2on) kalitesi (b) Perplexity
H-score
LDA AuthorTDTMTTM ToT UCITavg
UCITavg+ψ
UCITψ
0.25
0.35
0.45
0.55
Perplexity
LDA AuthorTDTM TTM ToT
GSDMMUCIT
ψ1000
1400
1800
2200
Elsevier ile Ortak Proje: Yurutumu
• Elsevier tarafindan projeyi yuruten bir yone2ci ve projenin Elsevier uzerinde kodlanmasi ile ilgilenen bir programci atandi
• Projede calisan post-‐doc hahanin bir gunu Elsevier’da, Elsevier calisanlarina bilgi vererek ve onlarla beraber kodlama yaparak gecirdi
• Proje Elsevier datasi ile denenmeden once genel kullanima acik datalar (Twicer, vb.) uzerinde denendi
• UCL ve Elsevier arasinda periyodik olarak toplan2lar yapildi
Endustri-‐Akademi Isbirlikleri Hakkinda
• Her iki taraf icin de buyuk katkilari olan isbirlikleri
• Endustri icin avantajlar • Akademisyenlerle kendileri icin onemli problemlerde calisma • En son teknikleri ogrenip uygulama imkani • Altyapiyi duzenleme ve very toplama altyapisi olusturma
• Akademisyenler icin avantajlar • Datasetlerine erisim • Gercek problemlere odaklanma • Metodlari gercek sistemlerde, gercek kullanicilarla deneme imkani
Endustri-‐Akademi Isbirligi Ornekleri
• Endustri tarafindan finanse edilen post-‐doc/doktora pozisyonlari • Part-‐2me doktora pozisyonlari • Ogre2m uyesi odulleri • Endustri ile ortak master tez danismanligi • Endustri’ye yapilan danismanliklar • Endustri’de yapilan sabba2callar
Ozet
• Turing Ens2tusu, amaci ve calisma tarsi
• Elsevier ile ornek endustri-‐akademi isbirligi projesi • Nasil isliyor? • Calisilan problemler • Ne gibi sonuclar alindi?
• Endustri-‐akademi isbirligi ornekleri ve avantajlari