ainl 2016: bugaychenko

26
Dmitry Bugaychenko, Eugeny Malytin Trend detection at OK

Upload: lidia-pivovarova

Post on 23-Jan-2017

177 views

Category:

Science


0 download

TRANSCRIPT

Page 1: AINL 2016: Bugaychenko

Dmitry Bugaychenko, Eugeny Malytin

Trend detection at OK

Page 2: AINL 2016: Bugaychenko

Trend detection at a glance

Page 3: AINL 2016: Bugaychenko

Texts extractionInput: raw user activities logs in JSONOutput: extracted text and metadataIn-between:

Unified data collection pipeline: Kafka + Hadoop + SamzaDifferent type of objects: posts, photos, videos, comments.Large volumes: 50Gb of raw data daily, 20Gb after extractionInitial filtering applied: too small documents removed

Page 4: AINL 2016: Bugaychenko

Language detectionInput: single extracted textOutput: text labeled with languageIn-between:

Based on open source library https://github.com/optimaize/language-detector

Math is built on top of trigram distribution, 70+ languagesCustom language profiles added for:

Azerbaijani, Armenian, Georgian, Kazakh, Kyrgyz, Tajik, Turkmen, Uzbek https://github.com/denniean/language_profiles

Language distribution priors are important!

Page 5: AINL 2016: Bugaychenko

Tokenization and canonizationInput: text with language labelOutput: tokens streamIn-between:

Apache Lucene Analyzers (tokenization, stop words removal, stemming)Profiles for 23 languages available, including: Russian, Armenian,

Latvian.Most ex-USSR languages still missing: Azerbaijani, Belarusian, Georgian,

Kazakh, Kyrgyz, Tajik, Turkmen, Ukrainian, Uzbek etc.

Page 6: AINL 2016: Bugaychenko

Dictionary extractionInput: corpus as a set of token streamsOutput: words index (dictionary)In-between:

Term frequency limits for inclusionPrevious day dictionary analyzed to keep indices for

common tokens the sameLarge enough to capture multiple languages (1M+)

Page 7: AINL 2016: Bugaychenko

VectorizationInput: tokens stream and dictionaryOutput: sparse vectorIn-between:

Raw term frequency vectorization

Page 8: AINL 2016: Bugaychenko

DeduplicationInput: corpus as a set of vectorsOutput: corpus with duplicates removedIn-between:

Cosine as similarity measure (>0.9 => duplicates)Random projection hashing to speedup calculation18-bit hash, 50% basis sparsity

Page 9: AINL 2016: Bugaychenko

Current day statisticsInput: filtered corpus as a set of token streamsOutput: % of documents term or 2-gram where used in

for terms and 2-gramsIn-between:

2-gram additionAggregationAbsolute filtrationDifferent limits for terms and 2-grams

Page 10: AINL 2016: Bugaychenko

Accumulated state aggregationInput: current day statistics, previous day accumulated stateOutput: Exponentially weighted moving average and variance

for terms and 2-grams (new accumulated state)In-between:

Inclusion limit > exclusion limitDifferent limits for terms and 2-grams

Page 11: AINL 2016: Bugaychenko

Trending terms identificationInput: Exponentially weighted moving average and

variance for terms and 2-gramsOutput: Trending terms and 2-grams with significanceIn-between:

Page 12: AINL 2016: Bugaychenko

Trending terms identification

Page 13: AINL 2016: Bugaychenko

Trending terms clusteringInput: list of trending terms, corpus as a set of token streamsOutput: trending terms grouped into clusters with high level of

concurrencesIn-between:

Term-term matrix of normalized pointwise mutual information

DBSCAN clustering (ELKI implementation) with cosine distance

Page 14: AINL 2016: Bugaychenko

Trending terms clustering

Page 15: AINL 2016: Bugaychenko

Relevant documents extractionInput: identified trending term clusters, corpus as a set of

token streamsOutput: set of relevant documents and “spammines”

level for each clusterIn-between:

For each document find most relevant cluster by counting terms

For each cluster select top liked documentsCount unique users/groups/IPs relative to overall count

Page 16: AINL 2016: Bugaychenko

Results visualizationInput: trending terms clusters with relevant documentsOutput: Nice interactive visualization In-between:

Add navigation for dates and clustersExtract geo location for each documentPlot on an interactive mapDisplay details on hover

Page 17: AINL 2016: Bugaychenko

Visualization

Page 18: AINL 2016: Bugaychenko

Visualization

Page 19: AINL 2016: Bugaychenko

Need for speed!Trends are valuable only while they are in trendDaily batch processing is inherently laggingAlternatives:

Mini-batchStreaming!

Lambda-architecture

Page 20: AINL 2016: Bugaychenko

Streaming trending terms detection

Page 21: AINL 2016: Bugaychenko

Not yet there!Visualizing just trending terms is not informativeClustering requiredRelevant documents extraction requiredMini-batch model is more appropriate here

Page 22: AINL 2016: Bugaychenko

Mini-batch trend clustering

Page 23: AINL 2016: Bugaychenko

Mini-batch trend clustering

Page 24: AINL 2016: Bugaychenko

Technologies usedApache Kafka for data collectionApache YARN for resource negotiationApache Spark for batch and mini-batch processingApache Samza for streaming processingApache Lucene for texts preprocessingOptimaze languange-detectorELKI for clustering

Page 26: AINL 2016: Bugaychenko

Thank you for your attention!

?