mining the social web - leibniz universität hannover · user recommendation ... (lda) to generate...

Mining the Social Web

Asmelash Teka [email protected]

L3S Research CenterApril 24, 2018

mailto:[email protected]

Outline

2

Deep LearningSelected Papers

Introduction: The big picture

Definition from Web Science Conference1

Web Science is the emergent science of the people, organizations, applications, and of policies that shape and are shaped by the Web

Studying the online world to understand the offline world

3

What is Web Science?

http://websci16.org

http://websci16.org/

Web 2.0 describes World Wide Web sites that emphasize user-generated content, usability, and interoperability.

The social web is a set of social relations that link people through the World Wide Web.

4

The Social Web

https://makeawebsitehub.com/social-media-sites

Problem: How can we automatically construct user profiles?

5

User Identification

Authoritative users extraction - Discovering expert users for a target topic.

Personalized web search - Personalized social media posts retrieval.

User recommendation - Suggesting new interesting users to a target user.

Preprocessing step for many interesting downstream analysis.

6

User Identification Applications

Challenges:

Labels - Generating labelled data to train Models

Feature Engineering - What are good features

Modeling - Mostly off the shelf Predictive models

7

Machine Learning for User Identification [1,2,3]

Feature engineering is key in a machine learning task.

● Profile features● Tweeting behaviour features● Linguistic content● Social network features

8

Feature Engineering

Baseline

Bag-of-words (BOW) model with weighting (e.g., TF-IDF)

Each user is represented as a bag of the words in their tweets.

These words are then used as features to train a classifier.

9

Feature Engineering

Ideally Profile information should be sufficient. In reality it does not contain enough quality information to be directly used for user classification.

- Length of name, Number of alphanumeric chars.- Capitalization forms in user name- Use of avatar picture- Number of followers, friends- Regular expression matches in bio:

(I |i)(m|am| m|[0 − 9] + (yo|yearold)man|woman|boy |girl

10

Profile Features

A set of statistics capturing the way users interact with the micro-blogging service.

- Number of tweets of a user.- Number and fraction of retweets of a user.- Ave. number of hashtags and URLs per tweet.- Ave. time and std between tweets.

11

Tweeting Behaviour Features

Linguistic content contains the user’s lexical usage and the main topics of interest to the user.

- List of socio-linguistic words such as emoticons, ellipses etc.- Prototypical words, hashtags etc. Given n classes, each class ci containing

seed users Si . Each word w from seed users is assigned a score:

(1)

where |w, Si| is the number of times the word w is issued by all users for class ci .

- Latent Dirichlet Allocation (LDA) to generate topics used as features for classification.

- Sentiment words- Word embeddings

12

Linguistic Content Features

These features contain the social connections between a user and those one follows, replies to or whose messages they retweet.

Prototypical ‘friend’ accounts are generated by exploring the social network of users in the training set by bootstrapping as in Eq. (1)

- Number of prototypical friends, percentage number of prototypical friend - Prototypical replied users, Prototypical retweeted users

13

Social Network Features

Manually building ground truth data and validating result not easy.

● Combine multiple data sources. e.g. scraping websites that contain users in each class.

● Using key terms in profile of users. Prone to systematic bias and care should be taken on features.

● Using crowdsourcing services such as Amazon Mechanical Turk1 and CrowdFlower2

14

Ground-Truth and Validation

1. https://www.mturk.com 2. http://www.crowdflower.com/

https://www.mturk.com

http://www.crowdflower.com/

Identifying Computer Science researchers on Twitter [3]

15

A Complete Pipeline

● Current State-of-the-art methods for classification are based on deep learning.

● Key difference is in representation learning - instead of hand crafting features, layers of neural networks, learn representations (features) from raw inputs.

● Bag of words -> words in semantic spaces through word embeddings e.g., word2vec

16

Deep Learning

Intro Deep Learning

Applications:

Face Recognition

Object Detection

Neural Style Transfer

From Feedforward to Convolutional Neural Networks

1. Deep Learning on large images

Consider an RGB image with 1000x1000 dimensions, if you use FCs with 1000 hidden units => 3 billion params.

- Need lots of data to prevent overfitting- computational/memory requirements are huge

2. Preserve Spatial relationship of input images.

The Big Picture (What’s inside the box?)

cat

Convolutional Neural Network

Why Convolutions?

Main advantages of convolutional layers over fully connected layers:

● Sparse interactions

● Parameter sharing

● Translation invariance

Hierarchy of Feature Representations

Earlier layers of deep neural networks detect low level features such as edges and later layers detect high level features.

Convolution Operation:

● Apply a dot-product of filter over image: ● Convolutional networks are neural networks that use convolution in at least one

of their layers.

1 1 1 0 0

0 1 1 1 0

0 0 1 1 1

0 0 1 1 0

0 1 1 0 0

1 0 1

0 1 0

1 0 1

Image

Filter/kernel

Animation Source: https://www.coursera.org/learn/convolutional-neural-networks

https://www.coursera.org/learn/convolutional-neural-networks

Padding

Problems with the basic Convolution operation:

- Shrinking output- Throwing away information from edge

E.g., nxn image convolved with fxf filter becomes

To solve these problems we can pad the image before applying convolution.

Given: nxn input convolved with fxf filter filter padding p gives:

Valid and Same Convolutions

Valid Convolution: Convolution with no padding

Input image: n by n becomes

Same Convolution: Pad so that output size is the same as the input size

Input image: n by n output feature map:

n + 2p - f + 1 = n => p = (f-1)/2

Common filter sizes for Convolution: 1x1, 3x3, 5x5, 7x7

Strided Convolution

Strided Convolution is another important basic building block.

Instead of shifting filter by one, we jump over by stride s of steps.

Given an n x n image convolved with f x f filter with padding p and stride s, the output size is given by:

Convolutions over Volumes

● Allow us to operate directly on multi-dimensional input e.g., RGB images.

● Multiple filters stacked together allow us to detect multiple features.

● The resulting output is volume which in turn we can perform convolution to form a convolutional network.

Video Source: https://www.coursera.org/learn/convolutional-neural-networks

https://docs.google.com/file/d/1QZ4kCFq8cINcJIZS366Ow838LMiiWbBI/preview

https://www.coursera.org/learn/convolutional-neural-networks

Convolution Computation [Dimensions]

Input: 5x5x3 image

2 filters 3x3 with Stride = 2 and padding = 1

Output: (5+2-3)/2 +1 = 3x3x2

We add bias and apply nonlinearity, e.g., Relu to obtain the next output to complete one step of a convolution layer.

Video Source: http://cs231n.github.io/convolutional-networks/

https://docs.google.com/file/d/1LJiE9F5pPS8uHsFOpnJHZ_-RRRfApK9I/preview

http://cs231n.github.io/convolutional-networks/

One Convolutional Layer [Parameters]

Consider the image on the right is 64x64x3. If you apply 10 filters that are 3x3 in the first CONV layer of neural network, how many parameters does that layer have?

CONV

3x3x3 = 27 params + bias => 28 params * 10 filters => 280 params

Note: No matter how big the input image is, the number of parameters stays the same. This makes convNets less prone to overfitting.

Types of Layers in a Convolutional Neural Network

- Convolution (CONV)- Pooling (POOL)- Fully Connected (FC)

CONV POOL CONV POOL CONV

FC FC

Pooling Layer

- Reduce the size of representation- Speed up computation- Make features more robust

No parameters to learn!

Hyperparameters: filter size, stride, Max or Average pooling, mostly no padding.

Pooling Operation:

3 4 0 1

2 0 2 0

5 3 2 1

1 0 6 7

Max Pooling: take the max of values in each filter. Similly,

Average Pooling: take the average of values in each filter.

4 2

5 7

3 4 0 1

2 0 2 0

5 3 2 1

1 0 6 7

2.25

0.75

2.25

4

Max pooling Average pooling

ConvNet Example

64x64x3

f =5x5s=1#f=8

64x64x8

CONV

MAXPOOL

f =2x2s=2

32x32x8

f =5x5s=1#f=16

CONV

28x28x16

MAXPOOL

f =2x2s=2 14x14x16

f =3x3s=1#f=32

CONV

FC FC

How activation dimensions change

Why Convolutions?

Main advantages of convolutional layers over fully connected layers:

● Sparse interactions: In each layer, each output value depends only on a small number of inputs.

● Parameter sharing: a feature detector (such as a vector edge detector) that’s useful in one part of the image is probably useful in another part of the image.

● Translation invariance: An image shifted a few pixels should result in similar features and should be assigned the same output label.

Training ConvNets

Training examples (x(1) ,y(1)), (x(2) ,y(2)), … (x(m) ,y(m))

Cost

Maximum likelihood provides a guide for how to design a good cost function.

Use gradient descent to optimize parameters to reduce J.

y^

Summary Part-One

We have seen basic building blocks

● Convolution operation, padding and strided convolution● Convolution Layer● Pooling layer

How to put these things together?

Case Studies: Deep ConvNet Architectures

- Start with what’s used or modify for own purpose- Gain intuition from successful architectures- Architecture that works for one task often works well for other tasks

ImageNet Classification

1.2 million high-resolution images into 1000 different classes.

LeNet-5

Architecture: CONV-POOL-CONV-POOL-CONV-FC-FC

● Relatively small network ~ 60k parameters● Notice the pattern of one or more CONVs followed by a POOL and then finally FCc● As you go deeper in the network, a decrease in HxW 32 => 28 => 14 => 10 => 10 => 5 and an increase

in channels 1 => 6 => 16

LeCun et al., 1998, Gradient-Based Learning Applied to Document Recognition

Input: 32x32 grayscale imageVALID CONV: 6 filters 5x5, stride 1AVG POOL: 2x2, stride 2VALID CONV: 6 filters 5x5, stride 1AVG POOL: 2x2, stride 2FC 120 nodesFC 84 nodes

AlexNet

● Similar to LeNet but much bigger ~60M parameters instead of ~60k● Used ReLu activation function

Krizhevsky et al., 2012, ImageNet Classification with Deep Convolutional Neural Networks

Input: 227x227x3SAME CONV 96 filters 11x11 s=4 MAX-POOL 3x3 s=2SAME CONV 256 filters 5x5MAX-POOL 3x3 s=2SAME CONV 384 filters 3x3 SAME CONV 384 filters 3x3 SAME CONV 256 filters 3x3 MAX-POOL 3x3 s=2FC 4096FC 4096Softmax 1000

VGG-16

● CONV = 3x3 filter, s=1, same● MAX-POOL = 2x2, s = 2● From 8 to 16 layers

Input: 224x224x3[CONV 64] x 2POOL[CONV 128] x 2POOL[CONV 256] x 3POOL[CONV 512] x 3POOL[CONV 512] x 3POOLFC 4096FC 4096Softmax 1000

Simonyan and Zisserman, 2015, Very Deep Convolutional Networks for Large-Scale Image Recognition

Notable for simple and uniform architecture.

ResNetHe et al., 2015, Deep residual learning for image recognition

Very deep networks are difficult to train because of vanishing and exploding gradients.

Skip Connection: allows to take the activation from one layer and feed it directly to a layer much deeper in the network.

Plain connection Residual block

ResNet

ResNets are formed by stacking residual blocks together to form a deep network

ResNets do well because it’s easy for a residual block learn the identity function

Plain connection

Residual Network

He et al., 2015, Deep residual learning for image recognition

ResNet

● Able to train very deep networks without degrading (152 layers on ImageNet, 1202 on Cifar)

● Swept 1st place in all ILSVRC and COCO 2015 competitions

He et al., 2015, Deep residual learning for image recognition

GoogLeNet

When designing a layer for a ConvNet you might ask:

1x1 CONV? 3x3 CONV? 5x5 CONV? MAX POOL?

Inception Network says Why not all!

Szegedy et al. 2014, Going deeper with convolutions

GoogLeNetSzegedy et al. 2014, Going deeper with convolutions

- 22 layers- Efficient ‘Inception’ module- No FC layers- Only 5 million Params 12x less than AlexNet- ILSVRC’14 classification winner (6.7% top 5 error)

Summary Part Two

Use Cases and Architectural Choices

Classic Examples:

LeNet-5, AlexNet, VGG

Recent Advances:

ResNet, GoogLeNet (Inception)

References

Convolutional Neural Networks On Coursera

https://www.coursera.org/specializations/deep-learning

CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.stanford.edu/

Deep Learning, Ian Goodfellow and Yoshua Bengio and Aaron Courvill http://www.deeplearningbook.org/

https://www.coursera.org/specializations/deep-learning

http://cs231n.stanford.edu/

http://www.deeplearningbook.org/

ApplicationsSpeech Recognition

Machine Translation

Named Entity Recognition

Image Captioning

Why Not a standard feed-forward Network?

● Input, output can be of different lengths in different examples

● Doesn’t share features learned across time

● Reduced Number of parameters

Why Not CNNs?

CNN - specialized for processing a grid of values such as an imageRNN - specialized for processing a sequence of values x(1), . . . , x(τ).

CNN - can readily scale to images with large width and heightRNN - can scale to much longer sequences

Types of RNNs (RNN Architectures)

Source: http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Image classification Music generationImage captioning

Sentiment classification Named Entity RecognitionVideo activity recognition

Machine translation

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

What are Recurrent Neural Networks (RNNs)?

xt - the input at time step tst - the hidden state at time step t. It’s the “memory” of the network. st is calculated based on the previous hidden state and the input at the current step: st=f(Ut + Wst-1). The function f usually is a nonlinearity such as tanh or ReLU.ot - is the output at step t Training RNN (learn parameter U, V, W): backpropagation with Backpropagation Through Time (BPTT) algorithm

http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/

Standard Recurrent Neural Networks

The repeating module in a standard RNN contains a single layer neural network.

Source: Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

LSTMs

54

LSTMs are a special type of RNNs capable of learning long-term dependencies.

Selected Papers

Gender Shades

● Computer vision is being utilized in high-stakes sectors such as health-care and law enforcement.

● Machine learning algorithms can discriminate based on classes like race and gender.

● An approach to evaluate bias present in automated facial analysis algorithms and datasets.

Buolamwini, Joy, and Timnit Gebru. "Gender shades: Intersectional accuracy disparities in commercial gender classification." Conference on Fairness, Accountability and Transparency. 2018.

Gender Shades

● Members of parliament from Africa: Rwanda, Senegal, South Africa and Europe: Iceland, Finland, Sweden

● Public figures with known identities and photos available under non-restrictive licenses

● gender labeling based on name, gendered title, prefixes such as Mr. or Ms and appearance of the photo


Pilot Parliaments Benchmark (PPB) dataset

Gender Shades

● Current benchmark datasets: IJB-A and Adience majority (79.6% for IJB-A and 86.2% for Adience) of the subjects are lighter-skinned faces

● Evaluation on commercial gender classification○ Microsoft - Microsoft’s Cognitive Services Face API○ IBM - IBM's Watson Visual Recognition API○ Face++ - Publicly available demonstration of Face++ from China

darker-skinned females are the most misclassified group (with error rates of up to 34.7%)


CheXNet

● Deep learning is being applied to help with improved medical diagnoses, disease analyses, and pharmaceutical development.

● Chest X-rays are the most common imaging examination tool for diagnosing Pneumonia and other diseases.

● Automated detection of diseases from chest X-rays at level of expert radiologists helps in clinical settings, also important to deliver health care in populations with inadequate access to diagnostic imaging specialists.

Rajpurkar, Pranav, et al. "CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning." arXiv preprint arXiv:1711.05225 (2017).

CheXNet

● An algorithm that detects Pneumonia from chest X-rays at a level exceeding professional radiologists.

● CheXNet, is a 121-layer convolutional neural network trained on ChestX-ray14, publicly available chest X-ray dataset.

● Uses dense connections and batch normalization to make the optimization of such a deep network tractable.


CheXNet

● Heatmaps visualize areas of image most indicative of disease with class activation maps (CAMs).

● Let fkbe kth feature map and wc,kweight in the final classification layer for feature map k leading to pathology c, compute a map Mc:

● Heatmap generated by upscaling Mcto dimensions of image and overlaying


DeepMoji

● NLP tasks are often limited by scarcity of manually annotated data.

● Researchers typically use emoticons and specific hashtags as forms of distant supervision.

● Learn emotion, sentiment and sarcasm detection using a single model through emoji prediction on a dataset of 1246 million tweets containing one of 64 common emojis.

Felbo, Bjarke, et al. "Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm." arXiv preprint arXiv:1708.00524 (2017).

Top five most likely emojis for each text

DeepMoji

Pretraining on the classification task of predicting which emoji were initially part of a text.

● English tweets without URLs● word tokens, repeated characters shortened ● Single tweet for the pretraining per unique emoji type

regardless of the number of emojis in a tweet● Balance dataset to avoid over representation of most

frequent emojis


DeepMoji model S is text length and C number of classes

DeepMoji

Transfer Learning: fine tuning the pretrained model to a target task.

- Using model as a feature extractor with all layers frozen

- Using model as an initialization where the full model is unfrozen

- Sequentially unfreezing and fine tuning a single layer at a time, chain-thaw.


Chain-thaw transfer learning approach

References

1. Pennacchiotti, Marco, and Ana-Maria Popescu. "Democrats, republicans and starbucks afficionados: user classification in twitter." Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2011.

2. Rao, Delip, et al. "Classifying latent user attributes in twitter." Proceedings of the 2nd international workshop on Search and mining user-generated contents. ACM, 2010.

3. Hadgu, Asmelash Teka, and Robert Jäschke. "Identifying and analyzing researchers on twitter." Proceedings of the 2014 ACM conference on Web science. ACM, 2014.

4. Remain, Joseph, EST al. "You only look once: Unified, real-time object detection." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

5. Rajpurkar, Pranav, et al. "CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning." arXiv preprint arXiv:1711.05225 (2017).

6. Felbo, Bjarke, et al. "Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm." arXiv preprint arXiv:1708.00524 (2017).

7. Buolamwini, Joy, and Timnit Gebru. "Gender shades: Intersectional accuracy disparities in commercial gender classification." Conference8. on Fairness, Accountability and Transparency. 2018.

65

mining the social web - leibniz universität hannover · user recommendation ... (lda) to generate...

Documents