how emotional are users' needs? emotion in query logs
DESCRIPTION
Emotional behaviour seems to be ubiquitous on the web. Predictably, social media web genres such as tweets, blog posts and blog comments show high emotional involvement. What about other genres on the web? In this talk, the focus is on the search query log genre. According to recent IR research, searchers’ behaviour is not only limited to traditional informational, navigational and transactional needs. A novel hypothesis is that the seeking behaviour is driven by emotion. But can emotion be detected by analysing the queries typed by users in a search box? In this talk, I will present the results of some experiments carried out to investigate whether it is possible to identify emotion in the query log genre, and discuss how emotion could be utilized to improve the relevance of retrieved documents in searches. These experiments are part of SearchInFocus, a study centred on search.TRANSCRIPT
1
How Emotional are Users’
Needs?Exploring Emotion in Query Logs
Marina Santini29 Jan 2013
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
2
Outline• Inspirational Triggers:
o The Big Unstructured Textual Data Issue o Emotion in IRo Research hypothesis
• Genre- and Emotion- Profiling of Query Logso Characterization of genreo Definition of emotiono Benefits of genre and emotion awareness in query log analysis
• Experimentso Query Logs from GenitoriCrescono thematic blog (in iItalian)o Query Logs from Västra Götlands Region (in Swedish)
• Conclusions
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
3
Inspirational Trigger 1BIG UNSTRUCTURED TEXTUAL
DATA
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
4
Big Unstructured Texutal Data
“Merrill Lynch estimates that more than 85 percent of all business information exists as unstructured data –commonly appearing in e‐mails, memos, notes from call centers and support operations, news, user groups, chats, reports, letters, surveys, white papers, marketing material, research, presentations and web pages.” [DM Review Magazine, February 2003 Issue]
ECONOMIC LOSS!Lots of different genres!
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
5
Simple search is not enough…
• Of course, it is possible to use simple search. But simple search is unrewarding, because is based on single terms.o ”a search is made on the term felony. In a simple search, the term
felony is used, and everywhere there is a reference to felony, a hit to an unstructured document is made. But a simple search is crude. It does not find references to crime, arson, murder, embezzlement, vehicular homicide, and such, even though these crimes are types of felonies” [ Source: Inmon, B. & A. Nesavich, "Unstructured Textual Data in the Organization" from "Managing Unstructured data in the organization", Prentice Hall 2008, pp. 1–13]
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
6
Text Analytics• A set of NLP techniques that provide some
structure to textual documents. • Common components:
o Tokenizationo Morphological Analysiso Syntactic Analysiso Named Entity Recognitiono Sentiment Analysis o Automatic Summarization o Etc.
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
7
Text Analytics Products and Frameworks
• Commercial:o Attensityo Clarabridgeo Temiso Lexalyticso Texifyo SASo IBM Cognoso etc.
Open Source:• GATE• NLTK• UIMA• etc.
Business Intelligence (BI)Customer Experience Management (CEM)
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
8
Actionable Intelligence
• Business Intelligence (BI) + Customer Experience Management (CEM) = Actionable Intelligence
• Actionable Intelligence is information that:1. must be accurate and verifiable2. must be timely3. must be comprehensive4. must be comprehensible5. give the power to make decisions and to
act straightaway
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
9
Today…
In 2003, Merryl Lynch pointed out that it was too difficult to extract automatically usable intelligence from the following genres:
o e‐mailso memoso notes from call centers and support
operationso newso user groupso chatso reportso letterso surveyso white paperso marketing material, o research, o presentationso web pages
Previous genres plus
•Blogs
•Tweets
•FB microposts
•FB comments
•Many other social network texutal
”interactions”
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
10
From Big Data to Query Logs
Current State of affair Viable Alternative
1. Big Unstructured Textual Data2. Text Analytics (commercial
products and frameworks)3. Structured information for BI
and CEM
• Query Logs• Genre- & Context
aware Text Analytics• Actionable
Information (BI, CEM, sentiment, emerging topics…)
The main advantage to uses query logs (when they are available) instead of other genres consists in REDUCED DATA SIZE, REDUCED PRE-PROCESSING; REDUCED NOISE, REDUCED DATA CLEANING!
Typical Use CaseA company managing: •Website•Blog•eMails•Facebook Page•Twitter account
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
11
Query Logs provide Actionable Intelligence for:- search providers- clients- end-users
SearchInFocusExploratory Study on Query Logs
and Actionable Intelligence
Exploratory Query-log
Analysis Workshop
Organized by Findwise,
AB – Sweden
SLTC 2012
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
12
Inspirational Trigger 2EMOTION IN INFORMATION RETRIEVAL
(IR)
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
13
Role of Emotion in Information Retrieval
by Yashar MoshfeghiPhD Thesis at University of
Glasgow, 2012
Emotion in IR
o Three concepts:• Emotion need• Emotion object• Emotion relevance
” uncover social situations
where emotion is the primary
factor (i.e., source of
motivation) in an IR&S
process.” (from the abstract)
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
14
Emotion Need• The whole IR&S behaviour is driven by an emotion
need.
• An emotion need is more fundamental than an information need in the sense that if an information need exists it implies that there is an underlying emotion need to satisfy this information need.
• Emotion needs, even when they do not lead to a particular information need, can motivate searchers to use an IR system.
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
15
Research Hypothesis for the exploration of emotion
in query logsIt is plausible that much of the IR&S behaviour is driven by an emotion need and that users’ emotions are expressed in the
queries that are typed in search boxes and stored in query logs.
If this is true, also emotion extraction from query logs provides actionable intelligence, because extracted emotions can be used to improve decision making and more grounded future choices.
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
16
Research Questions• Is it possible to extract emotion from query logs?
• If so, is it possible to use emotion from query logs for actionable intelligence?
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
17
Genre Profiling of Query Logs
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
18
What characterizes a genre?
1. Must have a name2. Must be recognized within a community3. Must be produced during a task4. Must have conventions5. Must raise expectations6. Can change over time. It is an cultural artifact
(culture here includes society, media, techonology, etc.)
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
19
The query log genre is…
a newly acknowledge but fully-
emerged webgenre1. Name: in line with other digital genres (ex: web log
blog)2. Community: internet users, IR practitioners3. Task: to express searchers’needs in a search engine4. Conventions: short texts written in”keywordese”5. Expectations: to find information relevant to the
query6. Cultural artifact: a product of sinternet-based
society OR a subproduct of search engines
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
20
The query log genre: Languistic and Textual
Conventions• Length: short text (a query log can be seen as a
corpus of very short texts, shorter than tweets, mobile text messages, chat logs, etc.)
• Sublanguage/Jargon: ”keywordese”• Register: neutral• Morphology: REDUCED• Syntax : REDUCED (usually no subclauses, etc.)
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
21
The Query Log Genre: Benefits
• wrt discourse analysis: o Conceptual lean and essential jargon
• reduced morphology• reduced syntax• short texts• mostly nouns and verbs
Benefit1: Predictable Sublanguage
• wrt BIG UNSTRUCTURED TEXTUAL DATA BENEFIT 2: REDUCED SIZE, REDUCED PRE-PROCESSING; LITTLE DATA CLEANING!
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
22
Emotion Profiling of Query Logs
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
23
What is emotion?BROAD DEFINITION: ANY DEGREE OF JUDGEMENTAL EVALUATION.
LIKE SENTISTRENGTH’S SCALE : DUAL 5-POINTS SYSTEM FOR POSITIVE [1; 2; 3; 4; 5]
AND NEGATIVE [-1; -2; -3; -4; -5] EMOTIONS
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
24
Explorations
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
25
Thematic Blog – Italian
Logs from Google Analytics
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
26
Genitori Crescono
http://genitoricrescono.co
m/• Parents Grow Up:
to learn together the parent profession
• About: parenthood, childcare, maternity, upbringing, behaviours during childhood…
belongs to:
FattoreMammaNetwork
(gathers websites targeted
to mothers and written by
mothers)
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
27
Queries from Google Analytics
www.genitoricrescono.com - Search Overview 2009-01-01-2012-11-10
togliere il pannolino = stop wearing nappies/stop using diapers
genitori crescono = is website name
Nopron = is the name of a controvensial syrup to make children sleep all night long
Tracy Hogg is is maternity nurse to Hollywood stars known as 'the baby whisperer' for her skill in calming unruly infants
nanna = familiar bye-byes (Brit) , beddy-byes
neonato 4 mesi = 4-months-old baby
io mi svezzo da solo: I wean by myself
nulla osta = certificate of no impediment
aborto terapeutico=therapeutic abortion
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
28
Zipf’s distribution
“… much research has
shown that query term
frequency distributions
conform to the power law, or
long tail distribution curves.
That is, a small portion of
the terms observed in a
large query log (e.g. > 100
million queries) are used
most often, while the
remaining terms are used
less often individually."
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
29
Parts of Speech
NOUNS
VERBS
ADJECTIVES AND ADVERBS
ARTICLES AND
PREPOSITIONS
1.9
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
30
Most Frequent Syntactic Patterns
inserimento al nido
bambini aggressivi
metodo estivill
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
31
Average Lengths
“The average length of a
search query was 2.4 terms"
in a recent study in 2011 it
was found that the average
length of queries has grown
steadily over time and
average length of non-
English languages queries
had increased more than
English queries."
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
32
Long query, informal syntax
How to stop breastfeeding and make it sleep alone i am planning second pregnacy
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
33
SentiStrength (basic options)
• Queries’ Emotional Strength (i)
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
34
The power of genre and the importance of
the communicative situation
• ”bambini aggressivi”
• Refinement of the concept presented in ”Topic-based Sentiment Analysis in the Social Media …” (Thelwall and Buckley, 2012): the polarity of affect words might flip according to genre and the communicative situation, and not only according the topic.
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
35
Addition: Emotion Words
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
36
Emotional Strength: Basic vs. Boosted
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
37
NegationBasic Options Boosted
bambini che non mangiano 1 -1
quando i bambini non dormono 1 -1
bambini che non mangiano 2 -1
quando i bambini non dormono 2 -1
children who do not eat
when children do not sleep
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
38
Most frequent wordTrigrams
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
39
Query ”Normalization”• Stopword removal• Lemmatization• And ideally synomym expansion
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
40
Ex: for increasing traffic to a
websiteIncrease emotion relevance:
• be empathetic to
searchers ’s problems by
sympathising and by
convetring the negative
words into more neutral
concepts
• Give heart and hope and
offer many solutions…
• In a few word: offer a new
communication stategy…
Use emotion needs as Actionable Intelligence
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
41
Public organization website
Enterprise search and log server by Findwise,
AB.
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
42
Within the Västra Götaland Region
website…
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
43
…hittavård [find health care center]
Regional HealthCare
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
44
VGR Corpus Description• Corpus Time frame: 2010-2011 (2 years)
• Description: “These logs come from the search at hittavard.vgregion.se. The biggest bulk should come from 1177.se. The rest should be from vgregion.se. The target audience are both VGR (Västra Götalands Region) users/employees as well as the general public, as it is a public site. The internal files aresearches made from within the VGR…”
• Corpus size:o size = 3,167 KB (only queries) (BIG DATA is usually > 1TB)o number of queries = 249,243o number of words = 306,453
• Average query length: 1.23 words
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
45
VGR Top Queries
egenremiss=self-certification mina vårdkontakter=my healthcare contacts
webbisar=a invented word referring to newborn babies whose pictures have been published on the web
sjukresa/or=trip to the hospital
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
46
Linguistic Remarks• At the top of the frequency
list:o Simple nounso Compoundso V+N
Simple nouns•feber•influensa•klamydia•…
Compounds•urinvägsinfektion•öroninflammation•Reseersättning•…
V+N•byta vårdcentral •avboka tid •boka tid•…
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
47
More complex constructions at the
bottom
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
48
SentiStrength on VGR
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
49
It seems that no emotion is conveyed by VGR
users…
• Are Swedes less emotional than Italians?• Is the ”healthcare” topic less emotional than the
”childcare” topic?
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
50
It might be that…There is a difference in users’ emotional behaviour when specifying queries to a web search engine OR when using a the search engine of a specialized website.
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
51
Emotion Interpretation…
is not straightforward…
• There are several factors to be accounted for:o One important factor is the context of communication: similar words
or sentences can convey positive emotion in a query and negative emotion in Facebook post, for example.
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
52
different communicative contexts = different
genres
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
53
Genre Awaraness • In practical terms, genre awareness is important
in text analytics and sentiment analysis because, all things being equal:o let you choose the easiest and less problematic texts to
process;o help interpret and disambiguate the real meanings of
words and sentences according to the different communicative context in which they appear.
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
54
In Summary
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
55
Is it possible to identify and extract emotion from query
logs?• It is possible to identify and extract emotion from
web query logs.
• It seems more difficult to extract emotion from enterprise search engine query logs.
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
56
Is it possible to use emotion from query logs for actionable
intelligence?
• If present, query log emotion can be used for actionable intelligence.
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
57
What do you think?
Thank you for your attention!
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
58
Details
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
59
Benefits for the Search Provider
• Mining query logs to extract user-created knowlege, ie queries that can be used as tags (metadata)
• Quickly create domain-specific taxonomies you can capitalize upon, especially for new client companies working in related fields
• Enhancements of current search products• Inexpensive creation of annotated corpora:
document annotation through query logs is a simple technique that in the a short time will build massive annotated corpora to use for machine learning, which will allow more sophisticated search refinements.
Marina Santini - CyberEmotions2013 Warsaw University of Technology 29-30 Jan 2013
60
Benefits for Clients & End Users
• Somebody said: SEARCH MUST BE MIND READER!• BUT ALSO faster, more friendly, more exhaustive and
more accurate.• If this happens, clients will spend less for customer
care. If the end user finds what s/he needs online and quickly, there is no need to call an helpdesk or customer care service.
• Through the analysis of query logs, log analysts can spot the less ”satisfied” queries (i.e. user’s needs). Companies can use this information to plan future products or product enhancement or marketing strategies, etc. (BI)