textual report generation from email utilizing temporal topic analysis · 2019-01-02 · input...

1
Use doc2vec for topic calculus Use model trained on Wikipedia articles for topics Extract topic labels by compare email vectors & cluster keyword sets to topic vectors Choose a set of topics that together best describe a email Topic Analysis Input Communication groups Temporal Chains Textual Report Generation from Email utilizing Temporal Topic Analysis Two email datasets: ENRON & Avocado Enron contains ~500K emails from 150 employees Avocado Research Email Collection contains ~1M emails from 282 accounts Group people into clusters based on communication frequency Draw graph of communications, weigh edges with email count Extract topics for each cluster Use clusters to determine communication patterns & anomalies Resulting components represent communication groups Report Generation Topic Ranking Use the hierarchical structure from the analysis (communication groups, email grouping, topic chains, anomalies, etc.) Select relevant details to help user understand context of report, based on particular template of choice (summary vs anomalies) Reason over content to select good organization/display style. Supports multiple report templates, including summary- and anomaly-focused output, with modular extensibility for other styles Reply / Forward / Related Organize emails into topic chains by looking at replies, forwards, and by comparing topics Identify topic flow/change over time Collaboration We are proud of a successful collaboration between NC State and the LAS, including monthly meetings with excellent feedback and ideas. We use doc2vec to compute similarity via cosine distance For topic labeling, we rank topics using additional criteria: PageRank Coverage Redundancy Colin M. Potts NC State University [email protected] Sean Lynch & Tracy Standafer Laboratory for Analytic Science [email protected] | [email protected] θ

Upload: others

Post on 27-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Textual Report Generation from Email utilizing Temporal Topic Analysis · 2019-01-02 · Input Topic Analysis Communication groups Temporal Chains Textual Report Generation from Email

● Use doc2vec for topic calculus

● Use model trained on Wikipedia articles for topics

● Extract topic labels by compare email vectors & cluster

keyword sets to topic vectors

● Choose a set of topics that together best describe a email

Topic AnalysisInput

Communication groups

Temporal Chains

Textual Report Generation from Email utilizing Temporal Topic Analysis

● Two email datasets: ENRON & Avocado

● Enron contains ~500K emails from 150 employees

● Avocado Research Email Collection contains ~1M emails from 282 accounts

● Group people into clusters based on communication frequency

● Draw graph of communications, weigh edges with email count

● Extract topics for each cluster

● Use clusters to determine communication patterns & anomalies

● Resulting components represent communication groups

Report Generation

Topic Ranking

● Use the hierarchical structure from the analysis (communication groups, email grouping, topic chains, anomalies, etc.)

● Select relevant details to help user understand context of report, based on particular template of choice (summary vs anomalies)

● Reason over content to select good organization/display style.

● Supports multiple report templates, including summary- and anomaly-focused output, with modular extensibility for other styles

Reply /Forward /

Related

● Organize emails into topic chains by looking at replies, forwards, and by comparing topics

● Identify topic flow/change over time

Collaboration

We are proud of a successful collaboration between NC State and the LAS, including monthly meetings with excellent feedback and ideas.

• We use doc2vec to compute similarity via cosine distance

• For topic labeling, we rank topics using additional criteria:

○ PageRank

○ Coverage

○ Redundancy

Colin M. PottsNC State [email protected]

Sean Lynch & Tracy StandaferLaboratory for Analytic Science

[email protected] | [email protected]

θ