hybrid system architecture overview

Overview of Hybrid Architecture in Project Halo Jesse Wang, Peter ClarkMarch 18, 2013

2

Status of Hybrid ArchitectureGoals, Modularity, Dispatcher, Evaluation

3

Hybrid System Near Term Goals

• Setup the infrastructure to communicate with existing reasoners

• Reliably dispatch questions and collect answers

• Create related tools and resourceso Question generation/selection, answer

evaluation, report analysis, etc.

• Experiment ways to choose the answers from available reasoners – as hybrid solver

AURA

CYC

TEQA

Dispatcher

AURA

CYC

TEQA

4

Focus Areas of Hybrid Framework (until mid 2013)

• Loose coupling, high cohesion, data exchange protocols

Modularity

• Send requests and handle the responses

Dispatching

• Ability to get ratings on answers, and report results

Evaluation

5

DirectQA

AURA

CYC TEQA

IR?SQDB

Retrieval

AURA

SQs

CYC SQs

TEQA

SQsHybri

d SQs

EVALUATIONReport

Hybrid System Core Components

SQs: suggested questions SQA: QA with suggested questions TEQA: Textual Entailment QAIR: Information Retrieval

Yellow Outline: New or Updated

Filtered Set of

Questions

In Campbe

ll

Chapt 7

Find-A-Value

6

Infrastructure: Dispatchers

Dispatcher

AURA

CYC TEQA

IR

Live Single QA

Suggested QA Batch QA

7

Dispatcher Features

• Asynchronous batch mode and single/experiment mode

• Parallel dispatching to reasonerso Very functional UI: Live progress indicator, view question file, logso Exception and error handling

• Retry question when server is busy

• Batch service can continue to finish even if the client dieso Cancel/stop the batch process also available

• Input and output support both XML and CSV/TSV formatso Pipeline support: accept Question-Selector input

• Configurable dispatchers, select reasonerso Collect answers and compute basic statistics

8

Question-Answering via Suggested Questions

• Similar features as Live/Direct QA

• Aggregate suggested questions’ answers as a solver

• Unique features:o Interactively browse suggested questions databaseo Filter on certain facetso Using Q/A concepts, question types, etc. to improve relevanceo Automatic comparison of filtered and non-filtered results by

chapters

9

Question and Answer Handling

• Handling and parsing reasoner’s returned resultso Customized programming

• Information on execution: details and summary

• Report generationo Automatic evaluation

• Question Selectoro Support multiple facets/filterso Question bankso Dynamic UI to pick questionso Hidden tags support

10

Automatic Evaluation: Status as of 2013.3

• Automatic result evaluation features• Web UI/service to use• Algorithms to score exact and variable answers

– brevity/clarity– relevance: correctness + completeness– overall score

• Generate reports – Summary & details– Graph plot

• Improving evaluation result accuracy • Using: basic text processing tricks (stop words, stemming, trigram

similarity, etc.), location of answer, length of answer, bio concepts, counts of concepts, chapters referred, question types, answer type

• Experiments and analysis (several rounds, W.I.P.)0

20

40

60

80

100

120

User overall AutoEval Overall

11

Hybrid PerformanceHow we evaluate and how can improve overall system performance

12

Caveats: Question Generation and Selection

• Generated by a small group of SMEs (senior biology students)

• In natural language, without textbook (only syllabus)

13

Question Set Facets

04

5

6

7

89

10

11

12

Chapter Distribution

EV

FIND-A-VALUE46%

IS-IT-TRUE-THAT9%

HAVE-RELATIONSHIP8%

HOW7%

PROPERTY6%

WHY5%

HOW-MANY5%

WHERE5%

WHAT-DOES-X-DO3%

WHAT-IS-A3%

HAVE-SIMILARITIES2%

X-OR-Y2%FUNCTION-OF-X

1%HAVE-DIFFERENCES

1%

Question Types

14

Caveat: Evaluation Criteria

• We provided a clear guideline, but still subjectiveo A(4.0) = correct, complete answer, no major weaknesso B(3.0) = correct, complete answer with small cosmetic issueso C(2.0) = partially correct or complete answers, with some big issueso D(1.0) = somewhat relevant answer or information, or poor presentationo F(0.0) = wrong or irrelevant, conflicting or hard-to-locate answers

• Only 3 users to rate the answers, under tight timeline

7 15 230

0.51

1.52

2.53

User Preferences

AuraCycText QA

15

Evaluation ExampleQ: What is the maximum number of different atoms a carbon atom can bind at once?

16

More Evaluation Samples (Snapshot)

17

0.00 0.33 0.67 1.00 1.33 1.67 2.00 2.33 2.67 3.00 3.33 3.67 4.000

20

40

60

80

100

120

140

160

Answer Counts Over Rating

Aura

Cyc

Text QA

Reasoner Quality Overview

18

Performance Number

Precision Recall F10.000

0.100

0.200

0.300

0.400

0.500

0.600

Reasoner Performance on All Ratings (0..4)

AuraCycText QA

Precision Recall F10.000

0.050

0.100

0.150

0.200

0.250

0.300

0.350

0.400

Reasoner Performance on "Good" (>= 3.0)

Answers

AuraCycText QA

19

Answers Over Question Types

FIND-A-VALUE

HOW

HOW-MANY

PROPERTY

WHAT-DOES-X-DO

WHAT-IS-A

X-OR-Y

IS-IT-TRUE-THAT

HAVE-DIFFERENCES

HAVE-SIMILARITIES

HAVE-RELATIONSHIP

0.000.501.001.502.002.503.003.504.00

Answer Overall Rating

Text QACycAura

FIND-A-VALUE

HOW

HOW-MANY

PROPERTY

WHAT-DOES-X-DO

WHAT-IS-A

X-OR-Y

IS-IT-TRUE-THAT

HAVE-DIFFERENCES

HAVE-SIMILARITIES

HAVE-RELATIONSHIP

0 2 4 6 8 10 12 14 16 18 20

36

Count of Answered Questions

Text QACycAura

20

Answer Distribution Over Chapters

Aura

AuraAuraAura

Aura

Aura

AuraAura

Cyc

Cyc

Cyc

CycCyc

Cyc

Cyc

Cyc

Text QA

Text QA

Text QA

Text QA

Text QAText QA

Text QA

Text QA

Text QA

Text QA

0 4

5 6

7 8

9 10

11 12

0 4 5 6 7 8 9 10 11 120.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00Answer Quality Over Chapters

Aura

Cyc

Text QA

21

Answers on Questions with E/V Answer Type

Aura Cyc Text QA Average0.000.501.001.502.002.503.00

Exact/Variou Answer Quality

EV

Aura Cyc Text QA0

1020304050

5 5

45

25

13

40

Exact/Various Answer Count

EV

22

Improve Performance: Hybrid Solver – Combine!• Random selector (dumbest, baseline)

o Total question answered correctly should beat the best solver

• Priority selector (less dumb)o Pick reasoner following a good order (e.g. Aura > Cyc > Text QA) *o Expected performance: better than best individual

• Trained selector: Feature and rule-based selector (smarter)o Decision-Tree (CTree…) learning over Q-Type, Chapter, …o Expected performance: slightly better than above

• Theoretical best selector: MAX – the upper limit (smartest)o Suppose we can always pick the best performing reasoner

24

Performance (F1) with Hybrid Solvers

Aura Cyc Text QA Random Priority D-Tree Max0.000

0.050

0.100

0.150

0.200

0.250

0.300

Performance of Solvers on Good Answers (Good: Rating >= 3.0)

F1

25

Conclusion

• Each reasoner has its own strength and weaknesso Some aspects not handled well by AURA & CYCo Low hanging: IS-IT-TRUE-THAT for all, WHAT-IS-A for CYC, …

• Aggregated performance easily beats the best individual (Text QA)o Random solver does a good job (F1: mean=0.609): F1

MAX –

F1random

~ 2.5%

• Little room for better performance via answer selectiono F1

MAX – F1

D-Tree ~ 0.5%

o Better focus on MORE and/or BETTER solvers

26

Future and Discussions

27

Near Future Plans

• Include SQDB-based answers as a “Solver”o Help alleviate question interpretation problems by reasoners

• Include Information Retrieval-based answers as a “Solver”o Help understand the extra power reasoners can have over search

• Improvement evaluation mechanism

• Extract more features from questions and answers to enable a better solver, and see how close we can get to the upper limit (MAX)

• Improve question selector to support multiple sources and automatic update/merge of question metadata

• Find ways to handle question bank evolution

28

Get More, Better Reasoners

• Extract and use more features to select best answers• Evidence collection and weighing

Machine learning, Evidence combination

• Easier to explore individual results and diagnose failures

• Support to tune and optimize performance over target question-answer datasets

Analytics & tuning

• Support shared data, shared answers• Subgoaling• Allow reasoners to call each other for subgoals

Inter-solver communication

Further Technical Directions (2013.6+)

Open Data

Open Services

Open Environment

29

Open *Data*

• Clear Semantics, Common Format (standard), Easy to Access, Persistent (available)

Requirements

• Questions bank, training sets, knowledge base, protocol for intermediate and final data exchange

Data Sources

• Design and implement protocols and services for data I/O

Open Data Access Layer

30

Open *Services*

• Pure machine/algorithms based• Human-computation (social, crowd sourcing)

Two Categories

• Communicate with open data, generate meta data, • More reliable, scalable, reusable

Requirements

• Convert raw, noisy, inaccurate data refined, structured, useful

Goal: Process and refine data

31

Open *Environment*

• AI development environment to facilitate collaboration, efficiency and scalability

Definition

• like MMPOG, each “player” gets credits: contribution, resource consumption; interests, loans; ratings…

Operation

• self-organized projects, growth potential, encourage collaboration, grand prize

Opportunities

32

Thank You!For having the opportunity for Q&A

Backup slides next

33

IBM Watson’s “DeepQA” Hybrid Architecture

34

DeepQA Answer Merging And Ranking Module

35

Wolfram Alpha Hybrid Architecture

• Data Curation

• Computation

• Linguistic components

• Presentation

38

Answer Distribution (Density)

0.00 0.33 0.67 1.00 1.33 1.67 2.00 2.33 2.67 3.00 3.33 3.67 4.000

2

4

6

8

10

12

14

16

Answer Distribution

Text QACycAura

Average User Rating

Coun

t of

Ans

wer

s

39

Data Table for Answer Quality Distribution

40

Work Performed

• Created web-based dispatcher infrastructureo For both Live Direct QA and Live Suggested Questionso Batch mode to handle larger amount

• Built a web UI for UW student to rate answers of questions (HEF)o Coherent UI, duplicate removal, queued tasks

• Established automatic ways for result evaluation and comparison

• Applied first versions of file exchange format and protocols

• Employed initial file and data exchange formats and protocols• Setup faceted browsing and search (retrieval) UI

o And web services for 3rd party consumption

• Carried out many rounds of relevance studies and analysis

41

First Evaluation via Halo Evaluation Framework• We sent individual QA result set to UW students for evaluation

• First round hybrid system evaluation:o Cyc SQA: 9 best (3 ties), 14 good, 15 / 60 answeredo Aura QA: 1 best, 9 good, 14/60 answered; o Aura SQA: 4 best (3 ties), 7 good, 8/60 answeredo Text QA: 27 best, 29 good; SQA: 3 best, 5 good, 7/60 answeredo Best scenario: 41/60 answered o Note: Cyc Live was not included

o * SQA (Answering via suggested questions)

42

Ask a question Waiting for answers

Answers returned?

Live Direct QA Dispatcher ServiceWhat does ribosome make?

43

Live Suggested QA Dispatcher Service

44

Batch QA Dispatcher Service

Result automatically downloaded once finished

45

Live solver Service Dispatchers

46

Direct Live QA: What does ribosome make?

47

Direct Live QA: What does ribosome make?

48

Suggested Questions Dispatcher

49

Results for Suggested Question Dispatcher

50

Batch M

ode QA D

ispatcher

51

Batch QA Progress Bar

Result automatically downloaded once finished

52

Suggested questions database browser

53

Faceted Search on Suggested Questions

54

Tuning the Suggested Question RecommendationAccomplished• Indexed suggested questions

database– Concept, question, answers

• Created a web service for upload new set of suggested questions

• Extracted chapter information from answer text (TEXT)

• Analyzed question types– Pattern-based

• Experimented with some basic retrieval criteria

Not Yet Implemented• Parsing the questions• More experiment

(heuristics) on retrieval/ranking criteria– manual

• Get SME generate training data to evaluate– Automatic

• More feature extraction

55

Parsing, Indexing and Ranking

In-place• New local concept

extraction service• Concept extracted and in

index• Both sentences and

paragraphs are in index • Basic sentence type

identified• Chapter and section

information in• Several ways of ranking

evaluated

NYI• More sentence features

– Content type: Questions, figures, header, regular, review…

– Previous and next concepts– Count of concepts– Clauses – Universal truth– Relevance or not

• Question parsing• More refining on ranking• Learning to Rank ??

56

Browse Hybrid system

57

WIP: Ranking Experiments (Ablation Study)Features Only

(Easy)Without(Easy)

Only (Hard)

W/O (Hard)

Sentence Text 139/201 31/146

Sentence Concept 79/201 13/146

Prev/Next Sentence Concept

- -

Locality info (Chapter, etc.)

- -

Stopword list - -

Stemming comparison

- -

Other features (type…)

- -

Weighting (variations)

58

Automatic Evaluation of IR Results

• Inexpensive, consistent results for tuningo Always using human judgments would be expensive and

somehow inconsistent

• Quick turnover

• With both “easy” and “difficult” question-answer sets

• Validated by UW students to be trustworthyo 95% accuracy on average with threshold

59

First UW Students’ Evaluation on AutoEval

• Notations:o 0 = right on. 100% is right, 0% is wrong.o -1 = false positive. It means we gave it a high score (>50%), but

the retrieved text does NOT contain or imply answero +1 = false negative. It means we gave it a low score (<50%), but

the retrieved text actually DOES contain or imply the answer

• We gave each of 4 studentso 15 questions, 15*5=75 sentences and scores to ranko 5 of the questions are the same, 10 are unique to each studento 23/45 questions from “hard” set, 22/45 from “easy” set

60

Results: Auto-Evaluation Validity Verification

12

34

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Threshold at 50%

Threshold at 80%

Threshold at 50%Threshold at 80%

61

The “Easy” QA set *

• Task: automatic evaluate if retrieved sentences contain the answer

• Scoring: Max score, Mean Average Precision (MAP)

• Result using Max (with threshold at 80%):o 193 regular questions and 8 yes/no questions (via concepts

overlap)• Only with sentence text: 139 (69.2%)• Peter’s test set: 149 (74.1%)• Peter’s more refined: 158 (78.6%)• (Lower) Upper bound for IR: 170 (84.2%)• Jesse’s best: ?? * The evaluation is for IR portion ONLY, no answer pinpointing

62

“Easy” QA Set Auto-Evaluation

Q text Only Vulcan Basic Vulcan Refined BaseIR Current Upper Bound

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Result

Result

64

Best Upper Bound for Hard Set as of Today

With weighting on Answer Text, Answer Concepts, Question Text, Question Concepts, matching over Sentence Text, Concepts, and Concepts from Previous and Next Sentences, and sentence type…Comparison with keyword overlap, concept overlap, stopwords removal and smart stemming techniques…

56/146=38.4%

66

Sharing the Data and Knowledge

• Information We Want, and each solver may also want

• Everyone’s result

• Everyone’s confidence on results

• Everyone’s supporting evidenceo From textbook sentences, reviews, homework section, figures…o From related web material, e.g. biology WikiPediao From common world knowledge, ParaPara, WordNet, …

• Training data – for offline use

67

More Timeline Details for First Integration

We are in control• AURA

– Now• Text

– before 12/7• Vulcan IR Baseline

– before 12/15• Initial Hybrid System Output

– Before 12/21– Without unified data format– With limited (possibly

outdated) suggested questions

Partners• Cyc

– ? Hopefully before EOY 2012

• JHU– ?? Hopefully before EOY

2012• ReVerb

– ??? EOM January 2013

68

Rounds of ImprovementsAnalysis (evaluation)• E

valuation with humans

• With each solver + hybrid system

69

OpenHalo

Vulcan Hybrid System

CYC QA

SILK QA

Other QA

TEQA

AURA

Data Service Collaboration

hybrid system architecture overview

Technology