hybrid system architecture overview

66

Click here to load reader

Upload: jesse-wang

Post on 16-Apr-2017

1.787 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Hybrid system architecture overview

Overview of Hybrid Architecture in Project Halo Jesse Wang, Peter ClarkMarch 18, 2013

Page 2: Hybrid system architecture overview

2

Status of Hybrid ArchitectureGoals, Modularity, Dispatcher, Evaluation

Page 3: Hybrid system architecture overview

3

Hybrid System Near Term Goals

• Setup the infrastructure to communicate with existing reasoners

• Reliably dispatch questions and collect answers

• Create related tools and resourceso Question generation/selection, answer

evaluation, report analysis, etc.

• Experiment ways to choose the answers from available reasoners – as hybrid solver

AURA

CYC

TEQA

Dispatcher

AURA

CYC

TEQA

Page 4: Hybrid system architecture overview

4

Focus Areas of Hybrid Framework (until mid 2013)

• Loose coupling, high cohesion, data exchange protocols

Modularity

• Send requests and handle the responses

Dispatching

• Ability to get ratings on answers, and report results

Evaluation

Page 5: Hybrid system architecture overview

5

DirectQA

AURA

CYC TEQA

IR?SQDB

Retrieval

AURA

SQs

CYC SQs

TEQA

SQsHybri

d SQs

EVALUATIONReport

Hybrid System Core Components

SQs: suggested questions SQA: QA with suggested questions TEQA: Textual Entailment QAIR: Information Retrieval

Yellow Outline: New or Updated

Filtered Set of

Questions

In Campbe

ll

Chapt 7

Find-A-Value

Page 6: Hybrid system architecture overview

6

Infrastructure: Dispatchers

Dispatcher

AURA

CYC TEQA

IR

Live Single QA

Suggested QA Batch QA

Page 7: Hybrid system architecture overview

7

Dispatcher Features

• Asynchronous batch mode and single/experiment mode

• Parallel dispatching to reasonerso Very functional UI: Live progress indicator, view question file, logso Exception and error handling

• Retry question when server is busy

• Batch service can continue to finish even if the client dieso Cancel/stop the batch process also available

• Input and output support both XML and CSV/TSV formatso Pipeline support: accept Question-Selector input

• Configurable dispatchers, select reasonerso Collect answers and compute basic statistics

Page 8: Hybrid system architecture overview

8

Question-Answering via Suggested Questions

• Similar features as Live/Direct QA

• Aggregate suggested questions’ answers as a solver

• Unique features:o Interactively browse suggested questions databaseo Filter on certain facetso Using Q/A concepts, question types, etc. to improve relevanceo Automatic comparison of filtered and non-filtered results by

chapters

Page 9: Hybrid system architecture overview

9

Question and Answer Handling

• Handling and parsing reasoner’s returned resultso Customized programming

• Information on execution: details and summary

• Report generationo Automatic evaluation

• Question Selectoro Support multiple facets/filterso Question bankso Dynamic UI to pick questionso Hidden tags support

Page 10: Hybrid system architecture overview

10

Automatic Evaluation: Status as of 2013.3

• Automatic result evaluation features• Web UI/service to use• Algorithms to score exact and variable answers

– brevity/clarity– relevance: correctness + completeness– overall score

• Generate reports – Summary & details– Graph plot

• Improving evaluation result accuracy • Using: basic text processing tricks (stop words, stemming, trigram

similarity, etc.), location of answer, length of answer, bio concepts, counts of concepts, chapters referred, question types, answer type

• Experiments and analysis (several rounds, W.I.P.)0

20

40

60

80

100

120

User overall AutoEval Overall

Page 11: Hybrid system architecture overview

11

Hybrid PerformanceHow we evaluate and how can improve overall system performance

Page 12: Hybrid system architecture overview

12

Caveats: Question Generation and Selection

• Generated by a small group of SMEs (senior biology students)

• In natural language, without textbook (only syllabus)

Page 13: Hybrid system architecture overview

13

Question Set Facets

04

5

6

7

89

10

11

12

Chapter Distribution

EV

FIND-A-VALUE46%

IS-IT-TRUE-THAT9%

HAVE-RELATIONSHIP8%

HOW7%

PROPERTY6%

WHY5%

HOW-MANY5%

WHERE5%

WHAT-DOES-X-DO3%

WHAT-IS-A3%

HAVE-SIMILARITIES2%

X-OR-Y2%FUNCTION-OF-X

1%HAVE-DIFFERENCES

1%

Question Types

Page 14: Hybrid system architecture overview

14

Caveat: Evaluation Criteria

• We provided a clear guideline, but still subjectiveo A(4.0) = correct, complete answer, no major weaknesso B(3.0) = correct, complete answer with small cosmetic issueso C(2.0) = partially correct or complete answers, with some big issueso D(1.0) = somewhat relevant answer or information, or poor presentationo F(0.0) = wrong or irrelevant, conflicting or hard-to-locate answers

• Only 3 users to rate the answers, under tight timeline

7 15 230

0.51

1.52

2.53

User Preferences

AuraCycText QA

Page 15: Hybrid system architecture overview

15

Evaluation ExampleQ: What is the maximum number of different atoms a carbon atom can bind at once?

Page 16: Hybrid system architecture overview

16

More Evaluation Samples (Snapshot)

Page 17: Hybrid system architecture overview

17

0.00 0.33 0.67 1.00 1.33 1.67 2.00 2.33 2.67 3.00 3.33 3.67 4.000

20

40

60

80

100

120

140

160

Answer Counts Over Rating

Aura

Cyc

Text QA

Reasoner Quality Overview

Page 18: Hybrid system architecture overview

18

Performance Number

Precision Recall F10.000

0.100

0.200

0.300

0.400

0.500

0.600

Reasoner Performance on All Ratings (0..4)

AuraCycText QA

Precision Recall F10.000

0.050

0.100

0.150

0.200

0.250

0.300

0.350

0.400

Reasoner Performance on "Good" (>= 3.0)

Answers

AuraCycText QA

Page 19: Hybrid system architecture overview

19

Answers Over Question Types

FIND-A-VALUE

HOW

HOW-MANY

PROPERTY

WHAT-DOES-X-DO

WHAT-IS-A

X-OR-Y

IS-IT-TRUE-THAT

HAVE-DIFFERENCES

HAVE-SIMILARITIES

HAVE-RELATIONSHIP

0.000.501.001.502.002.503.003.504.00

Answer Overall Rating

Text QACycAura

FIND-A-VALUE

HOW

HOW-MANY

PROPERTY

WHAT-DOES-X-DO

WHAT-IS-A

X-OR-Y

IS-IT-TRUE-THAT

HAVE-DIFFERENCES

HAVE-SIMILARITIES

HAVE-RELATIONSHIP

0 2 4 6 8 10 12 14 16 18 20

36

Count of Answered Questions

Text QACycAura

Page 20: Hybrid system architecture overview

20

Answer Distribution Over Chapters

Aura

AuraAuraAura

Aura

Aura

AuraAura

Cyc

Cyc

Cyc

CycCyc

Cyc

Cyc

Cyc

Text QA

Text QA

Text QA

Text QA

Text QAText QA

Text QA

Text QA

Text QA

Text QA

0 4

5 6

7 8

9 10

11 12

0 4 5 6 7 8 9 10 11 120.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00Answer Quality Over Chapters

Aura

Cyc

Text QA

Page 21: Hybrid system architecture overview

21

Answers on Questions with E/V Answer Type

Aura Cyc Text QA Average0.000.501.001.502.002.503.00

Exact/Variou Answer Quality

EV

Aura Cyc Text QA0

1020304050

5 5

45

25

13

40

Exact/Various Answer Count

EV

Page 22: Hybrid system architecture overview

22

Improve Performance: Hybrid Solver – Combine!• Random selector (dumbest, baseline)

o Total question answered correctly should beat the best solver

• Priority selector (less dumb)o Pick reasoner following a good order (e.g. Aura > Cyc > Text QA) *o Expected performance: better than best individual

• Trained selector: Feature and rule-based selector (smarter)o Decision-Tree (CTree…) learning over Q-Type, Chapter, …o Expected performance: slightly better than above

• Theoretical best selector: MAX – the upper limit (smartest)o Suppose we can always pick the best performing reasoner

Page 23: Hybrid system architecture overview

24

Performance (F1) with Hybrid Solvers

Aura Cyc Text QA Random Priority D-Tree Max0.000

0.050

0.100

0.150

0.200

0.250

0.300

Performance of Solvers on Good Answers (Good: Rating >= 3.0)

F1

Page 24: Hybrid system architecture overview

25

Conclusion

• Each reasoner has its own strength and weaknesso Some aspects not handled well by AURA & CYCo Low hanging: IS-IT-TRUE-THAT for all, WHAT-IS-A for CYC, …

• Aggregated performance easily beats the best individual (Text QA)o Random solver does a good job (F1: mean=0.609): F1

MAX –

F1random

~ 2.5%

• Little room for better performance via answer selectiono F1

MAX – F1

D-Tree ~ 0.5%

o Better focus on MORE and/or BETTER solvers

Page 25: Hybrid system architecture overview

26

Future and Discussions

Page 26: Hybrid system architecture overview

27

Near Future Plans

• Include SQDB-based answers as a “Solver”o Help alleviate question interpretation problems by reasoners

• Include Information Retrieval-based answers as a “Solver”o Help understand the extra power reasoners can have over search

• Improvement evaluation mechanism

• Extract more features from questions and answers to enable a better solver, and see how close we can get to the upper limit (MAX)

• Improve question selector to support multiple sources and automatic update/merge of question metadata

• Find ways to handle question bank evolution

Page 27: Hybrid system architecture overview

28

Get More, Better Reasoners

• Extract and use more features to select best answers• Evidence collection and weighing

Machine learning, Evidence combination

• Easier to explore individual results and diagnose failures

• Support to tune and optimize performance over target question-answer datasets

Analytics & tuning

• Support shared data, shared answers• Subgoaling• Allow reasoners to call each other for subgoals

Inter-solver communication

Further Technical Directions (2013.6+)

Open Data

Open Services

Open Environment

Page 28: Hybrid system architecture overview

29

Open *Data*

• Clear Semantics, Common Format (standard), Easy to Access, Persistent (available)

Requirements

• Questions bank, training sets, knowledge base, protocol for intermediate and final data exchange

Data Sources

• Design and implement protocols and services for data I/O

Open Data Access Layer

Page 29: Hybrid system architecture overview

30

Open *Services*

• Pure machine/algorithms based• Human-computation (social, crowd sourcing)

Two Categories

• Communicate with open data, generate meta data, • More reliable, scalable, reusable

Requirements

• Convert raw, noisy, inaccurate data refined, structured, useful

Goal: Process and refine data

Page 30: Hybrid system architecture overview

31

Open *Environment*

• AI development environment to facilitate collaboration, efficiency and scalability

Definition

• like MMPOG, each “player” gets credits: contribution, resource consumption; interests, loans; ratings…

Operation

• self-organized projects, growth potential, encourage collaboration, grand prize

Opportunities

Page 31: Hybrid system architecture overview

32

Thank You!For having the opportunity for Q&A

Backup slides next

Page 32: Hybrid system architecture overview

33

IBM Watson’s “DeepQA” Hybrid Architecture

Page 33: Hybrid system architecture overview

34

DeepQA Answer Merging And Ranking Module

Page 34: Hybrid system architecture overview

35

Wolfram Alpha Hybrid Architecture

• Data Curation

• Computation

• Linguistic components

• Presentation

Page 35: Hybrid system architecture overview

36

Page 36: Hybrid system architecture overview

37

Page 37: Hybrid system architecture overview

38

Answer Distribution (Density)

0.00 0.33 0.67 1.00 1.33 1.67 2.00 2.33 2.67 3.00 3.33 3.67 4.000

2

4

6

8

10

12

14

16

Answer Distribution

Text QACycAura

Average User Rating

Coun

t of

Ans

wer

s

Page 38: Hybrid system architecture overview

39

Data Table for Answer Quality Distribution

Page 39: Hybrid system architecture overview

40

Work Performed

• Created web-based dispatcher infrastructureo For both Live Direct QA and Live Suggested Questionso Batch mode to handle larger amount

• Built a web UI for UW student to rate answers of questions (HEF)o Coherent UI, duplicate removal, queued tasks

• Established automatic ways for result evaluation and comparison

• Applied first versions of file exchange format and protocols

• Employed initial file and data exchange formats and protocols• Setup faceted browsing and search (retrieval) UI

o And web services for 3rd party consumption

• Carried out many rounds of relevance studies and analysis

Page 40: Hybrid system architecture overview

41

First Evaluation via Halo Evaluation Framework• We sent individual QA result set to UW students for evaluation

• First round hybrid system evaluation:o Cyc SQA: 9 best (3 ties), 14 good, 15 / 60 answeredo Aura QA: 1 best, 9 good, 14/60 answered; o Aura SQA: 4 best (3 ties), 7 good, 8/60 answeredo Text QA: 27 best, 29 good; SQA: 3 best, 5 good, 7/60 answeredo Best scenario: 41/60 answered o Note: Cyc Live was not included

o * SQA (Answering via suggested questions)

Page 41: Hybrid system architecture overview

42

Ask a question Waiting for answers

Answers returned?

Live Direct QA Dispatcher ServiceWhat does ribosome make?

Page 42: Hybrid system architecture overview

43

Live Suggested QA Dispatcher Service

Page 43: Hybrid system architecture overview

44

Batch QA Dispatcher Service

Result automatically downloaded once finished

Page 44: Hybrid system architecture overview

45

Live solver Service Dispatchers

Page 45: Hybrid system architecture overview

46

Direct Live QA: What does ribosome make?

Page 46: Hybrid system architecture overview

47

Direct Live QA: What does ribosome make?

Page 47: Hybrid system architecture overview

48

Suggested Questions Dispatcher

Page 48: Hybrid system architecture overview

49

Results for Suggested Question Dispatcher

Page 49: Hybrid system architecture overview

50

Batch M

ode QA D

ispatcher

Page 50: Hybrid system architecture overview

51

Batch QA Progress Bar

Result automatically downloaded once finished

Page 51: Hybrid system architecture overview

52

Suggested questions database browser

Page 52: Hybrid system architecture overview

53

Faceted Search on Suggested Questions

Page 53: Hybrid system architecture overview

54

Tuning the Suggested Question RecommendationAccomplished• Indexed suggested questions

database– Concept, question, answers

• Created a web service for upload new set of suggested questions

• Extracted chapter information from answer text (TEXT)

• Analyzed question types– Pattern-based

• Experimented with some basic retrieval criteria

Not Yet Implemented• Parsing the questions• More experiment

(heuristics) on retrieval/ranking criteria– manual

• Get SME generate training data to evaluate– Automatic

• More feature extraction

Page 54: Hybrid system architecture overview

55

Parsing, Indexing and Ranking

In-place• New local concept

extraction service• Concept extracted and in

index• Both sentences and

paragraphs are in index • Basic sentence type

identified• Chapter and section

information in• Several ways of ranking

evaluated

NYI• More sentence features

– Content type: Questions, figures, header, regular, review…

– Previous and next concepts– Count of concepts– Clauses – Universal truth– Relevance or not

• Question parsing• More refining on ranking• Learning to Rank ??

Page 55: Hybrid system architecture overview

56

Browse Hybrid system

Page 56: Hybrid system architecture overview

57

WIP: Ranking Experiments (Ablation Study)Features Only

(Easy)Without(Easy)

Only (Hard)

W/O (Hard)

Sentence Text 139/201 31/146

Sentence Concept 79/201 13/146

Prev/Next Sentence Concept

- -

Locality info (Chapter, etc.)

- -

Stopword list - -

Stemming comparison

- -

Other features (type…)

- -

Weighting (variations)

Page 57: Hybrid system architecture overview

58

Automatic Evaluation of IR Results

• Inexpensive, consistent results for tuningo Always using human judgments would be expensive and

somehow inconsistent

• Quick turnover

• With both “easy” and “difficult” question-answer sets

• Validated by UW students to be trustworthyo 95% accuracy on average with threshold

Page 58: Hybrid system architecture overview

59

First UW Students’ Evaluation on AutoEval

• Notations:o 0 = right on. 100% is right, 0% is wrong.o -1 = false positive. It means we gave it a high score (>50%), but

the retrieved text does NOT contain or imply answero +1 = false negative. It means we gave it a low score (<50%), but

the retrieved text actually DOES contain or imply the answer

• We gave each of 4 studentso 15 questions, 15*5=75 sentences and scores to ranko 5 of the questions are the same, 10 are unique to each studento 23/45 questions from “hard” set, 22/45 from “easy” set

Page 59: Hybrid system architecture overview

60

Results: Auto-Evaluation Validity Verification

12

34

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Threshold at 50%

Threshold at 80%

Threshold at 50%Threshold at 80%

Page 60: Hybrid system architecture overview

61

The “Easy” QA set *

• Task: automatic evaluate if retrieved sentences contain the answer

• Scoring: Max score, Mean Average Precision (MAP)

• Result using Max (with threshold at 80%):o 193 regular questions and 8 yes/no questions (via concepts

overlap)• Only with sentence text: 139 (69.2%)• Peter’s test set: 149 (74.1%)• Peter’s more refined: 158 (78.6%)• (Lower) Upper bound for IR: 170 (84.2%)• Jesse’s best: ?? * The evaluation is for IR portion ONLY, no answer pinpointing

Page 61: Hybrid system architecture overview

62

“Easy” QA Set Auto-Evaluation

Q text Only Vulcan Basic Vulcan Refined BaseIR Current Upper Bound

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Result

Result

Page 62: Hybrid system architecture overview

64

Best Upper Bound for Hard Set as of Today

With weighting on Answer Text, Answer Concepts, Question Text, Question Concepts, matching over Sentence Text, Concepts, and Concepts from Previous and Next Sentences, and sentence type…Comparison with keyword overlap, concept overlap, stopwords removal and smart stemming techniques…

56/146=38.4%

Page 63: Hybrid system architecture overview

66

Sharing the Data and Knowledge

• Information We Want, and each solver may also want

• Everyone’s result

• Everyone’s confidence on results

• Everyone’s supporting evidenceo From textbook sentences, reviews, homework section, figures…o From related web material, e.g. biology WikiPediao From common world knowledge, ParaPara, WordNet, …

• Training data – for offline use

Page 64: Hybrid system architecture overview

67

More Timeline Details for First Integration

We are in control• AURA

– Now• Text

– before 12/7• Vulcan IR Baseline

– before 12/15• Initial Hybrid System Output

– Before 12/21– Without unified data format– With limited (possibly

outdated) suggested questions

Partners• Cyc

– ? Hopefully before EOY 2012

• JHU– ?? Hopefully before EOY

2012• ReVerb

– ??? EOM January 2013

Page 65: Hybrid system architecture overview

68

Rounds of ImprovementsAnalysis (evaluation)• E

valuation with humans

• With each solver + hybrid system

Page 66: Hybrid system architecture overview

69

OpenHalo

Vulcan Hybrid System

CYC QA

SILK QA

Other QA

TEQA

AURA

Data Service Collaboration