an empirical study on using hidden markov models for search interface segmentation

1

AN EMPIRICAL STUDY ON USINGHIDDEN MARKOV MODEL

FORSEARCH INTERFACE SEGMENTATION

Ritu Khare and Yuan An

The iSchool at Drexel

Drexel University, USA

Presentation Order

1. Problem: Interface Segmentation

2. Solution : Hidden Markov Model

3. Empirical Results4. Summing Up

2

Part 1

1. Problem: Interface SegmentationMotivation: The Deep WebSearch Interface SegmentationChallenges Novelty of the Solution

2. Solution : Hidden Markov Model3. Empirical Results4. Summing Up

3

4

Motivation: The Deep Web

What is DEEP WEB? Portion of Web, not returned by search engines

through traditional crawling and indexing. Contents lie in online databases and are accessed

by manually filling up HTML forms on search interfaces.

How to make it USEFUL? Meta Search Engines

E.g. Wu et al. (2004), He et al. (2004), Chang, He and Zhang (2005)

Deep Web Crawlers E.g. Raghavan and Garcia-Molina (2001), Madhavan et al. (2008)

A pre-requisite is A thorough understanding of semantics of search

interfaces

5

Search Interface Segmentation

A critical part in understanding semantics of search interfaces The segmentation of search interfaces into

logical groups of implied queries. Grouping of related interface components

together

Search Interface SegmentationTop segment = 7 componentsBottom Segment = 4 components

6

Why is Segmentation Challenging?

Cannot “see” a segment.

Visually close components, might be located far away in the HTML code.

No Cognitive Ability

Human Designer / User Machine Segment has

apparent semantic existence

Visual Arrangements

Past ExperiencesIn this paper, we investigate whether a machine can “learn” how to segment an interface.

7

The Novelty of The Solution:Model-based

Shortcomings of existing works: They use rules and heuristics for segmentation.

These techniques have problems in handling scalability and heterogeneity. Zhang et al., 2004 and He et al., 2004, Raghavan and Garcia-

Molina, 2001, Kalijuvee et al., 2001

We overcome these shortcomingsModel Based Approach

Implicit Knowledge (used by a designer to design an interface)

HMM(Artificial Designer)

SEGMENTATION

8

The Novelty of The Solution:

The Domain Aspect To segment interfaces from a given subject domain …

Existing works have compared the accuracies attained by two methods.

Using Hidden Markov Models . . . We don’t limit to the comparison between the two methods. For a given domain, we investigate what kind of training interfaces result

in high segmentation accuracy and why?

The deep Web has diverse domains. The interface designs differ across domains

Fresh Perspective

I(Di

)

I(Di

)

Domain –

Specific Method

Generic Method

Interfaces from domain Di

Interface Ifrom domain Di Interfaces from

mix of arbitrary domain D1, D2, D3 …

Part 2

1. Problem: Interface Segmentation2. Solution : Hidden Markov Model

Hidden Markov Model (HMMs)Search Interface AnalysisHMM: An Artificial Designer2-Layered ApproachModel Specification & Architecture


9

10

What is an HMM?“A finite state automaton with stochastic state transitions and symbol emissions” (Rabiner, 1989).

q0 q1 q2 q3 q4

σ0 σ1 σ2 σ3 σ4

STATE(hidden)

SYMBOL(observab

le)

TRANSITION

EMISSION

1. State Space : A finite set of states {q0, q1, q2 …qn}.2. Transition Matrix: Probability P (qi → qj) of transitioning from a state qi to qj. 3. Symbol Space : A set of output tokens {σ1, σ2, …, σm}. 4. Emission Matrix :Probability P (qi↑ σk) of state qi emitting the token σk.

Two ‘stochastic processes’: State Transitions and Symbol Emissions. Needed to model and explain the ‘real-world processes’ that are implicit and unobservable.

11

Search Interface AnalysisSemantic Labels

For data-driven Web applications, interface components are translated into structured query (e.g. SQL) expressions: SELECT * FROM Gene WHERE Gene_Name = ‘maggie’;

A segment in a search interface corresponds to a WHERE clause, each collecting values qualified using a built-in operator, for a particular attribute in the DB schema.

Segmentation is a two-fold problem Identification of boundaries of logical groups Assignment of semantic labels to components.

Logical Group

Logical Group

Attribute-name Operator Operand

12

INTERFACE DESIGN PROCESS

While the components are observable, their semantic roles appear hidden to a machine.

The proceeding of one semantic label by another is similar to the transitioning of HMM states.

Attribute

Name

Operand

Operator

Attribute

Name

Operand

Text(Gene

ID)

Textbox

Text(Gene Name)

RB Group

Textbox

Attribute-name Operator Operand

Attribute-name

Operand

13

HMM: An Artificial Designer

An HMM can act like a human designer that can design an interface and determine the segment boundaries and semantic labels of components.

We encoded the implicit knowledge required for interface segmentation in an HMM-based artificial designer.We employ a 2-layered HMM: The first layer T-HMM tags each component with appropriate

semantic labels (attribute-name, operator, and operand). The second layer S-HMM segments the interface into logical

attributes.

14

2-LAYERED HMM

TextTextbox

Text RB Group

TextboxAttribute-

name Operand

Attribute-name

Operator OperandBegin-segment End-

segmentBegin-segment

Inside-segment

End-segment

ParserT-HMM

S-HMM

15

MODEL SPECIFICATION: T-HMM & S-HMM

T-HMM S-HMM

Symbols States

HTML Constructs: Text label, Textbox, Textarea, radiobutton, checkbox, select list, etc.

Semantic Labels:Attribute-name, Operator, Operand, Text-Misc

Symbols States

Semantic Labels:Attribute-name, Operator, Operand, Text-Misc

Segment Positions: Begin,Inside, End, Outside

Training interfaces

Testinterfaces

Semantic Labels &

Segment Boundaries(of test interfaces)

State Sequences

Symbol Sequences

T-HMM

S-HMM

Part 3


2. Solution : Hidden Markov Model3. Empirical Results

Initial Experiments Variations of Models Some Interesting Results Conclusions

4. Summing Up

16

17

INITIAL EXPERIMENTS: Domain-Specific

Dataset: 200 interfaces Cross Validation: 190 training

and 10 testing examples. Training: Maximum Likelihood

Method Testing: Viterbi Algorithm

Dataset: 100 interfaces each

Why 2-Layered HMM outperformed? LEX does not model text-misc

and thus suffered from under-segmentation.

LEX considers only those texts as attribute-names that are located within 2-top-row distance from the form element. In reality, attribute-name and operand might be located far apart in the source code.

FIRST EXP.: BIOLOGY DOMAIN

COMPARISON WITH LEX (He et

al. 2007) : 4 DOMAINS

Semantic label Accuracy (%)Segment 86.05

Attribute-name 90.11Attribute-name * 99.75Operator 85.10Operand 98.60

Domain LEX HMM HMMbio

Biology 70.94 +16.66 +16.66

Health 66.85 +5.39 +13.74Automobile

54.34 +24.66 +18.01

Movie 70 +0 +5.9S-HMM

T-HMM

*For segments with multiple instances of attr-names, at least 1 was correctly identified

Design preferencesof designers from different domainsare different.

HMM VariationsT-HMM Topology

AUTOMOBILE BIOLOGY

HEALTH

REFERENCE & EDU

MOVIE

MIXEDTransitions <5% probable not shown

19

RESULTS

A Pattern Captured by Domain Specific Model

Test Domain

HMM Variations (based on training data)

HMMauto HMMbio HMMhealth HMMMovie HMMref_edu HMMmixed

Auto 79 72.35 73.63 68.81 67.52 70.7

Bio 48.7 87.6 48.72 45.29 52.56 51.2

Health 70.35 80.59 72.24 69 74.12 73.05

Movie 72.96 75.9 73.33 70 74.81 74

Ref. & Edu.

44.44 62.3 43.25 38.88 51 44

A Pattern Captured by Cross-Domain Model

AutomobileHealth

Domain-specific models do not always result in best performance, e.g. movie domain

Text-misc

1. Domain-Specific2. Generic3. Cross Domain

20

CONCLUSION

P can be recovered by HMMD1. E.g., Biology and automobile.

P can best be recovered using HMMD2, where D2 is a domain that has P as a frequent pattern. E.g., movie and health, wherein

most of the rare patterns are recovered by HMMbio.

Frequent Pattern P from Domain D1

Rare Pattern P from Domain D1

An artificial designer trained by more appropriate interfaces leads to more accurate results. The appropriateness depends on: frequency of segment design patterns in the test domain frequency of segment design patterns in the training dataset.

Part 4


2. Solution : Hidden Markov Model


ContributionsFuture Work

21

22

CONTRIBUTIONS Introduction to 2-layered HMM approach for interface

segmentation motivated by probabilistic nature of interface design process. First work to apply HMMs on deep Web search interfaces.

Effectiveness test across representative domains of deep Web. High segmentation accuracy in most domains. Outperformed a previous approach, LEX by at least 10% in

most cases. Design & comparison of various of learning models.

A single model has the potential of accurately segmenting interfaces from multiple domains, provided it is trained on the data having appropriate variety and frequency of design patterns.

An example is HMMbio that performed better than other models on 80% of the tested domains. The variety and frequency of patterns in biology domain helps HMMbio contain more design knowledge & be a smarter designer.

23

FUTURE WORK Design a minimal set of models that reaches as many

deep Web domains as possible Involve More Domains Each model returns higher accuracy than its domain-specific

counterpart Transition to a new interface representation scheme:

Distributed Segments and Segments with intertwined components Recover the schema of deep Web databases: Extracting

finer details, such as data types and constraints. Overcome the challenges posed by HMMs

Manual Tagging of training data: Explore unsupervised training methods such as Baum Welch algorithm.

Time taken by Viterbi algorithm for state recovery Find optimization techniques to improve efficiency. Use this method as an off-line pre-processing module to other applications such as meta-

search engines and deep Web crawlers.

24

Suggestions, Thoughts, Ideas, Questions…

THANK YOU !

Acknowledgements: To the Anonymous Reviewers of CIKM 2009

References: [1] to [23] (in full paper).

an empirical study on using hidden markov models for search interface segmentation

Documents

segmentation problem

interface segmentationsolution

meta search engines

segmentation challenging

deep understanding

deep web crawlerse

contents of deep web

portion of web