kantornsf-nij-isi-03-06-04.ppt

50
Libraries and Intelligence NSF/NIJ Symposium on Intelligence and Security Informatics. Tucson, AR. Paul Kantor June 2, 2003 Research supported in part by the National Science Foundation under Grant EIA-0087022and by the Advanced Research Development Activity under Contract 2002- H790400-000. The views expressed in this presentation are those of the author, and do not necessarily represent the views of the sponsoring agency.

Upload: butest

Post on 05-Jun-2015

234 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: kantorNSF-NIJ-ISI-03-06-04.ppt

Libraries and IntelligenceNSF/NIJ Symposium on Intelligence and Security

Informatics. Tucson, AR.

Paul Kantor

June 2, 2003 Research supported in part by the National Science Foundation under Grant EIA-0087022and by the Advanced Research Development Activity under Contract 2002-H790400-000. Theviews expressed in this presentation are those of the author, and do not necessarily represent the views of the sponsoring agency.

Page 2: kantorNSF-NIJ-ISI-03-06-04.ppt

Relation to General Intelligence and Security Informatics

• Signal information

• Map and image information

• Sound/voice information

• Geographic information

• Structured (Database) information

• Free form textual information in machine readable form

Page 3: kantorNSF-NIJ-ISI-03-06-04.ppt

Relation to Librarianship

• Much of the needed “technology” for managing information related to homeland security is of the same type that librarians have provided “by hand”.

• But ..– Millions of documents– dozens of languages– many media

Page 4: kantorNSF-NIJ-ISI-03-06-04.ppt

Librarianship

• Cataloging– organizing information according to what it is about– Classification – Machine Learning– Use training examples– Adapt as more data is received– Filter huge streams of potentially relevant data

• Monitoring Message Streams

Page 5: kantorNSF-NIJ-ISI-03-06-04.ppt

Librarianship

• Reference– Understand what the user wants– Understand both relevance and quality/genre– Learn from a dialog with the user

• Intelligent Question Answering

Page 6: kantorNSF-NIJ-ISI-03-06-04.ppt

Two Projects

• Filtering/Monitoring Message Streams National Science Foundation (NSF) -- acting for the National Security Agency HITIQA - High quality interactive Question Answering

• Advanced Research Development Activity (ARDA) of the Intelligence Community

Page 7: kantorNSF-NIJ-ISI-03-06-04.ppt

Motivation:

monitoring of global satellite communications (though this may produce voice rather than text)

sniffing and monitoring email traffic

OBJECTIVE:

Monitor streams of textualized communication to detect pattern changes and "significant" events

Page 8: kantorNSF-NIJ-ISI-03-06-04.ppt

© Paul Kantor 2002

1. Accumulated documents

2. Unexpected event

3. Initial Profile

4. Guided Retrieval

5.Clustering

6. Revision and Iteration

Retrospective/Supervised/Tracking

1. Accumulated documents

4. Anticipated event

3. Initial Profile

5.. Guided Retrieval

2.Clustering

Prospective/Unsupervised/Detection

Rutgers DIMACS: Automatic Event Finding in Streams of Messages

7. Track New documents

Analysts

Analysts

Page 9: kantorNSF-NIJ-ISI-03-06-04.ppt

MMS TeamStatisticians, computer scientists, experts in info. Retrieval &

library science, etc

Prof. Fred Roberts – decision rules

Prof. David Madigan – statistics

Dr. David Lewis –text classification

Prof. Paul Kantor – info science

Prof. Ilya Muchnik – statistics

Prof. Muthu Muthukrishnan –algorithms

Dr. Martin Strauss, AT&T Labs –algorithms

Dr. Rafail Ostrovsky, Telcordia Technologies, -algorithms

Prof. Endre Boros, --Boolean optimization.

Dr. Vladimir Menkov programming;

Dr. Alex Genkin programming;

Mr. Andrei Anghelescu; graduate asisstant

Mr. Dmitiry Fradkin; graduate assistant

Page 10: kantorNSF-NIJ-ISI-03-06-04.ppt

• Given stream of text in any language.

• Decide whether "new events" are present in the flow of messages.

• Event: new topic or topic with unusual level of activity.

• Retrospective or “Supervised” Event Identification: Classification into pre-existing classes.

TECHNICAL PROBLEM:

Page 11: kantorNSF-NIJ-ISI-03-06-04.ppt

More Complex Problem: Prospective Detection or “Unsupervised” Learning

1) Classes change - new classes or change meaning

2) A difficult problem in statistics

3) Recent new CS approaches

4) Algorithm detects a new class

5) Human analyst labels it; determines its significance

Page 12: kantorNSF-NIJ-ISI-03-06-04.ppt

COMPONENTS OF AUTOMATIC MESSAGE PROCESSING

(1). Compression of Text -- to meet storage and processing limitations;

(2). Representation of Text -- put in form amenable to computation and statistical analysis;

(3). Matching Scheme -- computing similarity between documents;

(4). Learning Method -- build on judged examples to determine characteristics of document cluster (“event”)

(5). Fusion Scheme -- combine methods (scores) to yield improved detection/clustering.

Page 13: kantorNSF-NIJ-ISI-03-06-04.ppt

Random Projections

Boolean Random

Projections Robust Feature

Selection

Compr

essio

n

Repre

sent

atio

n

Bag of Words

Bag of Bits

Mat

chin

g

Learn

ing

Fusio

n

tf-idf

kNN

Boolean

r-NN

Rocchio separator

Combinatorial Clustering

Naïve Bayes

Sparse Bayes

Discriminant Analysis

Support Vector

Machines

Non-linear Classifiers

Project Components: Rutgers DIMACS MMS

Page 14: kantorNSF-NIJ-ISI-03-06-04.ppt

• Existing methods use some or all 5 automatic processing components, but don’t exploit the full power of the components and/or an understanding of how to apply them to text data.

• Lewis' methods used an off-the-shelf support vector machine supervised learner, but tuned it for frequency properties of the data.Very good TREC 2002 results on batch learning.

• Chinese Academy of Sciences used most basic linear classifier (Roccho model) and achieved the best adaptive learning)

Proposed Advances

Page 15: kantorNSF-NIJ-ISI-03-06-04.ppt

• We can trace a path (called a homotopy) in method space, from a poor Rocchio model to the CAS one -- find some better results along the way.

• Next steps are:

more sophisticated statistical methods

sophisticated data compression in a pre-processing stage

Proposed Advances II

Page 16: kantorNSF-NIJ-ISI-03-06-04.ppt

• Representations: Boolean representations; weighting schemes

• Matching Schemes: Boolean matching; nonlinear transforms of individual feature values

• Learning Methods: new kernel-based methods (nonlinear classification); more complex Bayes classifiers to assign objects to highest probability class

• Fusion Methods: combining scores based on ranks, linear functions, or nonparametric schemes

MORE SOPHISTICATED STATISTICAL APPROACHES:

•.

Page 17: kantorNSF-NIJ-ISI-03-06-04.ppt

• Identify best combination of newer methods through careful exploration of variety of tools.

• Address issues of effectiveness (how well task is done) and efficiency (in computational time and space)

• Use combination of new or modified algorithms and improved statistical methods built on the algorithmic primitives.

• Systematic Experimentation on components and on fusion schemes

THE APPROACH•.

Page 18: kantorNSF-NIJ-ISI-03-06-04.ppt

Mercer KernelsMercer’s Theorem gives necessary and sufficient conditions for a continuous symmetric function K to admit this representation:

“Mercer Kernels”

This kernel defines a set of functions HK,

elements of which have an expansion as:

This set of functions is a “reproducing kernel hilbert space”

)()()()(),(1

zxzxzxK iii

i

N

iiii

ii xxKxcxf

11

),()()(

K “pos. semi-definite”

Prepared by David L. Madigan

Page 19: kantorNSF-NIJ-ISI-03-06-04.ppt

Support Vector MachineTwo-class classifier with the form:

parameters chosen to minimize:

Many of the fitted ’s are usually zero; x’s corresponding the the non-zero ’s are the “support vectors.”

N

iii xxKxf

10 ),()(

Kxfy TN

iii

1,

)(10

min

complexity penalty

Gram matrix

tuning constant

Prepared by David L. Madigan

Page 20: kantorNSF-NIJ-ISI-03-06-04.ppt

Regularized Linear Feature Space Model

Choose a model of the form:

to minimize:

Solution is finite dimensional:

bxwxf ii

i

)()(1

2

1

))(,( fxfyLN

iii

)()(),( zxzxK

just need to know K, not !

prediction is sign(f(x))

bxxyxfN

iiii

1

)()()(

A kernel is a function K, such that for all x,z X

where is a mapping from X to an inner product feature space F

Pre

pare

d by

Dav

id L

. Mad

igan

Page 21: kantorNSF-NIJ-ISI-03-06-04.ppt

Mixture Models

• Pr(d|Rel)=af(d)+(1-a)g(d)

• f, g may be centered at different points in document space. So distinct conceptual representations are accommodated easily.

• Examples: multinomial distributions.

Page 22: kantorNSF-NIJ-ISI-03-06-04.ppt

Example Results on Fusion

• http://dimacspc6.rutgers.edu/~dfradkin/fusion/centroid/try.pdf

• http://dimacspc6.rutgers.edu/~dfradkin/applet/topicShowApplet.jsp

• 60,000 documents.

Page 23: kantorNSF-NIJ-ISI-03-06-04.ppt

Feature space

Random Subspace

Score space

Learning takes place in two spaces: For matching and filtering, we learn rules in the primary space of document features. For fusion processes we learn rules in a secondary space of “pseudo-features” which are assigned by entire systems, to incoming documents.

RelevantRelevant

Page 24: kantorNSF-NIJ-ISI-03-06-04.ppt

REFERENCE ASPECT

Effective Communication with the Analyst User

Page 25: kantorNSF-NIJ-ISI-03-06-04.ppt

HITIQA: High-Quality

Interactive Question Answering

University at Albany, SUNYRutgers University

Page 26: kantorNSF-NIJ-ISI-03-06-04.ppt

HITIQA Team

• SUNY Albany:– Prof. Tomek Strzalkowski, PI/PM– Prof. Rong Tang– Prof. Boris Yamrom, consultant– Ms. Sharon Small, Research Scientist– Mr. Ting Liu, Graduate Student– Mr. Nobuyuki Shimizu, Graduate Student– Mr. Tom Palen, summer intern– Mr. Peter LaMonica, summer intern/AFRL

• Rutgers:– Prof. Paul Kantor, co-PI– Prof. K.B. Ng– Prof. Nina Wacholder– Mr. Robert Rittman, Graduate Student– Ms. Ying Sun, Graduate Student– Mr. Peng Song, Graduate student

Page 27: kantorNSF-NIJ-ISI-03-06-04.ppt

HITIQA Concept

Question: What recent disasters occurred in tunnels used for transportation?

Possible Category Axes SeenV

ehic

le t

yp

eLosses/Cost

loca

tion

other

auto

train

USER PROFILE; TASK CONTEXT

QUESTION NL PROCESSING

Clarification Dialogue:S: Are you interested in train accidents,automobile accidents or others?U: Any that involved lost life or a majordisruption in communication. Must identifyloses.

Semantics: What the question“means”:• to the system• to the userS

EM

AN

TIC

PR

OC

FUSE &SUMMARIZE

Answer &Justification

AN

SW

ER

GE

NE

R.

SEARCH &CATEGORIZE

KB

TEMPLATE SELECTION

Focused Information Need

QUALITY ASSESSMENT

Page 28: kantorNSF-NIJ-ISI-03-06-04.ppt

Key Research Issues

• Question Semantics – how the system “understands” user requests

• Human-Computer Dialogue – how the user and the system negotiate this

understanding

• Information Quality Metrics – how some information is better than other

• Information Fusion – how to assemble the answer that fits user

needs.

Page 29: kantorNSF-NIJ-ISI-03-06-04.ppt

Document Retrieval

Document Retrieval

BuildFrames

BuildFrames

ProcessFrames

ProcessFrames

DialogueManager

DialogueManager

QuestionProcessor

QuestionProcessor

Wordnet

Completed Work

question

Segment/Filter

Segment/Filter

ClusterSegments

ClusterSegments

Query Refinement

Query Refinement

Current Focus

DB

Gate

AnswerGenerator

AnswerGenerator

answer

Visualization

Page 30: kantorNSF-NIJ-ISI-03-06-04.ppt

Data-Driven NL Semantics

What does the question mean to the user?– The speech act– The focus– User’s task,

intention, goal– User’s background

knowledge

What does the question mean to the system?– Available

information– Information that

can be retrieved– The dimensions of

the retrieved information

Page 31: kantorNSF-NIJ-ISI-03-06-04.ppt

Answer Space Topology

KERNELQUESTION

MATCH

KERNELQUESTION

MATCH

NEARMISSES,

ALTERNATIVE INTERPRETATIONS

ALL RETRIEVED

FRAMES

Page 32: kantorNSF-NIJ-ISI-03-06-04.ppt

Quality Judgments

• Focus Group:– Sessions conducted: March-April, 2002– Results: Nine quality aspects generated

• Expert Sessions:– Sessions Conducted: May-June, 2002– Results: 100 documents scored twice along 9 quality aspects

• Student Sessions:– Training and Testing Sessions: June-July, 2002

• 10 documents judged by experts used for training/testing

– Actual Judgment Sessions: June-August, 2002• Qualified students evaluated 10 documents per session

– Results: 900 documents scored twice along 9 quality aspects

Page 33: kantorNSF-NIJ-ISI-03-06-04.ppt

Factor Analysis of 9 Quality Features

Appearance

Content

Page 34: kantorNSF-NIJ-ISI-03-06-04.ppt

Modeling Quality of Text• Kitchen sink approach

– 160 “independent” variables– Part-of-speech, vocabulary – stylistics, named entities, …

• Statistical pruning– Statistically significant variables– May be nonsensical to human

• Human pruning– Only “sensible” variables retained for each quality

• Pruning improves performance– Kitchen sink overfits– Statistics and Human close in performance– More work needed to understand the relationship

Page 35: kantorNSF-NIJ-ISI-03-06-04.ppt

Quality Prediction by Linear Combination of Textual Features (from 5 to 17 variables). Split Half for Training and Testing.

Quality Factors Prediction Rate

Depth 67%Author Credential 55%

Accuracy 69%Source 57%

Objectivity 64%Grammar 79%

One Side vs Multi View 70%

Verbosity 63%Readability 76%

Performance of models

Page 36: kantorNSF-NIJ-ISI-03-06-04.ppt

Quality Aspect: Depth ROC Using Stepwise Discriminant Function

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False Alarm Rate

Det

ectio

n R

ate

Training

Testing

Perfect Knowledge

No Knowledge

Page 37: kantorNSF-NIJ-ISI-03-06-04.ppt

In Summary

• The two conceptual foundations of librarianship: cataloging and reference, translate to two important problems in managing streams of textual messages:

• Both involve pattern recognition or machine learning.

Page 38: kantorNSF-NIJ-ISI-03-06-04.ppt

Two Roles for Learning

• Cataloging: learning which features of a message mean that it is significant to the problem at hand

• Reference: learning which features of a message mean that it is “salient” to a specific user of the system.

Page 39: kantorNSF-NIJ-ISI-03-06-04.ppt

Appendix:The following slides were not presented at the conference.

Page 40: kantorNSF-NIJ-ISI-03-06-04.ppt

Communicating Credibility

• A system that is correct 75% or 80% of the time will be wrong one time in every four or 5.

• Unless it can “shade” its judgments or recommendations, the analyst will lose confidence in it.

• Credibility must be high enough to avoid extensive rework.

Page 41: kantorNSF-NIJ-ISI-03-06-04.ppt

Data Fusion

• Use multiple methods to assess the relevance of documents or passages, – For a given question, dialogue, or cluster– Each method assigns a “score”

• Candidates → points in a “score space”• Seek patterns to localize the most relevant

documents or passages in this “score space”• Developed interactive data analysis tool

Page 42: kantorNSF-NIJ-ISI-03-06-04.ppt

Background on Fusion Problem

• There are systems S, T, U, …

• There are problems to be solved P,Q,R…

• This defines several fusion problemsLocal fusion: for a given problem P, and a pair of

systems S,T, what is the best fusion rule:

Let s(d) ,t(d) be the scores assigned to document d by systems S and T. Fusion tries to find the “best” combining function f(s,t)

Page 43: kantorNSF-NIJ-ISI-03-06-04.ppt

Non-linear “iso-relevance”

Page 44: kantorNSF-NIJ-ISI-03-06-04.ppt

Local Fusion Rule

• A local fusion rule fP(s,t) depends on the specific problem P.– This is relevant if P represents a static problem

or profile, which will be considered on many occasions

• A global fusion rule f(s,t) does not depend on a specific problem P, – and can be safely used on a variety of problems.

Page 45: kantorNSF-NIJ-ISI-03-06-04.ppt

• Completely rigorous For each topic:

• 1) Randomly split the documents into two parts: training and testing

• 2) Do the logistic regression on training part and get the fusion scores for both training and testing documents

• 3) Calculate p_100 on testing documents.

• 4) Excellent results (one random sample for each)

• 5) Test SMART and InQuery on the same random testing set

Local Fusion Results are Good

Page 46: kantorNSF-NIJ-ISI-03-06-04.ppt

Summary of Local FusionF & SM F & IN F & BEST

win 11 7 5tie 4 7 9lose 0 1 1

Smart InQuery Fusion318 0 3 0318 1 5 2318 1 5 4318 0 3 1318 1 4 2

PROBLEM CASE

We ran 5 split half runs on the odd case (318) and the results persist.

Page 47: kantorNSF-NIJ-ISI-03-06-04.ppt

Is Local Sensible?

• Local fusion depends on getting information about a particular topic, and doing the best possible fusion.

• Not available in an AdHoc (e.g. Google) setting

• Potentially available in an intelligence applications - -filtering; standing profile

Page 48: kantorNSF-NIJ-ISI-03-06-04.ppt

Fusion of InQuery and Smart: Topic 450* Easy case – almost any linear rule works well. Either system works well

Fusion of InQuery and Smart: Topic 450* Easy case – almost any linear rule works well. Either system works well Fusion of InQuery and Smart: Topic 392

* Easy case – SMART works well. InQuery works poorly

Fusion of InQuery and Smart: Topic 392* Easy case – SMART works well. InQuery works poorly

Fusion of InQuery and Smart: Topic 432* Another hard case – relevant documents not compactly grouped in the score space. Not many relevant documents found at all.

Fusion of InQuery and Smart: Topic 432* Another hard case – relevant documents not compactly grouped in the score space. Not many relevant documents found at all.

Fusion of InQuery and Smart: Topic 318* Interesting case – no linear rule works well. Relevant documentsembedded. Requires non-linear methods – Quadratic; SVM; other

Page 49: kantorNSF-NIJ-ISI-03-06-04.ppt

Fusion of InQuery and Smart: Topic 421* Really challenging case: Quite a few relevant documents. Very diffuse in score space. Neither system works well. Possibly Boolean AND

Fusion of InQuery and Smart: Topic 421* Really challenging case: Quite a few relevant documents. Very diffuse in score space. Neither system works well. Possibly Boolean AND

Fusion of InQuery and Smart: Topic 359* A disaster.

Fusion of InQuery and Smart: Topic 359* A disaster.

Fusion of InQuery and Smart: Topic 374* Possible Boolean AND. Neither works well alone

Fusion of InQuery and Smart: Topic 374* Possible Boolean AND. Neither works well alone

Fusion of InQuery and Smart: Topic 415* Part of the relevant material is easily found. Part is embedded

Fusion of InQuery and Smart: Topic 415* Part of the relevant material is easily found. Part is embedded

Page 50: kantorNSF-NIJ-ISI-03-06-04.ppt

Our Approach to Retrieval Fusion

SMART

InQuery

FUSION PROCESS

Request

DOCUMENTS SETS

Result Set

Delivered SET

Result Set

ADOPT: Fusion

System

Monitor Fusion Set

and Receive

Feedback

USE: Better System

Adaptive “Local” Fusion