semantic content management for enterprises and national security amit sheth cto, voquette*, inc....
TRANSCRIPT
SEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL
SECURITY
Amit Sheth
CTO, Voquette*, Inc. Large Scale Distributed Information Systems (LSDIS) Lab
University Of Georgia; http://lsdis.cs.uga.edu
*Now Semagix, http://www.semagix.com
July 15, 2002 © Amit Sheth
Keynote
CONTENT- AND SEMANTIC-BASED INFORMATION RETRIEVAL @ SCI 2002
New Enterprise Content Management
Challenges1. More variety and complexity
More formats (MPEG, PDF, MS Office, WM, Real, AVI, etc) More types (Docs, Images -> Audio, Video, Variety of text-
structured, unstructured) More sources (internal, extranet, internet, feeds)
2. Saclability, Information Overload Too much data, precious little information (Relevance)
3. Creating Value from Content How to Distribute the right content to the right people as needed?
(Personalization -- book of business) Customized delivery for different consumption options
(mobile/desktop, devices) Insight, Decision Making (Actionable)
New Enterprise Content Management Technical
Challenges1. Aggregation
Feed handlers/Agents that understand content representation and media semantics
Push-pull, Web-DB-Files, Structured-Semi-structured-Unstructured data of different types
2. Homogenization and Enhancement Enterprise-wide common view
Domain model, taxonomy/classification, metadata standards Semantic Metadata– created automatically if possible
3. Semantic Applications Search, personalization, directory, alerts, etc. using metadata and
semantics (semantic association and correlation), for improved relevance, intelligent personalization, customization
The Semantic Web -- a vision with several views:•·“The Web of data (and connections) with meaning in the sense that a computer program can learn enough about what data means to process it.” [B99]•·“The semantic Web is an extension of the current Web in which information is given well-defined meaning, better enabling computers and people to work in cooperation.” [BHL01]•·“The Semantic Web is a vision: the idea of having data on the Web defined and linked in a way that it can be used by machines not just for display purposes, but for automation, integration and reuse of data across various applications. [W3C01]
Semantics: The Next Step in the Web’s Evolution
Semantics for the Web
On the Semantic Web every resource (people, enterprises, information services, application services, and devices) are augmented with machine processable descriptions to support the finding, reasoning about (e.g., which service is best), and using (e.g., executing or manipulating) the resource. The idea is that self-descriptions of data and other techniques would allow context-understanding programs to selectively find what users want, or for programs to work on behalf of humans and organizations to make them more efficient and productive.
Central Role of Metadata
Where is the
content? Whose is
it?
ProduceAggregate
What is this
content about?
Catalog/Index
What other
content is it
related to?
Integrate Syndicate
What is the right
content for this user?
Personalize
What is the best way to
monetize this interaction?
Interactive Marketing
Broadcast,Wireline,Wireless,Interactive TV
Semantic Metadata
ApplicationsBack End
"A Web content repository without metadata is like a library without an index." - Jack Jia, IWOV“Metadata increases content value in each step of content value chain.” Amit Sheth
A Metadata Classification
Data (Heterogeneous Types/Media)(Heterogeneous Types/Media)
Content Independent Metadata (creation-date, location, type-of-sensor...)(creation-date, location, type-of-sensor...)
Content Dependent Metadata (size, max colors, rows, columns...)(size, max colors, rows, columns...)
Direct Content Based Metadata (inverted lists, document vectors, LSI)(inverted lists, document vectors, LSI)
Domain Independent (structural) Metadata (C++ class-subclass relationships, HTML/SGML(C++ class-subclass relationships, HTML/SGML Document Type Definitions, C program structure...)Document Type Definitions, C program structure...)
Domain Specific Metadata area, population (Census),area, population (Census), land-cover, relief (GIS),metadata land-cover, relief (GIS),metadata concept descriptions from ontologiesconcept descriptions from ontologies
OntologiesClassificationsClassificationsDomain ModelsDomain Models
User
More More
SemanticsSemantics
for for
Relevance Relevance
to tackleto tackle
InformationInformation
Overload!!Overload!!
Semantic Content Organization and Retrieval Engine (SCORE) technology
• Automatically aggregates and extracts information from disparate sources and multiple formats• Automatically tags/annotates and categorizes content• Automatically creates relevant associations
- Maps content topics and their relationships• Semantic query engine relates information and knowledge both internal and external to the organization into a single view
SCORE Architecture
Distributed agents that automatically extract relevantsemantic metadata from structured and unstructured content
Fast main-memory based query engine with APIs and XML output
CACS provides automatic classification (w.r.t. WorldModel)from unstructured text and extracts contextually relevant metadata
Distributed agents that automatically extract/mineknowledge from trusted sources
Toolkit to design and maintain the KnowledgebaseKnowledgebase represents the real-world instantiation(entities and relationships) of the WorldModel
WorldModel specifies enterprise’snormalized view of information (ontology)
Voquette Enterprise Semantic Platform Product Components
World ModelWMToolkit
Knowledgebaseand
MetabaseMain Memory
Index
XML APIsWeb
Services
EnterpriseApplications
EA
EA
EA
Semantic EngineSearch Alerts Portals DirectoryPersonalize
Enhancement Engine
CA
CA
CA
ContentAgent
Monitor
ContentAgents
Databases
XML/Feeds
Websites
ContentSources
Entity Extraction, Enhanced Metadata,
Domain Experts
AutomaticClassification
Classification Committee
Reports
Documents
Stru
ctu r
edSe
mi -
Stru
ctu r
edU
nstr
uct u
red
CAToolkit
KnowledgeAgent
Monitor
KS
KS
KS
KS
KA
KA
KA
KnowledgeSources
KnowledgeAgents
KAToolkit
Knowledgebase
KBToolkit
KnowledgeAgent
Monitor
KS
KS
KS
KS
KA
KA
KA
KnowledgeSources
KnowledgeAgents
KAToolkit
Metabase
Enhancement Engine
CA
CA
CA
ContentAgent
Monitor
ContentAgents
Databases
XML/Feeds
Websites
ContentSources
Entity Extraction, Enhanced Metadata,
Domain Experts
AutomaticClassification
Classification Committee
Reports
Documents
Stru
ctu r
edSe
mi -
Stru
ctu r
edU
nstr
uct u
red
CAToolkit
Market Guide (MG)ZDNet (ZD)
Hoover’s (H)Data supplied from NASA (DPL)
Federation of American Scientists (FAS)Central Intelligence Agency (CIA)
The Interdisciplinary Center (ICT)Federal Bureau of Investigation (FBI)
Capital Advantage (CA)Office of Foreign Assets Control (OFAC)
PERSON (OFAC, FBI, DPL)
-politician (OFAC, FBI, CIA, CA)
politician associated with politicalOrganziation
politician held politicalOffice
politician associated with politicalOffice
-terrorist (OFAC, FBI, DPL)
terrorist memberOf organization
terrorist appears on watchList
-companyExecutive (MG)
companyExecutive holdsOffice companyPosition
person has permanent address address (OFAC, FBI)
person has dob(date of birth) (OFAC, FBI)
person has pob(place of birth) (OFAC, FBI)
Knowledge Sources Used
THING
-event (ICT)
terroristOrganization participated in terroristSponsoredEvent (ICT)
-politicalOffice (CIA, CA)
politicalOffice office(s) within govtOrganization
politicalOffice associated with organization
-watchList (OFAC, FBI, DPL)
terroristOrganization appears on watchList (OFAC, FBI, DPL)
-organization (OFAC, FBI, FAS, ICT, CA, CIA)
organization appears on watchList
organization memberOf suborganization
-company
company manufactures product (ZD)
company identifiedBy tickeySymbol (H)
companyposition position in company (MG)
company memberOf industry (H)
-tickerSymbol (H)
tickerSymbol memberOf exchange (H)
PLACE
-organization located in place (H, OFAC)
-religiousAffiliation practiced in place (CIA)
-company headquarters in city (H)
Entity Classes and Relationships populated by these knowledge sources:
JIVA
SCORE Capabilities
• Semantics (understanding of content and user needs)
• Extreme relevance
• Semantic associations
• Near real-time
• Multiple applications/usage patterns (not just search)
• Automation
• Scalability in all aspects
Technologies Involved
• Ontology driven architecture (definitional, assertional components
• Automatic Classification with classifier committee (multiple technologies, rather than one size fits all)
• Automatic Semantic Metadata Extraction/Annotation
• Semantic associations/ knowledge inferences
• Scalability throughout with distributed architecture and implementation (number of content and knowledge sources, indexing, etc.)
• Main memory implementation, incremental check pointing
Performance
> 10,000 entities/relationships per hr.Population/update rate in a Knowledgebase with 1 million entities/relationships
1 minute (near real-time)Incremental Index Update Frequency
65msQuery Response Time (64 concurrent users)
1 - 10 msQuery Response Time (light load)
> 1,980,000Queries per server per hour
Information Extraction for Metadata Creation
WWW, EnterpriseRepositories
METADATAMETADATA
EXTRACTORSEXTRACTORS
Digital Maps
NexisUPIAPFeeds/
Documents
Digital Audios
Data Stores
Digital Videos
Digital Images. . .
. . . . . .
Key challenge: Create/extract as much (semantics)metadata automatically as possible
Video withEditorialized Text on the Web
Automatic Categorization & Metadata Tagging (Web page)
AutoCategorization
AutoCategorization
Semantic MetadataSemantic Metadata
Extraction Agent
Web Page Enhanced Metadata Asset
Content Extraction and Knowledgebase Enhancement
Semantic Metadata
Syntax Metadata
Content Enhancement Workflow
ExtractorAgent
forBloomberg
Scans text for analysis
Metadataextractedautomatically
AssetSyntax MetadataProducer: BusinessWireSource: BloombergDate: Sept. 10 2001Location: San Jose, CAURL: http://bloomberg.com/1.htmMedia: Text
Semantic Metadata Company: Cisco Systems, Inc.
Creates asset (index)out of extracted metadata
AssetSyntax MetadataProducer: BusinessWireSource: BloombergDate: Sept. 10 2001Location: San Jose, CAURL: http://bloomberg.com/1.htmMedia: Text
Semantic Metadata Company: Cisco Systems, Inc.Topic: Company News
Categorization &Auto-Cataloging System (CACS)
Scans text for analysis
Classifies document into pre-defined category/topic
Appends topic metadatato asset
CiscoSystems
CSCO
NASDAQ
Company
Ticker
Exchange
Industry
Sector
Executives
John ChambersTelecomm.
Computer Hardware
Competition
Nortel Networks
Knowledge Base
CEO of
Competes with
Syntax Metadata AssetProducer: BusinessWireSource: BloombergDate: Sept. 10 2001Location: San Jose, CAURL: http://bloomberg.com/1.htmMedia: Text
Semantic Metadata Company: Cisco Systems, Inc.Topic: Company NewsTicker: CSCOExchange: NASDAQIndustry: Telecomm.Sector: Computer HardwareExecutive: John ChambersCompetition: Nortel NetworksHeadquarters: San Jose, CA
Leveragesknowledgeto enhance
metatagging
Enhanced Content Asset
Indexed
Headquarters
San Jose
XML Feed
SemanticEngine
Content Asset Index Evolution
Content which doescontain the wordsthe user asked for
Extractor Agents
Content which does not contain the words
the user asked for, but is about what he asked
for.
Value-added Metadata
Content the user did not think to ask for, but
which he needs to know.
Semantic Associations
+ +
Intelligent ContentIntelligent Content
End-User
Intelligent Content Empowers the User
Example 1 – Snapshots (“Jamal Anderson”)
Click on first result for Jamal Anderson
View metadata. Note that Team name and League name are also included
in the metadata
Search for ‘Jamal Anderson’ in ‘Football’
View the original source HTML page. Verify that
the source page contains no mention of Team name and League name. They
are value-additions to the metadata to facilitate
easier search.
Focused relevantcontent
organizedby topic
(semantic categorization)
Automatic ContentAggregationfrom multiple
content providers and feeds
Related relevant content not
explicitly asked for (semantic
associations)
Competitive research inferred
automatically
Automatic 3rd party content
integration
Semantic Application Example – Research Dashboard
Related Stock
News
Related Stock
News
Semantic Web – Intelligent Content
IndustryNews
IndustryNews
Technology Products
Technology Products
COMPANYCOMPANY
SECEPAEPA
RegulationsRegulations
CompetitionCompetition
COMPANIES in Same or Related INDUSTRY
COMPANIES inINDUSTRY with Competing PRODUCTS
Impacting INDUSTRY or Filed By COMPANY
Important to INDUSTRY or COMPANY
Intelligent Content = What You Asked for + What you need to know!
Syntax Metadata
Semantic Metadata
led by
Same entity
Human-assisted inference
Knowledge-based & Manual Associations
Intelligence Analyst Browsing Scenario
Innovations that affect User Experience
• BSBQ: Blended Semantic Browsing and Querying
– Ability to query and browse relevant desired content in a highly contextual manner
• Seamless access/processing of Content, Metadata and Knowledge
– Ability to retrieve relevant content, view related metadata, access relevant knowledge and switch between all the
above, allowing user to follow his train of thought
• dACE: dynamic Automatic Content Enhancement
– Ability to provide enhanced annotation features, allowing the user to retrieve relevant knowledge about significant
pieces of content during content consumption
• Semantic Engine APIs with XML output
– Ability to create customized APIs for the Semantic Engine involving Semantic Associations with XML output to
cater to any user application
VisionicsAcSysSecurity Portal
Check-in
Interrogation
Boarding Gate AirportAirspace
VoquetteKnowledgebase
MetabaseThreat Scoring
Gov’t WatchlistsNews Media
Web Info
LexisNexisRiskWise
Passenger RecordsReservation Data
Airline DataAirport Data
Airline and Airport Data Future and Current Risks
Airport LEO
ARC AvSec ManagerData Management
Data Mining
IPG
Sources Used
Knowledge Sources:FBI - Most Wanted Terrorists
Denied Persons Lists
Terrorism Files
ICT
Office of Foreign Asset Control (OFAC)
Hamas terrorists
CNN Locations
FAA_Airport_Codes
About.com
Comtex_International
Hindustan Times
JerusalemPost
CNN
Newstrove_Hamas
Content Sources :
Africa News Service
AFX News – Asia/UK/Europe
AP Worldstream
Asia Pulse
BusinessWire
ComputerWire (CTW)
EFE News Services
FWN Select
Itar-TASS
Knight Ridder News (Open)
Knight-Ridder Open
M2 - International
M2 Airline Industry Information
New World Publishing
PR Newswire
PRLine (PRL)
Resource News International
RosBusiness
United Press International
UPI Spotlights
Voquette’s Semantic
Technology enables flight
authorities to :
- take a quick look at the
passenger’s history
- check quickly if the passenger is
on any official watchlist
- interpret and understand
passenger’s links to other
organizations (possibly terrorist)
- verify if the passenger has
boarded the flight from a “high
risk” region
- verify if the passenger originally
belongs to a “high risk” region
- check if the passenger’s name
has been mentioned in any news
article along with the name of a
known bad guy
Interrogation Kiosk – Unique Advantages of Voquette
SmithJohn
SmithJohn
Threat Score Components
WATCHLIST ANALYSIS
Action: Voquette’s rich knowledgebase is automatically searched for the possible appearance of this name on any of the watchlists
Ability Proven: Ability to automatically aggregate relevant rich domain knowledge and automatically co-relate it and rank the threat factors to indicate threat level of the passenger on the watchlist front
METABASE SEARCH
Action: Voquette’s rich metabase is searched for this name and associated content stories mentioning the passenger’s name are retrieved
Ability Proven: Ability to automatically aggregate and retrieve relevant content stories, field reports, etc. about the passenger that can be used by flight officials to determine if the passenger has any connections with known bad people or organizations
appearsOn watchList:
FBI
KNOWLEDGEBASE SEARCH
Action: Voquette’s rich knowledgebase is searched for this name and associated information like position, aliases, relationships (past or present) of this name to other organizations, watchlists, country, etc. are retrieved
Ability Proven: Ability to automatically aggregate relevant rich domain knowledge about a passenger and automatically co-relate it with other data in the knowledgebase to present a visual association picture to the flight official
LEXIS NEXIS ANNOTATION
Action: Information about or related to the passenger returned by Lexis Nexis is enhanced by linking important entities to Voquette’s rich knowledgebase
Ability Proven: Ability to automatically aggregate relevant rich domain knowledge, recognize entities in a piece of text and further automatically co-relate it with other data in the knowledgebase to present a clear picture about the passenger to the flight official
Flight Coutry Check 45 0.15
Person Country Check 25 0.15
Nested Organizations Check 75 0.8
Aggregate Link Analysis Score: 17.7
LINK ANALYSIS
Action: Semantic analysis of the various components (watchlist, Lexis Nexis, knowledgebase search, metabase search, etc.) to come up with an aggregate threat score for the passenger
Ability Proven: Ability to automatically aggregate relevant rich domain knowledge, recognize entities in a piece of text, automatically co-relate it with other data in the knowledgebase, search for relevant content to present an overall idea of the threat level fo the passenger, allowing him to take quick action
What it will take RDBMS to support flight security application
Link Analysis Component # Queries (Voquette) # Queries (RDBMS) Time (Voquette) Time (RDBMS)
Direct Watchlist Match (person name)lookup person entity 1 CACS Request 5-10 SQL Queries .05 sec 5-10 sec.retrieve person's relationships to watchlists 1 SQL Query 1 SQL Query .005 sec .005 sec
Organization Watchlist Match (person name, organization name)lookup person entity 1 CACS Request 5-10 SQL Queries .05 sec 5-10 sec.retrieve person's relationships to organizations 1 SQL Query 1 SQL Query .005 sec .005 secretrieve the organizations' relationships to watchlists 1 SQL Query 1 SQL Query .005 sec .005 seclook up organization entity 1 CACS Request 5-10 SQL Queries .05 sec 5-10 sec.retrieve the organizations' relationships to watchlists 1 SQL Query 1 SQL Query .005 sec .005 sec
Nested Organization Watchlist Match (person name, organization name)look up organization entity 1 CACS Request 5-10 SQL Queries .05 sec 5-10 sec.retrieve the organization's relationships to organizations 1 SQL Query 1 SQL Query .005 sec .005 secretrieve the organizations' relationships to watchlists 1 SQL Query 1 SQL Query .005 sec .005 sec
Flight Origin (country name)retrieve country entity 1 SQL Query 1 SQL Query .005 sec .005 secsee if country is on a list containing "high-risk" countries 1 SQL Query 1 SQL Query .005 sec .005 sec
Person Origin (person name)lookup person entity 1 CACS Request 5-10 SQL Queries .05 sec 5-10 sec.retrieve person's home country 1 SQL Query 1 SQL Query .005 sec .005 secretrieve the organization's relationships to lists containing "high-risk" countries 1 SQL Query 1 SQL Query .005 sec .005 sec
Field Report Search (person name)perform SSE query for field reports that mention this person 1 SSE Request 2 SQL Queries .03 sec 5-30 secretrieve a list of people associated with these field reports 1 SQL Query 1 SQL Query .005 sec .005 secdetermine which people are on watchlists, terrorists, etc… 1 SQL Query 1 SQL Query .005 sec .005 sec
18 requests 39-64 SQL Queries .33 sec 30-80 sec.
Query Comparison:Voquette vs. RDBMS
JIVA Semantic Console Start-up Interface
The mission of the JIVA project is to gather and analyze as much information of diverse kinds about suspected individuals,
terrorist and other groups, organizations, events, etc. For this Terrorism domain, the JIVA Semantic Console provides an
information retrieval interface (shown below) that displays some fundamental semantic attributes (based on a
corresponding Terrorism domain model) to enable information retrieval in the right context.
Most fundamental semantic attributes
specific to the Terrorism domain
(fully customizable)
Syntactic ordomain-independent attributes for generaland media-specific
search
Analyst can entersearch values in the
appropriate attribute fields (to search
in the right context)
Analyst can choose the type of media
of the desired content
Once all other valuesare set, click the
“Search” button to search semantically
Search interface withmore search features
(explained later)
JIVA Functionality Interface
“Complete Picture” View – Knowledgebase Results
This section of the ‘Complete Picture’ shows factually known real-world information about the entity (person, organization,
event, etc.) of interest along with its contextual classification(s) and relationships with other entities in the Knowledgebase,
to provide a comprehensive overview of the entity.
Such knowledge is kept up-to-date by means of automated knowledge extractor agents that aggregate such knowledge
about millions of entities from various trusted knowledge sources.
Entity’s canonical name
Entity’s classificationsin taxonomy
Entity’s aliases and other names
Entity’s real-world relationships to various
other entities across multiple entity classes
(as defined in theTerrorism domain model)
Individual related entities are clickableto navigate to a newknowledge page for
that entity e.g. Al Qaeda
- Knowledgebase Knowledgebase NavigationNavigation
While browsing throughrelevant knowledge, analyst can search for content on the
focal entity or any ofthe related entities.
The analyst can alsosearch for specific
relationships between two or more entities
by checking corresponding
entity boxes for search
- Blended Semantic- Blended SemanticBrowsing & QueryingBrowsing & Querying
(BSBQ)(BSBQ)
Fraud investigation offocal entity placing it in
one of five levels of threats, based on score
JIVA
Facilitating Knowledge Discovery
On clicking any bin Laden-related entity (e.g. Al Qaeda), a page is
displayed to the analyst showing knowledge pertaining to that
entity, which can be used in a BSBQ mode, as described on the
previous screen.
Continuing this integrated approach of Semantic Browsing and
Querying, the analyst has the necessary ammunition to perform
Knowledge Discovery. The analyst can follow his train of thought
as he browses and queries to possibly discover unexpected
relationships and links between entities at various levels in an
indirect manner. Automatically uncovering such hidden related
entities facilitates addition of new and meaningful entities and
relationships to the analyst’s assessment tasks.
JIVA
Wireless Application of Semantic Metadata and Automatic Content
Enrichment
MyStocks
News
Sports
Music
MyMedia
$
My Stocks
CSCO
NT
IBM
Market
CSCO
Analyst Call
Conf Call
Earnings
11/08 ON24 Payne
11/07 ON24 H&Q 11/06 CBS Langlesis
CSCO Analysis
Clicking on the link for Cisco Analyst Calls displays a listingsorted by date. Semantic filtering uses just the right metadata to meet screen and other constrains. E.g., Analyst Call focuses on the source and analyst name or company. The icon denote additional metadata, such as “Strong Buy” by H&Q Analyst.
SceneDescriptionTree
Retrieve Scene Description Track
“NSF Playoff”
Node
Enhanced XML
Description
MPEG-2/4/7
Enhanced Digital Cable
Video
MPEGEncoder
MPEGDecoder
Node = AVO Object
Voqutte/TaaleeSemantic
Engine“NSF Playoff”
Produced by: Fox Sports Creation Date: 12/05/2000 League: NFLTeams: Seattle Seahawks, Atlanta Falcons Players: John Kitna Coaches: Mike Holmgren, Dan Reeves Location: Atlanta
Object Content Information (OCI)
Metadata-richValue-added Node
Create Scene Description Tree
GREATUSER
EXPERIENCE
Metadata’s role in emerging iTV infrastructure
Channel salesthrough Video Server Vendors,
Video App Servers, and Broadcasters
License metadata decoder and semantic applications to
device makers
Metadata for Automatic Content Enrichment
Interactive Television
This segment has embedded or referenced metadata that isused by personalization application to show only the stocksthat user is interested in.
This screen is customizablewith interactivity featureusing metadata such as whetherthere is a new ConferenceCall video on CSCO.
Part of the screen can beautomatically customized to show conference call specific information– including transcript,participation, etc. all of which arerelevant metadata
Conference Call itself can have embedded metadata to support personalization andinteractivity.
Future
• Multimodal interfaces
• Multimodal semantics
• Multivalent Semantics
Metadata Usage: Keyword, Attribute and Content Based Access
The VisualHarness system at LSDIS/UGA