calais @ the palo alto semantic web meetup

18
Calais PAWS Sep 4, 2008

Upload: krista-thomas

Post on 22-Nov-2014

6.673 views

Category:

Technology


3 download

DESCRIPTION

An oveview of Thomson Reuters Calais Initiative given by Tom Tague at the Palo Alto Semantic Web Meetup in San Francisco, CA in August

TRANSCRIPT

Page 1: Calais @ the Palo Alto Semantic Web Meetup

CalaisPAWS

Sep 4, 2008

Page 2: Calais @ the Palo Alto Semantic Web Meetup

Calais?

Page 3: Calais @ the Palo Alto Semantic Web Meetup

ClearForest• Founded in 1998 by text analytics

pioneers

• A software organization that enables Intelligent Information

• Enterprise and government customers

• Led the market in the establishment of unstructured text as a key corporate asset

• Acquired by Reuters June 2007

• Offices: Boston, Israel

Page 4: Calais @ the Palo Alto Semantic Web Meetup

The Text Problem

• People consume text

• Most of it isn’t semantically enabled

• Most of it won’t be semantically enabled

• Why: Latency, cost and short shelf-life

Page 5: Calais @ the Palo Alto Semantic Web Meetup

Calais’ Piece of the Puzzle

• A semantic metadata

generation service that extracts

entities, facts and events from

unstructured text

• Two new capabilities: topics &

relevance

• Available for commercial or

non-commercial use up to

40,000 times per day

Calais

Named Entities

Facts Events

People,

Companies,

Geographies,

Albums,

Authors, etc.

Position,

Alliance,

Education,

Political

Affiliation, etc.

Management

Change, IPO,

Labor Action,

Sporting,

Entertainment

etc.

Unstructured Documents

(Text / HTML / XML)

Page 6: Calais @ the Palo Alto Semantic Web Meetup

<Topic>M&A</Topic>

<Acquisition offset="494" length="130">  <Company_Acquirer>Reuters</Company_Acquirer>   <Company_Acquired>ClearForest Ltd.</Company_Acquired>   <Status>Planned</Status> </Acquisition>

<Company>Reuters</Company>

<Company>ClearForest Ltd.</Company>

<Product>Text Analytic Solution </Product>

<Company>ClearForest Ltd.</Company>

<Company>Reuters</Company>

<Country>United States</Country>

<Country>Israel</Country>

<Company>Reuters</Company>

<Person>Gerry Campbell</Person>

<ManagementChange offset="2789" length="92"> <Person>Gerry Campbell</Person> <Company>Reuters</Company> <Action>Enters</Position> </ManagementChange>

Reuters Announced the Acquisition of ClearForest

New York - April 30, 2007

Reuters, the global information company, has entered into an agreement to acquire all of the outstanding shares of ClearForest Ltd., a privately held provider of Text Analytics solutions, whose tagging platform and analytical products allow clients to derive precise business information from huge amounts of textual content.

ClearForest has received sufficient shareholder approval to complete the transaction, which is expected to close in approximately 30 days, subject to customary closing conditions. The financial terms were not disclosed. Reuters plans to retain and continue to work with the existing management team and their highly skilled workforces in the US and Israel. It also plans to continue to support existing products and customers.

Reuters believes that search will be a pivotal element to the future of how financial information is sourced and consumed. As part of its drive into this space, Reuters has created a new strategic group and appointed Gerry Campbell, who will oversee the integration of ClearForest and drive this innovation.

Page 7: Calais @ the Palo Alto Semantic Web Meetup

What’s Behind and Event … An Example

Digital Marketing Services,Inc. (DMS), the leading provider of online marketing research and a division of America Online Inc. (AOL), today announced an alliance with Netcentives Inc. (Nasdaq: NCNT)

Extracted instances:

Company = Digital Marketing Services, Inc.

Company = Netcentives Inc.

Status = announced

DateString = today

Date = 2000-01-31

Page 8: Calais @ the Palo Alto Semantic Web Meetup

Live Example

Viewer Demo

Gnosis Demo

Page 9: Calais @ the Palo Alto Semantic Web Meetup

Extending Calais’ Reach

More than just a web service – a growing collection of tools

and applications to make it valuable in the real world

Calais

BrowserExtensions

Gnosis

Content Management Tools

WordPress

Drupal

UIMA

Development Tools & Libraries

PHP

Ruby

JAVA

.NET

Applications

And more…

TopBraid

RSS Tagger

Powerhouse

LinkedFacts

Wirecatch

FeedShaver

Page 10: Calais @ the Palo Alto Semantic Web Meetup

How Calais is Being Used Today• GistGist Automatically aggregates multiple news sources and automatically slots them

into topic, etc.

Page 11: Calais @ the Palo Alto Semantic Web Meetup

The Stack

ClearForest Tags Platform

File BasedConnector

Programmatic API(SOAP web Service)

RDBMS Connector

Web Crawlers(Agents)

Con

sole

RichXML

LiveFeed

Tooling

Modeler

Developer

Cat Manager

A

F

External Content/live feed/Enterprise Content

ClearForest Extraction Modules

B

ClearForest Categorizer C

Page 12: Calais @ the Palo Alto Semantic Web Meetup

Detailed Stack

RichXML

RichXML

ClearForest Tags Platform

Files

Document Conversion and Normalization

Control

DB

Tags API

ControlAPI

File BasedAPI

Programmatic API(SOAP web Service)

WebAgents

RDBMSbased API

Enterprise System

Categorizer

Semantic Tagging

Language ID

Headline Generation

Classifier

Extraction Modules

Language Classifier

Templates

Categorization Manager

ClearForest Dvlpr/Modeler

Languages Configuration

Key ConceptsConfiguration

ClearForest Studio

RichXML

External Feed

Configuration & Monitoring

Console

FarmManager

Page 13: Calais @ the Palo Alto Semantic Web Meetup

Platform Highlights

• Single run-time platform for all technologies

• Modular architecture

• Additional functional plug-in can be added anywhere

• Web services interfaces

• SOA ready

• Java based

• Programmatic API to all components

• Farming support for scalability

• Best practices/standards (XML, Unicode, Architectural Patterns, Design patterns …)

Page 14: Calais @ the Palo Alto Semantic Web Meetup

FileAPI

Programmatic API(SOAP web Service)

RDBMS based APIWeb

Custom

Document Tagging (Doc Runner)

Categorization

Information extractionControl

Con

sole

ControlAPI

Tags Pipeline

KB Writer

DB Writer

XML Writer

IO Bound

RichXML

ANSCollection

DB

Other (Headline Generation)

Document Conversion

Conversion & Normalization

PDF Conv.

XML Conv.

Doc Conv.

File/Web/DB based API (Document Provider)

ProfileProfileListener

Listener

Listener

Language identification

Queues:

CPU Bound

Web

Document Injector

(flight plan)

Technology

Page 15: Calais @ the Palo Alto Semantic Web Meetup

The NLP StackEvents & FactsEvents & Facts

EntitiesCandidates, Resolution, Normalization

EntitiesCandidates, Resolution, Normalization

Basic NLPNoun Groups, Verb Groups, Numbers Phrases, Abbreviations

Basic NLPNoun Groups, Verb Groups, Numbers Phrases, Abbreviations

Metadata AnalysisTitle, Date, Body, Paragraph

Metadata AnalysisTitle, Date, Body, Paragraph

Sentence MarkingSentence Marking

Morphological AnalyzerPOS Tagging (per word)

Stem, Tense, Aspect, Singular/PluralGender, Prefix/Suffix Separation

Morphological AnalyzerPOS Tagging (per word)

Stem, Tense, Aspect, Singular/PluralGender, Prefix/Suffix Separation

TokenizationTokenization

Page 16: Calais @ the Palo Alto Semantic Web Meetup

Calais, Semantics and the Semantic Web

• Issues, Opportunities

– Ontologies• How do we make this a community effort?

– Dereferenceable URI’s & Endpoints• Engineering

• Population– Basic data– Links– Proprietary data sources– Functions? Code?

Page 17: Calais @ the Palo Alto Semantic Web Meetup

What’s in the Pipeline?

• 2008– The basics of de-referenceable URI’s

– Disambiguation – company & geography

– Hooks

• 2009 (this is a fuzzy list)– Person disambiguation (social networks?)

– Other disambiguation

– Continued population of endpoints

– Calais as hub

– Exposure of the IDE

– User managed lexicons

– Lots and lots of hooks

Page 18: Calais @ the Palo Alto Semantic Web Meetup

• www.opencalais.com

– Gallery – code and applications examples

– Forums

– Documentation