apache uima introduction

51
Apache UIMA Introduction Gestione delle Informazioni su Web - 2010/2011 Tommaso Teofili tommaso [at] apache [dot] org

Upload: tommaso-teofili

Post on 15-Jan-2015

6.667 views

Category:

Technology


0 download

DESCRIPTION

Slides for GiW 2010/2011 Course

TRANSCRIPT

Page 1: Apache UIMA Introduction

Apache UIMA Introduction

Gestione delle Informazioni su Web - 2010/2011Tommaso Teofili

tommaso [at] apache [dot] org

Page 2: Apache UIMA Introduction

UIM ?

Unstructured Information Management

A wide topic: text, audio, video

Different (possibly mixed) approaches (NLP, Machine Learning, IR, Ontologies, Automated reasoning, Knowledge Sources)

Apache UIMA

Page 3: Apache UIMA Introduction

Apache Software Foundation

No profit corporation

“...provides organizational, legal, and financial support for a broad range of open source software projects...”

“...collaborative and meritocratic development process...”

“...pragmatic Apache License...”

Page 4: Apache UIMA Introduction

Apache UIMA

Architectural framework to manage unstructured data (Java, C++, ...)

Former IBM research project donated to ASF

OASIS Standard for unstructured information management

Page 5: Apache UIMA Introduction

Apache UIMA - Goals

“Our goal is to support a thriving community of users and developers of UIMA frameworks, tools, and annotators, facilitating the analysis of unstructured content such as text, audio and video”

Page 6: Apache UIMA Introduction

Apache UIMA - bridging worlds

Page 7: Apache UIMA Introduction

Apache UIMA - Overview

UIMA supports the development, discovery, composition and deployment of multi-modal analytics for the analysis of unstructured information and its integration with search technologies

Page 8: Apache UIMA Introduction

Apache UIMA - Multimodal Analysis

Multimodal Analysis means the ability of processing some resource from various “points of view”

Sample: a video stream for which we want to extract subtitles and also automatically recognize the actors involved

We are though mainly interested in text...

Page 9: Apache UIMA Introduction

Sample scenario

Content Management System containing free text articles about movies

We want such articles to be automatically enriched with metadata contained inside the text (movies, directors, actors/actresses, distribution) and linked to “similar” articles (i.e.: dealing with same movies or actors)

So that we can search for “similar” articles

Page 10: Apache UIMA Introduction

Sample scenario - articles about movies

Page 11: Apache UIMA Introduction

Sample scenario

UIMA can help on enriching articles with metadata

Think of filling an Article.java instance variables with proper values

Then persisting it to a database to query articles dealing with the same actors

Page 12: Apache UIMA Introduction

Filling Article with metadata

Page 13: Apache UIMA Introduction

Sample scenario - metadata

Page 14: Apache UIMA Introduction

UIMA - Annotations

Page 15: Apache UIMA Introduction

Apache UIMA - Annotation

The association of a metadata, such as a label, with a region of text (or other type of artifact).

For example, the label “Person” associated with a region of text “Fred Center” constitutes an annotation. We say “Person” annotates the span of text from X to Y containing exactly “Fred Center”

Page 16: Apache UIMA Introduction

Apache UIMA - Basic Steps

Domain model definition

Analysis pipeline definition

Arrange components:

Define components draining data from sources

Add and customize analysis components: Patterns, Dictionaries, RegEx, External services, NLP, etc...

Define components outputting information on target storages

Analysis pipeline(s) execution

Page 17: Apache UIMA Introduction

Defining domain model within UIMA using Type Systems

Type System is the place where we describe which metadata we would like to extract

Low representational gap

Like almost everything in UIMA: described (and generated!) using XML

Possible to define multiple Type Systems for different purposes

Page 18: Apache UIMA Introduction

How do UIMA extract metadata?

Page 19: Apache UIMA Introduction

Apache UIMA - Analysis Engines

Basic UIMA building blocks

Analyze a document

Infer and record descriptive attributes (about documents/regions)

Generating analysis results

Page 20: Apache UIMA Introduction

Apache UIMA - AEs

Analysis Engines are described by a descriptor (XML)

Can be Primitive (a single AE) or Aggregated (a pipeline of AEs)

Analysis algorithms can be switched changing descriptor instead of code

Contain TypeSystems definitions

Define Capabilites

Page 21: Apache UIMA Introduction

Apache UIMA - AnalysisComponent API

initialize : Performs (once) any startup tasks required by this component

process : Process the resource to analyze generating analysis results (metadata)

destroy : Frees all resources held, called only once when it is finished using this component

Page 22: Apache UIMA Introduction

Apache UIMA - Annotators

Analysis Engine algorithm

Annotator : A software component implemented to produce and record annotations over regions of an artifact (e.g., text document, audio, and video)

Annotators implement AnalysisComponent interface

Page 23: Apache UIMA Introduction

Apache UIMA - Roles

AnalysisEngine : High level block responsible for analysis - contains at least one AnalysisComponent

AnalysisComponent : interface for any component responsible for analyzing artifacts

Annotator : implementation of AnalysisComponent responsible for creating Annotations

Page 24: Apache UIMA Introduction

Apache UIMA - AEs

Page 25: Apache UIMA Introduction

Analysis Engines in a Pipeline

Page 26: Apache UIMA Introduction

Apache UIMA - Analysis Results

Where do analysis results end up?

How annotators represent and share their results?

CAS - Common Analysis Structure

Maintain typed indexes of extracted results

Page 27: Apache UIMA Introduction

Common Analysis Structure

Page 28: Apache UIMA Introduction

Which algorithms lay under AEs?

Page 29: Apache UIMA Introduction

Apache UIMA & NLP

NLP (Natural Language Processing) is a theoretically motivated range of computational techniques for analyzing and representing naturally occurring texts at one or more levels of linguistic analysis for the purpose of achieving human-like language processing for a range of tasks or applications

It’s an AI discipline

Page 30: Apache UIMA Introduction

Apache UIMA & NLP

“accomplish human-like language processing”

Paraphrase an input text

Translate the text into another language

Answer questions about the contents of the text

Draw inferences from the text

Page 31: Apache UIMA Introduction

Apache UIMA & NLP

“an NLP-based IR system has the goal of providing more precise, complete information in response to a user’s real information need”

various levels of processing

Page 32: Apache UIMA Introduction

Apache UIMA - Approaches

Simplest : Write RegEx and Dictionaries and mix them together

NLP-like : Tokenize -> Sentence identification -> PoS Tagging -> Anaphora resolution -> Named Entities Recognition -> Coreference Identification ...

Page 33: Apache UIMA Introduction

Analysis Engines in a Pipeline

Page 34: Apache UIMA Introduction

NLP - Language Identifying

NLP takes advantage of language specific syntax, forms, rules and meanings

Not easy to write language independent extraction algorithms

Often this is the first block of NLP pipelines

Techniques: Stopwords dictionaries, statistical models, etc.

Page 35: Apache UIMA Introduction

NLP - Tokens and Sentences

Humans learn words’ meaning in order to understand whole context semantics

Split the target text in words to be able to analyze their meaning and role

Discover sentences to later assign roles to each token

Easiest for English, Italian & co. but what about Chinese?

Page 36: Apache UIMA Introduction

NLP - PoS Tagging

Assign a “Part of Speech” (noun, adjective, verb, etc.) to each token generated in the previous step

Many language/domain specific patterns can be discovered and exploited just with pos-tagged-tokens and sentences

Page 37: Apache UIMA Introduction

NLP - Chunking & ParsingParse sentences into a meaningful set or tree of relationships

Chunks are the sentence building blocks (i.e. verbal forms)

Parse tree highlights the structure of a sentence

Can leverage logic analysis

chunking parsing

Page 38: Apache UIMA Introduction

NLP - Named Entities Recognition

Answer the questions: where? when? who? how often? how much?

Identify key entities in the text

Common techniques: dictionaries, rules, statistcal models

Page 39: Apache UIMA Introduction

Debugging NER in UIMA

Page 40: Apache UIMA Introduction

Using UIMA

Define TypeSystem

Define AnalysisEngine descriptor(s)

Implement Annotator(s)

Execute the UIMA pipeline

Page 41: Apache UIMA Introduction

Sample scenario - extract actors

Tokenize article text

Identify sentences

Tag PoS

Identify Persons using regular expressions and PoS

Use Person annotations, Tokens’ PoS and Sentences to extract relations between terms to identify Persons who are also Actors

Page 42: Apache UIMA Introduction

Sample scenario - extract persons

I have a dictionary of names (simple to find and/or build)

I use a dictionary based Annotator to extract annotations of first names (NameAnnotation)

I don’t have a dictionary of surnames

Everytime a matching name (a NameAnnotation) is found we look for one or more (considering persons with double name or surname) subsequent tokens whose PoS is “undefined” or a noun (but not a verb) and starts with Uppercase letter

If found then the name + token(s) sequence annotates a Person (i.e. “Michael J. Fox”)

Page 43: Apache UIMA Introduction

from Persons to Actors

Getting actors can be simple if we know that Persons who are also actors do some well known actions or there exist widely used patterns

i.e.: a Person “stars as” CharacterInTheMovie (that will be eventually tagged as Person too) when is also an Actor

i.e.: if the snippet “CharacterInTheMovie (Person)” exists, then Person is usually an Actor

then we could build an ActorAnnotator

Page 44: Apache UIMA Introduction

1. Define TypeSystem

Define at least a Type inside Type System for each object inside the domain model

Useful to define more fine grained Types (for values of type properties, called Features)

If we want to extract information about articles we create an Article type inside the Type System

Also we’ll need to create annotations/entites for movies, actors, directors, etc...

Page 45: Apache UIMA Introduction

2. Define AnalysisEngine descriptor

Define which type system it’s going to use

Define which capabilities the analysis engine has: which annotations need to work and which annotations it’ll (eventually) generate

Define configuration paramaters for the underlying algorithm

Define resources needed by the analysis engine

Page 46: Apache UIMA Introduction

3. Implement Annotator

create a new class extending JCasAnnotator_ImplBase

implement the process() method that actually does the job

the algorithm implementation is (called) in the process() method

you can use configuration parameters/resources defined in the descriptor

eventually override initialize() and destroy() methods

Page 47: Apache UIMA Introduction

DummyPersonAnnotator

Page 48: Apache UIMA Introduction

4. Execute the UIMA pipeline

Instantiate the AnalysisEngine with its descriptor as a parameter

Create a CAS which will contain the text to be analyzed and the annotations extracted

Run the AnalysisEngine on the given CAS

Browse results

Page 49: Apache UIMA Introduction

Execute a UIMA pipeline

Page 50: Apache UIMA Introduction

What’s next

UIMA Use cases

Using UIMA in search engines

Hands on code (assignment)