palantir xml formats

Post on 18-Nov-2014

2.974 Views

Category:

Technology

33 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

TRANSCRIPT

© 2008 Palantir Technologies Inc. All rights reserved.

Palantir XML Formats

PalantirXML (pXML) & PalantirDocXML (DocXML)

Ari Gordon-SchlosbergSenior Software Engineer

Palantir XML Formats

Written in XML Schema Definition (XSD) language– W3.org standard– Widely accepted

Allows developers to leverage existing XML tools– Editing– Verification– Transformation (XSLT) friendly

Designed to be simple & human-readable– Follows Palantir design principles– Meant to make life easier for developers to code, debug, learn

PalantirXML: An Introduction

A rendering of a Palantir object graph into XML– Encodes nearly all features in our lowest-level data model– “Close to the metal”

Used as open import format– Makes Palantir integration-friendly and a truly open platform– Federated Search on-the-fly-import uses it internally– Super efficient storage format

Used for export/interchange– Allows organization to pull knowledge out of Palantir– Can be transformed using XSLT to other XML formats

PalantirDocXML: An Introduction

Container for textual docs and entity extraction output– raw text– source document– entity extraction results– textual references to those entities– document metadata

Authored by Palantir, but it’s an open format– Not inherently tied Palantir. – Contains some optional features to ease integration with Palantir– Not tied to a single extractor, multiple vendors already support it

Designed to be simple-to-author import format– XSLT friendly– Used existing entity extractor formats as design guides

Object-Model Refresher

Example Text Document

Contributors: Ari Gordon-Schlosberg, Kevin Simler, John Carrino

We're currently stuck in Atlanta, waiting for our flight to IL. We learned that our display case is 83 lineal inches, 3 inches longer than we're supposed to be able to fly with, but let us go this time. (I wonder if this is just a Delta thing?)

Eric Poirier called us and told us that the presentation at Cornell went very well, which gives us high hopes for tomorrow's presentation at UIUC. John and I are excited to get back home for a visit and I've been contacting professors to look for students that we should target for recruiting.

Things are going well.

Sincerely,

Your field team: Kevin, John, and Ari.

Imported Into Palantir

A Simple Example

A Simple Example

Keep In Mind…

We’ll be covering:– Details of these two formats– Explanations of where to use them– Some simple examples

Examples have been edited for brevity and clarity– Covering important features– Reference manuals and XSDs are the full references– Some elements abbreviated as <element/>where details are not

relevant; More detail may be required there

PALANTIR XMLpXML

pXML: Where To Use It

To import structured data that doesn’t import easily– Data from a database where objects span tables– Objects assembled from multiple DataSources– Other “exotic” data sources

To export data from Palantir– Other analytic tools– Other data platforms– Other Palantir instances

pXML And The Object Model

pXML is strongly coupled to the object model– Data sources– Objects– Properties – Notes– Media– Links– Data source records

pXML And The Object Model

pXML elements come directly from the object model– Data sources <dataSource/>– Objects <object/>– Properties <property/>– Notes <note/>– Media <media/>– Links <link/>– Data source records <dataSourceRecord/>

pXML Document Structure

Document/Data Source Duality

Data sources represent real-world sources of data– do not contain data– a collection of references

Palantir document objects contain real-world data

Primary object connects a data source to the object holding its data

Used by data sources representing unstructured data– Documents– Emails– Other sources of unstructured text

Data Sources

Object

Property

Property Values

Three types of property values are supported in pXML:– Simple

• Used for single, unparsed values• e.g. Nationality, Organization Name

– Composite• Used for values composed of discrete, semantic units• e.g. Name (first & last), Address (city, state, zip, etc.)

– Raw• Convenience format• Keeps pXML simple and allows the parsers to do the work• Allows ontology to change around existing pXML generators

Simple Property Value

Composite Property Value

Raw Property Value

Media

Notes

DataSourceRecords

Data source records (DSRs) tie data to their source Apply to all pieces of data

– Properties– Notes– Media– Links

Have two modes– Import keys are used to tie data to a record primary key or index in

structured data sources. e.g. a line number, primary key, etc.– String position locators are used to mark references in

unstructured text using character offsets and lengths.

DataSourceRecords

Links

Links represent a link between to objects All links are directed in Palantir

PALANTIR DOCUMENT XMLDocXML

PalantirDocXML: An Introduction

Authored by Palantir, but it’s an open format– Not inherently tied Palantir. – Contains some optional features to ease integration with Palantir

Support for multiple entity extractors per document– Object data is designed to be an easy transform target from

popular extractors– Contains hold the original output from the entity extractors

Allows ontologies to change over time– Architected to use pluggable type-mappings– Compatible with multiple Palantir instances– Never need to rebuild a DocXML document

Advanced Features

Advanced character set handling– Stores document originals in original character set– Careful UTF-8 encapsulation supports all human languages

Support for flexible document metadata– Captures arbitrary organizational or handling metadata

Easy to understand and transform into other formats– XSLT friendly by design– Can hold extractor configuration as well as output– Cross-data-platform format for extracted documents– Intermediate format for multi-step extraction– Single interface for ingestion of extracted document– Completely Palantir agnostic

DocXML Document Structure

Document Metadata

Document Metadata Example

Object Data

Extraction Metadata

Extraction Metadata Example

Object

Example Object

Relationship

Type Mapping

DocXML documents are not tied to an ontology– Single document can be ingested into different ontologies– Changes in an ontology does not require re-extraction or changes

to the extractor, just an edit of the type mapping– Each document can use multiple mappings

Mappings map extractor types and document properties– Separate mapping for each supported extractor– Document properties map into properties on the Palantir

Document object Centrally-managed resource for each enterprise

– Analysts don’t write type mappings, architects do– Imports seamlessly “just work”– Everyone uses a consistent mapping

Type Mapping Overview

Document Properties

Extractor Type Mappings

Extractor Type Mappings Example

Final Thoughts

This presentation is an overview– Both pXML and DocXML have features not covered here

The XSD files are the canonical reference– Full syntax and rules are covered there– Consult reference manual for usage and in-depth explanations

Living Standards– Backwards compatible– May add new features to support customer needs

See our blog for tips and techniques on XML processing– http://blog.palantirtech.com/

© 2008 Palantir Technologies Inc. All rights reserved.

Palantir XML Formats

PalantirXML (pXML) & PalantirDocXML (DocXML)

Ari Gordon-SchlosbergSenior Software Engineer

top related