palantir xml formats

47
© 2008 Palantir Technologies Inc. All rights reserved. Palantir XML Formats PalantirXML (pXML) & PalantirDocXML (DocXML) Ari Gordon-Schlosberg Senior Software Engineer

Upload: palantirtech

Post on 18-Nov-2014

2.972 views

Category:

Technology


33 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Palantir XML Formats

© 2008 Palantir Technologies Inc. All rights reserved.

Palantir XML Formats

PalantirXML (pXML) & PalantirDocXML (DocXML)

Ari Gordon-SchlosbergSenior Software Engineer

Page 2: Palantir XML Formats

Palantir XML Formats

Written in XML Schema Definition (XSD) language– W3.org standard– Widely accepted

Allows developers to leverage existing XML tools– Editing– Verification– Transformation (XSLT) friendly

Designed to be simple & human-readable– Follows Palantir design principles– Meant to make life easier for developers to code, debug, learn

Page 3: Palantir XML Formats

PalantirXML: An Introduction

A rendering of a Palantir object graph into XML– Encodes nearly all features in our lowest-level data model– “Close to the metal”

Used as open import format– Makes Palantir integration-friendly and a truly open platform– Federated Search on-the-fly-import uses it internally– Super efficient storage format

Used for export/interchange– Allows organization to pull knowledge out of Palantir– Can be transformed using XSLT to other XML formats

Page 4: Palantir XML Formats

PalantirDocXML: An Introduction

Container for textual docs and entity extraction output– raw text– source document– entity extraction results– textual references to those entities– document metadata

Authored by Palantir, but it’s an open format– Not inherently tied Palantir. – Contains some optional features to ease integration with Palantir– Not tied to a single extractor, multiple vendors already support it

Designed to be simple-to-author import format– XSLT friendly– Used existing entity extractor formats as design guides

Page 5: Palantir XML Formats

Object-Model Refresher

Page 6: Palantir XML Formats

Example Text Document

Contributors: Ari Gordon-Schlosberg, Kevin Simler, John Carrino

We're currently stuck in Atlanta, waiting for our flight to IL. We learned that our display case is 83 lineal inches, 3 inches longer than we're supposed to be able to fly with, but let us go this time. (I wonder if this is just a Delta thing?)

Eric Poirier called us and told us that the presentation at Cornell went very well, which gives us high hopes for tomorrow's presentation at UIUC. John and I are excited to get back home for a visit and I've been contacting professors to look for students that we should target for recruiting.

Things are going well.

Sincerely,

Your field team: Kevin, John, and Ari.

Page 7: Palantir XML Formats

Imported Into Palantir

Page 8: Palantir XML Formats

A Simple Example

Page 9: Palantir XML Formats

A Simple Example

Page 10: Palantir XML Formats

Keep In Mind…

We’ll be covering:– Details of these two formats– Explanations of where to use them– Some simple examples

Examples have been edited for brevity and clarity– Covering important features– Reference manuals and XSDs are the full references– Some elements abbreviated as <element/>where details are not

relevant; More detail may be required there

Page 11: Palantir XML Formats

PALANTIR XMLpXML

Page 12: Palantir XML Formats

pXML: Where To Use It

To import structured data that doesn’t import easily– Data from a database where objects span tables– Objects assembled from multiple DataSources– Other “exotic” data sources

To export data from Palantir– Other analytic tools– Other data platforms– Other Palantir instances

Page 13: Palantir XML Formats

pXML And The Object Model

pXML is strongly coupled to the object model– Data sources– Objects– Properties – Notes– Media– Links– Data source records

Page 14: Palantir XML Formats

pXML And The Object Model

pXML elements come directly from the object model– Data sources <dataSource/>– Objects <object/>– Properties <property/>– Notes <note/>– Media <media/>– Links <link/>– Data source records <dataSourceRecord/>

Page 15: Palantir XML Formats

pXML Document Structure

Page 16: Palantir XML Formats

Document/Data Source Duality

Data sources represent real-world sources of data– do not contain data– a collection of references

Palantir document objects contain real-world data

Primary object connects a data source to the object holding its data

Used by data sources representing unstructured data– Documents– Emails– Other sources of unstructured text

Page 17: Palantir XML Formats

Data Sources

Page 18: Palantir XML Formats

Object

Page 19: Palantir XML Formats

Property

Page 20: Palantir XML Formats

Property Values

Three types of property values are supported in pXML:– Simple

• Used for single, unparsed values• e.g. Nationality, Organization Name

– Composite• Used for values composed of discrete, semantic units• e.g. Name (first & last), Address (city, state, zip, etc.)

– Raw• Convenience format• Keeps pXML simple and allows the parsers to do the work• Allows ontology to change around existing pXML generators

Page 21: Palantir XML Formats

Simple Property Value

Page 22: Palantir XML Formats

Composite Property Value

Page 23: Palantir XML Formats

Raw Property Value

Page 24: Palantir XML Formats

Media

Page 25: Palantir XML Formats

Notes

Page 26: Palantir XML Formats

DataSourceRecords

Data source records (DSRs) tie data to their source Apply to all pieces of data

– Properties– Notes– Media– Links

Have two modes– Import keys are used to tie data to a record primary key or index in

structured data sources. e.g. a line number, primary key, etc.– String position locators are used to mark references in

unstructured text using character offsets and lengths.

Page 27: Palantir XML Formats

DataSourceRecords

Page 28: Palantir XML Formats

Links

Links represent a link between to objects All links are directed in Palantir

Page 29: Palantir XML Formats

PALANTIR DOCUMENT XMLDocXML

Page 30: Palantir XML Formats

PalantirDocXML: An Introduction

Authored by Palantir, but it’s an open format– Not inherently tied Palantir. – Contains some optional features to ease integration with Palantir

Support for multiple entity extractors per document– Object data is designed to be an easy transform target from

popular extractors– Contains hold the original output from the entity extractors

Allows ontologies to change over time– Architected to use pluggable type-mappings– Compatible with multiple Palantir instances– Never need to rebuild a DocXML document

Page 31: Palantir XML Formats

Advanced Features

Advanced character set handling– Stores document originals in original character set– Careful UTF-8 encapsulation supports all human languages

Support for flexible document metadata– Captures arbitrary organizational or handling metadata

Easy to understand and transform into other formats– XSLT friendly by design– Can hold extractor configuration as well as output– Cross-data-platform format for extracted documents– Intermediate format for multi-step extraction– Single interface for ingestion of extracted document– Completely Palantir agnostic

Page 32: Palantir XML Formats

DocXML Document Structure

Page 33: Palantir XML Formats

Document Metadata

Page 34: Palantir XML Formats

Document Metadata Example

Page 35: Palantir XML Formats

Object Data

Page 36: Palantir XML Formats

Extraction Metadata

Page 37: Palantir XML Formats

Extraction Metadata Example

Page 38: Palantir XML Formats

Object

Page 39: Palantir XML Formats

Example Object

Page 40: Palantir XML Formats

Relationship

Page 41: Palantir XML Formats

Type Mapping

DocXML documents are not tied to an ontology– Single document can be ingested into different ontologies– Changes in an ontology does not require re-extraction or changes

to the extractor, just an edit of the type mapping– Each document can use multiple mappings

Mappings map extractor types and document properties– Separate mapping for each supported extractor– Document properties map into properties on the Palantir

Document object Centrally-managed resource for each enterprise

– Analysts don’t write type mappings, architects do– Imports seamlessly “just work”– Everyone uses a consistent mapping

Page 42: Palantir XML Formats

Type Mapping Overview

Page 43: Palantir XML Formats

Document Properties

Page 44: Palantir XML Formats

Extractor Type Mappings

Page 45: Palantir XML Formats

Extractor Type Mappings Example

Page 46: Palantir XML Formats

Final Thoughts

This presentation is an overview– Both pXML and DocXML have features not covered here

The XSD files are the canonical reference– Full syntax and rules are covered there– Consult reference manual for usage and in-depth explanations

Living Standards– Backwards compatible– May add new features to support customer needs

See our blog for tips and techniques on XML processing– http://blog.palantirtech.com/

Page 47: Palantir XML Formats

© 2008 Palantir Technologies Inc. All rights reserved.

Palantir XML Formats

PalantirXML (pXML) & PalantirDocXML (DocXML)

Ari Gordon-SchlosbergSenior Software Engineer