palantir events

UNCLASSIFIED

UNCLASSIFIED

Mostly-Automatic Construction of a Palantir Knowledge Base with NetOwl Extractor

Matthew C. Lowry

Command, Control, Communications and Intelligence Division Defence Science and Technology Organisation

DSTO-GD-0748

ABSTRACT A knowledge base describing entities, events, and links between them is a valuable tool for intelligence analysis. However constructing a knowledge base from unstructured source material is a labour intensive process. This leads to the desire for a process to automatically construct a knowledge base from unstructured source material. NetOwl Extractor is an information extraction system that processes unstructured text documents to extract structured information. Palantir is a knowledge base system that enables structured information to be combined into a single knowledge base and effectively exploited by analysts. This report describes the result of an experiment to combine these two systems; specifically to translate the output of NetOwl Extractor into a form that Palantir can ingest into its knowledge base. It was found that although the translation process was straightforward, the knowledge base obtained was of a poor quality and questionable utility for an intelligence analyst.

RELEASE LIMITATION

Approved for public release

UNCLASSIFIED

UNCLASSIFIED

Published by Command, Control, Communications and Intelligence Division DSTO Defence Science and Technology Organisation PO Box 1500 Edinburgh South Australia 5111 Australia Telephone: 1300 DEFENCE Fax: (08) 7389 6567 Commonwealth of Australia 2013 AR-015-636 June 2013

APPROVED FOR PUBLIC RELEASE

UNCLASSIFIED

Mostly-Automatic Construction of a Palantir Knowledge Base with NetOwl Extractor

Executive Summary A knowledge base describing entities (people, places, organisations, etc.), events, and links between them is a valuable tool for intelligence analysis. However for analysts to manually construct a knowledge base from unstructured (i.e. free text) source material is a labour-intensive process. When an analysts immediate goal is to produce reporting on a topic of interest, any effort spent in the construction of a knowledge base actually detracts from their efficiency in achieving their immediate task. So while the existence of a knowledge base with clear, accurate, and topical knowledge is useful to the analyst and will increase their efficiency in the long term, the effort to construct the knowledge base is seen as a distraction. This leads to the desire for intelligence processing systems that can automatically construct a knowledge base from available source material. To date commercial offerings have focussed on individual tasks required for an automatic knowledge base construction process. Two particular commercial tools that are relevant are NetOwl Extractor from SRA International, Inc. and Palantir from Palantir Technologies, Inc. NetOwl Extractor is an information extraction system that processes unstructured text documents to extract structured information; in particular information about entities, events, and the links between them. However NetOwl Extractor works on the level of individual documents, and has no knowledge base capability. Palantir is a knowledge base system that enables information that is in a structured form (databases, spreadsheets, etc.) to be combined into a single knowledge base and effectively exploited by analysts. However Palantir does not have any mechanism to automatically process unstructured textual source material. This report describes the result of an experiment to combine these two systems. The goal was to examine the potential for a fully automated process to generate a knowledge base from a large corpus of unstructured text documents. It was found that although the process was straightforward, the knowledge base obtained was of a poor quality and questionable utility for an intelligence analyst. Also it was found that the architecture and design of the Palantir system was not optimised to efficiently support consumption of large amounts of output from information extraction systems such as NetOwl Extractor.

UNCLASSIFIED

UNCLASSIFIED

UNCLASSIFIED

This page intentionally left blank

UNCLASSIFIED DSTO-GD-0748

Contents

1. INTRODUCTION............................................................................................................... 1

2. PROCEDURE....................................................................................................................... 2 2.1 Trial Corpus and its Preparation............................................................................ 2 2.2 Processing with NetOwl Extractor......................................................................... 3 2.3 Generation of DocXML Files .................................................................................. 3

2.3.1 Mapping Entities ..................................................................................... 4 2.3.2 Mapping Events....................................................................................... 5 2.3.3 Mapping Links......................................................................................... 7

2.4 Importation of DocXML and Entity Resolution in Palantir ........................... 10

3. OBSERVATIONS ............................................................................................................. 11 3.1 Poor Quality Knowledge Base.............................................................................. 11

3.1.1 Source Document Artifacts Confusing Information Extraction ..... 11 3.1.2 Erroneous Named Entity Recognition................................................ 11 3.1.3 Erroneous Named Entity Resolution.................................................. 12

3.2 Performance Issues ................................................................................................. 13

4. CONCLUSION .................................................................................................................. 14

UNCLASSIFIED


UNCLASSIFIED

This page intentionally left blank.


1. Introduction A knowledge base describing entities (people, places, organisations, etc.), events, and links between them is a valuable tool for intelligence analysis. However for analysts to manually construct a knowledge base from unstructured (i.e. free text) source material is a labour-intensive process. When an analysts immediate goal is to produce reporting on a topic of interest, any effort spent in the construction of a knowledge base actually detracts from their efficiency in achieving their immediate task. So while the existence of a knowledge base with clear, accurate, and topical knowledge is useful to the analyst and would increase their efficiency in the long term, the construction of the knowledge base is seen as a distraction. This leads to the desire for intelligence processing systems that can automatically construct a knowledge base from available source material. To date commercial offerings have focussed on individual tasks required for an automatic knowledge base construction process. Two particular commercial tools that are relevant are NetOwl Extractor from SRA International, Inc. and Palantir from Palantir Technologies, Inc. NetOwl Extractor is an information extraction system that processes unstructured text documents to extract structured information; in particular information about entities, events, and the links between them. However NetOwl Extractor works on the level of individual documents, and has no knowledge base capability. Palantir is a knowledge base system that enables information that is in a structured form (databases, spreadsheets, etc.) to be combined into a single knowledge base and effectively exploited by analysts. However Palantir does not have any mechanism to automatically process unstructured textual source material. This report describes the result of an experiment to combine these two systems. The goal was to examine the potential for a fully automated process to generate a knowledge base from a large corpus of unstructured text documents. However the process is described as mostly-automatic because one aspect of the process was not automated for the sake of expediency. This is explained in more detail in Section 2.4. The core activity in the experiment was writing Java code that translates from the output of NetOwl Extractor to a format suitable for passing to Palantir. NetOwl is capable of generating a range of output formats, including a compact XML format that describes all information extracted by the tool from a document. This format can be readily processed using the XML parsing facilities in the standard Java runtime environment. Palantir includes a facility for importing information into its knowledge base in a format called DocXML. This is essentially an XML schema that is specialised for combining the content of a text document together with information that can be inferred from the document. The extensive Java API provided by Palantir includes convenience classes for constructing DocXML document object models and writing DocXML files. The fundamental issue in translating from the output of NetOwl Extractor to Palantirs DocXML format was the different data models the two tools use. The way NetOwl expresses the information it extracts from a document can be characterised as a full entity-relationship model; that is entities, events, and links between entities or events are all first-class elements of the data model. However the data model in Palantirs knowledge

UNCLASSIFIED 1


base can be characterised as a simple entity model; while entities and events are first-class elements of the data model, while links between entities and events are second-class elements. The consequences of this difference, and an approach to resolving it, are discussed in the following section.

2. Procedure The process for mostly-automatic construction of a Palantir knowledge base that was explored is summarised as follows:

1. Select and preprocess a trial corpus.

2. Process the corpus documents with NetOwl Extractor.

3. Translate the output of NetOwl Extractor to Palantirs DocXML format.

4. Load the generated DocXML into a Palantir knowledge base. The details of these steps are described below.

2.1 Trial Corpus and its Preparation The corpus used for the experiment was a collection of documents dating from early 2002 that were disseminated by the Foreign Broadcast Information Service (FBIS; then a component of the United States Central Intelligence Agency, now known as the Open Source Centre and component of the United States Office of the Director of National Intelligence). The documents consist primarily of translations of non-English language print and World Wide Web media articles, and synopses of non-English language radio and television broadcasts, from around the world. The text content is written in upper-case letters, and is generally well-formed English prose but exhibits occasional grammatical or spelling mistakes. Due to the transmission and storage history of the corpus, the physical files containing the documents were limited to approximately 8 kilobytes in size. Documents larger than this size had been split into multiple physical files. Before submission to NetOwl Extractor the split documents were reconstructed into single physical files to enable submission to NetOwl as single units for processing. Without this reconstruction step, the potential for intra-document resolution of entity mentions would be reduced, and hence the quality of the information extraction would be reduced. Another preprocessing step was removal of metadata that had been stored within the document text as header and footer sections. If left in place NetOwl Extractor would attempt to treat this metadata as data, potentially confusing its information extraction routines and reducing the quality of its output.

UNCLASSIFIED 2


2.2 Processing with NetOwl Extractor The document corpus was processed using NetOwl Extractor version 6.5.1 augmented with the optional Link and Event version 2.4.0.1 rule base. The processing was performed using the linkandevent-plain predefined configuration parameter preset. This preset causes the tool to treat the input data as plain text, which is appropriate for the trial corpus, and apply the link recognition and event recognition subtasks in addition to the standard entity mention and equivalence recognition. The output generated by the preset is the xml-full format. Note that for the purposes of the experiment, NetOwl Extractor was used "out-of-the-box". That is, only using the general purpose semantic rules provided with the software were being used. In a typical deployment of the software, the semantic rules the software uses to extract information from text would be specialised to suit the source documents and the analysis to be performed on the extracted information. However the optimising the performance of information extraction was not a concern for this experiment, so no efforted was made to customise NetOwl for the context of the experiment.

2.3 Generation of DocXML Files The primary challenge is how to achieve the mapping of the result set from NetOwl Extractor, expressed in that tools ontology, to a corresponding description of objects in a Palantir knowledge base ontology. Once the details of this are established, the mechanics are straight forward: parse the XML produced by NetOwl, construct a DocXML DOM using the convenience classes provided by the Palantir developer API, and use the convenience methods in those DOM classes to create DocXML files. These files can then be easily ingested into Palantir using the Palantir Workspace application. In the case of this experiment, there was no predefined Palantir ontology that was to be the target of the mapping. So a simple Palantir ontology was constructed to directly match the ontology used by the NetOwl Extractor rule base. The NetOwl ontology was sliced at level 2, with object types created in the Palantir ontology to match. This slicing of the NetOwl ontology at level 2 for the purposes of the mapping was purely for convenience. The information encoded in level 3 of the NetOwl Extractor ontology class hierarchy is retained by mapping the level 3 term to a property value. For example, an instance of class entity:organisation:military in NetOwl results is mapped to a Palantir object of type Organisation and the object is given a property Organisation Type = Military. The practical issue with this approach is that in a DocXML file all pieces of information for addition to the knowledge base must appeal to some portion of the text of the document as the source of that information. For a property derived from the NetOwl ontological class, there is no obvious source within the document text. However in the DocXML schema it is valid to specify a text reference that starts at character index 0 and has a length of 0 characters. The Palantir system will accept DocXML containing such text references, and the behaviour of the client in dealing with these null text references is sensible. If the

UNCLASSIFIED 3


user requests the source of the property, they are simply shown the document text but no part of the document text is highlighted. Within this general approach to mapping, there are some differences in how the classes of entities, events, and links identified by NetOwl Extractor need to be handled. This is described below.

2.3.1 Mapping Entities There is a difference in the way the concept of entity is used by NetOwl Extractor and Palantir, which necessitates limiting the entity types that will be mapped from the NetOwl results. In NetOwl output, any thing that might be referred to by an event or link instance must be an entity. Thus NetOwl Extractor will identify things like dates and times, quantities of money, and even unitless numbers as date, currency or numeric entities. Doing so allows them to be referred to by events; e.g. giving the date and amount of a financial transaction between two organisations. However the Palantir system is designed to have details like dates and quantities stored as properties of objects. So in the mapping only the subset of NetOwl entity types that are sensible to have as entity objects in the Palantir knowledge base are mapped. For this experiment people, places, and organisations were mapped over. For entity types that are to be mapped over, there needs to be a chosen threshold of minimum information content below which an instance is considered too information-poor to be of value in a knowledge base. The threshold chosen for this experiment was at least one name mention. Entities that are only mentioned by description (e.g. a man on the back of a donkey, or a group of three militants) are not mapped over. To map the information regarding mentions of an entity, a property on the Palantir entity types must be chosen to receive the information. This is easily achieved for name mentions - the mention is mapped to a value of a name property with a text reference corresponding to the mention text identified by NetOwl Extractor. If NetOwl also identified descriptive mentions of the entity, then the description can be mapped to a value of a suitable property type with a text reference corresponding to the descriptive mention identified by NetOwl. For example, for descriptive mentions of an organisation entity, a property such as Organisation Description would be suitable. In contrast, the pronoun mentions for people are not mapped over. Having a property on entities in the knowledge base for pronouns that have been used to refer to an entity has little analytical utility for users of the knowledge base. While gender-specificity of English first-person pronouns does carry information, NetOwl extractor infers a gender attribute of person entities so this information is not lost. Any attributes of an entity inferred by NetOwl are also mapped across as properties on the corresponding Palantir object. Doing so has the same issue as noted above because NetOwl has inferred the attribute there is no segment of text in the document that corresponds directly to the attribute value. Again the solution is to use a null text reference that identifies a zero-length segment starting at character index 0 of the document as the source of the attribute value.

UNCLASSIFIED 4


2.3.2 Mapping Events The issues involved in mapping event instances identified by NetOwl Extractor are similar to the issues discussed previously with entities. However there are some differences to how the issues are best dealt with. As with entities, the mapping from NetOwl Extractor ontology class to Palantir ontology object type is done at level 2 of the NetOwl ontology hierarchy. The level 3 term of the class of an event instance is mapped to a property value with a null text reference. The segments of text that NetOwl Extractor identifies as mentions of an event are generally descriptive in nature, so they can be handled in the same way as descriptive mentions of an entity. The event mention is mapped to a value of a description property with a text reference corresponding to the mention text identified by NetOwl. There is also the issue of deciding whether for a given event instance, NetOwl Extractor has identified sufficient information for it to be worthwhile translating the instance to the knowledge base. This is similar to the issue faced with entity mapping discussed previously. In the case of event mapping, there are two aspects to consider. The first aspect to consider is whether an important attribute of the event has been identified. For example, the default Link and Event rule base will recognise any occurrence of the word attack as an instance of the event class event:conflict:attack_target. However the text will not necessarily identify who is the attacker or target involved. Even if the text does specify the attacker and target, NetOwl may fail to recognise this information. Secondly, when an attribute of an event is a reference to an entity (e.g. the attacker or target in the case of an attack event), there is the consideration of whether that entity has been identified to a sufficient level. The assessment of this will depend on the purpose for which the knowledge base is being constructed. Continuing the example, for some purposes the recognition of an attack event is not useful unless the attacker is recognised and identified by name. But in other situations, it may be that an attack event where the attacker is only identified by description (e.g. a man on the back of a donkey) is still a useful addition to the knowledge base. When an attribute of an event is a reference to a named entity, then that entity will have been mapped to an entity object in the DocXML translation. Hence the appropriate translation of the attribute of the event that is a reference to that named entity is a link in the Palantir knowledge base, where the event is the parent of the link and the entity is the child of the link. When an attribute of an event is a reference to an entity that is only mentioned by description, the appropriate translation is to make a property on the event object with the value of the property being the description of the entity. To facilitate flexibility with regard to these issues, the approach taken was to develop a simple file format that allowed specification of the choices made. This allows the translation process to be tailored to the circumstances and purpose of the knowledge base being generated. An example is shown below:

UNCLASSIFIED 5


event:conflict:attack_target dsto.c3id.ia.object.ConflictEvent dsto.c3id.ia.property.Description dsto.c3id.ia.property.EventType Attack Target target = reqnamed / dsto.c3id.ia.link.Target / - attacker = req / dsto.c3id.ia.link.Attacker / dsto.c3id.ia.property.Attacker weapon = opt / - / dsto.c3id.ia.property.Weapon time = opt / - / dsto.c3id.ia.property.Time place = opt / dsto.c3id.ia.link.Place / dsto.c3id.ia.property.Place

Figure 1 Example configuration for translating events

The translation code makes use of this format as follows:

The first line specifies the NetOwl ontology event type that the translation details following apply to.

The second line specifies the Palantir ontology object type that the NetOwl type is mapped to.

The third line specifies the Palantir ontology property type that is used for mapping the descriptive mentions of the event instance from NetOwls ontology to property values in the Palantir ontology.

The fourth line specifies the property type that is used to capture the third level of the NetOwl ontology hierarchy as a property value rather than a subtype.

For example, consider a document containing the text Yesterday a group of tribal militants mounted an attack on Yemeni President Ali Abdullah Saleh. Assume that NetOwl has recognised this sentence as describing an event:conflict:attack_target event, and specifically identified the phrase mounted an attack as the mention of the event. In this case, the translation based on the specification in Figure 1 would result in a Palantir event object of type dsto.c3id.ia.object.ConflictEvent with two property values. There would be a property of type dsto.c3id.ia.property.EventType with a value of Attack Target, and a property of type dsto.c3id.ia.property.Description with a value mounted an attack. In Figure 1 the fifth and subsequent lines describe how attributes of the event identified by NetOwl are to be mapped into the Palantir ontology. Each line contains four tokens, with the following meaning.

The first token is the name of the attribute in NetOwls ontology. The second token specifies the level of information required for this attribute

before the event instance recognised by NetOwl will be mapped over to the Palantir knowledge base. The keyword reqnamed means the attribute is required and must refer to an entity that has been identified by name, req means the attribute is required but may refer to an entity that has only been identified by description, while opt means the attribute is optional and its absence does not result in the event instance being discarded for being too information-poor.

UNCLASSIFIED 6


The third and fourth tokens specify, respectively, the link type to use when the attribute refers to an entity that has been mapped into the Palantir knowledge base as an object, and the property type to use when the attribute refers to an entity that has not been mapped into the Palantir knowledge base.

Continuing the example started above, assume that NetOwl has identified the phrase a group of tribal militants as an entity mentioned by description, and the attacker attribute of the event refers to this entity. Also assume that Ali Abdullah Saleh has been identified as a person mentioned by name, and the target attribute of the event refers to this entity. Further assume that Yesterday has been identified as a temporal entity, and the time attribute of the event refers to this entity. Following the specification in Figure 1, the translation will be as follows. There will be a link of type dsto.c3id.ia.link.Target created from the event object to the entity object corresponding to the name mention of Ali Abdullah Saleh. There will be a property of type dsto.c3id.ia.property.Attacker that contains the value a group of tribal militants, and a property of type dsto.c3id.ia.property.Time that contains the value Yesterday.

2.3.3 Mapping Links The primary issue to resolve in mapping links from NetOwl Extractor output to a Palantir knowledge base is that there is a fundamental difference between the way links are treated by the two systems. In NetOwl output, links are a first-class element of the data model; that is a link instance has an identity, can be referred to by its identity, and can in principle have any number of attributes. However in the data model of the Palantir knowledge base, a link is a second class element; that is an instance of a link is always attached to a parent object and the link does not itself have an identify so it cannot be referred to and it cannot have any properties associated with it. In effect, a link in a Palantir knowledge base is a special kind of property that contains a value that is always interpreted as a reference to another object. Viewed in graph-theoretic terms, in NetOwl output a link is a node that has outgoing directed edges to other nodes that it is linking together. But in Palantir a link is itself a directed edge from one node to another. This difference is elucidated in an example shown in Figure 2.

UNCLASSIFIED 7


Example text: Alice is an associate of Bob.

Associate

Person Name=Alice

Person Name=Bob

entity:person name=Bob

associateperson

link:person:person_associate

entity:person name=Alice

Knowledge Base Structure Expected By Palantir

NetOWL Output Structure

Figure 2 Difference in nature of links between NetOwl and Palantir

Given the above consideration, there are two approaches available:

1. Directly map the heavy-weight representation in NetOwl output into Palantir. This can be achieved by creating object types in a Palantir ontology that correspond to the link types in the NetOwl ontology.

2. Translate the NetOwl output to the light-weight representation that is natural for the data model of Palantir knowledge bases.

The advantage of the first option is that all the information contained in the NetOwl output is mapped across to the Palantir knowledge base. The disadvantage is that the resulting content in the Palantir knowledge base is not in a natural structure that is assumed by the Palantir Workbench application that analysts would use to access, visualise, and analyse the content. Conversely the disadvantage of the second option is the potential to lose information when a link is translated from a first-class to a second-class data model element, while the advantage is the production of Palantir knowledge base content that is in the structure expected by the analyst-facing components of the Palantir system. Both of the options listed above were tested, and the conclusion was that the second option was preferable. Firstly, in practice there was little actual information lost by translating from NetOwls first-class links to Palantirs second-class links. Of the link types that the Link and Event rule base can recognise, only two types actually have attributes that would need to be

UNCLASSIFIED 8


discarded by the translation process. In testing it was found that these two link types typically constituted less than 1.2% of the link instances recognised. Secondly it was found that the Palantir Workbench application did not have sufficient user interface mechanisms to allow an analyst to readily accommodate the first option. The application does have a mechanism in its graph view to allow an object that represents a link between two other objects to be collapsed so that the intermediary object appears like a direct link. However this mechanism must be invoked manually by a user and there is no convenient way for a user to direct the application to visually collapse intermediary objects to direct links in bulk. Thus it was found that, in practice, the second option provided the better cost-benefit trade-off. Given this, a flexible translation mechanism was developed based on a simple text file format similar to the mechanism described previously for handling events. An example is show below:

link:organization:org_founder organization founder / dsto.c3id.ia.link.Founder / dsto.c3id.ia.property.Founder

Figure 3 Example Configuration for Translating Links

The translation code makes use of this format as follows. The first line specifies the NetOwl ontology link type that the translation details following apply to. The second line specifies the attribute of the NetOwl link that specifies the entity that will be the parent object for the translation into Palantirs knowledge base. This implicitly mandates that this attribute is both present and referring to an entity that was mentioned by name (otherwise the entity will not have been translated; see Section 2.3.1). The third line gives the attribute of the NetOwl link that specifies the entity that will be the child object for the translation. In the case where this attribute refers to an entity that was mentioned by name (and hence translated as an entity object), the NetOwl link can be translated as the Palantir link type given in the second token of the third line. However if this attribute refers to an entity that was only mentioned by description (and hence not translated as an entity object), then the NetOwl link must be translated as a property on the parent object, and the property type to use is given in third token of the third line. As an example, consider the specification in Figure 3 in the context of the text The crime gang La Putatos was founded by Fred Bloggs. Assume NetOwl recognised the two entities and the link from organisation to founder. Since the two entities were mentioned by name, they will have been translated as entity objects. So the link between them would be translated as a link of type dsto.c3id.ia.link.Founder with the object representing the La Putatos entity as the parent of the link and the object representing the Fred Bloggs entity

UNCLASSIFIED 9


as the child of the link. In contrast, consider this in the context of the text The crime gang La Putatos was founded by a shadowy underworld figure. In this case the founder entity has only been mentioned by description, so will not have been translated as an entity object. Thus the translation of the link will be a property added to the object representing the La Putatos entity, of type "dsto.c3id.ia.property.Founder" and value a shadowy underworld figure.

2.4 Importation of DocXML and Entity Resolution in Palantir The DocXML files generated by translating NetOwl Extractor output using the process described above can easily be ingested into a Palantir knowledge base using the Workspace client application. The Import function can be used to select any number of DocXML documents and load their content into an investigation, and the imported data can then be published to the base realm (i.e. shared knowledge base). As noted in the introduction, this part of the process was the only step that was not automated. It was decided that the effort required to import DocXML manually was quite small: click a button, use a file selection dialog to select the files, chose a few options, then click OK and wait for the process to finish. Although there are client-side APIs that allow this import process to be done automatically, for the purposes of this experiment the effort required to develop and test code was not warranted. The primary issue encountered in this process was choosing the behaviour of the entity resolution that can optionally be performed when importing DocXML-encoded data. The NetOwl Extractor information extraction routines work on the basis of individual documents. When an entity is mentioned in more than one document, there will be different entity objects created from each document. Palantir supports the resolution of multiple objects that actually describe the same logical entity into a single logical object. When multiple objects are resolved together the properties and links of all the individual objects are combined together in the resolved object. In the case of ingesting a large number of documents and a frequently mentioned entity, there will potentially be a large number of separate objects created to describe the one entity. To produce a knowledge base that is not cluttered, and convenient for analysts to use, these objects must be resolved and fused together. Palantir allows the creation of object resolution suites, which are sets of rules describing criteria for automatically resolving and fusing knowledge base objects that are describing the same entity. These criteria are generally of the form If two objects of type X have a matching value for property Y, assume they are describing the same logical entity and resolve them into a single object. For this experiment the criteria used was to resolve and fuse any given set of person objects that shared an exactly matching value in their name property, and similarly for place or organisation objects.

UNCLASSIFIED 10


3. Observations The process described above was applied to 2000 documents from the trial corpus, using the QuickStart version 3.3.1.1 of Palantir. Although the corpus was much larger, ingestion was halted at 2000 documents due to performance issues that were encountered. Also, it was found that there were serious quality issues with the knowledge base that was being constructed through the process. These observations are elaborated upon below.

3.1 Poor Quality Knowledge Base The knowledge base that was constructed was generally of poor quality. This appeared to be due to a compounding effect from three primary sources of error. Firstly some of the source documents were malformed. Secondly, errors were made in the information extraction by NetOwl (both false positives and false negatives). Thirdly, there were errors in entity resolution and fusion at the intra-document level by NetOwl, and at the inter-document level by Palantir. Examples of some of the types of error seen are given below.

3.1.1 Source Document Artifacts Confusing Information Extraction Information extraction tools generally struggle to correctly interpret documents if the text deviates from modes of expression that the tool expects. For example, exploration of the knowledge base revealed a person entity with a name value Abd and a description value Former State Minister. It was found that this entity came from a document in the trial corpus that gave a synopsis of a Turkish television news broadcast. The format of the document was a series of numbered points, which included the following: FORMER STATE MINISTER ABD(U DIERESIS)LHALIK (C CEDILLA)AY HAS [] This can be explained as the translator capturing the correct orthography and pronunciation of the Turkish name Abdlhalik ay in a document that can only contain standard English letters. Turkish orthography is more phonemic than English in that the spelling of a word completely determines its pronunciation; hence and u are considered distinct letters with different pronunciation, as are and c. A human analyst marking up this document could easily recognise the intent of the translator, however an automatic information extraction routine fails to understand the translators intent and does not recognise the text correctly.

3.1.2 Erroneous Named Entity Recognition In addition to the source of confusion noted above, the named entity recognition routines in information extraction tools can fail to correctly parse even well formed text. Examples seen in the trail corpus include:

Text: JAMMU-KASHMIR. Entity recognition result: A place entity named Kashmir; i.e. the Jammu- fragment was not recognised as part of the place name.

UNCLASSIFIED 11


Text: IN 2001 USAMA BIN LADIN ROSE TO PROMINENCE []. Entity recognition result: A person entity named Usama Bin Ladin Rose; i.e. a verb following the name was erroneously considered part of the name.

In these examples, and many more observed in the results from the trial corpus, fragments of text adjacent to a name were erroneously included in the name, or fragments of a name separated by punctuation marks were omitted from the name.

3.1.3 Erroneous Named Entity Resolution The experiment's nave approach to the resolution and fusion of entities between documents leads to numerous errors as well. There are cases where recognised entities in different documents should have been resolved but were not, and conversely cases where resolution occurred but should not have. Examples of both these forms of error can be seen in the diagram shown in Figure 4.

Figure 4 Example of erroneous and inadequate knowledge base content - the Defence Ministry of

multiple countries being treated as a single organisation.

UNCLASSIFIED 12


Figure 4 shows the result of using the graph visualisation component of the Palantir Workspace to search for an organisation entity named Defence Ministry and display entities related to it. The trial corpus contained many documents discussing entities named Defence Ministry. In particular, there were documents discussing the Defence Ministry of France, Tajikistan, and Afghanistan. But the nave entity resolution based only on name has combined these entities. The result is a knowledge base that contains a single Defence Ministry entity linked to numerous other entities including the countries of France and Tajikistan, and the persons of Alain Richard (French Defence Minister at the time), Sherali Khayrulloyev (Tajik Defence Minister at the time), Emomali Rahmonov (Tajik President at the time), Mohammad Qasim Fahim (Afghan Defence Minister at the time), and Abdul Rashid Dostam (Afghan Deputy Defence Minister at the time). Also apparent in Figure 4 is the way nave fusion based on resolving objects with an exact name match can fail when a non-English name is transcribed into English in different ways in multiple documents. Mohammad Qasim Fahim and Mohammad Qasem Fahim are two of the many possible romanisations of , but exact name matching does not recognise this. Also the entities labelled Sherali Khayrulloyev and Khayrulloyev should be resolved, but when there are documents that refer to a person exclusively by last name and other documents that refer to a person exclusively by full name, then exact name matching will not resolve the entities.

3.2 Performance Issues Two particular performance issues were observed during the experiment. The first was that the Palantir Quickstart edition, which uses the open source database software PostgreSQL as its backend, performed poorly when running on a machine that also had anti-virus software active. In particular I/O throughput and CPU utilisation on a multi-core CPU was poor (most of the time only one core would be active). Investigation of the issue suggested that PostgreSQL was interacting poorly with the anti-virus scanning software. This was raised with a Palantir Technologies engineer who confirmed this was a known issue; their recommendation was to configure the anti-virus scanning software to ignore the data directory used by PostgreSQL. However in the context of this experiment it was not possible to verify the extent to which this would resolve the observed performance issue. The second issue was that as increasing numbers of documents processed by NetOwl were ingested into the Palantir knowledge base, the performance of the system steadily decreased. Most notably, the time taken to ingest a batch of documents and publish the ingested knowledge to the core knowledge base rapidly increased. The first batch of 200 documents was ingested and published in less than 10 minutes, while the fifth batch took over 1.5 hours to process, and the tenth batch approximately 5 hours. Also, as the number of documents ingested into the system increased, the time taken by the Workbench application to display any given document increased. After 2000 documents had been ingested, to load and display any given document within the workbench would take as

UNCLASSIFIED 13


UNCLASSIFIED 14

long as 30 seconds compared to less than 10 seconds when less than 1000 documents had been ingested. Investigation of the causes of this performance degradation led to the conclusion that the Palantir software had not been designed and optimised to support the usage model of this experiment. Specifically the system was not designed to store large quantities of documents where each document had a large amount of automatically generated tags concerning the entities, events, and links mentioned in the document. An examination of the database structure used by the software, the architecture of the client application, and the way these two components of the software interact, led to the conclusion that ingesting large amounts of automatically annotated documents was not a usage model that Palantir is designed to support efficiently. This suggests that this performance issue would also be observed, albeit to a lesser extent, in the full version of the software that runs over Oracle databases.

4. Conclusion The mapping from NetOwl output to a Palantir knowledge base via DocXML worked well. There are some issues with the expressivity of the Palantir knowledge base and DocXML data models. However in practice there did not appear to be much impedance mismatch and the information useful for knowledge base construction that was produced by NetOwl Extractor could be mapped over. However the general approach taken is not a suitable or sensible approach to take in practice. Notably it is a usage model of the Palantir system that the software does not appear to have been designed to support efficiently. Even if this were not the case, the performance of NetOwl Extractor when used "out of the box", combined with simplistic entity resolution, produces a low-quality knowledge base. It seems unlikely that an analyst would find the knowledge base useful as any content used for analysis would need to be rigorously inspected and validated. Given this requirement for effort on the analysts part, it would seem more sensible for that effort to be spent creating a focused, high-quality knowledge base by hand.

Page classification: UNCLASSIFIED

DEFENCE SCIENCE AND TECHNOLOGY ORGANISATION

DOCUMENT CONTROL DATA 1. PRIVACY MARKING/CAVEAT (OF DOCUMENT)

2. TITLE Mostly-Automatic Construction of a Palantir Knowledge Base with NetOwl Extractor

3. SECURITY CLASSIFICATION (FOR UNCLASSIFIED REPORTS THAT ARE LIMITED RELEASE USE (L) NEXT TO DOCUMENT CLASSIFICATION) Document (U) Title (U) Abstract (U)

4. AUTHOR(S) Matthew C. Lowry

5. CORPORATE AUTHOR DSTO Defence Science and Technology Organisation PO Box 1500 Edinburgh South Australia 5111 Australia

6a. DSTO NUMBER DSTO-GD-0748

6b. AR NUMBER AR-015-636

6c. TYPE OF REPORT General Document

7. DOCUMENT DATE June 2013

8. FILE NUMBER 2013/1012539/1

9. TASK NUMBER 07/329

10. TASK SPONSOR DCDS(I&WS)

11. NO. OF PAGES 14

12. NO. OF REFERENCES 0

13. DSTO Publications Repository http://dspace.dsto.defence.gov.au/dspace/

14. RELEASE AUTHORITY Chief, Command, Control, Communications and Intelligence Division

15. SECONDARY RELEASE STATEMENT OF THIS DOCUMENT

Approved for public release OVERSEAS ENQUIRIES OUTSIDE STATED LIMITATIONS SHOULD BE REFERRED THROUGH DOCUMENT EXCHANGE, PO BOX 1500, EDINBURGH, SA 5111 16. DELIBERATE ANNOUNCEMENT No Limitations 17. CITATION IN OTHER DOCUMENTS Yes 18. DSTO RESEARCH LIBRARY THESAURUS Knowledge management, Natural language processing, Information extraction, Information fusion 19. ABSTRACT A knowledge base describing entities, events, and links between them is a valuable tool for intelligence analysis. However constructing a knowledge base from unstructured source material is a labour intensive process. This leads to the desire for a process to automatically construct a knowledge base from unstructured source material. NetOwl Extractor is an information extraction system that processes unstructured text documents to extract structured information. Palantir is a knowledge base system that enables structured information to be combined into a single knowledge base and effectively exploited by analysts. This report describes the result of an experiment to combine these two systems; specifically to translate the output of NetOwl Extractor into a form that Palantir can ingest into its knowledge base. It was found that although the translation process was straightforward, the knowledge base obtained was of a poor quality and questionable utility for an intelligence analyst.

Page classification: UNCLASSIFIED

ABSTRACTExecutive SummaryContents1. Introduction 2. Procedure2.1 Trial Corpus and its Preparation2.2 Processing with NetOwl Extractor2.3 Generation of DocXML Files2.4 Importation of DocXML and Entity Resolution in Palantir

3. Observations3.1 Poor Quality Knowledge Base3.2 Performance Issues

4. ConclusionDISTRIBUTION LISTDOCUMENT CONTROL DATA

/ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 300 /GrayImageDepth -1 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages true /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile () /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False

/CreateJDFFile false /Description > /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ > /FormElements false /GenerateStructure false /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles false /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /DocumentCMYK /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /UseDocumentProfile /UseDocumentBleed false >> ]>> setdistillerparams> setpagedevice

palantir events

Documents

palantir knowledge base

knowledge base system

single knowledge base

knowledge base capability

topical knowledge

netowl extractor matthew

output of netowl extractor

structured information