airframe - hawaii.edu

40
AIRFrame: Astrobiology Integrative Research Framework Lisa Miller and Rich Gazan Department of Information and Computer Sciences UH-NASA Astrobiology Institute [email protected], [email protected] September 13, 2010

Upload: others

Post on 31-Oct-2021

40 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: AIRFrame - hawaii.edu

AIRFrame: Astrobiology Integrative Research Framework

Lisa Miller and Rich GazanDepartment of Information and Computer SciencesUH-NASA Astrobiology [email protected], [email protected]

September 13, 2010

Page 2: AIRFrame - hawaii.edu

Overview● AIRFrame project rationale● AIRFrame project goals● Current project status

– Textpresso system● Ontology● Database

● Current and future work direction– Clustering/Classification

– Visual representations of data

– Graphical interface

Page 3: AIRFrame - hawaii.edu

Project RationaleNature of Astrobiology● System-level science

– Concerned with complex, multidisciplinary, multi-phenomena behaviors of large physical and biological systems

– Information technology needed to consolidate and represent knowledge and data across many disciplines

● New field– No centralized repositories of knowledge

– No established, standardized vocabulary

Page 4: AIRFrame - hawaii.edu

Project RationaleInterdisciplinary Collaboration● Proven to be difficult

– Different disciplines have different:● Vocabularies/terminologies● Methods and formats for sharing and

presenting research● Assumed levels of precision

– Institutional and cultural boundaries

● Concern about duplication of research– Lack of access to existing knowledge

– Lack of discipline-specific knowledge to efficiently access what is available

Page 5: AIRFrame - hawaii.edu
Page 6: AIRFrame - hawaii.edu

A computer sees only your footprints: published research, websites, graphic records…

● Inputs: words, data, images…

● Processes: algorithms, categorizations, associations…

● Outputs: representations

Probabilistic, not deterministic

Page 7: AIRFrame - hawaii.edu

One way a computer sees NASA Astrobiology on the Web…

● Watch it happen: MIT Personas project– http://personas.media.mit.edu/personasWeb.html

● Inputs: words from Web documents● Processes: harvest words, associate them with

color-coded broader concepts● Outputs: representation of a person,

organization etc. by its proportion of constituent components

Page 8: AIRFrame - hawaii.edu

Personas

Page 9: AIRFrame - hawaii.edu

Another way computers see astrobiology…

● As of April 2010, under 10K scholarly publications indexed with astrobiology and related terms across major scholarly databases

– NASA Astrophysics Data System (ADS): 3717– Web of Science: 1092– Meteorological & Geoastrophysical Abstracts: 496– ScienceDirect: 3638

Page 10: AIRFrame - hawaii.edu

Astrobiology: how interdisciplinary?

● One (research) question

● Multiple scientists, datasets, disciplines, literatures, vocabularies…and ways to measure interdisciplinarity

Page 11: AIRFrame - hawaii.edu

Astrobiology publication profile

● Subject area of most current 1000 pubs indexed with “astrobiology” (ISI Web of Science)

Page 12: AIRFrame - hawaii.edu

Vocabulary problem● Same concept, different

words

● New fields like astrobiology have low query term overlap (QTO) until authors, indexers, journals, mapping algorithms etc. ‘catch up’

● QTO: ratio of docs findable with a given query term to the number of docs on the topic

Web of Science, April 2010

Page 13: AIRFrame - hawaii.edu

Keyword searches on Elsevier's ScienceDirect using some astrobiologically relevant synonyms

Keywords Number of results An article found only with this search

“prebiotic synthesis” 566 The origin of life – a review of facts and speculations

“prebiotic chemistry” 611 A possible circular RNA at the origin of life

“abiotic synthesis” 312 Recent advances in the chemical evolution and the origins of life

“organic synthesis” 53,434 Prebiotic organic synthesis in early Earth and Mars atmospheres

Keyword Searching Inadequate without a well-defined vocabulary

Page 14: AIRFrame - hawaii.edu

Ontology● Any arbitrary knowledge structure

● Which elements are included and excluded?

● How are the elements represented and related?

Page 15: AIRFrame - hawaii.edu

NASA ADS: over-associated ontology

Page 16: AIRFrame - hawaii.edu

Web of Science: broad ontology

Page 17: AIRFrame - hawaii.edu

ScienceDirect: narrow ontology

Page 18: AIRFrame - hawaii.edu

AIRFrame Project Goals

● Clustering not cloistering: – Discover and relate diverse research

concepts as a high level activity

● Eliminate the need to search for data using combinations of specific keywords

● Show users conceptual and functional relationships between diverse research documents

Astrobiology Integrative Research Framework

Page 19: AIRFrame - hawaii.edu

Textpresso-based ● Open-source information retrieval and extraction

system

– Developed at CalTech for biological researchers

– Deployed in 17 different focused literatures

● Two major components

– Ontology cataloging types of objects, abstract concepts, and relationships

– Database of documents marked up with XML based on the ontology

Current statusProof-of-concept system in place

Page 20: AIRFrame - hawaii.edu

Textpresso Ontology

● A shallow ontology catalogs terms

ConceptsConcepts DescriptionsDescriptions RelationshipsRelationships

Purpose

●Fulfill●Make●Govern●Produce● ...

NucleicAcid

●Adenine●Cytosine●DNA●Guanine● ...

Comparison

●Dissimilar●Equal sized●Like●Related● ...

Page 21: AIRFrame - hawaii.edu

Textpresso Database:Articles marked up with XML

All Documents

Single Document

Abstract

Body Text

Authors

Title

Individual sentencesWe report the synthesis of glycine on interstellar ice­analog films composed of water,

 methylamine (MA), and carbon dioxide ... 

XML MarkupWe report the <biologicalprocess>synthesis</biologicalprocess> of <aminoacid>glycine</aminoacid>

 on interstellar ice­analog films <action>composed</action> of <volatile><water>water</water></volatile>, methylamine ( MA ), and 

<volatile><gases><chemicalelement>carbon</chemicalelement> dioxide</gases></volatile>

Ontology

Page 22: AIRFrame - hawaii.edu

Textpresso system query

Allows search by meaning rather than specific keyword

Search by:● keyword,

●keyword + synonyms, ●and/or whole categories

Ontology categories

Page 23: AIRFrame - hawaii.edu

System Query - Results

Results ranked by number of sentenceswith matching terms

Category 2 Keyword Category 1

Category 2Synonym

Page 24: AIRFrame - hawaii.edu

Textpresso adoption challenges

● Other implementations use preexisting ontologies

● Other implementations are narrowly focused

● Many biological journals have open full-text access through a single source

Page 25: AIRFrame - hawaii.edu

Textpresso adoption challenges● Current interface is rudimentary (at best)

- Ugly!- Not intuitive

- Results spread across pages --Hard to see overall picture

Page 26: AIRFrame - hawaii.edu

Current Work

● To allow more breadth to the ontology – Convert to SKOS standard developed by

Vocabulary Explorer at IVOA

– Adapt Textpresso to directly read SKOS ontology

● Standards based ontology can– Ease porting existing ontologies to ours

– Allow easy sharing of our data with other systems

Ontology Standards Adoption

Page 27: AIRFrame - hawaii.edu

Why SKOS?

● Allows fast addition of terms from existing sources such as International Astronomical Union Thesaurus and IVOA.

● Created to allow conversion from other formats

● SKOS allows the kind of relationships we want to leverage with AIRFrame

Page 28: AIRFrame - hawaii.edu

Database Building

Workflow

New articles AIRFrameontology

Mark updocuments

&index

Mine for newTerms

Outsideontologies

AIRFramesearchabledatabase

Page 29: AIRFrame - hawaii.edu

Document Classification

● Use machine learning methods to discover connections in the data

– Possibly use several to give users many views.

● Developing a classifier will allow a user to enter a document and get related ones back

● My research area is in information theoretic clustering

– Information Bottleneck Method

Page 30: AIRFrame - hawaii.edu

Information Bottleneck Method● Based on Shannon's Theory of Mutual

Information– Measures the reduction in uncertainty or

the distortion between an original signal and its compressed representation

– Where uncertainty about an entity is its entropy

I [ x , y ]=H [ x ]−H [ x∣y ]

H [ x ]=∑x

N

p x log 1 / p x

Page 31: AIRFrame - hawaii.edu

Information Bottleneck Method 2

● Data is compressed so that information about a quantity of interest (in this case words) is kept maximally.

● Relative entropy or K-L divergence emerges as the distortion function

● The optimal cluster assignment rule is a Boltzmann distribution

max p c∣x[ I y , c −TI x , c]

DKL[ p y∣x ∣∣p y∣c]=∑y

p y∣x log2 [ p y∣x p y∣c ]

pc∣x=pc

Z x ,T exp

−1TDKL [ p y∣x ∣∣p y∣c]

Page 32: AIRFrame - hawaii.edu

Information Bottleneck ClusteringPreliminary Results

Cluster number

Database clustered into 16 clustersBubble size = number of documents in cluster

0 2 4 6 8 10 12 14 16 18

0

20

40

60

80

100

120

140

160

Num DifferentJournals

Page 33: AIRFrame - hawaii.edu

Information Bottleneck ClusteringPreliminary Results

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Cluster Journal Disciplines

OtherCompute/ElectrionicsBIOPHYSICSChemAstronomy/AstropysicsGeo

Cluster Number

Page 34: AIRFrame - hawaii.edu

Where are we going with this?● Visualizations!

Page 35: AIRFrame - hawaii.edu

Future Plans● Create a cleaner, browsable interface which

displays results and links in an easily understandable way

● Allow users to input a document and get back information such as:

– Connections to other researchers

– Relevant NASA/NAI goals

– Related articles

Page 36: AIRFrame - hawaii.edu

How do we get there?● Continue work on automatic document

classification/clustering– Focus on top-level classification first

– Need at least an idea of how data fits into classes for visualization planning

– Look at relationships that arise from unsupervised methods and how to represent those.

● Continue to gather documents for the database

● Continue to build and improve the ontology

Page 37: AIRFrame - hawaii.edu

● Project is at a point where formal working guidelines can be put in place

– Investigate user interface requirements

– Build a set of use cases/scenarios

● Need usability studies for various visualization paradigms

– Need to discover where visualizations will/will not work in the system

● Hoping to collaborate with graduate level HCI class this semester

System Interface Design Process

Page 38: AIRFrame - hawaii.edu

Open Questions● How best to use our semantic markup to

present our data? ● How to visualize our information to allow

discovery and understanding of connections?

● How best to search for and represent● Category links between terms?● Authorship linkages between projects?● NASA/NAI goal links to research?

● Ontology building– Can it be at least partially automated?

Page 39: AIRFrame - hawaii.edu

Long-term goals

● Iteratively build/design interface with user participation

● Integrate fine-grained Textpresso search with high-level visualizations

● Finish standardization of ontology and offer searchable interface or visualization for it

● Integrate Steve Freeland's amino acid database

– mechanism partly in place now

Page 40: AIRFrame - hawaii.edu

Thank you!● For updates visit our site at:

– www.ifa.hawaii.edu/airframe

– “Proof-of-concept” AIRFrame/Textpresso is currently up and functioning

● www.ifa.hawaii.edu/airframe/textpresso

● Mahalo to:

– Justin Rajkowski & Kathryn Aiko Arinaga, LIS

– Liz Jayne, NAI REU participant