on embedding machine-processable semantics into documents

27
1 On Embedding Machine- Processable Semantics into Documents Krishnaprasad Thirunarayan Department of Computer Science & Engineering Wright State University Dayton, OH-45435, USA

Upload: salim

Post on 14-Jan-2016

44 views

Category:

Documents


0 download

DESCRIPTION

On Embedding Machine-Processable Semantics into Documents. Krishnaprasad Thirunarayan Department of Computer Science & Engineering Wright State University Dayton, OH-45435, USA. Talk Outline. Background and Motivation ( Why ?) Goals ( What? ) Details ( How ?) Conclusions. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: On Embedding Machine-Processable Semantics into Documents

1

On Embedding Machine-Processable Semantics into Documents

Krishnaprasad ThirunarayanDepartment of Computer Science & Engineering

Wright State UniversityDayton, OH-45435, USA

Page 2: On Embedding Machine-Processable Semantics into Documents

2

Talk Outline

Background and Motivation (Why?)

Goals (What?)

Details (How?)

Conclusions

Page 3: On Embedding Machine-Processable Semantics into Documents

3

Background and Motivation

Page 4: On Embedding Machine-Processable Semantics into Documents

4

Heterogeneous Doc. Spec. Defn. Rep.

Content Extraction:

Formalize doc, using controlled vocabulary

Page 5: On Embedding Machine-Processable Semantics into Documents

5

Problems with this approach to content extraction

Archiving spec (for human comprehension) separately from its formalization is not conducive traceability.Manual extraction from spec (from scratch) for each use is labor intensive, time consuming, and prone to typographical errors.

Page 6: On Embedding Machine-Processable Semantics into Documents

6

Observation

Conceptually, every piece of information in an extraction owes its existence to a phrase in spec, and possibly, controlled vocabulary. So, explore techniques to maintain correspondence between a spec fragment and its formalization.

Page 7: On Embedding Machine-Processable Semantics into Documents

7

Goal

Page 8: On Embedding Machine-Processable Semantics into Documents

8

General Problem

Embed domain-specific mark-up (annotations) into human sensible document to make explicit semantics of

“content” text and complex data, and to augment an interpretation in a

modular fashion. Document text: Human comprehensible Semantic Mark-up: Machine processable

Page 9: On Embedding Machine-Processable Semantics into Documents

9

Details (How?)

Page 10: On Embedding Machine-Processable Semantics into Documents

10

Nature of Specs

Semi-structured Heterogeneous

Text Tables Images

Constrained technical vocabulary

Available as MS Word document

Page 11: On Embedding Machine-Processable Semantics into Documents

11

Pre-processing Spec

Abstract content from spec document by removing display oriented information Save text Save tabular data, preserving grid

layout Retain links to images …

Note: “Save As text” option in MS Word inadequate

Page 12: On Embedding Machine-Processable Semantics into Documents

12

Heterogeneous Document

Page 13: On Embedding Machine-Processable Semantics into Documents

13

XML generated by Majix

Page 14: On Embedding Machine-Processable Semantics into Documents

14

ASCII Output

Page 15: On Embedding Machine-Processable Semantics into Documents

15

Annotating Pre-processed Spec

Embedding Machine Processable Semantics Recognizing and tagging text using

controlled vocabulary By product of: Document Indexing and Semantic

Search Tagging tabular data to make explicit its

semantics : Same grid layout, but different interpretation and dependencies based on headings

Explore: XML-based programming language Water for defining data and its behavior (semantics)

Page 16: On Embedding Machine-Processable Semantics into Documents

16

Locating Controlled Vocabulary Terms

Page 17: On Embedding Machine-Processable Semantics into Documents

17

Example Table

Thickness (mm)

Tensile Strength

(ksi)

Yield Strength

(ksi)

0.50 and under

165 155

0.05 – 1.00 160 150

1.00 – 1.50 155 145

Page 18: On Embedding Machine-Processable Semantics into Documents

18

Example of Tagged Table

Thickness (mm) Tensile Strength (ksi) Yield Strength (ksi)

table.<setHeading thickness strength.tensile strength.yield/>

0.50 and under 165 155

table.<addRow 0 0.50 165 155 />

0.50 - 1.00 160 150

table.<addRow 0.50 1.00 160 150 />

1.00 - 1.50 155 145

table.<addRow 1.00 1.50 155 145 /> ...

Page 19: On Embedding Machine-Processable Semantics into Documents

19

Example of Processing Code

<defclass table rows=required=vector heading=optional=vector>

<defmethod setHeading t=required ts=required ys=required>

<set heading=<vector t ts ys/>/>

</>

<defmethod addRow smin smax ts ys>

<set rows=

table.rows.<insert <vector smin smax ts ys/>/>/>

</>

<defmethod computeYieldStrength> … </>

<defmethod computeTensileStrength> … </>

</>

Page 20: On Embedding Machine-Processable Semantics into Documents

20

(cont’d)<defclass table rows=required=vector

heading=optional=vector>

<defmethod computeTensileStrength>

<set temp=fluid.Thickness/>

<set i=0/>

<do>

<until <and temp.<less table.rows.<get i/>.1/>

temp.<more_or_equal table.rows.<get i/>.0/> /> >

table.rows.<get i/>.2

</until>

<set i=i.<plus 1/>/>

</do>

</>

</>

Page 21: On Embedding Machine-Processable Semantics into Documents

21

(cont’d)

<defclass table rows=required=vector heading=optional=vector>

</>

fluid.<set Thickness=0.60>

<try

<set TensileStrength=table.<computeTensileStrength/>/>

TensileStrength

>

"TABLE: out of range error occurred"

</try>

Page 22: On Embedding Machine-Processable Semantics into Documents

22

Water

XML-based OO Scripting LanguageFacilitates creating Web Services Run methods remotely via web-

browser

Generalizes dynamic typing to constraint checking Conformance of actuals to formals

Page 23: On Embedding Machine-Processable Semantics into Documents

23

Pros and cons

Encoding Improvement Amount of tagging can be controlled by

suitably delimiting table data and annotating it with corresponding “string-processing” method

Master Copy Update Changes to spec requires manual

modification to archived annotated version.

Irregular Tables in Specs Different units, etc

Page 24: On Embedding Machine-Processable Semantics into Documents

24

Some Related Work

Microsoft Smart Tags Recognize “controlled” words in

Office 2003 documents and associate predefined list of actions with each occurrence

SHOE Table data in a declarative (logic)

language

Page 25: On Embedding Machine-Processable Semantics into Documents

25

Prolog rendition

strengthTableRow( 0, 0.50, 165, 155).strengthTableRow(0.50, 1.00, 160, 150). strengthTableRow(1.00, 1.50, 155, 145). ...strengthTable(Thickness, TensileStrength, YieldStrength) :- strengthTableRow(L, U, TensileStrength,

YieldStrength), L =< Thickness, U > Thickness.

thicknessToTensileStrength(Thickness, TensileStrength) :- strengthTable(Thickness, TensileStrength, _).thicknessToYieldStrength(Thickness, YieldStrength) :- strengthTable(Thickness, _, YieldStrength).

?- thicknessToYieldStrength(0.6,YS).

Page 26: On Embedding Machine-Processable Semantics into Documents

26

Conclusions

Page 27: On Embedding Machine-Processable Semantics into Documents

27

A Step towards Holy Grail

Ultimately enable authoring and/or extracting, human-comprehensible and machine-processable parts of a document “hand in hand”, and keep them “side by side”.