a primer on converting analysis results data to …openrefine method 18 advantages • flexibility...

24
A Primer on Converting Analysis Results Data to RDF Data Cubes using Free and Open Source Tools Tim Williams Principal Statistical Solutions Analyst Global Statistical Sciences UCB BioSciences, Inc. PhUSE 2014 TT03

Upload: others

Post on 24-Jul-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Primer on Converting Analysis Results Data to …OpenRefine method 18 Advantages • Flexibility in cube design • Incremental development Disadvantages • Greater cube knowledge

A Primer on Converting Analysis Results Data to RDF Data Cubes using Free and Open Source Tools

Tim Williams Principal Statistical Solutions Analyst Global Statistical Sciences UCB BioSciences, Inc.

PhUSE 2014

TT03

Page 2: A Primer on Converting Analysis Results Data to …OpenRefine method 18 Advantages • Flexibility in cube design • Incremental development Disadvantages • Greater cube knowledge

The Semantic Web (circa 2011) 2

Page 3: A Primer on Converting Analysis Results Data to …OpenRefine method 18 Advantages • Flexibility in cube design • Incremental development Disadvantages • Greater cube knowledge

3 "I want to take the clinical trials results..."

"..and put them in an RDF Data Cube!"

Placebo LowDose HighDose Baseline N=28 N=30 N=29 --------------------------------------------- Sex F 12 (42.9) 14 (46.7) 16 (55.2) M 16 (57.1) 16 (53.3) 13 (44.8)

ds:obs1 a qb:Observation ; prop:treatment "Plc" ; prop:sex "F" ; prop:statistic "count" ; prop:result "12"^^xsd:double ; qb:dataSet ds:dataset-demog .

ds:obs2 a qb:Observation ; ...

Page 4: A Primer on Converting Analysis Results Data to …OpenRefine method 18 Advantages • Flexibility in cube design • Incremental development Disadvantages • Greater cube knowledge

4

"op": "rdf-extension/save-rdf-schema", "description": "Save RDF schema skeleton", "schema": { "baseUri": "http://www.example.org/", "prefixes": [ { "name": "dccs", "uri": "http://www.example.org/dc/demog/dccs/" }, { "name": "rdfs", "uri": "http://www.w3.org/2000/01/rdf-schema#" }, { "name": "prov", "uri": "http://www.w3.org/ns/prov#" }, ........

JSON

ts:i7832 ts:firstName “Homer” ; ts:lastName “Simpson” ; ts:hasSpouse ts:i5628 . ts:i5628 ts:firstName “Marge”; ts:lastName “Simpson”;

Turtle

Tribble hasSpouse Homer Simpson

Marge Simpson

Triple

Turtle

Jason

How to start?

Page 5: A Primer on Converting Analysis Results Data to …OpenRefine method 18 Advantages • Flexibility in cube design • Incremental development Disadvantages • Greater cube knowledge

In Scope

•  Introduction to Semantic Web, RDF....

•  PhUSE Wiki "PhUSE Semantic Technology Curriculum" •  Detailed tutorial

5

Out of Scope

•  Simplified RDF Data Cube •  Two creation methods (overview)

Page 6: A Primer on Converting Analysis Results Data to …OpenRefine method 18 Advantages • Flexibility in cube design • Incremental development Disadvantages • Greater cube knowledge

PhUSE Wiki: Companion Documents 6

Page 7: A Primer on Converting Analysis Results Data to …OpenRefine method 18 Advantages • Flexibility in cube design • Incremental development Disadvantages • Greater cube knowledge

What is an RDF Data Cube?

7

Page 8: A Primer on Converting Analysis Results Data to …OpenRefine method 18 Advantages • Flexibility in cube design • Incremental development Disadvantages • Greater cube knowledge

8

Page 9: A Primer on Converting Analysis Results Data to …OpenRefine method 18 Advantages • Flexibility in cube design • Incremental development Disadvantages • Greater cube knowledge

3 Main Components in the Cube Model •  Attributes

•  metadata •  status=final,issued="2014-08-06T00:00:00"^^xsd:dateTime ;

•  Measure (or Primary Measure) •  the observed value of primary interest •  count=12

•  Dimensions •  value keys or indices that identify the measure •  treatment="Plc" , sex="F", statistic="count"

9

Page 10: A Primer on Converting Analysis Results Data to …OpenRefine method 18 Advantages • Flexibility in cube design • Incremental development Disadvantages • Greater cube knowledge

10

F

M

Plc LowD HighD

count

12

16

14

16

16

13

12

percentage

14 16

16

13

42.9 46.7 55.2

55.2

44.8

Treatment

Sex

Baseline Placebo LowDose HighDose Characteristic N=28 N=30 N=29 ---------------------------------------------------------------------- Sex F 12 (42.9) 14 (46.7) 16 (55.2) M 16 (57.1) 16 (53.3) 13 (44.8)

Statistic •  count •  percentage

Treatment

Page 11: A Primer on Converting Analysis Results Data to …OpenRefine method 18 Advantages • Flexibility in cube design • Incremental development Disadvantages • Greater cube knowledge

11

treatment="Plc",

Dimensions

12

16

14

16

16

13

12 14 16

16

13

42.9 46.7 55.2

55.2

44.8

It's  a  hit!!  count=12

Measure

sex="F",

statistic="count"

Plc Treatment

F Sex

count

Page 12: A Primer on Converting Analysis Results Data to …OpenRefine method 18 Advantages • Flexibility in cube design • Incremental development Disadvantages • Greater cube knowledge

12 Publisci OpenRefine

X

X X

Page 13: A Primer on Converting Analysis Results Data to …OpenRefine method 18 Advantages • Flexibility in cube design • Incremental development Disadvantages • Greater cube knowledge

Publisci Method 13

Map table

Ruby Script

CSV

Baseline Placebo LowDose HighDose Characteristic N=28 N=30 N=29 ---------------------------------------------------------------------- Sex F 12 (42.9) 14 (46.7) 16 (55.2) M 16 (57.1) 16 (53.3) 13 (44.8)

Statistic •  count •  percentage

Treatment,Sex,Statistic,Result Plc,F,count,12 Plc,F,percentage,42.9 Plc,M,count,16 Plc,M,percentage,57.1 LowD,F,count,14, LowD,F,percentage,46.7 LowD,M,count,16 LowD,M,percentage,53.3 etc.

require 'publisci' include PubliSci::DSL data do source 'demog3DimSource.csv' dimension 'Treatment' , 'Sex', 'Statistic' measure 'Result' option :base_url, 'http://example.org' option 'base', 'http://example.org/' option 'label_column', 'Statistic' end metadata do dataset 'Demographics Analysis Results' title 'Demographics' creator 'Your-Name-Here' description 'Table example for Demographics and Baseline Characteristics' date '2014-07-07T00:00:00' end open('demog3Dim_p.ttl','w'){|file| file.write generate_n3}

... ns:obs1 a qb:Observation ; qb:dataSet ns:dataset-demog3DimSource ; rdfs:label "1" ; prop:Treatment <code/treatment/Plc> ; prop:Sex <code/sex/F> ; prop:Statistic <code/statistic/count> ; prop:Result 12 ; ... .

Page 14: A Primer on Converting Analysis Results Data to …OpenRefine method 18 Advantages • Flexibility in cube design • Incremental development Disadvantages • Greater cube knowledge

Publisci Method 14

Advantages

Simple, quick, easy

Minimal cube knowledge

Automatic code list generation

Disadvantages

Limited support*

Harder to extend unless you are a Ruby and Cube expert

Not as flexible as OpenRefine

RDF Data Cube

Page 15: A Primer on Converting Analysis Results Data to …OpenRefine method 18 Advantages • Flexibility in cube design • Incremental development Disadvantages • Greater cube knowledge

15

Map table

Import

Construct

Attach

Export

CSV/XLS

Create Project

Cube Skeleton Components •  Attributes •  Dimensions •  Measure

Values

OpenRefine Method

Page 16: A Primer on Converting Analysis Results Data to …OpenRefine method 18 Advantages • Flexibility in cube design • Incremental development Disadvantages • Greater cube knowledge

OpenRefine 16

Page 17: A Primer on Converting Analysis Results Data to …OpenRefine method 18 Advantages • Flexibility in cube design • Incremental development Disadvantages • Greater cube knowledge

Save & Re-use JSON from OpenRefine 17

Page 18: A Primer on Converting Analysis Results Data to …OpenRefine method 18 Advantages • Flexibility in cube design • Incremental development Disadvantages • Greater cube knowledge

OpenRefine method 18

Advantages

•  Flexibility in cube design

•  Incremental development

Disadvantages

•  Greater cube knowledge required

•  Steep Learning curve

•  Labour-intensive, manual steps

•  Measures in the same cube all receive the same data type Example: count and percentage as xsd:double

•  Cube components available within interface

•  Data reconciliation

Page 19: A Primer on Converting Analysis Results Data to …OpenRefine method 18 Advantages • Flexibility in cube design • Incremental development Disadvantages • Greater cube knowledge

Where did I go wrong with this child?

Query the data with SPARQL 19

PREFIX prop: <http://www.example.org/dc/demog/prop/> SELECT ?value WHERE { ?obs prop:treatment "Plc"; prop:sex "F"; prop:statistic "count"; prop:result ?value. }

I blame The Internets, honey.

SPARQL Protocol and RDF Query Language

Page 20: A Primer on Converting Analysis Results Data to …OpenRefine method 18 Advantages • Flexibility in cube design • Incremental development Disadvantages • Greater cube knowledge

Cube Construction: an Evolution. 20

Publisci rrdf, rrdflibs

•  “My first cube!” •  Codelists

•  Structure and Skeletons •  Customization •  Data reconciliation

•  Production solution

OpenRefine

Page 21: A Primer on Converting Analysis Results Data to …OpenRefine method 18 Advantages • Flexibility in cube design • Incremental development Disadvantages • Greater cube knowledge

Data Transparency?

•  Metadata

•  embedded with the data

•  Standardization

•  data reconciliation with online vocabularies & thesauri

•  translation between different coding systems and data models

•  Merge data

•  similar and dissimilar sources

•  Machine readable •  Reasoning, logic, intelligent search

21

Semantic Interoperability: "The ability for computer systems to exchange data with unambiguous, shared meaning". - Wikipedia.

Page 22: A Primer on Converting Analysis Results Data to …OpenRefine method 18 Advantages • Flexibility in cube design • Incremental development Disadvantages • Greater cube knowledge

22

Thank you!

Tim Williams UCB Biosciences, Inc. Raleigh, NC USA [email protected]

Acknowledgements Will Strinz - Publisci Author OpenRefine team Ian Fleming, Marc Andersen - PhUSE WG Leads PhUSE WG team members Open Source Movement The Internets Contact:

www.linkedin.com/in/timpwilliams/

Page 23: A Primer on Converting Analysis Results Data to …OpenRefine method 18 Advantages • Flexibility in cube design • Incremental development Disadvantages • Greater cube knowledge

Copyright & Source Attributions All images are copyrights of their creators and respected owners.

23

Paramount Pictures

Hasbro Inc.

Ron Leishman. Image 440722 illustrations Of.com

LOD Cloud Diagram as of September 2011CC BY-SA 3.0 . Anja Jentzch, own work

http://dgallery.s3.amazonaws.com/sparql-protocol.png

Davidson University Dept. of Biology Herpetology Lab Research http://www.bio.davidson.edu/people/midorcas/research/stresearch/tercar.jpg

My life with Fly Ball dogs http://mylifewithflyballdogs.com http://farm6.staticflickr.com/5341/7186194778_3c9d6b56be.jpg

Daily Tombstone Photo http://dailytombstonephoto.blogspot.com/2010/05/mausoleum-of-charles-lucky-luciano-st.html MAUSOLEUM OF CHARLES "LUCKY" LUCIANO - St. John's Cemetery, Middle Village, New York Image modifications by TW, Aug 2014

Page 24: A Primer on Converting Analysis Results Data to …OpenRefine method 18 Advantages • Flexibility in cube design • Incremental development Disadvantages • Greater cube knowledge

24

Nickelodeon

Copyright & Source Attributions All images are copyrights of their creators and respected owners.