nist scientific data for data science united nations open data / open government conference, april...
TRANSCRIPT
1
NIST Scientific Data for Data ScienceUnited Nations Open Data / Open Government Conference, April 26-28, Abu Dhabi
http://semanticommunity.info/Data_Science/NIST_Scientific_Data_for_Data_Science
Dr. Brand NiemannDirector and Senior Data Scientist
Semantic Communityhttp://semanticommunity.info/
http://www.meetup.com/Federal-Big-Data-Working-Group/http://semanticommunity.info/Data_Science/Federal_Big_Data_Working_Group_Meetup
April 26, 2014
2
Open Data / Open Government Conference
• Request:– Interesting case studies about open government / open data.– Information on relevant federal apps designed.– A short bio.
• Response:– AOL Government published about 80 of my 200 some stories at
Semantic Community about open government data and activities.– Over 250 Spotfire dashboard apps in my cloud library including
most of the major open government dashboards and new data sets.
– Helped Data.gov get started in the US, and open government data get started in the SEMIC.EU and Japan.
3
Speaker Bio
• Brand Niemann, former Senior Enterprise Architect & Data Scientist with the US EPA, works as a data scientist, produces data science products, and publishes data stories for Semantic Community, AOL Government, & Data Science & Data Visualization DC.
• He co-organized the Federal Big Data Working Group Meetup with Kate Goodier that has Data Science Teams producing big data applications for government and business and provides a free on-line graduate course entitled Practical Data Science for Data Scientists.
4
Broader Context• NIST and other agencies need to support the following Federal
Government Initiatives:– Big Data– Digital Government Strategy– Public access mandated for "scientific results" supported by the U.S.
government– Federal agencies have submitted their "initial plans" for public access to
scientific data to OSTP– Digital Object Architecture: One result will be to make the scientific
record into a first class scientific object• The author has suggested that all of these can be addressed with
agency digital content by following the Data Mining Standard.– See “Data Science Makes Data More Important Than Code and Ontology”
5
Data Mining Standard• Business Understanding:
– NIST Mission• Standardize measurement
• Data Understanding:– NIST Digital Archives
• Promised to publish raw data sets
• Data Preparation:– Knowledge Base of the Above
• Need raw data for figures
• Modeling:– Semantic Knowledge Base, Data Papers, and
NanoPublications• See White Paper on “Making Big Data Small" using
Data Science and Semantics
• Evaluation:– Searchability, Discovery, and Reasoning
• Relational Queries and Graph Traversal
• Deployment:– Story and Knowledge Base in MindTouch, Excel,
NodeXL, Spotfire, and Be Informed• Data ecosystem
6
NIST
• NIST Supports its employees and others with the following Information Services:– Research Library– Publishing Services– NIST Museum and Archives
• The NIST Digital Archives (NDA) present images of NIST Museum artifacts and full-text NIST publications:– NBS Bulletins– Journal of Research of NIST– NBS-NIST Directors– NBS-NIST Histories– NBS Circulars and Reports
9
NIST Digital Archive Interface
http://nistdigitalarchives.contentdm.oclc.org/
10
NIST Digital Archive Contents
http://nistdigitalarchives.contentdm.oclc.org/cdm/search/display/200/order/title/ad/asc
My Note: 9602 Items!
11
NIST Digital Archive Example
http://cdm16009.contentdm.oclc.org/cdm/compoundobject/collection/p13011coll6/id/153009/rec/1
My Note: Can Read PDF On-line, but Where Is the Data?
12
PDF-to-MindTouch
Figure 8 The solid circles show the measured absorbanceTable 1 Properties of 2.0 μm microspheres at 266 nm obtained from the fit of the L-M apparent cross section to the absorbance measurements
My Note: Need Data for Figure 8 and for Table 1 to be Real Data (it is!)
13
Modeling: Approaches by the Federal Big Data Working Group Meetup
• Semantic Medline:– Semantic MEDLINE Query: mesothelioma and
Data Science for VIVO• Data Papers:– Sepublica 2014: The Semantics for e-science in an intelligent
Big Data Context• http://sepublica.mywikipaper.org/
• Nanopublications:– The smallest unit of publishable information: an assertion
about anything that can be uniquely identified and attributed to its author.• http://nanopub.org/wordpress/?page_id=65
14
Modeling: Examples
Most Recent: 500 citations,Start Date: 01/01/1900,End Date: 11/30/2013,3169 predications extracted.Summarized for Substance Interactions
Dr. Barend Mons: BRAIN Dr. Tom Rindflesch: Semantic Medline
15
Evaluation and Deployment
• The Evaluation and Deployment examples of each is as follows:– Semantic Knowledge Base: Web & PDF– Selected Data Papers: PDF-to-MindTouch
• Measurement of Scattering and Absorption Cross Sections of Microspheres for Wavelengths between 240 nm and 800 nm
• OMNIDATA and the Computerization of Scientific Data
– Nanopublication: Extracts from the Data Papers-to-Excel• My Note: Still need the NIST raw data sources to re-
create the figures in the publications.– I have been promised that NIST is going to publish their
data sets as part of the Open Government Data Initiative.
16
How was the data collected?
http://semanticommunity.info/Data_Science/NIST_Scientific_Data_for_Data_Science
My Note: Unstructured Information to Structured Data, Including the Two PDF Papers, with Well-defined URLsAccording to the SEMIC.EU Standards.
17
Where is the unstructured and structured data stored?
http://semanticommunity.info/@api/deki/files/28860/NISTDataScience.xlsx
Web and PDFFootnote and ReferencesMetadata and Data SourcesWell-defined URLs for Linked DataRelational and GraphReady for NodeXL & Spotfire
18
What are the results?:NIST Scientific Data Knowledge Base Visualization
My Note: Sections with Many Reference Links Can be Very Important!
19
What are the results?:NIST Digital Archives Century of Excellence
My Note: The Featured Seminal Data Paper is the 60th out of 106 Which I Found from Doing the Index Below!
20
What are the results?:NIST Digital Archives
My Note: The NIST Digital Archive Can be an Interface to Data Papers with Data Tables and Interactive Visualizations. This Work Can be Used to Prioritize the Additional Work and Reduce Duplication.
21
What are the results?:NIST Library Catalog Search for Data
My Note: This Was a Test for Searching the Catalog for “data” and Converting the Results to a Spreadsheet (20 of 259). There is Also the Need to Search for Data Tables Within the Individual Publications.
22
What is our data story and product?
• Need a scientific data publishing environment that supports:– Conformance to editorial policies– Facilitates peer review– Standardizes dissemination– Manages references and URLs– Promotes data publication, validation, and mining
• Semantic Community is doing that for NIST:– More work in progress to be reported at the
conference and elsewhere