20140113 q uchemxseerseminarfinal

44
Chem X Seer: Digital library tools, features, and crawling characteristics Edward A. Fox Professor, Computer Science, Virginia Tech Blacksburg, VA 24061 USA [email protected] http://fox.cs.vt.edu and Sagnik Ray Choudhury Ph.D. Student, College of Information Science and Technology, Penn State, USA [email protected] 13 Jan. 2014 -- QU Library, Doha, Qatar 1

Upload: tahseenam

Post on 21-Jul-2015

278 views

Category:

Education


0 download

TRANSCRIPT

ChemXSeer: Digital library tools, features, and crawling characteristics

Edward A. Fox

Professor, Computer Science, Virginia Tech

Blacksburg, VA 24061 USA

[email protected] http://fox.cs.vt.edu

and

Sagnik Ray Choudhury

Ph.D. Student, College of Information Science and Technology, Penn State, USA

[email protected] 13 Jan. 2014 -- QU Library, Doha, Qatar 1

Outline

• Acknowledgments • Introduction • ELISQ • Technology

13 Jan. 2014 -- QU Library, Doha, Qatar 2

HTTP://WWW.QU.EDU.QA/

HTTP://WWW.TAMU.EDU/ HTTP://WWW.PSU.EDU/ HTTP://WWW.VT.EDU/

Funding provided thru the ELISQ project: Electronic Library Institute - SeerQ

13 Jan. 2014 -- QU Library, Doha, Qatar 3

Sponsored by Qatar University Library

HTTP://qnl.qa

Acknowledgments

• Dr. Mazen Hasna, VP and Chief Academic Officer, Qatar University

• Dr. Rashid Alammari, Dean, College of Engineering, Qatar University

• Dr. Moumen Hasnah , Director of Academic Research, Qatar University

• Dr. Imad Bachir, Qatar University Library Director

• Prof. Sebti Foufou, Head of Department of Computer Science and Engineering, Qatar University

• Prof. Ramazan Kahraman, Head of the Department of Chemical Engineering, Qatar University

13 Jan. 2014 -- QU Library, Doha, Qatar 4

Additional Thanks

13 Jan. 2014 -- QU Library, Doha, Qatar 5

QScience – providing collection:

Christopher J. Leonard, Editorial Director

Paul Coyne, CTO

US National Science Foundation (recent and current grants to Fox): • IIS-1319578 • IIS-0916733 • DUE-0840719 • OCI-1032677 • plus those to PSU, TAMU

Outline

• Acknowledgments • Introduction • ELISQ • Technology

13 Jan. 2014 -- QU Library, Doha, Qatar 6

Introduction

• Digital libraries have emerged since 1991. • Now each major publisher has its own

digital library; many others exist too. • Related systems include:

• Institutional repositories, e.g., at QU • Content & courseware management systems

• Research and development funding of hundreds of millions of dollars has led to powerful tailored systems, such as for chemical information.

13 Jan. 2014 -- QU Library, Doha, Qatar 7

8 13 Jan. 2014 -- QU Library, Doha, Qatar

9

Information Life Cycle

Authoring

Modifying

Organizing

Indexing

Storing

Retrieving

Distributing

Networking

Retention

/ Mining

Accessing

Filtering

Using

Creating

13 Jan. 2014 -- QU Library, Doha, Qatar

10

Browsing Collaborating Customizing Filtering Providing access Recommending Requesting Searching Visualizing

Annotating Classifying Clustering Evaluating Extracting Indexing

Measuring Publicizing

Rating Reviewing (peer)

Surveying Translating

(language)

Conserving Converting

Copying/Replicating Emulating Renewing

Translating (format)

Acquiring Cataloging

Crawling (focused) Describing Digitizing

Federating Harvesting Purchasing Submitting

Preservational Creational

Add

Value

Repository-Building

Information Satisfaction

Services

Infrastructure Services

13 Jan. 2014 -- QU Library, Doha, Qatar

Outline

• Acknowledgments • Introduction • ELISQ • Technology

13 Jan. 2014 -- QU Library, Doha, Qatar 11

ELISQ – Electronic Library Institute – SeerQ –– Project Team

Qatar University, Qatar: Mohammed Samaka (Ph.D., Co-Lead PI)

Sumaya Ali S A Al-Maadeed (Ph.D., PI)

Myrna Tabet

Asad Nafees

Tahseena Moideen

This project was made possible by NPRP Grant # 4 - 029 - 1 – 007 from the Qatar National Research Fund (a member of Qatar Foundation).

Virginia Tech, USA:

Edward Fox (Ph.D., Lead-PI)

Tarek Kanan

Penn. State University, USA:

C. Lee Giles (Ph.D., PI)

Sagnik Ray Choudhury

Texas A&M, USA:

Richard Furuta (Ph.D., PI)

Hamed Alhoori

13 Jan. 2014 -- QU Library, Doha, Qatar 12

Consultants:

John Impagliazzo (Ph.D., Key Investigator)

Susan Lukesh (Ph.D.)

Carole Thompson

Qatar National Library, Qatar:

Claudia Lux (PI)

Krishna RoyChowdhury

Postdoc - TBA

Project Objectives/Aims

A. Research and prototype digital library systems and infrastructure for Qatar, focusing initially on Qatari information related to government and scholarly activities.

Leverage the crawling engine from Penn State‘s SeerSuite software infrastructure, and extend it beyond its current focus on English to support Arabic-English collections, and to cover a broad range of scholarly disciplines, and all types of government information.

13 Jan. 2014 -- QU Library, Doha, Qatar 13

ELISQ Project (1 of 2)

Project Objectives/Aims (continued)

B. Research and build the digital library community in

Qatar, supporting digital library use, services, collection development, tailored systems, and advancing toward a Knowledge Society.

Study scholarly activities, and engage in community building in Qatar, so DLs can be tailored to specific domains and to the unique needs of Qatar. Through workshops, a consulting center at the proposed Institute, and collaborative efforts with libraries and museums in Qatar, we will identify particular needs and uses, and tailor collections, systems, and services, to lead toward the Qatari Knowledge Society.

13 Jan. 2014 -- QU Library, Doha, Qatar 14

ELISQ Project (2 of 2)

Outline

• Acknowledgments • Introduction • ELISQ • Technology

13 Jan. 2014 -- QU Library, Doha, Qatar 15

Crawler (Heritrix) (for search engines & Web archives)

• A Web crawler starts with a list of URLs to visit, called the seeds.

• On those page, identifies all the hyperlinks

• adds them to the list of URLs to visit

• recursively visits pages pointed to

• according to a set of policies.

• Prioritizes its downloads – some pages change often.

13 Jan. 2014 -- QU Library, Doha, Qatar 16

Selected SeerSuite Instantiations

• CiteSeerx • http://citeseerx.ist.psu.edu

• A scientific literature digital library and search engine

• ChemXSeer • http://chemxseer.ist.psu.edu

• Portal for researchers in environmental chemistry integrating the scientific literature with experimental, analytical, and simulation results and tools

• ArchSeer • http://archseer.ist.psu.edu/

• Archeology literature

• TableSeer

13 Jan. 2014 -- QU Library, Doha, Qatar 17

http://citeseerx.ist.psu.edu CiteSeerX

• 3 M documents

• Ms of files

• 60 M citations

• 3 to 6 M authors

• 2 to 4 M hits day

• 100K documents added monthly

• 800K individual users

• several Tbytes

• CiteSeerX crawls researcher homepages on the web for scholarly papers, formerly in computer science

• Converts PDF to text • Automatically extracts OAI metadata and other data • Automatic citation indexing, links to cited documents, creation of document page, author disambiguation • Software open source – can be used to build other such tools

13 Jan. 2014 -- QU Library, Doha, Qatar 18

13 Jan. 2014 -- QU Library, Doha, Qatar 19

13 Jan. 2014 -- QU Library, Doha, Qatar 20

SeerSuite

• Tool kit used to build search engines and digital libraries • CiteSeerX , MyCiteSeerX , ChemXSeer, ArchSeer, AlgoSeer,

AckSeer, BizSeer, CSSeer, CollabSeer, RefSeer, GrantSeer, SeerSeer, YouSeer, etc.

• Built on commercial grade open source tools (Solr/Lucene) • Penn State expertise – automated specialized metadata

extraction

• Supports research in • Indexing and search • Data mining & structures • Information and knowledge extraction • Social networks: Name/entity disambiguation • Scientometrics/infometrics • Systems engineering • User interface design (HCI = human-computer interaction) • Software engineering and management

22

SeerSuite is not Google

• Metadata (as in library catalogs) as well as content

• Sets of collections, rather than the Web as a whole • Provided by a curator (e.g., publisher, museum)

• Provided by user submissions

• Or collected by focused ‘crawling’

• Tailored services, rather than the same for everyone • Browsing using categories, preserving, adding value

• Based on studying user requirements, e.g., chemists

• Working with entities, rather than just words • Citations, tables, figures, names, chemical formula

• Using knowledge bases, machine learning, artificial intelligence

13 Jan. 2014 -- QU Library, Doha, Qatar

Search Engine and Repository for eChemistry

C. Lee Giles, Prasenjit Mitra, Karl Mueller, Levent Bolelli, Xiaonan Lu, Saurabh Kataria, Ying Liu, Anuj Jaiswal, Kun Bai, Bingjun Sun, Isaac Councill, James Z. Wang, James Kubicki, Barbara Garrison, William Brouwer, Joel Bandstra, Qingzhao Tan, Juan Pablo Ramirez

Fernandez, Madian Khabsa, Hung-Hsuan Chen, Sagnik Ray Choudhury

Chemistry, Computer Sciences and Engineering, Geosciences, Information Sciences and Technology

Pennsylvania State University, University Park, PA, USA

Past funding: NSF Cyberinfrastructure Chemistry, Microsoft Current Support: Dow Chemical

http://chemxseer.ist.psu.edu

Talk Overview

● Challenges and Motivation. ● Functionalities

– Fulltext Search – Author Search – Table Search – Figure Search – Expertise Search – Chemical Name and Formula Tagging – Chemical Name and Formula Search

● Summary.

Based on cyberinfrastructure for CiteSeerX

Built on Solr/Lucene, SeerSuite, other OSS

ChemXSeer RSC

ChemXSeer Fulltext Search

ChemXSeer Author Search

ChemXSeer Table Search

• Tables are widely used to present experimental results or statistical data in scientific documents.

• Existing search engines treat tabular data as regular text – Structural information and semantics not preserved. – We automatically identify tables and extract table metadata in xml.

Table Metadata Representation: • Environment metadata: (document specifics: type, title,…) • Frame metadata: (border left, right, top, bottom, …) • Affiliated metadata: (Caption, footnote, …) • Layout metadata: (number of rows, columns, headers,…) • Cell content metadata: (values in cells) • Type metadata: (numeric, symbolic, hybrid, …)

Y. Liu, et.al, AAAI 2007, JCDL 2007.

Sample Table Metadata Extracted File

Sample Table Metadata Extracted File

ChemXSeer Table Search

ChemXSeer Figure/Plot Data Extraction and Search

Numerical data in scientific publications are often found in figures. No search engine allows searching on figures and their data in chemical documents. Tools that automate the data extraction from figures and allow

search on them can provide the following: • Increases our understanding of key concepts of papers. • Provides data for automatic comparative analyses. • Enables regeneration of figures in different contexts. • Enables search for documents with figures containing specific experiment

results. X. Lu, et.al, JCDL 2006., Ray Choudhury et al. JCDL 2013, ICDAR 2013

Our Contribution

ChemXSeer Name and Formula Extraction and Search

• Extraction and search of chemical names and formulae in scientific documents has been shown to be very useful. • Extraction and search on chemical names is hard:

– Many chemical molecules are created everyday, any dictionary based name recognizer will fail eventually.

– Names need to segmented to get semantically meaningful sub-terms such as “methyl”, “ethyl” and “alcohol” from “methylethyl alcohol”.

• Identifying formula is hard: • “… YSI 5301, Yellow Springs, OH, USA …” (Non-formula)

• “… such as hydroxyl radical OH, superoxide O2- …” (formula) • For searching, formulae cannot be treated as text.

• Domain knowledge (formula identification) • Structural knowledge (substructure finding and search)

B. Sun, et.al., WWW 2007, WWW 2008, TOIS

Chemical Entity Extraction and Tagging

● Name tagging – Each chemical name can be a phrase

– Example ● "... Determination of lactic acid and ...“

● "... insecticide promecarb (3-isopropyl-5-methylphenyl methylcarbamate) acts against ..."

● Formula tagging – Each formula is a single term

– Example ● "... such as hydroxyl radical OH, superoxide ..."

– Non-formula example ● "... YSI 5301, Yellow Springs, OH, USA ... ”

● Tagging examples – Name tagging:

"... of <name-type>lactic acid</name-type> and ...“

– Formula tagging: "... radical <formula-type>OH</formula-type> , superoxide ..."

Online Chemical Entity Tagger

● We have an open source chemical name and formula tagger and a web based interface for evaluation.

● The interface takes a PDF file as input, returns text of the PDF with names or formulas tagged.

Online Chemical Entity Tagger: Chemical Name Tagging Example

● Results on a sample PDF.

● Some chemical formula erroneously identified as chemical name (loss of precision).

● High recall (most chemical names identified)

Online Chemical Entity Tagger: Chemical Formula Tagging Example

● Results on a sample PDF.

● Some chemical formulas not identified (loss of recall).

● High precision (words identified as formula are actual formulas)

Chemical Name Indexing and Search

● Segmentation-based index scheme – Used for indexing chemical names – First segment a chemical name hierarchically and then index

substrings at each node if frequent. – acetaldoxime->aldoxime->oxime. – Search for oxime returns all, depending on ranking function. – This can not be done in usual text search.

• Index Schemes: – Which tokens to index? – Indexing all subsequences generates a large size index – “but” in “butane” is morpheme, but not for “nembutal”.

Example Formula Search

http://chemxseer.ist.psu.edu/ChemXSeerFormulaSearch/help.htm

Built on top of millions of papers in CiteSeerX. A similar system was developed for Dow Chemicals. Can find experts in “polymer chemistry” or expertise of “Linus Pauling” Finds an expert based on their publications. Many approaches: Keyphases Citations Download count. Affiliation

Expert Recommendation - CiteSeerX

http://seerseer.ist.psu.edu (new version CSSeers)

Treeratpituk, Chen, JCDL’13

Future Work

Lots of interesting work to do! Few computer/machine learning scientists involved.

• Acquisitions - more documents, data, knowledge • Chemical 3D graph search • Fundamental chemical graph representation analysis • Table data storage and access • Figure search and data extraction and access • New data and feature search

• spectra, experimental methods, instrumentation • New documents: 400K PubMed • Semantic chemical graphs • Expert/collaborator search • Search integration of all features

Questions for Us?

• http://elisq.qu.edu.qa/

[email protected]

• http://fox.cs.vt.edu

13 Jan. 2014 -- QU Library, Doha, Qatar 56