subsift web services and workflows for profiling and comparing scientists and their published works
TRANSCRIPT
SubSift web services and workflows for profiling and comparing scientists and their published works
Simon Price, Peter Flach, Sebastian Spiegler, Christopher Bailey and Nikki Rogers
2
Outline of this paper
1. SubSift – submission sifting
2. Background Theory: Vector Space
Model
3. SubSift REST API
4. Demonstration Workflows
5. Conclusions
3
1. SubSift – submission sifting
1. SubSift – submission sifting
2. Background Theory
3. SubSift REST API
4. Demonstration Workflows
5. Conclusions
4
SubSiftSubSift is a prototype application to support academic peer review.
SubSift matches submitted conference/journal papers to potential peer reviewers based on similarity to published works.
Website:http://subsift.ilrt.bris.ac.uk
6
Contribution of this work
SubSift RESTful web services:• Open Source software (on Google Code)• Hosted open web service at University of Bristol
Re-usable workflows for profiling and comparing scientists and their published works.
Tool for constructing, manipulating and publishing document-centric datasets.
Related Work• SubSift uses techniques more normally associated with
Information Retrieval
• Full text search tools support text matching on large-scale document collections
e.g. Apache Lucene, PostgreSQL, Oracle UltraSearchDesigned for 1:M matching but can also to do Cartesian product M:M matching.
• How SubSift differs:• Exposes detailed metadata throughout.
• Partly a research tool: need to plug in + instrument new algorithms.
• Fewer licensing restrictions and dependencies for open source.
7
8
2. Background Theory: Vector Space Model
1. SubSift – submission sifting
2. Background Theory
3. SubSift REST API
4. Demonstration Workflows
5. Conclusions
9
Vector Space Model (from Information Retrieval)
Vector Space Model consists of:• bag-of-words representation
• cosine similarity
• tf-idf weighting
For a query (q), rank the documents (dj) in collection (D) by descending similarity to the query.
10
Vector Space Model: bag-of-words representation
no. terms in each abstract
no. terms in DBLP author page of each PC member
13
Representational State Transfer (REST)
“RESTful” web services:• URIs to represent resources
• HTTP POST/GET/PUT/DELETE correspond to usualCreate/Read/Update/Delete (CRUD) operations
• Response formats typically include: XML, JSON, CSV
REST is a design pattern for web services based on HTTP using its familiar URIs, requests, responses, authentication, etc.
14
3. SubSift REST API
1. SubSift – submission sifting
2. Background Theory
3. SubSift REST API
4. Demonstration Workflows
5. Conclusions
20
4. Demonstration Workflows
1. SubSift – submission sifting
2. Background Theory
3. SubSift REST API
4. Demonstration Workflows
5. Conclusions
27
Clustering staff based on homepage similarity
Dendrogram produced in Matlab from SubSift generated similarity matrix
32
Profiling a research group by its publications
Diagram produced in Wordle using SubSift profile data
Future Work
• Scaling-up• Currently a small-scale web application running on modest
hardware.
• Plans to migrate to a larger-scale HPC application at Bristol.
• ExaMiner project• Mining and mapping the University of Bristol’s research landscape.
• Crawling the University’s web pages to profile and visualise research interests of and similarities between faculty, departments, research groups and researchers.
• Plans to apply to websites of other Universities.
35
36
5. Conclusions
1. SubSift – submission sifting
2. Background Theory
3. SubSift REST API
4. Demonstration Workflows
5. Conclusions
37
Conclusion• SubSift Services useful outside of peer review domain
• Workflows for profiling/comparing scientists Promising e-Science and e-Research use cases for profiling and
comparing scientists and their published works.
• Tool for constructing, manipulating and publishing document-centric datasets E.g. information retrieval, data mining, pattern analysis research. Publication of datasets in this way supports reproducibility of
science. Connects data through Linked Data and the Semantic Web.