sharesci: a research paper networking...
TRANSCRIPT
ShareSci: A Research Paper Networking Site
Status Report - April 25, 2017
Mike D'Arcy
2553280 Utkarsh Patel
2623325
Introduction We proposed ShareSci, a research paper library site where users can find, discuss, and contribute
scholarly articles. Our primary goal is to improve on existing sites like ResearchGate by putting a
greater focus on content than on users, and streamlining the process of finding and interacting with
the uploaded articles and data. In this document, we will report the end-of-semester status of the
project, and provide detailed documentation on how to replicate the site on a different server.
Site Status
In our proposal, we stated the core features that we wanted to implement this semester, which form
the basis for a functioning site. These can be broadly divided into the user account system and the
article management system. For user accounts, we had aimed to create a system in which users
could log in with their username and password, and could manage all of their account information
such as changing their password, email, name, institution, or personal biography. For articles, we
had aimed to find a method for obtaining existing article metadata in bulk, allowing users to upload
their own articles, allowing users to view/download articles, and performing a keyword search on
our article database. We are pleased to report that all of these features have been completed.
Tools and Setup
We used a variant of a MEAN stack for our site. NodeJS Express was used for the webserver,
MongoDB for the article database, and AngularJS for the frontend. PostgreSQL was used for our
user database, as we deemed ACID compliance more important for this type of data than the
horizontal scalability and flexible schema offered by MongoDB. The site runs on an Ubuntu 16.04
server, although it should be capable of running on any Linux based server. Beyond this core
framework, we also used Git for version control and deployment. We will now describe the setup
procedures for each tool individually, along with the main references we consulted to use them . 1
Note that the setup of the base OS is beyond the scope of this document, but there is effectively no
deviation from the official Ubuntu installation instructions found at:
https://www.ubuntu.com/download/desktop/install-ubuntu-desktop
1 Only the major references are listed in this section. For a complete list, see the References section at the end of the
report.
Git
Primary references: https://git-scm.com/book/en/v2, https://developer.github.com/webhooks/
We created a repository on GitHub to act as the central repository for our project. To enable
deployment to our production server, we created a webhook on GitHub that automatically sends a
JSON request to our server whenever a commit is pushed to the repository. We created a script on
the server that listens for this request, and then automatically pulls the latest stable server code from
the master branch and starts the site server. The instructions for setting up a GitHub webhook can
be found in the second reference listed above, and the script we used to receive the webhook events
is available in our source code repository: https://github.com/mdarcy220/sharesci
NodeJS (v4.2.6)
Primary references: https://nodejs.org/api/, https://github.com/vitaly-t/pg-promise,
http://mongodb.github.io/node-mongodb-native/2.2/, http://expressjs.com/en/4x/api.html
Setting up NodeJS consisted primarily of using package managers. We first installed NodeJS and
NPM with the command: sudo apt install nodejs npm. We then installed the necessary
community-developed NodeJS modules. We specify these in a package.json file in our repository
(to ensure consistent versioning and for convenient installation), which can be found in our source
code repository. With this file, we can simply run npm install to update all of our dependencies.
One caveat due to the way AngularJS is set up is that there are actually two package.json files; one
in the root of our repository, and one under the clients/ directory. The npm install command
must be run from both locations to install the necessary modules for both Angular 2 and the core
server.
MongoDB (v2.6.10)
Primary reference: http://mongodb.github.io/node-mongodb-native/2.2/
There was very little setup here. Just run: sudo apt install mongodb. Our data collection
procedure will be described separately. After collecting data, we enabled full text search of articles
by indexing our article collection: db.papers.createIndex({"$**": "text"}).
PostgreSQL (v9.5.6)
Primary reference: https://www.postgresql.org/docs/9.5/static/index.html
Again, setup was simple: sudo apt install postgresql. Then we ran psql to create a new user
for our database: CREATE ROLE sharesci WITH LOGIN ENCRYPTED PASSWORD 'sharesci';.
Create the database: CREATE DATABASE sharesci;. Finally: GRANT ALL ON DATABASE sharesci
TO sharesci;. The database can now be used with the connection string:
postgres://sharesci:sharesci@localhost/sharesci. The creation of our schema and stored
procedures was automated, and can be found in our source code under
scripts/pg_db_schema_setup.sql. The schema can be generated by running psql <
scripts/pg_db_schema_setup.sql.
Angular 2
Primary reference: https://angular.io/docs/ts/latest/
Setting up Angular for local development is relatively complex procedure. The standard procedure
is explained in detail at https://angular.io/docs/ts/latest/guide/setup.html. After following the
standard setup of Angular 2, we installed community developed node components for Angular. The
list of all required components can be found in clients/npm-shrinkwrap.json from our source
repository.
Hosting
Our site is currently hosted on a DigitalOcean cloud server (1 CPU, 1 GB RAM, 20 GB disk). There
isn't much else to say about the hosting, but we explain some of our domain security in the HTTPS
section below.
HTTPS
Primary reference: https://letsencrypt.org/getting-started/
Our site uses HTTPS to ensure the security of our users, which requires obtaining a pair of public
and private keys for use by the SSL/TLS protocol. The public key must be digitally signed by a
trusted certificate authority (CA) to verify the authenticity of the site. We purchased a domain
(sharesci.org) from Google Domains and obtained our certificates from Let's Encrypt. This
consisted of simply following the instructions from their website, and the procedure may vary for
different platforms so we will not reprint it here. The most vital note in this process is that the
private key should be kept absolutely secret to ensure security. Only highly trusted administrator
accounts should have permissions to read it.
CAs will usually not provide certificates for connecting to the site via IP address, so if the server is
run without owning a domain name the administrator should generate their own keys and self-sign
them. This is inconvenient for the end user because most browsers will automatically block access
to sites using self-signed certificates and require the user to manually override the authenticity
check. Because we got certificates from a CA, we will omit instructions for self-signing, but
interested readers may find the following link helpful:
https://www.digitalocean.com/community/tutorials/openssl-essentials-working-with-ssl-certificates-
private-keys-and-csrs
Final Setup
After all of the above tools have been installed and configured, the server can be started by
changing to the root of the source tree and running sudo nodejs server.js (sudo is needed to
allow the server to bind to ports 80 and 443, as well as access the TLS certificates). It is likely
desirable to seed the article database as part of the setup process, for which we point readers to the
Data Collection section.
Server Side Implementation
Code Structure
To manage the complexity of our site, we have attempted to maintain a modular code structure. In
the root of our source tree, we have several directories, each containing a different part of the site's
functions. The clients/ directory holds all the files for Angular 2 and the client-side components
of the site. The controllers/ directory contains the code for all of our REST API endpoints. The
routes/ directory is responsible solely for assigning each incoming request to an appropriate
controller. The util/ directory contains code that is used by multiple separate components. The
scripts/ directory contains miscellaneous script files that are not directly included in the core site,
but are useful for administration (these include the arXiv harvesting script and the script triggered
by the Github webhook). Finally, the uploads/ directory is simply an empty directory to be used by
the server for storing uploaded PDFs and other large data items.
User Accounts
Our REST API contains endpoints for several account-related features, including account creation,
login, logout, profile updating and changing password. Account creation and profile updates are
simply wrappers around the database, so there would be little merit to explaining them here over
simply reading the relevant source code.
Logins are session-based, and we manage sessions using the express-session module for NodeJS.
We securely store the users' passwords in the database, using bcrypt for salting and hashing.
Logging in consists of sending credentials to the server, which then returns a session cookie.
Logging out simply deletes the session on the server side.
One technicality that arises from the session-based login scheme is that a few of our REST API
endpoints are not actually RESTful. Requests such as changing a user's password require being
logged in, which means these requests are not entirely stateless. This has no significant
consequences at this stage of development, but in the future we may consider using token-based
authentication to create a truly stateless and RESTful API.
Article Searching
We use the built-in text indexing and searching features in MongoDB to enable searching for
articles. An unfortunate drawback is that MongoDB allows only one text index per collection, so
there is no way to make the search granular to the level of separating title, author, or full-text
searches. However, using the built-in algorithm allowed us to develop the site much more quickly
and resulted in a more robust and powerful algorithm than we likely could have developed in just a
few weeks. Improving upon this search algorithm is a potential topic of future research.
When a request is sent to the search endpoint of the API, the backend code feeds the query to
MongoDB, which searches based on all textual fields in each document. This most notably includes
title, author, abstract, and full text. It then sorts based on the relevance score for each result. We
included parameters to the API for controlling the number of results returned, so the client could
request only the 10 most relevant results, for example. This is very important for improving load
times, as there are often over 100,000 results for a query.
Article Uploading
Article uploads are handled in part by the Multer middleware for NodeJS/Express. File uploads in
POST requests require using the multipart/form-data encoding, as opposed to the default
urlencoded encoding. Multer parses this encoding and stores the uploaded file in our uploads/
directory. Then, we insert the filename of the upload along with the metadata into MongoDB. We
additionally store the original filename and the MIME type of the data (which are typically given in
the file upload request). Storing these allows us to maintain the user's original filename and filetype,
while allowing Multer to assign a unique filename to each upload.
We want users to be able to find their articles, so after a PDF is uploaded, we use the pdftotextjs
NodeJS module to extract text from the PDF. This module is simply a wrapper around the pdftotext
application for Linux. It currently supports embedded text only (no OCR), but we believe it is very
feasible to add OCR capabilities in the future. The extracted text is stored in MongoDB, so that it
will be indexed and enable searching for the uploaded article.
Article Downloading
This is relatively simple. Given an article ID, we can find the filename in the database and serve it
to the client. The main point of note is that we must adjust the Content-Type response header to
indicate the original type of the file, and the Content-Disposition response header with the
filename= parameter specified to indicate the original filename for download.
Client Side Implementation
Code Structure
Primary Reference: https://angular.io/styleguide
We are following the recommended angular 2 styleguide to structure our client side code. We are
also following the standard naming conventions specified in the angular 2 style guide to name all
the client side files. Due to the small size of our application, all the components of angular
application are currently under one root module. As our application grows we will refactor our code
and separate each feature in a different angular module.
Components
Our application is a single page application. All the views are part of a single HTML page, and the
view is dynamically updated as user interacts with the application. The application has total of
seven views and each view is controlled by a separate component and styled by a separate
stylesheet. All the components are listed below.
1. Home 2. Login 3. Account 4. Search-Results 5. Article 6. Upload
7.Create-Account
Besides the seven main component listed above, we also have a navbar component which is used by
all the main components.
Services
The client communicates with the server through angular services. All the components consume
these services to send and receive data from server. Currently, client has total of four services that
communicates with server. The Authentication Service is used to login and logout user. The article
service is used to get the metadata and the pdf of article by posting the artcle_id to the article API.
The search_result service is used to get the list of articles from server based on the user search
query. The account service is used to get the user account information such as name, username,
institution etc. and to modify these information. Besides these four main services, we also
developed an internal service to share data between components. This internal service does not
communicate with server and is solely used to exchange data between different components.
Entities
The server API sends the data to client in a specific format. The structure of all the server side
entities is also defined on client side using typescript interfaces. Having a common definition of
entities on both server side and client side makes it easier to transform a json object sent by server
into a typescript interface, and also reduces the possibility of error. The client has currently three
entities defined: Article, User and Search-Results.
Data Collection
Primary reference: https://www.openarchives.org/OAI/openarchivesprotocol.html
For our testing, we collected metadata for 800,000 papers from the arXiv article database using the
Open Archive Initiative's Protocol for Metadata Harvesting (OAI-PMH). These papers were
inserted into our MongoDB instance. OAI-PMH is an open protocol designed to facilitate metadata
exchange in a more efficient way than web scraping, to save resources for both servers and
harvesters.
The data returned by requests to an OAI-PMH API is in XML format, which means we had to
convert it to JSON for use with MongoDB. We used the cheerio module to do some preliminary
XPath selection and pruning of the responses, which were returned in batches of 1000 papers (the
protocol specifies "resumption tokens" so large result sets can be downloaded in batches). This
module allowed us to select the relevant sections article metadata from the result set and prepare
them for conversion to JSON. We then used xml2js to convert the data to JSON and inserted it into
the database.
Lastly, because we were harvesting such an extreme amount of data, we wanted to avoid putting too
much of a load on the arXiv server, so we limited our requests to once per 30 seconds. In this way,
we were able to get metadata for all the papers since 2010 by allowing the script to run overnight.
We might have obtained more, but decided not to because we were using a low-powered server.
The full harvesting script can be found in our source code under scripts/harvest_arxiv.js.
Conclusion
We have created ShareSci, a content-focused research paper sharing site. While there are numerous
paths for future improvement, including (but not limited to) the higher-tier features listed in our
original proposal, we have made significant progress as we developed the site throughout the
semester. In this report, we have documented the tools used, setup procedures, and the high-level
design and implementation strategies we used in building the site. For readers interested in the
precise details of the implementation, we recommend taking a look at our source code repository on
Github : https://github.com/mdarcy220/sharesci 2
2 Note that we do not include the source code in the print submission, as it is more than 100 pages in length. In our
electronic submission, we will include a PDF compilation of the source code.
References
Naturally, it would not be feasible to provide links to every single source viewed during the
development of this site, but here we do provide links to sources and documentation which were
actively consulted in the design and implementation.
● File structure and routing pattern partially inspired by:
https://www.caffeinecoding.com/better-express-routing-for-nodejs/
● pg-promise NodeJS module documentation: https://github.com/vitaly-t/pg-promise
● PostgreSQL documentation: https://www.postgresql.org/docs/9.6/static/index.html
● MongoDB NodeJS module documentation:
http://mongodb.github.io/node-mongodb-native/2.2/
● ResearchGate had an important role in some design decisions: https://www.researchgate.net
● Angular 2 documentation: https://angular.io/docs/ts/latest/guide/
● NodeJS Express reference: http://expressjs.com/en/4x/api.html
● Sessions for Express: https://github.com/expressjs/session
● Cheerio documentation: https://github.com/cheeriojs/cheerio
● Xml2js documentation: https://github.com/Leonidas-from-XIV/node-xml2js
● Bcrypt NodeJS documentation: https://www.npmjs.com/package/bcrypt
● Multer documentation: https://www.npmjs.com/package/multer
● Body-parser for Express: https://github.com/expressjs/body-parser
● Let's Encrypt Getting Started: https://letsencrypt.org/getting-started/
● OAI-PMH specification: https://www.openarchives.org/OAI/openarchivesprotocol.html
● arXiv OAI-PMH interface: https://arxiv.org/help/oa/index
● Postman (for API testing): https://www.getpostman.com/
● Force HTTPS: http://stackoverflow.com/a/10715802
● NodeJS documentation: https://nodejs.org/docs/latest-v4.x/api/
● Promises in JavaScript:
○ https://developer.mozilla.org/en/docs/Web/JavaScript/Reference/Global_Objects/Pro
mise
○ https://ponyfoo.com/articles/es6-promises-in-depth
● Best practices and tool selection:
○ http://stackoverflow.com/a/10323055
○ http://stackoverflow.com/a/630475
○ http://stackoverflow.com/a/2894665
○ https://dba.stackexchange.com/a/74313
● Angular 2 Pagination:
http://jasonwatmore.com/post/2016/08/23/angular-2-pagination-example-with
./server.js Page 1
const express = require('express'), express_session = require('express−session'), https = require('https'), http = require('http'), fs = require('fs'), rootRouter = require('./routes/index');
const app = express();
var https_ok = true;var https_options = {};try { var https_options = { key: fs.readFileSync('/etc/letsencrypt/live/sharesci.org/privkey.pem'), cert: fs.readFileSync('/etc/letsencrypt/live/sharesci.org/cert.pem') };} catch (err) { https_ok = false; if (err.errno === −13 && err.syscall === 'open') { console.error('Access permissions denied to SSL certificate files.' + ' HTTPS will not be available. Try running as root.'); } else { console.error(err); }}
app.use('/', (req, res, next) => { if(!req.secure && https_ok) { return res.redirect(['https://', req.get('Host'), req.url].join('')); } next();});app.use(express_session({ secret: require('crypto').randomBytes(64).toString('base64'), resave: false, saveUninitialized: false, httpOnly: true, secure: true, ephemeral: true, cookie: { maxAge: 16*60*60*1000 }}));app.use('/', rootRouter);app.use('/', express.static(__dirname + '/client'));
http.createServer(app).listen(80);
if (https_ok) { https.createServer(https_options, app).listen(443);}
./README.md Page 1
# ShareSciShareSci is a research publication publication and discussion website,currently being developed as a school project by Mike D'Arcy and Utkarsh Patel.We aim to make it easier for researchers to find papers related to their fieldof study, to discuss the contributions and flaws of studies, and to publishtheir own studies (even negative results).
./controllers/account.js Page 1
const pgdb = require('../util/sharesci−pg−db'), bcrypt = require('bcrypt'), validator = require('../util/account_info_validation');
function index(req, res) { // TODO: change the redirect based on whether the user is logged in res.redirect('/login');}
function createAction(req, res) { var responseObj = { errno: 0, errstr: "" };
function onInsertComplete(data){ responseObj.errno = 0; responseObj.errstr = ""; res.json(responseObj); res.end(); }
function onInsertFailed(err) { if (err.code === '23505') { // Violated 'UNIQUE' constraint, so username was already in use responseObj.errno = 8; responseObj.errstr = "Account already exists"; res.json(responseObj); } else { console.error(err); responseObj.errno = 1; responseObj.errstr = "Unknown error"; res.json(responseObj); } res.end(); }
valuesPromise = new Promise((resolve, reject) => {values_from_request(req, resolve, reject);}); valuesPromise.then((values)=>{ return new Promise((resolve, reject)=>{insertValues(values, resolve, reject);}); }) .then(onInsertComplete) .catch(onInsertFailed);
valuesPromise.catch((err) => { responseObj.errno = err.errno; responseObj.errstr = err.errstr; res.json(responseObj); res.end(); });
}
function insertValues(values, resolve, reject) { const query = 'INSERT INTO account (username, passhash, firstname, lastname, self_bio, institution) VALUES (${username}, ${passhash}, ${firstname}, ${lastname}, ${self_bio}, ${institution});'; pgdb.any(query, values) .then((data) => { resolve(data); }) .catch((err) => { reject(err); });
./controllers/account.js Page 2
}
// Sets up values for insertion into the database// and validates them. Calls `resolve` with a JSON // object containing the values on success, calls // `reject` with a JSON object containing error info // on failure.function values_from_request(req, resolve, reject) { if(!req.body.password) { reject({errno: 6, errstr: 'Missing password'}); return; } var passsalt = bcrypt.genSaltSync(10); var passhash = bcrypt.hashSync(req.body.password, passsalt); var values = { 'username': req.body.username, 'passhash': passhash, 'firstname': req.body.firstname, 'lastname': req.body.lastname, 'self_bio': req.body.self_bio, 'institution': req.body.institution };
for (key in values) { if(typeof values[key] === 'undefined') { values[key] = null; } }
// Validate values if (!validator.is_valid_username(values['username'])) { reject({errno: 2, errstr: 'Invalid username'}); return; } if (!validator.is_valid_password(req.body.password)) { reject({errno: 3, errstr: 'Invalid password'}); return; } if (!validator.is_valid_firstname(values['firstname'])) { reject({errno: 6, errstr: 'Invalid firstname'}); return; } if (!validator.is_valid_lastname(values['lastname'])) { reject({errno: 6, errstr: 'Invalid lastname'}); return; } if (!validator.is_valid_institution(values['institution'])) { reject({errno: 6, errstr: 'Invalid institution'}); return; } if (!validator.is_valid_self_bio(values['self_bio'])) { reject({errno: 6, errstr: 'Invalid self−biography'}); return; }
resolve(values); }
module.exports = { index: index, createAction: createAction};
./controllers/login.js Page 1
const pgdb = require('../util/sharesci−pg−db'), bcrypt = require('bcrypt');
function loginAction(req, res) { var responseObj = { errno: 0, errstr: "" }; if(req.session.user_id) { responseObj.errno = 4; responseObj.errstr = "Already logged in"; res.json(responseObj); res.end(); } pgdb.func('get_user_passhash', [req.body.username]) .then((data) => { if (bcrypt.compareSync(req.body.password, data[0]['passhash'])) { req.session.user_id = req.body.username; responseObj.errno = 0; responseObj.errstr = ""; res.json(responseObj); } else { responseObj.errno = 3; responseObj.errstr = "Incorrect password"; res.json(responseObj); } res.end(); }) .catch((err) => { if(err.received === 0) { console.log('Invalid username \'' + req.body.username + '\' tried to log in.'); responseObj.errno = 2; responseObj.errstr = "Incorrect username"; } else { console.error(err); responseObj.errno = 1; responseObj.errstr = "Unknown error"; } res.json(responseObj); res.end(); });}
function loginPage(req, res) { res.redirect('/'); res.end();}
module.exports = { loginAction: loginAction, loginPage: loginPage};
./controllers/logout.js Page 1
function logoutAction(req, res) { delete req.session['user_id']; req.session.destroy(); if (req.body.successRedirect) { res.redirect(req.body.successRedirect); } else { res.redirect('/'); }}
module.exports = { logoutAction: logoutAction,};