sharesci: a research paper networking...

ShareSci: A Research Paper Networking Site

Status Report - April 25, 2017

Mike D'Arcy

2553280 Utkarsh Patel

2623325

Introduction We proposed ShareSci, a research paper library site where users can find, discuss, and contribute

scholarly articles. Our primary goal is to improve on existing sites like ResearchGate by putting a

greater focus on content than on users, and streamlining the process of finding and interacting with

the uploaded articles and data. In this document, we will report the end-of-semester status of the

project, and provide detailed documentation on how to replicate the site on a different server.

Site Status

In our proposal, we stated the core features that we wanted to implement this semester, which form

the basis for a functioning site. These can be broadly divided into the user account system and the

article management system. For user accounts, we had aimed to create a system in which users

could log in with their username and password, and could manage all of their account information

such as changing their password, email, name, institution, or personal biography. For articles, we

had aimed to find a method for obtaining existing article metadata in bulk, allowing users to upload

their own articles, allowing users to view/download articles, and performing a keyword search on

our article database. We are pleased to report that all of these features have been completed.

Tools and Setup

We used a variant of a MEAN stack for our site. NodeJS Express was used for the webserver,

MongoDB for the article database, and AngularJS for the frontend. PostgreSQL was used for our

user database, as we deemed ACID compliance more important for this type of data than the

horizontal scalability and flexible schema offered by MongoDB. The site runs on an Ubuntu 16.04

server, although it should be capable of running on any Linux based server. Beyond this core

framework, we also used Git for version control and deployment. We will now describe the setup

procedures for each tool individually, along with the main references we consulted to use them . 1

Note that the setup of the base OS is beyond the scope of this document, but there is effectively no

deviation from the official Ubuntu installation instructions found at:

https://www.ubuntu.com/download/desktop/install-ubuntu-desktop

1 Only the major references are listed in this section. For a complete list, see the References section at the end of the

report.

Git

Primary references: https://git-scm.com/book/en/v2, https://developer.github.com/webhooks/

We created a repository on GitHub to act as the central repository for our project. To enable

deployment to our production server, we created a webhook on GitHub that automatically sends a

JSON request to our server whenever a commit is pushed to the repository. We created a script on

the server that listens for this request, and then automatically pulls the latest stable server code from

the master branch and starts the site server. The instructions for setting up a GitHub webhook can

be found in the second reference listed above, and the script we used to receive the webhook events

is available in our source code repository: https://github.com/mdarcy220/sharesci

NodeJS (v4.2.6)

Primary references: https://nodejs.org/api/, https://github.com/vitaly-t/pg-promise,

http://mongodb.github.io/node-mongodb-native/2.2/, http://expressjs.com/en/4x/api.html

Setting up NodeJS consisted primarily of using package managers. We first installed NodeJS and

NPM with the command: sudo apt install nodejs npm. We then installed the necessary

community-developed NodeJS modules. We specify these in a package.json file in our repository

(to ensure consistent versioning and for convenient installation), which can be found in our source

code repository. With this file, we can simply run npm install to update all of our dependencies.

One caveat due to the way AngularJS is set up is that there are actually two package.json files; one

in the root of our repository, and one under the clients/ directory. The npm install command

must be run from both locations to install the necessary modules for both Angular 2 and the core

server.

MongoDB (v2.6.10)

Primary reference: http://mongodb.github.io/node-mongodb-native/2.2/

There was very little setup here. Just run: sudo apt install mongodb. Our data collection

procedure will be described separately. After collecting data, we enabled full text search of articles

by indexing our article collection: db.papers.createIndex({"$**": "text"}).

PostgreSQL (v9.5.6)

Primary reference: https://www.postgresql.org/docs/9.5/static/index.html

Again, setup was simple: sudo apt install postgresql. Then we ran psql to create a new user

for our database: CREATE ROLE sharesci WITH LOGIN ENCRYPTED PASSWORD 'sharesci';.

Create the database: CREATE DATABASE sharesci;. Finally: GRANT ALL ON DATABASE sharesci

TO sharesci;. The database can now be used with the connection string:

postgres://sharesci:sharesci@localhost/sharesci. The creation of our schema and stored

procedures was automated, and can be found in our source code under

scripts/pg_db_schema_setup.sql. The schema can be generated by running psql <

scripts/pg_db_schema_setup.sql.

Angular 2

Primary reference: https://angular.io/docs/ts/latest/

Setting up Angular for local development is relatively complex procedure. The standard procedure

is explained in detail at https://angular.io/docs/ts/latest/guide/setup.html. After following the

standard setup of Angular 2, we installed community developed node components for Angular. The

list of all required components can be found in clients/npm-shrinkwrap.json from our source

repository.

Hosting

Our site is currently hosted on a DigitalOcean cloud server (1 CPU, 1 GB RAM, 20 GB disk). There

isn't much else to say about the hosting, but we explain some of our domain security in the HTTPS

section below.

HTTPS

Primary reference: https://letsencrypt.org/getting-started/

Our site uses HTTPS to ensure the security of our users, which requires obtaining a pair of public

and private keys for use by the SSL/TLS protocol. The public key must be digitally signed by a

trusted certificate authority (CA) to verify the authenticity of the site. We purchased a domain

(sharesci.org) from Google Domains and obtained our certificates from Let's Encrypt. This

consisted of simply following the instructions from their website, and the procedure may vary for

different platforms so we will not reprint it here. The most vital note in this process is that the

private key should be kept absolutely secret to ensure security. Only highly trusted administrator

accounts should have permissions to read it.

CAs will usually not provide certificates for connecting to the site via IP address, so if the server is

run without owning a domain name the administrator should generate their own keys and self-sign

them. This is inconvenient for the end user because most browsers will automatically block access

to sites using self-signed certificates and require the user to manually override the authenticity

check. Because we got certificates from a CA, we will omit instructions for self-signing, but

interested readers may find the following link helpful:

https://www.digitalocean.com/community/tutorials/openssl-essentials-working-with-ssl-certificates-

private-keys-and-csrs

Final Setup

After all of the above tools have been installed and configured, the server can be started by

changing to the root of the source tree and running sudo nodejs server.js (sudo is needed to

allow the server to bind to ports 80 and 443, as well as access the TLS certificates). It is likely

desirable to seed the article database as part of the setup process, for which we point readers to the

Data Collection section.

Server Side Implementation

Code Structure

To manage the complexity of our site, we have attempted to maintain a modular code structure. In

the root of our source tree, we have several directories, each containing a different part of the site's

functions. The clients/ directory holds all the files for Angular 2 and the client-side components

of the site. The controllers/ directory contains the code for all of our REST API endpoints. The

routes/ directory is responsible solely for assigning each incoming request to an appropriate

controller. The util/ directory contains code that is used by multiple separate components. The

scripts/ directory contains miscellaneous script files that are not directly included in the core site,

but are useful for administration (these include the arXiv harvesting script and the script triggered

by the Github webhook). Finally, the uploads/ directory is simply an empty directory to be used by

the server for storing uploaded PDFs and other large data items.

User Accounts

Our REST API contains endpoints for several account-related features, including account creation,

login, logout, profile updating and changing password. Account creation and profile updates are

simply wrappers around the database, so there would be little merit to explaining them here over

simply reading the relevant source code.

Logins are session-based, and we manage sessions using the express-session module for NodeJS.

We securely store the users' passwords in the database, using bcrypt for salting and hashing.

Logging in consists of sending credentials to the server, which then returns a session cookie.

Logging out simply deletes the session on the server side.

One technicality that arises from the session-based login scheme is that a few of our REST API

endpoints are not actually RESTful. Requests such as changing a user's password require being

logged in, which means these requests are not entirely stateless. This has no significant

consequences at this stage of development, but in the future we may consider using token-based

authentication to create a truly stateless and RESTful API.

Article Searching

We use the built-in text indexing and searching features in MongoDB to enable searching for

articles. An unfortunate drawback is that MongoDB allows only one text index per collection, so

there is no way to make the search granular to the level of separating title, author, or full-text

searches. However, using the built-in algorithm allowed us to develop the site much more quickly

and resulted in a more robust and powerful algorithm than we likely could have developed in just a

few weeks. Improving upon this search algorithm is a potential topic of future research.

When a request is sent to the search endpoint of the API, the backend code feeds the query to

MongoDB, which searches based on all textual fields in each document. This most notably includes

title, author, abstract, and full text. It then sorts based on the relevance score for each result. We

included parameters to the API for controlling the number of results returned, so the client could

request only the 10 most relevant results, for example. This is very important for improving load

times, as there are often over 100,000 results for a query.

Article Uploading

Article uploads are handled in part by the Multer middleware for NodeJS/Express. File uploads in

POST requests require using the multipart/form-data encoding, as opposed to the default

urlencoded encoding. Multer parses this encoding and stores the uploaded file in our uploads/

directory. Then, we insert the filename of the upload along with the metadata into MongoDB. We

additionally store the original filename and the MIME type of the data (which are typically given in

the file upload request). Storing these allows us to maintain the user's original filename and filetype,

while allowing Multer to assign a unique filename to each upload.

We want users to be able to find their articles, so after a PDF is uploaded, we use the pdftotextjs

NodeJS module to extract text from the PDF. This module is simply a wrapper around the pdftotext

application for Linux. It currently supports embedded text only (no OCR), but we believe it is very

feasible to add OCR capabilities in the future. The extracted text is stored in MongoDB, so that it

will be indexed and enable searching for the uploaded article.

Article Downloading

This is relatively simple. Given an article ID, we can find the filename in the database and serve it

to the client. The main point of note is that we must adjust the Content-Type response header to

indicate the original type of the file, and the Content-Disposition response header with the

filename= parameter specified to indicate the original filename for download.

Client Side Implementation

Code Structure

Primary Reference: https://angular.io/styleguide

We are following the recommended angular 2 styleguide to structure our client side code. We are

also following the standard naming conventions specified in the angular 2 style guide to name all

the client side files. Due to the small size of our application, all the components of angular

application are currently under one root module. As our application grows we will refactor our code

and separate each feature in a different angular module.

Components

Our application is a single page application. All the views are part of a single HTML page, and the

view is dynamically updated as user interacts with the application. The application has total of

seven views and each view is controlled by a separate component and styled by a separate

stylesheet. All the components are listed below.

1. Home 2. Login 3. Account 4. Search-Results 5. Article 6. Upload

7.Create-Account

Besides the seven main component listed above, we also have a navbar component which is used by

all the main components.

Services

The client communicates with the server through angular services. All the components consume

these services to send and receive data from server. Currently, client has total of four services that

communicates with server. The Authentication Service is used to login and logout user. The article

service is used to get the metadata and the pdf of article by posting the artcle_id to the article API.

The search_result service is used to get the list of articles from server based on the user search

query. The account service is used to get the user account information such as name, username,

institution etc. and to modify these information. Besides these four main services, we also

developed an internal service to share data between components. This internal service does not

communicate with server and is solely used to exchange data between different components.

Entities

The server API sends the data to client in a specific format. The structure of all the server side

entities is also defined on client side using typescript interfaces. Having a common definition of

entities on both server side and client side makes it easier to transform a json object sent by server

into a typescript interface, and also reduces the possibility of error. The client has currently three

entities defined: Article, User and Search-Results.

Data Collection

Primary reference: https://www.openarchives.org/OAI/openarchivesprotocol.html

For our testing, we collected metadata for 800,000 papers from the arXiv article database using the

Open Archive Initiative's Protocol for Metadata Harvesting (OAI-PMH). These papers were

inserted into our MongoDB instance. OAI-PMH is an open protocol designed to facilitate metadata

exchange in a more efficient way than web scraping, to save resources for both servers and

harvesters.

The data returned by requests to an OAI-PMH API is in XML format, which means we had to

convert it to JSON for use with MongoDB. We used the cheerio module to do some preliminary

XPath selection and pruning of the responses, which were returned in batches of 1000 papers (the

protocol specifies "resumption tokens" so large result sets can be downloaded in batches). This

module allowed us to select the relevant sections article metadata from the result set and prepare

them for conversion to JSON. We then used xml2js to convert the data to JSON and inserted it into

the database.

Lastly, because we were harvesting such an extreme amount of data, we wanted to avoid putting too

much of a load on the arXiv server, so we limited our requests to once per 30 seconds. In this way,

we were able to get metadata for all the papers since 2010 by allowing the script to run overnight.

We might have obtained more, but decided not to because we were using a low-powered server.

The full harvesting script can be found in our source code under scripts/harvest_arxiv.js.

Conclusion

We have created ShareSci, a content-focused research paper sharing site. While there are numerous

paths for future improvement, including (but not limited to) the higher-tier features listed in our

original proposal, we have made significant progress as we developed the site throughout the

semester. In this report, we have documented the tools used, setup procedures, and the high-level

design and implementation strategies we used in building the site. For readers interested in the

precise details of the implementation, we recommend taking a look at our source code repository on

Github : https://github.com/mdarcy220/sharesci 2

2 Note that we do not include the source code in the print submission, as it is more than 100 pages in length. In our

electronic submission, we will include a PDF compilation of the source code.

References

Naturally, it would not be feasible to provide links to every single source viewed during the

development of this site, but here we do provide links to sources and documentation which were

actively consulted in the design and implementation.

● File structure and routing pattern partially inspired by:

https://www.caffeinecoding.com/better-express-routing-for-nodejs/

● pg-promise NodeJS module documentation: https://github.com/vitaly-t/pg-promise

● PostgreSQL documentation: https://www.postgresql.org/docs/9.6/static/index.html

● MongoDB NodeJS module documentation:

http://mongodb.github.io/node-mongodb-native/2.2/

● ResearchGate had an important role in some design decisions: https://www.researchgate.net

● Angular 2 documentation: https://angular.io/docs/ts/latest/guide/

● NodeJS Express reference: http://expressjs.com/en/4x/api.html

● Sessions for Express: https://github.com/expressjs/session

● Cheerio documentation: https://github.com/cheeriojs/cheerio

● Xml2js documentation: https://github.com/Leonidas-from-XIV/node-xml2js

● Bcrypt NodeJS documentation: https://www.npmjs.com/package/bcrypt

● Multer documentation: https://www.npmjs.com/package/multer

● Body-parser for Express: https://github.com/expressjs/body-parser

● Let's Encrypt Getting Started: https://letsencrypt.org/getting-started/

● OAI-PMH specification: https://www.openarchives.org/OAI/openarchivesprotocol.html

● arXiv OAI-PMH interface: https://arxiv.org/help/oa/index

● Postman (for API testing): https://www.getpostman.com/

● Force HTTPS: http://stackoverflow.com/a/10715802

● NodeJS documentation: https://nodejs.org/docs/latest-v4.x/api/

● Promises in JavaScript:

○ https://developer.mozilla.org/en/docs/Web/JavaScript/Reference/Global_Objects/Pro

mise

○ https://ponyfoo.com/articles/es6-promises-in-depth

● Best practices and tool selection:

○ http://stackoverflow.com/a/10323055



○ https://dba.stackexchange.com/a/74313

● Angular 2 Pagination:

http://jasonwatmore.com/post/2016/08/23/angular-2-pagination-example-with

./server.js Page 1

const express = require('express'), express_session = require('express−session'), https = require('https'), http = require('http'), fs = require('fs'), rootRouter = require('./routes/index');

const app = express();

var https_ok = true;var https_options = {};try { var https_options = { key: fs.readFileSync('/etc/letsencrypt/live/sharesci.org/privkey.pem'), cert: fs.readFileSync('/etc/letsencrypt/live/sharesci.org/cert.pem') };} catch (err) { https_ok = false; if (err.errno === −13 && err.syscall === 'open') { console.error('Access permissions denied to SSL certificate files.' + ' HTTPS will not be available. Try running as root.'); } else { console.error(err); }}

app.use('/', (req, res, next) => { if(!req.secure && https_ok) { return res.redirect(['https://', req.get('Host'), req.url].join('')); } next();});app.use(express_session({ secret: require('crypto').randomBytes(64).toString('base64'), resave: false, saveUninitialized: false, httpOnly: true, secure: true, ephemeral: true, cookie: { maxAge: 16*60*60*1000 }}));app.use('/', rootRouter);app.use('/', express.static(__dirname + '/client'));

http.createServer(app).listen(80);

if (https_ok) { https.createServer(https_options, app).listen(443);}

./README.md Page 1

# ShareSciShareSci is a research publication publication and discussion website,currently being developed as a school project by Mike D'Arcy and Utkarsh Patel.We aim to make it easier for researchers to find papers related to their fieldof study, to discuss the contributions and flaws of studies, and to publishtheir own studies (even negative results).

./controllers/account.js Page 1

const pgdb = require('../util/sharesci−pg−db'), bcrypt = require('bcrypt'), validator = require('../util/account_info_validation');

function index(req, res) { // TODO: change the redirect based on whether the user is logged in res.redirect('/login');}

function createAction(req, res) { var responseObj = { errno: 0, errstr: "" };

function onInsertComplete(data){ responseObj.errno = 0; responseObj.errstr = ""; res.json(responseObj); res.end(); }

function onInsertFailed(err) { if (err.code === '23505') { // Violated 'UNIQUE' constraint, so username was already in use responseObj.errno = 8; responseObj.errstr = "Account already exists"; res.json(responseObj); } else { console.error(err); responseObj.errno = 1; responseObj.errstr = "Unknown error"; res.json(responseObj); } res.end(); }

valuesPromise = new Promise((resolve, reject) => {values_from_request(req, resolve, reject);}); valuesPromise.then((values)=>{ return new Promise((resolve, reject)=>{insertValues(values, resolve, reject);}); }) .then(onInsertComplete) .catch(onInsertFailed);

valuesPromise.catch((err) => { responseObj.errno = err.errno; responseObj.errstr = err.errstr; res.json(responseObj); res.end(); });

}

function insertValues(values, resolve, reject) { const query = 'INSERT INTO account (username, passhash, firstname, lastname, self_bio, institution) VALUES (${username}, ${passhash}, ${firstname}, ${lastname}, ${self_bio}, ${institution});'; pgdb.any(query, values) .then((data) => { resolve(data); }) .catch((err) => { reject(err); });

./controllers/account.js Page 2

}

// Sets up values for insertion into the database// and validates them. Calls `resolve` with a JSON // object containing the values on success, calls // `reject` with a JSON object containing error info // on failure.function values_from_request(req, resolve, reject) { if(!req.body.password) { reject({errno: 6, errstr: 'Missing password'}); return; } var passsalt = bcrypt.genSaltSync(10); var passhash = bcrypt.hashSync(req.body.password, passsalt); var values = { 'username': req.body.username, 'passhash': passhash, 'firstname': req.body.firstname, 'lastname': req.body.lastname, 'self_bio': req.body.self_bio, 'institution': req.body.institution };

for (key in values) { if(typeof values[key] === 'undefined') { values[key] = null; } }

// Validate values if (!validator.is_valid_username(values['username'])) { reject({errno: 2, errstr: 'Invalid username'}); return; } if (!validator.is_valid_password(req.body.password)) { reject({errno: 3, errstr: 'Invalid password'}); return; } if (!validator.is_valid_firstname(values['firstname'])) { reject({errno: 6, errstr: 'Invalid firstname'}); return; } if (!validator.is_valid_lastname(values['lastname'])) { reject({errno: 6, errstr: 'Invalid lastname'}); return; } if (!validator.is_valid_institution(values['institution'])) { reject({errno: 6, errstr: 'Invalid institution'}); return; } if (!validator.is_valid_self_bio(values['self_bio'])) { reject({errno: 6, errstr: 'Invalid self−biography'}); return; }

resolve(values); }

module.exports = { index: index, createAction: createAction};

./controllers/login.js Page 1

const pgdb = require('../util/sharesci−pg−db'), bcrypt = require('bcrypt');

function loginAction(req, res) { var responseObj = { errno: 0, errstr: "" }; if(req.session.user_id) { responseObj.errno = 4; responseObj.errstr = "Already logged in"; res.json(responseObj); res.end(); } pgdb.func('get_user_passhash', [req.body.username]) .then((data) => { if (bcrypt.compareSync(req.body.password, data[0]['passhash'])) { req.session.user_id = req.body.username; responseObj.errno = 0; responseObj.errstr = ""; res.json(responseObj); } else { responseObj.errno = 3; responseObj.errstr = "Incorrect password"; res.json(responseObj); } res.end(); }) .catch((err) => { if(err.received === 0) { console.log('Invalid username \'' + req.body.username + '\' tried to log in.'); responseObj.errno = 2; responseObj.errstr = "Incorrect username"; } else { console.error(err); responseObj.errno = 1; responseObj.errstr = "Unknown error"; } res.json(responseObj); res.end(); });}

function loginPage(req, res) { res.redirect('/'); res.end();}

module.exports = { loginAction: loginAction, loginPage: loginPage};

./controllers/logout.js Page 1

function logoutAction(req, res) { delete req.session['user_id']; req.session.destroy(); if (req.body.successRedirect) { res.redirect(req.body.successRedirect); } else { res.redirect('/'); }}

module.exports = { logoutAction: logoutAction,};

sharesci: a research paper networking...

Documents