Building a Web Application to Monitor PubMedRetraction Notices
Neil Saunders
CSIRO Mathematics, Informatics and StatisticsBuilding E6B, Macquarie University Campus
North Ryde
December 1, 2011
Retraction Watch
Project Aims
Monitor PubMed for retractions
Retrieve retraction data and store locally for analysis
Develop web application to display retraction data
PubMed - advanced search, RSS and send-to-file
Updates in Google Reader
PubMed - MeSH
PubMed - EUtils
http://www.ncbi.nlm.nih.gov/books/NBK25501/
EInfo example script
#!/usr/bin/rubyrequire ’rubygems’require ’bio’require ’hpricot’require ’open-uri’
Bio::NCBI.default_email = "[email protected]"ncbi = Bio::NCBI::REST.newurl = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?db="ncbi.einfo.each do |db|
puts "Processing #{db}..."File.open("#{db}.txt", "w") do |f|
doc = Hpricot(open("#{url + db}"))(doc/’//fieldlist/field’).each do |field|
name = (field/’/name’).inner_htmlfullname = (field/’/fullname’).inner_htmldescription = (field/’description’).inner_htmlf.write("#{name},#{fullname},#{description}\n")
endend
end
EInfo script - output
ALL,All Fields,All terms from all searchable fieldsUID,UID,Unique number assigned to publicationFILT,Filter,Limits the recordsTITL,Title,Words in title of publicationWORD,Text Word,Free text associated with publicationMESH,MeSH Terms,Medical Subject Headings assigned to publicationMAJR,MeSH Major Topic,MeSH terms of major importance to publicationAUTH,Author,Author(s) of publicationJOUR,Journal,Journal abbreviation of publicationAFFL,Affiliation,Author’s institutional affiliation and address...
MongoDB Overview
MongoDB is a so-called “NoSQL” databaseKey features:
Document-oriented
Schema-free
Documents stored in collections
http://www.mongodb.org/
Saving to a database collection: ecount
#!/usr/bin/ruby
require "rubygems"require "bio"require "mongo"
db = Mongo::Connection.new.db(’pubmed’)col = db.collection(’ecount’)Bio::NCBI.default_email = "[email protected]"ncbi = Bio::NCBI::REST.new
1977.upto(Time.now.year) do |year|all = ncbi.esearch_count("#{year}[dp]", {"db" => "pubmed"})term = ncbi.esearch_count("Retraction of Publication[ptyp] #{year}[dp]",
{"db" => "pubmed"})record = {"_id" => year, "year" => year, "total" => all,
"retracted" => term, "updated_at" => Time.now}col.save(record)puts "#{year}..."
end
puts "Saved #{col.count} records."
ecount collection
> db.ecount.findOne(){
"_id" : 1977,"retracted" : 3,"updated_at" : ISODate("2011-11-15T03:58:10.729Z"),"total" : 260517,"year" : 1977
}
Saving to a database collection: entries
#!/usr/bin/ruby
require "rubygems"require "mongo"require "crack"
db = Mongo::Connection.new.db("pubmed")col = db.collection(’entries’)col.drop
xmlfile = "#{ENV[’HOME’]}/Dropbox/projects/pubmed/retractions/data/retract.xml"xml = Crack::XML.parse(File.read(xmlfile))
xml[’PubmedArticleSet’][’PubmedArticle’].each do |article|article[’_id’] = article[’MedlineCitation’][’PMID’]col.save(article)
end
puts "Saved #{col.count} articles."
entries collection
{"_id" : "22106469","PubmedData" : {
"PublicationStatus" : "ppublish","ArticleIdList" : {
"ArticleId" : "22106469"},"History" : {
"PubMedPubDate" : [{
"Minute" : "0","Month" : "11","PubStatus" : "entrez","Day" : "23","Hour" : "6","Year" : "2011"
},{
"Minute" : "0","Month" : "11","PubStatus" : "pubmed","Day" : "23","Hour" : "6","Year" : "2011"
},...
Saving to a database collection: timeline
#!/usr/bin/ruby
require "rubygems"require "mongo"require "date"
db = Mongo::Connection.new.db(’pubmed’)entries = db.collection(’entries’)timeline = db.collection(’timeline’)
dates = entries.find.map { |entry| entry[’MedlineCitation’][’DateCreated’] }dates.map! { |d| Date.parse("#{d[’Year’]}-#{d[’Month’]}-#{d[’Day’]}") }dates.sort!data = (dates.first..dates.last).inject(Hash.new(0)) { |h, date| h[date] = 0; h }dates.each { |date| data[date] += 1}data = data.sortdata.map! {|e| ["Date.UTC(#{e[0].year},#{e[0].month - 1},#{e[0].day})", e[1]] }
data.each do |date|timeline.save({"_id" => date[0].gsub(".", "_"), "date" => date[0], "count" => date[1]})
end
puts "Saved #{timeline.count} dates in timeline."
timeline collection
> db.timeline.findOne(){
"_id" : "Date_UTC(1977,7,12)","date" : "Date.UTC(1977,7,12)","count" : 1
}
Sinatra: minimal example
require "rubygems"require "sinatra"
get "/" do"Hello World"
end
# ruby myapp.rb# http://localhost:4567
Highcharts: minimal example code
var chart = new Highcharts.Chart({chart: {
renderTo: ’container’,defaultSeriesType: ’line’
},xAxis: {
categories: [’Jan’, ’Feb’, ’Mar’, ’Apr’, ’May’, ’Jun’,’Jul’, ’Aug’, ’Sep’, ’Oct’, ’Nov’, ’Dec’]
},series: [{
data: [29.9, 71.5, 106.4, 129.2, 144.0, 176.0,135.6, 148.5, 216.4, 194.1, 95.6, 54.4]
}]});
// <div id="container" style="height: 400px"></div>
Highcharts: minimal example result
Web Application Overview
|---config.ru|---Gemfile|---main.rb|---public| |---javascripts| | |---dark-blue.js| | |---dark-green.js| | |---exporting.js| | |---gray.js| | |---grid.js| | |---highcharts.js| | |---jquery-1.4.2.min.js| |---stylesheets| |---main.css|---Rakefile|---statistics.rb|---views
|---about.haml|---byyear.haml|---date.haml|---error.haml|---index.haml|---journal.haml|---journals.haml|---layout.haml|---test.haml|---total.haml
Sinatra Application Code - main.rb
# main.rbconfigure do
# a bunch of config stuff goes here# DB = connection to MongoDB database# timelinetimeline = DB.collection(’timeline’)set :data, timeline.find.to_a.map { |e| [e[’date’], e[’count’]] }
end
# viewsget "/" do
haml :indexend
Sinatra Views - index.haml
%h3 PubMed Retraction Notices - Timeline%p Last update: #{options.updated_at}
%div#container(style="margin-left: auto; margin-right: auto; width: 800px;")
:javascript$(function () {
new Highcharts.Chart({chart: {
renderTo: ’container’,defaultSeriesType: ’area’,width: 800,height: 600,zoomType: ’x’,marginTop: 80
},legend: { enabled: false },title: { text: ’Retractions by date’ },xAxis: { type: ’datetime’},yAxis: { title:
{ text: ’Retractions’ }},
series: [{data: #{options.data.inspect.gsub(/"/,"")}
}],// more stuff goes here...});
});
Deployment: Heroku + MongoHQ
Heroku.com - free application hosting (for small apps)
Almost as simple as:
$ git remote add heroku [email protected]:appname.git
$ git push heroku master
MongoHQ.com - free MongoDB database hosting (up to 16 MB)
“Final” product
Application - http://pmretract.heroku.com
Code - http://github.com/neilfws/PubMed