open data bay area: interesting problems in academic data
DESCRIPTION
Mendeley TalkTRANSCRIPT
Interesting Problems in Academic Data
William GunnHead of Academic OutreachMendeley@mrgunn
Academic data
• Not data like the Climate Corp.• Not data like Facebook or Twitter
• Metadata! (not the NSA kind)• Information about the scholarly
outputs of academic researchers
Page, Lawrence and Brin, Sergey and Motwani, Rajeev and Winograd, Terry (1999) The PageRank Citation Ranking: Bringing Order to the Web. Technical Report. Stanford InfoLab.
http://nexus.od.nih.gov/all/2012/02/13/age-distribution-of-nih-principal-investigators-and-medical-school-faculty/
The bog we’re stuck in
• Conservative culture• Technophobia in humanities• policy issues• Academic incentives
– beholden to publishing companies for status
– data & code aren’t citable– data & code aren’t easily shareable
• extra work for no extra credit, reproducibility issues
Time to market
• http://freethedata.org
• Gene patent reform
• Portable Legal Consent
Beholdenness
• Currently, you’re judged almost if not entirely based on what you publish
• Journals which are cited more (high IF) are worth more.
• In China, you get a salary bonus if you publish in a high IF journal -> gaming and predatory behavior
• In the US, it’s where you work
Altmetrics
• Re-building the reputation system for academia
• Collect data about more kinds of outputs
• Make data and code first class objects
• Attribute impact to the object, not the top-level container (journal or institution)
• We don’t know what the data mean, yet.
Increasing age of grant awardees
• No one is really working on this problem
• Grant agencies don’t know how to make data-driven decisions, because they don’t have enough good data.
• This is a hard problem.
Making data and code citable
• DateCite• CrossRef
• CC4 has new additions to make things work better– clearer CC-BY attribution guidelines– Handles sui generis database rights (for
EU)– makes it easier for publisher and user
Reproducibility Issues
• Code isn’t produced with high quality
• Analyses are hard to re-run• Code is hard to share
• Data sets are hard to re-use– rights issues and provenance and
context– Does that dataset mean what I think it
means?
Author Disambiguation
• If you want item-level credit to accrue to researchers, you need to be able to tell them apart– Y. Wang had 3,926 publications in 2011
– ORCID is working on this, but it’s a hard problem.
Recommender Systems
• Better data for funding agencies and publishers and researchers– Mendeley hosted a Recommender
Systems Workshop, active on this problem
• Moving from descriptive to predictive stats about research.
Non-problems
• Social networks for researchers– Researchers use Mendeley,Twitter,
LinkedIn, FB to some degree• a place for people to comment on
articles