open data bay area: interesting problems in academic data

Interesting Problems in Academic Data

William GunnHead of Academic OutreachMendeley@mrgunn

Academic data

• Not data like the Climate Corp.• Not data like Facebook or Twitter

• Metadata! (not the NSA kind)• Information about the scholarly

outputs of academic researchers

Page, Lawrence and Brin, Sergey and Motwani, Rajeev and Winograd, Terry (1999) The PageRank Citation Ranking: Bringing Order to the Web. Technical Report. Stanford InfoLab.

http://nexus.od.nih.gov/all/2012/02/13/age-distribution-of-nih-principal-investigators-and-medical-school-faculty/

The bog we’re stuck in

• Conservative culture• Technophobia in humanities• policy issues• Academic incentives

– beholden to publishing companies for status

– data & code aren’t citable– data & code aren’t easily shareable

• extra work for no extra credit, reproducibility issues

Time to market

• http://freethedata.org

• Gene patent reform

• Portable Legal Consent

http://freethedata.org/

Beholdenness

• Currently, you’re judged almost if not entirely based on what you publish

• Journals which are cited more (high IF) are worth more.

• In China, you get a salary bonus if you publish in a high IF journal -> gaming and predatory behavior

• In the US, it’s where you work

Altmetrics

• Re-building the reputation system for academia

• Collect data about more kinds of outputs

• Make data and code first class objects

• Attribute impact to the object, not the top-level container (journal or institution)

• We don’t know what the data mean, yet.

Increasing age of grant awardees

• No one is really working on this problem

• Grant agencies don’t know how to make data-driven decisions, because they don’t have enough good data.

• This is a hard problem.

Making data and code citable

• DateCite• CrossRef

• CC4 has new additions to make things work better– clearer CC-BY attribution guidelines– Handles sui generis database rights (for

EU)– makes it easier for publisher and user

Reproducibility Issues

• Code isn’t produced with high quality

• Analyses are hard to re-run• Code is hard to share

• Data sets are hard to re-use– rights issues and provenance and

context– Does that dataset mean what I think it

means?

Author Disambiguation

• If you want item-level credit to accrue to researchers, you need to be able to tell them apart– Y. Wang had 3,926 publications in 2011

– ORCID is working on this, but it’s a hard problem.

https://orcid.org/0000-0002-3555-2054

Recommender Systems

• Better data for funding agencies and publishers and researchers– Mendeley hosted a Recommender

Systems Workshop, active on this problem

• Moving from descriptive to predictive stats about research.

Non-problems

• Social networks for researchers– Researchers use Mendeley,Twitter,

LinkedIn, FB to some degree• a place for people to comment on

articles

open data bay area: interesting problems in academic data

Science

data mean

data sets

good data

hard problem

rerun code

researchers researchers

reuse rights issues

problem grant agencies