beyond trust and reliability: reusing data in collaborative cancer epidemiology research

BEYOND TRUST AND RELIABILITY:

REUSING DATA IN COLLABORATIVE CANCER EPIDEMIOLOGY RESEARCH

@betsyrolland; @ducktopian

Betsy RollandComputer Supported Collaboration

LaboratoryDepartment of Human Centered

Design & EngineeringUniversity of Washington

Public Health Sciences DivisionFred Hutchinson Cancer Research

Center

Charlotte P. LeeComputer Supported Collaboration

LaboratoryDepartment of Human Centered

Design & EngineeringUniversity of Washington

Background & Motivation “Big Data” is everywhere Little attention to “Small Data”

When combined, have similar potential to Big Data

Can be difficult to find Documentation is informal, spotty or nonexistent Original staff – moved on or forgotten

Increasing calls for sharing but little knowledge of how eventual recipients will use the data

What is Data Reuse? Data sharing: “… the release of research data for

use by others. Release may take many forms, from private exchange upon request to deposit in a public data collection.” (Borgman 2011)

Proposed definition of “data reuse” The work done by the recipient of those shared data. Involves

identification of a dataset of interest receipt of the dataset, and appropriate use of the data for analysis.

Previous Research Researchers need to establish trust and

reliability before using shared data (Bietz 2009, Birnholtz 2003, Edwards 2011, Faniel 2010, Zimmerman 2007)

Data are highly social

Little research on data use practices after trust and reliability have been established

Research QuestionRQ: How do cancer epidemiology postdoctoral researchers determine how to use a variable from an existing dataset appropriately for their own analyses?

Research Site

Fred Hutchinson Cancer Research Center (FHCRC) Seattle, Washington, USA

Methods Interviews with diverse sample of post

doctoral researchers in cancer epidemiology at FHCRC: 4 men, 7 women From several different fields, including MD/MPHs,

behavioral epi and molecular epi Worked with different mentors Were at different points in their post-docs, ranging

from 3 months to 2 years Analyzed transcripts using grounded theory

approach

Cancer Epidemiology Epidemiology: study of disease risk at the population level

Population examples: post-menopausal women, prostate-cancer survivors, Asians

Focus on exposures: tobacco use, family medical history, diet, proximity to contaminants

Cancer-Epidemiology Datasets Generally collected by questionnaire Not standardized but some generally accepted practices Asking similar questions over different populations

Culture of sharing

Findings

Typology of Information Needs

1. What datasets are available to use in my research?2. Will this dataset help me answer my research question?3. What else has been done with this dataset?4. Where do I find the information I need to understand this study?5. How was this dataset constructed?6. How were these data constructed?7. What do my variables of interest mean?8. Am I using the data and the dataset correctly?9. What have I done with this dataset?

Information Seeking Strategies Conversations with mentors and study

team Available written sources (project

websites, codebooks, study questionnaires, published manuscripts)

Ranged from simple to complex questions Why are so many participants missing

tumor stage and how has that been handled in previous analyses?

Iterative and ongoing process

Two scenarios for further information seeking1. Incorrect data usage2. Scientific discovery

Scenario 1: frustrating waste of time Scenario 2: healthy result of interesting

scientific work Both required post-docs to return to their

information sources

Understanding the Construction of Variables Question 6: How were these data constructed? Data are:

Social Constructed as result of decisions, assumptions, actions Impossible to fully document

Interested not just in meaning but in construction history of variables of interest How was this variable coded? How was this question originally asked? Why variable coded and analyzed in a certain way (add

slide if time)

How was this variable coded?And that took forever to figure out … because then [the data manager] had to go back to the original code in which she had created the variable and reinterpret the code to break down exactly what had happened, and it was like all this looping. So things like that were frustrating. We had a lot of setbacks where you’re like, “So what does this variable really mean?” and every time you ask, “What does this variable really mean?” it’s never straightforward (Ginger, 55).

How was this question originally asked?

But so you’re using like a time between diagnosis and when they say they had a colonoscopy to infer maybe what that was about. … And you have to go back to the data dictionary, for sure, but I find myself going back to the questionnaire to see what I think the question really was asking, you know. Because the data dictionary… it’s just descriptive, it’s kind of what they thought they were asking… it’s like oh, the value can be one or two. One is yes and two is no. But when you look at the question, it’s have you ever had a colonoscopy, you know, more than two years ago or something? So there’s a difference. So there can be differences (Stewart, 162).

Conclusion Data reuse is a difficult, time-consuming

and iterative process requiring access to both written and human information sources.

Appropriate use and scientific integrity were of greatest importance.

Documentation will never be thorough enough to cover all future uses.

Implications For CSCW:

What is thorough documentation? Need targeted documentation Need ways to easily document and store

study history without getting in the way of the science

Support for collaborative information seeking For Professional Practice:

Incentives from funders for documentation and data curation

Acknowledgements Thanks to our participants Funding:

Fred Hutchinson Cancer Research Center NIH award R03CA150036

Drs. John D. Potter and Polly Newcomb (FHCRC) for expertise in epidemiology

Thanks!

Questions?

Referenced Work Bietz, M.J., & Lee, C.P. (2009). Collaboration in Metagenomics:

Sequence Databases and the Organization of Scientific Work. In Proc. ECSCW 2009, Springer-Verlag (2009), 243-262.

Birnholtz, J.P., & Bietz, M.J. (2003). Data at work: supporting sharing in science and engineering. In Proc. ACM SIGGROUP, ACM Press (2003), 339-348.

Edwards, P.N., Batcheller, A.L., Mayernik, M.S., Borgman, C.L., & Bowker, G.C. (2011). Science friction: Data, metadata, and collaboration. Social Studies of Science, 41(5), 667-690.

Faniel, I.M., & Jacobsen, T.E. (2010). Reusing Scientific Data: How Earthquake Engineering Researchers Assess the Reusability of Colleagues’ Data. Computer Supported Cooperative Work, 19(3-4), 355-375.

Zimmerman, A. (2007). Not by metadata alone: the use of diverse forms of knowledge to locate data for reuse. International Journal on Digital Libraries, 7(1-2), 5-16.

beyond trust and reliability: reusing data in collaborative cancer epidemiology research

Documents