beyond trust and reliability: reusing data in collaborative cancer epidemiology research
Post on 22-Feb-2016
41 Views
Preview:
DESCRIPTION
TRANSCRIPT
BEYOND TRUST AND RELIABILITY:
REUSING DATA IN COLLABORATIVE CANCER EPIDEMIOLOGY RESEARCH
@betsyrolland; @ducktopian
Betsy RollandComputer Supported Collaboration
LaboratoryDepartment of Human Centered
Design & EngineeringUniversity of Washington
Public Health Sciences DivisionFred Hutchinson Cancer Research
Center
Charlotte P. LeeComputer Supported Collaboration
LaboratoryDepartment of Human Centered
Design & EngineeringUniversity of Washington
Background & Motivation “Big Data” is everywhere Little attention to “Small Data”
When combined, have similar potential to Big Data
Can be difficult to find Documentation is informal, spotty or nonexistent Original staff – moved on or forgotten
Increasing calls for sharing but little knowledge of how eventual recipients will use the data
What is Data Reuse? Data sharing: “… the release of research data for
use by others. Release may take many forms, from private exchange upon request to deposit in a public data collection.” (Borgman 2011)
Proposed definition of “data reuse” The work done by the recipient of those shared data. Involves
identification of a dataset of interest receipt of the dataset, and appropriate use of the data for analysis.
Previous Research Researchers need to establish trust and
reliability before using shared data (Bietz 2009, Birnholtz 2003, Edwards 2011, Faniel 2010, Zimmerman 2007)
Data are highly social
Little research on data use practices after trust and reliability have been established
Research QuestionRQ: How do cancer epidemiology postdoctoral researchers determine how to use a variable from an existing dataset appropriately for their own analyses?
Research Site
Fred Hutchinson Cancer Research Center (FHCRC) Seattle, Washington, USA
Methods Interviews with diverse sample of post
doctoral researchers in cancer epidemiology at FHCRC: 4 men, 7 women From several different fields, including MD/MPHs,
behavioral epi and molecular epi Worked with different mentors Were at different points in their post-docs, ranging
from 3 months to 2 years Analyzed transcripts using grounded theory
approach
Cancer Epidemiology Epidemiology: study of disease risk at the population level
Population examples: post-menopausal women, prostate-cancer survivors, Asians
Focus on exposures: tobacco use, family medical history, diet, proximity to contaminants
Cancer-Epidemiology Datasets Generally collected by questionnaire Not standardized but some generally accepted practices Asking similar questions over different populations
Culture of sharing
Findings
Typology of Information Needs
1. What datasets are available to use in my research?2. Will this dataset help me answer my research question?3. What else has been done with this dataset?4. Where do I find the information I need to understand this study?5. How was this dataset constructed?6. How were these data constructed?7. What do my variables of interest mean?8. Am I using the data and the dataset correctly?9. What have I done with this dataset?
Information Seeking Strategies Conversations with mentors and study
team Available written sources (project
websites, codebooks, study questionnaires, published manuscripts)
Ranged from simple to complex questions Why are so many participants missing
tumor stage and how has that been handled in previous analyses?
Iterative and ongoing process
Two scenarios for further information seeking1. Incorrect data usage2. Scientific discovery
Scenario 1: frustrating waste of time Scenario 2: healthy result of interesting
scientific work Both required post-docs to return to their
information sources
Understanding the Construction of Variables Question 6: How were these data constructed? Data are:
Social Constructed as result of decisions, assumptions, actions Impossible to fully document
Interested not just in meaning but in construction history of variables of interest How was this variable coded? How was this question originally asked? Why variable coded and analyzed in a certain way (add
slide if time)
How was this variable coded?And that took forever to figure out … because then [the data manager] had to go back to the original code in which she had created the variable and reinterpret the code to break down exactly what had happened, and it was like all this looping. So things like that were frustrating. We had a lot of setbacks where you’re like, “So what does this variable really mean?” and every time you ask, “What does this variable really mean?” it’s never straightforward (Ginger, 55).
How was this question originally asked?
But so you’re using like a time between diagnosis and when they say they had a colonoscopy to infer maybe what that was about. … And you have to go back to the data dictionary, for sure, but I find myself going back to the questionnaire to see what I think the question really was asking, you know. Because the data dictionary… it’s just descriptive, it’s kind of what they thought they were asking… it’s like oh, the value can be one or two. One is yes and two is no. But when you look at the question, it’s have you ever had a colonoscopy, you know, more than two years ago or something? So there’s a difference. So there can be differences (Stewart, 162).
Conclusion Data reuse is a difficult, time-consuming
and iterative process requiring access to both written and human information sources.
Appropriate use and scientific integrity were of greatest importance.
Documentation will never be thorough enough to cover all future uses.
Implications For CSCW:
What is thorough documentation? Need targeted documentation Need ways to easily document and store
study history without getting in the way of the science
Support for collaborative information seeking For Professional Practice:
Incentives from funders for documentation and data curation
Acknowledgements Thanks to our participants Funding:
Fred Hutchinson Cancer Research Center NIH award R03CA150036
Drs. John D. Potter and Polly Newcomb (FHCRC) for expertise in epidemiology
Thanks!
Questions?
Referenced Work Bietz, M.J., & Lee, C.P. (2009). Collaboration in Metagenomics:
Sequence Databases and the Organization of Scientific Work. In Proc. ECSCW 2009, Springer-Verlag (2009), 243-262.
Birnholtz, J.P., & Bietz, M.J. (2003). Data at work: supporting sharing in science and engineering. In Proc. ACM SIGGROUP, ACM Press (2003), 339-348.
Edwards, P.N., Batcheller, A.L., Mayernik, M.S., Borgman, C.L., & Bowker, G.C. (2011). Science friction: Data, metadata, and collaboration. Social Studies of Science, 41(5), 667-690.
Faniel, I.M., & Jacobsen, T.E. (2010). Reusing Scientific Data: How Earthquake Engineering Researchers Assess the Reusability of Colleagues’ Data. Computer Supported Cooperative Work, 19(3-4), 355-375.
Zimmerman, A. (2007). Not by metadata alone: the use of diverse forms of knowledge to locate data for reuse. International Journal on Digital Libraries, 7(1-2), 5-16.
top related