collaborative data analysis with taverna workflows
DESCRIPTION
Microsoft eScience Workshop 2009 presentation on collaborative development of Taverna workflows for data analysis in FLOSS research.TRANSCRIPT
Collaborative Data Analysis with Taverna
Workflows
Collaborative Data Analysis with Taverna
Workflows
Andrea Wiggins, Kevin Crowston & James
Howison16 October 2009
Andrea Wiggins, Kevin Crowston & James
Howison16 October 2009
FLOSS PhenomenonFLOSS Phenomenon Free/Libre Open Source Software
Distributed collaboration to develop software
Typical social research topics Coordination and collaboration Growth and evolution (social and code) Code quality Motivation, leadership, success Culture and community Intellectual property and copyright
Free/Libre Open Source Software Distributed collaboration to develop software
Typical social research topics Coordination and collaboration Growth and evolution (social and code) Code quality Motivation, leadership, success Culture and community Intellectual property and copyright
eScience Proof-of-Concept
eScience Proof-of-Concept
Project to replicate published FLOSS research using eScience approaches Use existing shared data sets Develop workflows collaboratively Build library of reusable components
Selected several papers to replicate based on: Data availability Suitability of analytical approach
Project to replicate published FLOSS research using eScience approaches Use existing shared data sets Develop workflows collaboratively Build library of reusable components
Selected several papers to replicate based on: Data availability Suitability of analytical approach
Research ReplicationResearch Replication
StudyStudy DescriptionDescription
Conklin, Conklin, 20042004
Examines distribution of Examines distribution of project sizes for evidence project sizes for evidence of preferential attachment of preferential attachment theory of growth in theory of growth in networksnetworks
Howison et Howison et al., 2006al., 2006
Examines dynamics of social Examines dynamics of social networks of project networks of project communications over timecommunications over time
English & English & Schweik, Schweik, 20072007
Classifies projects based Classifies projects based on metrics for success and on metrics for success and stage of project growthstage of project growth
FLOSS Research DataFLOSS Research Data Data sources include interviews, surveys, and ethnographic fieldwork
Digital “trace” data Archival, secondary, by-product of work Easy to get, but hard to use
Repositories Hosting “forges” like SourceForge, FreshMeat, RubyForge, etc.
RoRs: Repositories of Repositories Data sources for research
Data sources include interviews, surveys, and ethnographic fieldwork
Digital “trace” data Archival, secondary, by-product of work Easy to get, but hard to use
Repositories Hosting “forges” like SourceForge, FreshMeat, RubyForge, etc.
RoRs: Repositories of Repositories Data sources for research
RoRs: FLOSSmoleRoRs: FLOSSmolePublic access to 300+ GB data
300K+ projects from 8 repositories Politely scraped, then parsed Flat files & SQL datamarts Released monthly via SF & GC
5 TB allotment on TeraGrid @ SDSC Allows direct database access without compromising our humble server
Public access to 300+ GB data 300K+ projects from 8 repositories Politely scraped, then parsed Flat files & SQL datamarts Released monthly via SF & GC
5 TB allotment on TeraGrid @ SDSC Allows direct database access without compromising our humble server
RoRs: SRDARoRs: SRDASourceForge Research Data Archive
Gated researcher-only access to a 300 GB+ SQL db of monthly dumps from SourceForge Original obtuse structure, regular table deprecation, some limited documentation
SourceForge Research Data Archive
Gated researcher-only access to a 300 GB+ SQL db of monthly dumps from SourceForge Original obtuse structure, regular table deprecation, some limited documentation
Analysis Tool RequirementsAnalysis Tool Requirements
Scalability Move analysis from small n’s to big(ger) n’s
Data meshing Reproducing research required analysis of data drawn from multiple RoRs
Collaborative analysis design Needed to tap into diverse skills from different contributors
Scalability Move analysis from small n’s to big(ger) n’s
Data meshing Reproducing research required analysis of data drawn from multiple RoRs
Collaborative analysis design Needed to tap into diverse skills from different contributors
Integrating Diverse Skills
Integrating Diverse Skills
Collaborator 1: Data Wrangler Expert with data sources and handling Multiple coding languages and great technical skills
Collaborator 2: Analyst Competent with R, but no other coding skills
Good at debugging Collaborator 3: PI
Helps find solutions when all else fails
Collaborator 1: Data Wrangler Expert with data sources and handling Multiple coding languages and great technical skills
Collaborator 2: Analyst Competent with R, but no other coding skills
Good at debugging Collaborator 3: PI
Helps find solutions when all else fails
TavernaTaverna
Scientific workflow tool Free. Open. We like that. Responsive support from myGrid team, lively user community
Additional collaboration support via myExperiment Combined features and flexibility met our needs
Scientific workflow tool Free. Open. We like that. Responsive support from myGrid team, lively user community
Additional collaboration support via myExperiment Combined features and flexibility met our needs
Work ProcessWork Process
1. Evaluate paper’s data, methods, and findings
2. Develop abstract workflow together, focusing on functionality
3. Split the work(flow) between data and analysis, specifying names and forms of inputs and outputs at the boundary
4. Independent individual development and testing, using dummy inputs
5. Integration of partial workflows6. Test (debug, test…) and run
1. Evaluate paper’s data, methods, and findings
2. Develop abstract workflow together, focusing on functionality
3. Split the work(flow) between data and analysis, specifying names and forms of inputs and outputs at the boundary
4. Independent individual development and testing, using dummy inputs
5. Integration of partial workflows6. Test (debug, test…) and run
“Identifying success and tragedy of FLOSS Commons”“Identifying success and tragedy of FLOSS Commons”
Replication of English & Schweik, 2007 Classification of project success by stage of growth for 110K projects
Requires data from 2 repositories, FLOSSmole & SRDA
Extension Parameterized all thresholds Tested additional criterion tests
Replication of English & Schweik, 2007 Classification of project success by stage of growth for 110K projects
Requires data from 2 repositories, FLOSSmole & SRDA
Extension Parameterized all thresholds Tested additional criterion tests
Classification Workflow
Classification Workflow
Key StrategiesKey Strategies
Consciously worked to maintain transparency and modularity Copious code comments Metadata makes workflows self-explanatory
Designed components for reuse Particularly data handling and “shims”
Assigned the right work to the right people Created interdependencies, but reduced the initial learning curve
Consciously worked to maintain transparency and modularity Copious code comments Metadata makes workflows self-explanatory
Designed components for reuse Particularly data handling and “shims”
Assigned the right work to the right people Created interdependencies, but reduced the initial learning curve
Important DetailsImportant Details
Used SVN for version management myExperiment can now manage this much more easily
Set up server caching on query results to speed up testing
Created an OWL ontology to map between RoRs, which significantly improved performance Nontrivial effort!
Used SVN for version management myExperiment can now manage this much more easily
Set up server caching on query results to speed up testing
Created an OWL ontology to map between RoRs, which significantly improved performance Nontrivial effort!
Extending AnalysesExtending Analyses
Replications implemented some of authors’ suggestions for future work Also implemented our own variations Easy to add and modify these after the initial replication development
Ran analyses on larger data sets than original studies
Replications implemented some of authors’ suggestions for future work Also implemented our own variations Easy to add and modify these after the initial replication development
Ran analyses on larger data sets than original studies
Workflow Re-useWorkflow Re-use
Specifically designed workflows and components for re-use
Components for sampling and analysis had no constants, only parameters
Effortful development for data handling paid off “Plug-and-play” components used in every subsequent workflow
Shifts the challenge from data to research, where it should be!
Specifically designed workflows and components for re-use
Components for sampling and analysis had no constants, only parameters
Effortful development for data handling paid off “Plug-and-play” components used in every subsequent workflow
Shifts the challenge from data to research, where it should be!
Challenges with Using Workflows
Challenges with Using Workflows
Software usability - continually improving Bugs wreaked havoc at times
Data handling Continually more challenging than expected
No existing web services, nor appropriate examples to emulate All bioscience, no social science
Software usability - continually improving Bugs wreaked havoc at times
Data handling Continually more challenging than expected
No existing web services, nor appropriate examples to emulate All bioscience, no social science
Barriers to UptakeBarriers to Uptake Little science issues
Many paradigms: lack of agreement in research focus, theory, methods
Lack of incentives to collaborate Bimodal distribution of requisite skills “I can’t possibly do that! I can’t code!” “Why bother? I can code my own. You should too; just use Python.”
Students are more willing to experiment with tools and new approaches
Little science issues Many paradigms: lack of agreement in research focus, theory, methods
Lack of incentives to collaborate Bimodal distribution of requisite skills “I can’t possibly do that! I can’t code!” “Why bother? I can code my own. You should too; just use Python.”
Students are more willing to experiment with tools and new approaches
Estimating user base and potential user interest based on common release-and-download patterns Downloads a proxy for project success, a common dependent variable
Estimating user base and potential user interest based on common release-and-download patterns Downloads a proxy for project success, a common dependent variable
Example: Recent Research
Example: Recent Research
“Normal” Download-Release Patterns
“Normal” Download-Release Patterns
BibDeskUsers update fairly quickly after releases
BibDeskUsers update fairly quickly after releases
External effects! External effects!
Taverna’s Download-Release Patterns
Taverna’s Download-Release Patterns
1.3.2-RC1+2 presentations1.5.0
? ?
InterpretationInterpretationTaverna is not a “normal” open source project Speaking tours, tutorials, articles, and other events influence downloads
Taverna is not a “normal” open source project Speaking tours, tutorials, articles, and other events influence downloads
Questions?Questions?
More: Poster on eScience for FLOSS floss.syr.edu www.myexperiment.org/groups/64
More: Poster on eScience for FLOSS floss.syr.edu www.myexperiment.org/groups/64