collaborative data analysis with taverna workflows

Collaborative Data Analysis with Taverna

Workflows

Collaborative Data Analysis with Taverna

Workflows

Andrea Wiggins, Kevin Crowston & James

Howison16 October 2009

Andrea Wiggins, Kevin Crowston & James

Howison16 October 2009

FLOSS PhenomenonFLOSS Phenomenon Free/Libre Open Source Software

Distributed collaboration to develop software

Typical social research topics Coordination and collaboration Growth and evolution (social and code) Code quality Motivation, leadership, success Culture and community Intellectual property and copyright

Free/Libre Open Source Software Distributed collaboration to develop software

Typical social research topics Coordination and collaboration Growth and evolution (social and code) Code quality Motivation, leadership, success Culture and community Intellectual property and copyright

eScience Proof-of-Concept

eScience Proof-of-Concept

Project to replicate published FLOSS research using eScience approaches Use existing shared data sets Develop workflows collaboratively Build library of reusable components

Selected several papers to replicate based on: Data availability Suitability of analytical approach

Project to replicate published FLOSS research using eScience approaches Use existing shared data sets Develop workflows collaboratively Build library of reusable components

Selected several papers to replicate based on: Data availability Suitability of analytical approach

Research ReplicationResearch Replication

StudyStudy DescriptionDescription

Conklin, Conklin, 20042004

Examines distribution of Examines distribution of project sizes for evidence project sizes for evidence of preferential attachment of preferential attachment theory of growth in theory of growth in networksnetworks

Howison et Howison et al., 2006al., 2006

Examines dynamics of social Examines dynamics of social networks of project networks of project communications over timecommunications over time

English & English & Schweik, Schweik, 20072007

Classifies projects based Classifies projects based on metrics for success and on metrics for success and stage of project growthstage of project growth

FLOSS Research DataFLOSS Research Data Data sources include interviews, surveys, and ethnographic fieldwork

Digital “trace” data Archival, secondary, by-product of work Easy to get, but hard to use

Repositories Hosting “forges” like SourceForge, FreshMeat, RubyForge, etc.

RoRs: Repositories of Repositories Data sources for research

Data sources include interviews, surveys, and ethnographic fieldwork

Digital “trace” data Archival, secondary, by-product of work Easy to get, but hard to use

Repositories Hosting “forges” like SourceForge, FreshMeat, RubyForge, etc.

RoRs: Repositories of Repositories Data sources for research

RoRs: FLOSSmoleRoRs: FLOSSmolePublic access to 300+ GB data

300K+ projects from 8 repositories Politely scraped, then parsed Flat files & SQL datamarts Released monthly via SF & GC

5 TB allotment on TeraGrid @ SDSC Allows direct database access without compromising our humble server

Public access to 300+ GB data 300K+ projects from 8 repositories Politely scraped, then parsed Flat files & SQL datamarts Released monthly via SF & GC

5 TB allotment on TeraGrid @ SDSC Allows direct database access without compromising our humble server

RoRs: SRDARoRs: SRDASourceForge Research Data Archive

Gated researcher-only access to a 300 GB+ SQL db of monthly dumps from SourceForge Original obtuse structure, regular table deprecation, some limited documentation

SourceForge Research Data Archive

Gated researcher-only access to a 300 GB+ SQL db of monthly dumps from SourceForge Original obtuse structure, regular table deprecation, some limited documentation

Analysis Tool RequirementsAnalysis Tool Requirements

Scalability Move analysis from small n’s to big(ger) n’s

Data meshing Reproducing research required analysis of data drawn from multiple RoRs

Collaborative analysis design Needed to tap into diverse skills from different contributors

Scalability Move analysis from small n’s to big(ger) n’s

Data meshing Reproducing research required analysis of data drawn from multiple RoRs

Collaborative analysis design Needed to tap into diverse skills from different contributors

Integrating Diverse Skills

Integrating Diverse Skills

Collaborator 1: Data Wrangler Expert with data sources and handling Multiple coding languages and great technical skills

Collaborator 2: Analyst Competent with R, but no other coding skills

Good at debugging Collaborator 3: PI

Helps find solutions when all else fails

Collaborator 1: Data Wrangler Expert with data sources and handling Multiple coding languages and great technical skills

Collaborator 2: Analyst Competent with R, but no other coding skills

Good at debugging Collaborator 3: PI

Helps find solutions when all else fails

TavernaTaverna

Scientific workflow tool Free. Open. We like that. Responsive support from myGrid team, lively user community

Additional collaboration support via myExperiment Combined features and flexibility met our needs

Scientific workflow tool Free. Open. We like that. Responsive support from myGrid team, lively user community

Additional collaboration support via myExperiment Combined features and flexibility met our needs

Work ProcessWork Process

1. Evaluate paper’s data, methods, and findings

2. Develop abstract workflow together, focusing on functionality

3. Split the work(flow) between data and analysis, specifying names and forms of inputs and outputs at the boundary

4. Independent individual development and testing, using dummy inputs

5. Integration of partial workflows6. Test (debug, test…) and run

1. Evaluate paper’s data, methods, and findings

2. Develop abstract workflow together, focusing on functionality

3. Split the work(flow) between data and analysis, specifying names and forms of inputs and outputs at the boundary

4. Independent individual development and testing, using dummy inputs

5. Integration of partial workflows6. Test (debug, test…) and run

“Identifying success and tragedy of FLOSS Commons”“Identifying success and tragedy of FLOSS Commons”

Replication of English & Schweik, 2007 Classification of project success by stage of growth for 110K projects

Requires data from 2 repositories, FLOSSmole & SRDA

Extension Parameterized all thresholds Tested additional criterion tests

Replication of English & Schweik, 2007 Classification of project success by stage of growth for 110K projects

Requires data from 2 repositories, FLOSSmole & SRDA

Extension Parameterized all thresholds Tested additional criterion tests

Classification Workflow

Classification Workflow

Key StrategiesKey Strategies

Consciously worked to maintain transparency and modularity Copious code comments Metadata makes workflows self-explanatory

Designed components for reuse Particularly data handling and “shims”

Assigned the right work to the right people Created interdependencies, but reduced the initial learning curve

Consciously worked to maintain transparency and modularity Copious code comments Metadata makes workflows self-explanatory

Designed components for reuse Particularly data handling and “shims”

Assigned the right work to the right people Created interdependencies, but reduced the initial learning curve

Important DetailsImportant Details

Used SVN for version management myExperiment can now manage this much more easily

Set up server caching on query results to speed up testing

Created an OWL ontology to map between RoRs, which significantly improved performance Nontrivial effort!

Used SVN for version management myExperiment can now manage this much more easily

Set up server caching on query results to speed up testing

Created an OWL ontology to map between RoRs, which significantly improved performance Nontrivial effort!

Extending AnalysesExtending Analyses

Replications implemented some of authors’ suggestions for future work Also implemented our own variations Easy to add and modify these after the initial replication development

Ran analyses on larger data sets than original studies

Replications implemented some of authors’ suggestions for future work Also implemented our own variations Easy to add and modify these after the initial replication development

Ran analyses on larger data sets than original studies

Workflow Re-useWorkflow Re-use

Specifically designed workflows and components for re-use

Components for sampling and analysis had no constants, only parameters

Effortful development for data handling paid off “Plug-and-play” components used in every subsequent workflow

Shifts the challenge from data to research, where it should be!

Specifically designed workflows and components for re-use

Components for sampling and analysis had no constants, only parameters

Effortful development for data handling paid off “Plug-and-play” components used in every subsequent workflow

Shifts the challenge from data to research, where it should be!

Challenges with Using Workflows

Challenges with Using Workflows

Software usability - continually improving Bugs wreaked havoc at times

Data handling Continually more challenging than expected

No existing web services, nor appropriate examples to emulate All bioscience, no social science

Software usability - continually improving Bugs wreaked havoc at times

Data handling Continually more challenging than expected

No existing web services, nor appropriate examples to emulate All bioscience, no social science

Barriers to UptakeBarriers to Uptake Little science issues

Many paradigms: lack of agreement in research focus, theory, methods

Lack of incentives to collaborate Bimodal distribution of requisite skills “I can’t possibly do that! I can’t code!” “Why bother? I can code my own. You should too; just use Python.”

Students are more willing to experiment with tools and new approaches

Little science issues Many paradigms: lack of agreement in research focus, theory, methods

Lack of incentives to collaborate Bimodal distribution of requisite skills “I can’t possibly do that! I can’t code!” “Why bother? I can code my own. You should too; just use Python.”

Students are more willing to experiment with tools and new approaches

Estimating user base and potential user interest based on common release-and-download patterns Downloads a proxy for project success, a common dependent variable

Estimating user base and potential user interest based on common release-and-download patterns Downloads a proxy for project success, a common dependent variable

Example: Recent Research

Example: Recent Research

“Normal” Download-Release Patterns

“Normal” Download-Release Patterns

BibDeskUsers update fairly quickly after releases

BibDeskUsers update fairly quickly after releases

External effects! External effects!

Taverna’s Download-Release Patterns

Taverna’s Download-Release Patterns

1.3.2-RC1+2 presentations1.5.0

? ?

InterpretationInterpretationTaverna is not a “normal” open source project Speaking tours, tutorials, articles, and other events influence downloads

Taverna is not a “normal” open source project Speaking tours, tutorials, articles, and other events influence downloads

Questions?Questions?

More: Poster on eScience for FLOSS floss.syr.edu www.myexperiment.org/groups/64

More: Poster on eScience for FLOSS floss.syr.edu www.myexperiment.org/groups/64

collaborative data analysis with taverna workflows

Technology

analysis of data

gb data

papers data

collaborative data analysis

times data handling

larger data sets

data wrangler expert

existing shared data