Download - Final VIPER presentation at BioVis 2013
Jessie Kennedy, Martin Graham Edinburgh Napier University
Trevor Paterson, Andy Law The Roslin Institute, University of Edinburgh
Visual Cleaning of Genotype Data
• VIPER is a visualisation for spotting areas of error (impossible inheritance) in pedigree genotype datasets
Background
Many More Markers, with similar data per marker
Pedigreestructure
G | G
T | A G | G
G | G
G | AG | T
T | C
• The visualisation aggregated errors across markers and displayed them as offspring groups– Along with ancillary tables and bar charts
• For it to be a useful biological tool , it needed extended to become a data cleaning application
Background
• Data Wrangling– Fixing unreliable or useless data– General Purpose vs Specific Task
• General Purpose Tools– Wrangler / Google Refine– Tabular data
• Ours is a Specific Task– Remove the errors as they break further analyses– Fixing errors often creates new ones as our data is an
inheritance graph of related data rather than a table
Background
• Error Visualisation Topics (in order of vol of work)– Uncertainty visualisation – show bounds of reliability– Missing data visualisation – is data present
• Usually the bane of visualisation rather than the aim– Correctness visualisation – is data right
Background
• We cover missing data and correctness. For us...– Incorrect data – bad. – Missing (incomplete) data – manageable.
• Cleaning ≠ Correcting– Correction is preferable, but often impossible
• We clean by deleting erroneous data points and inferring data from ancestor individuals– We swap wrong data for missing data
Data Cleaning
• Four basic masking operations
Data Cleaning - Operations
1. Mask markers
2. Mask individuals
3. Mask single data points
4. Break relationships
• Markers are independent of each other.– Masking one marker doesn’t change the errors in any
other markers
• Thus markers with lots of errors can be quickly removed with no side-effect– Early version in VIPER hid errors (but didn’t do anything to
the underlying data)
Data Cleaning - Markers
• Wanted to adopt the same approach...
– But something odd happened.
– Removing individuals changes the error counts of other individuals
• Because individuals inherit from each other• So e.g. Removing every individual with > 5 errors
produced individuals with >5 errors.
Data Cleaning - Individuals
• Some errors turned out to simply drop from one generation to the next– Literal “chase to the bottom”, lots of lost data
• In these situations it is often necessary to break a child/parent relationship across all markers in the pedigree– Which is where the fourth masking operation originates
Data Cleaning - Individuals
www.napier.ac.uk/iidi
Masking - 1
A/G G/T
A/G C/G G/T C/AG/AG/C C/C
C/C G/G G/T G/G G/C
C/A A/C
G/C G/G
C/C G/CMask all errorsRecheck for errorsRepeat
Lose 50% of data
www.napier.ac.uk/iidi
Masking - 2
A/G G/T
A/G C/G G/T C/AG/AG/C C/C
C/C G/G G/T G/G G/C
C/A A/C
G/C G/G
C/C G/CMask errors top downRecheck for errorsRepeat
Lose 25% of data
www.napier.ac.uk/iidi
Masking - 3
A/G G/T
A/G C/G G/T C/AG/AG/C C/C
C/C G/G G/T G/G G/C
C/A A/C
G/C G/G
C/C G/CMask errors top down + cut linksRecheck for errorsRepeat
Lose <20% of data
• Being careful not to use any other colours in the interface, we can see how cleaning is going (red vsblue)
• New masking interactions available through standard context menus (and through tables)
Representations
• With such a hypothetical / experimental method of cleaning errors, undo is a must– Part of Shneiderman’s mantra– Beyond single-step, branching history
Visual History
• Genotype Checker vs VIPER+ interfaces• Both run using the same underlying data checking
algorithm• Same dataset
• 11 Biologists/Geneticists/Bioinformaticians at The Roslin Institute
• Asked them to attempt a pair of representative tasks with both interfaces (split into 12 Q’s)
Experiment
Experiment - Objective
• Over the whole question set there was no objective difference, but one did emerge when we considered questions that involved pedigree exploration
0
2
4
6
8
10
12
1 2 3 4 5 6 7 8 9 10 11
GenotypeChecker
Viper
Experiment - Objective
• Over the whole question set there was no objective difference, but one did emerge when we considered questions that involved pedigree exploration
0
1
2
3
4
5
6
7
8
1 2 3 4 5 6 7 8 9 10 11
Genotype Checker
VIPER
Experiment - Subjective
Key: 1 = Strongly prefer Viper, 5 = Strongly prefer GC, Bold = Median
Question VP No Pref GCFinding structural information on a pedigree 7 1 2 1 0Finding descendents of an individual 8 2 0 1 0
Finding ancestors of an individual 7 3 1 0 0
Finding error information on a single individual 4 1 1 4 1
Finding error information on a single marker 3 3 2 3 0
Distinguishing between different types of error 7 2 2 0 0Tracing errors to a shared parent 8 0 2 1 0
Finding error information on a single family 7 1 2 1 0Comparing errors between related families (one shared parent) 8 1 1 1 0
Masking errors 1 2 4 3 1Overall understanding of errors 5 1 4 1 0Overall ease of use 5 2 3 0 1
• A lot of incorrect/skipped answers in both scenarios– GC 61/132 = 46%– VP 45/132 = 34%
• These users were occasional users of cleaning software but it does show that Pedigree Cleaning is hard
• Excelitis – Biologists love Excel. The first move of many was to investigate the tables of error info rather than the main pedigree visualisation
Experiment - Observations