visualising errors in animal pedigree genotype data
DESCRIPTION
Presentation I gave at EuroVis 2011 on the VIPER projectTRANSCRIPT
VISUALISING ERRORS IN
ANIMAL PEDIGREE
GENOTYPE DATA
Martin Graham, Jessie Kennedy, Trevor Paterson & Andy Law
Edinburgh Napier University & The Roslin Institute, Univ of Edinburgh, UK
Pedigrees
Animal pedigrees are their family trees – who’s
whose father, mother etc
In animal breeding these pedigrees are strictly
controlled to maximise traits of value or
suppress unwanted ones
A genotype is the genetic make-up of an
animal
Pedigree + genotype = pedigree genotype
Not the whole genotype, use sets of markers
Marker type: SNP (Single Nucleotide
Polymorphism)
Each SNP has 2 alleles, one inherited from each
Pedigree Genotypes
Marker Values
M1 C|T
M2 A|A
M3 A|G
... ...
Example
Individual
But...
However, most large datasets have errors
Errors when recording pedigree
Technical errors e.g. wrongly detected marker
Misassigned samples
Also incomplete data
These errors make the data genetically
inconsistent
This makes them unusable for most downstream
analyses
?
C | ?
Example
Various possibilities here
Dad is Juniors’ father but the genotyping is
incorrect
Dad isn’t Junior’s father and the genotypes are
correct
Need to find/isolate/clean such data
Mum
A | A
Dad
G | G
Junior
A | C
G | G
C
Table Viewer
Current table-based viewer
Grid of markers x individuals; genotype values in
cells
Universally ‘bad’ markers or individuals stand out
Table Viewer
Expert biologists are needed to pinpoint the source of reported errors
But without a pedigree context to anchor the errors in, it’s impossible to do this
Previous Work
Multitude of pedigree viewers, but all have
issues with scalability or handling extra
(genotype) data
Voyage of Discovery
Mainly discovering representations that didn’t
work
Iterated through a number of different
representation styles that failed for various
reasons
Node-Link View
Can see that the pedigree clusters around a few males
But hard to follow edges/directions, loss of generational context
Hierarchical Node-Link View
Regain visual generation structure of pedigree
But plagued with more edge crossings than
before
Matrix View
Matrices are the main alternative to drawing node-link diagrams for relational information
We rejected having one overall matrix due to sparsity
Matrix View
One matrix per generation ‘gap’ (parent offspring)
Rather than sources v sinks - sires v dams; offspring in cells
Allows sorting of parent genders by properties
Sandwich View
Realised that in these matrices, either the rows
or columns will only have one filled cell each if
one of the parent genders is monogamous
In animal experiments this tends to be the
case, a female breeds with only one male per
generation
Each matrix can thus be replaced with a
compressed view
Sandwich View
The sandwich view is a specialised view of the
bipartite graph between two generations
With the top layer split into males/females and the
females pushed beneath the bottom layer
Sires
Offsprin
g
Dams
Parents
Offsprin
g
Connectors to repeated
node representations if
necessary
Sandwich View
Sandwich view of the relationships between
two adjacent generations
All the other pedigree views of full generations
involved tracing paths between
parents/offspring
Sires (Male Parents)
Dams (Female Parents)
Offspring
1 male has children
with multiple females
Sandwich View
Error Information
Colour is used to convey an individual’s error
status over all the markers in a data set
More errors = higher saturation
Parent – coloured by overall error count
Offspring drawn as hexagonal glyphs
‘Up’ triangle – incompatibilities with sire
‘Down’ triangle – incompatibilities with dam
Middle portion – markers exist that are not present
in either parent
Aggregating offspring
Groups of siblings who share the same
parents can be aggregated under one glyph
Colouring now represents errors in all markers
over a group of individuals
Troublesome families & parents can be clearly
seen
Error Information
Filtering
Error Filtering
The table view ( ) clearly showed
rogue markers and individuals, and these can be
filtered by a user in that application
To the sandwich view we add two complementary
histograms that perform the same purpose
Filtering
Error Filtering
Each histogram shows number of errors along the X axis
Number of individuals/markers with that number of errors on the Y axis
Typical pattern: A few individuals / markers have lots of errors, and the majority have a few or no errors
Mantra is to discard bad markers and look at bad individuals
Sandwich view
Pic/Vid of full view (To Do)
Video
Conclusion
Developed new style of pedigree visualisation
Shows detailed errors at a family level
Shows overview of errors in an entire pedigree
Keeps offspring close to their parents for family-
centric view
Future Work
Single marker views of errors
Making the sandwich into a club sandwich
Split the middle layer into multiple layers
i.e. By gender to spot sex-related marker errors
Acknowledgements
Reviewers
BBSRC funded project