seqpig script language for large bioinformatic datasets
Post on 09-Jun-2015
197 Views
Preview:
DESCRIPTION
TRANSCRIPT
SeqPigA simple and scalable scripting language for
large sequencing data sets in Hadoop
arian pasqualijune 6, 2014
/me
Arian PasqualiMaster's student in Data MiningData engineer at Semasio
background- engineering - cloud computing- data mining on big data - social networks
study case
SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop.Schumacher A1, Pireddu L, Niemenmaa M, Kallio A, Korpelainen E, Zanetti G, Heljanko K.
Bioinformatics. 2014 Jan 1;30(1):119-20. doi: 10.1093/bioinformatics/btt601. Epub 2013 Oct 22.
http://www.ncbi.nlm.nih.gov/pubmed/24149054
but first, some background
● Real world bioinformatics datasets are huge ● Gigabytes/Petabytes are hard to handle on a
single computer● in order to handle big data sets we have to
master parallel programming models
Parallel programming models
some high-performance programming models- Serial (doesn’t scale)- MPI (expensive)- MapReduce
- Hadoop (cheap and scalable)
hadoop
Hadoop is an open source implementation of that enables you to run MapReduce programs.
It is aimed to process huge volumes of data of Tera or PetaBytes, what fits perfectly in many bioinformatics scenarios.
http://hadoop.apache.org/
how mapreduce works on hadoopProvides a framework for MapReduce, a fault-tolerant parallel programing model- easier to write programs than other paradigms- easier means cheaper- runs on clusters with commodity hardware - scales horizontally
- need more power? just add more nodes
an application: BLAST algorithm
MapReduce Tasks- load data- map sequences- partitionate- reduce (merge)- output results
MapReduce is easier, but not trivial
Apache Pig tries to solve that
Apache Pig solves that. Under the hood it applies MapReduce paradigmIt hides all the pitfalls about writing MapReduce code
Pig version of the same code
Apache Pig in BioinformaticsIt is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs.
It can be easier
SeqPigScalable scripting language based on Apache Pig for large scale sequence
analysis
SeqPig
● a script language,● a library,● and a collection of tools to manipulate,
analyze and query sequencing datasets in a scalable and simple manner
http://seqpig.sourceforge.net/
SeqPig and data format support
Currently it supports BAMSAMFastQQseq input and outputFASTA input
possible use cases
● converting data formats● filters regions of a chromossome● computing base frequencies● alignments● collecting read-mapping-quality-statistics
code example run scripts/filter_defs.pig
A = load 'input.bam' using BamLoader('yes');
B = FILTER A BY not ReadUnmapped(flags) and not IsDuplicate(flags);
C = FOREACH B GENERATE ReadSplit(name,start,read,cigar,basequal,flags,mapqual,refindex,refname,attributes#'MD');
D = FOREACH C GENERATE FLATTEN($0);
base_stats_data = FOREACH D GENERATE refbase, basepos, UPPER(readbase) AS readbase;
base_stats_grouped = GROUP base_stats_data BY (refbase, basepos, readbase);
base_stats_grouped_count = FOREACH base_stats_grouped GENERATE group.$0 AS refbase, group.$1 AS basepos, group.$2 as readbase, COUNT($1) AS bcount;
base_stats_grouped = GROUP base_stats_grouped_count by (refbase, basepos);
base_stats = FOREACH base_stats_grouped {
TMP1 = FOREACH base_stats_grouped_count GENERATE readbase, bcount;
TMP2 = ORDER TMP1 BY bcount desc;
GENERATE group.$0, group.$1, TMP2;
}
STORE base_stats into 'outputfile_readstats.txt';
resultsA 0 {(A,19),(G,2)}
A 1 {(A,10)}
A 2 {(A,18)}
A 3 {(A,16)}
A 4 {(A,14)}
A 5 {(A,15)}
A 6 {(A,16),(G,2)}
...
A 98 {(A,7)}
A 99 {(A,14)}
C 0 {(C,6)}
C 1 {(C,11)}
C 2 {(C,9)}
results plotted
scalability test● 61Gb dataset● running some
FastQC stats
* speed in minutes
related workBiodoop: Bioinformatics on Hadoophttp://dl.acm.org/citation.cfm?id=1679817
BioPig: A Hadoop-based Analytic Toolkit for Large-Scale Sequence Data, Oxford Journalshttp://bioinformatics.oxfordjournals.org/content/early/2013/09/10/bioinformatics.btt528
some cloud computing solutions
Amazon AWS , general use purpousehttp://aws.amazon.com/
Mortar Data , focused on data sciencehttp://www.mortardata.com/
CloudGene, focused on bioinformatics usershttp://cloudgene.uibk.ac.at/
cloudgene, mapreduce for bioinformatics
conclusionsBioinformatics have been creating innovative algorithms and solutions that sometimes are adopted in different fields in computer science.
Neural networks in Artificial Intelligence and Machine learning is an example.Now, large scalable approaches from data mining are helping Bioinformatics to move forward, faster and cheaper.
top related