seqpig script language for large bioinformatic datasets

SeqPigA simple and scalable scripting language for

large sequencing data sets in Hadoop

arian pasqualijune 6, 2014

Arian PasqualiMaster's student in Data MiningData engineer at Semasio

background- engineering - cloud computing- data mining on big data - social networks

study case

SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop.Schumacher A1, Pireddu L, Niemenmaa M, Kallio A, Korpelainen E, Zanetti G, Heljanko K.

Bioinformatics. 2014 Jan 1;30(1):119-20. doi: 10.1093/bioinformatics/btt601. Epub 2013 Oct 22.

http://www.ncbi.nlm.nih.gov/pubmed/24149054

but first, some background

● Real world bioinformatics datasets are huge ● Gigabytes/Petabytes are hard to handle on a

single computer● in order to handle big data sets we have to

master parallel programming models

Parallel programming models

some high-performance programming models- Serial (doesn’t scale)- MPI (expensive)- MapReduce

- Hadoop (cheap and scalable)

hadoop

Hadoop is an open source implementation of that enables you to run MapReduce programs.

It is aimed to process huge volumes of data of Tera or PetaBytes, what fits perfectly in many bioinformatics scenarios.

http://hadoop.apache.org/

how mapreduce works on hadoopProvides a framework for MapReduce, a fault-tolerant parallel programing model- easier to write programs than other paradigms- easier means cheaper- runs on clusters with commodity hardware - scales horizontally

- need more power? just add more nodes

an application: BLAST algorithm

MapReduce Tasks- load data- map sequences- partitionate- reduce (merge)- output results

MapReduce is easier, but not trivial

Apache Pig tries to solve that

Apache Pig solves that. Under the hood it applies MapReduce paradigmIt hides all the pitfalls about writing MapReduce code

Pig version of the same code

Apache Pig in BioinformaticsIt is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs.

It can be easier

SeqPigScalable scripting language based on Apache Pig for large scale sequence

analysis

SeqPig

● a script language,● a library,● and a collection of tools to manipulate,

analyze and query sequencing datasets in a scalable and simple manner

http://seqpig.sourceforge.net/

SeqPig and data format support

Currently it supports BAMSAMFastQQseq input and outputFASTA input

possible use cases

● converting data formats● filters regions of a chromossome● computing base frequencies● alignments● collecting read-mapping-quality-statistics

code example run scripts/filter_defs.pig

A = load 'input.bam' using BamLoader('yes');

B = FILTER A BY not ReadUnmapped(flags) and not IsDuplicate(flags);

C = FOREACH B GENERATE ReadSplit(name,start,read,cigar,basequal,flags,mapqual,refindex,refname,attributes#'MD');

D = FOREACH C GENERATE FLATTEN($0);

base_stats_data = FOREACH D GENERATE refbase, basepos, UPPER(readbase) AS readbase;

base_stats_grouped = GROUP base_stats_data BY (refbase, basepos, readbase);

base_stats_grouped_count = FOREACH base_stats_grouped GENERATE group.$0 AS refbase, group.$1 AS basepos, group.$2 as readbase, COUNT($1) AS bcount;

base_stats_grouped = GROUP base_stats_grouped_count by (refbase, basepos);

base_stats = FOREACH base_stats_grouped {

TMP1 = FOREACH base_stats_grouped_count GENERATE readbase, bcount;

TMP2 = ORDER TMP1 BY bcount desc;

GENERATE group.$0, group.$1, TMP2;

STORE base_stats into 'outputfile_readstats.txt';

resultsA 0 {(A,19),(G,2)}

A 1 {(A,10)}

A 2 {(A,18)}

A 3 {(A,16)}

A 4 {(A,14)}

A 5 {(A,15)}

A 6 {(A,16),(G,2)}

A 98 {(A,7)}

A 99 {(A,14)}

C 0 {(C,6)}

C 1 {(C,11)}

C 2 {(C,9)}

results plotted

scalability test● 61Gb dataset● running some

FastQC stats

* speed in minutes

related workBiodoop: Bioinformatics on Hadoophttp://dl.acm.org/citation.cfm?id=1679817

BioPig: A Hadoop-based Analytic Toolkit for Large-Scale Sequence Data, Oxford Journalshttp://bioinformatics.oxfordjournals.org/content/early/2013/09/10/bioinformatics.btt528

some cloud computing solutions

Amazon AWS , general use purpousehttp://aws.amazon.com/

Mortar Data , focused on data sciencehttp://www.mortardata.com/

CloudGene, focused on bioinformatics usershttp://cloudgene.uibk.ac.at/

cloudgene, mapreduce for bioinformatics

conclusionsBioinformatics have been creating innovative algorithms and solutions that sometimes are adopted in different fields in computer science.

Neural networks in Artificial Intelligence and Machine learning is an example.Now, large scalable approaches from data mining are helping Bioinformatics to move forward, faster and cheaper.

thank youhi@arianpasquali.com

seqpig script language for large bioinformatic datasets

data mining data engineer

mortar data

data science http

group base

large data sets

foreach base

data analysis programs

basepos base

Data & Analytics

bioinformatic analysis identifies potentially key...

bioinformatic and metabolomic analysis reveal intervention...

research article systematic bioinformatic approach for...

bioinformatic platform for msb workflows

ovium bioinformatic solutions

summary of arabidopsis bioinformatic survey sent to...

bioinformatic identification of disease associated

bioinformatic analysis of chicken chemokines, …

usability, reusability and reproducibility of bioinformatic...

bioinformatic analysis of riboswitch structures...

b.sc.- bioinformatic i to vi semester

applying bioinformatic techniques to identify cold

pertemuan 9 bioinformatic

overzichtsartikel from bioinformatic pattern analysis to...

know your transcriptome integrative bioinformatic approaches

novel functions of pxr: a bioinformatic approach

a bioinformatic approach to understanding antibiotic...

bioinformatic phd. course

bioinformatic strategies in functional genomics applied in...

bioinformatic tools in pheromone technology