rice bosc2010 emboss
TRANSCRIPT
EBI is an Outstation of the European Molecular Biology Laboratory.
EMBOSS
European Molecular Biology Open Software Suite
Open-Bio Project Update 2010
Peter Rice [email protected]
BOSC 2010: EMBOSS12.04.232
A quick introduction
• Open source package for sequence analysis• ANSI C source code• GPL licensed applications, LGPL libraries• 200+ applications• 100+ third party applications in 15 associated packages
• MIRA, MEME, HMMER, PHYLIP, etc.• Project started 1996 at Sanger and HGMP• Now based at EBI• Release 1.0.0 15th July 2000• Release 6.3.0 15th July 2010• Funded by UK-BBSRC and EMBL-EBI• Originally funded by the Wellcome Trust• Additional funds from UK-MRC
BOSC 2010: EMBOSS12.04.233
Who do we serve?
• Expert software developers• Bioinformaticians• Computer scientists
• Expert users• Biology research community• Industry
• Scientific users• Biology research community• Industry
BOSC 2010: EMBOSS12.04.234
EMBOSS command line interface
• EMBOSS applications run from the command line• This is not the only interface
• There are over 100 interfaces and packaged systems available• Web: wEMBOSS• GUI: Jemboss• Web Services: SoapLab• Workflows: Galaxy, Taverna• Windows: mEMBOSS
• All applications have a command definition file (.acd)• Defines all inputs, outputs, and other options• Read at startup• Contains all command line options with descriptions• Template for any other interface
BOSC 2010: EMBOSS12.04.235
EMBOSS Update
• Release 6.3.0 as usual on 15th July 2010• New support for NGS sequence formats• Adaptor detection added to supermatcher• Metadata and ontologies• Full set of public data resources• Three open source books: users, developers, admin
• Cambridge University Press
BOSC 2010: EMBOSS12.04.236
NGS sequence formats
• SAM format: tab-delimited short read data• BAM format: binary compressed SAM format
• More work needed on remote access to mapped reads
• FASTQ short reads and quality scores• OpenBio project collaboration on format standards• Improved error detection (for all formats)• Improved performance for input and output• Indexing in dbxflat
BOSC 2010: EMBOSS12.04.237
NGS sequence formats
• FASTQ joint effort with Bio* projects• Definition of 3 conflicting FASTQ formats• Agreement on standard parsing procedures
• @EAS54_6_R1_2_1_413_324• CCCTTCTTGTCTTCAGCGTTTCTCC• + EAS54_6_R1_2_1_413_324• ;;3;;;;;;;;;;;;7;;;;;;;88• @EAS54_6_R1_2_1_443_348• GTTGCTTCTGGCGTGGGTGGGGGGG• +EAS54_6_R1_2_1_443_348• ;;;;;;;;;;;9;7;;.7;393333
8
Other sequence formats
>AB036666 AB036666 Wolbachia sp. wKue genes
cattactatttcagtcgagacatattaggtcaatcaattttaatcaacaagattggtcaa
gatcaaagtaacattaaaaaatatatatactcatatggtgagtaccctctgaactggcct
cagggaacagaatacactttatctaacagccctgttacaacattaatatttgttcaaggt
aatgaaggacaagaaaaaacagcattcatttttcatatacgagagtccaatacaaaggaa
ttctatgctgataaaaaaattccagtgctaaacatacctaaaataggaaaagtaggaaat
gccgtagaaattaaaatgagtctaaaaaaatatgaaacagggttatcttttgaagacctt
tttgaaatagaacagataagtaaatatgaatcaagtggtaatgatcaacaatttacagat
ggcaagtttattgagatacctaattctgatgaattaaaggcaaaatttgatcaagcaatc
acttctcaacatgcttccgacggtgaggtttcattgcaagcctataaagtgttgcttact
gaagtagcagatacgatttaccctatcaaagatttgattactaatgaagcaagattacaa
gctgttcttaatggtttgcttagtagctatagtgatttaaagctacaggagacttctgcg
aagactgtaattatacctgaatttcaagtaggagcaggtggtcgtgtagatatggtaatt
Caaggtattggtccttcgtctcagggtactaaagaatacactcctatagcgctggaattt
BOSC 2010: EMBOSS12.04.239
New data sources for EMBOSS
• BioMart access• As a sequence database, define sequence, identifier, etc.• Need to define a very large number of databases
• Ensembl access• Code from Michael Schuster• Ensembl SQL access code in library (access method soon)• Same issues as BioMart
• DAS 1.6 client access planned• GMOD access planned• BioSQL access planned
BOSC 2010: EMBOSS12.04.2310
Data servers
• Defining individual sequence databases is tedious• Many database definitions are similar• Simplify (and extend) with server definitions:
• SRS• MRS• BioMart• Ensembl• DAS 1.6
• Define server• USA to give server:dbname:queryfield-value• Database name and query field known to user
• Or reported by a query to the server in an extended showdb
BOSC 2010: EMBOSS12.04.2311
New data sources for EMBOSS (2)
• Non-sequence data• Cross-referenced resources from EMBL/UniProt/etc.• Useful to return as:
• Identifiers• Text for entries• HTML with markup• URLs for browsing
• Dbxref.dat • List of all known data resources• Standard names• Standard queries for sequence, text, HTML, etc• Query by identifier and other fields
BOSC 2010: EMBOSS12.04.2312
Ontologies
• Support for OBO format ontologies:• Gene Ontology• Sequence Ontology (used internally for features)• BioSapiens Ontology (used internally for features)
• Parsing and format validation• Indexing with new dbx applications• Indexing cross-references in EMBL/UniProt/etc.• Navigation up, down, siblings, etc.• Remote and local access
BOSC 2010: EMBOSS12.04.2313
Ontologies: EDAM
• EMBRACE Datatypes And Methods• OBO format (so far)
• All ACD files have relations attributes• “topic” for application (Immunological analysis)• “operation” for application (Epitope mapping)• “data” for inputs and outputs
• Pure protein sequence• Sequence record• 1 or more
• Sequence length• “Peptide immunogenicity report”
• Validation by acdvalid application
BOSC 2010: EMBOSS12.04.2314
EDAM in ACD
• application: antigenic [• documentation: "Finds antigenic sites in proteins"• groups: "Protein:Motifs"• relations: "/edam/topic/0000201 Immunological analysis"• relations: "/edam/operation/0000416 Epitope mapping“• ]
• seqall: sequence [• parameter: "Y"• type: "proteinstandard"• relations: "/edam/data/0001219 Pure protein sequence"• relations: "/edam/data/0000849 Sequence record" • relations: "/edam/data/0002178 1 or more“
]
• integer: minlen [• standard: "Y“ minimum: "1” maximum: "50” default: "6"• information: "Minimum length of antigenic region"• relations: "/edam/data/0001249 Sequence length“• ]• report: outfile [• parameter: "Y"• rformat: "motif"• multiple: "Y"• taglist: "int:pos=Max_score_pos"• relations: "/edam/data/0001534 Peptide immunogenicity report" • ]
BOSC 2010: EMBOSS12.04.2315
Ontologies: EDAM (2)
• SoapLab web services annotated with EDAM• EDAM terms parsed from ACD files• Web services have WSDL files• SAWSDL annotation with EDAM terms• Annotation can be used by BioCatalogue
• www.biocatalogue.org• Also can be used by EMBRACE registry
• www.embraceregistry.net
BOSC 2010: EMBOSS12.04.2316
Ontologies: NCBI Taxonomy
• Parsers for “.dmp” files• Will add dbx indexing applications• Local and remote access• Navigation up, down, siblings (the usual suspects)• Automatic cross references from sequence data
• EMBL source line• UniProt OX lines• BioMart mart name (organism name)
BOSC 2010: EMBOSS12.04.2317
EMBOSS Interfaces and wrappers
• Two releases in the past year• Possibly three releases next year• Too many for other projects to keep up
• So we are obliged to help, starting with:• SoapLab2• Jemboss• Galaxy• Pipeline Pilot
• BioPerl• wEMBOSS and Explorer• G-language?• …. And anyone else who asks!
BOSC 2010: EMBOSS12.04.2318
Peter RiceAlan Bleasby
Jon Ison Mahmut Uludag
The Emboss Team
BOSC 2010: EMBOSS12.04.2319
Acknowledgements
• EBI: Peter Rice, Alan Bleasby, Jon Ison, Mahmut Uludag, Martin Senger, Tom Oinn, Jaina Mistry, Rodrigo Lopez, Sharmilla Pillai, Hamish McWilliam, Michael Schuster, Syed Haider
• RFCGR/HGMP: Alan Bleasby, Jon Ison, Tim Carver, Hugh Morgan, Claude Beazley, Lisa Mullan, Damian Counsell, Gary Williams, Val Curwen, Mark Faller, Sinead O’Leary, Thon deBoer, Martin Bishop
• LION: Thomas Laurent, Bijay Jassal, Bren Vaughan, Thure Etzold
• Sanger Institute: Ian Longden, Richard Bruskiewich, Simon Kelley
• National bioinformatics service providers in: Norway, Spain, Italy, Netherlands, Germany, Belgium, Russia, China, Canada, Australia, Argentina
• Others: Catherine Letondal, Don Gilbert, Rodger Staden, Bill Pearson, Webb Miller, Marie-Laetitia Denayer, Amandine Schurmann, Gabriele Weiler, Luke McCarthy, David Mathog, David Bauer, Henrikki Almusa, Thomas Siegmund, Scott Markel, Darryl Leon, Bastien Chevreux, Ivo Hofacker, Kristoffer Rapacki, Matus Kalas
• IBM, Hewlett-Packard, (Compaq), Apple, SGI, Sun, LION bioscience, SciTegic, Cambridge University Press
• Open-Bio Foundation, Sourceforge, ... And the British Antarctic Survey
http://emboss.sourceforge.net
http://emboss.open-bio.org/wiki