essential skills for bioinformatics: unix/linux - dgist · • data can remain compressed on the...

42
Essential Skills for Bioinformatics: Unix/Linux

Upload: buithu

Post on 14-Apr-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Essential Skills for Bioinformatics: Unix/Linux - DGIST · • Data can remain compressed on the disk throughout processing and analyses. Most well-written bioinformatics tools can

Essential Skills for Bioinformatics: Unix/Linux

Page 2: Essential Skills for Bioinformatics: Unix/Linux - DGIST · • Data can remain compressed on the disk throughout processing and analyses. Most well-written bioinformatics tools can

WORKING WITH COMPRESSED DATA

Page 3: Essential Skills for Bioinformatics: Unix/Linux - DGIST · • Data can remain compressed on the disk throughout processing and analyses. Most well-written bioinformatics tools can

Overview

• Data compression, the process of condensing data so that it takes up less space (on disk drives, in memory, or across network transfer), is an indispensable technology in modern bioinformatics.

• For example, sequences from a recent Illumina HiSeq run• example.fastq: 63,203,414,514 bytes (59 GB)• example.fastq.gz: 21,408,674,240 bytes (20 GB)• Compression ratio (uncompressed size/compressed size) of this data

is 2.95, which translates to a significant space saving of about 66%.

Page 4: Essential Skills for Bioinformatics: Unix/Linux - DGIST · • Data can remain compressed on the disk throughout processing and analyses. Most well-written bioinformatics tools can

Overview

• Data can remain compressed on the disk throughout processing and analyses. Most well-written bioinformatics tools can work natively with compressed data as input, without requiring us to decompress it to disk first.

• Using pipes and redirection, we can stream compressed data and write compressed files directly to the disk. Common Unix tools like cat, grep all have variants that work with compressed data.

• While working with large datasets in bioinformatics can be challenging, using the compression tools in Unix and software libraries make our lives much easier.

Page 5: Essential Skills for Bioinformatics: Unix/Linux - DGIST · • Data can remain compressed on the disk throughout processing and analyses. Most well-written bioinformatics tools can

gzip

• The two most common compression systems used on Unix are gzip and bzip2.

• gzip faster than bzip2.• bzip2 has a higher compression ratio (the previous fastq file is only

about 16 GB when compressed with bzip2)• Generally, gzip is used in bioinformatics to compress most sizable

files, while bzip2 is more common for long-term data archiving.

Page 6: Essential Skills for Bioinformatics: Unix/Linux - DGIST · • Data can remain compressed on the disk throughout processing and analyses. Most well-written bioinformatics tools can

gzip

• It can compress results from standard input. This is useful, as we can compress results directly from another bioinformatics program’s standard output.

Page 7: Essential Skills for Bioinformatics: Unix/Linux - DGIST · • Data can remain compressed on the disk throughout processing and analyses. Most well-written bioinformatics tools can

gzip

• It also can compress files on disk in place. gzip will compress this file in place, replacing the original uncompressed version with the compressed file (appending the extension .gz to the original filename).

Page 8: Essential Skills for Bioinformatics: Unix/Linux - DGIST · • Data can remain compressed on the disk throughout processing and analyses. Most well-written bioinformatics tools can

gunzip

• We can decompress files in place with the command gunzip.

• Note that this replaces tb1.fasta.gz file with the decompressed version.

Page 9: Essential Skills for Bioinformatics: Unix/Linux - DGIST · • Data can remain compressed on the disk throughout processing and analyses. Most well-written bioinformatics tools can

gzip -c

• Both gzip and gunzip can also output their results to standard out. This can be enabled using the –c option:

Page 10: Essential Skills for Bioinformatics: Unix/Linux - DGIST · • Data can remain compressed on the disk throughout processing and analyses. Most well-written bioinformatics tools can

gzip with multiple files

Page 11: Essential Skills for Bioinformatics: Unix/Linux - DGIST · • Data can remain compressed on the disk throughout processing and analyses. Most well-written bioinformatics tools can

gzip with multiple files

Page 12: Essential Skills for Bioinformatics: Unix/Linux - DGIST · • Data can remain compressed on the disk throughout processing and analyses. Most well-written bioinformatics tools can

Working with gzipped files

• The greatest advantage of gzip (and bzip2) is that many Unix and bioinformatics tools can work directly with compressed files.

• For example, we can search compressed files using grep’sanalog for gzipped files, zgrep. Likewise, cat has zcat. If programs cannot handle compressed input, you can use zcat and pipe output directly to the standard input of another program.

Page 13: Essential Skills for Bioinformatics: Unix/Linux - DGIST · • Data can remain compressed on the disk throughout processing and analyses. Most well-written bioinformatics tools can

Working with gzipped files

Page 14: Essential Skills for Bioinformatics: Unix/Linux - DGIST · • Data can remain compressed on the disk throughout processing and analyses. Most well-written bioinformatics tools can

Creating a tar.gz archive

Page 15: Essential Skills for Bioinformatics: Unix/Linux - DGIST · • Data can remain compressed on the disk throughout processing and analyses. Most well-written bioinformatics tools can

Extracting a tar.gz file

Page 16: Essential Skills for Bioinformatics: Unix/Linux - DGIST · • Data can remain compressed on the disk throughout processing and analyses. Most well-written bioinformatics tools can

CASE STUDY: REPRODUCIBLY DOWNLOADING DATA

Page 17: Essential Skills for Bioinformatics: Unix/Linux - DGIST · • Data can remain compressed on the disk throughout processing and analyses. Most well-written bioinformatics tools can

GRCm38 mouse reference genome

• We usually download genomic resources like sequence and annotation files from remote servers over the Internet, which may change in the future. Furthermore, new versions of sequence and annotation data may be released, so it is imperative that we document everything about how data was acquired for full reproducibility

• The human, mouse, zebrafish, and chicken genomes releases are coordinated through the Genome Reference Consortium (https://www.ncbi.nlm.nih.gov/grc).

Page 18: Essential Skills for Bioinformatics: Unix/Linux - DGIST · • Data can remain compressed on the disk throughout processing and analyses. Most well-written bioinformatics tools can

GRCm38 mouse reference genome

• The “GRC” prefix in GRCm38 refers to the Genome Reference Consortium.

• We can download GRCm38 from Ensembl using wget.

Page 19: Essential Skills for Bioinformatics: Unix/Linux - DGIST · • Data can remain compressed on the disk throughout processing and analyses. Most well-written bioinformatics tools can

Compare checksum values

From ftp://ftp.ensembl.org/pub/release-87/fasta/mus_musculus/dna/CHECKSUMS

Page 20: Essential Skills for Bioinformatics: Unix/Linux - DGIST · • Data can remain compressed on the disk throughout processing and analyses. Most well-written bioinformatics tools can

Extract the FASTA headers

Page 21: Essential Skills for Bioinformatics: Unix/Linux - DGIST · • Data can remain compressed on the disk throughout processing and analyses. Most well-written bioinformatics tools can

Document README

• Document how and when we downloaded this file in README

• Copy the SHA-1 checksum values into README

Page 22: Essential Skills for Bioinformatics: Unix/Linux - DGIST · • Data can remain compressed on the disk throughout processing and analyses. Most well-written bioinformatics tools can

UNIX DATA TOOLS

Page 23: Essential Skills for Bioinformatics: Unix/Linux - DGIST · • Data can remain compressed on the disk throughout processing and analyses. Most well-written bioinformatics tools can

Overview

• Understanding how to use Unix data tools in bioinformatics is not only about learning what each tool does, it is about mastering the practice of connecting tools together – creating programs from Unix pipelines.

• By connecting data tools together with pipes, we can construct programs that parse, manipulate, and summarize data.

• Unix pipelines can be developed in shell scripts or as one-liners (tiny programs built by connecting Unix tools with pipes directly on the shell).

Page 24: Essential Skills for Bioinformatics: Unix/Linux - DGIST · • Data can remain compressed on the disk throughout processing and analyses. Most well-written bioinformatics tools can

Overview

• Building more complex programs from small, modular tools capitalizes on the design and philosophy of Unix.

• The pipeline approach to building programs is a well-established tradition in Unix and bioinformatics because it is a fast way to solve problems, incredibly powerful, and adaptable to a variety of problems.

Page 25: Essential Skills for Bioinformatics: Unix/Linux - DGIST · • Data can remain compressed on the disk throughout processing and analyses. Most well-written bioinformatics tools can

When to use the Unix pipeline approach

• The Unix one-linear approach is not appropriate for all problems. Many bioinformatics tasks are better accomplished through a custom, well-documented script.

• Knowing when to use a fast and simple engineering solution like a Unix pipeline and when to resort to writing a well-documented Python and R script takes experience.

Page 26: Essential Skills for Bioinformatics: Unix/Linux - DGIST · • Data can remain compressed on the disk throughout processing and analyses. Most well-written bioinformatics tools can

When to use the Unix pipeline approach

• Unix pipelines: • Fast, low-level data manipulation toolkit to explore data, transform data

between formats, and inspect data for potential problems.• Useful when we want to get a quick answer and keep moving forward with

our project.• It is essential that everything that produces a result is documented. Storing

pipelines in shell scripts is a good approach.

• Custom scripts using Python or R:• Useful for larger, more complex tasks as these allow for the flexibility in

checking input data, structuring programs, use of data structures, code documentation.

Page 27: Essential Skills for Bioinformatics: Unix/Linux - DGIST · • Data can remain compressed on the disk throughout processing and analyses. Most well-written bioinformatics tools can

Inspecting and manipulating text data

• Many formats in bioinformatics are simple tabular plain-text files delimited by a character.

• The most common tabular plain-text file format used in bioinformatics is tab-delimited because most Unix tools treat tabs as delimiters by default.

• Tab-delimited file formats are also simple to parse with scripting language like Python and Perl, and easy to load into R.

Page 28: Essential Skills for Bioinformatics: Unix/Linux - DGIST · • Data can remain compressed on the disk throughout processing and analyses. Most well-written bioinformatics tools can

Tabular plain-text data formats

• The basic format:• Each row (known as a record) is kept on its own line• Each column (known as a field) is separated by some delimiter

• Three formats:• Tab-delimited• Comma-separated• Variable space-delimited

Page 29: Essential Skills for Bioinformatics: Unix/Linux - DGIST · • Data can remain compressed on the disk throughout processing and analyses. Most well-written bioinformatics tools can

Tab-delimited

• The most commonly used in bioinformatics (e.g. BED, GTF/GFF, SAM, VCF).

• Columns of a tab-delimited file are separated by a single tab character (the escape code: \t).

• A common convention (not a standard) is to include metadata on the first few lines of a tab-delimited files. These metadata lines begin with #.

• Tabs in data are not allowed.

Page 30: Essential Skills for Bioinformatics: Unix/Linux - DGIST · • Data can remain compressed on the disk throughout processing and analyses. Most well-written bioinformatics tools can

Comma-separated values (CSV)

• CSV is similar to tab-delimited, except the delimiter is a comma character.

• While not a common occurrence in bioinformatics, it is possible that the data stored in CSV format contain commas. Some variants just do not allow this, while others use quotes around entries that could contain commas.

Page 31: Essential Skills for Bioinformatics: Unix/Linux - DGIST · • Data can remain compressed on the disk throughout processing and analyses. Most well-written bioinformatics tools can

Variable space-delimited

• In general, tab-delimited formats and CSV are better choices than variable space-delimited formats because it is quite common to encounter data containing spaces.

Page 32: Essential Skills for Bioinformatics: Unix/Linux - DGIST · • Data can remain compressed on the disk throughout processing and analyses. Most well-written bioinformatics tools can

How lines are separated

• In Linux and OS X: use a single linefeed character (the escape code: \n) to separate lines.

• In Windows: use a DOS-style line separator of a carriage return and a linefeed character (\r\n).

• To convert DOS to Unix text format, use dos2unix.

• To convert Unix to DOS text format, use unix2dos.

Page 33: Essential Skills for Bioinformatics: Unix/Linux - DGIST · • Data can remain compressed on the disk throughout processing and analyses. Most well-written bioinformatics tools can

Inspecting data with head and tail

• Many files in bioinformatics are much too long to inspect with cat. Running cat on a file a million lines long would quickly fill your shell.

• A better option is to take a look at the top of a file with head.

Page 34: Essential Skills for Bioinformatics: Unix/Linux - DGIST · • Data can remain compressed on the disk throughout processing and analyses. Most well-written bioinformatics tools can

Inspecting data with head and tail

Page 35: Essential Skills for Bioinformatics: Unix/Linux - DGIST · • Data can remain compressed on the disk throughout processing and analyses. Most well-written bioinformatics tools can

Inspecting data with head and tail

• We can control how many lines we see.

Page 36: Essential Skills for Bioinformatics: Unix/Linux - DGIST · • Data can remain compressed on the disk throughout processing and analyses. Most well-written bioinformatics tools can

Inspecting data with head and tail

• tail is designed to look at the end of a file. tail works just like head.

Page 37: Essential Skills for Bioinformatics: Unix/Linux - DGIST · • Data can remain compressed on the disk throughout processing and analyses. Most well-written bioinformatics tools can

Inspecting data with head and tail

• We can also use tail to remove the header of a file. If –n is given a number x preceded with a + sign (e.g. +x), tail will start from the xth line.

Page 38: Essential Skills for Bioinformatics: Unix/Linux - DGIST · • Data can remain compressed on the disk throughout processing and analyses. Most well-written bioinformatics tools can

Inspecting data with head and tail

• head is useful for taking a peek at data resulting from a Unix pipeline.

• We will use grep’s results as the standard input for the next program in our pipeline, but first we want to check grep’s standard out to see if everything looks correct. When head exits, your shell catches this and stops the entire pipe. When building complex pipelines that process large amounts of data, this is important.

Page 39: Essential Skills for Bioinformatics: Unix/Linux - DGIST · • Data can remain compressed on the disk throughout processing and analyses. Most well-written bioinformatics tools can

less

• less is a useful program for a inspecting files and the output of pipes. It is a terminal pager, a program that allows us to view large amounts of text in our terminals at a time.

• less has more features and is generally preferred than the older terminal pager called more.

Page 40: Essential Skills for Bioinformatics: Unix/Linux - DGIST · • Data can remain compressed on the disk throughout processing and analyses. Most well-written bioinformatics tools can

less

Shortcut Action

Space bar Next page

b Previous page

g First line

G Last line

j Down one line at a time

k Up one line at a time

/<pattern> Search down for string <pattern>

?<pattern> Search up for string <pattern>

Page 41: Essential Skills for Bioinformatics: Unix/Linux - DGIST · • Data can remain compressed on the disk throughout processing and analyses. Most well-written bioinformatics tools can

less

• less is useful in debugging our command-line pipelines. Just pipe the output of the command you want to debug to less. When you run the pipe, less will capture the output of the last command and pause so you can inspect it.

• less is crucial when iteratively building up a pipeline.

Page 42: Essential Skills for Bioinformatics: Unix/Linux - DGIST · • Data can remain compressed on the disk throughout processing and analyses. Most well-written bioinformatics tools can

less

• A useful behavior of pipes is that the execution of a program with output piped to less will be paused when less has a full screen of data. When you pipe a program’s output to less and inspect it, less stops reading input from the pipe. The pipe will block and we can spend as much time as needed to inspect the output.