parsing

28
1 Parsing Analyze text: split it into meaningful units, tokens • Extract relevant information, disregard irrelevant information ‘Meaningful’ and ‘relevant’ depend on application: what are we looking for?

Upload: nhung

Post on 04-Jan-2016

34 views

Category:

Documents


0 download

DESCRIPTION

Parsing. Analyze text: split it into meaningful units, tokens Extract relevant information , disregard irrelevant information ‘Meaningful’ and ‘relevant’ depend on application: what are we looking for?. Blast. Program package for finding similarities between biological sequences - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Parsing

1

Parsing

• Analyze text: split it into meaningful units, tokens

• Extract relevant information, disregard irrelevant information

• ‘Meaningful’ and ‘relevant’ depend on application: what are we looking for?

Page 2: Parsing

2

Blast

• Program package for finding similarities between biological sequences

• blastn compares DNA sequences with DNA sequences

• Input: – Fasta file with query sequences– Formatted Fasta file with database sequences– Sensitivity parameter (and more)

• Output:– Result of comparing each query to each database sequence

Page 3: Parsing

3

Example run

Query file: arachis.fastaDatabase file: arabidopsis_nucleotides.fasta

Format the database: formatdb –i arabidopsis.fasta –p F –o T

Command:

/users/chili/usr/blast-2.2.13/bin/blastall -p blastn -e 0.000000002 -d arabidopsis.fasta -i arachis.fasta -o arachis_arab.bn

Page 4: Parsing

4

Example output – query with no match

BLASTN 2.2.6 [Apr-09-2003]

Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402.

Query= CL5Contig1 (797 letters)

Database: arabidopsis_nucleotides.fasta 1,004,711 sequences; 901,539,077 total letters

Searching..................................................done

***** No hits found ******

..

Page 5: Parsing

5

Example output – query with matches

Query= CL69Contig1 (372 letters)

Database: arabidopsis_nucleotides.fasta 1,004,711 sequences; 901,539,077 total letters

Searching..................................................done

Score ESequences producing significant alignments: (bits) Value

gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis 70 3e-10gb|AV791771.1|AV791771 AV791771 RAFL7 Arabidopsis 70 3e-10gb|AV519674.1|AV519674 AV519674 Arabidopsis 68 1e-09gb|AV557401.1|AV557401 AV557401 Arabidopsis 42 3e-05gb|BP670151.1|BP670151 BP670151 RAFL21 Arabidopsis 43 1e-04

..

Page 6: Parsing

6

Example output – match alignment

>gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis Length = 1009

Score = 69.9 bits (35), Expect = 3e-10 Identities = 47/51 (92%) Strand = Plus / Plus

Query:67 gagctattaacaggtaagggtcttttgaagggaacaggcttcttggacttc 117 ||||||||||||||||| ||||| ||||| |||||||| ||||||||||||Sbjct:776 gagctattaacaggtaaaggtctattgaaaggaacagggttcttggacttc 826

..

General form of output:

Repetitions of (query, subject matches, alignments)

Page 7: Parsing

7

Extract information from blast output

• Extract the best hit for each query sequence

Query= CL69Contig1 (372 letters)..

Score ESequences producing significant alignments: (bits) Value

gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis 70 3e-10gb|AV791771.1|AV791771 AV791771 RAFL7 Arabidopsis 70 3e-10..

Page 8: Parsing

8

Algorithm

• Read blast output file line by line

• Introduce two states:1. Looking for next query

2. Looking for hit list

• Return dictionary of query best hit

Page 9: Parsing

9

First state: Looking for next query

Query= CL69Contig1 (372 letters)

Database: arabidopsis_nucleotides.fasta 1,004,711 sequences; 901,539,077 total letters

Searching..................................................done

Score E

Sequences producing significant alignments: (bits) Value

gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis 70 3e-10gb|AV791771.1|AV791771 AV791771 RAFL7 Arabidopsis 70 3e-10..

Look for a line starting with

Query=

(the = is important!)

Page 10: Parsing

10

Why we look for Query= and not just Query

>gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis Length = 1009

Score = 69.9 bits (35), Expect = 3e-10 Identities = 47/51 (92%) Strand = Plus / Plus

Query:67 gagctattaacaggtaagggtcttttgaagggaacaggcttcttggacttc 117 ||||||||||||||||| ||||| ||||| |||||||| ||||||||||||Sbjct:776 gagctattaacaggtaaaggtctattgaaaggaacagggttcttggacttc 826

..

Page 11: Parsing

11

Second state: Looking for hit list

Query= CL69Contig1 (372 letters)

Database: arabidopsis_nucleotides.fasta 1,004,711 sequences; 901,539,077 total letters

Searching..................................................done

Score E

Sequences producing significant alignments: (bits) Value

gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis 70 3e-10gb|AV791771.1|AV791771 AV791771 RAFL7 Arabidopsis 70 3e-10..

Case A: hits were found

Page 12: Parsing

12

Case B: no hits were found

Query= CL69Contig1 (372 letters)

Database: arabidopsis_nucleotides.fasta 1,004,711 sequences; 901,539,077 total letters

Searching..................................................done

Score E

Sequences producing significant alignments: (bits) Value

gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis 70 3e-10gb|AV791771.1|AV791771 AV791771 RAFL7 Arabidopsis 70 3e-10..

Query= CL5Contig1 (797 letters)

Database: arabidopsis_nucleotides.fasta 1,004,711 sequences; 901,539,077 total letters

Searching..................................................done

***** No hits found ******

Page 13: Parsing

13

Second state: Looking for hit list

Query= CL69Contig1 (372 letters)

Database: arabidopsis_nucleotides.fasta 1,004,711 sequences; 901,539,077 total letters

Searching..................................................done

Score E

Sequences producing significant alignments: (bits) Value

gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis 70 3e-10gb|AV791771.1|AV791771 AV791771 RAFL7 Arabidopsis 70 3e-10..

Query= CL5Contig1 (797 letters)

Database: arabidopsis_nucleotides.fasta 1,004,711 sequences; 901,539,077 total letters

Searching..................................................done

***** No hits found ******

Look for a line starting with

Searching

Then read a few more lines to distinguish case A/B

Look for a line starting with

Searching

Then read a few more lines to distinguish case A/B

Page 14: Parsing

14blas

tpar

ser.

py (

part

1)

Find the query ID:Query= CL69Contig1

Page 15: Parsing

15blas

tpar

ser.

py (

part

2)

Find the best match ID:

Searching..................................................done

Score ESequences producing significant alignments: (bits) Value

gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis 70 3e-10

Find the best match ID:

Searching..................................................done

***** No hits found ******

Page 16: Parsing

16

Test

>>> from blastparser import parseBlastallOutput

>>> d = parseBlastallOutput(“arachis_arab.bn”)

>>> d[“gi|30419745”]

‘gb|BP625785.1|BP625785’

>>> d[“gi|30419753”]

‘none’

Page 17: Parsing

25

Evolutionary tree of life (animal kingdom)

• Huge hierarchy of groups and subgroups

• Each node in the tree has a name and a (possibly empty) list of descendant trees (sons)

Two pass-parsing

Source: The origin and evolution of model organisms, Nature Genetics, Nov. 2002, vol. 3.

Page 18: Parsing

26

Abstract data structure to represent a

general tree (not necessarily

binary)

tree

.py

Page 19: Parsing

27

How can we write a tree to a sequential file?

(File format should be readable by other systems, so we can’t use cPickle)

– A tree is a labeled node containing a (possibly empty) list of other (sub)trees

– Write tree node using start and end tags: <N=“Insects”> [sons] </N>

• Formally (context-free grammar):

T → <N=“L”>S</N> S → λ | TSL → string label

Insects

Beetles

Flies

B

AE

D

C

Page 20: Parsing

28

Recursive method: string representation of tree tr

ee.p

y

First obtain string representation of sons (empty string if no sons) by calling function recursively..

.. then create string with start tag, label, sons’ representation, and end tag

Insects

Beetles

Flies

B

AE

D

C

.. <N=“Beetles”><N=“C”></N><N=“D”></N><N=“E”></N></N> ..

Page 21: Parsing

29

Larger tree – How can we read a tree from a sequential file?

<N="Terrestrialvertebrates"><N="Synapsida"><N="Therapsida"><N="Mammalia"><N="Marsupialia"><N="Kangaroo"></N><N="Koala"></N></N><N="Eutheria"><N="Primates"><N="Human"></N><N="Gorilla"></N><N="Chimpanzee"></N></N><N="Carnivora"><N="Walrus"></N><N="Wolf"></N></N><N="Proboscidea"><N="Elephant"></N></N></N></N></N></N><N="Reptilia"><N="Diapsida"><N="Archosauromorpha"><N="Tyrannosaurus"></N><N="Penguin"></N><N="Owl"></N></N><N="Lepidosauromorpha"><N="Lizard"></N><N="Snake"></N></N></N><N="Testudines"><N="Turtle"></N></N></N></N>

We need a parser!

part

_of_

the_

tree

_of_

life.

txt

Page 22: Parsing

30

Two-pass parsing

Complex parsing is often split in two passes:

1. Lexical analysis• Identify and assemble tokens: logical units of text

2. Structural analysis• Determine the structural hierarchy of the tokens

In our case, the tokens are the two kinds of tag:

Page 23: Parsing

31

Lexical analysis

phyl

ogen

ypar

ser.

py (

part

1)

Match either a start tag or an end tag

Define a group containing the start tag’s label

Search text from index pointer

Create token of right type

Move index pointer

Page 24: Parsing

32

Structural analysis

phyl

ogen

ypar

ser.

py (

part

2)

current_node

new_node

.. <N="Kangaroo"></N><N="Koala"></N> ..

1

2

1

2

current_node

3

current_node

3

Kangaroo

.. <N="Kangaroo"></N><N="Koala"></N> ..

Real root will be first son of this node

Page 25: Parsing

33

Terrestrial vertebrates

Synapsida

Reptilia

Therapsida

Mammalia

MarsupiliaEutheria

Kangaroo

Koala

Primates

Human

GorillaChimpanzee

Carnivora

Walrus

Wolf

Proboscidea

Elephant

Diapsida

TestudinesTurtle

Lepidosauromorpha

Lizard

Snake

Archosauromorpha

Tyrannosaurus

Penguin

Owl

Page 26: Parsing

34phyl

ogen

ypar

ser.

pyTest

program

Page 27: Parsing

35

Navigating in the tree

Name: DiapsidaFather: ReptiliaSiblings: TestudinesSons: Archosauromorpha Lepidosauromorpha

(f)ather, (s)on, si(b)ling, (p)rint, (q)uit? bNumber of sibling (0-0)? 0

Name: TestudinesFather: ReptiliaSiblings: DiapsidaSons: Turtle

(f)ather, (s)on, si(b)ling, (p)rint, (q)uit? p<N="Testudines"><N="Turtle"></N></N>

Name: TestudinesFather: ReptiliaSiblings: DiapsidaSons: Turtle

(f)ather, (s)on, si(b)ling, (p)rint, (q)uit? f

Name: ReptiliaFather: Terrestrial vertebratesSiblings: SynapsidaSons: Diapsida Testudines

Reptilia

Diapsida

TestudinesTurtle

Lepidosauromorpha

Archosauromorpha

Page 28: Parsing

36

.. on to the exercises