science signaling supplemental

28
www.sciencesignaling.org/cgi/content/full/4/202/ra83/DC1 Supplementary Materials for The SH2 Domain–Containing Proteins in 21 Species Establish the Provenance and Scope of Phosphotyrosine Signaling in Eukaryotes Bernard A. Liu, Eshana Shah, Karl Jablonowski, Andrew Stergachis, Brett Engelmann, Piers D. Nash* *To whom correspondence should be addressed. E-mail: [email protected] Published 6 December 2011, Sci. Signal. 4, ra83 (2011) DOI: 10.1126/scisignal.2002105 This PDF file includes: Section S1. SH2 domain proteins in organisms with incomplete genomes. Section S2. Analysis of the intron/exon code of human SH2 domains. Section S3. Comparative analysis of ortholog and paralog predictions. Fig. S1. An evolutionary time line for the organisms represented in this study. Fig. S2. Evolutionary expansion of SH2 domains, tyrosine kinases, and PDZ domains. Fig. S3. Splice site positions within the protein sequence alignment of human SH2 domains. Fig. S4. The bead on a string representation for protein domains. Fig. S5. Gene duplication and loss within SH2 domain families. Fig. S6. Clustal alignments of the GRB2 and CRK families. Fig. S7. Insertion of an intramolecular phosphorylation site for autoinhibition of CRK. Fig. S8. Evolving new SH2 interactions through novel pTyr sites. Fig. S9. Conservation of the pTyr ligand-binding pocket in SH2 domains. Fig. S10. Tissue expression of human SH2 domain proteins. Table S1. A complete list of organisms in this study. Table S2. SH2 domains in Bikonta and Amoebozoa. Table S5. Classification of SH2 domain family divergence. Table S6. Defining SH2 families with Ensembl paralog predictions versus family organization. Table S7. Ortholog predictions of human SH2 proteins to proteins in lower organisms.

Upload: broadinstitute

Post on 11-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

www.sciencesignaling.org/cgi/content/full/4/202/ra83/DC1

Supplementary Materials for

The SH2 Domain–Containing Proteins in 21 Species Establish the Provenance and Scope of Phosphotyrosine Signaling in Eukaryotes

Bernard A. Liu, Eshana Shah, Karl Jablonowski, Andrew Stergachis, Brett Engelmann,

Piers D. Nash*

*To whom correspondence should be addressed. E-mail: [email protected]

Published 6 December 2011, Sci. Signal. 4, ra83 (2011) DOI: 10.1126/scisignal.2002105

This PDF file includes:

Section S1. SH2 domain proteins in organisms with incomplete genomes. Section S2. Analysis of the intron/exon code of human SH2 domains. Section S3. Comparative analysis of ortholog and paralog predictions. Fig. S1. An evolutionary time line for the organisms represented in this study. Fig. S2. Evolutionary expansion of SH2 domains, tyrosine kinases, and PDZ domains. Fig. S3. Splice site positions within the protein sequence alignment of human SH2 domains. Fig. S4. The bead on a string representation for protein domains. Fig. S5. Gene duplication and loss within SH2 domain families. Fig. S6. Clustal alignments of the GRB2 and CRK families. Fig. S7. Insertion of an intramolecular phosphorylation site for autoinhibition of CRK. Fig. S8. Evolving new SH2 interactions through novel pTyr sites. Fig. S9. Conservation of the pTyr ligand-binding pocket in SH2 domains. Fig. S10. Tissue expression of human SH2 domain proteins. Table S1. A complete list of organisms in this study. Table S2. SH2 domains in Bikonta and Amoebozoa. Table S5. Classification of SH2 domain family divergence. Table S6. Defining SH2 families with Ensembl paralog predictions versus family organization. Table S7. Ortholog predictions of human SH2 proteins to proteins in lower organisms.

Table S8. Ortholog predictions from C. elegans to H. sapiens. References

Other Supplementary Material for this manuscript includes the following: (available at www.sciencesignaling.org/cgi/content/full/4/202/ra83/DC1)

Table S3. Comprehensive list of SH2 domain–containing proteins in Eukaryotes (Excel file). Table S4. Comprehensive list of SH2 domain–containing proteins in organisms with incomplete genome annotations (Excel file).

The SH2 domain-containing proteins in 21 species establish the provenance and scope of phosphotyrosine signaling in Eukaryotes

Supplementary Text Section S1. SH2 domain proteins in organisms with incomplete genomes

Several recently sequenced genomes are either incomplete or poorly annotated and, therefore, were not included in the original list of 21 organisms. However, these organisms provide additional insight into the evolution of Metazoa that afford a more complete picture of the evolution of phosphotyrosine (pTyr) signaling. To this end, 7 additional genomes were examined including the choanoflagellate Monosiga ovata, hydrazoan Hydra magnipapillata, placozoan Trichoplax adhaerens, sponges Suberites domuncula and Ephydatia fluviatilis, amphioxus (lancelet) Branchiostoma floridae, and African clawed frog Xenopus laevis. A complete list of SH2 domain proteins from these organisms and the SH2 family to which these proteins belong can be found in table S4.

The choanoflagellate Monosiga brevicollis is the first organism to contain the complete pTyr circuit with an extensive repertoire of protein tyrosine kinases (PTKs), protein tyrosine phosphatases (PTPs), and SH2 domain proteins (1). Here, we compared the genome of another choanoflagellate, Monosiga ovata. Although the genome still remains incomplete, nine SH2 domain proteins are present in M. ovata (table S4). The differences between M. ovata and M. brevicollis in terms of SH2 protein complement is major and suggests substantial variation within this lineage. M. brevicollis contains a 112 SH2 domain proteins compared to the 9 in M. ovata. The nine proteins found in M. ovata all possess human orthologs and fall into one of the 38 SH2 families. As the genome of M. ovata becomes more completely annotated and other choanoflagellate genomes are sequenced, we may better understand whether the large expansion of SH2 domain containing proteins in M. brevicollis is specific to this species or a more general phenomenon within the choanoflagellates. Such studies will presumably also provide a better understanding of the early expansion of pTyr signaling.

In the early Metazoan divergence, the phylogenetic positions of several organisms remain unsolved. We analyzed three organisms that split prior to Eumetazoa, two of which belong to the phylum Porifera (sponges): Suberites domuncula and Ephydatia fluviatilis. In the freshwater sponge Ephydatia fluviatilis, 11 proteins possess SH2 domains whereas, to date, Suberites domuncula has only 6. The completion of the sponge genome of Amphimedon queenslandica will provide further insight into pTyr signaling in Metazoa (2). The placozoan Trichoplax adhaerens is a simple free-living animal that may belong to the Eumetazoa clade

or have an earlier origin (3). In Trichoplax, we found close to 27 SH2 domain proteins, nearly all of which have close homology to human SH2 domain proteins. Our analysis suggests that the Trichoplax adhaerens has a similar SH2 protein complement to that of cnidarians, such as Nematostella vectensis and Hydra magnipapillata. Although this is a limited data set, it supports the hypothesis that placozoans reside within the Eumetazoan clade.

The phylum of Cnidaria consists of 5 classes that comprise a diverse set of organisms with various forms and function (4). We analyzed two organisms in Cnidaria, the Anthozoa (sea anemone) Nematostella vectensis and the Hydrozoa (hydras) Hydra magnipapilllata. Although Hydra and Nematostella are completely different types of cnidarians and have different size genomes, the content of SH2 domain proteins is similar.

The amphioxus genome provides a closer look at the origins of vertebrates and early chordate evolution. Therefore. we examined the genome of the amphioxus Branchiostoma floridae (5). Branchiostoma lies within the Cephalochordata subphylum of Chordates falling between the urochorate Ciona intestinalis and the chordate Danio rerio in this analysis. Although the genome of Branchiostoma is only partially annotated, we identified 34 SH2-containing proteins (table S4), only marginally fewer than the total in Ciona intestinalis or Strongylocentrotus purpuratus, which have 54 or 53, respectively (Fig. 1C). Branchiostoma SH2 domain proteins exhibit several unique domain combinations. We find SH2 domains linked to death domains, IQ, DNAJ, FN3, WW, and others that do not appear in humans (table S4). As Branchiostoma becomes fully annotated, this organism will provide insight into the diversity of combinations of SH2 domain proteins and help us to understand how SH2 domains may integrate into different cellular systems.

Section S2. Analysis of the intron/exon code of human SH2 domains

Our splice site analysis of the 121 human SH2 domains present in 111 human proteins shows that 114 contain at least one intron within the known SH2 domain secondary structure, extending from the βA to the βG strand (figs. S3B,C, see fig. S3A for a structure view of the secondary structure elements of the SH2 domain). Previous reports found 9 of 30 SH2 domains to contain the βC7 splice site (6). Consistent with this, we observed that 40 out of 120 SH2 domains contain a phase 2 βC7 splice site. Another commonly conserved splice site at βA0 is present at the N terminus of the classically defined SH2 domain. The βA0 splice junction is found in 43 of the 121 SH2 domains with approximately half of these also containing the βC7 splice site. The observation that

The SH2 domain-containing proteins in 21 species establish the provenance and scope of phosphotyrosine signaling in Eukaryotes

1

SH2 domains, such as those found in Grb2 and Shb, share conserved splice sites within their SH2 domains may be an indication of their shared evolutionary heritage that is apparent in their gene structure (fig. 3C).

Section S3. Comparative analysis of ortholog and paralog predictions

The proper identification of orthologs and paralogs is critical in defining evolutionary relationships between closely related family members. Multiple approaches and tools have been developed to predict orthologs (Inparanoid and Ensembl) and paralogs (Ensembl) across multiple organisms (7-9). By comparing these different approaches, one can assess the benefits of using the different methods for accurate curation of orthologs and paralogs of SH2 proteins. In our analysis, SH2 families are a defined classification of human paralogs, which can be traced back in origin through ortholog predictions. Here, we determined members of a family using a hierarchical method of sequence alignments, intron-exon boundaries, and domain organization. To determine how our method compares to Ensembl’s paralog prediction approaches, we compared our list of 38 families and members within each family to Ensembl’s paralog algorithm. Using Ensembl’s paralog prediction algorithm we listed all proteins that were predicted as paralogs to each SH2 domain protein (table S6).

The Ensembl paralog prediction was successful at accurately predicting 16 of the 38 SH2 domain families, such as CBL, CRK, DAPP1, and SHC. However, several families, including ABL, TEC, FRK, and SRC, were clustered into the same group by Ensembl. This is likely due to shared domain organizations between these four families. Using our hierarchical method, these proteins were divided into select families based upon sequence alignments and intron/exon splicing patterns. Other families, such as CHN, GRB7, PLCG, and VAV, were

predicted to include proteins that contain SH2 domains and other proteins that lack an SH2 domain. Therefore, using paralog predictions by Ensembl requires additional levels of filtering to remove proteins lacking SH2 domains and to further divide proteins using additional approaches, such as intron/exon splicing or sequence alignments.

Both Ensembl and InParanoid can be used to predict orthologs across multiple organisms. In this study, we determined that the BLAST algorithm and matching domain organization were more reliable than using Ensembl and InParanoid. Here, we performed two types of searches using the three methods. The first was a search for human orthologs to a particular C. elegans protein (table S7). The second test was to identify orthologs in lower organisms, such as the opossum, zebrafish, fruitfly, and roundworm, using the human protein in our search query (table S8). This test demonstrated that in multiple instances both Ensembl and InParanoid failed to detect orthologs in lower organisms (Gads, Grap) or poorly predicted orthologs (FRK). Gads, Grap and Grb2 belong to the GRB2 family. Whereas BLAST, Ensembl, and InParanoid predicted the human ortholog of Grb2 in C. elegans (sem-5), Ensembl and InParanoid failed to identify sem-5 as an ortholog of Gads and Grap. Furthermore, when predicting with Ensembl’s ortholog algorithm, in many instances, SH2 proteins from other families would be considered orthologs. This is likely due to clustering multiple families into the same paralog group. Benchmarking of these three approaches suggests that using BLAST provides an accurate curation of orthologs. Although both Ensembl and InParanoid are powerful tools, they occasionally miss orthologs from genes with duplicate copies. In addition, several genes with multiple paralogs were inaccurately predicted with these two approaches.

The SH2 domain-containing proteins in 21 species establish the provenance and scope of phosphotyrosine signaling in Eukaryotes

2

Supplemental Table Legends for tables supplied as Excel files Table S3. Comprehensive list of SH2 domain-containing proteins in Eukaryotes. An Excel file in which each organism is detailed on a separate sheet listing the known SH2 domain-containing proteins. Information on gene name, gene ID, protein ID (accession number), alias, chromosome location (genomic position), domains, sequence, family, SH2 sequence and the closest SH2 homolog is provided. Triple asterisk (***) denote information that remains undetermined. Table S4. SH2 domain-containing proteins in organisms with incomplete genome annotations. Seven organisms that contain SH2 domain proteins, spanning across Eurkaryota, with genomes that are incomplete and insufficiently annotated include the choanoflagellate (Monosiga ovata), sponges (Suberites domuncula and Ephydatia fluviatilis), trichoplax (Trichoplax adhaerens), hydra (Hydra magnipapillata), Florida lancelet (Branchiostoma floridae), and African clawed Frog (Xenopus laevis). Each organism (including Taxonomy ID) is represented as a separate tabbed table listing all SH2 domain containing proteins identified to date. In each case the gene name, gene ID, protein ID (accession number), alias, chromosome location (genomic position), domains, sequence, family, SH2 sequence and the closest SH2 homolog is provided. Triple asterisk (***) denote information that remains undetermined.

The SH2 domain-containing proteins in 21 species establish the provenance and scope of phosphotyrosine signaling in Eukaryotes

3

Figure S1. An evolutionary time line for the organisms represented in this study. The approximate divergence in millions of years ago (MYA) are indicated for the select model organisms shown (10). Branch lengths are not strictly proportional to time.

The SH2 domain-containing proteins in 21 species establish the provenance and scope of phosphotyrosine signaling in Eukaryotes

4

Figure S2. Evolutionary expansion of SH2 domains, tyrosine kinases, and PDZ domains. (A) The total number of genes in metazoan organisms that contain a tyrosine kinase or an SH2 domain graphed alongside genes that contain both an SH2 domain and tyrosine kinase. (B) When genes containing both an SH2 or tyrosine kinase were removed from the analysis, the rate of expansions remains the same with the exception of C. elegans. (C) Genes encoding tyrosine kinases and SH2 domains were compared to the number of genes that contain PDZ domains. (D and E) The % of genes encoding PDZ domains were plotted against the % of genes encoding SH2 domains (D) or tyrosine kinases (E) in the respective genomes. The Pearson correlations between genes with PDZ domains to SH2 domains and tyrosine kinases are 0.77 and 0.64, respectively.

The SH2 domain-containing proteins in 21 species establish the provenance and scope of phosphotyrosine signaling in Eukaryotes

5

(A)

Figure 3. Analysis of Intron and Exon splice sites of human SH2 domains. (A) The secondary structure elements (α, alpha helix; β, beta sheet) of the SH2 domain of SAP bound to a peptide ligand (red) are labeled (PDB: 1M27) (11).

The SH2 domain-containing proteins in 21 species establish the provenance and scope of phosphotyrosine signaling in Eukaryotes

6

(B)

Figure S3. (B) Splice site positions within the protein sequence alignment of human SH2 domains. The ClustalW alignment of all human SH2 domains was derived previously (12). The intron/exon splice sites indicated with a black line (Phase 0), blue circle (Phase 1), or red circle (Phase 2) were collected from SMART (http://smart.embl-heidelberg.de) (13) and Ensembl (http://www.ensembl.org) (8). Splice sites were used to determine relationships between SH2 domains and to assist in the classification of SH2 families. Not shown is the splice pattern for SH2D7. SH2D7 shares identical splice patterns to SH2D2A and HSH2D. SH2 domains that do not contain splice sites within the boundaries of the SH2 domain (light blue shade) are indicated with an (*). Amino acids within the highlighted color background represent residues greater than 60% conserved in sequence identity. The colors indicate defined groups of amino acids determined by the Multiple Alignment Editor; green highlights bulky hydrophobic amino acids (W, I, L), yellow for glycine (G), blue for alanine (A), red for charged amino acids (R, E), light blue for aromatic amino acids (F, Y), grey for polar amino acids (S, T), and purple for histidine (H).

The SH2 domain-containing proteins in 21 species establish the provenance and scope of phosphotyrosine signaling in Eukaryotes

7

(C)

Figure S3. (C) Hierarchiral clustering the human intron-exon splice sites. The intron-exon splice site boundary positions from 120 SH2 domains overlaid onto protein structural predictions were organized by hierarchical clustering and represented in barcode form with the boundary positions indicated along the secondary structure (α, alpha helix; β, beta sheet) of the SH2 domain (see fig S3A). The splice site colors correspond to the different phases of the splice sites (Phase 0- dark blue, Phase 1- light blue, Phase 2-red).

The SH2 domain-containing proteins in 21 species establish the provenance and scope of phosphotyrosine signaling in Eukaryotes

8

Figure S4. The bead on a string representation for protein domains. The full name of each protein domain is listed to the right of the bead.

The SH2 domain-containing proteins in 21 species establish the provenance and scope of phosphotyrosine signaling in Eukaryotes

9

Figure S5. Gene duplication and loss within SH2 domain families. The SH2 domain families are indicated with the family name and domain organization listed on the left. The families are organized according to the species in which they were first identified (bottom = earliest; top = more recent). The colors on the left correspond to the evolutionary time frame found in Fig. 3B. Listed on the right is the total number of gene copies present in each organism for a particular SH2 family.

The SH2 domain-containing proteins in 21 species establish the provenance and scope of phosphotyrosine signaling in Eukaryotes

10

(A)

(B)

Figure S6. Clustal alignments of the GRB2 and CRK families. (A) Clustal alignment of the GRB2 family. The protein sequences of the GRB2 family from Homo sapiens (HsGRB2, HsGRAP, HsGADS), Danio rerio (DrGrb2), Drosophila melanogastor (drk), and Caenorhabditis elegans (sem-5) were aligned using ClustalX. The colored numbers indicates the splice sites identified using Ensembl and SMART. Highlighted in light blue is the SH2 domain. Human Gads (HsGADS) contains an extended protein segment inserted through intron/exon shuffling located between the SH2 domain and the second SH3 domain (position 160-275). (B) Clustal alignment of the CRK family. The protein sequences of the CRK family from Homo sapiens (HsCRK, HsCRKL), Danio rerio (DrCRK, DrCRKL), Drosophila melanogastor (DmCRK), and Caenorhabditis elegans (ced-2) were aligned using ClustalX (14). The splices were identified using Ensembl and SMART indicated by the colored numbers. The SH2 domain of CRK is highlighted in light blue. The intramolecular SH2 binding site of CRK is highlighted in orange (see fig. S7).

The SH2 domain-containing proteins in 21 species establish the provenance and scope of phosphotyrosine signaling in Eukaryotes

11

Figure S7. Insertion of an intramolecular phosphorylation site for autoinhibition of CRK. The Clustal sequence alignment (fig. S6B) reveals an intramolecular tyrosine phosphorylation site present in Homo sapiens (HsCRK, HsCRKL) and Danio rerio (DrCRK, DrCRKL), but absent in Drosophila melanogastor (DmCRK) and Caenorhabditis elegans (ced-2). The figure presents a potential model for the evolution of intramolecular regulation of CRK.

The SH2 domain-containing proteins in 21 species establish the provenance and scope of phosphotyrosine signaling in Eukaryotes

12

Figure S8. Evolving new SH2 interactions through novel pTyr sites. Within SH2 domain proteins are phosphotyrosine sites that allow SH2 domains of other proteins to bind. Here are several examples of different SH2 domain proteins with reported SH2 domain interactions in humans. For many of these interactions, the sites of phosphorylation may not be conserved resulting in ambiguity regarding the occurrence of interactions in certain species, as indicated by (?).

The SH2 domain-containing proteins in 21 species establish the provenance and scope of phosphotyrosine signaling in Eukaryotes

13

Figure S9. Conservation of the p yr igand binding pocket in SH2 domains. The percent protein sequence conservation of the SH2 domains from Drosophila to human were mapped onto existing 3D structures using UCSF Chimera. SH2 domain structures with or without phosphotyrosine ligands for Abl (PDB: 1AB2), Cbl (PDB: 2CBL), Rasa1_C (PDB: 2GSB), Sap (PDB: 1M27), Shc (PDB: 1TCE), Syk_N&C (PDB: 2OQ1), and Vav (PDB: 2ROR) were downloaded from the RCSB Protein Data Bank. The Rasa1_N structure was modeled using SwissModel with the Rasa1_C SH2 domain structure as a template. The phosphotyrosine binding pocket of the SH2 domain is indicated with a black box. The position of the phosphotyrosine residue is noted in structures lacking a phosphotyrosine ligand.

T l -

The SH2 domain-containing proteins in 21 species establish the provenance and scope of phosphotyrosine signaling in Eukaryotes

14

Figure S10. Tissue expression of human SH2 domain proteins. (A) Tissue expression clustering of all human SH2 domain proteins. The SAGE expression data for 110 SH2 domain proteins was collected from Unigene (NCBI) (15) and then clustered using the GenePattern software (www.broadinstitute.org) (16). High expression is shown in dark red; dark blue/purple indicates low or no expression in the specific tissues. Circled in yellow are select SH2 genes that are highly expressed in the myeloid and lymphoid systems. (B) A close up view of the tissue expression pattern for the Src family kinases.

The SH2 domain-containing proteins in 21 species establish the provenance and scope of phosphotyrosine signaling in Eukaryotes

15

Table S1. A complete list of organisms in this study. Listed are the scientific or taxonomy names (genus and species), common names, and NCBI taxonomy ID. Prefix Common Name Genus species Taxonomy ID Hs Human Homo sapiens 9606 Mm House Mouse Mus musculus 10090 Md Gray Short-tailed Opossum Monodelphis domestica 13616 Xt Western Clawed Frog Xenopus tropicalis 8364 Xl African Clawed Frog Xenopus laevis 8355 Dr Zebrafish Danio rerio 7955 Bf Florida lancelet Branchiostoma floridae 7739 Ci Sea Squirt Ciona intestinalis 7719 Sp Sea Urchin Strongylocentrotus purpuratus 7668 Dm Fruit Fly Drosophila melanogaster 7227 Aa Yellow Fever Mosquito Aedes aegypti 7159 Ce Roundworm Caenorhabditis elegans 6239 Hm Hydra Hydra magnipapillata 6085 Nv Sea Anemone Nematostella vectensis 45351 Ta Trichoplax Trichoplax adhaerens 10228 Sd Sponge Suberites domuncula 55567 Ef Freshwater Sponge Ephydatia fluviatilis 31330 Mb Choanoflagellate Monosiga brevicollis MX1 431895 Mo Choanoflagellate Monosiga ovata 81526 Sc Budding yeast Saccharomyces cerevisiae 4932 Dd Slime mold Dictyostelium discoideum AX4 352472 Dp Slime mold Dictyostelium purpureum 5786 Eh Amoeba Entamoeba histolytica 5759 Ng Amoeba-flagellate Naegleria gruberi 5762 Pc Oomycetes plant pathogen Phytophthora capsici 4784 At Thale cress Arabidopsis thaliana 3702 Tt Ciliated protozoan Tetrahymena thermophila 5911 Tv Parasitic protozoan Trichomonas vaginalis 5722

The SH2 domain-containing proteins in 21 species establish the provenance and scope of phosphotyrosine signaling in Eukaryotes

16

Table S2. SH2 domains in Bikonta and Amoebozoa.

Gene Name Domains Phytophthora capsici Tax ID:4784 estExt_fgenesh1_kg.C_220545 SH2 gw1.66.3.1 SH2 estExt_fgenesh1_pg.C_170140 GTPase, SH2 estExt_Genewise1.C_40380* S1 (missing SH2) Naegleria gruberi Tax ID: 5762 NAEGRDRAFT_58761 SH2 NAEGRDRAFT_58280 SH2 NAEGRDRAFT_72806 SH2 NAEGRDRAFT_50018 * Tex, S1, SH2 Arabidopsis thaliana Tax ID: 3702 AT1G17040.1 Concanavalin A-like lectin/glucanase, SH2 AT1G78540.1 Concanavalin A-like lectin/glucanase, SH2 GTB1* Tex, YqqFc, S1, SH2 AT1G63210* Tex, YqqFc, S1, SH2 Trichomonas vaginalis Tax ID: 5722 TVAG_145250* Tex, SH2 Tetrahymena thermophila Tax ID: 5911 TTHERM_01093550* Tex, SH2, AIR1 Entamoeba histolytica Tax ID: 294381 EHI_045380 STY Kinase, SH2 EHI_141930 STY Kinase, SH2 EHI_128700 STY Kinase, SH2 EHI_055990 STY Kinase, SH2 Dictyostelium discoideum Tax ID: 352472 dstA, STATa SH2 dstB SH2 dstC, STATc SH2 dstD SH2 shkA, SHK1 STY Kinase, SH2 shkB, SHK2 STY Kinase, SH2 shkC, SHK3 STY Kinase, SH2 shkD, SHK4 STY Kinase, SH2 shkE, SHK5 STY Kinase, SH2 spt6* S1, SH2 Mss11p (DDB_0237765) F-Box, SH2, Ank repeats DDBDRAFT_0206267 SH2 DDB0238372 (DDB0168162) SH2, RING DDB0238373 (DDB0217218) SH2, RING DDB0187660 SH2, LRR (2) Dictyostelium purpureum Tax ID: 5786 estExt_Genewise1Plus.C_2790002 SH2

The SH2 domain-containing proteins in 21 species establish the provenance and scope of phosphotyrosine signaling in Eukaryotes

17

GID1.0048896 SH2 GID1.0037847 SH2 estExt_Genewise1.C_7500001 SH2 GID1.0047996 SH2 GID1.0044238 F-Box, SH2, Ank repeats e_gw1.210.20.1 SH2, RING GID1.0043328 SH2, LRR (2) estExt_fgeneshDP_pg.C_60040 STY Kinase, SH2 GID1.0049092 STY Kinase, SH2 estExt_Genewise1Plus.C_1530046 STY Kinase, SH2 GID1.0050363 STY Kinase, SH2 GID1.0039116 STY Kinase, SH2

(*) Spt6 orthologs

The SH2 domain-containing proteins in 21 species establish the provenance and scope of phosphotyrosine signaling in Eukaryotes

18

Table S5. Classification of SH2 domain family divergence. This classification is based upon the types of events that lead to diversification of each family during the periods of whole gene duplication between arthropods and mammals. The schema for these classifications is outlined in Figure 4.

Class IA Class IB Class IC Class II Class 0 ABL SRC SH2B STAP CHN CRK CSK FPS FRK GRB7 NCK PIK3R PLCG RIN SH2D2A SH2D4 SH2D3 SHC SHB PTPN11 SOCS STAT JAK VAV

GRB2 SLP76 TNS

TEC CBL SHIP

SH2D1 SLAP SYK

SH3BP2 DAPP1 RASA1 SH2D5 SPT6

The SH2 domain-containing proteins in 21 species establish the provenance and scope of phosphotyrosine signaling in Eukaryotes

19

Table S6. Defining SH2 families using Ensembl paralog predictions versus family organization. Proteins not predicted as a member of the defined family are highlighted in bold. Reason(s) behind why the highlighted proteins were not included are indicated in the comments sections.

Family Sequence alignment, intron-exon, and dDomain organization

Ensembl (Paralog) Comments

ABL Abl1, Abl2 Abl1, Abl2, Btk, Tec, Txk, Itk, Bmx, Srms, Ptk6, Lck, Src, Hck, Fgr, Frk, Lyn, Blk, Yes1, Fyn

Ensembl predicts TEC, SRC, and FRK families as a member of the ABL familiy based on domain organization

CBL Cbl, CblB, CblC Cbl, CblB, CblC CHN Chn1, Chn2 Chn1, Chn2, Gmip, Hmha1, Arhgap29,

Arhgap21, Arhgap23, Arhgap10, Arhgap15, Racgap1, Arhgap12, Arhgap9, Arhgap27, Ophn1, Arhgap26, Arhgap42

Additional proteins predicted but lack SH2 domains

CRK Crk, Crkl Crk, Crkl CSK Csk, Matk Csk, Matk, Fer, Fes, Dstyk Ensembl predicts FPS family and

Dstyk DAPP1 Dapp1 Dapp1 FPS Fes, Fer Fes, Fer, Dstyk, Csk Matk Ensembl predicts CSK family

and Dstyk FRK Frk, Srms, Brk Ptk6 (Brk), Abl1, Abl2, Btk, Tec, Txk,

Itk, Bmx, Lck, Src, Hck, Fgr, Frk, Lyn, Blk, Yes, Fyn, Srms

Ensembl predicts TEC, SRC, and ABL families based on domain organization

GRB2 Grb2, Gads, Grap Grb2, Gads, Grap2, Nck1, Nck2, Ac007952.1

NCK family is not a member of the GRB2 family

GRB7 Grb7, Grb10, Grb14 Grb7, Grb10, Grb14, Raph1, Apbb1ip Additional proteins predicted but lack SH2 domains

JAK Jak1, Jak2, Jak3, Tyk2

Jak1, Jak2, Jak3, Tyk2, Tnk2, Tnk1, Ptk2, Erbb4, Ptk2B, Syk, Erbb2, Erbb3, Egfr, Zap70

Additional proteins predicted but lack SH2 domains

NCK Nck1, Nck2 Nck1, Nck2, Grb2, Gads, Grap, Ac007952

GRB2 family is not a member of the NCK based on domain organization, sequence homology and splice patterns

PI3KR Pik3r1, Pik3r2, Pik3r3 Pik3r1, Pik3r2, Pik3r3 PLCG Plcg1, Plcg2 Plcg1, Plcg2, Plcz1, Plce1, Plcb3,

Plcb2, Plcb1, Plcb4, Plcl2, Plcd1, Plch1, Plcl1, Plcd3, Plch2, Plcd4

Additional proteins predicted but lack SH2 domains

PTPN Ptpn6, Ptpn11 Ptpn6, Ptpn11, Ptpn3, Ptprn, Ptpn21, Ptpn14, Ptpn20b, Ptpn2, Frmpd2, Ptpn18, Ptpn9, Ptpn4, Ptpn12, Ptpn20c, Ptpn22, Ptpn20a, Ptpn1, Ptpn13

Additional proteins predicted but lack SH2 domains

RASA1 Rasa1 Rasa1, Dab2ip, Syngap1, Rasal2, Rasal3, Rasa4, Rasa3, Rasa4b, Rasa2, RasaL1

Additional proteins predicted but lack SH2 domains

RIN Rin1, Rin2, Rin3 Rin1, Rin2, Rin3, RinL Additional proteins predicted but lack SH2 domains

SH2B Sh2b, Lnk, Aps Sh2b, Lnk, Aps SH2D1 Sh2d1a, Sh2d1b Sh2d1a, Sh2d1b SH2D2 Sh2d2a, Hsh2, Sh2d7 Sh2d2a, Hsh2, Sh2d4a, Sh2d4b

Ensembl missed Sh2d7

SH2D3 Sh2d3a, Bcar3, Sh2d3c

Sh2d3a, Bcar3, Sh2d3c

SH2D4 Sh2d4a, Sh2d4b Sh2d4a, Sh2d4b, Sh2d2a, Hsh2 Identical domain organization but different splice patterns

SH2D5 Sh2d5 Sh2d5 SH3BP2 Sh3bp2 Sh3bp2 SHB Shb, Shd, She, Shf Shb, Shd, She, Shf, RP11-613M10.9 Additional proteins predicted but

lack SH2 domains

The SH2 domain-containing proteins in 21 species establish the provenance and scope of phosphotyrosine signaling in Eukaryotes

20

SHC Shc1, Shc2, Shc3, Shc4

Shc1, Shc2, Shc3, Shc4

SHIP Ship1, Ship2 Ship1, Ship2, Inpp5e, Ocrl, Inpp5k, Inpp5b, Synj1, Inpp5j, Synj2

Additional proteins predicted but lack SH2 domains

SLAP Slap1, Slap2 Slap1 (Sla), Slap2 (Sla2) SLP76 Slnk, Slp76, Blnk,

Mist Slnk, Slp76, Blnk, Mist

SOCS Socs1, Socs2, Socs3, Socs4, Socs5, Socs6, Socs7, Cish

Socs1, Socs2, Socs3, Socs4, Socs5, Socs6, Socs7, Cish

SPT6 Supt6h Supt6h SRC Blk, Fgr, Fyn, Hck,

Lck, Lyn, Src, Yes Fgr, Fyn, Yes1, Frk, Abl2, Lck, Itk, Btk, Hck, Txk, Lyn, Srms, Ptk6 (Brk), Tec, Abl1, Bmx

Ensembl predicts TEC, FRK, and ABL families based on domain organization

STAP Bks, Brdg1 Bks, Brdg1 SYK Syk, Zap70 Syk, Zap70, Ptk2, Ptk2b, Jak1, Jak2,

Erbb4, Tnk2, Jak3, Erbb2, Tnk1, Erbb3, Egfr, Tyk2

TEC Bmx, Tec, Btk, Itk, Txk

Bmx, Tec, Btk, Itk, Txk, Abl1, Abl2, Srms, Ptk6, Lck, Src, Hck, Fgr, Frk, Lyn, Blk, Yes1, Fyn

Ensembl predicts FRK, SRC, and ABL families based on domain organization

TNS Tns1, Tns3, Tenc1, Tns4

Tns1, Tns3, Tenc1, Tns4, Gak, Tpte2, Pten, Tpte, Dnjc6

Additional proteins predicted but lack SH2 domains

VAV Vav1, Vav2, Vav3 Vav1, Vav2, Vav3,Plekhg1, Plekhg2, Arhgef7, Tiam1, Arhgef4, Prex1, Spata13, Arhgef9, Prex2, Tiam2, Arhgef6

Additional proteins predicted but lack SH2 domains

The SH2 domain-containing proteins in 21 species establish the provenance and scope of phosphotyrosine signaling in Eukaryotes

21

Table S7. Ortholog predictions of human SH2 proteins to proteins in lower organisms. NCBI BlastP, Ensembl, and InParanoid were used to identify orthologs in lower organisms from opossum, zebrafish, fruitfly (Drosophila), and worm. The far-left column indicates the human SH2 protein used to search the three databases for orthologs in the different organisms. Highlighted in bold are predictions that do not match with those using our method of Blast and domain organization. N.P. – No predictions. Rows are colored according to SH2 families (orange - SRC; blue – FRK; green – GRB2).

Organism – Organism (SH2 protein)

Blast (NCBI) Ensembl (Ortholog) InParanoid

Human – Opossum (Src)

Src (LOC100029391)

Src (ENSMODG00000000247) Fyn (ENSMODG00000017905)

Human - Zebrafish (Src)

src src (ENSDARG00000008107) wu:fc54g0

Human - Drosophila (Src)

Src64B N.P. N.P.

Human - Worm (Src)

src-1 src-1 (Y92H12A.1) src-1

Human – Opossum (Fyn)

Fyn Fyn (ENSMODG00000017905) Fyn (ENSMODG00000017905)

Human - Zebrafish (Fyn)

Fyna, Fynb Fynb (ENSDARG00000025319)

Fyna (ENSDARG00000011370)

Human - Drosophila (Fyn)

Src42A/Src64B N.P. N.P.

Human - Worm (Fyn)

src-2, src-1 src-1 (Y92H12A.1) src-1 (Y92H12A.1)

Human - Opossum (Brk)

LOC100025045 (PTK6-like)

Brk (ENSMODG00000016881) Brk (ENSMODG00000016881)

Human – Zebrafish (Brk)

ptk6b, ptk6a ptk6b (ENSDARG00000059956), si:ch73-340m8.2 (ENSDARG00000040258)

N.P.

Human - Drosophila (Brk)

Src42A Src64B N.P.

Human - Worm (Brk)

src-2 abl-1 N.P.

Human - Opossum (Frk)

LOC100022866 (Frk-like)

Frk (ENSMODG00000017852)

Frk (ENSMODG00000017852)

Human – Zebrafish (Frk)

LOC567549 (Frk-like)

Frk (si:ch1073-112f1.1) N.P.

Human - Drosophila (Frk)

Src42A Src42A (FBgn0004603) Src42A

Human - Worm (Frk)

src-2 src-2 (F49B2.5)

src-2

Human - Opossum (Srms)

LOC100025024 (Srms-like)

Srms (ENSMODG00000016887) Srms (ENSMODG00000016887)

Human – Zebrafish (Srms)

zgc:194282

si:ch73-340m8.2 (ENSDARG00000040258)

N.P.

Human - Drosophila (Srms)

Src42A Src64B N.P.

Human - Worm (Srms)

src-2 abl-1 N.P.

Human – Opossum (Grb2)

LOC100022894 (Grb2-like)

Grb2 Grb2

Human – Zebrafish (Grb2)

grb2, zgc:103549

zgc:103549 (ENSDARG00000014709), grb2 (ENSDARG00000038059)

Grb2, MCG:7307

Human – Drosophila (Grb2)

drk drk (FBgn0004638) drk

Human – Worm (Grb2) sem-5 sem-5 (C14F5.5) sem-5 Human – Opossum (Gads)

LOC100011047 (Grap2-like)

Grap2 (ENSMODG00000008992)

Grap2

The SH2 domain-containing proteins in 21 species establish the provenance and scope of phosphotyrosine signaling in Eukaryotes

22

Human – Zebrafish (Gads)

grap2a, grap2b grap2a, grap2b zgc:11209, MGC:11209, MGC:6594, zgc:6594

Human – Drosophila (Gads)

drk N.P. N.P.

Human – Worm (Gads)

sem-5 N.P. N.P.

Human – Opossum (Grap)

LOC100011892, LOC100022894, LOC100011047

ENSMODG00000000961 ENSMODG00000000961

Human – Zebrafish (Grap)

grb2, zgc:103549, grap, zgc:109892

Grap, zgc:109892 N.P.

Human – Drosophila (Grap)

drk N.P. N.P.

Human – Worm (Grap)

sem-5 N.P. sem-5

The SH2 domain-containing proteins in 21 species establish the provenance and scope of phosphotyrosine signaling in Eukaryotes

23

Table S8. Ortholog prediction from C. elegans to H. sapiens. N.P. – No predictions. C. elegans Human Blast/Family Ensembl (Human) InParanoid

(Human) Comments

abl-1

Abl1/ABL Abl1, Btk, Txk, Ptk6, Bmx, Itk, Srms, Tec, Abl2

Abl1

T25B9.5 Abl1/ABL Matk, Csk, Dstyk, Fer, Fes

N.P.

sli-1 Cbl-B/CBL Cbl, CblB, CblC Cbl chin-1 Chn1/CHN Chn1, Chn2 Chn1, Chn2 ced-2 Crkl/CRK Crk, Crkl Crk, Crkl csk-1 Csk/CSK Csk Matk Csk, Matk kin-24 Fer/FPS Matk, Csk, Dsty, Fer,

Fes N.P.

src-2 Frk/FRK Frk Frk sem-5 Grb2/GRB2 Grb2 Grb2, Grap nck-1 Nck2/NCK Nck1, Nck2 Nck1, Nck2 aap-1 Pik3r1/PIK3R Pik3r1, Pik3r2, Pik3r3 Pik3r1, Pik3r2,

Pik3r3

plc-3 Plcg1/PLCG Plcg1, Plcg2 Plcg1, Plcg2 ptp-2 Ptpn6/PTPN Ptpn6, Ptpn11 Ptpn6, Ptpn11 gap-3 Rasa1/RASA1 N.P. N.P. No predictions by

either Ensembl or InParanoid

tag-333 Rin2/RIN Rin1, Rin2, Rin3 RinL

Rin1, Rin2, Rin3

F13B12.6 Sh2d4a,b/SH2D4 Sh2d2a, Hsh2, Sh2d4a, Sh2d4b

Sh2d4a, Sh2d4b

Y87G2A.17 Shb/SHB N.P. Shb, Shd, She, Shf K11E4.2 Shc/SHC N.P. N.P. No predictions by

either Ensembl or InParanoid

shc-1 Shc1/SHC Shc1, Shc2, Shc3, Shc4

Shc1, Shc2, Shc3, Shc4

shc-2 Shc/SHC N.P. N.P. No predictions by either Ensembl or InParanoid

F39B2.5 Socs7/SOCS N.P. Socs6, Socs7 emb-5 Supt6h/SPT6 Supt6h Supt6h src-1 Fyn/SRC Fgr, Fyn, Yes, Src Fgr, Fyn, Yes, Src,

Hck, Lck, Lyn, Blk

K07A1.14 Btk/TEC N.P. N.P. No predictions by either Ensembl or InParanoid

vav-1 Vav1,2,3/VAV Vav2 Vav1, Vav2, Vav3

The SH2 domain-containing proteins in 21 species establish the provenance and scope of phosphotyrosine signaling in Eukaryotes

24

References

1. W. A. Lim, T. Pawson, Phosphotyrosine signaling: evolving a new cellular communication system. Cell 142, 661-667 (2010); published online EpubSep 3 (S0092-8674(10)00954-2 [pii]

10.1016/j.cell.2010.08.023).

2. M. Srivastava, O. Simakov, J. Chapman, B. Fahey, M. E. Gauthier, T. Mitros, G. S. Richards, C. Conaco, M. Dacre, U. Hellsten, C. Larroux, N. H. Putnam, M. Stanke, M. Adamska, A. Darling, S. M. Degnan, T. H. Oakley, D. C. Plachetzki, Y. Zhai, M. Adamski, A. Calcino, S. F. Cummins, D. M. Goodstein, C. Harris, D. J. Jackson, S. P. Leys, S. Shu, B. J. Woodcroft, M. Vervoort, K. S. Kosik, G. Manning, B. M. Degnan, D. S. Rokhsar, The Amphimedon queenslandica genome and the evolution of animal complexity. Nature 466, 720-726 (2010); published online EpubAug 5 (nature09201 [pii]

10.1038/nature09201).

3. M. Srivastava, E. Begovic, J. Chapman, N. H. Putnam, U. Hellsten, T. Kawashima, A. Kuo, T. Mitros, A. Salamov, M. L. Carpenter, A. Y. Signorovitch, M. A. Moreno, K. Kamm, J. Grimwood, J. Schmutz, H. Shapiro, I. V. Grigoriev, L. W. Buss, B. Schierwater, S. L. Dellaporta, D. S. Rokhsar, The Trichoplax genome and the nature of placozoans. Nature 454, 955-960 (2008); published online EpubAug 21 (nature07191 [pii]

10.1038/nature07191).

4. R. E. Steele, C. N. David, U. Technau, A genomic view of 500 million years of cnidarian evolution. Trends Genet 27, 7-13 (2010); published online EpubJan (S0168-9525(10)00208-8 [pii]

10.1016/j.tig.2010.10.002).

5. N. H. Putnam, T. Butts, D. E. Ferrier, R. F. Furlong, U. Hellsten, T. Kawashima, M. Robinson-Rechavi, E. Shoguchi, A. Terry, J. K. Yu, E. L. Benito-Gutierrez, I. Dubchak, J. Garcia-Fernandez, J. J. Gibson-Brown, I. V. Grigoriev, A. C. Horton, P. J. de Jong, J. Jurka, V. V. Kapitonov, Y. Kohara, Y. Kuroki, E. Lindquist, S. Lucas, K. Osoegawa, L. A. Pennacchio, A. A. Salamov, Y. Satou, T. Sauka-Spengler, J. Schmutz, I. T. Shin, A. Toyoda, M. Bronner-Fraser, A. Fujiyama, L. Z. Holland, P. W. Holland, N. Satoh, D. S. Rokhsar, The amphioxus genome and the evolution of the chordate karyotype. Nature 453, 1064-1071 (2008); published online EpubJun 19 (nature06967 [pii]

10.1038/nature06967).

6. C. M. Manning, W. R. Mathews, L. P. Fico, J. R. Thackeray, Phospholipase C-gamma contains introns shared by src homology 2 domains in many unrelated proteins. Genetics 164, 433-442 (2003); published online EpubJun (

7. G. Ostlund, T. Schmitt, K. Forslund, T. Kostler, D. N. Messina, S. Roopra, O. Frings, E. L. Sonnhammer, InParanoid 7: new algorithms and tools for eukaryotic orthology analysis. Nucleic Acids Res 38, D196-203; published online EpubJan (gkp931 [pii]

10.1093/nar/gkp931).

8. T. Hubbard, D. Barker, E. Birney, G. Cameron, Y. Chen, L. Clark, T. Cox, J. Cuff, V. Curwen, T. Down, R. Durbin, E. Eyras, J. Gilbert, M. Hammond, L. Huminiecki, A. Kasprzyk, H. Lehvaslaiho, P. Lijnzaad, C. Melsopp, E. Mongin, R. Pettett, M. Pocock, S. Potter, A. Rust, E. Schmidt, S. Searle, G. Slater, J. Smith, W. Spooner, A. Stabenau, J. Stalker, E. Stupka, A. Ureta-Vidal, I. Vastrik, M. Clamp, The Ensembl genome database project. Nucleic Acids Res 30, 38-41 (2002); published online EpubJan 1 (

9. T. J. Hubbard, B. L. Aken, K. Beal, B. Ballester, M. Caccamo, Y. Chen, L. Clarke, G. Coates, F. Cunningham, T. Cutts, T. Down, S. C. Dyer, S. Fitzgerald, J. Fernandez-Banet, S. Graf, S. Haider, M. Hammond, J. Herrero, R. Holland, K. Howe, N. Johnson, A. Kahari, D. Keefe, F. Kokocinski, E. Kulesha, D. Lawson, I. Longden, C. Melsopp, K. Megy, P. Meidl, B. Ouverdin, A. Parker, A. Prlic, S. Rice, D. Rios, M. Schuster, I. Sealy, J. Severin, G. Slater, D. Smedley, G. Spudich, S. Trevanion, A. Vilella, J. Vogel, S. White, M. Wood, T. Cox, V. Curwen, R. Durbin, X. M. Fernandez-Suarez, P. Flicek, A. Kasprzyk, G. Proctor, S. Searle, J. Smith, A. Ureta-Vidal, E. Birney, Ensembl 2007. Nucleic Acids Res 35, D610-617 (2007); published online EpubJan (gkl996 [pii]

10.1093/nar/gkl996).

10. S. B. Hedges, The origin and evolution of model organisms. Nat Rev Genet 3, 838-849 (2002); published online EpubNov (10.1038/nrg929

The SH2 domain-containing proteins in 21 species establish the provenance and scope of phosphotyrosine signaling in Eukaryotes

25

nrg929 [pii]).

11. B. Chan, A. Lanyi, H. K. Song, J. Griesbach, M. Simarro-Grande, F. Poy, D. Howie, J. Sumegi, C. Terhorst, M. J. Eck, SAP couples Fyn to SLAM immune receptors. Nat Cell Biol 5, 155-160 (2003); published online EpubFeb (10.1038/ncb920

ncb920 [pii]).

12. B. A. Liu, K. Jablonowski, M. Raina, M. Arce, T. Pawson, P. D. Nash, The human and mouse complement of SH2 domain proteins-establishing the boundaries of phosphotyrosine signaling. Mol Cell 22, 851-868 (2006); published online EpubJun 23 (

13. C. P. Ponting, J. Schultz, F. Milpetz, P. Bork, SMART: identification and annotation of domains from signalling and extracellular protein sequences. Nucleic Acids Res 27, 229-232 (1999); published online EpubJan 1 (gkc010 [pii]).

14. M. A. Larkin, G. Blackshields, N. P. Brown, R. Chenna, P. A. McGettigan, H. McWilliam, F. Valentin, I. M. Wallace, A. Wilm, R. Lopez, J. D. Thompson, T. J. Gibson, D. G. Higgins, Clustal W and Clustal X version 2.0. Bioinformatics 23, 2947-2948 (2007); published online EpubNov 1 (btm404 [pii]

10.1093/bioinformatics/btm404).

15. E. W. Sayers, T. Barrett, D. A. Benson, S. H. Bryant, K. Canese, V. Chetvernin, D. M. Church, M. DiCuccio, R. Edgar, S. Federhen, M. Feolo, L. Y. Geer, W. Helmberg, Y. Kapustin, D. Landsman, D. J. Lipman, T. L. Madden, D. R. Maglott, V. Miller, I. Mizrachi, J. Ostell, K. D. Pruitt, G. D. Schuler, E. Sequeira, S. T. Sherry, M. Shumway, K. Sirotkin, A. Souvorov, G. Starchenko, T. A. Tatusova, L. Wagner, E. Yaschenko, J. Ye, Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 37, D5-15 (2009); published online EpubJan (gkn741 [pii]

10.1093/nar/gkn741).

16. M. Reich, T. Liefeld, J. Gould, J. Lerner, P. Tamayo, J. P. Mesirov, GenePattern 2.0. Nat Genet 38, 500-501 (2006); published online EpubMay (ng0506-500 [pii]

10.1038/ng0506-500).