the evolving molecular epidemiology of hiv type 1 among injecting drug users (idus) in malaysia

10
HUMAN MUTATION 29(3), 351^360, 2008 DATABASES The Human Intermediate Filament Database: Comprehensive Information on a Gene Family Involved in Many Human Diseases Ildiko Szeverenyi, 1 Andrew J. Cassidy, 2 Cheuk Wang Chung, 3 Bernett T.K. Lee, 3 John E.A. Common, 1 Stephen C. Ogg, 1 Huijia Chen, 1 Shu Yin Sim, 1 Walter L.P. Goh, 1 Kee Woei Ng, 1 John A. Simpson, 4 Li Lian Chee, 1 Goi Hui Eng, 1 Bin Li, 1 Declan P. Lunny, 1 Danny Chuon, 3 Aparna Venkatesh, 1 Kian Hoe Khoo, 1 W.H. Irwin McLean, 2 Yun Ping Lim, 3 and E. Birgitte Lane 1,4 1 Epithelial Biology Group, Institute of Medical Biology, Singapore; 2 Epithelial Genetics Group, Human Genetics Unit, Division of Pathology and Neuroscience, University of Dundee, Ninewells Hospital and Medical School, Dundee, United Kingdom; 3 Bioinformatics Institute, Singapore; 4 College of Life Sciences, University of Dundee, Dundee, United Kingdom Communicated by A. Jamie Cuticchia We describe a revised and expanded database on human intermediate filament proteins, a major component of the eukaryotic cytoskeleton. The family of 70 intermediate filament genes (including those encoding keratins, desmins, and lamins) is now known to be associated with a wide range of diverse diseases, at least 72 distinct human pathologies, including skin blistering, muscular dystrophy, cardiomyopathy, premature aging syndromes, neurodegenerative disorders, and cataract. To date, the database catalogs 1,274 manually-curated pathogenic sequence variants and 170 allelic variants in intermediate filament genes from over 459 peer-reviewed research articles. Unrelated cases were collected from all of the six sequence homology groups and the sequence variations were described at cDNA and protein levels with links to the related diseases and reference articles. The mutations and polymorphisms are presented in parallel with data on protein structure, gene, and chromosomal location and basic information on associated diseases. Detailed statistics relating to the variants records in the database are displayed by homology group, mutation type, affected domain, associated diseases, and nucleic and amino acid substitutions. Multiple sequence alignment algorithms can be run from queries to determine DNA or protein sequence conservation. Literature sources can be interrogated within the database and external links are provided to public databases. The database is freely and publicly accessible online at www.interfil.org (last accessed 13 September 2007). Users can query the database by various keywords and the search results can be downloaded. It is anticipated that the Human Intermediate Filament Database (HIFD) will provide a useful resource to study human genome variations for basic scientists, clinicians, and students alike. Hum Mutat 29(3), 351–360, 2008. r r 2007 Wiley-Liss, Inc. KEY WORDS: database; intermediate filament; keratin; desmin; lamin; GFAP; neurofilament; epidermolysis bullosa; laminopathy INTRODUCTION Intermediate filaments are part of the eukaryotic cell cytoskeleton beside actin filaments and microtubules. The function of the flexible filament proteins appears to be predominantly to provide physical reinforcement for cells in tissues to resist normal wear and tear [Fuchs and Weber, 1994]. Intermediate filament proteins are encoded by 70 genes in the human genome [Hesse et al., 2001; Rogers et al., 2005, 2004], which are divided into six sequence homology groups (types I–VI). Type V filaments, the lamins, are exclusively nuclear and occur in all tissues, but all the others are cytoplasmic filaments showing distinct tissue restriction. Types I and II are the keratins (in epithelial cells), type III are mostly mesodermal (desmin in muscle cells, vimentin in mesenchymal, endothelial, and hemopoietic cells and others; glial fibrillary acidic protein [GFAP] in astroglial cells, and peripherin in subsets of neuronal cells). Type IV are neurofilament and related proteins (NF-L, NF-M, NF-H, nestin, synemin, syncoilin, a-internexin) and type VI are the beaded filament proteins of the eye lens (filensin and CP49). All of the intermediate filament proteins share a central alpha- helical structure of a long rod domain of highly conserved size and Published online 21 November 2007 in Wiley InterScience (www. interscience.wiley.com). DOI 10.1002/humu.20652 Received 4 June 2007; accepted revised manuscript 14 August 2007. Grant sponsors: Agency for Science, Technology and Research (A*STAR), Dystrophic Epidermolysis Research Association (DebRA); Grant sponsor: Wellcome Trust; Grant number: 055090/A/98/Z; Grant sponsor: Cancer Research UK; Grant number: C26/A1461. Correspondence to: E. Birgitte Lane, Professor, Institute of Medical Biology,8A Biomedical Grove, ]06-06 Immunos, Singapore138665. E-mail: [email protected] r r 2007 WILEY-LISS, INC.

Upload: independent

Post on 13-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

HUMANMUTATION 29(3),351^360,2008

DATABASES

The Human Intermediate Filament Database:Comprehensive Information on a Gene FamilyInvolved in Many Human Diseases

Ildiko Szeverenyi,1 Andrew J. Cassidy,2 Cheuk Wang Chung,3 Bernett T.K. Lee,3 John E.A. Common,1

Stephen C. Ogg,1 Huijia Chen,1 Shu Yin Sim,1 Walter L.P. Goh,1 Kee Woei Ng,1 John A. Simpson,4

Li Lian Chee,1 Goi Hui Eng,1 Bin Li,1 Declan P. Lunny,1 Danny Chuon,3 Aparna Venkatesh,1

Kian Hoe Khoo,1 W.H. Irwin McLean,2 Yun Ping Lim,3 and E. Birgitte Lane1,4�

1Epithelial Biology Group, Institute of Medical Biology, Singapore; 2Epithelial Genetics Group, Human Genetics Unit, Division of Pathology andNeuroscience, University of Dundee, Ninewells Hospital and Medical School, Dundee, United Kingdom; 3Bioinformatics Institute, Singapore;4College of Life Sciences, University of Dundee, Dundee, United Kingdom

Communicated by A. Jamie Cuticchia

We describe a revised and expanded database on human intermediate filament proteins, a major component ofthe eukaryotic cytoskeleton. The family of 70 intermediate filament genes (including those encoding keratins,desmins, and lamins) is now known to be associated with a wide range of diverse diseases, at least 72 distincthuman pathologies, including skin blistering, muscular dystrophy, cardiomyopathy, premature aging syndromes,neurodegenerative disorders, and cataract. To date, the database catalogs 1,274 manually-curated pathogenicsequence variants and 170 allelic variants in intermediate filament genes from over 459 peer-reviewed researcharticles. Unrelated cases were collected from all of the six sequence homology groups and the sequencevariations were described at cDNA and protein levels with links to the related diseases and reference articles.The mutations and polymorphisms are presented in parallel with data on protein structure, gene, andchromosomal location and basic information on associated diseases. Detailed statistics relating to the variantsrecords in the database are displayed by homology group, mutation type, affected domain, associated diseases,and nucleic and amino acid substitutions. Multiple sequence alignment algorithms can be run from queries todetermine DNA or protein sequence conservation. Literature sources can be interrogated within the databaseand external links are provided to public databases. The database is freely and publicly accessible online atwww.interfil.org (last accessed 13 September 2007). Users can query the database by various keywords and thesearch results can be downloaded. It is anticipated that the Human Intermediate Filament Database (HIFD) willprovide a useful resource to study human genome variations for basic scientists, clinicians, and students alike.Hum Mutat 29(3), 351–360, 2008. rr 2007 Wiley-Liss, Inc.

KEY WORDS: database; intermediate filament; keratin; desmin; lamin; GFAP; neurofilament; epidermolysis bullosa;laminopathy

INTRODUCTION

Intermediate filaments are part of the eukaryotic cell cytoskeletonbeside actin filaments and microtubules. The function of the flexiblefilament proteins appears to be predominantly to provide physicalreinforcement for cells in tissues to resist normal wear and tear [Fuchsand Weber, 1994]. Intermediate filament proteins are encoded by 70genes in the human genome [Hesse et al., 2001; Rogers et al., 2005,2004], which are divided into six sequence homology groups (typesI–VI). Type V filaments, the lamins, are exclusively nuclear and occurin all tissues, but all the others are cytoplasmic filaments showingdistinct tissue restriction. Types I and II are the keratins (in epithelialcells), type III are mostly mesodermal (desmin in muscle cells,vimentin in mesenchymal, endothelial, and hemopoietic cells andothers; glial fibrillary acidic protein [GFAP] in astroglial cells, andperipherin in subsets of neuronal cells). Type IV are neurofilamentand related proteins (NF-L, NF-M, NF-H, nestin, synemin, syncoilin,

a-internexin) and type VI are the beaded filament proteins of the eyelens (filensin and CP49).

All of the intermediate filament proteins share a central alpha-helical structure of a long rod domain of highly conserved size and

Published online 21 November 2007 in Wiley InterScience (www.interscience.wiley.com).

DOI10.1002/humu.20652

Received 4 June 2007; accepted revised manuscript 14 August2007.

Grant sponsors: Agency for Science, Technology and Research(A*STAR), Dystrophic Epidermolysis Research Association (DebRA);Grant sponsor: Wellcome Trust; Grant number: 055090/A/98/Z;Grant sponsor: Cancer Research UK; Grant number: C26/A1461.

�Correspondence to: E. Birgitte Lane, Professor, Institute ofMedicalBiology,8A Biomedical Grove, ]06-06 Immunos, Singapore138665.E-mail: [email protected]

rr 2007 WILEY-LISS, INC.

variable sized, less structured head (amino-terminal) and tail(carboxy-terminal) domains [Fuchs and Weber, 1994; Steinert andParry, 1985]. The rod domain has a conserved sequence pattern ofheptad repeats (with predominantly hydrophobic residues at thefirst and fourth position of every heptad) and surface patches ofalternating charge, and these features drive the polypeptide chainsinto a coiled-coil dimer assembly, and thence rapidly [Herrmannand Aebi, 2004; Strelkov et al., 2002] into multistrand filaments ofapproximately 10 nm wide. Expression of intermediate filamentgene transcripts and the small number of their derived alternativesplice products is highly tissue-specific, such that intermediatefilament proteins have been valuable tools in diagnostic pathologyfor many years [Domagala et al., 1986; Osborn et al., 1986;Ramaekers et al., 1985; reviewed by Fuchs and Cleveland, 1998;Porter and Lane, 2003]. From studies on the human diseases it isnow understood that intermediate filaments are essential for tissuefunction, as they provide cells with necessary resilience tomaintain tissue structure [reviewed by DePianto and Coulombe,2004].

The number of diseases associated with intermediate filamentsis still growing (currently 72), with the first human diseases linkedto mutation in genes encoding lamin B1 (LMNB1) and filensin(BFSP1) reported recently [Padiath et al., 2007; Ramachandranet al., 2007]. The number of intermediate filament mutationsreported is expanding even faster [Burke and Stewart, 2006;Capell and Collins, 2006; Goldfarb et al., 2004; Lane and McLean,2004; Lariviere and Julien, 2004; Omary et al., 2004]. This diseaseassociation has generated increased interest worldwide in theprotein family, and, consequently, an increasing amount of data onthese genes and proteins is appearing in the literature at anaccelerating rate. Adding to this the recently published revised

gene nomenclature of the largest group, the keratins [Schweizeret al., 2006], it is now timely to combine all the information onthese proteins into one easily accessible source. Based on theprototype created 5 years ago at the University of Dundee, wepresent here a significantly expanded and updated database, theHuman Intermediate Filament Database (HIFD), as a usefulreference and analysis tool.

DATA SOURCESAND LIMITATIONS

The HIFD contains cataloged and manually curated dataretrieved from publications describing sequence variants (muta-tions) and allelic variants (polymorphisms) identified in the genesencoding intermediate filaments in humans (Fig. 1). Over 460peer-reviewed research articles have been mined to date,published between 1991 and 2007, extracted from PubMed(www.ncbi.nlm.nih.gov/entrez) by bibliographic searches usingkeywords such as intermediate filament, mutation, polymorphism,variant, disease, keratin, epidermolysis bullosa, laminopathy, desmin,and neurofilament, to identify primary data describing naturally-arising disease-associated human intermediate filament mutations.Reviews, and studies presenting experimentally induced muta-tion(s), were not included. The sequence and allelic variants wereentered into the database as many times as they were found inunrelated cases. Multiple family members expressing the samevariant sequence are only entered as one incidence, in contrastwith other databases such as Universal Mutation Database LMNA(www.umd.be:2000). Similarly, when a mutation from the samepatient has been described in more than one article, only theoriginal description has been taken into account and registered inthe database.

FIGURE 1. Chromosomal location of human intermediate ¢lament coding genes. Individual protein records can be accessed by click-ingon thegenesymbols (e.g.,KRT14) arrangedaccording to theirchromosomal locations. IllustrationofchromosomeswasobtainedfromEnsembl (www.ensembl.org) andmodi¢ed.

352 HUMAN MUTATION 29(3),351^360,2008

DATABASECONTENTANDORGANIZATION

The HIFD is a protein-centric human database listing sequencevariants of intermediate filaments associated with variousgenetically inherited diseases. Known allelic variants (polymorph-isms) for intermediate filament genes are also collected anddisplayed. Data recorded for the 70 intermediate filament genesinclude information on the primary protein plus the few recordedalternative splice products that are significantly expressed innormal human tissues. This amounts to 73 proteins in totalincluding desmuslin (DMN) with two alternative splice products(synemin a and synemin b) and lamin A/C (LMNA) with threealternative splice products. The LMNA gene encodes three majorproteins, lamins A, C1, and C2. Lamin AD10 is not included inthe database as it has only been detected in human cell lines andnot, so far, in normal tissues [Machiels et al., 1996]. At the time ofwriting there are altogether 1,235 sequence variants and 168allelic variants in the database. Nearly all of the sequence variantslisted are clearly implicated in one or more of 72 distinct humandiseases described in the database.

HIFD is a relational database built using MySQL and arrangedto facilitate easy access to the relevant information. The data isorganized according to types of intermediate filament proteins andtheir related diseases. Details such as protein name andphysicochemical characteristics, gene symbol, genomic location,sequence information, domain positions, and associated diseasesare accessible directly through the individual protein informationpage for each intermediate filament protein. Each sequencevariant and allelic variant has a unique database identificationnumber (ID) and they are presented in the form of separate tables.Variations are mapped to cDNA sequences and the changes aredescribed at DNA and protein levels with three-letter amino acidcodes. The positions of the variants referred in the papers werecurated (and revised if necessary) to fit the consensus nomen-clature of the Human Genome Variation Society (www.hgvs.org/mutnomen) [den Dunnen and Antonarakis, 2000, 2001].Accordingly, the first coding nucleotide position is taken as theadenine of the initiation (ATG) codon in the cDNA referencesequence, and ATG is numbered as the first codon (encodingmethionine) of the protein. Following current convention,methionine is taken as the first amino acid of the proteinsequence, irrespective of whether it is subsequently cleaved. Thevariants were assigned to the integrated nucleic and aminoacid sequences to provide a comprehensive view of the data set(Fig. 2A).

The genomic, mRNA and protein sequences were retrievedfrom the NCBI RefSeq (www.ncbi.nlm.nih.gov/projects/RefSeq)collection [Pruitt et al., 2007] and they are accessible fordownloading in FASTA (text-based) format, while the variantrecords are downloadable in Excel (Microsoft, Redmond, WA)format. The short descriptions of diseases are organized accordingto their association with intermediate filament type and links areincluded to the respective protein information page, to referencesin the HIFD, as well to the Online Mendelian Inheritance in Man(OMIM) database. The database statistics section reviews therecords for variants according to sequence homology group,variant type, affected domain, associated diseases, and nucleic oramino acid substitution.

In addition to the data repository and visualization components,the HIFD also includes the frequently used CLUSTALW tool[Thompson et al., 1994], which has been incorporated within theweb portal to run on our backend computer platforms. Thisresource allows multiple sequence comparison of orthologous

nucleotide and protein sequences as obtained from HomoloGene(Fig. 2B) and can be visualized using the JalView tool[Clamp et al., 2004]. Hyperlinks to public databases such asHGNC (HUGO Gene Nomenclature Committee; www.genena-mes.org/index.html), OMIM (www.ncbi.nlm.nih.gov/entrez/query.fcgi?db 5 OMIM), NCBI Entrez Gene (www.ncbi.nlm.nih.gov/entrez/query.fcgi?db 5 gene), and the University ofCalifornia, Santa Cruz (UCSC) Genome Browser (UCSCGenome Bioinformatics Group; http://genome.ucsc.edu) are allprovided.

Information in HIFD is periodically updated to ensurecontinuity and accuracy. Contact details and an online feedbackform are available for the user community to contribute their owncomments and updates. During its ‘‘soft-launch’’ phase, thedatabase has already benefited from suggestions of several localand international users who have already started to use it for theirresearch.

QUERYING THEDATABASE

The database is freely and publicly accessible online atwww.interfil.org. Individual protein records can be accessed eitherby clicking onto the gene symbols displayed according to theirchromosomal location (Fig. 1) or via the navigation bar. The majoradvantages of HIFD are that it allows the user to search at thedifferent levels of information and the query results can bedownloaded as Excel spreadsheets. The database is searchable byseveral keywords (e.g., protein name, gene symbol, cDNA andprotein RefSeq accession numbers, chromosome number/location,associated disease, variant type, and publication records).The predetermined drop-down menus are displayed to helpnonspecialist users in data query. Users can submit sequencesdirectly into database to search for similarities using the BasicLocal Alignment Search Tool (BLAST) [Altschul et al., 1990].Data can be retrieved by browsing through either the entirecollection of intermediate filaments organized into six differenttypes of homology group or the full bibliography containingcurated references listed in alphabetical order and linked toPubMed.

ANALYSIS OFDATASETS

There are 73 intermediate filament proteins encoded within thehuman genome [Hesse et al., 2001; Rogers et al., 2005, 2004], and34 of these have so far been associated with diseases in humans.Some of the proteins have been linked with multiple diseases,leading to a total of 72 intermediate filament-associated diseases inthe database (Fig. 3).

To date, the single intermediate filament protein with thehighest number of recorded unrelated cases of pathogenicmutations is lamin A, with a total of 333 sequence variants; 18of the 20 described laminopathies are associated with the LMNAgene. This is followed by K14 with 152 recorded sequence variants,GFAP with 127, and K5 with 122 sequence variants. Since K5and K14 are always coexpressed in the same cells as obligatorycopolymers, mutations in either K5 or K14 cause the samepathology: combined reports of mutations in K5 and K14 givesthe next highest category of known intermediate filamentpathologies (274 disease-associated sequence variants) afterlamin A.

Of all the intermediate filament proteins known, 48% (35/73)have at least one sequence or allelic variant record. Most of theproteins with no recorded sequence variants are the keratins

HUMANMUTATION 29(3), 351^360,2008 353

associated with the hair follicle, a dense and complex structurewhich is hard to analyze histologically and whose proteins are veryinsoluble and hard to distinguish from one another. Only four ofthe 31 known hair follicle-associated intermediate filament

proteins have been linked to diseases. Assuming that thedisease-associated K8 and K18 mutations can be classed aspathogenic or pathocontributory, then outside the hair keratins,only KRT15, KRT19, KRT7, VIM (vimentin), INA (internexin),

354 HUMANMUTATION 29(3), 351^360,2008

FIGURE 2. Sequence alignment interface. A: Integrated sequence view.The screenshot captures part of the alignment of genomic,mRNA, and protein sequences of K14. Red and blue colored sequences denote untranslated and coding regions of mRNA, respec-tively. Nucleic acid variants are displayed in boxed blue background above the integrated view,while changes in thewild-type proteinsequences are illustrated in single-letter amino acid code in boxed pink background below the sequence.The pop-up box with infor-mation on genomic variation, domains, unrelated patient cases, associated diseases, and references can be displayed bymoving themouse over a DNA or protein variant (red arrowheads). Hyperlinks to OMIM and PubMed are included in the pop-up box. B:Ortho-logous sequence alignment. Shown here is a screenshot of human, mouse, and rat K14 protein sequences, aligned and visualized inJalview.The pop-up window shows the alignment of orthologous sequences with graphical display of protein conservation andwiththeconsensus sequencebelow. Sequencevariants are labeledwithbluebackgroundwith‘‘mouse-over’’optionallowing for thedisplayof data entry references.

FIGURE 3. Screenshot of themost highly representedhumangenetic diseases listed in thedatabase (31shownoutof 72 in total). Barsrepresent numbers of pathogenic sequence variants reported. Proteins are labeledwith di¡erent colors according to their homologygroups.The‘‘mouse-over’’ feature of the bar chart allows the user to identify the number of variants of the respective proteins asso-ciatedwith each disorder.

HUMANMUTATION 29(3),351^360,2008 355

NES (nestin), SYNC1 (syncoilin), and DMN (desmuslin) have notbeen linked to any human disease. It should be suspected thatmutation in these genes may be either lethal or clinicallyasymptomatic.

The database currently contains 1,274 sequence variants and170 allelic variants collected from 459 peer-reviewed articles. Thegenomic variations are classified according to the changes inprotein sequences: 76.7% missense mutations, 8.4% silentmutations, 5.2% deletions, 3.3% frameshift mutations, 1.9%nonsense mutations, 1.6% silent/deletion (a silent mutation inLMNA activates a cryptic splice site resulting in an in-framedeletion), 0.7% insertions, 0.6% insertion-deletions, and 0.07%duplications. There are residual uncertainties (1.4% unknown)when changes at protein level have not been described but severalpossibilities exist based on the DNA sequence variations.

We analyzed the nature of substitution in our dataset, since thismutation type cause the majority of changes in the geneticinformation. By far the most frequent substitutions were C to T(C4T) and G to A (G4A); these substitutions are known to beclassically associated with methylation events. Cooper andYoussoufian [1988] have described the contribution of methyla-tion localized to the coding region of DNA to human geneticdiseases in general. In the case of sequence variants listed in HIFD,transitions from C to T and G to A took place 2.8 times morefrequently than in the opposite direction (T to C and A to G)(Fig. 4). The reason for this difference is that the spontaneousdeamination of 5-methylcytosine in CpG dinucleotides createsthymine [Duncan and Miller, 1980], which cannot be repaired byuracil-DNA glycolase repair machinery. In contrast, deamination

of unmethylated cytosine produces uracil, which is amenable tocorrection by uracil-DNA glycosylase [Duncan and Miller, 1980;Lindahl et al., 1997].

At the protein level, the highest mutation frequency was foundin arginine (Arg, R) codons, due to the high mutability of the CpGdinucleotide (see previous paragraph) in four out of six possiblearginine codons. Most of the arginine variants were identified intypes I, III, and V homology groups (Table 1). In K14, sequencevariation of codon 125 (Arg) accounts for 43 out of 54 argininemutations. This residue lies within one of the most highlyconserved regions of type I keratins, the helix initiation motif, andits mutation has been associated the most severe subtype of theskin-blistering disorder epidermolysis bullosa simplex [Lane and

FIGURE 4. Nucleotide substitutionpattern inhuman intermediate ¢lament genes.The‘‘mouse-over’’pop-up boxshows theoccurrenceof a particular point mutation. Disease causingmutations are predominantly C toTandG toA transitions at theDNA level.

TABLE 1. ArginineMutationHotspotsWithin theHomologyGroups

IFArg(R) K14:R125a K14:R288a K14:R416a

laminA:R482b

Domain Total 1A 2A 2B TAILHomologygroup

Type I 133 122 0 1 NAType II 23 0 0 0 NAType III 87 20 30 8 NAType IV 1 0 0 0 0TypeV 152 0 7 11 48TypeVI 1 0 0 0 NATotal 397 142 (36%) 37 (9%) 20 (5%) 48 (12%)

aNumber of Argmutations at corresponding residues of K14.bNumber of Argmutations at corresponding residues of lamin A.NA, not applicable.

356 HUMANMUTATION 29(3), 351^360,2008

McLean, 2004; Letai et al., 1993]. ClustalW alignment of K14sequence with other type I protein was used to locate theequivalent arginine to K14: R125. The alignment confirms thatthis is a mutation hotspot for type I group with all of the sequencevariants for arginine in K9, K10, K12, K16, and K17 in theequivalent position to K14: R125 (Fig. 5A).

Alongside type I keratins, lamin A (LMNA) and glial fibrillaryacidic protein (GFAP) have contributed most to the argininesequence variants catalogued in the database (Table 1). Themajority of the mutated Arg residues (100/152) to date are in thetail domain of lamin A, predominantly at position R482, whichwith 48 reports is a mutation hotspot. Mutation of this residue was

FIGURE 5. ClustalWalignment of keratin proteins.These screenshots show themost prominent hotspot regions for pathogenicmuta-tions in keratins, the helix boundary regions. Comparison of the amino acid sequence of (A) coil1A (helix initiation region) of type Ikeratins and (B) coil 2B (helix termination region) of type II keratins.Themost frequently mutated amino acids, Arg in type I (R) andGlu in type II (E1^4), are locatedat equivalentpositionswithineachhomologygroup.The relativepositions of theseamino acids in thecorresponding sequences are listed in the adjacent table.The fractional numbers in the tables indicate the frequency withwhich thismutation occurred in this speci¢c position, out of the total number of mutated residues (Arg or Glu) in that sequence.The referencepositions (K14 for type I andK5 for type II) are underlined.

HUMANMUTATION 29(3), 351^360,2008 357

described to cause Dunnigan-type lipodystrophy [Broers et al.,2006; Burke and Stewart, 2006; Cao and Hegele, 2000; Capell andCollins, 2006; Rogers et al., 2005; Shackleton et al., 2000;Speckman et al., 2000; Vigouroux et al., 2000]. Arginine mutationclusters are very common in GFAP: 31 of them are in the 1Adomain at codons R79 (equivalent position to K14:R125) and R88(corresponding to K14:R134), and 30 are in the 2A domain atresidue R239 (corresponding to K14:R288). Together, only 12 Argvariants were entered for desmin and eight of them were to befound at R406 (corresponding to K14:R416).

The helix initiation motif at the start of the rod domaininteracts directly with the helix termination motif at the endof the rod domain of the next subunit in the filament and has anessential role in filament assembly [Herrmann and Aebi,1998]. The second most frequently mutated residue in thedatabase was glutamic acid (Glu, E), and the majority of thesemutations lie within the helix termination motif (see Table 2).Analysis of the distribution of pathogenic sequence variants ofglutamic acid revealed that 92 out of 142 mutations affectingglutamic acid were found in type II keratins in the helixtermination motif, while 133 out of 397 of the Arg mutationswere in a functionally reciprocal region in type I keratins in thehelix initiation motif (Tables 1 and 2). Two functionally significantconserved glutamic acid residues were found in helix terminationmotif of the type II group (Fig. 5B). The most common mutationin type II keratins is one of these, E477 in K5 (8/17 K5substitutions) and its equivalent residue in other type II sequences.This position was described as a hotspot for K86 (formerly knownas Hb6) in monilethrix patients and for K2 in association toichthyosis bullosa of Siemens [Korge et al., 1998; Rothnagel et al.,1994]. The second most common Glu sequence variant is 11residues farther toward the amino terminal, although interestinglythis has not been found to be mutated in K5 or K6a/b. Thedatabase contains 16 glutamic acid mutations in GFAP; half ofthem are in the hotspot region equivalent position to K5:E477 andthe other half are mostly in the 1B domain. There are 18 glutamicacid mutation records for lamin A, but they are scattered along theprotein sequence.

Of the glutamic acid changes, 79% were changed to lysine,leading to altered charge in the protein, which would be predictedto have a destabilizing effect on the filament [Rothnagel et al.,1994]. The equivalent position to K5:466 is highly conserved incharge across the intermediate filament proteins and this isimportant for early filament assembly and stability [Wu et al.,2000].

It is notable that to date there have been no sequence variantsidentified in K8 or K18 corresponding to the hotspots in helix 1A(K14: R125 equivalent) or helix 2B (K5: E477 equivalent),presumably because these highly disruptive mutations may be

lethal in these keratins that are expressed very early inembryogenesis.

The majority of the disease-causing variations (primarily intypes I–III) are found in the helix boundary motifs of the a-helicalsubdomains of 1A and 2B (375 and 234 records, respectively),which is in accordance with the highly conserved moleculararchitecture (i.e., low tolerance of sequence variation) of theseregions across intermediate filaments [Strelkov et al., 2002], andthe apparent functional importance of these regions for filamentassembly. Another potential contributor to this bias, however,could be that end domain mutations in keratins have beensystematically underreported, because of a tendency to sequenceprimarily the two major hotspots. High numbers of mutations inthe tail domain have only been recorded in the lamins (200/232):lamins have several important functional motifs in their taildomains, including a nuclear localization signal (NLS), a regionwith immunoglobulin-like structure and a CaaX motif formembrane anchorage modifications. Sequence variation in theNLS results in aberrant filament assembly in the cytoplasm, andmutations affecting the Ig-like structure or the CaaX-dependentposttranslational processing have been directly associated withlaminopathies [Krimm et al., 2002; Loewinger and McKeon,1988].

Thus it is apparent that there are some very prominent hotspotsof mutation, which are seen in some types of intermediatefilaments but not all. Reasons for this may lie in the biology of thecell and tissue types in which these proteins are major structuralcomponents. Severe or catastrophic mutations, such as those inthe rod ends of keratins, may only be compatible with life whenthey occur in tissues where cells are naturally replaced veryfrequently (such as in the epidermis).

DATA SUBMISSION

The growth of the HIFD is envisioned to be a collective effortfrom the scientific community. Users are encouraged to submitnew variants even if they are unpublished, and it is anticipatedthat this facility will be increasingly used as the database grows. Anonline submission form is available for user submissions, which is auser-friendly simplified interpretation of the recommendations ofthe Human Genome Variation Society. Each new variant will beallocated a unique identifier based upon intermediate filamenttype and date of acceptance, for publication or for unpublishedincorporation into the database. For variants submitted to thedatabase but not submitted for publication to a scientific journal,the identifier number will be allocated on the basis of acceptanceby a curator committee: a senior member of staff from each of thecollaborating research or clinical groups forms part of a committeeto review unpublished variants submitted to the Database.

A web-based curation portal has been developed as part of themaintenance effort to enter new records and maintain updatedinformation in the Database. The portal facilitates the curationworkflow, allows easy data entry and enforces basic quality control.Access to the curation portal is password protected and is availableonly to a group of curators.

CONCLUSION AND FUTURE PROSPECTS

Given the increasing interest toward and growing importance ofintermediate filament associated diseases, and the massive amountof related literature published each year, it is becomingprogressively harder for any single researcher to cover the researchfield. The existence of a comprehensive database such as this, the

TABLE 2. Glutamic AcidMutationHotspotsWithin theHomologyGroups

IF Glu (E) K5:466a K5:475a K5:477a K5:478a

Domain Total 2B 2B 2B 2BHomology groupType I 5 1 0 0 0Type II 92 20 4 58 3Type III 18 1 1 5 2Type IV 9 0 0 0 2TypeV 18 0 0 0 0TypeVI 0 0 0 0 0Total 142 22 (15%) 5 (4%) 63 (44%) 7 (5%)

aNumber of Glumutations at corresponding residues of K5.

358 HUMAN MUTATION 29(3),351^360,2008

HIFD, should be a considerable help to all workers in this andrelated fields. The HIFD is a freely available repository forpathogenic sequence variants and allelic variants of intermediatefilament genes, equally useful for clinicians investigating pheno-type–genotype correlations, population geneticists studying thedistribution of mutations, cell and molecular biologists working onfunctional analyses, and students studying intermediate filamentgenes. Moreover, the patterns of mutation incidence that emergefrom such a database can inform strategies in clinical genetics,such as, e.g., the necessity to sequence the whole coding region. Inaddition to the manually curated information, the HIFD integratesdata from a variety of resources (e.g., HGNC, NCBI Entrez Gene,PubMed, OMIM, and RefSeq), allowing them to be accessed froma single web portal. We hope that this database succeeds inproviding a resource that integrates analysis tools for current datawith a user-friendly graphical interface.

It is planned to keep the database current with online datasubmissions, frequent updates, and a rigorous curation process, toretain and increase its value as a resource for academic, clinical,and commercial users alike. The structure of the database allowsfor expansion to include further data categories (e.g., a list ofantibodies against intermediate filament proteins) and newfeatures, should they be required by the user community.

ACKNOWLEDGMENTS

We are especially grateful to David A.D. Parry, MasseyUniversity, New Zealand, for help with the protein domainannotation and to Michael A. Rogers, Deutsches Krebsforschungs-zentrum, Heidelberg, Germany, for providing updated referencesequences. We thank Sarina Tay for the web design, the manycolleagues who have contributed data to the project, and to usersfor their feedback and support.

REFERENCES

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. 1990. Basic local

alignment search tool. J Mol Biol 215:403–410.

Broers JL, Ramaekers FC, Bonne G, Yaou RB, Hutchison CJ. 2006. Nuclear

lamins: laminopathies and their role in premature ageing. Physiol Rev

86:967–1008.

Burke B, Stewart CL. 2006. The laminopathies: the functional architecture

of the nucleus and its contribution to disease. Annu Rev Genomics Hum

Genet 7:369–405.

Cao H, Hegele RA. 2000. Nuclear lamin A/C R482Q mutation in

Canadian kindreds with Dunnigan-type familial partial lipodystrophy.

Hum Mol Genet 9:109–112.

Capell BC, Collins FS. 2006. Human laminopathies: nuclei gone

genetically awry. Nat Rev Genet 7:940–952.

Clamp M, Cuff J, Searle SM, Barton GJ. 2004. The Jalview Java alignment

editor. Bioinformatics 20:426–427.

Cooper DN, Youssoufian H. 1988. The CpG dinucleotide and human

genetic disease. Hum Genet 78:151–155.

den Dunnen JT, Antonarakis SE. 2000. Mutation nomenclature extensions

and suggestions to describe complex mutations: a discussion. Hum

Mutat 15:7–12.

den Dunnen JT, Antonarakis SE. 2001. Nomenclature for the description

of human sequence variations. Hum Genet 109:121–124.

DePianto D, Coulombe PA. 2004. Intermediate filaments and tissue repair.

Exp Cell Res 301:68–76.

Domagala W, Lubinski J, Weber K, Osborn M. 1986. Intermediate filament

typing of tumor cells in fine needle aspirates by means of monoclonal

antibodies. Acta Cytol 30:214–224.

Duncan BK, Miller JH. 1980. Mutagenic deamination of cytosine residues

in DNA. Nature 287:560–561.

Fuchs E, Cleveland DW. 1998. A structural scaffolding of intermediate

filaments in health and disease. Science 279:514–519.

Fuchs E, Weber K. 1994. Intermediate filaments: structure, dynamics,

function, and disease. Annu Rev Biochem 63:345–382.

Goldfarb LG, Vicart P, Goebel HH, Dalakas MC. 2004. Desmin myopathy.

Brain 127(Pt 4):723–734.

Herrmann H, Aebi U. 1998. Structure, assembly, and dynamics of

intermediate filaments. Subcell Biochem 31:319–362.

Herrmann H, Aebi U. 2004. Intermediate filaments: molecular structure,

assembly mechanism, and integration into functionally distinct intra-

cellular Scaffolds. Annu Rev Biochem 73:749–789.

Hesse M, Magin TM, Weber K. 2001. Genes for intermediate filament

proteins and the draft sequence of the human genome: novel keratin

genes and a surprisingly high number of pseudogenes related to keratin

genes 8 and 18. J Cell Sci 114(Pt 14):2569–2575.

Korge BP, Healy E, Munro CS, Punter C, Birch-Machin M, Holmes SC,

Darlington S, Hamm H, Messenger AG, Rees JL, Traupe H. 1998.

A mutational hotspot in the 2B domain of human hair basic keratin 6

(hHb6) in monilethrix patients. J Invest Dermatol 111:896–899.

Krimm I, Ostlund C, Gilquin B, Couprie J, Hossenlopp P, Mornon JP, Bonne

G, Courvalin JC, Worman HJ, Zinn-Justin S. 2002. The Ig-like structure

of the C-terminal domain of lamin A/C, mutated in muscular

dystrophies, cardiomyopathy, and partial lipodystrophy. Structure

10:811–823.

Lane EB, McLean WH. 2004. Keratins and skin disorders. J Pathol

204:355–366.

Lariviere RC, Julien JP. 2004. Functions of intermediate filaments in

neuronal development and disease. J Neurobiol 58:131–148.

Letai A, Coulombe PA, McCormick MB, Yu QC, Hutton E, Fuchs E. 1993.

Disease severity correlates with position of keratin point mutations in

patients with epidermolysis bullosa simplex. Proc Natl Acad Sci USA

90:3197–3201.

Lindahl T, Karran P, Wood RD. 1997. DNA excision repair pathways. Curr

Opin Genet Dev 7:158–169.

Loewinger L, McKeon F. 1988. Mutations in the nuclear lamin proteins

resulting in their aberrant assembly in the cytoplasm. EMBO J

7:2301–2309.

Machiels BM, Zorenc AH, Endert JM, Kuijpers HJ, van Eys GJ, Ramaekers

FC, Broers JL. 1996. An alternative splicing product of the lamin A/C

gene lacks exon 10. J Biol Chem 271:9249–9253.

Omary MB, Coulombe PA, McLean WH. 2004. Intermediate filament

proteins and their associated diseases. N Engl J Med 351:2087–2100.

Osborn M, van Lessen G, Weber K, Kloppel G, Altmannsberger M. 1986.

Differential diagnosis of gastrointestinal carcinomas by using monoclonal

antibodies specific for individual keratin polypeptides. Lab Invest

55:497–504.

Padiath QS, Saigoh K, Schiffmann R, Asahara H, Yamada T,

Koeppen A, Hogan K, Ptacek LJ, Fu YH. 2007. Corrigendum: lamin

B1 duplications cause autosomal dominant leukodystrophy. Nat Genet

39:276.

Porter RM, Lane EB. 2003. Phenotypes, genotypes and their contribution

to understanding keratin function. Trends Genet 19:278–285.

Pruitt KD, Tatusova T, Maglott DR. 2007. NCBI reference sequences

(RefSeq): a curated non-redundant sequence database of genomes,

transcripts and proteins. Nucleic Acids Res 35(Database issue):

D61–D65.

Ramachandran RD, Perumalsamy V, Hejtmancik JF. 2007. Autosomal

recessive juvenile onset cataract associated with mutation in BFSP1.

Hum Genet 121:475–482.

Ramaekers FC, Vroom TM, Moesker O, Kant A, Scholte G, Vooijs GP.

1985. The use of antibodies to intermediate filament proteins in the

differential diagnosis of lymphoma versus metastatic carcinoma.

Histochem J 17:57–70.

Rogers MA, Winter H, Langbein L, Bleiler R, Schweizer J. 2004. The

human type I keratin gene family: characterization of new hair follicle

specific members and evaluation of the chromosome 17q212 gene

domain. Differentiation 72:527–540.

Rogers MA, Edler L, Winter H, Langbein L, Beckmann I, Schweizer J.

2005. Characterization of new members of the human type II keratin

HUMANMUTATION 29(3), 351^360,2008 359

gene family and a general evaluation of the keratin gene domain on

chromosome 12q1313. J Invest Dermatol 124:536–544.

Rothnagel JA, Traupe H, Wojcik S, Huber M, Hohl D, Pittelkow MR, Saeki

H, Ishibashi Y, Roop DR. 1994. Mutations in the rod domain of keratin

2e in patients with ichthyosis bullosa of Siemens. Nat Genet 7:485–490.

Schweizer J, Bowden PE, Coulombe PA, Langbein L, Lane EB, Magin TM,

Maltais L, Omary MB, Parry DA, Rogers MA, Wright MW. 2006. New

consensus nomenclature for mammalian keratins. J Cell Biol 174:169–174.

Shackleton S, Lloyd DJ, Jackson SN, Evans R, Niermeijer MF, Singh BM,

Schmidt H, Brabant G, Kumar S, Durrington PN, Gregory S, O’Rahilly S,

Trembath RC. 2000. LMNA, encoding lamin A/C, is mutated in partial

lipodystrophy. Nat Genet 24:153–156.

Speckman RA, Garg A, Du F, Bennett L, Veile R, Arioglu E, Taylor SI,

Lovett M, Bowcock AM. 2000. Mutational and haplotype analyses of

families with familial partial lipodystrophy (Dunnigan variety) reveal

recurrent missense mutations in the globular C-terminal domain of

lamin A/C. Am J Hum Genet 66:1192–1198.

Steinert PM, Parry DA. 1985. Intermediate filaments: conformity and

diversity of expression and structure. Annu Rev Cell Biol 1:41–65.

Strelkov SV, Herrmann H, Geisler N, Wedig T, Zimbelmann R, Aebi U,

Burkhard P. 2002. Conserved segments 1A and 2B of the intermediate

filament dimer: their atomic structures and role in filament assembly.

EMBO J 21:1255–1266.

Thompson JD, Higgins DG, Gibson TJ. 1994. CLUSTAL W: improving the

sensitivity of progressive multiple sequence alignment through sequence

weighting, position-specific gap penalties and weight matrix choice.

Nucleic Acids Res 22:4673–4680.

Vigouroux C, Magre J, Vantyghem MC, Bourut C, Lascols O,

Shackleton S, Lloyd DJ, Guerci B, Padova G, Valensi P, Grimaldi A,

Piquemal R, Touraine P, Trembath RC, Capeau J. 2000. Lamin A/C

gene: sex-determined expression of mutations in Dunnigan-type

familial partial lipodystrophy and absence of coding mutations in

congenital and acquired generalized lipoatrophy. Diabetes 49:

1958–1962.

Wu KC, Bryan JT, Morasso MI, Jang SI, Lee JH, Yang JM, Marekov LN,

Parry DA, Steinert PM. 2000. Coiled-coil trigger motifs in the 1B and 2B

rod domain segments are required for the stability of keratin

intermediate filaments. Mol Biol Cell 11:3539–3558.

360 HUMAN MUTATION 29(3),351^360,2008