Genome Analysis Genome-Wide Identification of Arabidopsis Coiled-Coil 1 Proteins and Establishment of the ARABI-COIL Database Annkatrin Rose, Sankaraganesh Manikantan, Shannon J. Schraegle, Michael A. Maloy, Eric A. Stahlberg, and Iris Meier* Department of Plant Biology and Plant Biotechnology Center, Ohio State University, 1060 Carmack Road, Columbus, Ohio 43210 (A.R., I.M.); and Ohio Supercomputer Center, 1224 Kinnear Road, Columbus, Ohio 43212 (S.M., S.J.S., M.A.M., E.A.S.) Increasing evidence demonstrates the importance of long coiled-coil proteins for the spatial organization of cellular processes. Although several protein classes with long coiled-coil domains have been studied in animals and yeast, our knowledge about plant long coiled-coil proteins is very limited. The repeat nature of the coiled-coil sequence motif often preventsthesimpleidentificationofhomologsofanimalcoiled-coilproteinsbygenericsequencesimilaritysearches.Asa consequence, counterparts of many animal proteins with long coiled-coil domains, like lamins, golgins, or microtubule organizationcentercomponents,havenotbeenidentifiedyetinplants.Here,allArabidopsisproteinspredictedtocontain long stretches of coiled-coil domains were identified by applying the algorithm MultiCoil to a genome-wide screen. A searchableproteindatabase,ARABI-COIL(http://www.coiled-coil.org/arabidopsis),wasestablishedthatintegratesinfor- mationonnumber,size,andpositionofpredictedcoiled-coildomainswithsubcellularlocalizationsignals,transmembrane domains,andavailablefunctionalannotations.ARABI-COILservesasatooltosortandbrowseArabidopsislongcoiled-coil proteins to facilitate the identification and selection of candidate proteins of potential interest for specific research areas. Using the database, candidate proteins were identified for Arabidopsis membrane-bound, nuclear, and organellar long coiled-coil proteins. The coiled-coil protein oligomerization motif con- such as the Golgi, centrosomes, centromers, or the sists of two or more amphipathic alpha helices that nuclear envelope. twist around each other in a supercoil (Burkhard et Some large coiled-coil proteins oligomerize into al.,2001).Itwasoneoftheearliestproteinstructures filaments or networks and have themselves a struc- discovered, first described for the hair protein alpha tural role. One of the three main classes of cytoskel- keratin (Crick, 1952). Sequences with the capacity to etalproteins,theintermediatefilamentproteins,rep- formcoiled-coilsarecharacterizedbyaheptadrepeat resents a well-characterized group of coiled-coil pattern in which residues in the first and fourth proteins (Strelkov et al., 2003). In addition, the cy- positions are hydrophobic, and residues in the fifth toskeletal motor proteins myosin, dynein, and kine- and seventh position are predominantly charged or sin contain coiled-coil motifs (Schliwa and Woehlke, polar. The stability of the coiled-coil is derived from 2003). a characteristic packing of the hydrophobic side In the past few years, the number of investigated chains into a hydrophobic core (“knobs in holes”; long coiled-coil proteins from animals and yeast has Crick, 1952). rapidly grown. They include proteins involved in Ithasbeenestimatedthatapproximately10%ofall nuclearorganization,suchaslamins(Goldmanetal., proteins of an organism contain a coiled-coil motif 2002; Holaska et al., 2002), NuMA (nuclear mitotic (Liu and Rost, 2001). Roughly, coiled-coil proteins apparatus protein; Compton et al., 1992; Yang et al., can be grouped into two classes: Short coiled-coil 1992), or the SMC (structural maintenance of chro- domains of six or seven heptad repeats, also called mosomes) proteins (Hirano, 2000; Jessberger, 2002). Leucine zippers, are frequently found as homo- and A number of coiled-coil proteins have been charac- heterodimerizationmotifsintranscriptionfactors(Ja- terized that associate with the kinetochore/centro- kobyetal.,2002;Vinsonetal.,2002).Incontrast,long mere regions of chromosomes in vertebrates and are coiled-coil domains of several hundred amino acids involved in assembling other proteins on the kineto- are found in a number of functionally distinct pro- chore(Liaoetal.,1995;Sugataetal.,1999,2000;Starr teins, which are often involved in attaching func- et al., 2000; Fukagawa et al., 2001). tionalproteincomplexestolargercellularstructures, Longcoiled-coilproteinsplayaroleinmicrotubule nucleationandspindleorganizationduringcelldivi- sion. For example, coiled-coil proteins are involved 1ThisworkwassupportedbytheNationalScienceFoundation in the architecture of the spindle pole body, the nu- 2010Project(grantno.NSF0209339toI.M.). clear envelope-embedded microtubule organization * Corresponding author; e-mail [email protected]; fax 614–292–5379. center in yeast (Saccharomyces cerevisiae). They are http://www.plantphysiol.org/cgi/doi/10.1104/pp.103.035626. required for insertion of the spindle pole body into PlantPhysiology,March2004,VoDl.o1w34n,lopapd.e9d2 f7r–o9m3 9o,nw Jwanwu.aprlya n4t, p2h0y1s9i o-l P.ourbgli©sh2e0d0 b4yA wmwewr.ipcalanntSpohcyiseitoyl.oofrgP lantBiologists 927 Copyright © 2004 American Society of Plant Biologists. All rights reserved. Roseetal. the nuclear envelope (Schramm et al., 2000; Le Mas- apparently plant-specific coiled-coil proteins have sonetal.,2002)andfortheprecisespatialpositioning been identified. The carrot (Daucus carota) coiled-coil oftheouterplaque,centralplaque,andinnerplaque protein NMCP1 (Nuclear Matrix Constituent Protein (Kilmartin et al., 1993; Chen et al., 1998; Soue`s and 1)islocatedatthenuclearrimduringinterphaseand Adams, 1998; Schaerer et al., 2001). The vertebrate at the spindle poles in mitotic cells (Masuda et al., microtubule organization center, the centrosome, 1997). CIP1 (COP1-interactive protein 1), a also contains a number of long coiled-coil proteins. cytoskeleton-associated coiled-coil protein, binds to They are involved in microtubule nucleation, scaf- the photomorphogenesis suppressor COP1 (Matsui folding/bridging of other proteins, and the anchor- et al., 1995). MFP1 is a DNA-binding protein and ingofsignalingcomponentssuchascalmodulin,pro- associated with the thylakoids in plant chloroplasts teinkinaseC,andproteinkinaseA(Favaetal.,1999; (Jeong et al., 2003). PF2 is a large coiled-coil protein Takahashi et al., 1999; Witczak et al., 1999; Flory et found in a screen for motility mutants in the algae al.,2000;Lietal.,2000;Takahashietal.,2000;Moisoi Chlamydomonas reinhardtii (Rupp and Porter, 2003), et al., 2002; Sillibourne et al., 2002; Takahashi et al., where it is required for the assembly of the dynein 2002). In nematodes, the coiled-coil proteins PUMA1 regulatory complex. Besides these few examples, (Esteban et al., 1998) and LIN-5 (Lorson et al., 2000) nothing is presently known about plant long coiled- have been found to localize to the spindle apparatus coil proteins and their potential functions in the an- in a cell cycle- and microtubule-dependent manner. choring and structuring of cellular events. PUMA1 might be part of a “centromeric matrix,” In BLAST searches of the whole Arabidopsis ge- whereas LIN-5 is thought to play a role in localizing nome for all animal and yeast proteins discussed or regulating a motor-protein complex and/or con- above, significant homologies can only be found for necting the spindle apparatus with the cell cortex. the protein families of the SMC proteins and myo- In the cytoplasm, long coiled-coil proteins are in- sins, with E values typically below e(cid:1)100, kinesins volved in the organization of and targeting to mem- with E values in the e(cid:1)50 to e(cid:1)100 range, and for the brane systems. The golgin family comprises a group nuclearporecomplexproteinTpr(5e(cid:1)78).Inallother of coiled-coil peripheral or integral membrane pro- cases, the best hits for functionally very different teinsassociatedwiththeGolgiapparatus.Theyhave proteins are the same three proteins from the Arabi- been shown to function in a variety of membrane- dopsis genome, indicating the difficulty in using se- membrane and membrane-cytoskeleton tethering quence similarity algorithms to identify functional events at the Golgi and are regulated by small homologs of long coiled-coil proteins. The multiple GTPasesoftheRabandArlfamilies(BarrandShort, heptad repeats in long coiled-coil domains cause a 2003). It has been suggested that golgins and the low and promiscuous sequence similarity between related fruitfly (Drosophila melanogaster) protein Lva longcoiled-coilproteins,whichleadstomeaningless (Lava Lamp; Sisson et al., 2000) are forming a Golgi results. This clearly demonstrates the need to use matrix that serves as the structural scaffold for the other methods than sequence comparison for the enzyme-containing membranes of the Golgi appara- identificationofplantlongcoiled-coilproteinspoten- tus and may provide the means of partitioning the tially involved in the diverse cellular functions dis- Golgi during mitosis (Seemann et al., 2000, 2002). A cussed above. group of long coiled-coil proteins associated with Although the heptad repeat pattern causes false both the centrosome and the Golgi are involved in hits in sequence similarity searches, it can be easily anchoring both cyclic nucleotide phosphodiesterase exploited by computational methods to predict andcAMP-dependentproteinkinaseAtothecentro- coiled-coil domains in amino acid sequences (Parry, some/Golgi, suggesting a role of these coiled-coil 1982;Lupasetal.,1991).Morerecently,thecombina- proteins in cAMP signal compartmentalization tionofcoiled-coilpredictionalgorithmssuchasMul- (Witczaketal.,1999;DivianiandScott,2001;Verdeet tiCoil (Wolf et al., 1997) with whole-genome infor- al., 2001). mation has permitted the mining of all coiled-coil These examples serve to illustrate the emerging proteins of an organism. Using this approach on a function of long coiled-coil proteins as anchors for total yeast genome translation, approximately 300 the regulation of protein positioning in the cell, thus two-stranded and 250 three-stranded coiled-coils bothseparatingandcoordinatingsignalingpathways have been identified (Newman et al., 2000). Over in a temporal and spatial manner and organizing one-half of these open reading frames represent pro- cellular processes like cell division. In contrast to teins of unknown function. An investigation of a animals and yeast, only a handful of long coiled-coil number of structural motifs in several whole ge- proteins have been studied in plants. Besides the nomesshowedindependentlythatthehuman(Homo large families of myosins and kinesins (Reddy and sapiens), fruitfly, Caenorhabditis elegans, and yeast ge- Day, 2001a, 2001b; Smith, 2002), the homologs of the nomes contain roughly 10% coiled-coil proteins (Liu mammalianSMCproteinshavebeencharacterizedin and Rost, 2001). Arabidopsis(Mengisteetal.,1999;Haninetal.,2000; Wereportheretheidentificationofalllongcoiled- Liu et al., 2002). In addition, a small number of coilproteinsfromArabidopsisandtheestablishment 928 Downloaded from on January 4, 2019 - Published by www.plantphysiol.org Plant Physiol. Vol. 134, 2004 Copyright © 2004 American Society of Plant Biologists. All rights reserved. TheArabidopsisCoiled-CoilProteinDatabaseARABI-COIL ofanovelsearchabledatabase,ARABI-COIL(http:// www.coiled-coil.org/arabidopsis). In the future, as more fully annotated plant genomes such as rice (Oryzasativa)andC.reinhardtiibecomeavailable,our analysis pipeline will be applied to these species as well, and the data will be added to the database. RESULTS Genome-Wide Screen for Coiled-Coil Proteins Arabidopsis long coiled-coil proteins were identi- fiedusingthealgorithmMultiCoil(Wolfetal.,1997). MultiCoil is capable of predicting two-stranded and three-stranded coiled-coils with significantly less false positives than earlier prediction methods (Wolf et al., 1997). Figure 1 shows a comparison of Multi- Coilperformancewitholderpredictionmethods,us- ing Arabidopsis MFP1 (Harder et al., 2000; Jeong et al., 2003) as an example. MultiCoil offers the highest stringency of the methods tested. The program is available as a Web resource allowing prediction of individual sequences online. With more than 25,000 sequences requiring analysis, the single sequence submission through the Web was not a tractable approach; therefore, the MultiCoil program was in- stalled on a local multiprocessor system to run the Arabidopsisproteomesequenceset.Afterconfirming the consistency of results between the locally in- stalled version of MultiCoil and the available Web resource with a small subset of test sequences, the entireArabidopsispredictedproteome(http://www. ebi.ac.uk/proteome/ARATH/) was processed. Us- ingacutoffvalueof20aminoacidsminimumlength for a coiled-coil domain and 0.5 for the probability score,5.6%ofallArabidopsissequences(about1,500 proteins) were identified as coiled-coil proteins. Of these sequences (1.5% of the genome), 386 were pre- dicted to have coiled-coil domains of 50 or more amino acids in length. Selection of Proteins with Long Coiled-Coil Domains To focus on proteins potentially involved in struc- tural aspects of the cells and to exclude shorter coiled-coil domains like Leucine zippers, the output from the original MultiCoil run was further pro- cessedandfiltered.Asoftwarepackage(ExtractProp Figure1. Comparisonofdifferentalgorithmsforcoiled-coildomain Suite, see “Materials and Methods”) was developed prediction.At3g16000(AtMFP1,GenBankaccessionno.BAB02666; to automate the processing of data and selection of Harderetal.,2000;Jeongetal.,2003)isshownasanexample.Ato sequences.Inthisprocess,smallgapsshorterthan25 C, Probability scores plotted against the length of the protein in amino acids between predicted coiled-coil domains aminoacids(aa)andbardiagramgeneratedfromplots.Thedashed linesmarkthecutoffscoreof0.5.Coiled-coildomainsareshownin were ignored and the domains treated as a single, grayinthebardiagram.A,COILS;B,PAIRCOIL;C,MultiCoil.D,Bar larger coiled-coil (Fig. 1D). The relative consistency diagramgeneratedafterprocessingofMultiCoildatathroughExtract- of the prediction between Arabidopsis and animal Prop(“MaterialsandMethods”)toeliminateshortgapsincoiled-coil sequences was tested by comparing family members domainpredictions.E,ComparisonofdomainpredictionsbyMulti- of the conserved SMC proteins. SMC proteins typi- Coil using the same score and length cutoff parameters for human callycontaintwoclustersofcoiled-coildomainssep- andArabidopsisstructuralmaintenanceofchromosomes2proteins. arated by a central linker domain. Figure 1E shows HsSMC2,NP_006435;AtSMC2,NP_190330;AtTTN3,NP_201047. Plant Physiol. Vol. 134, 2004 Downloaded from on January 4, 2019 - Published by www.plantphysiol.org 929 Copyright © 2004 American Society of Plant Biologists. All rights reserved. Roseetal. TableI. Distributionoflengthsoflongcoiled-coildomainsinthe TableIII. Coiled-coilfeaturesaspresentedinARABI-COIL ARABI-COILdatabase Example prediction data for At3g16000 (GenBank accession no. CC,Coiled-coil. BAB02666;comparewithFigure1,CandD,andTableIV). Maximum No.of ComputedCoiled-CoilProperty Value CCLength Proteins No.ofcoiled-coildomainsinsequence 2 (cid:2)400 7 Lengthoflargestcoiled-coildomain 338 250–399 12 %ofentiresequencepredictedascoiled-coil 70 150–249 57 %ofN-terminalone-thirdpredictedascoiled-coil 40 100–149 67 %ofmiddleone-thirdpredictedascoiled-coil 99 (cid:3)100a 143 %ofC-terminalone-thirdpredictedascoiled-coil 72 aNo. includes only proteins with maximum domain length/do- Highestcoiled-coilprobabilityscore 1.0 mainno.ofatleast70/1,50/2,or30/3. tailed positions of all predicted coiled-coil domains that this domain distribution was observed for hu- and the length of the longest intradomain gap for man SMC2 and its two Arabidopsis homologs. each given domain (Table IV). A graphical represen- Becauseahigh-stringencyalgorithmlikeMultiCoil tation of the predicted coiled-coil structures was in- often predicts long stretches of coiled-coil domains cluded (Fig. 1D). Links to National Center for Bio- with significant intradomain gaps (as shown in Fig. technology Information (NCBI) GenBank sequence 1), a filter was introduced to include only proteins entries are provided in ARABI-COIL to retrieve the with at least one coiled-coil domain of at least 70 underlying sequence information for each database amino acids, two domains and a minimal domain entry. length of 50 amino acids, or three domains and a minimaldomainlengthof30aminoacidsinthefinal Functional Categories of Arabidopsis Long data set. This strategy isolated 286 sequences with Coiled-Coil Proteins longormultiplecoiled-coildomainswhileexcluding 97%oftheknownArabidopsisbZIPproteins(Jakoby Only10%ofthe286proteinsinARABI-COILhave et al., 2002). Table I shows the distribution of the been characterized so far by experimental data, with maximum length of predicted coiled-coil domains about one-half of these falling into the categories per protein in the ARABI-COIL database. The total kinesin or myosin motors or SMC proteins. For a percentage of the residues per protein sequence pre- preliminary estimate of protein functions, annota- dicted to be in a coiled-coil region is summarized in tions were assigned manually. They are based on Table II. available publications (refs. linked to PubMed are Thecoiled-coilpropertyinformationpresentedand available in ARABI-COIL), annotations in NCBI Ref- searchable in ARABI-COIL is summarized for a sin- Seq (http://www.ncbi.nlm.nih.gov/RefSeq/), The gle protein example in Table III. It includes the pre- Arabidopsis Information Resource (http://www. dicted number of coiled-coil domains, length of the arabidopsis.org/),TheInstituteforGenomicResearch largest coiled-coil domain, percentages of the total (http://www.tigr.org/tdb/e2k1/ath1/), and the Mu- sequenceandtheN-terminal,middle,andC-terminal nich Information Center for Protein Sequences one-thirdofthesequencepredictedtobeinacoiled- (http://mips.gsf.de/proj/thal/db/), and conserved coil, and the highest prediction score over the whole domains outside of the coiled-coil domain. The sequence. The ARABI-COIL database search form ARABI-COIL database can be searched for keywords allows for searches limited to a certain length of within these annotations. Figure 2 summarizes the protein and/or coiled-coil domain and percentage functionalannotationsoftheproteinsinARABI-COIL coverage over the whole length and/or the and shows that two main fractions of the annotated N-terminal, middle, and C-terminal one-third of the proteins are involved in either cytoskeletal or nuclear sequence. A second output table summarizes the de- functions. The putative function of 66% of the se- quences in ARABI-COIL remains unknown. The per- cent of uncharacterized ORFs increases with the per- TableII. Distributionofpercentagecoiled-coilcoverageperpro- centage coverage from 60% unknown proteins with teinintheARABI-COILdatabase CC,Coiled-coil. TableIV. Coiled-coildomaindataaspresentedinARABI-COIL CC No.of Coverage Proteins Example prediction data for At3g16000 (GenBank accession no. BAB02666;comparewithFigure1,CandD,andTableIII). % (cid:2)80 2 Start End Property Value 79–60 13 145 482 Lengthofdomain 338 59–40 48 145 482 Maximumintradomaingap 14 39–20 112 518 692 Lengthofdomain 175 (cid:3)20 111 518 692 Maximumintradomaingap 0 930 Downloaded from on January 4, 2019 - Published by www.plantphysiol.org Plant Physiol. Vol. 134, 2004 Copyright © 2004 American Society of Plant Biologists. All rights reserved. TheArabidopsisCoiled-CoilProteinDatabaseARABI-COIL than 25% for which published data are presently available. The Arabidopsis genome appears to encode only one protein with a continuous coiled-coil domain of more than 1,000 amino acids. This protein, CIP1, has been characterized as a component of the cytoskele- tonandfunctionsasabindingsiteforthephotomor- phogenesis regulator COP1 (Matsui et al., 1995). An- other characterized protein with a high coiled-coil coverageisAtMFP1,aDNA-bindingchloroplastthy- Figure 2. Putative functions of proteins in ARABI-COIL based on lakoid protein (Jeong et al., 2003). Of the remaining annotations.TFs,Transcriptionfactors;DNARMR,DNArecombina- proteins in Table V, six have been described as a tion,modification,andrepair. family of filament-like proteins (FPPs; Gindullis et al., 2002), five contain a kinesin motor domain, and eight have functions suggesting their localization in less than 50% coiled-coil to 86% unknown proteins the nucleus. with 50% or more coiled-coil coverage. Seventy-five percent of the proteins with unknown function matched known expressed sequence tags and were Putative Membrane-Associated Long Coiled-Coil annotated as “expressed proteins.” The remaining Proteins in Arabidopsis proteins without expressed sequence tag data were annotated as “hypothetical proteins.” Table V lists all In addition to coiled-coil domain prediction, trans- Arabidopsis proteins of at least 500 amino acids in membrane domain prediction data from several length with a predicted coiled-coil coverage of more programs (see Table VI) were incorporated in the TableV. Previouslyinvestigatedlongcoiled-coilproteinsinArabidopsis Proteinslistedareofatleast500aminoacidsinlengthwithatleast25%predictedtoformcoiled-coils. Protein Maximum TotalCC AGILocus Protein PutativeFunction References Length CCLength Coverage % At5g41790 CIP1 1,305 1,060 81 Cytoskeleton, Matsuietal.(1995) COP1signal transduction At3g16000 MFP1 727 338 70 ChloroplastDNA- Jeongetal.(2003) bindingprotein At1g77580 FPP1 629 236 55 Unknown Gindullisetal.(2002) At1g21810 FPP2 639 196 50 Unknown Gindullisetal.(2002) At3g05270 FPP3 603 208 40 Unknown Gindullisetal.(2002) At1g13220 NMCP1like 1,128 200 39 Nuclearmatrix Masudaetal.(1997)a At1g19835 FPP4 1,024 201 38 Unknown Gindullisetal.(2002) At1g67230 NMCP1like 1,166 198 38 Nuclearmatrix Masudaetal.(1997)a At2g18540 preproMP73 699 253 36 Storageprotein Mitsuhashietal.(2001)a At3g12020 T21B14.15 956 109 31 Kinesin ReddyandDay(2002b) At1g47900 FPP6 1,054 110 30 Unknown Gindullisetal.(2002) At4g21270 KatA,ATK1 793 198 30 Kinesin Marcusetal.(2002,2003) At4g27180 KatB,ATK2 744 138 30 Kinesin Mitsuietal.(1994,1996) At4g38070 pHLH131 1,496 122 30 Transcriptionfactor Heimetal.(2003) At5g48600 SMC4 1,241 108 30 Condensin Liuetal.(2002) At1g68790 NMCP1like 1,085 173 29 Nuclearmatrix Masudaetal.(1997)a At3g10180 F14P13.22 1,348 276 29 Kinesin ReddyandDay(2002b) At5g65770 NMCP1like 1,042 122 29 Nuclearmatrix Masudaetal.(1997)a At3g47460 SMC2 1,171 141 28 Condensin Liuetal.(2002) At4g36120 FPP5 981 125 28 Unknown Gindullisetal.(2002) At5g54670 KatC,ATK3 746 94 26 Kinesin Mitsuietal.(1994,1996) At5g61460 MIM1 1,057 76 25 DNArepair Mengisteetal.(1999), Haninetal.(2000) At5g62410 TTN3 1,175 202 25 Condensin Liuetal.(2002) aPublisheddataonlyavailableforhomologsfromotherplantspecies(NMCP1,carrot;andpreproMP73,pumpkin).bHLH,basicHelix-Loop- Helix;CC,coiled-coil;CIP,COP1InteractiveProtein;MFP,MAR-bindingfilament-likeProtein;MIM,hypersensitivetoMMS,irradiation,and MMC;NMCP,nuclearmatrixconstituentprotein;SMC,structuralmaintenanceofchromosomes;TTN,TITAN. Plant Physiol. Vol. 134, 2004 Downloaded from on January 4, 2019 - Published by www.plantphysiol.org 931 Copyright © 2004 American Society of Plant Biologists. All rights reserved. Roseetal. TableVI. Programsusedforsequenceanalysisandtargetingprediction cTP,Chloroplasttargetingpeptide;mTP,mitochondrialtargetingpeptide;NLS,nuclearlocalizationsignal;TMH,transmembranehelix;SP, signalpeptide(ER,secretorypathway). Program PredictedFeature URL Reference MultiCoil1.0 Coiled-coil http://theory.lcs.mit.edu/multicoil Wolfetal.(1997) ChloroP1.1 cTP http://www.cbs.dtu.dk/services/ChloroP/ Emanuelssonetal.(1999) Predotar0.5 cTP,mTP http://www.inra.fr/predotar/ TargetP1.01 cTP,mTP,SP,other http://www.cbs.dtu.dk/services/TargetP/ Nielsenetal.(1997a), Emanuelssonetal.(2000) MitoProtII mTP http://www.mips.biochem.mpg.de/cgi-bin/ Clarosetal.(1996) proj/mcdgcn/mitofilter PredictNLS NLS http://maple.bioc.columbia.edu/predictNLS/ Cokoletal.(2000) PSORT NLS,TMH http://psort.nibb.ac.jp/ NakaiandHorton(1999) SignalP2.0HMM SP http://www.cbs.dtu.dk/services/SignalP-2.0/ NielsenandKrogh(1998) SignalP2.0NN SP http://www.cbs.dtu.dk/services/SignalP-2.0/ Nielsenetal.(1997b) TMHMM2.0 TMH http://www.cbs.dtu.dk/services/TMHMM/ Sonnhammeretal.(1998), Kroghetal.(2001) HMMTOP1.1 TMH http://www.enzim.hu/hmmtop/ Tusna´dyandSimon(1998) Several TMH http://aramemnon.botanik.uni-koeln.dc Schwackeetal.(2003) database, including the number of predicted trans- combinationwithspecificcoiled-coilproperties.Cross- membrane domains in the ARAMEMNON database referencestothemorecomprehensivedetailspagesin (http://aramemnon.botanik.uni-koeln.de; Schwacke ARAMEMNON, which include graphic comparisons etal.,2003).TheARABI-COILsearchpageallowsfor of a larger number of transmembrane prediction pro- limitingsearchestocoiled-coilproteinswithacertain grams, are provided with the output details for pro- number of predicted transmembrane domains in teinswithanentryinthatdatabase.Fourteenproteins were identified that are at least 500 amino acids long, have at least 25% coiled-coil coverage, and contain at least one transmembrane domain according to ARAMEMNON.Figure3showsaschematicrepresen- tationoftheseproteins.Fourproteinsinthiscategory havebeencharacterizedpreviously.AtMFP1isathyla- koidmembraneprotein(Jeongetal.,2003).At3g12020 contains a kinesin motor domain, suggesting that it might function as a membrane-bound microtubule motor (Reddy and Day, 2001b). The Arabidopsis TableVII. TargetingsignalpredictionssummarizedinARABI-COIL forAt3g16000 The scores shown in the table were acquired using GenBank accession no. BAB02666. cTP, Chloroplast targeting peptide; mTP, mitochondrial targeting peptide; SP, signal peptide for secretory pathway. PredictedSignal Program Score cTP TargetP 0.78 cTP ChloroP 0.96a cTP Predotar 0.88 mTP Predotar 0.15 mTP Mitoprot 0.72 mTP TargetP 0.16 SP TargetP 0.00 Figure 3. Putative membrane proteins with high coiled-coil cover- SP SignalPHMM 0.70 age. Proteins are of at least 500 amino acids in length and at least SP SignalPNN 0.46 25%coiled-coilcoverage,sortedfromtoptobottombydecreasing NLS PredictNLS 1.00b percentage of coiled-coil coverage. Bar diagrams show the coiled- NLS PSORT 0.07 coil domain structure as represented in the ARABI-COIL database; Nosignal TargetP 0.06 gray boxes, coiled-coil domains; black boxes, transmembrane do- aChloroP prediction scores were normalized to a 0 to 1 mainsaccordingtoARAMEMNON.Proteinsbelongingtogenefam- scale. bPredictNLSgeneratesa‘yes’(cid:4)1/‘no’(cid:4)0prediction. iliesareboxedtogether.a,Proteinsarecharacterizedbypublished data(seeTableVforcomparison). 932 Downloaded from on January 4, 2019 - Published by www.plantphysiol.org Plant Physiol. Vol. 134, 2004 Copyright © 2004 American Society of Plant Biologists. All rights reserved. TheArabidopsisCoiled-CoilProteinDatabaseARABI-COIL SMC2 homologs (Liu et al., 2002) are predicted to matrix protein NMCP1 (Masuda et al., 1997) are pre- contain a transmembrane domain in their C-terminal dicted as nuclear proteins. Other published proteins domain. All novel proteins in this category contain a inthenuclear-predictedfractionincludetheputative C-terminal predicted transmembrane domain. transcription factor bHLH131 (Heim et al., 2003) and the condensin SMC4 protein (Liu et al., 2002). Long Coiled-Coil Proteins Are Predicted in All Cellular Compartments Investigated Putative Organellar Long Coiled-Coil Proteins in Arabidopsis The ARABI-COIL sequence set was further ana- lyzedusingabatteryofprogramstopredictputative Searching ARABI-COIL for proteins with N- subcellular targeting of the proteins (Table VI). Two terminal targeting signals such as mitochondrial or (NLSs)orthree(N-terminaltargetingsignals)predic- plastid targeting or secretory signal peptides, 52 tion scores were included in the database for each proteins matching the criteria used for Table V and targeting signal. The ARABI-COIL search options al- Figure 3 were identified. Twenty-seven were pre- low limiting searches to coiled-coil proteins with a dicted by at least one method to target to the chlo- certain predicted localization in addition to trans- roplasts, 23 to the mitochondria, and two to the membrane prediction and selected coiled-coil fea- secretory pathway. Disregarding proteins with tures. The results returned include all proteins with cross-program average scores below the cutoff or atleastoneprogramresultinginapredictionforthat strong ambiguous predictions (“unclear” in Fig. 4), location above a probability cutoff of 0.5. The reli- the remaining proteins with clear targeting predic- ability of the prediction scores is color-coded for tions are summarized in Figure 6. Of the eight pro- easier reference on the online result details page by teins predicted to target to plastids, only the local- using yellow for lower probability (0.50–0.74) and ization of AtMFP1 has been characterized red for higher probability (0.75–1.00). Table VII experimentally (Jeong et al., 2003). None of the five showsanexampleforthedetailedpredictionoutput, proteins predicted as mitochondrial has been char- whichalsoillustrateshowpredictingthelocalization acterized.Theonlyproteinwithaclearpredictionto of individual proteins can be ambiguous. followthesecretorypathwayshowssignificantsim- To summarize the predicted targeting for all pro- ilarity to the pumpkin (Cucurbita maxima) protein teins, the cross-program average of the scores for preproMP73, a protein targeted to storage vacuoles each type of targeting signal were computed and (Mitsuhashi et al., 2001). probability values of 0.5 and higher counted as pos- itive. Figure 4 shows the computationally predicted Putative Cytoplasmic Long Coiled-Coil distribution of the ARABI-COIL proteins in the cell Proteins in Arabidopsis using this method. Only proteins with an entry in ARAMEMNON were counted as transmembrane Oftheproteinslongerthan500aminoacidswithat proteins for this analysis. The result shows that pro- least 25% coiled-coil coverage, 29 fall into the group teinswithhighcoiled-coilcoveragearepredictedtobe definedascytoplasmicinFigure4.Theseproteinsare presentinallcompartmentsoftheplantcellforwhich summarized in Figure 7. The cytoskeletal protein targeting signals were predicted computationally. CIP1 (Matsui et al., 1995), having the longest contin- uous coiled-coil domain in Arabidopsis predicted by MultiCoil in our screen, falls into this group. Other Putative Nuclear Long Coiled-Coil proteinsincludemembersofthefamilyofFPPs(Gin- Proteins in Arabidopsis dullis et al., 2002) and the kinesin family of KatA, KatB, and KatC (Mitsui et al., 1994, 1996; Marcus et About10%oftheannotationsinARABI-COILsug- al., 2002, 2003). gest a nuclear function, and Figure 4 illustrates that 16% of the proteins in ARABI-COIL are predicted to be nuclear. The ARABI-COIL search functions were DISCUSSION used to single out putative nuclear proteins of more than 500 amino acids in length with coiled-coil cov- Increasingexperimentalevidencedemonstratesthe eragesabove25%.Theresultinggroupof37proteins importanceoflongcoiled-coilproteinsforthespatial was manually checked for consistency of the predic- organization of cellular processes. Although several tions as described for Figure 4 to exclude proteins protein classes with long coiled-coil domains have with only weak nuclear prediction or with ambigu- been studied in animals and yeast, our knowledge ous predictions (“unclear” in Fig. 4). The domain about plant long coiled-coil proteins is very limited. structures of the remaining 19 putative nuclear long The repeat nature of the coiled-coil sequence motif coiled-coil proteins are summarized in Figure 5. The makes it almost impossible to identify homologs of proteinswiththehighestpredictedcoiled-coilcover- animalcoiled-coilproteinswithouthighlyconserved age are functionally uncharacterized so far. Three of non-coiled-coil domains. As a consequence, counter- the four Arabidopsis homologs of the carrot nuclear parts of many animal proteins with long coiled-coil Plant Physiol. Vol. 134, 2004 Downloaded from on January 4, 2019 - Published by www.plantphysiol.org 933 Copyright © 2004 American Society of Plant Biologists. All rights reserved. Roseetal. Figure4. Summaryofsubcellulartargetingpredictionsofproteinsin ARABI-COIL.Themeanvaluesofallpredictionprogramsused(see TableVI)werecomputedandlocalizationpredictionswithamean valueaboveaprobabilitycutoffscoreof0.5werecountedaspositive forthatlocation.Proteinswithmeanvaluesabovecutofffortwoor morecompartmentsofthecellwerelabeled“unclear.” domains, like lamins, golgins, or microtubule orga- nizationcentercomponents,havenotbeenidentified yetinplants.TheARABI-COILdatabasewascreated toprovidetheresearchcommunitywithatooltosort and browse Arabidopsis long coiled-coil proteins to Figure 6. Proteins with high coiled-coil coverage and putative facilitatetheidentificationandselectionofcandidate N-terminaltargetingsignals.Proteinsareofatleast500aminoacids inlengthandatleast25%coiled-coilcoverage,sortedfromtopto bottom by decreasing percentage of coiled-coil coverage. Bar dia- grams, Coiled-coil domain structure as represented in the ARABI- COILdatabase;grayboxes,coiled-coildomains;blackboxes,trans- membranedomainsaccordingtoARAMEMNON.Proteinsbelonging togenefamiliesareboxedtogether.a,Proteinsarecharacterizedby published data (see Table V for comparison). cTP, Chloroplast tar- geting peptide; mTP, mitochondrial targeting peptide; SP, signal peptideforsecretorypathway. proteins of potential interest for specific research areas. Coiled-Coil Prediction and Selection Criteria To predict coiled-coil structures based on amino acid sequence, several programs with differing per- formanceratesareavailable.COILSandNEWCOILS (Lupas et al., 1991), based on Parry’s algorithm (Parry, 1982), have become the standard for coiled- coil prediction and are used widely in published literature.However,COILSgeneratesahighnumber offalsepositivesbypredictingnon-coiled-coilalpha- helical regions as coiled-coil structures (Berger et al., 1995; Lupas, 1997). In tests on the PDB database of solved protein structures, two-thirds of the se- quences predicted by COILS did not contain coiled- coils (Berger and Singh, 1997). Thus, this program Figure 5. Putative nuclear proteins with high coiled-coil coverage. wouldgenerateahighnumberoffalsehitsifusedfor Proteinsareofatleast500aminoacidsinlengthandatleast25% a genome-wide screen. The PAIRCOIL program coiled-coil coverage, sorted from top to bottom by decreasing per- takes pair-wise residue correlations within the hep- centage of coiled-coil coverage. Bar diagrams, Coiled-coil domain tad repeat into account and performs significantly structure as represented in the ARABI-COIL database; gray boxes, better than COILS in avoiding false positives. How- coiled-coildomains;blackboxes,transmembranedomainsaccord- ever, PAIRCOIL often fails to predict antiparallel or ingtoARAMEMNON.Proteinsbelongingtogenefamiliesareboxed together.a,Proteinsarecharacterizedbypublisheddata(seeTableV multistranded coiled-coils (Lupas, 1997). MultiCoil, forcomparison). based on data of two-stranded and three-stranded 934 Downloaded from on January 4, 2019 - Published by www.plantphysiol.org Plant Physiol. Vol. 134, 2004 Copyright © 2004 American Society of Plant Biologists. All rights reserved. TheArabidopsisCoiled-CoilProteinDatabaseARABI-COIL In a genome-wide screen using the MultiCoil pro- gram,5.6%ofallArabidopsissequences(about1,500) were identified as coiled-coil proteins. This number is lower than those found for other eukaryotic ge- nomes(about10%;LiuandRost,2001).However,the older studies did not describe using a cutoff in length. Because MultiCoil has no internal length cut- off and the formation of coiled-coil structures re- quires a minimum number of residues, we believe thesettingofaminimaldomainsizemoresignificant than a high per-residue probability cutoff. Studies using synthetic peptides showed that a minimum length of three to four heptads or six to eight helical turns is required for peptides to form stable coiled- coils (Lumb et al., 1994; Su et al., 1994; Litowski and Hodges, 2001). The cutoff of 20 amino acids minimal length for a coiled-coil domain used in our primary screen allows for the formation of about six helical turns in the secondary structure of the protein. ThegoaloftheARABI-COILdatabasecreationwas to provide a searchable selection of proteins with high coiled-coil coverage and long coiled-coil do- mains putatively involved in structural functions. Many long coiled-coil domains, for example that of AtMFP1 (Fig. 1), contain small gaps and disruptions in the overall coiled-coil structure predicted by Mul- tiCoil.Toidentifythecompletelengthofthelongbut discontinuouscoiled-coildomainsofsuchproteins,a featurewasincludedtoignoresmallgapsoflessthan 25 amino acids between predicted coiled-coil struc- tures, thus fusing the predictions for these domains to a single larger coiled-coil as exemplified in Figure 1D. Subsequently, a subset of proteins containing long coiled-coil regions was selected while trying to exclude shorter coiled-coil motifs such as Leucine zippers. The criteria chosen succeeded in excluding 97% of the known Arabidopsis bZIPs (Jakoby et al., 2002),thusprovidingastringentselectionagainstthe inclusion of Leu-zipper-containing proteins. The Figure7. Putativecytosolicproteinswithhighcoiled-coilcoverage. bZIPfactorsincludedinARABI-COIL,suchasATB2, Proteinsareofatleast500aminoacidsinlengthandatleast25% contain unusually long coiled-coil domains for this coiled-coil coverage, sorted from top to bottom by decreasing per- protein family (Rook et al., 1998). However, Multi- centage of coiled-coil coverage. Bar diagrams, Coiled-coil domain Coilpredictiondataforshorterdomainsareavailable structure as represented in the ARABI-COIL database; gray boxes, and integrated into the ARABI-COIL database envi- coiled-coil domains. Proteins belonging to gene families are boxed ronment.Futureenhancementsofthedatabasecould together.a,Proteinsarecharacterizedbypublisheddata(seeTableV includemakingdataforthecurrentlyexcludedshort forcomparison). coiled-coil proteins available to users by offering a choice of additional selection parameter combina- coiled-coils, is capable of predicting both types of tionsthatincorporateproteinswithshorterdomains. structures while achieving a similar low rate of false predictions as PAIRCOIL (Wolf et al., 1997). There- fore,MultiCoilwasappliedastheprogramofchoice ARABI-COIL Search Functions and Prediction to define coiled-coil proteins from the Arabidopsis Data Interpretation genome for this analysis. A probability cutoff of 0.5 was used, which is the default suggested by the The search features provided to browse the data- program developers. Because MultiCoil is already base allow users to select for proteins of a certain more stringent than the older programs, using this coiled-coillengthandcoverage.Byprovidingcoiled- moderate cutoff leads to a prediction of coiled-coil coil percentages predicted for the N-terminal, mid- structuresthataremorecomparablewiththoseoften dle, and C-terminal domains of the protein, the da- found in the literature (see Fig. 1). tabase allows for a crude search for coiled-coil Plant Physiol. Vol. 134, 2004 Downloaded from on January 4, 2019 - Published by www.plantphysiol.org 935 Copyright © 2004 American Society of Plant Biologists. All rights reserved. Roseetal. domain configurations. This facilitates the identifica- diction scores that allows the user to compare and tionofproteinswithsimilarcoiled-coildomainstruc- evaluate the results from a number of prediction tures without detectable sequence homology. programs without having to submit the sequence to Theincorporationoftransmembraneandtargeting the individual prediction servers. However, experi- signal prediction data allows the user to specify mentaldatawillhavetoprovewhetherthepredicted searches for putative chloroplast, mitochondria, se- targeting actually occurs in the cell. cretory pathway, nuclear, and transmembrane pro- teins. This helps to identify subsets of coiled-coil proteins predicted to localize to a certain cell com- Future Directions of the ARABI-COIL Database partment that are of enhanced interest for further FutureenhancementsoftheARABI-COILdatabase functional studies. and Web site will include the incorporation of addi- However,thecomparisonoflocalizationprediction tional prediction data and adding the capability of results from different programs and with available BLASTsearchesagainstthesequencespopulatingthe experimental data shows that computationally re- database. As more fully annotated plant genomes trieved targeting predictions are ambiguous (Table become available, the ARABI-COIL database will VII; also see Emanuelsson and von Heijne, 2001; serveasatemplatefortheadditionofothergenomes, Schwacke et al., 2003). ARABI-COIL searches return enabling comparative analyses between different results if at least one of the incorporated predictions plantspecies.Flexibilityandexpandabilitywerefun- scores above the cutoff of 0.5, with the goal to pro- damental criteria for the underlying MySQL data- vide the user with the largest group possible from base and schema. The ability to add results from whichtoselectcandidatesforfurtheranalysis.These additional programs and sources is key to the suc- predictionresultsneedtobeevaluatedcriticallyona cessful viability of the database over the long term. case-by-case basis, which is aided by color-coding of Essentially, ARABI-COIL is a warehouse of anno- low-probability predictions (0.5–0.74, yellow) and tated and computed information, with relatively few high-probability predictions (0.75–1) on the results updatetransactionsrelativetothenumberofqueries. display. In general, the reliability of computational For increased availability to the scientific commu- targeting predictions is lower for plant sequences nity, the ARABI-COIL data will be made accessible than for non-plant sequences and varies from about through existing data mining and distribution tools, 85% overall correct predictions by TargetP to about such as for, example, The Arabidopsis Information 70% for PSORT (Emanuelsson et al., 2000). Predict- Resource (Rhee et al., 2003) and MOBY Central NLS works on the basis of a database of known NLS (Wilkinson and Links, 2002). motifs and their variations and was found to cor- rectly predict 43% of known nuclear proteins (Cokol et al., 2000), whereas PSORT searches for consensus Arabidopsis Coiled-Coil Proteins Identified patterns, thus potentially creating higher numbers Using ARABI-COIL including false-positive NLS predictions. Predotar The ARABI-COIL database was used to select frequently generates false positives by predicting groups of candidate proteins of at least 500 amino proteins with signal peptides as putative mitochon- acids in length and more than 25% coiled-coil cover- drial or chloroplast proteins. In some cases, this age in combination with other features that could be might reflect a true dual targeting to the ER and of potential interest for future research. The length organelle as has been observed for cytochrome b 5 cutoff for this analysis was chosen based on the (Zhao et al., 2003). MitoProt and ChloroP are less lengths of animal and yeast coiled-coil proteins with efficientthanPredotarindistinguishingbetweenmi- knownstructuralfunctionsinthecellthatrangefrom tochondrial and plastid targeting sequences and oc- about 600 amino acids (for example, lamin A/C, casionally predict high scores for both types of or- golgin-67) to more than 3,000 (for example, giantin). ganellar targeting sequences, as can be seen in the Several long coiled-coil proteins of unknown func- high MitoProt score for the example of the chloro- tion with transmembrane domains at the C terminus plast protein MFP1 in Table VII. Such predictions were identified (Fig. 3). This domain structure is could also reflect true dual targeting to both or- characteristicofasubgroupofanimalgolginsinclud- ganelles. Yeast mitochondrial targeting sequences ing golgin-84, golgin-67, giantin, and CASP (Bascom havebeenshowntotargetproteinstobothorganelles et al., 1999; Jakymiw et al., 2000; Misumi et al., 2001; in plants (Huang et al., 1990), and isolated plant Gillingham et al., 2002). Three of the identified Ara- mitochondria are capable of importing a range of bidopsis proteins contain similarity to golgins: chloroplast protein precursors (Cleary et al., 2002). At3g18480toCASPandAt1g18190andAt2g19950to Dual targeting is being observed for an increasing golgin-84 (Gillingham et al., 2002). Thus, the identi- number of plant proteins (Peeters and Small, 2001; fied Arabidopsis proteins are promising candidates Rudhe et al., 2002; Goggin et al., 2003), thus making forplantintegralmembranegolginsorproteinswith computational predictions difficult. Each ARABI- endosomal functions. No plant golgins have been COIL details page provides a normalized list of pre- characterized in the literature so far. 936 Downloaded from on January 4, 2019 - Published by www.plantphysiol.org Plant Physiol. Vol. 134, 2004 Copyright © 2004 American Society of Plant Biologists. All rights reserved.
Description: