|
Swiss-Prot Protein Knowledgebase User Manual
Release XX.X
|
Amos Bairoch
Swiss Institute of Bioinformatics (SIB)
Centre Medical Universitaire (CMU)
1, rue Michel Servet
1211 Geneva 4
Switzerland
Telephone: +41-22-379 50 50
Fax: +41-22-379 58 58
Electronic mail address: amos.bairoch@isb-sib.ch
WWW server: http://www.expasy.org/sprot/
Rolf Apweiler
The EMBL Outstation - The European Bioinformatics Institute (EBI)
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD
United Kingdom
Telephone: +44-1223-494 444
Fax: +44-1223-494 468
Electronic mail address: datalib@ebi.ac.uk
WWW server: http://www.ebi.ac.uk/swissprot/
Acknowledgements
This release of Swiss-Prot has been prepared by:
- Amos Bairoch, Andrea Auchincloss, Kristian Axelsen, Marie-Claude Blatter Garin,
Brigitte Boeckmann, Laurent Bollondi, Silvia Braconi Quintaje, Edouard deCastro, Danielle Coral, Elisabeth Coudert,
Sylvie Dethiollaz, Pavel Dobrokhotov, Severine Duvaud, Anne Estreicher, Livia Famiglietti,
Nathalie Farriol-Mathis, Stephanie Federico, Serenella Ferro, Elisabeth Gasteiger,
Raffaella Gatto Sebo, Alain Gateau, Alexandre Gattiker, Vivienne Gerritsen, Arnaud Gos,
Nadine Gruaz-Gumowski, Ursula Hinz, Chantal Hulo, Nicolas Hulo, Ivan Ivanyi,
Janet James, Eric Jain, Silvia Jimenez, Florence Jungo, Vivien Junker, Corinne Lachaize, Virginie Le Saux,
Tania Lima, Veronique Mangold, Xavier Martin, Karine Michoud, Madelaine Moinat, Anne Morgat, Isabelle Phan,
Sandrine Pilbout, Sylvain Poux, Nicole Redaschi, Sorogini Reynaud, Catherine Rivoire, Bernd Roechert, Claudia Sapsezian,
Michel Schneider, Christian Sigrist, Andre Stutz, Karin Sonesson, Shyamala Sundaram, Michael Tognolli,
Anne-Lise Veuthey and Lina Yip at the Swiss Institute of Bioinformatics (SIB) and
the Medical Biochemistry Department of the University of Geneva;
- Rolf Apweiler, Nicola Althorpe, Daniel Barrell, Kirsty Bates, David Binns, Margaret Biswas,
Sandra van den Broek, Paul Browne, Evelyn Camon, Michael Darsow, Sergio Contrino, Kirill Degtyarenko,
Alexander Fedetov, Astrid Fleischmann, Wolfgang Fleischmann, Gill Fraser, John Garavelli, Andre Hackmann,
Henning Hermjakob, Alexander Kanapin, Youla Karavidopoulou, Paul Kersey, Maria Krestyaninova, Ernst Kretschmann,
Evguenia Kriventseva, Kati Laiho, Minna Lehvaslaiho, Chris Lewington,
Michele Magrane, Maria Jesus Martin, John Maslen, Rupinder Singh Mazara, Michelle McHale, Virginie Mittard,
Lorna Morris, Nicola Mulder, Claire O'Donovan, John O'Rourke, Sandra Orchard,
Manuela Pruess, Astrid Rakow, Kai Runte, Margaret Shore-Nye, Eleanor Whitfield,
Allyson Williams and Dan Wu at the European Bioinformatics Institute (EBI).
Copyright notice
Swiss-Prot is copyright. It is produced through a collaboration between the
Swiss Institute of Bioinformatics and the EMBL Outstation - the European
Bioinformatics Institute. There are no restrictions on its use by non-profit
institutions as long as its content is in no way modified. Usage by and for
commercial entities requires a license agreement. For information about the
licensing scheme see:
http://www.isb-sib.ch/announce/
or send an email to license@isb-sib.ch.
The above copyright notice also applies to this user manual as well as to all other
Swiss-Prot documents.
How to submit data or updates/corrections to Swiss-Prot
To submit new sequence data to Swiss-Prot and for all queries regarding the
submission of Swiss-Prot one should contact:
-
- Swiss-Prot
- The EMBL Outstation - The European Bioinformatics Institute
- Wellcome Trust Genome Campus
- Hinxton
- Cambridge CB10 1SD
- United Kingdom
- Telephone: (+44 1223) 494 462
- Telefax: (+44 1223) 494 468
- E-mail: datasubs@ebi.ac.uk (for submissions);
datalib@ebi.ac.uk (for queries)
To submit updates and/or corrections to Swiss-Prot you can either use the E-mail
address: swiss-prot@expasy.org or the WWW address:
- http://www.expasy.org/sprot/update.html
Citation
If you want to cite Swiss-Prot in a publication, please use the following reference:
-
- Boeckmann B., Bairoch A., Apweiler R., Blatter M.-C., Estreicher A., Gasteiger E., Martin M.J., Michoud K., O’Donovan C., Phan I., Pilbout S. and Schneider M.
- The Swiss-Prot protein knowledgebase and its supplement TrEMBL in 2003.
- Nucleic Acids Res. 31:365-370(2003).
1. What is Swiss-Prot?
2. Conventions used in the database
-
- 2.1. General structure of the database
- 2.2. Classes of data
- 2.3. Structure of a sequence entry
3. The different line types
-
- 3.1 The ID line
- 3.2 The AC line
- 3.3 The DT line
- 3.4 The DE line
- 3.5 The GN line
- 3.6 The OS line
- 3.7 The OG line
- 3.8 The OC line
- 3.9 The OX line
- 3.10 The reference (RN, RP, RC, RX, RA, RT, RL) lines
- 3.11 The CC line
- 3.12 The DR line
- 3.13 The KW line
- 3.14 The FT line
- 3.15 The SQ line
- 3.16 The sequence data line
- 3.17 The // line
Appendix A: Feature table keys
-
- A.1 Change indicators
- A.2 Amino-acid modifications
- A.3 Regions
- A.4 Secondary structure
- A.5 Others
Appendix B: Amino-acid codes
Appendix C: Format differences between the Swiss-Prot and EMBL databases
-
- C.1 Generalities
- C.2 Differences in line types present in both databases
- C.3 Line types defined by Swiss-Prot but currently not used by EMBL
- C.4 Line types defined by EMBL but currently not used by Swiss-Prot
Appendix D: Swiss-Prot documentation files
Appendix E: FTP access to Swiss-Prot and TrEMBL
-
- E.1 Generalities
- E.2 Non-redundant database
- E.3 Weekly updates of Swiss-Prot documents
- E.4 Weekly updates of Swiss-Prot
Appendix F: Relationships between Swiss-Prot and some biomolecular databases
Swiss-Prot is an annotated protein sequence database. It was established in 1986 and
maintained collaboratively, since 1987, by the group of Amos Bairoch first at the
Department of Medical Biochemistry of the University of Geneva and now at the Swiss
Institute of Bioinformatics (SIB) and the EMBL Data Library (now the EMBL Outstation -
The European Bioinformatics Institute (EBI)). The Swiss-Prot protein knowledgebase
consists of sequence entries. Sequence entries are composed of different line
types, each with their own format. For standardization purposes the format of
Swiss-Prot follows as closely as possible that of the EMBL Nucleotide
Sequence Database.
Swiss-Prot distinguishes itself from protein sequence databases by four
distinct criteria:
-
a) Annotation
In Swiss-Prot, as in many sequence databases, two classes of data can be
distinguished: the core data and the annotation.
For each sequence entry the core data consists of:
- The sequence data;
- The citation information (bibliographical references);
- The taxonomic data (description of the biological source of the protein).
The annotation consists of the description of the following items:
- Function(s) of the protein;
- Posttranslational modification(s). For example carbohydrates,
phosphorylation, acetylation, GPI-anchor, etc.;
- Domains and sites. For example calcium-binding regions, ATP-binding
sites, zinc fingers, homeoboxes, SH2 and SH3 domains, kringle, etc.;
- Secondary structure. For example alpha helix, beta sheet, etc.;
- Quaternary structure. For example homodimer, heterotrimer, etc.;
- Similarities to other proteins;
- Disease(s) associated with any number of deficiencies in the protein;
- Sequence conflicts, variants, etc.
We try to include as much annotation information as possible in Swiss-Prot.
To obtain this information we use, in addition to the publications that
report new sequence data, review articles to periodically update the
annotations of families or groups of proteins. We also make use of external
experts, who have been recruited to send us their comments and updates
concerning specific groups of proteins.
We believe that having systematic recourse both to publications other
than those reporting the core data and to subject referees represents a
unique and beneficial feature of Swiss-Prot.
In Swiss-Prot, annotation is mainly found in the comment lines (CC), in the
feature table (FT) and in the keyword lines (KW). Most comments are
classified by 'topics'; this approach permits the easy retrieval of
specific categories of data from the database.
-
b) Minimal redundancy
Many sequence databases contain, for a given protein sequence, separate entries which
correspond to different literature reports. In Swiss-Prot we try as much as possible to
merge all these data so as to minimize the redundancy of the database. If conflicts exist
between various sequencing reports, they are indicated in the feature table of the
corresponding entry.
-
c) Integration with other databases
It is important to provide the users of biomolecular databases with a degree of integration
between the three types of sequence-related databases (nucleic acid sequences, protein
sequences and protein tertiary structures) as well as with specialized data collections.
Swiss-Prot is currently cross-referenced to about 45 different databases.
Cross-references are provided in the form of pointers to information related to Swiss-Prot
entries and found in data collections other than Swiss-Prot. This extensive network of
cross-references allows Swiss-Prot to play a major role as a focal point of biomolecular
database interconnectivity.
-
d) Documentation
Swiss-Prot is distributed with a large number of index files and specialized
documentation files. Some of these files have been available for a long time (this user
manual, the release notes, the various indices for authors, citations, keywords, etc.),
but many have been created recently and we are continuously adding new files.
The release notes contain an
up-to-date descriptive list of all distributed document files.
The following sections describe the general conventions used in Swiss-Prot to achieve
uniformity of presentation. Experienced users of the EMBL Database can skip
these sections and directly refer to Appendix C,
which lists the minor differences in format between the two data collections.
-
2.1. General structure of the database
The Swiss-Prot protein sequence database is composed of sequence entries. Each
entry corresponds to a single contiguous sequence as contributed to the bank or reported
in the literature. In some cases, entries have been assembled from several papers that
report overlapping sequence regions. Conversely, a single paper can provide data for
several entries, e.g. when related sequences from different organisms are reported.
References to positions within a sequence are made using sequential numbering,
beginning with 1 at the N-terminal end of the sequence.
Except for initiator N-terminal methionine residues, which are not included in a sequence
when their absence from the mature sequence has been proven, the sequence data
correspond to the precursor form of a protein before posttranslational modifications and
processing.
-
2.2. Data classes
To make data available to users as quickly as possible after publication,
Swiss-Prot is distributed with a supplement called TrEMBL, where entries
are released before all their details are finalized. To distinguish
between fully annotated entries and those in TrEMBL, the 'class' of each
entry is indicated on the first (ID) line of the entry. The two defined
classes are:
STANDARD |
Data which are complete and up to the standards laid down by the Swiss-Prot database. |
PRELIMINARY |
Sequence entries which have not yet been annotated by the Swiss-Prot staff
up to the standards laid down by Swiss-Prot. These entries are exclusively found in
TrEMBL. |
-
2.3. Structure of a sequence entry
The entries in the Swiss-Prot database are structured so as to be usable by human
readers as well as by computer programs. The explanations, descriptions, classifications
and other comments are in ordinary English. Wherever possible, symbols familiar to
biochemists, protein chemists and molecular biologists are used.
Each sequence entry is composed of lines. Different types of lines, each with their own
format, are used to record the various data that make up the entry. A sample sequence
entry is shown below.
ID GRAA_HUMAN STANDARD; PRT; 262 AA.
AC P12544;
DT 01-OCT-1989 (Rel. 12, Created)
DT 01-OCT-1989 (Rel. 12, Last sequence update)
DT 28-FEB-2003 (Rel. 41, Last annotation update)
DE Granzyme A precursor (EC 3.4.21.78) (Cytotoxic T-lymphocyte proteinase
DE 1) (Hanukkah factor) (H factor) (HF) (Granzyme 1) (CTL tryptase)
DE (Fragmentin 1).
GN GZMA OR CTLA3 OR HFSP.
OS Homo sapiens (Human).
OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
OC Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
OX NCBI_TaxID=9606;
RN [1]
RP SEQUENCE FROM N.A.
RC TISSUE=T-cell;
RX MEDLINE=88125000; PubMed=3257574;
RA Gershenfeld H.K., Hershberger R.J., Shows T.B., Weissman I.L.;
RT "Cloning and chromosomal assignment of a human cDNA encoding a T
RT cell- and natural killer cell-specific trypsin-like serine
RT protease.";
RL Proc. Natl. Acad. Sci. U.S.A. 85:1184-1188(1988).
RN [2]
RP SEQUENCE FROM N.A.
RC TISSUE=Blood;
RA Strausberg R.;
RL Submitted (OCT-2001) to the EMBL/GenBank/DDBJ databases.
RN [3]
RP SEQUENCE OF 1-23 FROM N.A.
RA Goralski T.J., Krensky A.M.;
RT "The upstream region of the human granzyme A locus contains both
RT positive and negative transcriptional regulatory elements.";
RL Submitted (NOV-1995) to the EMBL/GenBank/DDBJ databases.
RN [4]
RP SEQUENCE OF 29-53.
RX MEDLINE=88330824; PubMed=3047119;
RA Poe M., Bennett C.D., Biddison W.E., Blake J.T., Norton G.P.,
RA Rodkey J.A., Sigal N.H., Turner R.V., Wu J.K., Zweerink H.J.;
RT "Human cytotoxic lymphocyte tryptase. Its purification from granules
RT and the characterization of inhibitor and substrate specificity.";
RL J. Biol. Chem. 263:13215-13222(1988).
RN [5]
RP SEQUENCE OF 29-40, AND CHARACTERIZATION.
RX MEDLINE=89009866; PubMed=3262682;
RA Hameed A., Lowrey D.M., Lichtenheld M., Podack E.R.;
RT "Characterization of three serine esterases isolated from human IL-2
RT activated killer cells.";
RL J. Immunol. 141:3142-3147(1988).
RN [6]
RP SEQUENCE OF 29-39, AND CHARACTERIZATION.
RX MEDLINE=89035468; PubMed=3263427;
RA Kraehenbuhl O., Rey C., Jenne D.E., Lanzavecchia A., Groscurth P.,
RA Carrel S., Tschopp J.;
RT "Characterization of granzymes A and B isolated from granules of
RT cloned human cytotoxic T lymphocytes.";
RL J. Immunol. 141:3471-3477(1988).
RN [7]
RP 3D-STRUCTURE MODELING.
RX MEDLINE=89184501; PubMed=3237717;
RA Murphy M.E.P., Moult J., Bleackley R.C., Gershenfeld H.,
RA Weissman I.L., James M.N.G.;
RT "Comparative molecular model building of two serine proteinases from
RT cytotoxic T lymphocytes.";
RL Proteins 4:190-204(1988).
CC -!- FUNCTION: This enzyme is necessary for target cell lysis in cell-
CC mediated immune responses. It cleaves after Lys or Arg. May be
CC involved in apoptosis.
CC -!- CATALYTIC ACTIVITY: Hydrolysis of proteins, including fibronectin,
CC type IV collagen and nucleolin. Preferential cleavage: Arg-|-Xaa,
CC Lys-|-Xaa >> Phe-|-Xaa in small molecule substrates.
CC -!- SUBUNIT: Homodimer; disulfide-linked.
CC -!- SUBCELLULAR LOCATION: Cytoplasmic granules.
CC -!- SIMILARITY: Belongs to peptidase family S1. Granzyme subfamily.
CC --------------------------------------------------------------------------
CC This SWISS-PROT entry is copyright. It is produced through a collaboration
CC between the Swiss Institute of Bioinformatics and the EMBL outstation -
CC the European Bioinformatics Institute. There are no restrictions on its
CC use by non-profit institutions as long as its content is in no way
CC modified and this statement is not removed. Usage by and for commercial
CC entities requires a license agreement (See http://www.isb-sib.ch/announce/
CC or send an email to license@isb-sib.ch).
CC --------------------------------------------------------------------------
DR EMBL; M18737; AAA52647.1; -.
DR EMBL; BC015739; AAH15739.1; -.
DR EMBL; U40006; AAD00009.1; -.
DR PIR; A28943; A28943.
DR PIR; A30525; A30525.
DR PIR; A30526; A30526.
DR PIR; A31372; A31372.
DR PDB; 1HF1; 15-OCT-94.
DR MEROPS; S01.135; -.
DR Genew; HGNC:4708; GZMA.
DR MIM; 140050; -.
DR InterPro; IPR001254; Ser_protease_Try.
DR Pfam; PF00089; trypsin; 1.
DR SMART; SM00020; Tryp_SPc; 1.
DR PROSITE; PS50240; TRYPSIN_DOM; 1.
DR PROSITE; PS00134; TRYPSIN_HIS; 1.
DR PROSITE; PS00135; TRYPSIN_SER; 1.
KW Hydrolase; Serine protease; Zymogen; Signal; T-cell; Cytolysis;
KW Apoptosis; 3D-structure.
FT SIGNAL 1 26
FT PROPEP 27 28 ACTIVATION PEPTIDE.
FT CHAIN 29 262 GRANZYME A.
FT ACT_SITE 69 69 CHARGE RELAY SYSTEM (BY SIMILARITY).
FT ACT_SITE 114 114 CHARGE RELAY SYSTEM (BY SIMILARITY).
FT ACT_SITE 212 212 CHARGE RELAY SYSTEM (BY SIMILARITY).
FT DISULFID 54 70 BY SIMILARITY.
FT DISULFID 148 218 BY SIMILARITY.
FT DISULFID 179 197 BY SIMILARITY.
FT DISULFID 208 234 BY SIMILARITY.
FT CARBOHYD 170 170 N-LINKED (GLCNAC...) (POTENTIAL).
FT STRAND 30 30
FT STRAND 33 34
FT TURN 37 38
FT TURN 41 42
FT STRAND 43 48
FT TURN 49 51
FT STRAND 52 60
FT TURN 61 62
FT STRAND 63 66
FT TURN 68 69
FT STRAND 76 80
FT STRAND 84 84
FT TURN 85 86
FT TURN 90 91
FT STRAND 93 102
FT TURN 104 105
FT HELIX 108 110
FT TURN 112 113
FT STRAND 116 120
FT STRAND 127 127
FT TURN 128 129
FT STRAND 130 130
FT STRAND 134 134
FT TURN 138 139
FT TURN 144 145
FT STRAND 147 152
FT STRAND 155 157
FT TURN 158 159
FT STRAND 160 162
FT STRAND 165 165
FT STRAND 167 173
FT HELIX 176 180
FT TURN 181 182
FT TURN 184 185
FT TURN 193 194
FT STRAND 195 199
FT TURN 201 202
FT STRAND 206 206
FT TURN 209 210
FT TURN 212 213
FT STRAND 215 218
FT TURN 219 220
FT STRAND 221 228
FT TURN 231 232
FT TURN 234 235
FT TURN 237 238
FT STRAND 241 245
FT TURN 246 249
FT HELIX 252 260
SQ SEQUENCE 262 AA; 28968 MW; DA87363A0D92BAF4 CRC64;
MRNSYRFLAS SLSVVVSLLL IPEDVCEKII GGNEVTPHSR PYMVLLSLDR KTICAGALIA
KDWVLTAAHC NLNKRSQVIL GAHSITREEP TKQIMLVKKE FPYPCYDPAT REGDLKLLQL
TEKAKINKYV TILHLPKKGD DVKPGTMCQV AGWGRTHNSA SWSDTLREVN ITIIDRKVCN
DRNHYNFNPV IGMNMVCAGS LRGGRDSCNG DSGSPLLCEG VFRGVTSFGL ENKCGDPRGP
GVYILLSKKH LNWIIMTIKG AV
//
Each line begins with a two-character line code, which indicates the type of data contained
in the line. The current line types and line codes and the order in which they appear in an
entry, are shown in the table below.
-
Line code |
Content |
Occurrence in an entry |
ID | Identification | Once; starts the entry |
AC | Accession number(s) | Once or more |
DT | Date | Three times |
DE | Description | Once or more |
GN | Gene name(s) | Optional |
OS | Organism species | Once or more |
OG | Organelle | Optional |
OC | Organism classification | Once or more |
OX | Taxonomy cross-reference(s) | Once or more |
RN | Reference number | Once or more |
RP | Reference position | Once or more |
RC | Reference comment(s) | Optional |
RX | Reference cross-reference(s) | Optional |
RA | Reference authors | Once or more |
RT | Reference title | Optional |
RL | Reference location | Once or more |
CC | Comments or notes | Optional |
DR | Database cross-references | Optional |
KW | Keywords | Optional |
FT | Feature table data | Optional |
SQ | Sequence header | Once |
| (blanks) sequence data | Once or more |
// | Termination line | Once; ends the entry |
As shown in the above table, some line types are found in all entries, others are optional.
Some line types occur many times in a single entry. Each entry must begin with an
identification line (ID) and end with a terminator line (//).
A detailed description of each line type is given in the next section of this document.
It must be noted that, with the exception of GN, all Swiss-Prot line types exist in the
EMBL Database. A description of the format differences between the Swiss-Prot and
EMBL databases is given in Appendix C of this
document.
The two-character line-type code that begins each line is always followed
by three blanks, so that the actual information begins with the sixth
character. Information is not extended beyond character position 75 except
for one exception: CC lines that contain the 'DATABASE' topic (see section
3.11).
3. The different line types
|
The ID (IDentification) line is always the first line of an entry. The general form of the ID line
is:
ID ENTRY_NAME DATA_CLASS; MOLECULE_TYPE; SEQUENCE_LENGTH.
- 3.1.1. Entry name
The first item on the ID line is the entry name of the sequence. This name is a useful means
of identifying a sequence. The entry name consists of up to ten uppercase alphanumeric
characters.
Swiss-Prot uses a general purpose naming convention that can be symbolized as X_Y,
where:
- X is a mnemonic code of at most 4 alphanumeric characters representing
the protein name. Examples: B2MG is for Beta-2-microglobulin, HBA is for
Hemoglobin alpha chain and INS is for Insulin;
- The '_' sign serves as a separator;
- Y is a mnemonic species identification code of at most 5 alphanumeric
characters representing the biological source of the protein. This code
is generally made of the first three letters of the genus and the first
two letters of the species.
Examples:
PSEPU is for Pseudomonas putida and NAJNI is for Naja nivea.
However, for species most commonly encountered in the database,
self-explanatory codes are used. There are 16 of those codes: BOVIN for Bovine,
CHICK for Chicken, ECOLI for Escherichia coli, HORSE for Horse, HUMAN for
Human, MAIZE for Maize (Zea mays), MOUSE for Mouse, PEA for Garden pea
(Pisum sativum), PIG for Pig, RABIT for Rabbit, RAT for Rat, SHEEP for
Sheep, SOYBN for Soybean (Glycine max), TOBAC for Common tobacco
(Nicotina tabacum), WHEAT for Wheat (Triticum aestivum), and
YEAST for Baker's yeast (Saccharomyces cerevisiae).
As it was not possible to apply the above rules to viruses, they were given
arbitrary, but generally easy-to-remember identification codes.
Examples of complete protein sequence entry names are: RL1_ECOLI for
ribosomal protein L1 from Escherichia coli,, SODC_DROME for Superoxide
dismutase [Cu-Zn] from Drosophila melanogaster.
The names of all the presently-defined species identification codes are
listed in the Swiss-Prot document file
speclist.txt.
- 3.1.2. Data class
The second item on the ID line indicates the data class of the entry (see
section 2.2).
- 3.1.3. Molecule type
The third item on the ID line is a three-letter code that indicates the
type of molecule of the entry; in Swiss-Prot it is 'PRT' (for PRoTein).
- 3.1.4. Length of the molecule
The fourth and last item of the ID line is the length of the molecule,
which is the total number of amino acids in the sequence. This number
includes the positions reported to be present but which have not been
determined (coded as 'X'). The length is followed by the letter code 'AA'
(Amino Acids).
- 3.1.5. Examples of identification lines
Two examples of ID lines are shown below:
ID CYC_BOVIN STANDARD; PRT; 104 AA.
ID GIA2_GIALA STANDARD; PRT; 296 AA.
The AC (ACcession number) line lists the accession number(s) associated
with an entry. The format of the AC line is:
AC AC_number_1;[ AC_number_2;]...[ AC_number_N;]
An example of an accession number line is shown below:
AC P00321;
Semicolons separate the accession numbers and a semicolon terminates the
list. If necessary, more than one AC line can be used. Example:
AC Q16653; Q14855; Q13054; Q13055; Q92891; Q92892; Q92893; Q92894;
AC Q92895; Q93053; Q99605; O00713; O00714; O00715;
The purpose of accession numbers is to provide a stable way of identifying
entries from release to release. It is sometimes necessary for reasons of
consistency to change the names of the entries, for example, to ensure that
related entries have similar names. However, an accession number is always
conserved, and therefore allows unambiguous citation of Swiss-Prot entries.
Researchers who wish to cite entries in their publications should
always cite the first accession number. This is commonly referred to as
the 'primary accession number'.
Entries will have more than one accession number if they have been merged
or split. For example, when two entries are merged into one, the accession
numbers from both entries are stored in the AC line(s).
If an existing entry is split into two or more entries (a rare occurrence),
the original accession numbers are retained in all the derived entries and
a new primary accession number is added to all the entries.
An accession number is dropped only when the data to which it was assigned
have been completely removed from the database. Deleted accession numbers
are listed in the Swiss-Prot document file
deleteac.txt.
Accession numbers consist of 6 alphanumerical characters in the following
format:
-
1 |
2 |
3 |
4 |
5 |
6 |
[O,P,Q] | [0-9] | [A-Z, 0-9] | [A-Z, 0-9] | [A-Z, 0-9] | [0-9] |
Here are some examples of valid accession numbers: P12345, Q1AAA9, O456A1
and P4A123.
The DT (DaTe) lines show the date of creation and last modification of the
database entry. The format of the DT line is:
DT DD-MMM-YYYY (Rel. XX, Comment)
Where 'DD' is the day, 'MMM' the month, 'YYYY' the year, and 'XX' the
Swiss-Prot release number. The comment portion of the line indicates the action
taken on that date. There are always three DT lines in each entry, each of
them is associated with a specific comment:
- The first DT line indicates when the entry first appeared in the database. The
associated comment is 'Created';
- The second DT line indicates when the sequence data was last modified. The
associated comment is 'Last sequence update';
- The third DT line indicates when data (see the note below) other than the sequence
was last modified. The associated comment is 'Last annotation update'.
Example of a block of DT lines:
DT 01-AUG-1988 (Rel. 08, Created)
DT 30-MAY-2000 (Rel. 39, Last sequence update)
DT 16-OCT-2001 (Rel. 40, Last annotation update)
Whenever the sequence of an entry is updated there is always also an
annotation update. The date in the 3rd DT line is thus always at least
as recent as the one in the 2nd DT line.
Regarding the third DT line, one should note that such a line is not
updated when an entry is the target of what we call a 'global change'. A
global change is defined as any operation which involves changes in all
or most Swiss-Prot entries. These changes are announced in the
release notes and are
usually linked to formatting issues. As such global changes
take place at almost each release, we strongly advise users of the database
to completely reload Swiss-Prot at each release cycle.
The DE (DEscription) lines contain general descriptive information about
the sequence stored. This information is generally sufficient to identify
the protein precisely.
The format of the DE line is:
DE Description.
The description is given in ordinary English (using US-spelling) and is
free-format.
In cases where more than one DE line is required, the
text is only divided between words and only the last DE line is terminated
by a period.
The description always starts with the proposed official name of the
protein. Synonyms are indicated between brackets. Example:
DE Annexin V (Lipocortin V) (Endonexin II) (Calphobindin I) (CBP-I)
DE (Placental anticoagulant protein I) (PAP-I) (PP4) (Thromboplastin
DE inhibitor) (Vascular anticoagulant-alpha) (VAC-alpha) (Anchorin CII).
When a protein is known to be cleaved into multiple functional components,
the description starts with the name of the precursor protein, followed
by a section delimited by '[Contains: ...]'. All the individual components
are listed in that section and are separated by semi-colons (';'). Synonyms
are allowed at the level of the precursor and for each individual component.
Example:
DE Corticotropin-lipotropin precursor (Pro-opiomelanocortin) (POMC)
DE [Contains: NPP; Melanotropin gamma (Gamma-MSH); Corticotropin
DE (Adrenocorticotropic hormone) (ACTH); Melanotropin alpha (Alpha-MSH);
DE Corticotropin-like intermediary peptide (CLIP); Lipotropin beta (Beta-
DE LPH); Lipotropin gamma (Gamma-LPH); Melanotropin beta (Beta-MSH);
DE Beta-endorphin; Met-enkephalin].
When a protein is known to include multiple functional domains each of which is
described by a different name, the description starts with the name of
the overall protein, followed by a section delimited by '[Includes: ]'.
All the domains are listed in that section and are separated by semi-colons
(';'). Synonyms are allowed at the level of the protein and for each
individual domain. Example:
DE CAD protein [Includes: Glutamine-dependent carbamoyl-phosphate
DE synthase (EC 6.3.5.5); Aspartate carbamoyltransferase (EC 2.1.3.2);
DE Dihydroorotase (EC 3.5.2.3)].
When the complete sequence was not determined, the last information given on the DE
lines is '(Fragment)' or '(Fragments)'. Example:
DE Dihydrodipicolinate reductase (EC 1.3.1.26) (DHPR) (Fragment).
The GN (Gene Name) line contains the name(s) of the gene(s) that code for the stored
protein sequence.
The format of the GN line is:
GN NAME1[ AND|OR NAME2...].
Examples:
GN ALB.
GN CRE-1.
It often occurs that more than one gene name has been assigned to an
individual locus, in which case all the synonyms will be listed. The word
'OR' separates the different designations. The first name in the list is
assumed to be the most correct (or most current) designation.
Example:
GN HNS OR HNSA OR DRDX OR OSMZ OR BGLY OR MSYA OR CUR OR PILG OR TOPS OR
GN B1237 OR Z2013 OR ECS1739.
In a few cases, multiple genes code for an identical protein sequence, in
which case all the different gene names will be listed. The word 'AND'
separates the designations. Example:
GN B23R AND C17L.
In some cases 'AND' and 'OR' are both present. Parentheses are then used as
shown in the following examples:
GN GVPA AND (GVPB OR GVPA2).
GN (HHT1 OR SPAC1834.04) AND (HHT2 OR SPBC8D2.04 OR PI060).
The OS (Organism Species) line specifies the organism(s) which was (were)
the source of the stored sequence. In the rare case where all the species
information will not fit on a single line, more than one OS line is used.
The last OS line is terminated by a period.
The species designation
consists, in most cases, of the Latin genus and species designation
followed by the English name (in parentheses). For viruses, only the common
English name is given.
Examples of OS lines are shown here:
OS Escherichia coli.
OS Homo sapiens (Human).
OS Solanum melongena (Eggplant) (Aubergine).
OS Rous sarcoma virus (strain Schmidt-Ruppin).
If a Swiss-Prot entry reports the sequence of a protein identical in a
number of species, the name of these species will all be listed in the OS
lines of that entry. The species names are separated by commas, the last
species name is preceded by the word 'and'.
Here are examples of the OS lines for entries representing multiple
species:
OS Oncorhynchus nerka (Sockeye salmon), and
OS Oncorhynchus masou (Cherry salmon) (Masu salmon).
OS Mus musculus (Mouse),
OS Rattus norvegicus (Rat), and
OS Bos taurus (Bovine).
The names (official name, common name, synonym) concerning one species are
cut across lines when they do not fit into a single line:
OS Epizootic hemorrhagic disease virus (serotype 2 / strain Alberta)
OS (Ehdv-2).
The OG (OrGanelle) line indicates if the gene coding for a protein originates from the
mitochondria, the chloroplast, the cyanelle, the nucleomorph or a plasmid.
The format of the OG line is:
OG Chloroplast.
OG Cyanelle.
OG Mitochondrion.
OG Nucleomorph.
OG Plasmid name.
Where 'name' is the name of the plasmid.
If a Swiss-Prot entry reports the sequence of a protein identical in a
number of plasmids, the names of these plasmids will all be listed in the OG
lines of that entry. The plasmid names are separated by commas, the last
plasmid name is preceded by the word 'and'. Plasmid names are never cut
across two lines. Examples:
OG Plasmid IncFIV R124, and Plasmid IncFI ColV3-K30.
OG Plasmid R6-5, Plasmid IncFII NR1, and
OG Plasmid IncFII R1-19 (R1 drd-19).
The Swiss-Prot document plasmid.txt
lists all the plasmid names that are used in the database in the context of the OG
line.
The OC (Organism Classification) lines contain the taxonomic classification
of the source organism. The taxonomic classification used in Swiss-Prot is
that maintained at the NCBI (see
http://www.ncbi.nlm.nih.gov/Taxonomy/)
and used by the nucleotide sequence databases (EMBL/GenBank/DDBJ). The NCBI's taxonomy
reflects current phylogenetic knowledge. It is a sequence-based taxonomy as
much as possible and based on published authorities wherever possible.
Because of the inherent ambiguity of evolutionary classification and the
specific needs of database users (e.g. trying to track down the
phylogenetic history of a group of organisms or to elucidate the evolution
of a molecule), this taxonomy strives to accurately reflect current
phylogenetic knowledge. The NCBI's taxonomy is intended to be informative
and helpful; no claim is made that it is the best or the most exact.
The classification is listed top-down as nodes in a taxonomic tree in which the most
general grouping is given first. The classification may be distributed over several OC lines,
but nodes are not split or hyphenated between lines. Semicolons separate the individual
items and the list is terminated by a period.
The format of the OC line is:
OC Node[; Node...].
For example the classification lines for a human sequence would be:
OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
OC Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
If a protein is identical in more than one species, all the species names will be listed
in the OS lines (see 3.6)
but the OC lines will only contain the classification for the first species listed.
The OX (Organism taxonomy cross-reference) line is used to indicate the
identifier of a specific organism in a taxonomic database. The number of
taxonomic codes is identical to the number of species given in the OS line(s).
There can be more than one OX line in an entry. The format of the OX line is:
OX Taxonomy_database_Qualifier=Taxonomic code[, Taxonomic code...];
Currently the cross-references are made to the taxonomy database of NCBI,
which is associated with the qualifier 'TaxID' and a one- to six-digit
taxonomic code.
Examples:
OX NCBI_TaxID=9606;
OX NCBI_TaxID=9606, 10090, 9913, 9823, 10141, 10029, 10030, 10116, 9986,
OX 9031, 8355, 7227, 7213, 7108, 7130;
These lines comprise the literature citations within Swiss-Prot. The
citations indicate the sources from which the data has been abstracted. The
reference lines for a given citation occur in a block, and are always in
the order RN, RP, RC, RX, RA, RT and RL. Within each such reference block,
the RN line occurs once, the RC, RX and RT lines occur zero or more
times, and the RP, RA and RL lines occur one or more times. If several
references are given, there will be a reference block for each.
An example of a complete reference is:
RN [1]
RP SEQUENCE FROM N.A., AND SEQUENCE OF 1-15.
RC STRAIN=Sprague-Dawley; TISSUE=Liver;
RX MEDLINE=91002678; PubMed=2207170;
RA Chan Y.-L., Paz V., Olvera J., Wool I.G.;
RT "The primary structure of rat ribosomal proteins: the amino acid
RT sequences of L27a and L28 and corrections in the sequences of S4 and
RT S12.";
RL Biochim. Biophys. Acta 1050:69-73(1990).
The formats of the individual lines are explained below.
- 3.10.1. The RN line
The RN (Reference Number) line gives a sequential number to each reference citation in
an entry. This number is used to indicate the reference in comments and feature table
notes. The format of the RN line is:
RN [N]
where 'N' denotes the nth reference for this entry. The
reference number is always between square brackets.
- 3.10.2. The RP line
The RP (Reference Position) lines describe the extent of the work carried out by the
authors of the reference cited. The format of the RP line is:
RP COMMENT.
Typical examples of RP lines are shown below:
RP SEQUENCE FROM N.A.
RP SEQUENCE FROM N.A., AND SEQUENCE OF 21-35.
RP SEQUENCE OF 39-76; 95-118 AND 125-138, AND DISULFIDE BONDS.
RP REVISIONS TO 76-84 AND 129.
RP STRUCTURE BY NMR.
RP X-RAY CRYSTALLOGRAPHY (1.8 ANGSTROMS).
RP CHARACTERIZATION.
RP MUTAGENESIS OF TYR-65.
RP REVIEW.
RP VARIANT ALA-1368.
RP VARIANTS TD ARG-597 AND ARG-1477, AND VARIANT FHA LEU-693 DEL.
RP SEQUENCE FROM N.A., SEQUENCE OF 1-22; 2-17; 240-256; 318-339 AND
RP 381-390, AND CHARACTERIZATION.
RP SEQUENCE FROM N.A., SEQUENCE OF 154-171; 302-308; 312-328; 377-384
RP AND 419-431, FUNCTION, SUBCELLULAR LOCATION, AND MUTAGENESIS OF
RP ARG-331; GLY-332 AND ARG-333.
- 3.10.3. The RC line
The RC (Reference Comment) lines are optional lines which are used to store comments
relevant to the reference cited. The format of the RC line is:
RC TOKEN1=Text; TOKEN2=Text; ...
Where the currently defined tokens are:
-
- PLASMID
- SPECIES
- STRAIN
- TISSUE
- TRANSPOSON
Examples of RC lines:
RC STRAIN=Sprague-Dawley; TISSUE=Liver;
RC STRAIN=Holstein; TISSUE=Lymph node, and Mammary gland;
RC SPECIES=Rat; STRAIN=Wistar; TISSUE=Brain;
RC SPECIES=A.thaliana; STRAIN=cv. Columbia;
RC PLASMID=IncFII R100;
The 'SPECIES' token is only used when an entry describes a sequence that is
identical in more than one species; similarly 'PLASMID' is only used if an
entry describes a sequence identical in more than one plasmid.
The Swiss-Prot document tisslist.txt lists all the tissues that are used in the
database in the context of the 'TISSUE' token.
- 3.10.4. The RX line
The RX (Reference cross-reference) line is an optional line which is used
to indicate the identifier assigned to a specific reference in a
bibliographic database. The format of the RX line is:
RX Bibliographic_db=IDENTIFIER[; Bibliographic_db=IDENTIFIER...];
Where the valid bibliographic database names and their associated
identifiers are:
-
Name
| Identifier
|
MEDLINE |
Eight-digit MEDLINE Unique Identifier (UI) |
PubMed |
PubMed Unique Identifier (PMID) |
Example of RX lines:
RX MEDLINE=97291283; PubMed=9145897;
RX PubMed=11792842;
- 3.10.5. The RA line
The RA (Reference Author) lines list the authors of the paper (or other work) cited. All of the
authors are included, and are listed in the order given in the paper. The names are listed
surname first followed by a blank, followed by initial(s) with periods. The authors' names
are separated by commas and terminated by a semicolon. Author names are not split
between lines. An example of the use of RA lines is shown below:
RA Galinier A., Bleicher F., Negre D., Perriere G., Duclos B.,
RA Cozzone A.J., Cortay J.-C.;
As many RA lines as necessary are included in each reference.
An author's initials can be followed by an abbreviation such as 'Jr' (for Junior), 'Sr'
(Senior), 'II', 'III' or 'IV' (2nd, 3rd and 4th). Example:
RA Nasoff M.S., Baker H.V. II, Wolf R.E. Jr.;
- 3.10.6. The RT line
The RT (Reference Title) lines give the title of the paper (or other work) cited as exactly as
possible given the limitations of the computer character set. The format of the RT line is:
RT "Title";
Example of a set of RT lines:
RT "New insulin-like proteins with atypical disulfide bond pattern
RT characterized in Caenorhabditis elegans by comparative sequence
RT analysis and homology modeling.";
It should be noted that the format of the title is not always identical to
that displayed at the top of the published work:
- Major title words are not capitalized;
- The text of a title ends with either a period '.', a question mark '?' or an exclamation mark '!';
- Double quotation marks ' " ' in the text of the title are replaced by single quotation marks;
- Titles of articles published in a language other than English have been translated into English;
- Greek letters are written in full (alpha, beta, etc.).
- 3.10.7. The RL line
The RL (Reference Location) lines contain the conventional citation information for the
reference. In general, the RL lines alone are sufficient to find the paper in question.
- a) Journal citations
The RL line for a journal citation includes the journal abbreviation, the volume number, the
page range and the year. The format for such an RL line is:
RL Journal_abbrev Volume:First_page-Last_page(YYYY).
Journal names are abbreviated according to the conventions used by the
National Library of Medicine (NLM) and are based on the existing ISO and
ANSI standards. A list of the abbreviations currently in use is given in
the Swiss-Prot document file
jourlist.txt
An example of an RL line is:
RL J. Mol. Biol. 168:321-331(1983).
When a reference is made to a paper which is 'in press' at the time the
database is released, the page range, and possibly the volume number, are
indicated as '0' (zero). An example of such an RL line is shown
here:
RL Arch. Biochem. Biophys. 411:0-0(2003).
- b) Book citations
A variation of the RL line format is used for papers found in books or
other types of publication, which are then cited using the following
format:
RL (In) Editor_1 I.[, Editor_2 I., Editor_X I.] (eds.);
RL Book_name, pp.[Volume:]First_page-Last_page, Publisher, City (YYYY).
Examples:
RL (In) Boyer P.D. (eds.);
RL The enzymes (3rd ed.), pp.11:397-547, Academic Press, New York (1975).
RL (In) Rich D.H., Gross E. (eds.);
RL Proceedings of the 7th american peptide symposium, pp.69-72,
RL Pierce Chemical Co., Rockford Il. (1981).
RL (In) Magnusson S., Ottesen M., Foltmann B., Dano K.,
RL Neurath H. (eds.);
RL Regulatory proteolytic enzymes and their inhibitors, pp.163-172,
RL Pergamon Press, New York (1978).
- c) Plant Gene Register and Worm Breeder's Gazette citations
The '(In)' prefix used for books (see above) is also used for references to the electronic
Plant Gene Register (http://www.tarweed.com/pgr/) as well as to the Worm Breeder's
Gazette (http://elegans.swmed.edu/wli/). Examples:
RL (In) Plant Gene Register PGR98-023.
RL (In) Worm Breeder's Gazette 15(3):34(1998).
- d) Unpublished results
RL lines for unpublished results follow the format shown in the next example:
RL Unpublished results, cited by:
RL Shelnutt J.A., Rousseau D.L., Dethmers J.K., Margoliash E.;
RL Biochemistry 20:6485-6497(1981).
- e) Unpublished observations
For unpublished observations the format of the RL line is:
RL Unpublished observations (MMM-YYYY).
Where 'MMM' is the month and 'YYYY' is the year.
We use the 'unpublished observations' RL line to cite communications by scientists to
Swiss-Prot of unpublished information concerning various aspects of a sequence entry.
- f) Thesis
For Ph.D. theses the format of the RL line is:
RL Thesis (Year), Institution_name, Country.
An example of such a line is given here:
RL Thesis (1977), University of Geneva, Switzerland.
- g) Patent applications
For patent applications the format of the RL line is:
RL Patent number Pat_num, DD-MMM-YYYY.
Where 'Pat_num' is the international publication number of the patent, 'DD' is the day, 'MMM'
is the month and 'YYYY' is the year. Example:
RL Patent number WO9010703, 20-SEP-1990.
- h) Submissions
The final form that an RL line can take is that used for submissions. The
format of such an RL line is:
RL Submitted (MMM-YYYY) to the Database_name.
Where 'MMM' is the month, 'YYYY' is the year and 'Database_name' is one of
the following:
-
- EMBL/GenBank/DDBJ databases
- Swiss-Prot data bank
- HIV data bank
- PDB data bank
- PIR data bank
Two examples of submission RL lines are given here:
RL Submitted (OCT-1995) to the EMBL/GenBank/DDBJ databases.
RL Submitted (JUL-1998) to the Swiss-Prot data bank.
The CC lines are free text comments on the entry, and are used to convey
any useful information. The comments always appear below the last reference
line and are grouped together in comment blocks; a block is made up of 1 or
more comment lines. The first line of a block starts with the characters
'-!-'.
The format of a comment block is:
CC -!- TOPIC: First line of a comment block;
CC second and subsequent lines of a comment block.
The comment blocks are arranged according to what we designate as 'topics'. The current
topics and their definitions are listed in the table below.
Topic |
Description |
ALTERNATIVE PRODUCTS |
Description of the existence of related protein sequence(s) produced by alternative
splicing of the same gene or by the use of alternative initiation codons; see 3.11.1 |
BIOTECHNOLOGY |
Description of the use of a specific protein in a biotechnological process |
CATALYTIC ACTIVITY |
Description of the reaction(s) catalyzed by an enzyme [1] |
CAUTION |
Warning about possible errors and/or grounds for confusion |
COFACTOR |
Description of an enzyme cofactor |
DATABASE |
Description of a cross-reference to a network database/resource for a specific protein; see 3.11.2 |
DEVELOPMENTAL STAGE |
Description of the developmentally-specific expression of a protein |
DISEASE |
Description of the disease(s) associated with a deficiency of a protein |
DOMAIN |
Description of the domain structure of a protein |
ENZYME REGULATION |
Description of an enzyme regulatory mechanism |
FUNCTION |
General description of the function(s) of a protein |
INDUCTION |
Description of the compound(s) which stimulate the synthesis of a protein |
MASS SPECTROMETRY |
Reports the exact molecular weight of a protein or part of a protein as determined by mass spectrometric methods 3.11.3 |
MISCELLANEOUS |
Any comment which does not belong to any of the other defined topics |
PATHWAY |
Description of the metabolic pathway(s) with which a protein is associated |
PHARMACEUTICAL |
Description of the use of a protein as a pharmaceutical drug |
POLYMORPHISM |
Description of polymorphism(s) |
PTM |
Description of a posttranslational modification |
SIMILARITY |
Description of the similaritie(s) (sequence or structural) of a protein with other proteins |
SUBCELLULAR LOCATION |
Description of the subcellular location of the mature protein |
SUBUNIT |
Description of the quaternary structure of a protein |
TISSUE SPECIFICITY |
Description of the tissue specificity of a protein |
Note:
[1] For the 'CATALYTIC ACTIVITY' topic:
To describe the catalytic activity of an enzyme we have used, whenever
possible, the recommendations of the Nomenclature Committee of the
International Union of Biochemistry and Molecular Biology (IUBMB) as
published in Enzyme Nomenclature, NC-IUBMB, Academic Press, New-York,
(1992).
Each Swiss-Prot entry contains a variable number of CC line topics. Most
topics can be present more than once in a given entry. Topics that
occur only once in an entry are: ALTERNATIVE PRODUCTS, COFACTOR,
DEVELOPMENTAL STAGE, ENZYME REGULATION, INDUCTION, SUBCELLULAR LOCATION,
SUBUNIT and TISSUE SPECIFICITY.
- 3.11.1. Syntax of the topic 'ALTERNATIVE PRODUCTS'
The format of the CC line topic ALTERNATIVE PRODUCTS is:
CC -!- ALTERNATIVE PRODUCTS:
CC Event=Alternative promoter;
CC Comment=Free text;
CC Event=Alternative splicing; Named isoforms=n;
CC Comment=Optional free text;
CC Name=Isoform_1; Synonyms=Synonym_1[, Synonym_n];
CC IsoId=Isoform_identifier_1[, Isoform_identifer_n];
CC Sequence=Displayed;
CC Note=Free text;
CC Name=Isoform_n; Synonyms=Synonym_1[, Synonym_n];
CC IsoId=Isoform_identifier_1[, Isoform_identifer_n];
CC Sequence=VSP_identifier_1 [, VSP_identifier_n];
CC Note=Free text;
CC Event=Alternative initiation;
CC Comment=Free text;
The qualifiers are described in the table below:
Topic |
Description |
Event |
Biological process that results in the production of the alternative
forms (Alternative promoter, Alternative splicing, Alternative initiation).
Format: Event=controlled vocabulary;
Example: Event=Alternative splicing; |
Named isoforms |
Number of isoforms listed in the topics 'Name' currently only for 'Event=Alternative splicing'.
Format: Named isoforms=number;
Example: Named isoforms=6; |
Comment |
Any comments concerning one or more isoforms;
optional for 'Alternative splicing';
in case of 'Alternative promoter' and 'Alternative initiation' there is always a
'Comment' of free text, which includes relevant information on the isoforms.
Format: Comment=free text;
Example: Comment=Experimental confirmation may be lacking for some isoforms; |
Name |
A common name for an isoform used in the literature or assigned by Swiss-Prot; currenty only available for spliced isoforms.
Format: Name=common name;
Example: Name=Alpha; |
Synonyms |
Synonyms for an isoform as used in the literature; optional; currently only available for spliced isoforms.
Format: Synonyms=Synonym_1[, Synonym_n];
Example: Synonyms=B, KL5; |
IsoId |
Unique identifier for an isoform, consisting of the Swiss-Prot accession
number, followed by a dash and a number.
Format: IsoId=acc#-isoform_number[, acc#-isoform_number];
Example: IsoId=P05067-1; |
Sequence |
Information on the isoform sequence; the term 'Displayed' indicates, that the sequence
is shown in the entry; a lists of feature identifiers
(VSP_#) indicates that the isoform is annotated in the feature table; the FTIds enable programs to create
the sequence of a splice variant; if the accession number of the IsoId does not correspond to the accession
number of the current entry, this topic contains the term 'External';
'Not described' points out that the sequence of the isoform is unknown.
Format: Sequence=VSP_#[, VSP_#]|Displayed|External|Not described;
Example: Sequence=Displayed;
Example: Sequence=VSP_000013, VSP_000014;
Example: Sequence=External;
Example: Sequence=Not described;
|
Note |
Lists isoform-specific information; optional.
Format: Note=Free text;
Example: Note=No experimental confirmation available;
|
Example of the CC lines and the corresponding FT lines for an entry
with alternative splicing:
...
CC -!- ALTERNATIVE PRODUCTS:
CC Event=Alternative splicing; Named isoforms=6;
CC Name=1;
CC IsoId=Q15746-4; Sequence=Displayed;
CC Name=2;
CC IsoId=Q15746-5; Sequence=VSP_000040;
CC Name=3A;
CC IsoId=Q15746-6; Sequence=VSP_000041, VSP_000043;
CC Name=3B;
CC IsoId=Q15746-7; Sequence=VSP_000040, VSP_000041, VSP_000042;
CC Name=4;
CC IsoId=Q15746-8; Sequence=VSP_000041, VSP_000042;
CC Name=del-1790;
CC IsoId=Q15746-9; Sequence=VSP_000044;
...
FT VARSPLIC 437 506 VSGIPKPEVAWFLEGTPVRRQEGSIEVYEDAGSHYLCLLKA
FT RTRDSGTYSCTASNAQGQVSCSWTLQVER -> G (IN
FT ISOFORM 2 AND ISOFORM 3B).
FT /FTId=VSP_000040;
FT VARSPLIC 1433 1439 DEVEVSD -> MKWRCQT (IN ISOFORM 3A,
FT ISOFORM 3B AND ISOFORM 4).
FT /FTId=VSP_000041;
FT VARSPLIC 1473 1546 GKFGQVFRLVEKKTRKVWAGKFFKAYSAKEKENIRQEISIM
FT NCLHHPKLVQCVDAFEEKANIVMVLEIVSGGEL -> L
FT (IN ISOFORM 4).
FT /FTId=VSP_000042;
FT VARSPLIC 1655 1705 MISSING (IN ISOFORM 3A AND ISOFORM 3B).
FT /FTId=VSP_000043;
FT VARSPLIC 1790 1790 MISSING (IN ISOFORM DEL-1790).
FT /FTId=VSP_000044;
...
- 3.11.2. Syntax of the topic 'DATABASE'
CC -!- DATABASE: NAME=Text[; NOTE=Text][; WWW="Address"][; FTP="Address"].
Where:
- 'NAME' is the name of the database;
- 'NOTE' (optional) is a free text note;
- 'WWW' (optional) is the WWW address (URL) of the database;
- 'FTP' (optional) is the anonymous FTP address (including the directory name) where the database file(s) are stored.
Note: this is currently the only part of the entry where lines longer than 75 characters
can be found as long URLs or FTP addresses are not reformatted into multiple lines.
- 3.11.3. Syntax of the topic 'MASS SPECTROMETRY'
CC -!- MASS SPECTROMETRY: MW=XXX[; MW_ERR=XX][; METHOD=XX][;RANGE=XX-XX].
Where:
- 'MW=XXX' is the determined molecular weight (MW);
- 'MW_ERR=XX' (optional) is the accuracy or error range of the MW measurement;
- 'METHOD=XX' (optional) is the mass spectrometric method;
- 'RANGE=XX-XX' (optional) is used to indicate what part of the protein sequence entry
corresponds to the molecular weight. If this qualifier is not present, the MW value
corresponds to the full length of the protein sequence.
- 3.11.4. Examples for each comment line topic
CC -!- ALTERNATIVE PRODUCTS:
CC Event=Alternative splicing; Named isoforms=3;
CC Comment=Additional isoforms seem to exist. Experimental
CC confirmation may be lacking for some isoforms;
CC Name=1; Synonyms=Aire-1;
CC IsoId=O43918-1; Sequence=Displayed;
CC Name=2; Synonyms=Aire-2;
CC IsoId=O43918-2; Sequence=VSP_004089;
CC Name=3; Synonyms=Aire-3;
CC IsoId=O43918-3; Sequence=VSP_004089, VSP_004090;
CC -!- ALTERNATIVE PRODUCTS:
CC Event=Alternative initiation;
CC Comment=2 isoforms, Caveolin-2 alpha and Caveolin-2 beta, are
CC produced by alternative initiation;
CC -!- BIOTECHNOLOGY: The effect of PG can be neutralized by introducing
CC an antisense PG gene by genetic manipulation. The Flavr Savr
CC tomato produced by Calgene (Monsanto) in such a manner has a
CC longer shelf life due to delayed ripening.
CC -!- BIOTECHNOLOGY: Used in the food industry for high temperature
CC liquefaction of starch-containing mashes and in the detergent
CC industry to remove starch. Sold under the name Termamyl by
CC Novozymes.
CC -!- CATALYTIC ACTIVITY: ATP + L-glutamate + NH(3) = ADP + phosphate +
CC L-glutamine.
CC -!- CATALYTIC ACTIVITY: (R)-2,3-dihydroxy-3-methylbutanoate + NADP(+)
CC = (S)-2-hydroxy-2-methyl-3-oxobutanoate + NADPH.
CC -!- CAUTION: Ref.2 sequence differs from that shown in positions 92 to
CC 165 due to frameshifts.
CC -!- CAUTION: It is uncertain whether Met-1 or Met-3 is the initiator.
CC -!- COFACTOR: Pyridoxal phosphate.
CC -!- COFACTOR: FAD and nonheme iron.
CC -!- DATABASE: NAME=CD40Lbase;
CC NOTE=European CD40L defect database (mutation db);
CC WWW="http://www.expasy.org/cd40lbase/";
CC FTP="ftp://ftp.expasy.org/databases/cd40lbase".
CC -!- DATABASE: NAME=PROW; NOTE=CD guide CD80 entry;
CC WWW="http://www.ncbi.nlm.nih.gov/prow/cd/cd80.htm".
CC -!- DEVELOPMENTAL STAGE: Expressed early during conidial (dormant
CC spores) differentiation.
CC -!- DEVELOPMENTAL STAGE: Detected in embryonic skin (E12.5 and E14.5)
CC during the formation of hair follicles and at E15.5 in the enamel
CC knot of the developing tooth. Detected in the basal layer of the
CC epidermis and hair follicles of P2 mice.
CC -!- DISEASE: Defects in PHKA1 are linked to X-linked muscle
CC glycogenosis [MIM:311870]; a disease characterized by slowly
CC progressive, predominantly distal muscle weakness and atrophy.
CC -!- DISEASE: Defects in ABCD1 are the cause of recessive X-linked
CC adrenoleukodystrophy (X-ALD) [MIM:300100], a rare peroxisomal
CC metabolic disorder that occurs in boys and is characterized by
CC progressive multifocal demyelination of the central nervous system
CC and by adrenocortical insufficiency. It produces mental
CC deterioration, corticospinal tract dysfunction, and cortical
CC blindness. There is laboratory evidence of adrenal cortical
CC dysfunction. Different clinical manifestations exist like:
CC cerebral childhood ALD (CALD), adult cerebral ALD (ACALD),
CC adrenomyeloneuropathy (AMN) and "Addison disease only" (ADO)
CC phenotype.
CC -!- DOMAIN: Contains a coiled-coil domain essential for vesicular
CC transport and a dispensable C-terminal region.
CC -!- DOMAIN: The B chain is composed of two domains, each domain
CC consists of 3 homologous subdomains (alpha, beta, gamma).
CC -!- ENZYME REGULATION: The activity of this enzyme is controlled
CC by adenylation. The fully adenylated enzyme complex is inactive.
CC -!- ENZYME REGULATION: Activated by Gram-negative bacterial
CC lipopolysaccharides and chymotrypsin.
CC -!- FUNCTION: Binds to actin and affects the structure of the
CC cytoskeleton. At high concentrations, profilin prevents the
CC polymerization of actin, whereas it enhances it at low
CC concentrations. By binding to PIP2, it inhibits the formation of
CC IP3 and DG.
CC -!- FUNCTION: Inhibitor of fungal polygalacturonase. It is an
CC important factor for plant resistance to phytopathogenic fungi.
CC Substrate preference is polygalacturonase (PG) from A.niger >> PG
CC of F.oxysporum, A.solani or B.cinerea. Not active on PG from
CC F.moniliforme.
CC -!- INDUCTION: By heat shock, salt stress, oxidative stress, glucose
CC limitation and oxygen limitation.
CC -!- INDUCTION: By infection, plant wounding, or elicitor treatment of
CC cell cultures.
CC -!- MASS SPECTROMETRY: MW=24948; MW_ERR=6; METHOD=MALDI.
CC -!- MASS SPECTROMETRY: MW=8597.5; METHOD=Electrospray; RANGE=40-119.
CC -!- MISCELLANEOUS: Binds to bacitracin.
CC -!- MISCELLANEOUS: Called DUO because the encoded protein is closely
CC related to but shorter than TRIO.
CC -!- PATHWAY: Porphyrin biosynthesis by the C5 pathway; second step.
CC -!- PATHWAY: Degradation of allantoin (purine catabolism); third step.
CC -!- PHARMACEUTICAL: Available under the names Avonex (Biogen),
CC Betaseron (Berlex) and Rebif (Serono). Used in the treatment of
CC multiple sclerosis (MS). Betaseron is a slightly modified form
CC of IFNB1 with two residue substitutions.
CC -!- PHARMACEUTICAL: Available under the name Proleukin (Chiron). Used
CC in patients with renal cell carcinoma or metastatic melanoma.
CC -!- POLYMORPHISM: The allelic form of the enzyme with Gln-191
CC (Allozyme A) hydrolyzes paraoxon with a low turnover number and
CC the one with Arg-191 (Allozyme B) with a high turnover number.
CC -!- POLYMORPHISM: The two main alleles of HP are called HP1F (fast)
CC and HP1S (slow). The sequence shown here is that of the HP1S form.
CC -!- PTM: N-glycosylated and probably also O-glycosylated.
CC -!- PTM: A soluble short 95 kDa form may be released by proteolytic
CC cleavage from the long membrane-anchored form.
CC -!- SIMILARITY: Belongs to the annexin family.
CC -!- SIMILARITY: Contains 13 EGF-like domains.
CC -!- SUBCELLULAR LOCATION: Mitochondrial matrix.
CC -!- SUBCELLULAR LOCATION: Integral membrane protein. Inner membrane.
CC -!- SUBUNIT: Homotetramer.
CC -!- SUBUNIT: Disulfide-linked heterodimer of a light chain (L) and a
CC heavy chain (H). The light chain has the pharmacological activity,
CC while the N- and C-terminal of the heavy chain mediate channel
CC formation and toxin binding, respectively.
CC -!- TISSUE SPECIFICITY: Shoots, roots, and cotyledon from dehydrating
CC seedlings.
CC -!- TISSUE SPECIFICITY: Expressed at high levels in brain and ovary.
CC Lower levels in small intestine. In brain regions, detected in all
CC regions tested. Highest levels in the cerebellum and cerebral
CC cortex.
- 3.12.1. Definition
The DR (Database cross-Reference) lines are used as pointers to information related to
Swiss-Prot entries and found in data collections other than Swiss-Prot. The full list of
all databases to which Swiss-Prot is cross-referenced can be found in the document file
dbxref.txt.
For example, if the X-ray crystallographic atomic coordinates of a sequence are stored in
the Protein Data Bank (PDB) there will be one DR line pointing to each of the corresponding
entries in PDB. For a sequence translated from a nucleotide sequence there will
be DR line(s) pointing to the relevant entri(es) in the EMBL/GenBank/DDBJ database
which correspond to the DNA or RNA sequence(s) from which it was translated.
The format of the DR line is:
DR DATABASE_IDENTIFIER; PRIMARY_IDENTIFIER; SECONDARY_IDENTIFIER.
Exceptions are cross-references to the EMBL/GenBank/DDBJ nucleotide
sequence database and PROSITE. The specific formats for these
cross-references are described in sections 3.12.6
and 3.12.7.
- 3.12.2. Database identifier
The first item on the DR line, the 'DATABASE_IDENTIFIER', is the
abbreviated name of the data collection to which reference is made. The
currently defined database identifiers are listed below.
Identifier |
Database description |
Reference |
EMBL |
Nucleotide sequence database of EMBL/EBI (see 3.12.6) |
Nucleic Acids Res. 30:21-26(2002); PMID: 11752244 |
Aarhus/Ghent-2DPAGE |
Human keratinocyte 2D gel protein database from Aarhus and Ghent universities |
FEBS Lett. 430:64-72(1998); PMID: 9678596 |
ANU-2DPAGE |
Australian National University 2-DE database |
Proteomics 1:1149-1161(2001); PMID: 11990509 |
COMPLUYEAST-2DPAGE |
2-DE database at Universidad Complutense de Madrid |
J. Chromatogr. B. Biomed. Sci. Appl. (in press) |
DictyDb |
Dictyostelium discoideum genome database |
(In) Maeda Y., Inouyea K. and Takeuchi I. (eds.); Dictyostelium. A Model System for Cell and Developmental Biology. pp.471-477, Universal Academic Press, Tokyo (1997) |
ECO2DBASE |
Escherichia coli gene-protein database (2D gel spots) (ECO2DBASE) |
Electrophoresis 20:2149-2159(1999); PMID: 9298644 |
EcoGene |
Escherichia coli K12 genome database (EcoGene) |
Nucleic Acids Res. 28:60-64(2000); PMID: 10592181 |
FlyBase |
Drosophila genome database (FlyBase) |
Nucleic Acids Res. 30:106-108(2002); PMID: 11752267 |
Genew |
Human gene nomenclature database (Genew) |
Nucleic Acids Res. 30:169-171(2002); PMID: 11752283 |
GlycoSuiteDB |
Database of glycan structures (GlycoSuiteDB) |
Nucleic Acids Res. 29:332-335(2001); PMID: 11125129 |
GO |
Gene Ontology (GO) database |
Genome Res. 12:1982-1991(2002); PMID: 12466303 |
Gramene |
Comparative mapping resource for grains (Gramene) |
Nucleic Acids Res. 30:103-105(2002); PMID: 11752266 |
HAMAP |
Database of microbial protein families (HAMAP) |
Comput. Biol. Chem. 27:49-58(2003) |
HIV |
HIV sequence database |
Kuiken C.L. et al., In: Theoretical Biology and Biophysics Group, Los Alamos National Laboratory, Los Alamos, NM. |
HSC-2DPAGE |
Harefield hospital 2D gel protein databases (HSC-2DPAGE) |
Electrophoresis 18:471-479(1997); PMID: 9150926 |
HSSP |
Homology-derived secondary structure of proteins database (HSSP) |
Nucleic Acids Res. 27:244-247(1999); PMID: 9847191 |
InterPro |
Integrated resource of protein families, domains and functional sites (InterPro) |
Nucleic Acids Res. 31:315-318(2003); PMID: 12520011 |
Leproma |
Mycobacterium leprae genome database (Leproma) |
Lepr. Rev. 72:470-477(2001); PMID: 11826483 |
MaizeDB |
Maize genome database (MaizeDB) |
Polacco M., Chen S., Coe E., Hancock D.C., Kross H., Schroeder S. and Vargo C.; Maize Genetics Conference Abstracts 41 (1999) |
Maize-2DPAGE |
Maize genome 2D Electrophoresis database (Maize-2DPAGE) |
Theor. Appl. Genet. 93:997-1005(1996) |
MEROPS |
Peptidase database (MEROPS) |
Nucleic Acids Res. 30:343-346(2002); PMID: 11752332 |
MGD |
Mouse genome database (MGD) |
Nucleic Acids Res. 30:113-115(2002); PMID: 11752269 |
MIM |
Mendelian Inheritance in Man Database (MIM) |
Nucleic Acids Res. 30:52-55(2002); PMID: 11752252 |
MypuList |
Mycoplasma pulmonis genome database (MypuList) |
|
PDB |
3D-macromolecular structure Protein Data Bank (PDB) |
Nucleic Acids Res. 30:245-248(2002); PMID: 11752306 |
Pfam |
Pfam protein domain database |
Nucleic Acids Res. 30:276-280(2002); PMID: 11752314 |
PHCI-2DPAGE |
Parasite host cell interaction 2D-PAGE database |
|
PhosSite |
Phosphorylation Site Database for prokaryotic proteins |
(In) Leslie M. (ed.); NetWatch. Science 294:1623-1623(2001) |
PIR |
Protein sequence database of the Protein Information Resource (PIR) |
Nucleic Acids Res. 30:35-37(2002); PMID: 11752247 |
PMMA-2DPAGE |
Purkyne Military Medical Academy 2D-PAGE database |
|
PRINTS |
Protein Fingerprint database (PRINTS) |
Nucleic Acids Res. 30:239-241(2002); PMID: 11752304 |
ProDom |
ProDom protein domain database |
Brief. Bioinform. 3:246-251(2002); PMID: 12230033 |
PROSITE |
PROSITE protein domain and family database (see 3.12.7) |
Nucleic Acids Res. 30:235-238(2002); PMID: 11752303 |
REBASE |
Restriction enzymes and methylases database (REBASE) |
Nucleic Acids Res. 29:268-269(2001); PMID: 11125108 |
SGD |
Saccharomyces Genome Database (SGD) |
Nucleic Acids Res. 30:69-72(2002); PMID: 11752257 |
Siena-2DPAGE |
2D-PAGE database from the Department of Molecular Biology, University of Siena, Italy |
|
SMART |
Simple Modular Architecture Research Tool (SMART) |
Nucleic Acids Res. 30:242-244(2002); PMID: 11752305 |
StyGene |
Salmonella typhimurium LT2 genome database (StyGene) |
|
SubtiList |
Bacillus subtilis 168 genome database (SubtiList) |
Nucleic Acids Res. 30:62-65(2002); PMID: 11752255 |
SWISS-2DPAGE |
2D-PAGE database from the Geneva University Hospital (SWISS-2DPAGE) |
Nucleic Acids Res. 28:286-288(2000); PMID: 10592248 |
TIGR |
The bacterial databases of 'The Institute of Genome Research' (TIGR) |
Nucleic Acid Res. 29:159-164(2001); PMID: 11125077 |
TIGRFAMs |
TIGR protein family database (TIGRFAMs) |
Nucleic Acids Res. 29:41-43(2001); PMID: 11125044 |
TRANSFAC |
Transcription factor database (TRANSFAC) |
Nucleic Acids Res. 29:281-283(2001); PMID: 11125113 |
TubercuList |
Mycobacterium tuberculosis H37Rv genome database (TubercuList) |
FEBS Lett. 452:7-10(1999); PMID: 10376668 |
WormPep |
Caenorhabditis elegans genome sequencing project protein database (WormPep) |
Genomics 46:200-216(1997); PMID: 9417907 |
ZFIN |
Zebrafish Information Network genome database (ZFIN) |
Nucleic Acids Res. 29:87-90(2001); PMID: 11125057 |
- 3.12.3. The primary identifier
The second item on the DR line, the 'PRIMARY_IDENTIFIER', is an unambiguous
pointer to the information entry in the database to which the reference is
being made.
- For a ANU-2DPAGE, COMPLUYEAST-2DPAGE, DictyDb, EcoGene,
FlyBase, GlycoSuiteDB, GO, Gramene, HIV, HSC-2DPAGE, InterPro, MAIZE-2DPAGE,
MEROPS, MGD, MIM, Pfam, PHCI-2DPAGE, PhosSite, PIR, PMMA-2DPAGE, PRINTS, ProDom, REBASE,
SGD, Siena-2DPAGE, SMART, StyGene, SubtiList, SWISS-2DPAGE, TIGRFAMs, TRANSFAC or ZFIN
reference the primary identifier is the first accession number (also called
the Unique Identifier in some databases) of the entry to which reference is
being made.
- For a Genew reference the primary identifier is the unique identifier assigned by the HUGO Gene Nomenclature Committee.
- For a HAMAP reference the primary identifier is the unique identifier of a HAMAP protein family.
- For a PDB reference the primary identifier is the entry name.
- For an AARHUS/GHENT-2DPAGE or ECO2DBASE reference the primary
identifier is the alphanumeric designation of the protein spot.
- For a WormPep reference the primary identifier is the cosmid-derived name
given to the protein by the C.elegans genome sequencing project.
- For a MaizeDB reference the primary identifier is the 'Gene-product' accession ID.
- For a Leproma, MypuList, TIGR or TubercuList reference the primary identifier is the
genome Open Reading Frame (ORF) code.
- For an HSSP reference the primary identifier is the accession number of a Swiss-Prot
entry cross-referenced to a PDB entry whose structure is expected to be similar
to that of the entry in which the HSSP cross-reference is present.
- 3.12.4. The secondary identifier
The third item on the DR line, the 'SECONDARY_IDENTIFIER', is generally used to
complement the information given by the first identifier.
- For an HIV, InterPro, Pfam, PIR, PRINTS, ProDom, REBASE, SMART or TIGRFAMs reference the
secondary identifier is the entry name.
- For a PDB reference the secondary identifier is the most recent date on which PDB revised the entry (last 'REVDAT' record - or 'PRELIMINARY').
- For a DictyDb, EcoGene, FlyBase, Genew, MGD, SGD, StyGene, SubtiList or ZFIN
reference the secondary identifier is the gene designation. If the gene designation
is not available, a dash ('-') is used.
- For a GO reference, the second identifier is a 1-letter abbreviation for one of the 3 ontology aspects,
separated from the GO term by a column. If the term is longer than 46 characters, the first 43 characters
are indicated followed by 3 dots ('...'). The abbreviations for the 3 distinct aspects of the ontology are
P (biological Process), F (molecular Function), and C (cellular Component).
- For a HAMAP reference the secondary identifier indicates if a domain is 'atypical' and/or 'fused', otherwise the field is empty ('-').
- For an ECO2DBASE reference the secondary identifier is the latest release
number or edition of the database that has been used to derive the cross-reference.
- For a SWISS-2DPAGE, HSC-2DPAGE or MAIZE-2DPAGE reference the
secondary identifier is the species or tissue of origin.
- For an AARHUS/GHENT-2DPAGE reference the secondary identifier is either 'IEF'
(for isoelectric focusing) or 'NEPHGE' (for non-equilibrium pH gradient electrophoresis).
- For a WormPep reference the secondary identifier is a number attributed by the
C.elegans genome-sequencing project to that protein.
- For a ANU-2DPAGE, COMPLUYEAST-2DPAGE, GlycoSuiteDB, Gramene,
Leproma, MaizeDB, MEROPS, MIM, MypuList, PHCI-2DPAGE, PhosSite, PMMA-2DPAGE,
Siena-2DPAGE, TIGR, TRANSFAC or TubercuList reference the secondary
identifier is not used and a dash ('-') is stored in that field.
- For an HSSP reference the secondary identifier is the entry name of the PDB
structure related to that of the entry in which the HSSP cross-reference is present.
- 3.12.5. The tertiary identifier
A limited number of DR lines possess a fourth item, the 'TERTIARY_IDENTIFIER',
which is generally used to give more detailed information.
- For the protein domain and family databases HAMAP, Pfam, ProDom, SMART and TIGRFAMs the tertiary
identifier is the number of hits found in the sequence.
- For GO the tertiary identifer is a 3-character GO evidence code. The meaning of the evidence codes is:
IDA=inferred from direct assay, IMP=inferred from mutant phenotype, IGI=inferred from genetic interaction,
IPI=inferred from physical interaction, IEP=inferred from expression pattern, TAS=traceable author statement,
NAS=non-traceable author statement, IC=inferred by curator, ISS=inferred from sequence or structural similarity.
Examples of complete DR lines are shown here:
DR Aarhus/Ghent-2DPAGE; 8006; IEF.
DR ANU-2DPAGE; Q9XEA8; -.
DR COMPLUYEAST-2DPAGE; P41797; -.
DR DictyDb; DD01047; myoA.
DR ECO2DBASE; G052.0; 6TH EDITION.
DR EcoGene; EG10054; araC.
DR FlyBase; FBgn0000055; Adh.
DR Genew; HGNC:12849; YWHAB.
DR GlycoSuiteDB; P05067; -.
DR GO; GO:0003677; F:DNA binding; TAS.
DR Gramene; Q06967; -.
DR HAMAP; MF_00120; -; 1.
DR HIV; K02012; GAG$BH5.
DR HSC-2DPAGE; P47985; HUMAN.
DR HSSP; P00438; 1PBE.
DR InterPro; IPR001254; Trypsin.
DR Leproma; ML0485; -.
DR MaizeDB; 25342; -.
DR Maize-2DPAGE; P80607; COLEOPTILE.
DR MEROPS; M41.001; -.
DR MGD; MGI:87920; Adfp.
DR MIM; 249900; -.
DR MypuList; MYPU_4900; -.
DR PDB; 3ADK; 16-APR-88.
DR Pfam; PF00017; SH2; 1.
DR PhosSite; P00955; -.
DR PHCI-2DPAGE; Q9Z8U5; -.
DR PIR; A00682; KIPGA.
DR PMMA-2DPAGE; P04179; -.
DR PRINTS; PR00237; GPCRRHODOPSN.
DR ProDom; PD000511; Aconitase; 1.
DR REBASE; 993; EcoRI.
DR SGD; S0000170; AAR2.
DR Siena-2DPAGE; P38008; -.
DR SMART; SM00370; LRR; 6.
DR StyGene; SG10312; proV.
DR SubtiList; BG10774; oppD.
DR SWISS-2DPAGE; P10599; HUMAN.
DR TIGR; MJ0125; -.
DR TIGRFAMs; TIGR00630; uvra; 1.
DR TRANSFAC; T00141; -.
DR TubercuList; Rv0001; -.
DR WormPep; ZK637.7; CE00437.
DR ZFIN; ZDB-GENE-980526-290; hoxb1b.
-
3.12.6. Cross-references to the nucleotide sequence database
The specific format for cross-references to the EMBL/GenBank/DDBJ nucleotide
sequence database is:
DR EMBL; ACCESSION_NUMBER; PROTEIN_ID; STATUS_IDENTIFIER.
where 'PROTEIN_ID' stands for the 'Protein Sequence Identifier'. It is a string which is
stored, in nucleotide sequence entries, in a qualifier called '/protein_id' which is tagged
to every CDS in the nucleotide database. Example from EMBL:
FT CDS 302..2674
FT /protein_id="CAA03857.1"
FT /db_xref="SWISS-PROT:P26345"
FT /gene="recA"
FT /product="RecA protein"
The Protein Sequence Identifier (Protein_ID) consists of a stable ID
portion (8 characters: 3 letters followed by 5 numbers) plus a period
and a version number. The version number only changes when the
protein sequence coded by the CDS changes, while the stable part remains
unchanged. The Protein_ID effectively replaces what was previously known as
the 'PID'.
The 'STATUS_IDENTIFIER' provides information about the relationship between
the sequence in the Swiss-Prot entry and the CDS in the corresponding EMBL
entry:
a) In most cases the translation of the EMBL nucleotide sequence CDS
results in the same sequence as shown in the corresponding Swiss-Prot entry
or the differences are mentioned in the Swiss-Prot feature (FT) lines as
CONFLICT, VARIANT or VARSPLIC and in the RP lines. The status identifier
is then a dash ('-').
Example:
DR EMBL; Y00312; CAA68412.1; -.
b) In some cases the translation of the EMBL nucleotide sequence CDS
results in a sequence different from the sequence shown in the
corresponding Swiss-Prot entry. When the differences are either not
mentioned in the Swiss-Prot feature (FT) lines as CONFLICT, VARIANT or
VARSPLIC (see Appendix A) and in the RP lines, or
do simply not meet the criteria for such situations, the differences are
indicated as follows:
-
- 1. If the difference is due to a different start of the
sequence (i.e. Swiss-Prot believes that the start of the sequence is
upstream or downstream of the site annotated as the start of the sequence
in the EMBL database), the status identifier shows the comment 'ALT_INIT'.
Example:
-
DR EMBL; L29151; AAA99430.1; ALT_INIT.
- 2. If the difference is due to a different termination of the sequence
(i.e. Swiss-Prot believes that the termination of the sequence is upstream
or downstream of the site annotated as the end of the sequence in the EMBL
database), the status identifier shows the comment 'ALT_TERM'.
Example:
DR EMBL; L20562; AAA26884.1; ALT_TERM.
- 3. If the difference is due to frameshifts in the EMBL sequence, the status identifier
shows the comment 'ALT_FRAME'. Example:
DR EMBL; X56420; CAA39814.1; ALT_FRAME.
- 4. If the difference is not due to any of the cases mentioned above (e.g.
wrong intron-exon boundaries given in the EMBL entry) or to a mixture of
the cases mentioned above, the status identifier shows the comment
'ALT_SEQ'. Example:
DR EMBL; M28482; AAA26378.1; ALT_SEQ.
c) In some cases the nucleotide sequence of a complete CDS is divided into
exons present in different EMBL entries. We point to the exon-containing EMBL
entries by citing the Protein_ID as secondary identifier and adding the
comment 'JOINED' as the status identifier. These EMBL entries do not
contain a CDS feature but contain exons joined to a CDS feature which is
labeled with the given Protein_ID.
Example:
DR EMBL; M63397; AAA51662.1; -.
DR EMBL; M63395; AAA51662.1; JOINED.
DR EMBL; M63396; AAA51662.1; JOINED.
In the above example the Swiss-Prot sequence is derived from the CDS labeled with
the Protein_ID AAA51662. This CDS feature can be found in the EMBL entry M63397.
Exons belonging to this CDS are not only found in EMBL entry M63397, but also in the
EMBL entries M63395 and M63396.
d) In some cases there is no CDS feature key annotating a protein
translation in an EMBL entry and thus no Protein_ID for the CDS. Therefore
it is not possible for us to point to a Protein_ID as a secondary
identifier. In these cases we point to the relevant EMBL entries by
including a dash ('-') in the position of the missing Protein_ID and
'NOT_ANNOTATED_CDS' into the status identifier.
Example:
DR EMBL; AJ243418; -; NOT_ANNOTATED_CDS.
-
3.12.7. Cross-references to the PROSITE database
The specific format for cross-references to the PROSITE protein domain and family
database is:
DR PROSITE; ACCESSION_NUMBER; ENTRY_NAME; STATUS.
Where 'ACCESSION_NUMBER' stands for the accession number of the PROSITE
pattern or profile entry; 'ENTRY_NAME' is the name of the entry and
'STATUS' is one of the following:
-
- n
- FALSE_NEG
- PARTIAL
- UNKNOWN_n
Where 'n' is the number of hits of the pattern or profile in that
particular protein sequence. The 'FALSE_NEG' status indicates that while
the pattern or profile did not detect the protein sequence, it is a member
of that particular family or domain. The 'PARTIAL' status indicates that
the pattern or profile did not detect the sequence because the sequence is not
complete and lacks the region on which the pattern/profile is based.
Finally the 'UNKNOWN' status indicates uncertainties as to the fact that
the sequence is a member of the family or contains the domain described by the
pattern/profile.
Examples of PROSITE cross-references:
DR PROSITE; PS00107; PROTEIN_KINASE_ATP; 2.
DR PROSITE; PS00028; ZINC_FINGER_C2H2_1; 4.
DR PROSITE; PS00237; G_PROTEIN_RECEP_F1_1; FALSE_NEG.
DR PROSITE; PS00603; TK_CELLULAR_TYPE; PARTIAL.
DR PROSITE; PS50234; VWFA; UNKNOWN_1.
The KW (KeyWord) lines provide information that can be used to generate indexes of the
sequence entries based on functional, structural, or other categories. The keywords chosen
for each entry serve as a subject reference for the sequence. The Swiss-Prot document
keywlist.txt lists all the
keywords that are used in the database. Often several KW lines are necessary for a single entry.
The format of the KW line is:
KW Keyword[; Keyword...].
More than one keyword may be listed on each KW line; semicolons separate the
keywords, and the last keyword is followed by a period. Keywords may consist of more
than one word (they may contain blanks), but are never split between lines. An example of a
KW line is:
KW Oxidoreductase; NADP; Acetylation.
The order of the keywords is not significant. The above example could also have been
written:
KW Acetylation; NADP; Oxidoreductase.
The FT (Feature Table) lines provide a precise but simple means for the annotation of the
sequence data. The table describes regions or sites of interest in the sequence. In general
the feature table lists posttranslational modifications, binding sites, enzyme active sites,
local secondary structure or other characteristics reported in the cited references.
Sequence conflicts between references are also included in the feature table.
The FT lines have a fixed format. The column numbers allocated to each of the data items
within each FT line are shown in the following table (column numbers not referred to in the
table are always occupied by blanks).
-
Columns |
Data item |
1-2 |
FT |
6-13 |
Key name |
15-20 |
'From' endpoint |
22-27 |
'To' endpoint |
35-75 |
Description |
The key name and the endpoints are always on a single line, but the
description may require one or more additional lines. In this event,
the following line contains blanks in the columns 3-34, and the
description continues from column 35 onwards as in the line above.
Thus a blank key always denotes a continuation of the previous description.
An example of a feature table is shown below:
FT NON_TER 1 1
FT SIGNAL <1 10 BY SIMILARITY.
FT CHAIN 19 87 A-AGGLUTININ.
FT PROPEP 22 43 REMOVED BY A DIPEPTIDYLPEPTIDASE.
FT MOD_RES 41 41 AMIDATION (G-42 PROVIDE AMIDE GROUP)
FT (BY SIMILARITY).
FT DISULFID 110 115
FT CARBOHYD 251 251 N-LINKED (GLCNAC...).
FT CONFLICT 327 327 E -> R (IN REF. 2).
FT CONFLICT 378 509 MISSING (IN REF. 3).
The first item on each FT line is the key name, which is a fixed abbreviation (of up to 8
characters) with a defined meaning. A list of the currently defined key names can be found
in Appendix A of this document.
Following the key name are the 'FROM' and 'TO' endpoint specifications. These fields
designate (inclusively) the endpoints of the feature named in the key field. In general, these
fields simply contain residue numbers which indicate positions in the sequence as listed. Note
that these positions are always specified assuming a numbering of the listed sequence
from 1 to n; this numbering is not necessarily the same as that used in the original
reference(s). The following should be noted:
- If the 'FROM' and 'TO' specifications are identical, the feature
involves one single amino acid;
- When a feature is known to extend beyond the end(s) of the sequenced region, the
endpoint specification will be preceded by '<' for features which continue to the left
end (N-terminal direction) or by '>' for features which continue to the right end (C-
terminal direction);
- Unknown endpoints are denoted by '?'. Uncertain endpoints are denoted
by a '?' before the position, e.g. '?42'.
See also the notes concerning each of the key names in
Appendix A.
The remaining portion of the FT line is a description that contains additional information
about the feature. For example, for a posttranslationally modified residue (key MOD_RES)
the chemical nature of the modification is given, while for a sequence variation (key
VARIANT) the nature of the variation is indicated. This portion of the line is generally in free
form, and may be continued on additional lines when necessary.
-
Feature identifiers
Some features are associated with a unique and stable feature identifier
(FTId), which allows to construct links directly from position-specific
annotation in the feature table to specialized protein-related databases.
The FTId is always the last component of a feature in the description field.
The format of a feature with a feature identifer is:
FT KEYNAME x x Description.
FT /FTId=XXX_number.
where XXX is the 3-letter code for the specific feature key, seperated by an understore from a
6-digit number.
Feature identifiers currently exist for the feature keys CARBOHYD,
VARIANT and VARSPLIC. The format of the corresponding FTIds is shown
in the following table:
-
Key name |
Format of the FTId |
Availability |
CARBOHYD |
CAR_number |
Currently only for residues attached to an oligosaccharide structure annotated in the GlycoSuiteDB database |
VARIANT |
VAR_number |
Currently only for human protein sequences |
VARSPLIC |
VSP_number |
Any sequence with a VARSPLIC feature |
Examples of features with FTIds are given below:
FT CARBOHYD 251 251 N-LINKED (GLCNAC...).
FT /FTId=CAR_000070.
FT VARIANT 214 214 V -> I.
FT /FTId=VAR_009122.
FT VARSPLIC 33 83 TPDINPAWYTGRGIRPVGRFGRRRATPRDVTGLGQLSCLPL
FT DGRTKFSQRG -> SECLTYGKQPLTSFHPFTSQMPP (in
FT isoform 2).
FT /FTId=VSP_004370.
The SQ (SeQuence header) line marks the beginning of the sequence data and gives a
quick summary of its content.
The format of the SQ line is:
SQ SEQUENCE XXXX AA; XXXXX MW; XXXXXXXXXXXXXXXX CRC64;
The line contains the length of the sequence in amino acids ('AA') followed by the molecular
weight ('MW') rounded to the nearest mass unit (Dalton) and the sequence 64-bit CRC
(Cyclic Redundancy Check) value ('CRC64').
The algorithm to compute the CRC64 is described in the ISO 3309 standard.
The generator polynomial is x64 + x4 + x3 + x + 1.
Reference: Press W.H., Flannery B.P., Teukolsky S.A. and Vetterling W.T.
"Numerical recipes in C", 2nd ed., pp896-902, Cambridge University Press (1993).
(See http://www.ulib.org/webRoot/Books/Numerical_Recipes/bookcpdf/c20-3.pdf)
It should be noted that while, in theory, two different sequences could
have the same CRC64 value, the likelihood that this would happen is
extremely low.
An example of an SQ line is shown here:
SQ SEQUENCE 486 AA; 55638 MW; D7862E867AD74383 CRC64;
The information in the SQ line can be used as a check on accuracy or for statistical
purposes. The word 'SEQUENCE' is present solely for readability.
The sequence data line has a line code consisting of two blanks rather than
the two-letter codes used until now. The sequence counts 60 amino acids per
line, in groups of 10 amino acids, beginning in position 6 of the line.
The characters used for the amino acids are the standard IUPAC one letter codes (see
Appendix B).
An example of sequence data lines is shown here:
MTILASICKL GNTKSTSSSI GSSYSSAVSF GSNSVSCGEC GGDGPSFPNA SPRTGVKAGV
NVDGLLGAIG KTVNGMLISP NGGGGGMGMG GGSCGCI
The // (terminator) line contains no data or comments and designates the end of an entry.
The definition of each of the key names used in the feature table is
explained here. It is likely that new key names will be added
progressively to this list. For each key a number of examples are presented.
-
A.1. Change indicators
CONFLICT - Different sources report differing sequences.
Examples of CONFLICT key feature lines:
FT CONFLICT 304 304 MISSING (IN REF. 3).
FT CONFLICT 62 62 D -> N (IN REF. 2 AND 3).
FT CONFLICT 3 6 STGD -> RRGN (IN REF. 3).
VARIANT - Authors report that sequence variants exist.
Examples of VARIANT key feature lines:
FT VARIANT 214 214 V -> I.
FT /FTId=VAR_009122.
FT VARIANT 240 240 D -> E (in strains Z3915 and Z3524).
FT VARIANT 1 1 Missing (in 20% of the chains).
Explicit links are present in human protein sequence entries to the Single
Nucleotide Polymorphism database (dbSNP) (Nucleic Acids Res. 29:308-311(2001);
PMID: 11125122).
The format of such links is:
FT VARIANT from to Description (in dbSNP:accession_number).
FT /FTId=VAR_number.
Example of a feature with a link to dbSNP:
FT VARIANT 65 65 T -> I (in dbSNP:1065419).
FT /FTId=VAR_012009.
VARSPLIC - Description of sequence variants produced by alternative splicing.
Examples of VARSPLIC key feature lines:
FT VARSPLIC 653 672 VATSNPGKCLSFTNSTFTFT -> ALVSHHCPVEAVRAVHP
FT TRL (IN ISOFORM 2).
FT VARSPLIC 673 913 MISSING (IN ISOFORM 2).
MUTAGEN - Site which has been experimentally altered.
Examples of MUTAGEN key feature lines:
FT MUTAGEN 65 65 H->F: 100% ACTIVITY LOSS.
FT MUTAGEN 119 119 C->R,E,A: LOSS OF CADPR HYDROLASE
FT AND ADP-RIBOSYL CYCLASE ACTIVITY.
- A.2. Amino-acid modifications
MOD_RES - Posttranslational modification of a residue.
The chemical nature of the modification is given in the description. The general format of
the MOD_RES description field is:
FT MOD_RES xxx xxx MODIFICATION (COMMENT).
The most frequent modifications are listed below.
Modification |
Description |
ACETYLATION |
N-terminal or other |
AMIDATION |
Generally at the C-terminal of a mature active peptide |
BLOCKED |
Undetermined N- or C-terminal blocking group |
FORMYLATION |
Of the N-terminal methionine |
GAMMA-CARBOXYGLUTAMIC ACID |
Of glutamate |
HYDROXYLATION |
Of asparagine, aspartate, proline or lysine |
METHYLATION |
Generally of lysine or arginine |
PHOSPHORYLATION |
Of serine, threonine, tyrosine, aspartate or histidine |
PYRROLIDONE CARBOXYLIC ACID |
N-terminal glutamine which has formed an internal cyclic lactam. This is also called 'pyro-Glu' |
SULFATION |
Generally of tyrosine |
Examples of MOD_RES key feature lines:
FT MOD_RES 1 1 ACETYLATION.
FT MOD_RES 686 686 PHOSPHORYLATION (BY PKC).
FT MOD_RES 367 367 SULFATION (BY SIMILARITY).
FT MOD_RES 58 58 AMIDATION (G-59 PROVIDE AMIDE GROUP).
FT MOD_RES 20 20 METHYLATION (MONO- AND DI-).
CARBOHYD - Glycosylation site.
This key describes the occurrence of the attachment of a glycan (mono- or polysaccharide)
to a residue of the protein:
- The type of linkage (C-, N- or O-linked) to the protein is indicated.
- If the nature of the reducing terminal sugar is known, its abbreviation is shown
between parenthesis. If three dots '...' follow the abbreviation this indicates
an extension of the carbohydrate chain. Conversely no dots means that a
monosaccharide is linked.
Examples of CARBOHYD key feature lines:
FT CARBOHYD 53 53 N-LINKED (GLCNAC...) (POTENTIAL).
FT CARBOHYD 18 18 O-LINKED (GLCNAC).
FT CARBOHYD 194 194 O-LINKED (GLC...) (BY SIMILARITY).
FT CARBOHYD 29 29 C-LINKED (MAN).
LIPID - Covalent binding of a lipid moiety
The chemical nature of the bound lipid moiety is given in the description. The general
format of the LIPID description field is:
FT LIPID xxx xxx NAME OF THE ATTACHED GROUP (COMMENT).
The attached groups that are currently defined are listed below.
Attached group |
Description |
MYRISTATE |
Myristate group attached through an amide bond to the N-terminal glycine
residue of the mature form of a protein [1,2] or to an internal lysine residue |
PALMITATE |
Palmitate group attached through a thioether bond to a cysteine residue
or through an ester bond to a serine or threonine residue [1,2] |
FARNESYL |
Farnesyl group attached through a thioether bond to a cysteine residue
[3,4] |
GERANYL-GERANYL |
Geranyl-geranyl group attached through a thioether bond to a cysteine
residue [3,4] |
GPI-ANCHOR |
Glycosyl-phosphatidylinositol (GPI) group linked to the alpha-carboxyl
group of the C-terminal residue of the mature form of a protein [5,6] |
N-ACYL DIGLYCERIDE |
N-terminal cysteine of the mature form of a prokaryotic lipoprotein with
an amide-linked fatty acid and a glyceryl group to which two fatty acids
are linked by ester linkages [7] |
S-ARCHAEOL |
N-terminal cysteine of the mature form of a archaeal lipoprotein with
an archaeol (2,3-di-O-phytanyl-sn-glycerol) lipid group linked by an ester
linkage [8] |
N-OCTANOATE |
n-octanoate group linked through an ester bond to a serine residue
|
References:
[1] |
Grand R.J.A. Biochem. J. 258:626-638(1989). |
[2] |
McIlhinney R.A.J. Trends Biochem. Sci. 15:387-391(1990). |
[3] |
Glomset J.A., Gelb M.H., Farnsworth C.C. Trends Biochem. Sci. 15:139-142(1990). |
[4] |
Sinensky M., Lutz R.J. BioEssays 14:25-31(1992). |
[5] |
Low M.G. FASEB J. 3:1600-1608(1989). |
[6] |
Low M.G. Biochim. Biophys. Acta 988:427-454(1989). |
[7] |
Hayashi S., Wu H.C. J. Bioenerg. Biomembr. 22:451-471(1990). |
[8] |
Nishihara M., Utagawa M., Akutsu H., Koga Y. J. Biol. Chem. 267:12432-12435(1992). |
Examples of LIPID key feature lines:
FT LIPID 2 2 MYRISTATE.
FT LIPID 5 5 PALMITATE.
FT LIPID 231 231 GPI-ANCHOR.
DISULFID - Disulfide bond.
The 'FROM' and 'TO' endpoints designate the two residues which are linked by an intra-chain
disulfide bond. If the 'FROM' and 'TO' endpoints are identical, the disulfide bond is an
interchain one and the description field indicates the nature of the cross-link. Examples of
DISULFID key feature lines:
FT DISULFID 23 84 PROBABLE.
FT DISULFID 29 29 INTERCHAIN (WITH C-8 OF SMALL CHAIN).
THIOLEST - Thiolester bond.
The 'FROM' and 'TO' endpoints designate the two residues which are linked by a thiolester
bond.
THIOETH - Thioether bond.
The 'FROM' and 'TO' endpoints designate the two residues which are linked by a thioether
bond.
SE_CYS - Selenocysteine
This key describes the occurrence of a selenocysteine in the sequence record. Examples:
FT SE_CYS 52 52
FT SE_CYS 16 16 POTENTIAL.
METAL - Binding site for a metal ion.
The description field indicates the nature of the metal. Examples of METAL key feature lines:
FT METAL 52 52 IRON (HEME AXIAL LIGAND).
FT METAL 28 28 COPPER (POTENTIAL).
BINDING - Binding site for any chemical group (co-enzyme, prosthetic group, etc.).
The chemical nature of the group is given in the description field. Examples of BINDING key
feature lines:
FT BINDING 14 14 HEME (COVALENT).
FT BINDING 250 250 PYRIDOXAL PHOSPHATE.
- A.3. Regions
SIGNAL - Extent of a signal sequence (prepeptide).
TRANSIT - Extent of a transit peptide (mitochondrion, chloroplast, thylakoid,
cyanelle or microbody).
Examples of TRANSIT key feature lines:
FT TRANSIT 1 42 CHLOROPLAST.
FT TRANSIT 1 37 CYANELLE (BY SIMILARITY).
FT TRANSIT 1 25 MITOCHONDRION.
FT TRANSIT 1 34 MICROBODY (POTENTIAL).
FT TRANSIT ? 77 THYLAKOID (BY SIMILARITY).
PROPEP - Extent of a propeptide.
Examples of PROPEP key feature lines:
FT PROPEP 27 28 ACTIVATION PEPTIDE.
FT PROPEP 550 574 REMOVED IN MATURE FORM.
CHAIN - Extent of a polypeptide chain in the mature protein.
Examples of CHAIN key feature lines:
FT CHAIN 21 119 BETA-2-MICROGLOBULIN.
FT CHAIN 41 180 FACTOR X LIGHT CHAIN.
PEPTIDE - Extent of a released active peptide.
Examples of PEPTIDE key feature lines:
FT PEPTIDE 13 107 NEUROPHYSIN 2.
FT PEPTIDE 235 239 MET-ENKEPHALIN.
DOMAIN - Extent of a domain of interest in the sequence.
The nature of the domain is given in the description field. Examples of DOMAIN key feature
lines:
FT DOMAIN 22 788 EXTRACELLULAR (POTENTIAL).
FT DOMAIN 745 939 RAS-GAP.
FT DOMAIN 128 195 COILED COIL (POTENTIAL).
CA_BIND - Extent of a calcium-binding region.
Example:
FT CA_BIND 759 770 EF-HAND 1 (POTENTIAL).
DNA_BIND - Extent of a DNA-binding region.
The nature of the DNA-binding region is given in the description field. Examples of
DNA_BIND key feature lines:
FT DNA_BIND 335 415 ETS-DOMAIN.
FT DNA_BIND 69 128 HOMEOBOX.
FT DNA_BIND 16 67 MYB 2.
FT DNA_BIND 135 200 TEA-DOMAIN.
NP_BIND - Extent of a nucleotide phosphate-binding region.
The nature of the nucleotide phosphate is indicated in the description field. Examples of
NP_BIND key feature lines:
FT NP_BIND 182 189 ATP.
FT NP_BIND 38 45 GTP (POTENTIAL).
FT NP_BIND 9 24 FAD (ADP PART).
TRANSMEM - Extent of a transmembrane region.
ZN_FING - Extent of a zinc finger region.
The zinc finger 'category' is indicated in the description field. Examples of ZN_FING key
feature lines:
FT ZN_FING 319 343 GATA-TYPE.
FT ZN_FING 559 579 C4-TYPE.
REPEAT - Extent of an internal sequence repetition.
Examples of REPEAT key feature lines:
FT REPEAT 225 307 1.
FT REPEAT 341 423 2.
FT REPEAT 455 537 3 (APPROXIMATE).
- A.4. Secondary structure
The feature table of sequence entries of proteins whose tertiary structure is known
experimentally contains the secondary structure information corresponding to that protein.
The secondary structure assignment is made according to DSSP (see Kabsch W., Sander
C.; Biopolymers, 22:2577-2637(1983)) and the information is extracted from the
coordinate data sets of the Protein Data Bank (PDB).
In the feature table only three types of secondary structure are specified: helices (key
HELIX), beta-strands (key STRAND) and turns (key TURN). Residues not specified in one of
these classes are in a 'loop' or 'random-coil' structure. Because the DSSP assignment
has more than the three common secondary structure classes, we have converted the
different DSSP assignments to HELIX, STRAND and TURN as shown in the table below.
-
DSSP code |
DSSP definition |
Swiss-Prot assignment |
H |
Alpha-helix |
HELIX |
G |
3(10) helix |
HELIX |
I |
Pi-helix |
HELIX |
E |
Hydrogen-bonded beta-strand (extended strand) |
STRAND |
B |
Residue in an isolated beta-bridge |
STRAND |
T |
H-bonded turn (3-turn, 4-turn or 5-turn) |
TURN |
S |
Bend (five-residue bend centered at residue i) |
Not specified |
One should be aware of the following facts:
a) |
Segment length. For helices (alpha and 3-10), the
residue just before and just after the helix as given by DSSP participates in
the helical hydrogen-bonding pattern with a single H-bond. For practical
purposes, one can extend the HELIX range by one residue on each side,
e.g. HELIX 25-35 instead of HELIX 26-34. Also, the ends of secondary structure
segments are less well defined for lower-resolution structures. A fluctuation
of one residue is common. |
b) |
Missing segments. In low-resolution structures, badly formed helices or
strands may be omitted in the DSSP definition. |
c) |
Special helices and strands. Helices of length three are 3-10 helices,
those of length four and longer are either alpha-helices or 3-10 helices
(pi helices are extremely rare). A strand of one residue corresponds to a
residue in an isolated beta-bridge. Such bridges can be structurally important. |
d) |
Missing secondary structure. No secondary structure is currently given
in the feature table in the following cases:
- No sequence data in the PDB entry;
- Structure for which only C-alpha coordinates are in PDB;
- NMR structure with more than one coordinate data set;
- Model (i.e. theoretical) structure.
|
Examples:
FT HELIX 4 14
FT HELIX 21 35
FT TURN 36 36
FT HELIX 48 61
FT TURN 62 63
FT HELIX 66 83
FT TURN 84 86
- A.5. Others
ACT_SITE - Amino acid(s) involved in the activity of an enzyme.
Examples of ACT_SITE key feature lines:
FT ACT_SITE 193 193 ACCEPTS A PROTON DURING CATALYSIS.
FT ACT_SITE 1083 1083 CHARGE RELAY SYSTEM.
SITE - Any other interesting site on the sequence.
Examples of SITE key feature lines:
FT SITE 285 288 PREVENT SECRETION FROM ER.
FT SITE 759 760 CLEAVAGE (BY THROMBIN).
INIT_MET - Initiator methionine.
This feature key is usually associated with a zero value in the 'FROM' and 'TO' fields to
indicate that the initiator methionine has been cleaved off and is not shown in the
sequence:
FT INIT_MET 0 0
It is not used when the initiator methionine is not cleaved off except in
the event of internal alternative initiation sites. Example:
FT INIT_MET 44 44 FOR CYTOPLASMIC ISOFORM.
NON_TER - The residue at an extremity of the sequence is not the terminal residue.
If applied to position 1, this signifies that the first position is not the N-terminus of the
complete molecule. If applied to the last position, it means that this position is not the
C-terminus of the complete molecule. There is no description field for this key. Examples of
NON_TER key feature lines:
FT NON_TER 1 1
FT NON_TER 129 129
NON_CONS - Non-consecutive residues.
Indicates that two residues in a sequence are not consecutive and that there are a number
of unsequenced residues between them. Example of a NON_CONS key feature line:
FT NON_CONS 1683 1684
UNSURE - Uncertainties in the sequence
Used to describe region(s) of a sequence for which the authors are unsure about the
sequence assignment.
FT UNSURE 12 12 OR Y.
The one-letter and three-letter codes for amino acids used in Swiss-Prot are those
adopted by the commission on Biochemical Nomenclature of the IUPAC-IUB (see the
reference listed below).
-
One-letter code |
Three-letter code |
Amino-acid name |
A |
Ala |
Alanine |
R |
Arg |
Arginine |
N |
Asn |
Asparagine |
D |
Asp |
Aspartic acid |
C |
Cys |
Cysteine |
Q |
Gln |
Glutamine |
E |
Glu |
Glutamic acid |
G |
Gly |
Glycine |
H |
His |
Histidine |
I |
Ile |
Isoleucine |
L |
Leu |
Leucine |
K |
Lys |
Lysine |
M |
Met |
Methionine |
F |
Phe |
Phenylalanine |
P |
Pro |
Proline |
S |
Ser |
Serine |
T |
Thr |
Threonine |
W |
Trp |
Tryptophan |
Y |
Tyr |
Tyrosine |
V |
Val |
Valine |
B |
Asx |
Aspartic acid or Asparagine |
Z |
Glx |
Glutamic acid or Glutamine |
X |
Xaa |
Any amino acid |
Reference
IUPAC-IUB Joint Commission on Biochemical Nomenclature (JCBN).
Nomenclature and Symbolism for Amino Acids and Peptides. Recommendations 1983.
Eur. J. Biochem. 138:9-37(1984).
See also: http://www.chem.qmw.ac.uk/iupac/AminoAcid/
Appendix C: Format differences between the Swiss-Prot and EMBL databases
|
Table of contents |
- C.1. Generalities
The format of Swiss-Prot follows as closely as possible that of the EMBL
database. The general structure of an entry is identical in both databases.
The data classes are the same except though Swiss-Prot does not make use of
the 'BACKBONE', 'UNREVIEWED' and 'UNANNOTATED' data
classes. One line type used in Swiss-Prot does not exist in the EMBL database
(see section C.3); conversely Swiss-Prot does not
currently make use of every EMBL line type (see section
C.4).
- C.2. Differences in line types present in both databases
- C.2.1 The ID line (IDentification)
Differences with the EMBL database ID line format are:
- The entry name can be up to 10 characters long (instead of 9 in EMBL) and can
begin with a numerical character;
- EMBL entry ID lines have an additional three-letter taxonomic division 'token'
inserted between the data class and the molecule type;
- The molecule type is listed as 'PRT' rather than 'DNA' or 'RNA';
- The length of the molecule is followed by 'AA' (Amino Acid) instead of 'BP' (Base Pairs).
- C.2.2 The AC line (ACcession number)
The format of this line type is identical to that defined by the EMBL
database. Swiss-Prot accession numbers do not overlap those used in the
EMBL/GenBank/DDBJ nucleotide sequence database. However, it should be noted
that there are differences in the format of the accession numbers themselves. In Swiss-Prot accession numbers
consist of 6 alphanumerical characters in the following format:
-
1 |
2 |
3 |
4 |
5 |
6 |
[O,P,Q] | [0-9] | [A-Z, 0-9] | [A-Z, 0-9] | [A-Z, 0-9] | [0-9] |
Examples: P01234; Q1AA12.
In EMBL, two different types of accession numbers co-exist:
1. |
Accession numbers with 6 alphanumerical characters,
where the first character is any letter with the exception of O,P or Q and
the five other characters are numbers (example: M23765); |
2. |
Accession numbers with 8 alphanumerical characters, where the first two
characters are letters and the following six characters are numbers
(example: AB001084). |
- C.2.3 The DT line (DaTe)
Differences with the EMBL database DT line format are:
- In EMBL there are two DT lines per entry whereas there are three in Swiss-Prot;
- In EMBL the format of the DT line that indicates when an entry was created
is identical to that defined in Swiss-Prot; but the two DT lines that convey
information relevant to the updating of an entry are replaced by a single line in
EMBL. This is shown in the example below.
DT lines in a Swiss-Prot entry:
DT 21-JUL-1986 (Rel. 01, Created)
DT 23-OCT-1986 (Rel. 02, Last sequence update)
DT 01-APR-1990 (Rel. 14, Last annotation update)
DT lines in an EMBL database entry:
DT 10-MAR-1990 (Rel. 22, Created)
DT 12-APR-1990 (Rel. 23, Last updated, Version 3)
- C.2.4 The DE line (DEscription)
- In Swiss-Prot the species of origin is not included in the description;
- In EMBL the last DE line is not terminated by a period.
- C.2.5 The OS line (Organism Species)
- In some cases the Swiss-Prot OS line includes more than one organism name
(when the relevant sequence is completely conserved in different species);
- In EMBL the last OS line is not terminated by a period.
- C.2.6 The OG line (OrGanelle)
- EMBL makes a distinction between 'Mitochondrion', and 'Kinetoplast', while
Swiss-Prot does not use the latter designation;
- EMBL makes a distinction between 'Chloroplast' and 'Plastid', while
Swiss-Prot does not use the latter designation;
- In EMBL the OG line is not terminated by a period.
- C.2.7 The RP and RC lines
- In EMBL, unlike Swiss-Prot, the RC line precedes the RP line;
- In EMBL the RC line is in free format and is generally not used.
- C.2.8 The RT line (Reference Title)
In EMBL the reference title is not terminated by a period, a question mark or an
exclamation mark.
- C.2.9 The FT line (Feature Table)
The format of this line is totally different from that currently defined for the EMBL database.
The format used in Swiss-Prot is similar to that which was used in older versions of the
EMBL database, prior to the introduction of the common EMBL/GenBank/DDBJ feature
table.
- C.2.10 The CC line (Comment)
The comment lines, which are free text and can appear anywhere in an EMBL entry, are
grouped together in the Swiss-Prot database. They are always listed below the last
reference line, and follow a precise syntax (see section
3.11).
- C.2.11 The SQ line (SeQuence header)
Although the rough format and purpose of this line type is conserved, its
exact content differs from that of the EMBL database. The numerical length
of the sequence is listed, followed by 'AA' (Amino Acid) instead of
'BP' (Base Pairs). Rather then indicating the sequence composition
which, for protein sequences, would not fit in a single line, the molecular
weight and the 64-bit CRC (Cyclic Redundancy Check) value of the sequence
are indicated.
-
C.3. Line types defined by Swiss-Prot but currently not used by EMBL
Presently, there are two line types in Swiss-Prot, which are not used in
the EMBL database: the GN and OX lines.
- C.4. Line types defined by EMBL but currently not used by Swiss-Prot
There are three line types in the EMBL database, which are not used in
Swiss-Prot:
- FH and XX. The FH and XX lines contain no data and are present in EMBL only to
improve readability of an entry when it is printed or displayed on a terminal screen.
These lines are not included in Swiss-Prot so as to keep it as compact as
possible and thereby facilitate its use on small computer systems.
- SV. The SV (Sequence Version) line contains an identifier specific to nucleic acid
sequences. It has no meaning in the context of Swiss-Prot.
Swiss-Prot is distributed with a large number of documentation files. Some
of these files have been available for a long time (the user manual,
release notes, the various indexes for authors, citations, keywords, etc.),
but many have been created recently and we are continuously adding new
files, and updating and modifying existing files. The following table lists all
the documents that are currently available.
userman.txt |
User manual |
relnotes.txt |
Release notes for the current release |
shortdes.txt |
Short description of entries in Swiss-Prot |
|
|
jourlist.txt |
List of cited journals |
keywlist.txt |
List of keywords |
plasmid.txt |
List of plasmids |
speclist.txt |
List of organism (species) identification codes |
tisslist.txt |
List of tissues |
experts.txt |
List of on-line experts for PROSITE and Swiss-Prot |
dbxref.txt |
List of databases cross-referenced in Swiss-Prot |
submit.txt |
Submission of sequence data to Swiss-Prot |
|
|
acindex.txt |
Accession number index |
autindex.txt |
Author index |
citindex.txt |
Citation index |
keyindex.txt |
Keyword index |
speindex.txt |
Species index |
deleteac.txt |
Deleted accession number index |
|
|
7tmrlist.txt |
List of 7-transmembrane G-linked receptor entries |
aatrnasy.txt |
List of aminoacyl-tRNA synthetases |
allergen.txt |
Nomenclature and index of allergen sequences |
annbioch.txt |
Swiss-Prot annotation: how is biochemical information assigned to sequence entries |
arath.txt |
Index of Arabidopsis thaliana entries and their corresponding gene designations |
bacsu.txt |
Index of Bacillus subtilis strain 168 chromosomal entries and their
corresponding SubtiList cross-references |
bloodgrp.txt |
Blood group antigen proteins |
bucai.txt |
Index of Buchnera aphidicola (subsp. Acyrthosiphon pisum) entries |
bucap.txt |
Index of Buchnera aphidicola (subsp. Schizaphis graminum) entries |
calbican.txt |
Index of Candida albicans entries and their corresponding gene designations |
cdlist.txt |
CD nomenclature for surface proteins of human leucocytes |
celegans.txt |
Index of Caenorhabditis elegans entries and their corresponding gene
designations and WormPep cross-references |
dicty.txt |
Index of Dictyostelium discoideum entries and their corresponding gene
designations and DictyDB cross-references |
ec2dtosp.txt |
Index of Escherichia coli Gene-protein database (ECO2DBASE) entries
referenced in Swiss-Prot |
ecoli.txt |
Index of Escherichia coli strain K12 chromosomal entries and their
corresponding EcoGene cross-references |
embltosp.txt |
Index of EMBL Nucleotide Sequence Database entries referenced in Swiss-Prot |
extradom.txt |
Nomenclature of extracellular domains |
fly.txt |
Index of Drosophila entries and their corresponding FlyBase cross-references |
glycosid.txt |
Classification of glycosyl hydrolase families and index of glycosyl hydrolase entries in Swiss-Prot |
haein.txt |
Index of Haemophilus influenzae strain Rd chromosomal entries |
helpy.txt |
Index of Helicobacter pylori strain 26695 chromosomal entries |
hoxlist.txt |
Vertebrate homeotic Hox proteins: nomenclature and index |
humchr01.txt |
Index of proteins encoded on human chromosome 1 |
humchr02.txt |
Index of proteins encoded on human chromosome 2 |
humchr03.txt |
Index of proteins encoded on human chromosome 3 |
humchr04.txt |
Index of proteins encoded on human chromosome 4 |
humchr05.txt |
Index of proteins encoded on human chromosome 5 |
humchr06.txt |
Index of proteins encoded on human chromosome 6 |
humchr07.txt |
Index of proteins encoded on human chromosome 7 |
humchr08.txt |
Index of proteins encoded on human chromosome 8 |
humchr09.txt |
Index of proteins encoded on human chromosome 9 |
humchr10.txt |
Index of proteins encoded on human chromosome 10 |
humchr11.txt |
Index of proteins encoded on human chromosome 11 |
humchr12.txt |
Index of proteins encoded on human chromosome 12 |
humchr13.txt |
Index of proteins encoded on human chromosome 13 |
humchr14.txt |
Index of proteins encoded on human chromosome 14 |
humchr15.txt |
Index of proteins encoded on human chromosome 15 |
humchr16.txt |
Index of proteins encoded on human chromosome 16 |
humchr17.txt |
Index of proteins encoded on human chromosome 17 |
humchr18.txt |
Index of proteins encoded on human chromosome 18 |
humchr19.txt |
Index of proteins encoded on human chromosome 19 |
humchr20.txt |
Index of proteins encoded on human chromosome 20 |
humchr21.txt |
Index of proteins encoded on human chromosome 21 |
humchr22.txt |
Index of proteins encoded on human chromosome 22 |
humchrx.txt |
Index of proteins encoded on human chromosome X |
humchry.txt |
Index of proteins encoded on human chromosome Y |
humpvar.txt |
Index of human proteins with sequence variants |
initfact.txt |
List and index of translation initiation factors |
intein.txt |
Index of intein-containing entries referenced in Swiss-Prot |
metallo.txt |
Classification of metallothioneins and index of the entries in Swiss-Prot |
metja.txt |
Index of Methanococcus jannaschii entries |
mgdtosp.txt |
Index of MGD entries referenced in Swiss-Prot |
mimtosp.txt |
Index of MIM entries referenced in Swiss-Prot |
mycge.txt |
Index of Mycoplasma genitalium strain G-37 chromosomal entries |
mycpn.txt |
Index of Mycoplasma pneumoniae strain M129 chromosomal entries |
ngr234.txt |
Table of predicted proteins in Rhizobium plasmid pNGR234a |
nomlist.txt |
List of nomenclature related references for proteins |
pdbtosp.txt |
Index of Protein Data Bank (PDB) entries referenced in Swiss-Prot |
peptidas.txt |
Classification of peptidase families and index of peptidase entries in Swiss-Prot |
plastid.txt |
List of chloroplast and cyanelle encoded proteins |
pombe.txt |
Index of Schizosaccharomyces pombe entries and their corresponding gene designations |
restric.txt |
List of restriction enzyme and methylase entries |
ribosomp.txt |
Index of ribosomal proteins classified by families on the basis of sequence similarities |
ricpr.txt |
Index of Rickettsia prowazekii strain Madrid E entries |
salty.txt |
Index of Salmonella typhimurium strain LT2 chromosomal entries and their
corresponding StyGene cross-references |
syny3.txt |
Index of Synechocystis sp. strain PCC 6803 entries |
upflist.txt |
List of UPF (Uncharacterized Protein Families) and index of members |
yeast.txt |
Index of Saccharomyces cerevisiae entries in Swiss-Prot and their
corresponding gene designations |
yeast1.txt |
Yeast chromosome I entries |
yeast2.txt |
Yeast chromosome II entries |
yeast3.txt |
Yeast chromosome III entries |
yeast5.txt |
Yeast chromosome V entries |
yeast6.txt |
Yeast chromosome VI entries |
yeast7.txt |
Yeast chromosome VII entries |
yeast8.txt |
Yeast chromosome VIII entries |
yeast9.txt |
Yeast chromosome IX entries |
yeast10.txt |
Yeast chromosome X entries |
yeast11.txt |
Yeast chromosome XI entries |
yeast13.txt |
Yeast chromosome XIII entries |
yeast14.txt |
Yeast chromosome XIV entries |
-
E.1 Generalities
Swiss-Prot is available for download on the following anonymous FTP
servers:
-
E.2 Non-redundant database
On the ExPASy and EBI FTP servers we distribute files that make up a non-redundant
and complete protein sequence database consisting of three components:
1) Swiss-Prot
2) TrEMBL
3) New entries to be integrated later into TrEMBL (hereafter known as TrEMBL_New)
Every week three files are completely rebuilt. These files are named:
sprot.dat.gz, trembl.dat.gz and trembl_new.dat.gz. As indicated by their '.
gz' extension, these are gzip-compressed files which, when
decompressed, produce ASCII files in Swiss-Prot format.
Three other files are also available (sprot.fas.gz, trembl.fas.gz and
trembl_new.fas.gz) which are compressed 'fasta' format sequence
files useful for building the databases used by FASTA, BLAST and other
sequence similarity search programs. Please do not use these files for any
other purpose, as you will lose all annotations by using this stripped-down
format.
The files for the non-redundant database are stored in the directory
'/databases/sp_tr_nrdb' on the ExPASy FTP server (ftp.expasy.org) and in
the directory '/pub/databases/sp_tr_nrdb' on the EBI FTP server
(ftp.ebi.ac.uk).
Additional notes:
- The Swiss-Prot file continuously grows as new annotated sequences are added.
- The TrEMBL file decreases in size as sequences are moved out of that section after being
annotated and moved into Swiss-Prot. Four times a year a new release of TrEMBL is
built at EBI, at this point the TrEMBL file increases in size as it then includes all of the new
data (see next section) that has accumulated since the last release.
- The TrEMBL_New file starts as a very small file and grows in size until a new release of
TrEMBL is available.
- Swiss-Prot and TrEMBL share the same system of accession numbers. Therefore you
will not find any primary accession number duplicated between the two sections. A TrEMBL
entry (and its associated accession number(s)) can either move to Swiss-Prot as a new
entry or be merged with an existing Swiss-Prot entry. In the latter case, the accession
number(s) of that TrEMBL entry are added to that of the Swiss-Prot entry.
- TrEMBL_New does not have real accession numbers. However it was necessary to have an
'AC' line so as to be able to use it with different software products. This AC line contains a
temporary identifier which consists of the protein_ID (protein sequence identifier) of the
coding sequence in the parent nucleotide sequence.
- TrEMBL_New is quite messy! You will of course find new sequence entries but you will also
encounter sequences that are going to be used to update existing TrEMBL or Swiss-Prot
entries. None of the "cleaning" steps that are applied to produce a TrEMBL release are run
on TrEMBL_New nor are any of the computer-annotation software tools that are used to
enhance the information content of TrEMBL. TrEMBL_New is provided only so that users
can be sure not to miss any important new sequences when they run similarity searches.
- While these three files allow you to build what we call a 'non-redundant' database, it must be
noted that this is not completely a true statement. Without going into a long explanation we
can say that this is currently the best attempt in providing a complete selection of protein
sequence entries while trying to eliminate redundancies. While Swiss-Prot is completely
(well 99.994% !) non-redundant, TrEMBL is far from being non-redundant and the addition of
Swiss-Prot + TrEMBL is even less so.
- To describe to your users the version of the non-redundant database that you are providing
them with, you should use a statement of the form:
Swiss-Prot release 41.x of xx-yyy-2003;
TrEMBL release 23.x of xx-yyy-2003;
TrEMBL_New of xx-yyy-2003.
-
E.3 Weekly updates of Swiss-Prot documents
Whilst the ExPASy FTP server so far only allowed FTP access to the
Swiss-Prot documents and indexes in their versions at the time of the
last full release, all documents are now updated with every weekly
release of Swiss-Prot. They are available for FTP download from the
directory /databases/swiss-prot/updated_doc/.
-
E.4 Weekly updates of Swiss-Prot
Weekly updates of Swiss-Prot are available by anonymous FTP. Three files
are generated at each update:
new_seq.dat |
Contains all the new entries since the last full release; |
upd_seq.dat |
Contains the entries for which the sequence data has been updated since the last release; |
upd_ann.dat |
Contains the entries for which one or more annotation fields have been updated since the last release. |
Important notes
- Although we try to follow a regular schedule, we do not promise to update these files every
week. In most cases two weeks may elapse between two updates.
- Instead of using the above files, you can, every week, download an updated copy of the
Swiss-Prot database. This file is available in the directory containing the non-redundant
database (see section E.2).
Appendix F: Relationships between Swiss-Prot and some biomolecular databases
|
Table of contents |
The current status of the relationships (cross-references) between Swiss-Prot and some
biomolecular databases is shown in the following schema: