ExPASy Home page Site Map Search ExPASy Contact us
Hosted by NCSC USMirror sites:Canada China Korea Switzerland Taiwan
Search for

Swiss-Prot Protein Knowledgebase
User Manual

Release XX.X

Table of contents
Amos Bairoch
Swiss Institute of Bioinformatics (SIB)
Centre Medical Universitaire (CMU)
1, rue Michel Servet
1211 Geneva 4
Switzerland

Telephone: +41-22-379 50 50
Fax: +41-22-379 58 58
Electronic mail address: amos.bairoch@isb-sib.ch
WWW server: http://www.expasy.org/sprot/


Rolf Apweiler
The EMBL Outstation - The European Bioinformatics Institute (EBI)
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD
United Kingdom

Telephone: +44-1223-494 444
Fax: +44-1223-494 468
Electronic mail address: datalib@ebi.ac.uk
WWW server: http://www.ebi.ac.uk/swissprot/


Acknowledgements

This release of Swiss-Prot has been prepared by:


Copyright notice

Swiss-Prot is copyright. It is produced through a collaboration between the Swiss Institute of Bioinformatics and the EMBL Outstation - the European Bioinformatics Institute. There are no restrictions on its use by non-profit institutions as long as its content is in no way modified. Usage by and for commercial entities requires a license agreement. For information about the licensing scheme see: http://www.isb-sib.ch/announce/ or send an email to license@isb-sib.ch.

The above copyright notice also applies to this user manual as well as to all other Swiss-Prot documents.


How to submit data or updates/corrections to Swiss-Prot

To submit new sequence data to Swiss-Prot and for all queries regarding the submission of Swiss-Prot one should contact:

Swiss-Prot
The EMBL Outstation - The European Bioinformatics Institute
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD
United Kingdom

Telephone: (+44 1223) 494 462
Telefax: (+44 1223) 494 468
E-mail: datasubs@ebi.ac.uk (for submissions); datalib@ebi.ac.uk (for queries)
To submit updates and/or corrections to Swiss-Prot you can either use the E-mail address: swiss-prot@expasy.org or the WWW address:

http://www.expasy.org/sprot/update.html


Citation

If you want to cite Swiss-Prot in a publication, please use the following reference:

Boeckmann B., Bairoch A., Apweiler R., Blatter M.-C., Estreicher A., Gasteiger E., Martin M.J., Michoud K., O’Donovan C., Phan I., Pilbout S. and Schneider M.
The Swiss-Prot protein knowledgebase and its supplement TrEMBL in 2003.
Nucleic Acids Res. 31:365-370(2003).

Table of contents

1. What is Swiss-Prot?

2. Conventions used in the database
2.1.   General structure of the database
2.2.   Classes of data
2.3.   Structure of a sequence entry

3. The different line types
3.1    The ID line
3.2    The AC line
3.3    The DT line
3.4    The DE line
3.5    The GN line
3.6    The OS line
3.7    The OG line
3.8    The OC line
3.9    The OX line
3.10  The reference (RN, RP, RC, RX, RA, RT, RL) lines
3.11  The CC line
3.12  The DR line
3.13  The KW line
3.14  The FT line
3.15  The SQ line
3.16  The sequence data line
3.17  The // line
Appendix A: Feature table keys
A.1   Change indicators
A.2   Amino-acid modifications
A.3   Regions
A.4   Secondary structure
A.5   Others
Appendix B: Amino-acid codes
Appendix C: Format differences between the Swiss-Prot and EMBL databases
C.1   Generalities
C.2   Differences in line types present in both databases
C.3   Line types defined by Swiss-Prot but currently not used by EMBL
C.4   Line types defined by EMBL but currently not used by Swiss-Prot
Appendix D: Swiss-Prot documentation files
Appendix E: FTP access to Swiss-Prot and TrEMBL
E.1   Generalities
E.2   Non-redundant database
E.3   Weekly updates of Swiss-Prot documents
E.4   Weekly updates of Swiss-Prot
Appendix F: Relationships between Swiss-Prot and some biomolecular databases

1. What is Swiss-Prot? Table of contents

Swiss-Prot is an annotated protein sequence database. It was established in 1986 and maintained collaboratively, since 1987, by the group of Amos Bairoch first at the Department of Medical Biochemistry of the University of Geneva and now at the Swiss Institute of Bioinformatics (SIB) and the EMBL Data Library (now the EMBL Outstation - The European Bioinformatics Institute (EBI)). The Swiss-Prot protein knowledgebase consists of sequence entries. Sequence entries are composed of different line types, each with their own format. For standardization purposes the format of Swiss-Prot follows as closely as possible that of the EMBL Nucleotide Sequence Database.

Swiss-Prot distinguishes itself from protein sequence databases by four distinct criteria:

a) Annotation
In Swiss-Prot, as in many sequence databases, two classes of data can be distinguished: the core data and the annotation.

For each sequence entry the core data consists of:

The annotation consists of the description of the following items: We try to include as much annotation information as possible in Swiss-Prot. To obtain this information we use, in addition to the publications that report new sequence data, review articles to periodically update the annotations of families or groups of proteins. We also make use of external experts, who have been recruited to send us their comments and updates concerning specific groups of proteins.

We believe that having systematic recourse both to publications other than those reporting the core data and to subject referees represents a unique and beneficial feature of Swiss-Prot.

In Swiss-Prot, annotation is mainly found in the comment lines (CC), in the feature table (FT) and in the keyword lines (KW). Most comments are classified by 'topics'; this approach permits the easy retrieval of specific categories of data from the database.

b) Minimal redundancy
Many sequence databases contain, for a given protein sequence, separate entries which correspond to different literature reports. In Swiss-Prot we try as much as possible to merge all these data so as to minimize the redundancy of the database. If conflicts exist between various sequencing reports, they are indicated in the feature table of the corresponding entry.

c) Integration with other databases
It is important to provide the users of biomolecular databases with a degree of integration between the three types of sequence-related databases (nucleic acid sequences, protein sequences and protein tertiary structures) as well as with specialized data collections. Swiss-Prot is currently cross-referenced to about 45 different databases. Cross-references are provided in the form of pointers to information related to Swiss-Prot entries and found in data collections other than Swiss-Prot. This extensive network of cross-references allows Swiss-Prot to play a major role as a focal point of biomolecular database interconnectivity.

d) Documentation
Swiss-Prot is distributed with a large number of index files and specialized documentation files. Some of these files have been available for a long time (this user manual, the release notes, the various indices for authors, citations, keywords, etc.), but many have been created recently and we are continuously adding new files. The release notes contain an up-to-date descriptive list of all distributed document files.

2. Conventions used in the database Table of contents

The following sections describe the general conventions used in Swiss-Prot to achieve uniformity of presentation. Experienced users of the EMBL Database can skip these sections and directly refer to Appendix C, which lists the minor differences in format between the two data collections.

2.1. General structure of the database
The Swiss-Prot protein sequence database is composed of sequence entries. Each entry corresponds to a single contiguous sequence as contributed to the bank or reported in the literature. In some cases, entries have been assembled from several papers that report overlapping sequence regions. Conversely, a single paper can provide data for several entries, e.g. when related sequences from different organisms are reported.

References to positions within a sequence are made using sequential numbering, beginning with 1 at the N-terminal end of the sequence.

Except for initiator N-terminal methionine residues, which are not included in a sequence when their absence from the mature sequence has been proven, the sequence data correspond to the precursor form of a protein before posttranslational modifications and processing.

2.2. Data classes
To make data available to users as quickly as possible after publication, Swiss-Prot is distributed with a supplement called TrEMBL, where entries are released before all their details are finalized. To distinguish between fully annotated entries and those in TrEMBL, the 'class' of each entry is indicated on the first (ID) line of the entry. The two defined classes are:

STANDARD Data which are complete and up to the standards laid down by the Swiss-Prot database.
PRELIMINARY Sequence entries which have not yet been annotated by the Swiss-Prot staff up to the standards laid down by Swiss-Prot. These entries are exclusively found in TrEMBL.

2.3. Structure of a sequence entry
The entries in the Swiss-Prot database are structured so as to be usable by human readers as well as by computer programs. The explanations, descriptions, classifications and other comments are in ordinary English. Wherever possible, symbols familiar to biochemists, protein chemists and molecular biologists are used.

Each sequence entry is composed of lines. Different types of lines, each with their own format, are used to record the various data that make up the entry. A sample sequence entry is shown below.


ID   GRAA_HUMAN     STANDARD;      PRT;   262 AA.
AC   P12544;
DT   01-OCT-1989 (Rel. 12, Created)
DT   01-OCT-1989 (Rel. 12, Last sequence update)
DT   28-FEB-2003 (Rel. 41, Last annotation update)
DE   Granzyme A precursor (EC 3.4.21.78) (Cytotoxic T-lymphocyte proteinase
DE   1) (Hanukkah factor) (H factor) (HF) (Granzyme 1) (CTL tryptase)
DE   (Fragmentin 1).
GN   GZMA OR CTLA3 OR HFSP.
OS   Homo sapiens (Human).
OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
OC   Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
OX   NCBI_TaxID=9606;
RN   [1]
RP   SEQUENCE FROM N.A.
RC   TISSUE=T-cell;
RX   MEDLINE=88125000; PubMed=3257574;
RA   Gershenfeld H.K., Hershberger R.J., Shows T.B., Weissman I.L.;
RT   "Cloning and chromosomal assignment of a human cDNA encoding a T
RT   cell- and natural killer cell-specific trypsin-like serine
RT   protease.";
RL   Proc. Natl. Acad. Sci. U.S.A. 85:1184-1188(1988).
RN   [2]
RP   SEQUENCE FROM N.A.
RC   TISSUE=Blood;
RA   Strausberg R.;
RL   Submitted (OCT-2001) to the EMBL/GenBank/DDBJ databases.
RN   [3]
RP   SEQUENCE OF 1-23 FROM N.A.
RA   Goralski T.J., Krensky A.M.;
RT   "The upstream region of the human granzyme A locus contains both
RT   positive and negative transcriptional regulatory elements.";
RL   Submitted (NOV-1995) to the EMBL/GenBank/DDBJ databases.
RN   [4]
RP   SEQUENCE OF 29-53.
RX   MEDLINE=88330824; PubMed=3047119;
RA   Poe M., Bennett C.D., Biddison W.E., Blake J.T., Norton G.P.,
RA   Rodkey J.A., Sigal N.H., Turner R.V., Wu J.K., Zweerink H.J.;
RT   "Human cytotoxic lymphocyte tryptase. Its purification from granules
RT   and the characterization of inhibitor and substrate specificity.";
RL   J. Biol. Chem. 263:13215-13222(1988).
RN   [5]
RP   SEQUENCE OF 29-40, AND CHARACTERIZATION.
RX   MEDLINE=89009866; PubMed=3262682;
RA   Hameed A., Lowrey D.M., Lichtenheld M., Podack E.R.;
RT   "Characterization of three serine esterases isolated from human IL-2
RT   activated killer cells.";
RL   J. Immunol. 141:3142-3147(1988).
RN   [6]
RP   SEQUENCE OF 29-39, AND CHARACTERIZATION.
RX   MEDLINE=89035468; PubMed=3263427;
RA   Kraehenbuhl O., Rey C., Jenne D.E., Lanzavecchia A., Groscurth P.,
RA   Carrel S., Tschopp J.;
RT   "Characterization of granzymes A and B isolated from granules of
RT   cloned human cytotoxic T lymphocytes.";
RL   J. Immunol. 141:3471-3477(1988).
RN   [7]
RP   3D-STRUCTURE MODELING.
RX   MEDLINE=89184501; PubMed=3237717;
RA   Murphy M.E.P., Moult J., Bleackley R.C., Gershenfeld H.,
RA   Weissman I.L., James M.N.G.;
RT   "Comparative molecular model building of two serine proteinases from
RT   cytotoxic T lymphocytes.";
RL   Proteins 4:190-204(1988).
CC   -!- FUNCTION: This enzyme is necessary for target cell lysis in cell-
CC       mediated immune responses. It cleaves after Lys or Arg. May be
CC       involved in apoptosis.
CC   -!- CATALYTIC ACTIVITY: Hydrolysis of proteins, including fibronectin,
CC       type IV collagen and nucleolin. Preferential cleavage: Arg-|-Xaa,
CC       Lys-|-Xaa >> Phe-|-Xaa in small molecule substrates.
CC   -!- SUBUNIT: Homodimer; disulfide-linked.
CC   -!- SUBCELLULAR LOCATION: Cytoplasmic granules.
CC   -!- SIMILARITY: Belongs to peptidase family S1. Granzyme subfamily.
CC   --------------------------------------------------------------------------
CC   This SWISS-PROT entry is copyright. It is produced through a collaboration
CC   between  the Swiss Institute of Bioinformatics  and the  EMBL outstation -
CC   the European Bioinformatics Institute.  There are no  restrictions on  its
CC   use  by  non-profit  institutions as long  as its content  is  in  no  way
CC   modified and this statement is not removed.  Usage  by  and for commercial
CC   entities requires a license agreement (See http://www.isb-sib.ch/announce/
CC   or send an email to license@isb-sib.ch).
CC   --------------------------------------------------------------------------
DR   EMBL; M18737; AAA52647.1; -.
DR   EMBL; BC015739; AAH15739.1; -.
DR   EMBL; U40006; AAD00009.1; -.
DR   PIR; A28943; A28943.
DR   PIR; A30525; A30525.
DR   PIR; A30526; A30526.
DR   PIR; A31372; A31372.
DR   PDB; 1HF1; 15-OCT-94.
DR   MEROPS; S01.135; -.
DR   Genew; HGNC:4708; GZMA.
DR   MIM; 140050; -.
DR   InterPro; IPR001254; Ser_protease_Try.
DR   Pfam; PF00089; trypsin; 1.
DR   SMART; SM00020; Tryp_SPc; 1.
DR   PROSITE; PS50240; TRYPSIN_DOM; 1.
DR   PROSITE; PS00134; TRYPSIN_HIS; 1.
DR   PROSITE; PS00135; TRYPSIN_SER; 1.
KW   Hydrolase; Serine protease; Zymogen; Signal; T-cell; Cytolysis;
KW   Apoptosis; 3D-structure.
FT   SIGNAL        1     26
FT   PROPEP       27     28       ACTIVATION PEPTIDE.
FT   CHAIN        29    262       GRANZYME A.
FT   ACT_SITE     69     69       CHARGE RELAY SYSTEM (BY SIMILARITY).
FT   ACT_SITE    114    114       CHARGE RELAY SYSTEM (BY SIMILARITY).
FT   ACT_SITE    212    212       CHARGE RELAY SYSTEM (BY SIMILARITY).
FT   DISULFID     54     70       BY SIMILARITY.
FT   DISULFID    148    218       BY SIMILARITY.
FT   DISULFID    179    197       BY SIMILARITY.
FT   DISULFID    208    234       BY SIMILARITY.
FT   CARBOHYD    170    170       N-LINKED (GLCNAC...) (POTENTIAL).
FT   STRAND       30     30
FT   STRAND       33     34
FT   TURN         37     38
FT   TURN         41     42
FT   STRAND       43     48
FT   TURN         49     51
FT   STRAND       52     60
FT   TURN         61     62
FT   STRAND       63     66
FT   TURN         68     69
FT   STRAND       76     80
FT   STRAND       84     84
FT   TURN         85     86
FT   TURN         90     91
FT   STRAND       93    102
FT   TURN        104    105
FT   HELIX       108    110
FT   TURN        112    113
FT   STRAND      116    120
FT   STRAND      127    127
FT   TURN        128    129
FT   STRAND      130    130
FT   STRAND      134    134
FT   TURN        138    139
FT   TURN        144    145
FT   STRAND      147    152
FT   STRAND      155    157
FT   TURN        158    159
FT   STRAND      160    162
FT   STRAND      165    165
FT   STRAND      167    173
FT   HELIX       176    180
FT   TURN        181    182
FT   TURN        184    185
FT   TURN        193    194
FT   STRAND      195    199
FT   TURN        201    202
FT   STRAND      206    206
FT   TURN        209    210
FT   TURN        212    213
FT   STRAND      215    218
FT   TURN        219    220
FT   STRAND      221    228
FT   TURN        231    232
FT   TURN        234    235
FT   TURN        237    238
FT   STRAND      241    245
FT   TURN        246    249
FT   HELIX       252    260
SQ   SEQUENCE   262 AA;  28968 MW;  DA87363A0D92BAF4 CRC64;
     MRNSYRFLAS SLSVVVSLLL IPEDVCEKII GGNEVTPHSR PYMVLLSLDR KTICAGALIA
     KDWVLTAAHC NLNKRSQVIL GAHSITREEP TKQIMLVKKE FPYPCYDPAT REGDLKLLQL
     TEKAKINKYV TILHLPKKGD DVKPGTMCQV AGWGRTHNSA SWSDTLREVN ITIIDRKVCN
     DRNHYNFNPV IGMNMVCAGS LRGGRDSCNG DSGSPLLCEG VFRGVTSFGL ENKCGDPRGP
     GVYILLSKKH LNWIIMTIKG AV
//
Each line begins with a two-character line code, which indicates the type of data contained in the line. The current line types and line codes and the order in which they appear in an entry, are shown in the table below.

Line code Content Occurrence in an entry
IDIdentificationOnce; starts the entry
ACAccession number(s)Once or more
DTDateThree times
DEDescriptionOnce or more
GNGene name(s)Optional
OSOrganism speciesOnce or more
OGOrganelleOptional
OCOrganism classificationOnce or more
OXTaxonomy cross-reference(s)Once or more
RNReference numberOnce or more
RPReference positionOnce or more
RCReference comment(s)Optional
RXReference cross-reference(s)Optional
RAReference authorsOnce or more
RT Reference titleOptional
RLReference locationOnce or more
CCComments or notesOptional
DRDatabase cross-referencesOptional
KWKeywordsOptional
FTFeature table dataOptional
SQSequence headerOnce
  (blanks) sequence dataOnce or more
//Termination lineOnce; ends the entry


As shown in the above table, some line types are found in all entries, others are optional. Some line types occur many times in a single entry. Each entry must begin with an identification line (ID) and end with a terminator line (//).

A detailed description of each line type is given in the next section of this document. It must be noted that, with the exception of GN, all Swiss-Prot line types exist in the EMBL Database. A description of the format differences between the Swiss-Prot and EMBL databases is given in Appendix C of this document.

The two-character line-type code that begins each line is always followed by three blanks, so that the actual information begins with the sixth character. Information is not extended beyond character position 75 except for one exception: CC lines that contain the 'DATABASE' topic (see section 3.11).

3. The different line types

3.1. The ID line Table of contents

The ID (IDentification) line is always the first line of an entry. The general form of the ID line is:

ID   ENTRY_NAME DATA_CLASS; MOLECULE_TYPE; SEQUENCE_LENGTH.
3.1.1. Entry name

The first item on the ID line is the entry name of the sequence. This name is a useful means of identifying a sequence. The entry name consists of up to ten uppercase alphanumeric characters.

Swiss-Prot uses a general purpose naming convention that can be symbolized as X_Y, where:

Examples:

PSEPU is for Pseudomonas putida and NAJNI is for Naja nivea.

However, for species most commonly encountered in the database, self-explanatory codes are used. There are 16 of those codes: BOVIN for Bovine, CHICK for Chicken, ECOLI for Escherichia coli, HORSE for Horse, HUMAN for Human, MAIZE for Maize (Zea mays), MOUSE for Mouse, PEA for Garden pea (Pisum sativum), PIG for Pig, RABIT for Rabbit, RAT for Rat, SHEEP for Sheep, SOYBN for Soybean (Glycine max), TOBAC for Common tobacco (Nicotina tabacum), WHEAT for Wheat (Triticum aestivum), and YEAST for Baker's yeast (Saccharomyces cerevisiae).

As it was not possible to apply the above rules to viruses, they were given arbitrary, but generally easy-to-remember identification codes.

Examples of complete protein sequence entry names are: RL1_ECOLI for ribosomal protein L1 from Escherichia coli,, SODC_DROME for Superoxide dismutase [Cu-Zn] from Drosophila melanogaster.

The names of all the presently-defined species identification codes are listed in the Swiss-Prot document file speclist.txt.

3.1.2. Data class

The second item on the ID line indicates the data class of the entry (see section 2.2).

3.1.3. Molecule type

The third item on the ID line is a three-letter code that indicates the type of molecule of the entry; in Swiss-Prot it is 'PRT' (for PRoTein).

3.1.4. Length of the molecule

The fourth and last item of the ID line is the length of the molecule, which is the total number of amino acids in the sequence. This number includes the positions reported to be present but which have not been determined (coded as 'X'). The length is followed by the letter code 'AA' (Amino Acids).

3.1.5. Examples of identification lines

Two examples of ID lines are shown below:


ID   CYC_BOVIN      STANDARD;      PRT;   104 AA.

ID   GIA2_GIALA     STANDARD;      PRT;   296 AA.
3.2. The AC line Table of contents

The AC (ACcession number) line lists the accession number(s) associated with an entry. The format of the AC line is:

AC   AC_number_1;[ AC_number_2;]...[ AC_number_N;]
An example of an accession number line is shown below:


AC   P00321;
Semicolons separate the accession numbers and a semicolon terminates the list. If necessary, more than one AC line can be used. Example:


AC   Q16653; Q14855; Q13054; Q13055; Q92891; Q92892; Q92893; Q92894;
AC   Q92895; Q93053; Q99605; O00713; O00714; O00715;
The purpose of accession numbers is to provide a stable way of identifying entries from release to release. It is sometimes necessary for reasons of consistency to change the names of the entries, for example, to ensure that related entries have similar names. However, an accession number is always conserved, and therefore allows unambiguous citation of Swiss-Prot entries.

Researchers who wish to cite entries in their publications should always cite the first accession number. This is commonly referred to as the 'primary accession number'.

Entries will have more than one accession number if they have been merged or split. For example, when two entries are merged into one, the accession numbers from both entries are stored in the AC line(s).

If an existing entry is split into two or more entries (a rare occurrence), the original accession numbers are retained in all the derived entries and a new primary accession number is added to all the entries.

An accession number is dropped only when the data to which it was assigned have been completely removed from the database. Deleted accession numbers are listed in the Swiss-Prot document file deleteac.txt.

Accession numbers consist of 6 alphanumerical characters in the following format:

 1  2  3  4  5  6
 [O,P,Q] [0-9] [A-Z, 0-9] [A-Z, 0-9] [A-Z, 0-9] [0-9]


Here are some examples of valid accession numbers: P12345, Q1AAA9, O456A1 and P4A123.

3.3. The DT line Table of contents

The DT (DaTe) lines show the date of creation and last modification of the database entry. The format of the DT line is:

DT   DD-MMM-YYYY (Rel. XX, Comment)
Where 'DD' is the day, 'MMM' the month, 'YYYY' the year, and 'XX' the Swiss-Prot release number. The comment portion of the line indicates the action taken on that date. There are always three DT lines in each entry, each of them is associated with a specific comment: Example of a block of DT lines:

DT   01-AUG-1988 (Rel. 08, Created)
DT   30-MAY-2000 (Rel. 39, Last sequence update)
DT   16-OCT-2001 (Rel. 40, Last annotation update)
Whenever the sequence of an entry is updated there is always also an annotation update. The date in the 3rd DT line is thus always at least as recent as the one in the 2nd DT line.

Regarding the third DT line, one should note that such a line is not updated when an entry is the target of what we call a 'global change'. A global change is defined as any operation which involves changes in all or most Swiss-Prot entries. These changes are announced in the release notes and are usually linked to formatting issues. As such global changes take place at almost each release, we strongly advise users of the database to completely reload Swiss-Prot at each release cycle.

3.4. The DE line Table of contents

The DE (DEscription) lines contain general descriptive information about the sequence stored. This information is generally sufficient to identify the protein precisely.

The format of the DE line is:
DE   Description.
The description is given in ordinary English (using US-spelling) and is free-format.

In cases where more than one DE line is required, the text is only divided between words and only the last DE line is terminated by a period.

The description always starts with the proposed official name of the protein. Synonyms are indicated between brackets. Example:


DE   Annexin V (Lipocortin V) (Endonexin II) (Calphobindin I) (CBP-I)
DE   (Placental anticoagulant protein I) (PAP-I) (PP4) (Thromboplastin
DE   inhibitor) (Vascular anticoagulant-alpha) (VAC-alpha) (Anchorin CII).
When a protein is known to be cleaved into multiple functional components, the description starts with the name of the precursor protein, followed by a section delimited by '[Contains: ...]'. All the individual components are listed in that section and are separated by semi-colons (';'). Synonyms are allowed at the level of the precursor and for each individual component. Example:


DE   Corticotropin-lipotropin precursor (Pro-opiomelanocortin) (POMC)
DE   [Contains: NPP; Melanotropin gamma (Gamma-MSH); Corticotropin
DE   (Adrenocorticotropic hormone) (ACTH); Melanotropin alpha (Alpha-MSH);
DE   Corticotropin-like intermediary peptide (CLIP); Lipotropin beta (Beta-
DE   LPH); Lipotropin gamma (Gamma-LPH); Melanotropin beta (Beta-MSH);
DE   Beta-endorphin; Met-enkephalin].
When a protein is known to include multiple functional domains each of which is described by a different name, the description starts with the name of the overall protein, followed by a section delimited by '[Includes: ]'. All the domains are listed in that section and are separated by semi-colons (';'). Synonyms are allowed at the level of the protein and for each individual domain. Example:


DE   CAD protein [Includes: Glutamine-dependent carbamoyl-phosphate
DE   synthase (EC 6.3.5.5); Aspartate carbamoyltransferase (EC 2.1.3.2);
DE   Dihydroorotase (EC 3.5.2.3)].
When the complete sequence was not determined, the last information given on the DE lines is '(Fragment)' or '(Fragments)'. Example:


DE   Dihydrodipicolinate reductase (EC 1.3.1.26) (DHPR) (Fragment).
3.5. The GN line Table of contents

The GN (Gene Name) line contains the name(s) of the gene(s) that code for the stored protein sequence.

The format of the GN line is:

GN   NAME1[ AND|OR NAME2...].
Examples:


GN   ALB.

GN   CRE-1.
It often occurs that more than one gene name has been assigned to an individual locus, in which case all the synonyms will be listed. The word 'OR' separates the different designations. The first name in the list is assumed to be the most correct (or most current) designation. Example:


GN   HNS OR HNSA OR DRDX OR OSMZ OR BGLY OR MSYA OR CUR OR PILG OR TOPS OR
GN   B1237 OR Z2013 OR ECS1739.
In a few cases, multiple genes code for an identical protein sequence, in which case all the different gene names will be listed. The word 'AND' separates the designations. Example:


GN   B23R AND C17L.
In some cases 'AND' and 'OR' are both present. Parentheses are then used as shown in the following examples:


GN   GVPA AND (GVPB OR GVPA2).

GN   (HHT1 OR SPAC1834.04) AND (HHT2 OR SPBC8D2.04 OR PI060).
3.6. The OS line Table of contents

The OS (Organism Species) line specifies the organism(s) which was (were) the source of the stored sequence. In the rare case where all the species information will not fit on a single line, more than one OS line is used. The last OS line is terminated by a period.

The species designation consists, in most cases, of the Latin genus and species designation followed by the English name (in parentheses). For viruses, only the common English name is given.

Examples of OS lines are shown here:


OS   Escherichia coli.

OS   Homo sapiens (Human).

OS   Solanum melongena (Eggplant) (Aubergine).

OS   Rous sarcoma virus (strain Schmidt-Ruppin).
If a Swiss-Prot entry reports the sequence of a protein identical in a number of species, the name of these species will all be listed in the OS lines of that entry. The species names are separated by commas, the last species name is preceded by the word 'and'. Here are examples of the OS lines for entries representing multiple species:


OS   Oncorhynchus nerka (Sockeye salmon), and
OS   Oncorhynchus masou (Cherry salmon) (Masu salmon).


OS   Mus musculus (Mouse),
OS   Rattus norvegicus (Rat), and
OS   Bos taurus (Bovine).
The names (official name, common name, synonym) concerning one species are cut across lines when they do not fit into a single line:


OS   Epizootic hemorrhagic disease virus (serotype 2 / strain Alberta)
OS   (Ehdv-2).
3.7. The OG line Table of contents

The OG (OrGanelle) line indicates if the gene coding for a protein originates from the mitochondria, the chloroplast, the cyanelle, the nucleomorph or a plasmid.

The format of the OG line is:

OG   Chloroplast.
OG   Cyanelle.
OG   Mitochondrion.
OG   Nucleomorph.
OG   Plasmid name.
Where 'name' is the name of the plasmid.

If a Swiss-Prot entry reports the sequence of a protein identical in a number of plasmids, the names of these plasmids will all be listed in the OG lines of that entry. The plasmid names are separated by commas, the last plasmid name is preceded by the word 'and'. Plasmid names are never cut across two lines. Examples:


OG   Plasmid IncFIV R124, and Plasmid IncFI ColV3-K30.


OG   Plasmid R6-5, Plasmid IncFII NR1, and
OG   Plasmid IncFII R1-19 (R1 drd-19).
The Swiss-Prot document plasmid.txt lists all the plasmid names that are used in the database in the context of the OG line.

3.8 The OC line Table of contents

The OC (Organism Classification) lines contain the taxonomic classification of the source organism. The taxonomic classification used in Swiss-Prot is that maintained at the NCBI (see http://www.ncbi.nlm.nih.gov/Taxonomy/) and used by the nucleotide sequence databases (EMBL/GenBank/DDBJ). The NCBI's taxonomy reflects current phylogenetic knowledge. It is a sequence-based taxonomy as much as possible and based on published authorities wherever possible. Because of the inherent ambiguity of evolutionary classification and the specific needs of database users (e.g. trying to track down the phylogenetic history of a group of organisms or to elucidate the evolution of a molecule), this taxonomy strives to accurately reflect current phylogenetic knowledge. The NCBI's taxonomy is intended to be informative and helpful; no claim is made that it is the best or the most exact.

The classification is listed top-down as nodes in a taxonomic tree in which the most general grouping is given first. The classification may be distributed over several OC lines, but nodes are not split or hyphenated between lines. Semicolons separate the individual items and the list is terminated by a period.

The format of the OC line is:

OC   Node[; Node...].
For example the classification lines for a human sequence would be:


OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
OC   Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
If a protein is identical in more than one species, all the species names will be listed in the OS lines (see 3.6) but the OC lines will only contain the classification for the first species listed.

3.9 The OX line Table of contents

The OX (Organism taxonomy cross-reference) line is used to indicate the identifier of a specific organism in a taxonomic database. The number of taxonomic codes is identical to the number of species given in the OS line(s). There can be more than one OX line in an entry. The format of the OX line is:
OX   Taxonomy_database_Qualifier=Taxonomic code[, Taxonomic code...];
Currently the cross-references are made to the taxonomy database of NCBI, which is associated with the qualifier 'TaxID' and a one- to six-digit taxonomic code.

Examples:

OX   NCBI_TaxID=9606;


OX   NCBI_TaxID=9606, 10090, 9913, 9823, 10141, 10029, 10030, 10116, 9986,
OX   9031, 8355, 7227, 7213, 7108, 7130;
3.10. The reference (RN, RP, RC, RX, RA, RT, RL) lines Table of contents

These lines comprise the literature citations within Swiss-Prot. The citations indicate the sources from which the data has been abstracted. The reference lines for a given citation occur in a block, and are always in the order RN, RP, RC, RX, RA, RT and RL. Within each such reference block, the RN line occurs once, the RC, RX and RT lines occur zero or more times, and the RP, RA and RL lines occur one or more times. If several references are given, there will be a reference block for each.

An example of a complete reference is:


RN   [1]
RP   SEQUENCE FROM N.A., AND SEQUENCE OF 1-15.
RC   STRAIN=Sprague-Dawley; TISSUE=Liver;
RX   MEDLINE=91002678; PubMed=2207170;
RA   Chan Y.-L., Paz V., Olvera J., Wool I.G.;
RT   "The primary structure of rat ribosomal proteins: the amino acid
RT   sequences of L27a and L28 and corrections in the sequences of S4 and
RT   S12.";
RL   Biochim. Biophys. Acta 1050:69-73(1990).
The formats of the individual lines are explained below.

3.10.1. The RN line

The RN (Reference Number) line gives a sequential number to each reference citation in an entry. This number is used to indicate the reference in comments and feature table notes. The format of the RN line is:

RN   [N]
where 'N' denotes the nth reference for this entry. The reference number is always between square brackets.

3.10.2. The RP line

The RP (Reference Position) lines describe the extent of the work carried out by the authors of the reference cited. The format of the RP line is:

RP   COMMENT.
Typical examples of RP lines are shown below:


RP   SEQUENCE FROM N.A.

RP   SEQUENCE FROM N.A., AND SEQUENCE OF 21-35.

RP   SEQUENCE OF 39-76; 95-118 AND 125-138, AND DISULFIDE BONDS.

RP   REVISIONS TO 76-84 AND 129.

RP   STRUCTURE BY NMR.

RP   X-RAY CRYSTALLOGRAPHY (1.8 ANGSTROMS).

RP   CHARACTERIZATION.

RP   MUTAGENESIS OF TYR-65.

RP   REVIEW.

RP   VARIANT ALA-1368.

RP   VARIANTS TD ARG-597 AND ARG-1477, AND VARIANT FHA LEU-693 DEL.


RP   SEQUENCE FROM N.A., SEQUENCE OF 1-22; 2-17; 240-256; 318-339 AND
RP   381-390, AND CHARACTERIZATION.


RP   SEQUENCE FROM N.A., SEQUENCE OF 154-171; 302-308; 312-328; 377-384
RP   AND 419-431, FUNCTION, SUBCELLULAR LOCATION, AND MUTAGENESIS OF
RP   ARG-331; GLY-332 AND ARG-333.

3.10.3. The RC line

The RC (Reference Comment) lines are optional lines which are used to store comments relevant to the reference cited. The format of the RC line is:

RC   TOKEN1=Text; TOKEN2=Text; ...
Where the currently defined tokens are:

PLASMID
SPECIES
STRAIN
TISSUE
TRANSPOSON

Examples of RC lines:


RC   STRAIN=Sprague-Dawley; TISSUE=Liver;

RC   STRAIN=Holstein; TISSUE=Lymph node, and Mammary gland;

RC   SPECIES=Rat; STRAIN=Wistar; TISSUE=Brain;

RC   SPECIES=A.thaliana; STRAIN=cv. Columbia;

RC   PLASMID=IncFII R100;
The 'SPECIES' token is only used when an entry describes a sequence that is identical in more than one species; similarly 'PLASMID' is only used if an entry describes a sequence identical in more than one plasmid.

The Swiss-Prot document tisslist.txt lists all the tissues that are used in the database in the context of the 'TISSUE' token.

3.10.4. The RX line

The RX (Reference cross-reference) line is an optional line which is used to indicate the identifier assigned to a specific reference in a bibliographic database. The format of the RX line is:

RX   Bibliographic_db=IDENTIFIER[; Bibliographic_db=IDENTIFIER...];
Where the valid bibliographic database names and their associated identifiers are:

 Name  Identifier
 MEDLINE  Eight-digit MEDLINE Unique Identifier (UI)
 PubMed  PubMed Unique Identifier (PMID)

Example of RX lines:


RX   MEDLINE=97291283; PubMed=9145897;

RX   PubMed=11792842;
3.10.5. The RA line

The RA (Reference Author) lines list the authors of the paper (or other work) cited. All of the authors are included, and are listed in the order given in the paper. The names are listed surname first followed by a blank, followed by initial(s) with periods. The authors' names are separated by commas and terminated by a semicolon. Author names are not split between lines. An example of the use of RA lines is shown below:


RA   Galinier A., Bleicher F., Negre D., Perriere G., Duclos B.,
RA   Cozzone A.J., Cortay J.-C.;
As many RA lines as necessary are included in each reference.

An author's initials can be followed by an abbreviation such as 'Jr' (for Junior), 'Sr' (Senior), 'II', 'III' or 'IV' (2nd, 3rd and 4th). Example:


RA   Nasoff M.S., Baker H.V. II, Wolf R.E. Jr.;
3.10.6. The RT line

The RT (Reference Title) lines give the title of the paper (or other work) cited as exactly as possible given the limitations of the computer character set. The format of the RT line is:

RT   "Title";
Example of a set of RT lines:


RT   "New insulin-like proteins with atypical disulfide bond pattern
RT   characterized in Caenorhabditis elegans by comparative sequence
RT   analysis and homology modeling.";
It should be noted that the format of the title is not always identical to that displayed at the top of the published work:

3.10.7. The RL line

The RL (Reference Location) lines contain the conventional citation information for the reference. In general, the RL lines alone are sufficient to find the paper in question.

a) Journal citations

The RL line for a journal citation includes the journal abbreviation, the volume number, the page range and the year. The format for such an RL line is:

RL   Journal_abbrev Volume:First_page-Last_page(YYYY).
Journal names are abbreviated according to the conventions used by the National Library of Medicine (NLM) and are based on the existing ISO and ANSI standards. A list of the abbreviations currently in use is given in the Swiss-Prot document file jourlist.txt

An example of an RL line is:


RL   J. Mol. Biol. 168:321-331(1983).
When a reference is made to a paper which is 'in press' at the time the database is released, the page range, and possibly the volume number, are indicated as '0' (zero). An example of such an RL line is shown here:


RL   Arch. Biochem. Biophys. 411:0-0(2003).
b) Book citations

A variation of the RL line format is used for papers found in books or other types of publication, which are then cited using the following format:

RL   (In) Editor_1 I.[, Editor_2 I., Editor_X I.] (eds.);
RL   Book_name, pp.[Volume:]First_page-Last_page, Publisher, City (YYYY).
Examples:


RL   (In) Boyer P.D. (eds.);
RL   The enzymes (3rd ed.), pp.11:397-547, Academic Press, New York (1975).


RL   (In) Rich D.H., Gross E. (eds.);
RL   Proceedings of the 7th american peptide symposium, pp.69-72,
RL   Pierce Chemical Co., Rockford Il. (1981).


RL   (In) Magnusson S., Ottesen M., Foltmann B., Dano K.,
RL   Neurath H. (eds.);
RL   Regulatory proteolytic enzymes and their inhibitors, pp.163-172,
RL   Pergamon Press, New York (1978).
c) Plant Gene Register and Worm Breeder's Gazette citations

The '(In)' prefix used for books (see above) is also used for references to the electronic Plant Gene Register (http://www.tarweed.com/pgr/) as well as to the Worm Breeder's Gazette (http://elegans.swmed.edu/wli/). Examples:


RL   (In) Plant Gene Register PGR98-023.

RL   (In) Worm Breeder's Gazette 15(3):34(1998).
d) Unpublished results

RL lines for unpublished results follow the format shown in the next example:


RL   Unpublished results, cited by:
RL   Shelnutt J.A., Rousseau D.L., Dethmers J.K., Margoliash E.;
RL   Biochemistry 20:6485-6497(1981).
e) Unpublished observations

For unpublished observations the format of the RL line is:

RL   Unpublished observations (MMM-YYYY).
Where 'MMM' is the month and 'YYYY' is the year.

We use the 'unpublished observations' RL line to cite communications by scientists to Swiss-Prot of unpublished information concerning various aspects of a sequence entry.

f) Thesis

For Ph.D. theses the format of the RL line is:

RL   Thesis (Year), Institution_name, Country.
An example of such a line is given here:


RL   Thesis (1977), University of Geneva, Switzerland.
g) Patent applications

For patent applications the format of the RL line is:

RL   Patent number Pat_num, DD-MMM-YYYY.
Where 'Pat_num' is the international publication number of the patent, 'DD' is the day, 'MMM' is the month and 'YYYY' is the year. Example:


RL   Patent number WO9010703, 20-SEP-1990.
h) Submissions

The final form that an RL line can take is that used for submissions. The format of such an RL line is:

RL   Submitted (MMM-YYYY) to the Database_name.
Where 'MMM' is the month, 'YYYY' is the year and 'Database_name' is one of the following:

EMBL/GenBank/DDBJ databases
Swiss-Prot data bank
HIV data bank
PDB data bank
PIR data bank
Two examples of submission RL lines are given here:


RL   Submitted (OCT-1995) to the EMBL/GenBank/DDBJ databases.

RL   Submitted (JUL-1998) to the Swiss-Prot data bank.
3.11. The CC line Table of contents

The CC lines are free text comments on the entry, and are used to convey any useful information. The comments always appear below the last reference line and are grouped together in comment blocks; a block is made up of 1 or more comment lines. The first line of a block starts with the characters '-!-'.

The format of a comment block is:

CC   -!- TOPIC: First line of a comment block;
CC       second and subsequent lines of a comment block.
The comment blocks are arranged according to what we designate as 'topics'. The current topics and their definitions are listed in the table below.

Topic Description
ALTERNATIVE PRODUCTS Description of the existence of related protein sequence(s) produced by alternative splicing of the same gene or by the use of alternative initiation codons; see 3.11.1
BIOTECHNOLOGY Description of the use of a specific protein in a biotechnological process
CATALYTIC ACTIVITY Description of the reaction(s) catalyzed by an enzyme [1]
CAUTION Warning about possible errors and/or grounds for confusion
COFACTOR Description of an enzyme cofactor
DATABASE Description of a cross-reference to a network database/resource for a specific protein; see 3.11.2
DEVELOPMENTAL STAGE Description of the developmentally-specific expression of a protein
DISEASE Description of the disease(s) associated with a deficiency of a protein
DOMAIN Description of the domain structure of a protein
ENZYME REGULATION Description of an enzyme regulatory mechanism
FUNCTION General description of the function(s) of a protein
INDUCTION Description of the compound(s) which stimulate the synthesis of a protein
MASS SPECTROMETRY Reports the exact molecular weight of a protein or part of a protein as determined by mass spectrometric methods 3.11.3
MISCELLANEOUS Any comment which does not belong to any of the other defined topics
PATHWAY Description of the metabolic pathway(s) with which a protein is associated
PHARMACEUTICAL Description of the use of a protein as a pharmaceutical drug
POLYMORPHISM Description of polymorphism(s)
PTM Description of a posttranslational modification
SIMILARITY Description of the similaritie(s) (sequence or structural) of a protein with other proteins
SUBCELLULAR LOCATION Description of the subcellular location of the mature protein
SUBUNIT Description of the quaternary structure of a protein
TISSUE SPECIFICITY Description of the tissue specificity of a protein

Note:
[1] For the 'CATALYTIC ACTIVITY' topic: To describe the catalytic activity of an enzyme we have used, whenever possible, the recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (IUBMB) as published in Enzyme Nomenclature, NC-IUBMB, Academic Press, New-York, (1992).

Each Swiss-Prot entry contains a variable number of CC line topics. Most topics can be present more than once in a given entry. Topics that occur only once in an entry are: ALTERNATIVE PRODUCTS, COFACTOR, DEVELOPMENTAL STAGE, ENZYME REGULATION, INDUCTION, SUBCELLULAR LOCATION, SUBUNIT and TISSUE SPECIFICITY.
3.11.1. Syntax of the topic 'ALTERNATIVE PRODUCTS'
The format of the CC line topic ALTERNATIVE PRODUCTS is:
CC   -!- ALTERNATIVE PRODUCTS:
CC       Event=Alternative promoter;
CC         Comment=Free text;
CC       Event=Alternative splicing; Named isoforms=n;
CC         Comment=Optional free text;
CC       Name=Isoform_1; Synonyms=Synonym_1[, Synonym_n];
CC         IsoId=Isoform_identifier_1[, Isoform_identifer_n]; 
CC         Sequence=Displayed;
CC         Note=Free text;
CC       Name=Isoform_n; Synonyms=Synonym_1[, Synonym_n];
CC         IsoId=Isoform_identifier_1[, Isoform_identifer_n]; 
CC         Sequence=VSP_identifier_1 [, VSP_identifier_n];
CC         Note=Free text;
CC       Event=Alternative initiation;
CC         Comment=Free text;
The qualifiers are described in the table below:

Topic Description
Event Biological process that results in the production of the alternative forms (Alternative promoter, Alternative splicing, Alternative initiation).
Format: Event=controlled vocabulary;
Example: Event=Alternative splicing;
Named isoforms Number of isoforms listed in the topics 'Name' currently only for 'Event=Alternative splicing'.
Format: Named isoforms=number;
Example: Named isoforms=6;
Comment Any comments concerning one or more isoforms; optional for 'Alternative splicing'; in case of 'Alternative promoter' and 'Alternative initiation' there is always a 'Comment' of free text, which includes relevant information on the isoforms.
Format: Comment=free text;
Example: Comment=Experimental confirmation may be lacking for some isoforms;
Name A common name for an isoform used in the literature or assigned by Swiss-Prot; currenty only available for spliced isoforms.
Format: Name=common name;
Example: Name=Alpha;
Synonyms Synonyms for an isoform as used in the literature; optional; currently only available for spliced isoforms.
Format: Synonyms=Synonym_1[, Synonym_n];
Example: Synonyms=B, KL5;
IsoId Unique identifier for an isoform, consisting of the Swiss-Prot accession number, followed by a dash and a number.
Format: IsoId=acc#-isoform_number[, acc#-isoform_number];
Example: IsoId=P05067-1;
Sequence Information on the isoform sequence; the term 'Displayed' indicates, that the sequence is shown in the entry; a lists of feature identifiers (VSP_#) indicates that the isoform is annotated in the feature table; the FTIds enable programs to create the sequence of a splice variant; if the accession number of the IsoId does not correspond to the accession number of the current entry, this topic contains the term 'External'; 'Not described' points out that the sequence of the isoform is unknown.
Format: Sequence=VSP_#[, VSP_#]|Displayed|External|Not described;
Example: Sequence=Displayed;
Example: Sequence=VSP_000013, VSP_000014; Example: Sequence=External;
Example: Sequence=Not described;
Note Lists isoform-specific information; optional.
Format: Note=Free text;
Example: Note=No experimental confirmation available;

Example of the CC lines and the corresponding FT lines for an entry with alternative splicing:
...
CC  -!- ALTERNATIVE PRODUCTS:
CC      Event=Alternative splicing; Named isoforms=6;
CC      Name=1;
CC        IsoId=Q15746-4; Sequence=Displayed;
CC      Name=2;
CC        IsoId=Q15746-5; Sequence=VSP_000040;
CC      Name=3A;
CC        IsoId=Q15746-6; Sequence=VSP_000041, VSP_000043; 
CC      Name=3B;
CC        IsoId=Q15746-7; Sequence=VSP_000040, VSP_000041, VSP_000042;
CC      Name=4;
CC        IsoId=Q15746-8; Sequence=VSP_000041, VSP_000042;
CC      Name=del-1790;
CC        IsoId=Q15746-9; Sequence=VSP_000044;
...
FT   VARSPLIC    437    506       VSGIPKPEVAWFLEGTPVRRQEGSIEVYEDAGSHYLCLLKA
FT                                RTRDSGTYSCTASNAQGQVSCSWTLQVER -> G (IN
FT                                ISOFORM 2 AND ISOFORM 3B).
FT     				  /FTId=VSP_000040;
FT   VARSPLIC   1433   1439       DEVEVSD -> MKWRCQT (IN ISOFORM 3A,
FT                                ISOFORM 3B AND ISOFORM 4).
FT     				  /FTId=VSP_000041;
FT   VARSPLIC   1473   1546       GKFGQVFRLVEKKTRKVWAGKFFKAYSAKEKENIRQEISIM
FT                                NCLHHPKLVQCVDAFEEKANIVMVLEIVSGGEL -> L
FT                                (IN ISOFORM 4).
FT     				  /FTId=VSP_000042;
FT   VARSPLIC   1655   1705       MISSING (IN ISOFORM 3A AND ISOFORM 3B).
FT     				  /FTId=VSP_000043;
FT   VARSPLIC   1790   1790       MISSING (IN ISOFORM DEL-1790).
FT     				  /FTId=VSP_000044;
...
3.11.2. Syntax of the topic 'DATABASE'
CC   -!- DATABASE: NAME=Text[; NOTE=Text][; WWW="Address"][; FTP="Address"].
Where:
Note: this is currently the only part of the entry where lines longer than 75 characters can be found as long URLs or FTP addresses are not reformatted into multiple lines.
3.11.3. Syntax of the topic 'MASS SPECTROMETRY'
CC   -!- MASS SPECTROMETRY: MW=XXX[; MW_ERR=XX][; METHOD=XX][;RANGE=XX-XX].
Where:
3.11.4. Examples for each comment line topic

CC   -!- ALTERNATIVE PRODUCTS:
CC       Event=Alternative splicing; Named isoforms=3;
CC         Comment=Additional isoforms seem to exist. Experimental
CC         confirmation may be lacking for some isoforms;
CC       Name=1; Synonyms=Aire-1;
CC         IsoId=O43918-1; Sequence=Displayed;
CC       Name=2; Synonyms=Aire-2;
CC         IsoId=O43918-2; Sequence=VSP_004089;
CC       Name=3; Synonyms=Aire-3;
CC         IsoId=O43918-3; Sequence=VSP_004089, VSP_004090;

CC   -!- ALTERNATIVE PRODUCTS:
CC       Event=Alternative initiation;
CC         Comment=2 isoforms, Caveolin-2 alpha and Caveolin-2 beta, are
CC         produced by alternative initiation;


CC   -!- BIOTECHNOLOGY: The effect of PG can be neutralized by introducing
CC       an antisense PG gene by genetic manipulation. The Flavr Savr
CC       tomato produced by Calgene (Monsanto) in such a manner has a
CC       longer shelf life due to delayed ripening.

CC   -!- BIOTECHNOLOGY: Used in the food industry for high temperature
CC       liquefaction of starch-containing mashes and in the detergent
CC       industry to remove starch. Sold under the name Termamyl by
CC       Novozymes.


CC   -!- CATALYTIC ACTIVITY: ATP + L-glutamate + NH(3) = ADP + phosphate +
CC       L-glutamine.

CC   -!- CATALYTIC ACTIVITY: (R)-2,3-dihydroxy-3-methylbutanoate + NADP(+)
CC       = (S)-2-hydroxy-2-methyl-3-oxobutanoate + NADPH.


CC   -!- CAUTION: Ref.2 sequence differs from that shown in positions 92 to
CC       165 due to frameshifts.

CC   -!- CAUTION: It is uncertain whether Met-1 or Met-3 is the initiator.


CC   -!- COFACTOR: Pyridoxal phosphate.

CC   -!- COFACTOR: FAD and nonheme iron.


CC   -!- DATABASE: NAME=CD40Lbase;
CC       NOTE=European CD40L defect database (mutation db);
CC       WWW="http://www.expasy.org/cd40lbase/";
CC       FTP="ftp://ftp.expasy.org/databases/cd40lbase".

CC   -!- DATABASE: NAME=PROW; NOTE=CD guide CD80 entry;
CC       WWW="http://www.ncbi.nlm.nih.gov/prow/cd/cd80.htm".


CC   -!- DEVELOPMENTAL STAGE: Expressed early during conidial (dormant
CC       spores) differentiation.

CC   -!- DEVELOPMENTAL STAGE: Detected in embryonic skin (E12.5 and E14.5)
CC       during the formation of hair follicles and at E15.5 in the enamel
CC       knot of the developing tooth. Detected in the basal layer of the
CC       epidermis and hair follicles of P2 mice.


CC   -!- DISEASE: Defects in PHKA1 are linked to X-linked muscle
CC       glycogenosis [MIM:311870]; a disease characterized by slowly
CC       progressive, predominantly distal muscle weakness and atrophy.

CC   -!- DISEASE: Defects in ABCD1 are the cause of recessive X-linked
CC       adrenoleukodystrophy (X-ALD) [MIM:300100], a rare peroxisomal
CC       metabolic disorder that occurs in boys and is characterized by
CC       progressive multifocal demyelination of the central nervous system
CC       and by adrenocortical insufficiency. It produces mental
CC       deterioration, corticospinal tract dysfunction, and cortical
CC       blindness. There is laboratory evidence of adrenal cortical
CC       dysfunction. Different clinical manifestations exist like:
CC       cerebral childhood ALD (CALD), adult cerebral ALD (ACALD),
CC       adrenomyeloneuropathy (AMN) and "Addison disease only" (ADO)
CC       phenotype.


CC   -!- DOMAIN: Contains a coiled-coil domain essential for vesicular
CC       transport and a dispensable C-terminal region.

CC   -!- DOMAIN: The B chain is composed of two domains, each domain
CC       consists of 3 homologous subdomains (alpha, beta, gamma).


CC   -!- ENZYME REGULATION: The activity of this enzyme is controlled
CC       by adenylation. The fully adenylated enzyme complex is inactive.

CC   -!- ENZYME REGULATION: Activated by Gram-negative bacterial
CC       lipopolysaccharides and chymotrypsin.


CC   -!- FUNCTION: Binds to actin and affects the structure of the
CC       cytoskeleton. At high concentrations, profilin prevents the
CC       polymerization of actin, whereas it enhances it at low
CC       concentrations. By binding to PIP2, it inhibits the formation of
CC       IP3 and DG.

CC   -!- FUNCTION: Inhibitor of fungal polygalacturonase. It is an
CC       important factor for plant resistance to phytopathogenic fungi.
CC       Substrate preference is polygalacturonase (PG) from A.niger >> PG
CC       of F.oxysporum, A.solani or B.cinerea. Not active on PG from
CC       F.moniliforme.


CC   -!- INDUCTION: By heat shock, salt stress, oxidative stress, glucose
CC       limitation and oxygen limitation.

CC   -!- INDUCTION: By infection, plant wounding, or elicitor treatment of
CC       cell cultures.


CC   -!- MASS SPECTROMETRY: MW=24948; MW_ERR=6; METHOD=MALDI.

CC   -!- MASS SPECTROMETRY: MW=8597.5; METHOD=Electrospray; RANGE=40-119.


CC   -!- MISCELLANEOUS: Binds to bacitracin.

CC   -!- MISCELLANEOUS: Called DUO because the encoded protein is closely
CC       related to but shorter than TRIO.


CC   -!- PATHWAY: Porphyrin biosynthesis by the C5 pathway; second step.

CC   -!- PATHWAY: Degradation of allantoin (purine catabolism); third step.


CC   -!- PHARMACEUTICAL: Available under the names Avonex (Biogen),
CC       Betaseron (Berlex) and Rebif (Serono). Used in the treatment of
CC       multiple sclerosis (MS). Betaseron is a slightly modified form
CC       of IFNB1 with two residue substitutions.

CC   -!- PHARMACEUTICAL: Available under the name Proleukin (Chiron). Used
CC       in patients with renal cell carcinoma or metastatic melanoma.


CC   -!- POLYMORPHISM: The allelic form of the enzyme with Gln-191
CC       (Allozyme A) hydrolyzes paraoxon with a low turnover number and
CC       the one with Arg-191 (Allozyme B) with a high turnover number.

CC   -!- POLYMORPHISM: The two main alleles of HP are called HP1F (fast)
CC       and HP1S (slow). The sequence shown here is that of the HP1S form.


CC   -!- PTM: N-glycosylated and probably also O-glycosylated.

CC   -!- PTM: A soluble short 95 kDa form may be released by proteolytic
CC       cleavage from the long membrane-anchored form.


CC   -!- SIMILARITY: Belongs to the annexin family.

CC   -!- SIMILARITY: Contains 13 EGF-like domains.


CC   -!- SUBCELLULAR LOCATION: Mitochondrial matrix.

CC   -!- SUBCELLULAR LOCATION: Integral membrane protein. Inner membrane.


CC   -!- SUBUNIT: Homotetramer.

CC   -!- SUBUNIT: Disulfide-linked heterodimer of a light chain (L) and a
CC       heavy chain (H). The light chain has the pharmacological activity,
CC       while the N- and C-terminal of the heavy chain mediate channel
CC       formation and toxin binding, respectively.


CC   -!- TISSUE SPECIFICITY: Shoots, roots, and cotyledon from dehydrating
CC       seedlings.

CC   -!- TISSUE SPECIFICITY: Expressed at high levels in brain and ovary.
CC       Lower levels in small intestine. In brain regions, detected in all
CC       regions tested. Highest levels in the cerebellum and cerebral
CC       cortex.
3.12. The DR line Table of contents

3.12.1. Definition

The DR (Database cross-Reference) lines are used as pointers to information related to Swiss-Prot entries and found in data collections other than Swiss-Prot. The full list of all databases to which Swiss-Prot is cross-referenced can be found in the document file dbxref.txt.

For example, if the X-ray crystallographic atomic coordinates of a sequence are stored in the Protein Data Bank (PDB) there will be one DR line pointing to each of the corresponding entries in PDB. For a sequence translated from a nucleotide sequence there will be DR line(s) pointing to the relevant entri(es) in the EMBL/GenBank/DDBJ database which correspond to the DNA or RNA sequence(s) from which it was translated.

The format of the DR line is:

DR   DATABASE_IDENTIFIER; PRIMARY_IDENTIFIER; SECONDARY_IDENTIFIER.
Exceptions are cross-references to the EMBL/GenBank/DDBJ nucleotide sequence database and PROSITE. The specific formats for these cross-references are described in sections 3.12.6 and 3.12.7.

3.12.2. Database identifier

The first item on the DR line, the 'DATABASE_IDENTIFIER', is the abbreviated name of the data collection to which reference is made. The currently defined database identifiers are listed below.

Identifier Database description Reference
EMBL Nucleotide sequence database of EMBL/EBI (see 3.12.6) Nucleic Acids Res. 30:21-26(2002); PMID: 11752244
Aarhus/Ghent-2DPAGE Human keratinocyte 2D gel protein database from Aarhus and Ghent universities FEBS Lett. 430:64-72(1998); PMID: 9678596
ANU-2DPAGE Australian National University 2-DE database Proteomics 1:1149-1161(2001); PMID: 11990509
COMPLUYEAST-2DPAGE 2-DE database at Universidad Complutense de Madrid J. Chromatogr. B. Biomed. Sci. Appl. (in press)
DictyDb Dictyostelium discoideum genome database (In) Maeda Y., Inouyea K. and Takeuchi I. (eds.); Dictyostelium. A Model System for Cell and Developmental Biology. pp.471-477, Universal Academic Press, Tokyo (1997)
ECO2DBASE Escherichia coli gene-protein database (2D gel spots) (ECO2DBASE) Electrophoresis 20:2149-2159(1999); PMID: 9298644
EcoGene Escherichia coli K12 genome database (EcoGene) Nucleic Acids Res. 28:60-64(2000); PMID: 10592181
FlyBase Drosophila genome database (FlyBase) Nucleic Acids Res. 30:106-108(2002); PMID: 11752267
Genew Human gene nomenclature database (Genew) Nucleic Acids Res. 30:169-171(2002); PMID: 11752283
GlycoSuiteDB Database of glycan structures (GlycoSuiteDB) Nucleic Acids Res. 29:332-335(2001); PMID: 11125129
GO Gene Ontology (GO) database Genome Res. 12:1982-1991(2002); PMID: 12466303
Gramene Comparative mapping resource for grains (Gramene) Nucleic Acids Res. 30:103-105(2002); PMID: 11752266
HAMAP Database of microbial protein families (HAMAP) Comput. Biol. Chem. 27:49-58(2003)
HIV HIV sequence database Kuiken C.L. et al., In: Theoretical Biology and Biophysics Group, Los Alamos National Laboratory, Los Alamos, NM.
HSC-2DPAGE Harefield hospital 2D gel protein databases (HSC-2DPAGE) Electrophoresis 18:471-479(1997); PMID: 9150926
HSSP Homology-derived secondary structure of proteins database (HSSP) Nucleic Acids Res. 27:244-247(1999); PMID: 9847191
InterPro Integrated resource of protein families, domains and functional sites (InterPro) Nucleic Acids Res. 31:315-318(2003); PMID: 12520011
Leproma Mycobacterium leprae genome database (Leproma) Lepr. Rev. 72:470-477(2001); PMID: 11826483
MaizeDB Maize genome database (MaizeDB) Polacco M., Chen S., Coe E., Hancock D.C., Kross H., Schroeder S. and Vargo C.; Maize Genetics Conference Abstracts 41 (1999)
Maize-2DPAGE Maize genome 2D Electrophoresis database (Maize-2DPAGE) Theor. Appl. Genet. 93:997-1005(1996)
MEROPS Peptidase database (MEROPS) Nucleic Acids Res. 30:343-346(2002); PMID: 11752332
MGD Mouse genome database (MGD) Nucleic Acids Res. 30:113-115(2002); PMID: 11752269
MIM Mendelian Inheritance in Man Database (MIM) Nucleic Acids Res. 30:52-55(2002); PMID: 11752252
MypuList Mycoplasma pulmonis genome database (MypuList)  
PDB 3D-macromolecular structure Protein Data Bank (PDB) Nucleic Acids Res. 30:245-248(2002); PMID: 11752306
Pfam Pfam protein domain database Nucleic Acids Res. 30:276-280(2002); PMID: 11752314
PHCI-2DPAGE Parasite host cell interaction 2D-PAGE database  
PhosSite Phosphorylation Site Database for prokaryotic proteins (In) Leslie M. (ed.); NetWatch. Science 294:1623-1623(2001) 
PIR Protein sequence database of the Protein Information Resource (PIR) Nucleic Acids Res. 30:35-37(2002); PMID: 11752247
PMMA-2DPAGE Purkyne Military Medical Academy 2D-PAGE database  
PRINTS Protein Fingerprint database (PRINTS) Nucleic Acids Res. 30:239-241(2002); PMID: 11752304
ProDom ProDom protein domain database Brief. Bioinform. 3:246-251(2002); PMID: 12230033
PROSITE PROSITE protein domain and family database (see 3.12.7) Nucleic Acids Res. 30:235-238(2002); PMID: 11752303
REBASE Restriction enzymes and methylases database (REBASE) Nucleic Acids Res. 29:268-269(2001); PMID: 11125108
SGD Saccharomyces Genome Database (SGD) Nucleic Acids Res. 30:69-72(2002); PMID: 11752257
Siena-2DPAGE 2D-PAGE database from the Department of Molecular Biology, University of Siena, Italy  
SMART Simple Modular Architecture Research Tool (SMART) Nucleic Acids Res. 30:242-244(2002); PMID: 11752305
StyGene Salmonella typhimurium LT2 genome database (StyGene)  
SubtiList Bacillus subtilis 168 genome database (SubtiList) Nucleic Acids Res. 30:62-65(2002); PMID: 11752255
SWISS-2DPAGE 2D-PAGE database from the Geneva University Hospital (SWISS-2DPAGE) Nucleic Acids Res. 28:286-288(2000); PMID: 10592248
TIGR The bacterial databases of 'The Institute of Genome Research' (TIGR) Nucleic Acid Res. 29:159-164(2001); PMID: 11125077
TIGRFAMs TIGR protein family database (TIGRFAMs) Nucleic Acids Res. 29:41-43(2001); PMID: 11125044
TRANSFAC Transcription factor database (TRANSFAC) Nucleic Acids Res. 29:281-283(2001); PMID: 11125113
TubercuList Mycobacterium tuberculosis H37Rv genome database (TubercuList) FEBS Lett. 452:7-10(1999); PMID: 10376668
WormPep Caenorhabditis elegans genome sequencing project protein database (WormPep) Genomics 46:200-216(1997); PMID: 9417907
ZFIN Zebrafish Information Network genome database (ZFIN) Nucleic Acids Res. 29:87-90(2001); PMID: 11125057


3.12.3. The primary identifier

The second item on the DR line, the 'PRIMARY_IDENTIFIER', is an unambiguous pointer to the information entry in the database to which the reference is being made.

3.12.4. The secondary identifier

The third item on the DR line, the 'SECONDARY_IDENTIFIER', is generally used to complement the information given by the first identifier.

3.12.5. The tertiary identifier

A limited number of DR lines possess a fourth item, the 'TERTIARY_IDENTIFIER', which is generally used to give more detailed information.

Examples of complete DR lines are shown here:


DR   Aarhus/Ghent-2DPAGE; 8006; IEF.

DR   ANU-2DPAGE; Q9XEA8; -.

DR   COMPLUYEAST-2DPAGE; P41797; -.

DR   DictyDb; DD01047; myoA.

DR   ECO2DBASE; G052.0; 6TH EDITION.

DR   EcoGene; EG10054; araC.

DR   FlyBase; FBgn0000055; Adh.

DR   Genew; HGNC:12849; YWHAB.

DR   GlycoSuiteDB; P05067; -.

DR   GO; GO:0003677; F:DNA binding; TAS.

DR   Gramene; Q06967; -.

DR   HAMAP; MF_00120; -; 1.

DR   HIV; K02012; GAG$BH5.

DR   HSC-2DPAGE; P47985; HUMAN.

DR   HSSP; P00438; 1PBE.

DR   InterPro; IPR001254; Trypsin.

DR   Leproma; ML0485; -.

DR   MaizeDB; 25342; -.

DR   Maize-2DPAGE; P80607; COLEOPTILE.

DR   MEROPS; M41.001; -.

DR   MGD; MGI:87920; Adfp.

DR   MIM; 249900; -.

DR   MypuList; MYPU_4900; -.

DR   PDB; 3ADK; 16-APR-88.

DR   Pfam; PF00017; SH2; 1.

DR   PhosSite; P00955; -.

DR   PHCI-2DPAGE; Q9Z8U5; -.

DR   PIR; A00682; KIPGA.

DR   PMMA-2DPAGE; P04179; -.

DR   PRINTS; PR00237; GPCRRHODOPSN.

DR   ProDom; PD000511; Aconitase; 1.

DR   REBASE; 993; EcoRI.

DR   SGD; S0000170; AAR2.

DR   Siena-2DPAGE; P38008; -.

DR   SMART; SM00370; LRR; 6.

DR   StyGene; SG10312; proV.

DR   SubtiList; BG10774; oppD.

DR   SWISS-2DPAGE; P10599; HUMAN.

DR   TIGR; MJ0125; -.

DR   TIGRFAMs; TIGR00630; uvra; 1.

DR   TRANSFAC; T00141; -.

DR   TubercuList; Rv0001; -.

DR   WormPep; ZK637.7; CE00437.

DR   ZFIN; ZDB-GENE-980526-290; hoxb1b.
3.12.6. Cross-references to the nucleotide sequence database

The specific format for cross-references to the EMBL/GenBank/DDBJ nucleotide sequence database is:

DR   EMBL; ACCESSION_NUMBER; PROTEIN_ID; STATUS_IDENTIFIER.
where 'PROTEIN_ID' stands for the 'Protein Sequence Identifier'. It is a string which is stored, in nucleotide sequence entries, in a qualifier called '/protein_id' which is tagged to every CDS in the nucleotide database. Example from EMBL:

FT   CDS 302..2674
FT   /protein_id="CAA03857.1"
FT   /db_xref="SWISS-PROT:P26345"
FT   /gene="recA"
FT   /product="RecA protein"
The Protein Sequence Identifier (Protein_ID) consists of a stable ID portion (8 characters: 3 letters followed by 5 numbers) plus a period and a version number. The version number only changes when the protein sequence coded by the CDS changes, while the stable part remains unchanged. The Protein_ID effectively replaces what was previously known as the 'PID'.

The 'STATUS_IDENTIFIER' provides information about the relationship between the sequence in the Swiss-Prot entry and the CDS in the corresponding EMBL entry:

a) In most cases the translation of the EMBL nucleotide sequence CDS results in the same sequence as shown in the corresponding Swiss-Prot entry or the differences are mentioned in the Swiss-Prot feature (FT) lines as CONFLICT, VARIANT or VARSPLIC and in the RP lines. The status identifier is then a dash ('-').

Example:


DR   EMBL; Y00312; CAA68412.1; -.
b) In some cases the translation of the EMBL nucleotide sequence CDS results in a sequence different from the sequence shown in the corresponding Swiss-Prot entry. When the differences are either not mentioned in the Swiss-Prot feature (FT) lines as CONFLICT, VARIANT or VARSPLIC (see Appendix A) and in the RP lines, or do simply not meet the criteria for such situations, the differences are indicated as follows:

1. If the difference is due to a different start of the sequence (i.e. Swiss-Prot believes that the start of the sequence is upstream or downstream of the site annotated as the start of the sequence in the EMBL database), the status identifier shows the comment 'ALT_INIT'. Example:


DR   EMBL; L29151; AAA99430.1; ALT_INIT.
2. If the difference is due to a different termination of the sequence (i.e. Swiss-Prot believes that the termination of the sequence is upstream or downstream of the site annotated as the end of the sequence in the EMBL database), the status identifier shows the comment 'ALT_TERM'. Example:


DR   EMBL; L20562; AAA26884.1; ALT_TERM.
3. If the difference is due to frameshifts in the EMBL sequence, the status identifier shows the comment 'ALT_FRAME'. Example:

DR   EMBL; X56420; CAA39814.1; ALT_FRAME.
4. If the difference is not due to any of the cases mentioned above (e.g. wrong intron-exon boundaries given in the EMBL entry) or to a mixture of the cases mentioned above, the status identifier shows the comment 'ALT_SEQ'. Example:


DR   EMBL; M28482; AAA26378.1; ALT_SEQ.
c) In some cases the nucleotide sequence of a complete CDS is divided into exons present in different EMBL entries. We point to the exon-containing EMBL entries by citing the Protein_ID as secondary identifier and adding the comment 'JOINED' as the status identifier. These EMBL entries do not contain a CDS feature but contain exons joined to a CDS feature which is labeled with the given Protein_ID.

Example:


DR   EMBL; M63397; AAA51662.1; -.
DR   EMBL; M63395; AAA51662.1; JOINED.
DR   EMBL; M63396; AAA51662.1; JOINED.
In the above example the Swiss-Prot sequence is derived from the CDS labeled with the Protein_ID AAA51662. This CDS feature can be found in the EMBL entry M63397. Exons belonging to this CDS are not only found in EMBL entry M63397, but also in the EMBL entries M63395 and M63396.

d) In some cases there is no CDS feature key annotating a protein translation in an EMBL entry and thus no Protein_ID for the CDS. Therefore it is not possible for us to point to a Protein_ID as a secondary identifier. In these cases we point to the relevant EMBL entries by including a dash ('-') in the position of the missing Protein_ID and 'NOT_ANNOTATED_CDS' into the status identifier.

Example:


DR   EMBL; AJ243418; -; NOT_ANNOTATED_CDS.
3.12.7. Cross-references to the PROSITE database

The specific format for cross-references to the PROSITE protein domain and family database is:

DR   PROSITE; ACCESSION_NUMBER; ENTRY_NAME; STATUS.
Where 'ACCESSION_NUMBER' stands for the accession number of the PROSITE pattern or profile entry; 'ENTRY_NAME' is the name of the entry and 'STATUS' is one of the following:
n
FALSE_NEG
PARTIAL
UNKNOWN_n
Where 'n' is the number of hits of the pattern or profile in that particular protein sequence. The 'FALSE_NEG' status indicates that while the pattern or profile did not detect the protein sequence, it is a member of that particular family or domain. The 'PARTIAL' status indicates that the pattern or profile did not detect the sequence because the sequence is not complete and lacks the region on which the pattern/profile is based. Finally the 'UNKNOWN' status indicates uncertainties as to the fact that the sequence is a member of the family or contains the domain described by the pattern/profile.

Examples of PROSITE cross-references:


DR   PROSITE; PS00107; PROTEIN_KINASE_ATP; 2.

DR   PROSITE; PS00028; ZINC_FINGER_C2H2_1; 4.

DR   PROSITE; PS00237; G_PROTEIN_RECEP_F1_1; FALSE_NEG.

DR   PROSITE; PS00603; TK_CELLULAR_TYPE; PARTIAL.

DR   PROSITE; PS50234; VWFA; UNKNOWN_1.
3.13. The KW line Table of contents

The KW (KeyWord) lines provide information that can be used to generate indexes of the sequence entries based on functional, structural, or other categories. The keywords chosen for each entry serve as a subject reference for the sequence. The Swiss-Prot document keywlist.txt lists all the keywords that are used in the database. Often several KW lines are necessary for a single entry.

The format of the KW line is:

KW   Keyword[; Keyword...].
More than one keyword may be listed on each KW line; semicolons separate the keywords, and the last keyword is followed by a period. Keywords may consist of more than one word (they may contain blanks), but are never split between lines. An example of a KW line is:


KW   Oxidoreductase; NADP; Acetylation.
The order of the keywords is not significant. The above example could also have been written:

KW   Acetylation; NADP; Oxidoreductase.
3.14. The FT line Table of contents

The FT (Feature Table) lines provide a precise but simple means for the annotation of the sequence data. The table describes regions or sites of interest in the sequence. In general the feature table lists posttranslational modifications, binding sites, enzyme active sites, local secondary structure or other characteristics reported in the cited references. Sequence conflicts between references are also included in the feature table.

The FT lines have a fixed format. The column numbers allocated to each of the data items within each FT line are shown in the following table (column numbers not referred to in the table are always occupied by blanks).

Columns Data item
1-2 FT
6-13 Key name
15-20 'From' endpoint
22-27 'To' endpoint
35-75 Description


The key name and the endpoints are always on a single line, but the description may require one or more additional lines. In this event, the following line contains blanks in the columns 3-34, and the description continues from column 35 onwards as in the line above. Thus a blank key always denotes a continuation of the previous description.

An example of a feature table is shown below:


FT   NON_TER       1      1

FT   SIGNAL       <1     10       BY SIMILARITY.

FT   CHAIN        19     87       A-AGGLUTININ.

FT   PROPEP       22     43       REMOVED BY A DIPEPTIDYLPEPTIDASE.

FT   MOD_RES      41     41       AMIDATION (G-42 PROVIDE AMIDE GROUP)
FT                                (BY SIMILARITY).

FT   DISULFID    110    115

FT   CARBOHYD    251    251       N-LINKED (GLCNAC...).

FT   CONFLICT    327    327       E -> R (IN REF. 2).

FT   CONFLICT    378    509       MISSING (IN REF. 3).
The first item on each FT line is the key name, which is a fixed abbreviation (of up to 8 characters) with a defined meaning. A list of the currently defined key names can be found in Appendix A of this document.

Following the key name are the 'FROM' and 'TO' endpoint specifications. These fields designate (inclusively) the endpoints of the feature named in the key field. In general, these fields simply contain residue numbers which indicate positions in the sequence as listed. Note that these positions are always specified assuming a numbering of the listed sequence from 1 to n; this numbering is not necessarily the same as that used in the original reference(s). The following should be noted:

See also the notes concerning each of the key names in Appendix A.

The remaining portion of the FT line is a description that contains additional information about the feature. For example, for a posttranslationally modified residue (key MOD_RES) the chemical nature of the modification is given, while for a sequence variation (key VARIANT) the nature of the variation is indicated. This portion of the line is generally in free form, and may be continued on additional lines when necessary.

Feature identifiers
Some features are associated with a unique and stable feature identifier (FTId), which allows to construct links directly from position-specific annotation in the feature table to specialized protein-related databases. The FTId is always the last component of a feature in the description field. The format of a feature with a feature identifer is:
FT   KEYNAME     x      x         Description. 
FT				  /FTId=XXX_number.
where XXX is the 3-letter code for the specific feature key, seperated by an understore from a 6-digit number. Feature identifiers currently exist for the feature keys CARBOHYD, VARIANT and VARSPLIC. The format of the corresponding FTIds is shown in the following table:

Key name Format of the FTId Availability
CARBOHYD CAR_number Currently only for residues attached to an oligosaccharide structure annotated in the GlycoSuiteDB database
VARIANT VAR_number Currently only for human protein sequences
VARSPLIC VSP_number Any sequence with a VARSPLIC feature

Examples of features with FTIds are given below:

FT   CARBOHYD    251    251       N-LINKED (GLCNAC...).
FT                                /FTId=CAR_000070.

FT   VARIANT     214    214       V -> I.
FT                                /FTId=VAR_009122.

FT   VARSPLIC     33     83       TPDINPAWYTGRGIRPVGRFGRRRATPRDVTGLGQLSCLPL
FT                                DGRTKFSQRG -> SECLTYGKQPLTSFHPFTSQMPP (in
FT                                isoform 2).
FT                                /FTId=VSP_004370.

3.15. The SQ line Table of contents

The SQ (SeQuence header) line marks the beginning of the sequence data and gives a quick summary of its content.

The format of the SQ line is:

SQ   SEQUENCE XXXX AA; XXXXX MW; XXXXXXXXXXXXXXXX CRC64;
The line contains the length of the sequence in amino acids ('AA') followed by the molecular weight ('MW') rounded to the nearest mass unit (Dalton) and the sequence 64-bit CRC (Cyclic Redundancy Check) value ('CRC64').

The algorithm to compute the CRC64 is described in the ISO 3309 standard. The generator polynomial is x64 + x4 + x3 + x + 1.
Reference: Press W.H., Flannery B.P., Teukolsky S.A. and Vetterling W.T. "Numerical recipes in C", 2nd ed., pp896-902, Cambridge University Press (1993). (See http://www.ulib.org/webRoot/Books/Numerical_Recipes/bookcpdf/c20-3.pdf)
It should be noted that while, in theory, two different sequences could have the same CRC64 value, the likelihood that this would happen is extremely low.

An example of an SQ line is shown here:


SQ   SEQUENCE   486 AA;  55638 MW;  D7862E867AD74383 CRC64;
The information in the SQ line can be used as a check on accuracy or for statistical purposes. The word 'SEQUENCE' is present solely for readability.

3.16. The sequence data line Table of contents

The sequence data line has a line code consisting of two blanks rather than the two-letter codes used until now. The sequence counts 60 amino acids per line, in groups of 10 amino acids, beginning in position 6 of the line.

The characters used for the amino acids are the standard IUPAC one letter codes (see Appendix B).

An example of sequence data lines is shown here:


     MTILASICKL GNTKSTSSSI GSSYSSAVSF GSNSVSCGEC GGDGPSFPNA SPRTGVKAGV
     NVDGLLGAIG KTVNGMLISP NGGGGGMGMG GGSCGCI
3.17. The // line Table of contents

The // (terminator) line contains no data or comments and designates the end of an entry.

Appendix A: Feature table keys Table of contents

The definition of each of the key names used in the feature table is explained here. It is likely that new key names will be added progressively to this list. For each key a number of examples are presented.

A.1. Change indicators
CONFLICT - Different sources report differing sequences.

Examples of CONFLICT key feature lines:


FT   CONFLICT    304    304       MISSING (IN REF. 3).

FT   CONFLICT     62     62       D -> N (IN REF. 2 AND 3).

FT   CONFLICT      3      6       STGD -> RRGN (IN REF. 3).
VARIANT - Authors report that sequence variants exist.

Examples of VARIANT key feature lines:


FT   VARIANT     214    214       V -> I.
FT                                /FTId=VAR_009122.

FT   VARIANT     240    240       D -> E (in strains Z3915 and Z3524).

FT   VARIANT       1      1       Missing (in 20% of the chains).
Explicit links are present in human protein sequence entries to the Single Nucleotide Polymorphism database (dbSNP) (Nucleic Acids Res. 29:308-311(2001); PMID: 11125122). The format of such links is:
FT   VARIANT    from     to	  Description (in dbSNP:accession_number).
FT                                /FTId=VAR_number.
Example of a feature with a link to dbSNP:
FT   VARIANT      65     65       T -> I (in dbSNP:1065419).
FT                                /FTId=VAR_012009.
VARSPLIC - Description of sequence variants produced by alternative splicing.

Examples of VARSPLIC key feature lines:


FT   VARSPLIC    653    672       VATSNPGKCLSFTNSTFTFT -> ALVSHHCPVEAVRAVHP
FT                                TRL (IN ISOFORM 2).
FT   VARSPLIC    673    913       MISSING (IN ISOFORM 2).
MUTAGEN - Site which has been experimentally altered.

Examples of MUTAGEN key feature lines:


FT   MUTAGEN      65     65       H->F: 100% ACTIVITY LOSS.

FT   MUTAGEN     119    119       C->R,E,A: LOSS OF CADPR HYDROLASE
FT                                AND ADP-RIBOSYL CYCLASE ACTIVITY.

A.2. Amino-acid modifications
MOD_RES - Posttranslational modification of a residue.

The chemical nature of the modification is given in the description. The general format of the MOD_RES description field is:

FT   MOD_RES     xxx    xxx       MODIFICATION (COMMENT).
The most frequent modifications are listed below.

Modification Description
ACETYLATION N-terminal or other
AMIDATION Generally at the C-terminal of a mature active peptide
BLOCKED Undetermined N- or C-terminal blocking group
FORMYLATION Of the N-terminal methionine
GAMMA-CARBOXYGLUTAMIC ACID Of glutamate
HYDROXYLATION Of asparagine, aspartate, proline or lysine
METHYLATION Generally of lysine or arginine
PHOSPHORYLATION Of serine, threonine, tyrosine, aspartate or histidine
PYRROLIDONE CARBOXYLIC ACID N-terminal glutamine which has formed an internal cyclic lactam. This is also called 'pyro-Glu'
SULFATION Generally of tyrosine


Examples of MOD_RES key feature lines:


FT   MOD_RES       1      1       ACETYLATION.

FT   MOD_RES     686    686       PHOSPHORYLATION (BY PKC).

FT   MOD_RES     367    367       SULFATION (BY SIMILARITY).

FT   MOD_RES      58     58       AMIDATION (G-59 PROVIDE AMIDE GROUP).

FT   MOD_RES      20     20       METHYLATION (MONO- AND DI-).
CARBOHYD - Glycosylation site.

This key describes the occurrence of the attachment of a glycan (mono- or polysaccharide) to a residue of the protein:

Examples of CARBOHYD key feature lines:


FT   CARBOHYD     53     53       N-LINKED (GLCNAC...) (POTENTIAL).

FT   CARBOHYD     18     18       O-LINKED (GLCNAC).

FT   CARBOHYD    194    194       O-LINKED (GLC...) (BY SIMILARITY).

FT   CARBOHYD     29     29       C-LINKED (MAN).
LIPID - Covalent binding of a lipid moiety

The chemical nature of the bound lipid moiety is given in the description. The general format of the LIPID description field is:

FT   LIPID       xxx    xxx       NAME OF THE ATTACHED GROUP (COMMENT).
The attached groups that are currently defined are listed below.

Attached group Description
MYRISTATE Myristate group attached through an amide bond to the N-terminal glycine residue of the mature form of a protein [1,2] or to an internal lysine residue
PALMITATE Palmitate group attached through a thioether bond to a cysteine residue or through an ester bond to a serine or threonine residue [1,2]
FARNESYL Farnesyl group attached through a thioether bond to a cysteine residue [3,4]
GERANYL-GERANYL Geranyl-geranyl group attached through a thioether bond to a cysteine residue [3,4]
GPI-ANCHOR Glycosyl-phosphatidylinositol (GPI) group linked to the alpha-carboxyl group of the C-terminal residue of the mature form of a protein [5,6]
N-ACYL DIGLYCERIDE N-terminal cysteine of the mature form of a prokaryotic lipoprotein with an amide-linked fatty acid and a glyceryl group to which two fatty acids are linked by ester linkages [7]
S-ARCHAEOL N-terminal cysteine of the mature form of a archaeal lipoprotein with an archaeol (2,3-di-O-phytanyl-sn-glycerol) lipid group linked by an ester linkage [8]
N-OCTANOATE n-octanoate group linked through an ester bond to a serine residue

References:

[1] Grand R.J.A.
Biochem. J. 258:626-638(1989).
[2] McIlhinney R.A.J.
Trends Biochem. Sci. 15:387-391(1990).
[3] Glomset J.A., Gelb M.H., Farnsworth C.C.
Trends Biochem. Sci. 15:139-142(1990).
[4] Sinensky M., Lutz R.J.
BioEssays 14:25-31(1992).
[5] Low M.G.
FASEB J. 3:1600-1608(1989).
[6] Low M.G.
Biochim. Biophys. Acta 988:427-454(1989).
[7] Hayashi S., Wu H.C.
J. Bioenerg. Biomembr. 22:451-471(1990).
[8] Nishihara M., Utagawa M., Akutsu H., Koga Y.
J. Biol. Chem. 267:12432-12435(1992).

Examples of LIPID key feature lines:


FT   LIPID         2      2       MYRISTATE.

FT   LIPID         5      5       PALMITATE.

FT   LIPID       231    231       GPI-ANCHOR.
DISULFID - Disulfide bond.

The 'FROM' and 'TO' endpoints designate the two residues which are linked by an intra-chain disulfide bond. If the 'FROM' and 'TO' endpoints are identical, the disulfide bond is an interchain one and the description field indicates the nature of the cross-link. Examples of DISULFID key feature lines:


FT   DISULFID     23     84       PROBABLE.

FT   DISULFID     29     29       INTERCHAIN (WITH C-8 OF SMALL CHAIN).
THIOLEST - Thiolester bond.

The 'FROM' and 'TO' endpoints designate the two residues which are linked by a thiolester bond.

THIOETH - Thioether bond.

The 'FROM' and 'TO' endpoints designate the two residues which are linked by a thioether bond.

SE_CYS - Selenocysteine

This key describes the occurrence of a selenocysteine in the sequence record. Examples:


FT   SE_CYS       52     52

FT   SE_CYS       16     16       POTENTIAL.
METAL - Binding site for a metal ion.

The description field indicates the nature of the metal. Examples of METAL key feature lines:


FT   METAL        52     52       IRON (HEME AXIAL LIGAND).

FT   METAL        28     28       COPPER (POTENTIAL).
BINDING - Binding site for any chemical group (co-enzyme, prosthetic group, etc.).

The chemical nature of the group is given in the description field. Examples of BINDING key feature lines:


FT   BINDING      14     14       HEME (COVALENT).

FT   BINDING     250    250       PYRIDOXAL PHOSPHATE.

A.3. Regions
SIGNAL - Extent of a signal sequence (prepeptide).

TRANSIT - Extent of a transit peptide (mitochondrion, chloroplast, thylakoid, cyanelle or microbody).

Examples of TRANSIT key feature lines:


FT   TRANSIT       1     42       CHLOROPLAST.

FT   TRANSIT       1     37       CYANELLE (BY SIMILARITY).

FT   TRANSIT       1     25       MITOCHONDRION.

FT   TRANSIT       1     34       MICROBODY (POTENTIAL).

FT   TRANSIT       ?     77       THYLAKOID (BY SIMILARITY).
PROPEP - Extent of a propeptide.

Examples of PROPEP key feature lines:


FT   PROPEP       27     28       ACTIVATION PEPTIDE.

FT   PROPEP      550    574       REMOVED IN MATURE FORM.
CHAIN - Extent of a polypeptide chain in the mature protein.

Examples of CHAIN key feature lines:


FT   CHAIN        21    119       BETA-2-MICROGLOBULIN.

FT   CHAIN        41    180       FACTOR X LIGHT CHAIN.
PEPTIDE - Extent of a released active peptide.

Examples of PEPTIDE key feature lines:


FT   PEPTIDE      13    107       NEUROPHYSIN 2.

FT   PEPTIDE     235    239       MET-ENKEPHALIN.
DOMAIN - Extent of a domain of interest in the sequence.

The nature of the domain is given in the description field. Examples of DOMAIN key feature lines:


FT   DOMAIN       22    788       EXTRACELLULAR (POTENTIAL).

FT   DOMAIN      745    939       RAS-GAP.

FT   DOMAIN      128    195       COILED COIL (POTENTIAL).
CA_BIND - Extent of a calcium-binding region.

Example:

FT   CA_BIND     759    770       EF-HAND 1 (POTENTIAL).
DNA_BIND - Extent of a DNA-binding region.

The nature of the DNA-binding region is given in the description field. Examples of DNA_BIND key feature lines:


FT   DNA_BIND    335    415       ETS-DOMAIN.

FT   DNA_BIND     69    128       HOMEOBOX.

FT   DNA_BIND     16     67       MYB 2.

FT   DNA_BIND    135    200       TEA-DOMAIN.
NP_BIND - Extent of a nucleotide phosphate-binding region.

The nature of the nucleotide phosphate is indicated in the description field. Examples of NP_BIND key feature lines:


FT   NP_BIND     182    189       ATP.

FT   NP_BIND      38     45       GTP (POTENTIAL).

FT   NP_BIND       9     24       FAD (ADP PART).
TRANSMEM - Extent of a transmembrane region.

ZN_FING - Extent of a zinc finger region.

The zinc finger 'category' is indicated in the description field. Examples of ZN_FING key feature lines:


FT   ZN_FING     319    343       GATA-TYPE.

FT   ZN_FING     559    579       C4-TYPE.
REPEAT - Extent of an internal sequence repetition.

Examples of REPEAT key feature lines:


FT   REPEAT      225    307       1.
FT   REPEAT      341    423       2.
FT   REPEAT      455    537       3 (APPROXIMATE).

A.4. Secondary structure
The feature table of sequence entries of proteins whose tertiary structure is known experimentally contains the secondary structure information corresponding to that protein. The secondary structure assignment is made according to DSSP (see Kabsch W., Sander C.; Biopolymers, 22:2577-2637(1983)) and the information is extracted from the coordinate data sets of the Protein Data Bank (PDB).

In the feature table only three types of secondary structure are specified: helices (key HELIX), beta-strands (key STRAND) and turns (key TURN). Residues not specified in one of these classes are in a 'loop' or 'random-coil' structure. Because the DSSP assignment has more than the three common secondary structure classes, we have converted the different DSSP assignments to HELIX, STRAND and TURN as shown in the table below.

DSSP code DSSP definition Swiss-Prot assignment
H  Alpha-helix  HELIX
G  3(10) helix  HELIX
I  Pi-helix  HELIX
E  Hydrogen-bonded beta-strand (extended strand)  STRAND
B  Residue in an isolated beta-bridge  STRAND
T  H-bonded turn (3-turn, 4-turn or 5-turn)  TURN
S  Bend (five-residue bend centered at residue i)  Not specified


One should be aware of the following facts:

a) Segment length. For helices (alpha and 3-10), the residue just before and just after the helix as given by DSSP participates in the helical hydrogen-bonding pattern with a single H-bond. For practical purposes, one can extend the HELIX range by one residue on each side, e.g. HELIX 25-35 instead of HELIX 26-34. Also, the ends of secondary structure segments are less well defined for lower-resolution structures. A fluctuation of one residue is common.
b) Missing segments. In low-resolution structures, badly formed helices or strands may be omitted in the DSSP definition.
c) Special helices and strands. Helices of length three are 3-10 helices, those of length four and longer are either alpha-helices or 3-10 helices (pi helices are extremely rare). A strand of one residue corresponds to a residue in an isolated beta-bridge. Such bridges can be structurally important.
d) Missing secondary structure. No secondary structure is currently given in the feature table in the following cases:

  • No sequence data in the PDB entry;
  • Structure for which only C-alpha coordinates are in PDB;
  • NMR structure with more than one coordinate data set;
  • Model (i.e. theoretical) structure.


Examples:


FT   HELIX         4     14
FT   HELIX        21     35
FT   TURN         36     36
FT   HELIX        48     61
FT   TURN         62     63
FT   HELIX        66     83
FT   TURN         84     86

A.5. Others
ACT_SITE - Amino acid(s) involved in the activity of an enzyme.

Examples of ACT_SITE key feature lines:


FT   ACT_SITE    193    193       ACCEPTS A PROTON DURING CATALYSIS.

FT   ACT_SITE   1083   1083       CHARGE RELAY SYSTEM.
SITE - Any other interesting site on the sequence.

Examples of SITE key feature lines:


FT   SITE        285    288       PREVENT SECRETION FROM ER.

FT   SITE        759    760       CLEAVAGE (BY THROMBIN).
INIT_MET - Initiator methionine.

This feature key is usually associated with a zero value in the 'FROM' and 'TO' fields to indicate that the initiator methionine has been cleaved off and is not shown in the sequence:


FT   INIT_MET      0      0
It is not used when the initiator methionine is not cleaved off except in the event of internal alternative initiation sites. Example:


FT   INIT_MET     44     44       FOR CYTOPLASMIC ISOFORM.
NON_TER - The residue at an extremity of the sequence is not the terminal residue.

If applied to position 1, this signifies that the first position is not the N-terminus of the complete molecule. If applied to the last position, it means that this position is not the C-terminus of the complete molecule. There is no description field for this key. Examples of NON_TER key feature lines:


FT   NON_TER       1      1

FT   NON_TER     129    129
NON_CONS - Non-consecutive residues.

Indicates that two residues in a sequence are not consecutive and that there are a number of unsequenced residues between them. Example of a NON_CONS key feature line:


FT   NON_CONS   1683   1684
UNSURE - Uncertainties in the sequence

Used to describe region(s) of a sequence for which the authors are unsure about the sequence assignment.


FT   UNSURE       12     12       OR Y.
Appendix B: Amino-acid codes Table of contents

The one-letter and three-letter codes for amino acids used in Swiss-Prot are those adopted by the commission on Biochemical Nomenclature of the IUPAC-IUB (see the reference listed below).

One-letter code Three-letter code Amino-acid name
A Ala   Alanine
R Arg   Arginine
N Asn   Asparagine
D Asp   Aspartic acid
C Cys   Cysteine
Q Gln   Glutamine
E Glu   Glutamic acid
G Gly   Glycine
H His   Histidine
I Ile   Isoleucine
L Leu   Leucine
K Lys   Lysine
M Met   Methionine
F Phe   Phenylalanine
P Pro   Proline
S Ser   Serine
T Thr   Threonine
W Trp   Tryptophan
Y Tyr   Tyrosine
V Val   Valine
B Asx   Aspartic acid or Asparagine
Z Glx   Glutamic acid or Glutamine
X Xaa   Any amino acid


Reference

IUPAC-IUB Joint Commission on Biochemical Nomenclature (JCBN).
Nomenclature and Symbolism for Amino Acids and Peptides. Recommendations 1983.
Eur. J. Biochem. 138:9-37(1984).

See also: http://www.chem.qmw.ac.uk/iupac/AminoAcid/

Appendix C: Format differences between the Swiss-Prot and EMBL databases Table of contents

C.1. Generalities

The format of Swiss-Prot follows as closely as possible that of the EMBL database. The general structure of an entry is identical in both databases. The data classes are the same except though Swiss-Prot does not make use of the 'BACKBONE', 'UNREVIEWED' and 'UNANNOTATED' data classes. One line type used in Swiss-Prot does not exist in the EMBL database (see section C.3); conversely Swiss-Prot does not currently make use of every EMBL line type (see section C.4).

C.2. Differences in line types present in both databases

C.2.1 The ID line (IDentification)

Differences with the EMBL database ID line format are:

C.2.2 The AC line (ACcession number)

The format of this line type is identical to that defined by the EMBL database. Swiss-Prot accession numbers do not overlap those used in the EMBL/GenBank/DDBJ nucleotide sequence database. However, it should be noted that there are differences in the format of the accession numbers themselves. In Swiss-Prot accession numbers consist of 6 alphanumerical characters in the following format:

 1  2  3  4  5  6
 [O,P,Q] [0-9] [A-Z, 0-9] [A-Z, 0-9] [A-Z, 0-9] [0-9]


Examples: P01234; Q1AA12.

In EMBL, two different types of accession numbers co-exist:

1. Accession numbers with 6 alphanumerical characters, where the first character is any letter with the exception of O,P or Q and the five other characters are numbers (example: M23765);
2. Accession numbers with 8 alphanumerical characters, where the first two characters are letters and the following six characters are numbers (example: AB001084).


C.2.3 The DT line (DaTe)

Differences with the EMBL database DT line format are:

DT lines in a Swiss-Prot entry:

DT   21-JUL-1986 (Rel. 01, Created)
DT   23-OCT-1986 (Rel. 02, Last sequence update)
DT   01-APR-1990 (Rel. 14, Last annotation update)
DT lines in an EMBL database entry:

DT   10-MAR-1990 (Rel. 22, Created)
DT   12-APR-1990 (Rel. 23, Last updated, Version 3)
C.2.4 The DE line (DEscription)

C.2.5 The OS line (Organism Species)

C.2.6 The OG line (OrGanelle)

C.2.7 The RP and RC lines

C.2.8 The RT line (Reference Title)

In EMBL the reference title is not terminated by a period, a question mark or an exclamation mark.

C.2.9 The FT line (Feature Table)

The format of this line is totally different from that currently defined for the EMBL database. The format used in Swiss-Prot is similar to that which was used in older versions of the EMBL database, prior to the introduction of the common EMBL/GenBank/DDBJ feature table.

C.2.10 The CC line (Comment)

The comment lines, which are free text and can appear anywhere in an EMBL entry, are grouped together in the Swiss-Prot database. They are always listed below the last reference line, and follow a precise syntax (see section 3.11).

C.2.11 The SQ line (SeQuence header)

Although the rough format and purpose of this line type is conserved, its exact content differs from that of the EMBL database. The numerical length of the sequence is listed, followed by 'AA' (Amino Acid) instead of 'BP' (Base Pairs). Rather then indicating the sequence composition which, for protein sequences, would not fit in a single line, the molecular weight and the 64-bit CRC (Cyclic Redundancy Check) value of the sequence are indicated.

C.3. Line types defined by Swiss-Prot but currently not used by EMBL

Presently, there are two line types in Swiss-Prot, which are not used in the EMBL database: the GN and OX lines.

C.4. Line types defined by EMBL but currently not used by Swiss-Prot

There are three line types in the EMBL database, which are not used in Swiss-Prot:

Appendix D: Swiss-Prot documentation files Table of contents

Swiss-Prot is distributed with a large number of documentation files. Some of these files have been available for a long time (the user manual, release notes, the various indexes for authors, citations, keywords, etc.), but many have been created recently and we are continuously adding new files, and updating and modifying existing files. The following table lists all the documents that are currently available.

userman.txt User manual
relnotes.txt Release notes for the current release
shortdes.txt Short description of entries in Swiss-Prot
   
jourlist.txt List of cited journals
keywlist.txt List of keywords
plasmid.txt List of plasmids
speclist.txt List of organism (species) identification codes
tisslist.txt List of tissues
experts.txt List of on-line experts for PROSITE and Swiss-Prot
dbxref.txt List of databases cross-referenced in Swiss-Prot
submit.txt Submission of sequence data to Swiss-Prot
   
acindex.txt Accession number index
autindex.txt Author index
citindex.txt Citation index
keyindex.txt Keyword index
speindex.txt Species index
deleteac.txt Deleted accession number index
   
7tmrlist.txt List of 7-transmembrane G-linked receptor entries
aatrnasy.txt List of aminoacyl-tRNA synthetases
allergen.txt Nomenclature and index of allergen sequences
annbioch.txt Swiss-Prot annotation: how is biochemical information assigned to sequence entries
arath.txt Index of Arabidopsis thaliana entries and their corresponding gene designations
bacsu.txt Index of Bacillus subtilis strain 168 chromosomal entries and their corresponding SubtiList cross-references
bloodgrp.txt Blood group antigen proteins
bucai.txt Index of Buchnera aphidicola (subsp. Acyrthosiphon pisum) entries
bucap.txt Index of Buchnera aphidicola (subsp. Schizaphis graminum) entries
calbican.txt Index of Candida albicans entries and their corresponding gene designations
cdlist.txt CD nomenclature for surface proteins of human leucocytes
celegans.txt Index of Caenorhabditis elegans entries and their corresponding gene designations and WormPep cross-references
dicty.txt Index of Dictyostelium discoideum entries and their corresponding gene designations and DictyDB cross-references
ec2dtosp.txt Index of Escherichia coli Gene-protein database (ECO2DBASE) entries referenced in Swiss-Prot
ecoli.txt Index of Escherichia coli strain K12 chromosomal entries and their corresponding EcoGene cross-references
embltosp.txt Index of EMBL Nucleotide Sequence Database entries referenced in Swiss-Prot
extradom.txt Nomenclature of extracellular domains
fly.txt Index of Drosophila entries and their corresponding FlyBase cross-references
glycosid.txt Classification of glycosyl hydrolase families and index of glycosyl hydrolase entries in Swiss-Prot
haein.txt Index of Haemophilus influenzae strain Rd chromosomal entries
helpy.txt Index of Helicobacter pylori strain 26695 chromosomal entries
hoxlist.txt Vertebrate homeotic Hox proteins: nomenclature and index
humchr01.txt Index of proteins encoded on human chromosome 1
humchr02.txt Index of proteins encoded on human chromosome 2
humchr03.txt Index of proteins encoded on human chromosome 3
humchr04.txt Index of proteins encoded on human chromosome 4
humchr05.txt Index of proteins encoded on human chromosome 5
humchr06.txt Index of proteins encoded on human chromosome 6
humchr07.txt Index of proteins encoded on human chromosome 7
humchr08.txt Index of proteins encoded on human chromosome 8
humchr09.txt Index of proteins encoded on human chromosome 9
humchr10.txt Index of proteins encoded on human chromosome 10
humchr11.txt Index of proteins encoded on human chromosome 11
humchr12.txt Index of proteins encoded on human chromosome 12
humchr13.txt Index of proteins encoded on human chromosome 13
humchr14.txt Index of proteins encoded on human chromosome 14
humchr15.txt Index of proteins encoded on human chromosome 15
humchr16.txt Index of proteins encoded on human chromosome 16
humchr17.txt Index of proteins encoded on human chromosome 17
humchr18.txt Index of proteins encoded on human chromosome 18
humchr19.txt Index of proteins encoded on human chromosome 19
humchr20.txt Index of proteins encoded on human chromosome 20
humchr21.txt Index of proteins encoded on human chromosome 21
humchr22.txt Index of proteins encoded on human chromosome 22
humchrx.txt Index of proteins encoded on human chromosome X
humchry.txt Index of proteins encoded on human chromosome Y
humpvar.txt Index of human proteins with sequence variants
initfact.txt List and index of translation initiation factors
intein.txt Index of intein-containing entries referenced in Swiss-Prot
metallo.txt Classification of metallothioneins and index of the entries in Swiss-Prot
metja.txt Index of Methanococcus jannaschii entries
mgdtosp.txt Index of MGD entries referenced in Swiss-Prot
mimtosp.txt Index of MIM entries referenced in Swiss-Prot
mycge.txt Index of Mycoplasma genitalium strain G-37 chromosomal entries
mycpn.txt Index of Mycoplasma pneumoniae strain M129 chromosomal entries
ngr234.txt Table of predicted proteins in Rhizobium plasmid pNGR234a
nomlist.txt List of nomenclature related references for proteins
pdbtosp.txt Index of Protein Data Bank (PDB) entries referenced in Swiss-Prot
peptidas.txt Classification of peptidase families and index of peptidase entries in Swiss-Prot
plastid.txt List of chloroplast and cyanelle encoded proteins
pombe.txt Index of Schizosaccharomyces pombe entries and their corresponding gene designations
restric.txt List of restriction enzyme and methylase entries
ribosomp.txt Index of ribosomal proteins classified by families on the basis of sequence similarities
ricpr.txt Index of Rickettsia prowazekii strain Madrid E entries
salty.txt Index of Salmonella typhimurium strain LT2 chromosomal entries and their corresponding StyGene cross-references
syny3.txt Index of Synechocystis sp. strain PCC 6803 entries
upflist.txt List of UPF (Uncharacterized Protein Families) and index of members
yeast.txt Index of Saccharomyces cerevisiae entries in Swiss-Prot and their corresponding gene designations
yeast1.txt Yeast chromosome I entries
yeast2.txt Yeast chromosome II entries
yeast3.txt Yeast chromosome III entries
yeast5.txt Yeast chromosome V entries
yeast6.txt Yeast chromosome VI entries
yeast7.txt Yeast chromosome VII entries
yeast8.txt Yeast chromosome VIII entries
yeast9.txt Yeast chromosome IX entries
yeast10.txt Yeast chromosome X entries
yeast11.txt Yeast chromosome XI entries
yeast13.txt Yeast chromosome XIII entries
yeast14.txt Yeast chromosome XIV entries

Appendix E: FTP access to Swiss-Prot and TrEMBL Table of contents

E.1   Generalities
Swiss-Prot is available for download on the following anonymous FTP servers:

Organization Swiss Institute of Bioinformatics (SIB)
Address ftp.expasy.org, au.expasy.org, bo.expasy.org, ca.expasy.org, cn.expasy.org, kr.expasy.org, tw.expasy.org, us.expasy.org
Directory /databases/swiss-prot/

Organization European Bioinformatics Institute (EBI)
Address ftp.ebi.ac.uk
Directory /pub/databases/swissprot/
E.2   Non-redundant database
On the ExPASy and EBI FTP servers we distribute files that make up a non-redundant and complete protein sequence database consisting of three components:

1) Swiss-Prot
2) TrEMBL
3) New entries to be integrated later into TrEMBL (hereafter known as TrEMBL_New)

Every week three files are completely rebuilt. These files are named: sprot.dat.gz, trembl.dat.gz and trembl_new.dat.gz. As indicated by their '. gz' extension, these are gzip-compressed files which, when decompressed, produce ASCII files in Swiss-Prot format.

Three other files are also available (sprot.fas.gz, trembl.fas.gz and trembl_new.fas.gz) which are compressed 'fasta' format sequence files useful for building the databases used by FASTA, BLAST and other sequence similarity search programs. Please do not use these files for any other purpose, as you will lose all annotations by using this stripped-down format.

The files for the non-redundant database are stored in the directory '/databases/sp_tr_nrdb' on the ExPASy FTP server (ftp.expasy.org) and in the directory '/pub/databases/sp_tr_nrdb' on the EBI FTP server (ftp.ebi.ac.uk).

Additional notes:

E.3   Weekly updates of Swiss-Prot documents
Whilst the ExPASy FTP server so far only allowed FTP access to the Swiss-Prot documents and indexes in their versions at the time of the last full release, all documents are now updated with every weekly release of Swiss-Prot. They are available for FTP download from the directory /databases/swiss-prot/updated_doc/.

E.4   Weekly updates of Swiss-Prot
Weekly updates of Swiss-Prot are available by anonymous FTP. Three files are generated at each update:

new_seq.dat Contains all the new entries since the last full release;
upd_seq.dat Contains the entries for which the sequence data has been updated since the last release;
upd_ann.dat Contains the entries for which one or more annotation fields have been updated since the last release.

Important notes

Appendix F: Relationships between Swiss-Prot and some biomolecular databases Table of contents

The current status of the relationships (cross-references) between Swiss-Prot and some biomolecular databases is shown in the following schema:

ExPASy Home page Site Map Search ExPASy Contact us
Hosted by NCSC USMirror sites:Canada China Korea Switzerland Taiwan