Instructions for MultiIdent Protein Identification Software
1.1 Introduction:
MultiIdent is designed for the identification of proteins from 2-D gels.
There are many properties of a protein that can be used to aid in its
identification. These include:
- protein isoelectric point (estimated from a 2-D gel)
- protein molecular mass (estimated from a 2-D gel or by mass spectrometry)
- species of origin of the protein
- protein amino acid composition
- protein sequence data (sequence tag)
- peptide masses generated by enzymatic or chemical digestion followed by
mass spectrometry of generated peptides.
The following computer program is designed to take different combinations of
the above data to achieve unambiguous protein identification. The program
will eventually work in a modular fashion, allowing any combinations of the above
data to be used to find the correct protein identity.
However, the current version of the software works by first generating a
list of best-matching proteins with amino acid composition, and then using
other protein parameters (e.g. protein pI, Mw, sequence tag, peptide mass data) to
query this list and identify the protein of interest.
In the following set of notes, instructions you should follow for filling in
the form on the computer will be shown with a
.
Background notes relevant
to that part of the form will be marked with a
. Important notes and
cautions are marked with a
.
2.1 Using MultiIdent
In the program there are a number of fields that must be filled in, as
well as some fields that are optional. We will describe all of the fields
from the top to the bottom of the form, to give some idea of the best way to use
this program for your needs.
2.1.1 E-mail address
Please supply your full e-mail address.
This program runs in batch mode. You send your protein information to the
ExPASy server via the World-Wide Web, which then undertakes the
matching and returns the results to you by e-mail. It operates in batch
mode as this allows you to send many searches and then go and have a cup of
coffee whilst you wait for the results!
2.2 Unknown Protein Fields:
2.2.1 Name of Protein
Here you should supply a name or code number of the protein you are
working with. If you are going to do a number of different matches, you
should give each match a different name or code.
2.2.2 Protein pI
Here you should specify the pI of your protein, if known. This should be
estimated from a 2-D gel. You should also specify the confidence you have
in your pI estimate. This is done by selecting the appropriate number in
the "range" box.
For bacterial proteins separated in IPG gradients, we find that a range of
± 0.25 around the estimate is usually sufficient. For eukaryotic proteins,
this should be increased to ± 0.5 units if the proteins are thought to be
unmodified, but ± 1 unit if proteins are though to carry charge-modifying
modifications like sialic acid. If you do not know the pI of the protein,
you should specify a pI of 7.0 ± 10. Similarly, if you have only a
vague idea of the protein pI, use a very large range. Even a large range of
pI can increase the power of your search.
2.2.3 Protein Mw
Here you should specify the estimated mass of your protein, in Daltons.
This can be estimated from a 2-D gel, or from mass spectrometry (MS) of the
entire protein. You should specify the confidence that you have in your Mw
estimate, in percent terms.
For bacterial proteins, a range of ± 20% around the Mw estimate is usually
sufficient. For cytoplasmic eukaryotic proteins this range is also usually
sufficient, but secreted eukaryotic proteins often carry post-translational
modifications that require a range of ± 30% or more to be inclusive. If
masses have been determined with MS, the ranges used can be much smaller (e.g. ± 2%).
However, note that if MS has been used to determine the size of a
glycoprotein or other heavily modifed protein, it may be considerably
larger than the mass of the
unmodified polypeptide predicted in the database.
Note! If you want to search only with a Mw range and not a pI range,
specify the Mw and range of interest, and then put 7.0 ± 10 units for the
pI. If you want to search only within a pI range and not a Mw range,
specify the pI and range of interest, and then a suitably large mass window
such as 100,000 Da ± 100%.
2.2.4 Species of Interest
Here you should enter one or more search terms
used on the OS and OC (Organism species and Organism classification) lines of Swiss-Prot
to specify
species of origin of the protein or the species which you would like to
search against. If you wish to match against all proteins from all species
in the database, specify "ALL". If you have doubts about the exact
search terms to use, consult the Swiss-Prot list of species,
or the NCBI taxonomy browser.
We define "single species matching" where you, for example, have proteins
from E. coli which you then match against only the E. coli proteins
in the database. This is a good approach to use when the organism you are
working with is molecularly well defined, or ideally, the subject of a
genome project. If the source of your proteins is not molecularly well
defined it is best to do "cross-species matching". Thus, for example, if
you are working with proteins from Candida albicans you may wish to either
match your proteins against ALL proteins in the database, or against the
fully sequenced yeast Saccharomyces cerevisiae. Note that when
cross-species matching, protein pI is frequently poorly conserved, but
protein mass is generally very well conserved. You should thus take this
into consideration when setting your pI and Mw ranges. The choice of
Swiss-Prot protein OS and OC lines enables
matching against certain groups of species, phyla, or families as
desired. This is useful if working on, for example, cats, as one can
match against all proteins in the database described for mammals.
2.2.5 Keyword
If you wish to perform your search in Swiss-Prot
only, you can enter a keyword appearing on the
KW lines of Swiss-Prot to further restrict your search. For example, a keyword
of "PLASMA" could be used in conjunction with the OC term "MAMMALIA"
to see if a user-entered protein matches well
with any mammalian plasma proteins in the database. If you are in
doubt about the exact term to use, consult the list of
keywords used in Swiss-Prot.
However, the use of a keyword is not allowed for searches in TrEMBL, as
the annotation in TrEMBL is done automatically and is therefore incomplete and not always correct.
2.3 Amino Acid Composition Fields:
Currently, protein amino acid (AA) composition is central to the
identification approach used by this program. The AA composition of the
protein of interest is matched against the theoretical AA compositions of
proteins in the Swiss-Prot/TrEMBL database in order to generate lists of best-matching
proteins, ranked by a score. The score is calculated from the
euclidian distance between the AA compositions of the protein of interest
and a database entry. For more information about how to interpret this score, see
the AACompIdent documentation.
Note that whenever proteins are documented in Swiss-Prot to be
processed to produce a mature form, this program will calculate the
theoretical AA composition of that form for that entry. Thus signal
sequences, propeptides, and transit peptides will not be considered for the
AA composition calculations, neither will they be included for pI and Mw calculations for these
proteins. Similarly, if a certain database entry
produces a number of mature polypeptides, the composition of each of these
will be calculated and documented separately.
However, if such processing events are documented in TrEMBL, MultiIdent does not use this information
to process TrEMBL
proteins into mature chains or peptides, i.e. AA composition,
pI and Mw are always computed for the whole sequence.
The matching of AA composition data against the database(s) is done in 3
ways, to produce 3 lists of best-matching proteins for each of the selected databases. The first list is where
the AA composition of the unknown protein is matched only against proteins
from the species of interest, but without considering pI or Mw ranges. The
second list is where the AA composition data is matched against all proteins
in the database, again without considering pI or Mw ranges. The third list
is where the AA composition data is matched against proteins from the
species of interest, but only those that fall in the specified pI and Mw range.
2.3.1 Amino Acid Composition of Unknown Protein
Currently, this is an essential field (ie no data, no results!). You
should enter the emprically determined AA composition of the protein of
interest. The composition should have been calculated in molar percent from
the results from your AA analysis system. Your data should have one decimal
place and add up to 100. Note that Asx (B) corresponds to the mixture of Asn and Asp, which is
quantitated as Asp during AA analysis. Similarly, Glx (Z) corresponds to the
mixture of Gln and Glu, which is quantitated as Glu during AA analysis.
Caution! There are several different "constellations" or sets of AAs that
are available for matching. It is critical that you use the constellation
that has the same set of AAs for which you have quantitative analysis data.
If you do not enter data for a certain AA in the form, the program will
assume you have quantitated that AA but found a molar percentage of 0.00. This is
different to if you have not quantitated the AA and not included it in your
percent composition calculations. Most AA analysis systems (e.g. Fmoc, Picotag,
ninhydrin, but not OPA) will produce
data compatible with Constellation 2.
2.3.2 Amino Acid Composition of Calibration Protein
We have found that use of a known calibration
protein can dramatically decrease
the error of any particular AA analysis. This calibration protein should,
ideally, have been hydrolysed, extracted, and finally derivatised and
subjected to chromatography in the same batch as the protein of interest.
Ideally, it should be the same type of sample (e.g. PVDF-bound if unknown
proteins are from blots of 2-D gels).
If you specify data for a calibration protein, a set of factors will be
generated that mathematically represent the difference between the observed
and expected composition of the calibration protein. These factors will
then be automatically applied to the composition of the unknown protein (see
2.3.1) before it is matched against the composition of proteins in the
database in the identification procedure.
2.3.2.1 Calibration Protein ID
If you have used a calibration protein, put the Swiss-Prot ID code for the
protein here. If you do not wish to perform calibration, enter NULL for
this field.
2.3.2.2 Composition of Calibration Protein
Enter the AA composition of your calibration protein. The composition
should have been calculated in molar percent from the results from your AA
analysis system. Your data should have one decimal place, if possible, and
add up to a total of 100.
2.3.3 Details on the Lists of Proteins to be Generated and Printed
Choose the number of best-matched proteins by AA composition
to be generated by the computer
and kept in memory. The size of this list is important if you are also going to
use peptide mass fingerprinting data for protein identification
(see 2.5). Then specify how many
of the top-ranked proteins from this list you actually want to see in the
results that will be sent to you by e-mail. This number should always be
equal to or smaller than the number specified for the size of the list.
If you want to do protein identification by AA composition alone, the
default values of 20 and 20 will be adequate. For sequence tag
identification you may wish to use 40 and 40 if a sequence tag is not
found in the top 20 best-matching proteins. If you will be using the program for AA
composition and peptide mass fingerprinting identification, we recommend
to use values of 100 or 200 for list size, and to print out 20 top-ranked
proteins.
2.3.4 Reset and Perform Buttons
The RESET button resets all fields in the entire form and returns them to
default values. Use this if you have made a terrible mess and need to start
again! To submit the data to the ExPASy computer for matching, press the
PERFORM button.
Note! If you want to include sequence tag or peptide mass data in your
search, DO NOT press this button until you have filled in all
appropriate fields as described below.
2.4 Sequence Tag Fields:
Note! If you want to use sequence tag data as part of your identification
strategy, you MUST first click in the check box!
We define a "sequence tag"
as 3 to 6 residues of protein sequence, which may contain one or more 'X'
values for unknown AAs.
Sequence tag data is very specific. For example, there are 8000 different
combinations of 3 AA sequence tags, and 160000 different combinations of 4
AA sequence tags. Tag data can come from Edman degradation, carboxy- or
amino-peptidase experiments, or from MS sequencing.
In this program, sequence tag data as described above is matched against
the entire sequences of all best-matching proteins generated by AA
composition. If the tag sequence is found in a protein, the protein is
marked with an asterisk (*) in the list of results. Correspondingly, if
the user has selected to match all possible permutations of the tag against
the sequence, and if a permutation of the specified 3 to 6 amino acids
has been found, the protein is marked with a plus (+) in the results.
Note also that in the list of
results, the protein text description will be replaced by the protein N-,
C-terminal or internal sequence (see 2.4.2), and the tag, if found, will be displayed in
lower case letters.
2.4.1 Tag
Enter the sequence tag from your unknown protein, using the single letter
AA code. You can enter a maximum of 6 consecutive residues. Currently, the
use of B, Z, or J residues is not possible. 3 or more residues of
sequence should be entered, and unknown amino acids can be written
as an 'X'.
2.4.2 Display N- or C-Terminus or an appropriate internal subsequence
You can specify if you would like to see either the predicted N- or
C-terminus of all the best-matching proteins printed in the results. Alternatively,
you can choose to display an (internal) subsequence containing your tag. Your
choice of terminus should correspond with the source of your sequence tag
data (see 2.4.1).
2.4.3 Tag Permutations
If you have generated your tag data with amino- or carboxy-peptidases, or
with tandem MS, you may not be sure of the correct order of your sequence.
If this is the case, you can check the tag permutation button. In this
case, if the original tag sequence is found in a protein it is shown with an
asterisk (*), however if any permutation of the tag sequence is found in a
protein, it will be shown by a plus (+).
2.5 Peptide Mass Fingerprinting Fields:
Note! If you want to use peptide mass data as part of your identification
strategy, you MUST click in the check box!
Peptides can be generated from proteins by enzymatic or chemical cleavage
and their masses measured by MS. This part of the identification program
accepts peptide data generated by these means, to help identify your protein
of interest. Currently, MultiIdent first generates a list of best-matching
proteins by AA composition. The number of proteins in this list can be
specified in the AA composition part of the form (see 2.3.3). The program
then takes all proteins in the list, calculates their theoretical
peptides by "cutting" them with the enzyme of choice, and calculating the
theoretical
masses of generated fragments. Finally, MultiIdent matches the masses of the
theoretical peptides with those from your protein. Best-matching proteins
from the above list are then ranked according to the number of hits they have with
the empirically determined peptides.
2.5.1 List of Peptide Masses
Enter the masses of peptides generated from your protein of interest.
Enter as many peptides as possible, separating them by a space or a new
line, and specifying masses to one decimal place. Note that you
can copy a list of peptides from Excel or other applications and paste them
directly into this field. Users should avoid using peptide masses known to
be from autodigestion of an enzyme (eg trypsin!), or other artifactual peaks (e.g.
matrix peaks).
2.5.2 Masses as [M] or [M+H]+
You can enter the masses of your peptides as [M] or as [M+H]+,
however you must select the appropriate button. If you select the [M+H]+ button,
all peptide masses calculated from the database will have one proton
(mass of 1 unit) added before matching with user-specified peptides.
2.5.3 Select an Enzyme
Specify the enzyme or chemical reagent that you used to generate your
peptides (see the corresponding section in the PeptideMass instructions
for the available enzymes
and their cleavage rules).
2.5.4 Modifications to Cysteines in Peptides
Usually proteins will be subject to reduction and alkylation before they
are used to generate peptides. This ensures that all disulfide bonds in a
protein are broken. If the protein has not been treated in this manner,
check "with all cysteines in reduced form". If the protein has been reduced
and alkylated, specify "with all cysteines treated with", followed by the
reagent used for alkylation. You have a choice of iodoacetamide, iodoacetic
acid, 4-vinyl pyridene, and acrylamide. The program will modify the
theoretical masses of Cys-containing peptides accordingly,
before matching with user-specified peptides. Note that in
proteins prepared by polyacrylamide gel electrophoresis, it is common for
cysteines to react with free acrylamide monomers.
2.5.5 Modifications to Methionines in Peptides
You can request for all methionines in theoretical peptides to be
oxidised. If this option is selected, the program will modify the
theoretical masses of all Met-containing peptides accordingly,
before matching with user-specified peptides. Note that
proteins prepared by gel electrophoresis often show this modification.
2.5.6 Mass Tolerance
You should specify the mass tolerance to be used around your peptides
during matching.
Tolerance works quite simply. If you have a peptide mass of 934.3 Da and
specify a tolerance of ± 2 Da, a hit will be registered with a protein if
one of its peptides in the database has a mass between 932.3 and 936.3 Da.
There are two important things to be considered in the choice of mass
tolerance. Firstly, the accuracy of your MS. Both MALDI and ES machines
are now capable of achieving single decimal point mass resolution, however,
this may depend on the care that has been taken in machine calibration and use of
internal standards. We recommend the use of a tolerance of ± 2 to 4 Da for
data from MALDI machines, and for ES machines, a tolerance of ± 1 or 2 Da.
The second thing to consider is if you are doing single species matching or
cross-species matching (see 2.2.4). With single species matching, you need
not allow any extra tolerance, as you are identifying your peptides by
direct comparison with those in the database. However, if you are doing
cross-species matching, you should allow extra mass tolerance. Thus, for
example, if there is a single AA substitution in a peptide from cat that you
are comparing with the same peptide from human, a tolerance window of ± 5 or
6 Da will cover many possible single AA substitutions in that peptide that do not
appect the cleavage sites.
2.5.7 Scores for Peptide Mass Fingerprinting
The score or hitrate for peptide mass fingerprinting is simply
the number of peptides that matched with those from an entry in the Swiss-Prot
database.
In the output, you will see the masses of peptides that have matched with
those for a particular entry, as well as the name of the chain/peptide,
if the entry has resulted from a cleavage of a larger precursor molecule.
2.6 Results
Here you can specify which lists of results you wish the program to send
to you by e-mail.
For the result of the AA composition search, you can choose to have the lists of
-
closest Swiss-Prot entries (in terms of AA composition, pI and Mw not
considered) for selected species and keyword,
- closest Swiss-Prot entries (in terms of AA composition, pI and Mw not
considered) for selected keyword and any species,
- closest Swiss-Prot entries for selected species and keyword, having pI and Mw
values in the specified range),
or any combination of them by clicking in the appropriate box(es).
For the selected list(s), you can specify whether you wish to see the AAComposition scores
the peptide mass fingerprinting scores, and/or
the Integrated scores.
The
default is to have AA composition results, peptide mass fingerprinting results
and a list of integrated scores for the closest Swiss-Prot entries from
the specified species and keyword, having pI and Mw in the specified ranges.
See above sections for
notes concerning the different types of output and their interpretation.
Note! If you haven't clicked the check box for peptide mass
fingerprinting, you will not receive results even if you ask for them here!
You will receive only the AACompIdent output!
Last modified 22/Jan/2003 by ELG