ExPASy Home page

Site Map

Search ExPASy

Contact us

Proteomics tools

Swiss-Prot

Mirror sites:

Instructions for MultiIdent Protein Identification Software

1.1 Introduction:

MultiIdent is designed for the identification of proteins from 2-D gels. There are many properties of a protein that can be used to aid in its identification. These include:

protein isoelectric point (estimated from a 2-D gel)
protein molecular mass (estimated from a 2-D gel or by mass spectrometry)
species of origin of the protein
protein amino acid composition
protein sequence data (sequence tag)
peptide masses generated by enzymatic or chemical digestion followed by mass spectrometry of generated peptides.

The following computer program is designed to take different combinations of the above data to achieve unambiguous protein identification. The program will eventually work in a modular fashion, allowing any combinations of the above data to be used to find the correct protein identity. However, the current version of the software works by first generating a list of best-matching proteins with amino acid composition, and then using other protein parameters (e.g. protein pI, Mw, sequence tag, peptide mass data) to query this list and identify the protein of interest.

In the following set of notes, instructions you should follow for filling in the form on the computer will be shown with a . Background notes relevant to that part of the form will be marked with a DOC . Important notes and cautions are marked with a .

2.1 Using MultiIdent

In the program there are a number of fields that must be filled in, as well as some fields that are optional. We will describe all of the fields from the top to the bottom of the form, to give some idea of the best way to use this program for your needs.

2.1.1 E-mail address

Please supply your full e-mail address.

DOC This program runs in batch mode. You send your protein information to the ExPASy server via the World-Wide Web, which then undertakes the matching and returns the results to you by e-mail. It operates in batch mode as this allows you to send many searches and then go and have a cup of coffee whilst you wait for the results!

2.2 Unknown Protein Fields:

2.2.1 Name of Protein

Here you should supply a name or code number of the protein you are working with. If you are going to do a number of different matches, you should give each match a different name or code.

2.2.2 Protein pI

Here you should specify the pI of your protein, if known. This should be estimated from a 2-D gel. You should also specify the confidence you have in your pI estimate. This is done by selecting the appropriate number in the "range" box.

DOC For bacterial proteins separated in IPG gradients, we find that a range of ± 0.25 around the estimate is usually sufficient. For eukaryotic proteins, this should be increased to ± 0.5 units if the proteins are thought to be unmodified, but ± 1 unit if proteins are though to carry charge-modifying modifications like sialic acid. If you do not know the pI of the protein, you should specify a pI of 7.0 ± 10. Similarly, if you have only a vague idea of the protein pI, use a very large range. Even a large range of pI can increase the power of your search.

2.2.3 Protein Mw

Here you should specify the estimated mass of your protein, in Daltons. This can be estimated from a 2-D gel, or from mass spectrometry (MS) of the entire protein. You should specify the confidence that you have in your Mw estimate, in percent terms.

DOC For bacterial proteins, a range of ± 20% around the Mw estimate is usually sufficient. For cytoplasmic eukaryotic proteins this range is also usually sufficient, but secreted eukaryotic proteins often carry post-translational modifications that require a range of ± 30% or more to be inclusive. If masses have been determined with MS, the ranges used can be much smaller (e.g. ± 2%). However, note that if MS has been used to determine the size of a glycoprotein or other heavily modifed protein, it may be considerably larger than the mass of the unmodified polypeptide predicted in the database.

Note! If you want to search only with a Mw range and not a pI range, specify the Mw and range of interest, and then put 7.0 ± 10 units for the pI. If you want to search only within a pI range and not a Mw range, specify the pI and range of interest, and then a suitably large mass window such as 100,000 Da ± 100%.

2.2.4 Species of Interest

Here you should enter one or more search terms used on the OS and OC (Organism species and Organism classification) lines of Swiss-Prot to specify species of origin of the protein or the species which you would like to search against. If you wish to match against all proteins from all species in the database, specify "ALL". If you have doubts about the exact search terms to use, consult the Swiss-Prot list of species, or the NCBI taxonomy browser.

DOC We define "single species matching" where you, for example, have proteins from E. coli which you then match against only the E. coli proteins in the database. This is a good approach to use when the organism you are working with is molecularly well defined, or ideally, the subject of a genome project. If the source of your proteins is not molecularly well defined it is best to do "cross-species matching". Thus, for example, if you are working with proteins from Candida albicans you may wish to either match your proteins against ALL proteins in the database, or against the fully sequenced yeast Saccharomyces cerevisiae. Note that when cross-species matching, protein pI is frequently poorly conserved, but protein mass is generally very well conserved. You should thus take this into consideration when setting your pI and Mw ranges. The choice of Swiss-Prot protein OS and OC lines enables matching against certain groups of species, phyla, or families as desired. This is useful if working on, for example, cats, as one can match against all proteins in the database described for mammals.

2.2.5 Keyword

If you wish to perform your search in Swiss-Prot only, you can enter a keyword appearing on the KW lines of Swiss-Prot to further restrict your search. For example, a keyword of "PLASMA" could be used in conjunction with the OC term "MAMMALIA" to see if a user-entered protein matches well with any mammalian plasma proteins in the database. If you are in doubt about the exact term to use, consult the list of keywords used in Swiss-Prot.

However, the use of a keyword is not allowed for searches in TrEMBL, as the annotation in TrEMBL is done automatically and is therefore incomplete and not always correct.

2.3 Amino Acid Composition Fields:

Currently, protein amino acid (AA) composition is central to the identification approach used by this program. The AA composition of the protein of interest is matched against the theoretical AA compositions of proteins in the Swiss-Prot/TrEMBL database in order to generate lists of best-matching proteins, ranked by a score. The score is calculated from the euclidian distance between the AA compositions of the protein of interest and a database entry. For more information about how to interpret this score, see the AACompIdent documentation.

Note that whenever proteins are documented in Swiss-Prot to be processed to produce a mature form, this program will calculate the theoretical AA composition of that form for that entry. Thus signal sequences, propeptides, and transit peptides will not be considered for the AA composition calculations, neither will they be included for pI and Mw calculations for these proteins. Similarly, if a certain database entry produces a number of mature polypeptides, the composition of each of these will be calculated and documented separately.

However, if such processing events are documented in TrEMBL, MultiIdent does not use this information to process TrEMBL proteins into mature chains or peptides, i.e. AA composition, pI and Mw are always computed for the whole sequence.

DOC The matching of AA composition data against the database(s) is done in 3 ways, to produce 3 lists of best-matching proteins for each of the selected databases. The first list is where the AA composition of the unknown protein is matched only against proteins from the species of interest, but without considering pI or Mw ranges. The second list is where the AA composition data is matched against all proteins in the database, again without considering pI or Mw ranges. The third list is where the AA composition data is matched against proteins from the species of interest, but only those that fall in the specified pI and Mw range.

2.3.1 Amino Acid Composition of Unknown Protein

Currently, this is an essential field (ie no data, no results!). You should enter the emprically determined AA composition of the protein of interest. The composition should have been calculated in molar percent from the results from your AA analysis system. Your data should have one decimal place and add up to 100. Note that Asx (B) corresponds to the mixture of Asn and Asp, which is quantitated as Asp during AA analysis. Similarly, Glx (Z) corresponds to the mixture of Gln and Glu, which is quantitated as Glu during AA analysis.

Caution! There are several different "constellations" or sets of AAs that are available for matching. It is critical that you use the constellation that has the same set of AAs for which you have quantitative analysis data. If you do not enter data for a certain AA in the form, the program will assume you have quantitated that AA but found a molar percentage of 0.00. This is different to if you have not quantitated the AA and not included it in your percent composition calculations. Most AA analysis systems (e.g. Fmoc, Picotag, ninhydrin, but not OPA) will produce data compatible with Constellation 2.

2.3.2 Amino Acid Composition of Calibration Protein

We have found that use of a known calibration protein can dramatically decrease the error of any particular AA analysis. This calibration protein should, ideally, have been hydrolysed, extracted, and finally derivatised and subjected to chromatography in the same batch as the protein of interest. Ideally, it should be the same type of sample (e.g. PVDF-bound if unknown proteins are from blots of 2-D gels). If you specify data for a calibration protein, a set of factors will be generated that mathematically represent the difference between the observed and expected composition of the calibration protein. These factors will then be automatically applied to the composition of the unknown protein (see 2.3.1) before it is matched against the composition of proteins in the database in the identification procedure.

2.3.2.1 Calibration Protein ID

If you have used a calibration protein, put the Swiss-Prot ID code for the protein here. If you do not wish to perform calibration, enter NULL for this field.

2.3.2.2 Composition of Calibration Protein

Enter the AA composition of your calibration protein. The composition should have been calculated in molar percent from the results from your AA analysis system. Your data should have one decimal place, if possible, and add up to a total of 100.

2.3.3 Details on the Lists of Proteins to be Generated and Printed

Choose the number of best-matched proteins by AA composition to be generated by the computer and kept in memory. The size of this list is important if you are also going to use peptide mass fingerprinting data for protein identification (see 2.5). Then specify how many of the top-ranked proteins from this list you actually want to see in the results that will be sent to you by e-mail. This number should always be equal to or smaller than the number specified for the size of the list.

DOC If you want to do protein identification by AA composition alone, the default values of 20 and 20 will be adequate. For sequence tag identification you may wish to use 40 and 40 if a sequence tag is not found in the top 20 best-matching proteins. If you will be using the program for AA composition and peptide mass fingerprinting identification, we recommend to use values of 100 or 200 for list size, and to print out 20 top-ranked proteins.

2.3.4 Reset and Perform Buttons

The RESET button resets all fields in the entire form and returns them to default values. Use this if you have made a terrible mess and need to start again! To submit the data to the ExPASy computer for matching, press the PERFORM button.

Note! If you want to include sequence tag or peptide mass data in your search, DO NOT press this button until you have filled in all appropriate fields as described below.

2.4 Sequence Tag Fields:

Note! If you want to use sequence tag data as part of your identification strategy, you MUST first click in the check box!

DOC We define a "sequence tag" as 3 to 6 residues of protein sequence, which may contain one or more 'X' values for unknown AAs. Sequence tag data is very specific. For example, there are 8000 different combinations of 3 AA sequence tags, and 160000 different combinations of 4 AA sequence tags. Tag data can come from Edman degradation, carboxy- or amino-peptidase experiments, or from MS sequencing.

DOC In this program, sequence tag data as described above is matched against the entire sequences of all best-matching proteins generated by AA composition. If the tag sequence is found in a protein, the protein is marked with an asterisk (*) in the list of results. Correspondingly, if the user has selected to match all possible permutations of the tag against the sequence, and if a permutation of the specified 3 to 6 amino acids has been found, the protein is marked with a plus (+) in the results. Note also that in the list of results, the protein text description will be replaced by the protein N-, C-terminal or internal sequence (see 2.4.2), and the tag, if found, will be displayed in lower case letters.

2.4.1 Tag

Enter the sequence tag from your unknown protein, using the single letter AA code. You can enter a maximum of 6 consecutive residues. Currently, the use of B, Z, or J residues is not possible. 3 or more residues of sequence should be entered, and unknown amino acids can be written as an 'X'.

2.4.2 Display N- or C-Terminus or an appropriate internal subsequence

You can specify if you would like to see either the predicted N- or C-terminus of all the best-matching proteins printed in the results. Alternatively, you can choose to display an (internal) subsequence containing your tag. Your choice of terminus should correspond with the source of your sequence tag data (see 2.4.1).

2.4.3 Tag Permutations

If you have generated your tag data with amino- or carboxy-peptidases, or with tandem MS, you may not be sure of the correct order of your sequence. If this is the case, you can check the tag permutation button. In this case, if the original tag sequence is found in a protein it is shown with an asterisk (*), however if any permutation of the tag sequence is found in a protein, it will be shown by a plus (+).

2.5 Peptide Mass Fingerprinting Fields:

Note! If you want to use peptide mass data as part of your identification strategy, you MUST click in the check box!

DOC Peptides can be generated from proteins by enzymatic or chemical cleavage and their masses measured by MS. This part of the identification program accepts peptide data generated by these means, to help identify your protein of interest. Currently, MultiIdent first generates a list of best-matching proteins by AA composition. The number of proteins in this list can be specified in the AA composition part of the form (see 2.3.3). The program then takes all proteins in the list, calculates their theoretical peptides by "cutting" them with the enzyme of choice, and calculating the theoretical masses of generated fragments. Finally, MultiIdent matches the masses of the theoretical peptides with those from your protein. Best-matching proteins from the above list are then ranked according to the number of hits they have with the empirically determined peptides.

2.5.1 List of Peptide Masses

Enter the masses of peptides generated from your protein of interest. Enter as many peptides as possible, separating them by a space or a new line, and specifying masses to one decimal place. Note that you can copy a list of peptides from Excel or other applications and paste them directly into this field. Users should avoid using peptide masses known to be from autodigestion of an enzyme (eg trypsin!), or other artifactual peaks (e.g. matrix peaks).

2.5.2 Masses as [M] or [M+H]⁺

You can enter the masses of your peptides as [M] or as [M+H]⁺, however you must select the appropriate button. If you select the [M+H]⁺ button, all peptide masses calculated from the database will have one proton (mass of 1 unit) added before matching with user-specified peptides.

2.5.3 Select an Enzyme

Specify the enzyme or chemical reagent that you used to generate your peptides (see the corresponding section in the PeptideMass instructions for the available enzymes and their cleavage rules).

2.5.4 Modifications to Cysteines in Peptides

Usually proteins will be subject to reduction and alkylation before they are used to generate peptides. This ensures that all disulfide bonds in a protein are broken. If the protein has not been treated in this manner, check "with all cysteines in reduced form". If the protein has been reduced and alkylated, specify "with all cysteines treated with", followed by the reagent used for alkylation. You have a choice of iodoacetamide, iodoacetic acid, 4-vinyl pyridene, and acrylamide. The program will modify the theoretical masses of Cys-containing peptides accordingly, before matching with user-specified peptides. Note that in proteins prepared by polyacrylamide gel electrophoresis, it is common for cysteines to react with free acrylamide monomers.

2.5.5 Modifications to Methionines in Peptides

You can request for all methionines in theoretical peptides to be oxidised. If this option is selected, the program will modify the theoretical masses of all Met-containing peptides accordingly, before matching with user-specified peptides. Note that proteins prepared by gel electrophoresis often show this modification.

2.5.6 Mass Tolerance

You should specify the mass tolerance to be used around your peptides during matching.

DOC Tolerance works quite simply. If you have a peptide mass of 934.3 Da and specify a tolerance of ± 2 Da, a hit will be registered with a protein if one of its peptides in the database has a mass between 932.3 and 936.3 Da. There are two important things to be considered in the choice of mass tolerance. Firstly, the accuracy of your MS. Both MALDI and ES machines are now capable of achieving single decimal point mass resolution, however, this may depend on the care that has been taken in machine calibration and use of internal standards. We recommend the use of a tolerance of ± 2 to 4 Da for data from MALDI machines, and for ES machines, a tolerance of ± 1 or 2 Da. The second thing to consider is if you are doing single species matching or cross-species matching (see 2.2.4). With single species matching, you need not allow any extra tolerance, as you are identifying your peptides by direct comparison with those in the database. However, if you are doing cross-species matching, you should allow extra mass tolerance. Thus, for example, if there is a single AA substitution in a peptide from cat that you are comparing with the same peptide from human, a tolerance window of ± 5 or 6 Da will cover many possible single AA substitutions in that peptide that do not appect the cleavage sites.

2.5.7 Scores for Peptide Mass Fingerprinting

The score or hitrate for peptide mass fingerprinting is simply the number of peptides that matched with those from an entry in the Swiss-Prot database. In the output, you will see the masses of peptides that have matched with those for a particular entry, as well as the name of the chain/peptide, if the entry has resulted from a cleavage of a larger precursor molecule.

2.6 Results

Here you can specify which lists of results you wish the program to send to you by e-mail. For the result of the AA composition search, you can choose to have the lists of

closest Swiss-Prot entries (in terms of AA composition, pI and Mw not considered) for selected species and keyword,
closest Swiss-Prot entries (in terms of AA composition, pI and Mw not considered) for selected keyword and any species,
closest Swiss-Prot entries for selected species and keyword, having pI and Mw values in the specified range),

or any combination of them by clicking in the appropriate box(es). For the selected list(s), you can specify whether you wish to see the AAComposition scores the peptide mass fingerprinting scores, and/or the Integrated scores. The default is to have AA composition results, peptide mass fingerprinting results and a list of integrated scores for the closest Swiss-Prot entries from the specified species and keyword, having pI and Mw in the specified ranges. See above sections for notes concerning the different types of output and their interpretation.

Note! If you haven't clicked the check box for peptide mass fingerprinting, you will not receive results even if you ask for them here! You will receive only the AACompIdent output!

Last modified 22/Jan/2003 by ELG