PeptIdent Documentation

PeptIdent is a tool that allows the identification of proteins using peptide mass fingerprinting data.

Those experimentally measured peptide masses are then compared with the theoretical peptides calculated for all proteins in a protein sequence database.

The PeptIdent tool calculates the theoretical peptides of all proteins in the Swiss-Prot/TrEMBL databases by "cutting" them with the enzyme of choice, and calculating the theoretical masses of generated fragments. These peptides and their masses are stored in a precomputed index. PeptIdent matches the masses of your experimentally observed peptides with all peptide masses in this index. Best-matching database proteins are ranked by the number of hits they have with the observed experimental peptides.

Isoelectric point, molecular weight and a species (or group of species) can be specified in order to restrict the number of candidate proteins and reduce false positive matches.

PeptIdent makes extensive use of the annotations in Swiss-Prot/TrEMBL, and, unlike other peptide mass fingerprinting identification programs, it takes into account post-translational modifications as documented in Swiss-Prot. Therefore, PeptIdent removes signal sequences and/or propeptides (as documented in the Swiss-Prot feature table (FT lines)) before computing pI, Mw and peptide masses for each of the resulting mature forms. The program not only returns a list of likely protein identifications, but also any hits with peptides that are known to carry any of more than 20 different types of discrete post-translational modifications . The program thus offers a degree of protein characterization as part of the identification procedure.

The mass effects of several chemical protein modifications, such as oxidation of methionine or acrylamide adducts of cysteine residues, or desired alkylation products of cysteine residues can also be considered by the program.

Note! The PeptIdent tool does not consider any protein glycosylation apart from O-GlcNAc and C-mannosylation on tryptophan. N-linked and larger O-linked sugar structures are generally of unpredictable mass.

Note! PeptIdent does not do any de novo prediction of post-translational modifications on proteins. All modified peptides shown in the results will be the verification of an event documented in Swiss-Prot. However, PeptIdent can match peptides whose modifications are documented in Swiss-Prot as «potential» or «by similarity», and thus allows predicted post-translational modifications to be validated. (See the document Swiss-Prot annotation: how is biochemical information assigned to sequence entries for the use of the terms «potential», «by similarity» or «probable» in the annotations.)

The FindMod, GlycoMod and FindPept tools can be used subsequently to protein identification with PeptIdent. They allow the

de novo prediction and discovery of protein post-translational modifications as well as the prediction of potentially mutated amino acid sequences (FindMod),
predicition of possible oligosaccharide structures occurring on proteins and
identification of peptides that potentially result from unspecific cleavage.

PeptIdent results are displayed on-line or can be sent to you by email, in form of an html table. The result file contains direct links to FindMod, GlycoMod and FindPept to further characterize matching proteins by predicting potential protein post- translational modifications and finding potential single amino acid substitutions, to PeptideMass and to BioGraph for the graphical representation of the theoretical spectrum.

1. Name of the unknown protein

In the « Name of the unknown protein:» text box supply a name or a code number for the query protein. For numerous matches, give different names to each unknown protein. This is helpful if you want to archive your query and identify it later.

2. Select database

Under «Database:.», select the database(s) to use for the search. You have the choice of:

Consult the Swiss-Prot/TrEMBL page, the SPTR README file or the Swiss-Prot/TrEMBL references for more information about these databases.
When searching Swiss-Prot, all alternative splice isoforms described in Swiss-Prot feature tables are included in the search (e.g. Isoform 12S of O43184). For splice isoforms, no processing or post-translational modifications documented in the Swiss-Prot feature table are taken into account, as they are usually documented for the primary isoform described in the original entry.

Note! Peptides with masses >6000 Dalton are not indexed and therefore not considered in the search.

Note! Annotation in the TrEMBL database is done automatically; therefore it is incomplete and not always correct. Where available, TrEMBL annotation is used like for Swiss-Prot to process the proteins into mature chains or peptides. TrEMBL results should therefore be interpreted with care.

Note! Some Swiss-Prot/TrEMBL entries contain ambiguous residues (X = any amino acid, B = Asx = Asp(D) or Asn(N), Z = Glx = Gln(Q) or Glu(E)) Examples for such entries are P19341, O77721. As substitution of D by N or of Q by E induces mass differences of about 1 Dalton, is not possible to compute exact masses for peptides containing one or more residues B, Z or X. Those peptides are therefore not included in the index.

3. pI and pI range

In the « pI:» box, specify the pI of the protein of interest, if known. This should be estimated from a 2-D gel. You can also specify the confidence you have in your pI estimation by selecting the appropriate number under « pI range: ».

If no number is specified, a pI of 0-

is assumed. That will cover all pI values and return all proteins within the specified Mw region, regardless of the pI.

Note! For bacterial proteins separated in IPG gradients, a range of ± 0.25 around the estimate is usually sufficient. For eukaryotic proteins, increase this range to ± 0.5 units if the proteins are thought to be unmodified. If there is a high probability that the eukaryotic protein carries charge-modifying modifications, such as sialic acid, the range should be changed to ± 1.

Note! If you only have a vague idea of the protein pI, use a very large range. Even using a pI with a large range can increase the power of your search.

In the « Mw» box you should specify the estimated mass of your protein, in Dalton. This can be estimated from a 2-D gel, or from mass spectrometry (MS) of the entire protein. You should also specify the confidence that you have in your Mw estimate, in percent terms, by selecting the appropriate number in the « within Mw range (in percent): » box. This allows you to limit the search to proteins within the specified molecular weight range. If no number is specified, a molecular weight of 0-

is assumed which means that the program searches the whole database, which includes proteins up to a size of more than 2'000'000 Dalton (human Titin, heart isoform, Q10466).

Note! For bacterial proteins larger than 20 KDa, a range of ± 20% around the Mw estimate is usually sufficient. For small proteins, allow a +/- 40% range. For cytoplasmic eukaryotic proteins this range is also usually sufficient, but secreted eukaryotic proteins often carry post-translational modifications that require a range of respectively ± 40% and 100% or more to be inclusive. If masses have been determined with MS, the ranges used can be much smaller. However, note that if MS has been used to determine the size of a glycoprotein or other heavily modified protein, the measured mass this mass may be considerably larger than the mass of the unmodified polypeptide predicted in the database.

Note! If you only have a vague idea of the protein Mw, you can use a large range. However, as the proteins are ranked by the number of matching peptide masses, very large proteins are likely to obtain a high score and appear at the top of the list. Eliminating proteins with high molecular weight can reduce random matches. Whenever you have an idea about the Mw range, it is highly recommended to use this information in the identification to speed up searches and to reduce «false positives».

From the pull-down menu « Species to be searched: », select a species or taxonomic range to limit the search to proteins from the specified organism(s). In this case the peptide mass data is matched only against proteins from the specified organism(s), thus eliminating many irrelevant proteins from unrelated organisms.

To match your peptides against peptides from all species in the database, select "ALL". This option is not recommended without good reason, as it unnecessarily increases the search space and causes a significant number of unrelated false positive matches to appear.

Note! We define "single species matching" where you, for example, have proteins from E. coli which you then match against only the E.coli proteins in the database. This is a good approach to use when the organism you are working with is molecularly well defined, or ideally, the subject of a genome project.

Note! If the source of your proteins is not molecularly well defined, it is best to do "cross-species matching". Thus, for example, if you are working with proteins from Candida albicans, you may wish to either match your proteins against all proteins from fungi or against the fully sequenced yeast Saccharomyces cerevisiae.
DOC

Note that when cross-species matching, protein pI is frequently poorly conserved [ref], but protein mass is generally very well conserved. You should take this into consideration when setting your pI and Mw ranges.

Note! Peptide masses are not well conserved across species boundaries. The poor conservation of peptide mass data is expected, as a single amino acid substitution in any peptide can drastically change its mass [1].

Note! Apart from a number of model organisms (e.g. human, bovine, rat, mouse, E.coli, etc.), the pull-down menu also contains groups of species. This is useful if working on, for example, cats, as one can match against all proteins in the database described for mammals. If you are in doubt about the taxonomic classification, you can consult the NEWT Taxonomy Browser.

Note! Proteins that are 100% conserved between different species are merged into a single Swiss-Prot entry, e.g. UBIQ_HUMAN, CALM_HUMAN. In such entries, information about the source of each organism is noted in the OS (Organism Species) lines, e.g. actin, P03996:

OS   Homo sapiens (Human), Mus musculus (Mouse), Rattus norvegicus (Rat),
OS   Bos taurus (Bovine), and Oryctolagus cuniculus (Rabbit).

However, the OC (Organism Classification) lines will only contain the taxonomy of the first listed species:

OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Mammalia;
OC   Eutheria; Primates; Catarrhini; Hominidae; Homo.

In such cases, a PeptIdent query with RODENTIA will miss the P03996 actin entry which describes the mouse actin sequence, but contains the organism classification for Homo sapiens.

Enter a list of peptide masses

Enter the experimentally measured peptide masses generated from the unknown protein in the « Enter a list of peptide masses... » text field, and separate them by spaces, tabs or new lines.

Note! You can copy a list of peptides from Excel or other applications and paste them directly into the text field.

Note! Avoid using peptide masses known to be from autodigestion of an enzyme (e.g. trypsin!), or other artefactual peaks (e.g. matrix peaks).

Upload a .pkm, .dta or text file

If the peptide mass fingerprinting data is stored in a file of one of the formats listed below, you can also upload the file directly from your computer:

(1) Click on the on the « Browse...» button

(2) select the file containing the relevant peptide mass data and

(3) click on the « Open» button

The peptide masses will then be extracted automatically from this file.

.pkm format, produced by the Voyager software of Perseptive Biosystems or the GRAMS software

Center X Peak Y Left X Right X Time X Mass Difference Name

STD.Misc Height Left Y Right Y %Height,Width,%Area,%Quan,H/A

833.319 2189 833.260 833.378 0.016 0 0

854.843 5078 854.769 854.917 0.001 0 0

863.419 5108 863.064 863.775 0.001 0 0

872.402 12519 872.347 872.456 0.002 0 0

C 0.? 0 11417 11417

874.395 6730 874.331 874.460 0.002 0 0

887.786 5903 887.540 888.031 0.003 0 0

898.475 3329 898.416 898.534 0.006 0 0

904.366 7432 904.199 904.533 0.001 0 0

955.300 2598 955.229 955.371 0.011 0 0

973.845 16689 973.749 973.941 0.001 0 0

All lines before the line containing ‘H/A’ are ignored. After that, only lines which do not contain any capital letters in the first 20 characters are retained. From the retained lines, the first column is interpreted as the mass value, and the second column (if present) as the peak intensity. The intensity is only important if the BioGraph tool is used subsequently to the identification.

The first line is considered as a comment and is ignored. All subsequent lines are interpreted to contain a mass and an intensity (if any), and mass values are taken into account if the corresponding intensity is > 0.

Any user-created files can be uploaded if they correspond to the following rules:

7. Charge state (ion mode)

You can enter the masses of your peptides as [M] (molecular mass data), [M-H]^- (negative ion mode, deprotonated molecular ions) or [M+H]⁺ (positive ion mode, protonated molecular ions), however you must select the appropriate button.

If you select the [M+H]⁺ button, all peptide masses calculated from the database will have one proton (mass of 1 unit) added before matching with user-specified peptides, thus giving values for singly charged peptides as found in electrospray and MALDI-TOF mass spectrometers.

With [M-H]^- selected, all theoretical peptide masses will have one proton (one mass unit) removed before matching.

Specify whether the experimental mass values are « average» values or « monoisotopic» values by ticking the appropriate box. The theoretical masses of the peptides in the database will be calculated accordingly.

If the unknown protein has been reduced and alkylated, you should specify the reagent used for the alkylation under the " with cysteines treated with: " menu. This can be either iodoacetamide (forms carboxyamidomethyl cysteine, Cys_CAM), iodoacetic acid (forms carboxymethyl cysteine, Cys_CM) or 4-vinyl pyridene (forms pyridyl-ethyl cysteine, Cys_PE).

If the protein has not been treated in this manner, you should select the option " nothing (in reduced form) " (default setting).

Note! Proteins are usually subject to reduction and alkylation before they are used to generate peptides. This ensures that all disulfide bonds in a protein are broken.

Note! The program will modify the theoretical masses of Cys-containing peptides accordingly, before matching with the experimental peptide masses.

It is common for proteins separated by polyacrylamide gel electrophoresis that reduced cysteines react with free acrylamide monomers. PeptIdent can therefore be used with the option «acrylamide adducts on cysteines». With this option selected, the program considers all cysteine residues in a peptide as potentially modified and forming propionamide cysteine, Cys_PAM.

11. Oxidized methionines

You can request for all methionines in theoretical peptides to be oxidised to form methionine sulfoxide (MSO). If this option is selected, the program will modify the theoretical masses of all Met-containing peptides accordingly, before matching with user-specified peptides.

Note! Proteins separated by gel electrophoresis often show this modification.

General note about artefactual cysteine and methionine modifications :

If more than one cysteine or methionine residue can be found in the peptide, the masses of any number of possible modifications will be calculated. For example, if there are three methionine residues in a peptide, the masses of peptides having zero, one, two or three oxidized methionines will be calculated. The program can also account for post-translational modifications in conjunction with artefactual modifications in a peptide. This can be very useful, however one should be aware that by computing all combination of possible artefactual or post-translational modifications, a considerable amount of «noise» is added to the database of peptide fragments. If you think your peptides are not likely to carry artefactually modified cysteines or methionines, it is recommended not to select any of these modifications in order not to artificially increase the search space.

In the « Mass tolerance: ±... » text box, enter the mass tolerance to be used around your peptides during matchingSelect whether the mass tolerance is expressed as an absolute value (in Dalton) or relative value (in parts per million; ppm).

Note! If you have a peptide mass of 934.3 Da and specify a tolerance of ± 0.5 Da, a hit will be registered with a protein if one of its peptides in the database has a mass between 933.8 Da and 934.8 Da.

Note! If you have a peptide mass of 934.3 Da and specify a tolerance of ± 100 ppm, a hit will be registered with a protein if one of its peptides in the database has a mass between 934.207 Da and 934.393 Da.

Note! The mass tolerance should reflect the known accuracy of your mass spectrometer (MS). Both MALDI and ES machines are now capable of achieving single decimal point mass resolution, however, this may depend on the care that has been taken in machine calibration and use of internal standards. We recommend the use of a tolerance of 0.2 Da or 200ppm or better whenever it is possible.

ESI-TOF mass spectrometers or MALDI-TOF apparatus equipped with delayed extraction and ion reflectors are ideal for this, since most can deliver monoisotopic masses below ±40 ppm, when two point internal calibration is used.

Less accurate peptide mass data will require a larger mass tolerance and will result in a lower accuracy of your search.

Note! Mass spectrometers typically have a mass dependent error associated with mass measurements, which cannot be uniformly expressed in Dalton. The use of ppm can therefore be more accurate.

13. Cleavage agent
Under «Enzyme:» you can specify the enzyme that you used to generate your peptides. See here for the cleavage rules.
DOC

Note! The current version of PeptIdent only supports tryptic cleavage.

14. Missed cleavage sites

In order to take into account partial cleavages, you can specify a maximum number (0, 1 or 2) of missed cleavage sites to be allowed.

If the maximum number of missed cleavages entered is 1, all concatenations of two adjoining peptides are also added to the list of theoretical peptides under consideration.

Note! If you are confident that your digest was complete, with no partial fragments present, choose the setting 0 (default setting). This will give maximum discrimination and keep the number of random matches low.

Note! If experience shows that your digest usually includes some peptides with missed cleavage sites, you should specify a setting of 1, rarely 2. However, keep in mind that each additional level of missed cleavages increases the number of calculated peptide masses to be matched against the experimental data and the number of random matches.

15. Minimal number of peptide matches required

Under « Report only proteins with... » specify the minimum number of peptide mass hits you require a matching protein to show for it to be included in the result list. The default value is 4.

16. Maximum number

Limit the number of matching proteins displayed in the result report by selecting the appropriate value in the « Display a maximum of... » menu. The default value is 20.

17. Print sequence information

You can specify if you would like the result to include, for each high-scoring protein, information about the sequence portion covered by the matching peptides. If this is selected, the protein sequence will be displayed, and all matching peptides will be highlighted in color and upper case letters.

18. Send result by e-mail

PeptIdent results are displayed on-line in your browser window or can be sent by e-mail,

If the results should be sent back to you by e-mail, tick the « Send the result by e-mail » box. In the « Your e-mail:» text field you should enter the correct e-mail address (e.g. name@unknown.ch) to where the results should be sent. The email option is recommended, in particular for queries with a high number of peptide masses or for searches against large sections of the database. This avoids timeouts («document contains no data») which can occur for the on-line option: the browser interrupts the connection with the program if the search is not terminated after a certain time (usually about 3 minutes).

Note! If you select the e-mail option, your information will be sent to the ExPASy computer, which then undertakes the matching and returns the results to you by e-mail. It operates in batch mode, which means several searches can be sent successively, without having to wait for the result of the preceding query.

Note! In batch mode, only a very limited number of requests can be treated simultaneously, and your query will be queued for processing. Usually, results are sent back within a few minutes. However, if the batch queue already contains a number of requests, it will take longer (even up to a few hours in the worst case) for your query to be returned. Please allow for a certain time and do not unnecessarily resubmit your request. If you do not receive any results, it is possible that you made an error when you specified your email address. If you think there is a problem with the server, contact the server administrator and specify the time of submission, details about your search parameters and whether you got any error messages. Do not forget to specify which of the ExPASy mirror sites you were using.

19. Start PeptIdent

Once you have filled in the form, click on the « Start PeptIdent » button to start the program.

If you have made a mistake and would like all fields to be reset to their default values, press the « Reset» button.

IV. Result Output

1. Summary of search parameters

The top part of the page provides a summary of all user-specified search parameter, as well as the date of the query, the database release number and current number of entries, and a button to perform a new PeptIdent search.

2. List of matching database entries

The second part of the result page contains a summary of the best-matching protein from the database, with a «quick jump» link to detailed peptide information provided further down in the same page (see the following section). This summary provides the following information for each of the high-scoring proteins:

Score: The score or hit-rate for peptide mass fingerprinting is simply the number of peptides that match the theoretical peptides from a database entry divided by the total number of peptide masses specified for the search.

# peptide matches: Number of peptides that matched those from a database entry

AC: Swiss-Prot/TrEMBL accession number (AC)

ID: Swiss-Prot/TrEMBL entry name (ID)

Description: Name of the matching protein (Swiss-Prot/TrEMBL DE line). If the matching sequence results from cleavage of a larger precursor molecule, the name of the chain or peptide (Swiss-Prot/TrEMBL FT CHAIN or PEPTIDE line) will be displayed.

pI: theoretical isoelectric point of the matching protein.

Mw: theoretical molecular weight of the matching protein in Dalton.

3. Details of each match

user mass: Experimentally measured molecular mass of the peptide as provided by the user.

matching mass: Theoretical molecular mass of the matching peptide as calculated from the database entry. This mass may already include modification (post-translational or artefactual) – in this case the type of modification is detailed in the «modification» column.

Delta mass: Difference between experimentally measured and theoretical peptide masses. This mass difference is given in Da or ppm, depending on the mass tolerance unit specified by the user.

#MC: number of missed cleavage sites considered for the calculation of the theoretical molecular mass of the peptide.

modification: modification(s) of the peptide that were considered for the calculation of the peptide mass.

The following format is used to describe these modifications:

If only one (post-translational or artefactual) modification can be present, i.e. the peptide contains only one Met, or only one Cys, or only one PTM documented in Swiss-Prot:
Mod_type: position
(examples: «Cys_CM: 32», or «CARB: 17»)
If at least two (post-translational or artefactual) modifications are possible in the peptide:
Number of occurrences of the modification x Mod_type
(example: «2xCys_CM, 1xGLCNAC», or «1x METH», or «3xMSO»)

The abbreviations used for the different types of modifications are listed here.

position: sequence position of the matching peptide in the database entry. If the protein under consideration is the result of post-translational processing into a mature chain or peptide, the position information used corresponds to the numbering in the underlying Swiss-Prot/TrEMBL entry, i.e. to the numbering relative to the precursor sequence.

example:

The position given for the N-terminal tryptic peptide ANSFLEELRPGNVER of the mature chain ‘Protein C light chain (40-194)’ of PRTC_BOVIN (P00745) is 40-54, and not 1-15.

peptide: amino acid sequence of the matching peptide

Delta pI : difference between user specified/ estimated Mw and theoretical pI of the matching protein in the database.

Delta Mw : difference between user specified/ estimated Mw and theoretical Mw of the matching protein in the database.

% of sequence covered: Percentage of the protein sequence covered by the matching peptides. To calculate this percentage, the number of amino acids contained in at least one of the matching peptides is divided by the length of the protein / mature peptide.

Sequence: The sequence of the matching protein, printed in lower case letters. The regions of the sequence that are matching the query peptides are highlighted in red and printed in capital letters. The sequences can start from positions higher than position 1. This reflects the removal of propeptides and signal sequences.

V. Links

The PeptIdent output contains links to a number of related programs on ExPASy. These links are available both from the on-line result page and from the html file returned by email. Relevant input data and/or information about the matching database entry are automatically transferred to those programs.

PeptIdent

A new PeptIdent search can be launched directly from the result output. This allows to

Submit a second search with slightly modified parameters, i.e. with modified molecular weight or pI ranges, number of missed cleavages, taxonomic range etc.
Resubmit an archived query at a later stage, for later database releases. This is particularly useful if the initial identification was unsuccessful or ambiguous.

Note for users who wish to use BioGraph visualization tool subsequently to a second search submitted via the «New Search» button: If peptide masses were initially specified through the upload of a file which also contained peak intensities (e.g. pkm or dta format), the peptide masses from the original file will appear pasted into the mass window of the new search form, where they can be modified if desired (e.g. peaks which are believed to be «noise» or caused by a contaminant or autodigestion of Trypsin can be removed). Peak intensities however, as read from the pkm or dta file, will be transferred from the first to the second PeptIdent query as hidden parameters. They have to be taken over from the original query and cannot be modified for a second search. It is therefore not recommended to modify / remove / add a mass before resubmitting data from an earlier query if you intend to use the BioGraph link to visualize the theoretical spectrum. Modification / removal / insertion of a mass would only be effective for the mass value itself, whereas the peak intensities from the original spectrum would remain unchanged. In case of insertion or deletion of a mass, all intensities would be shifted and no longer be associated with the original mass values. For the BioGraph tool to work correctly, it is more advisable to apply modifications directly to the mass/intensity file, and to upload the file again.

Furthermore, direct links to a number of characterization / visualization tools are available for each matching candidate protein:

FindMod

The FindMod tool can be used to predict post-translational modifications or single amino acid substitutions. This is done by comparing experimental peptide masses that did not match with the protein against those calculated from the assigned protein sequence, seeking mass differences that may be due to post-translational modifications. A number of rules are applied, trying to determine whether a post-translational modification suggested by mass difference is likely to occur in the peptide under consideration.

GlycoMod

The GlycoMod tool can be used to predict the possible oligosaccharide structures that occur on proteins from their experimentally determined masses. The program can be used for free or derivatized oligosaccharides and for glycopeptides.

FindPept

The FindPept tool can be used to identify peptides that result from unspecific cleavage of proteins from their experimental masses, taking into account artefactual chemical modifications, post-translational modifications (PTM) and protease autolytic cleavage.

PeptideMass

The PeptideMass tool can be used to simulate a theoretical digest of the matching database protein

BioGraph

The BioGraph tool allows a graphical and interactive representation of the PeptIdent results. A theoretical spectrum is displayed, in which peaks corresponding to matching and unmatching peptides are shown in different colors.

VI. Comments

The output of a peptide query contains a list of proteins ranked by the number of peptides shared with the unknown protein, where the correct identification for an unknown protein is likely to be that with the largest number of peptide «hits».

Confidence in identification is achieved by looking for a significant difference in the number of matching peptides between the top and second ranked protein, and a good sequence coverage of the top ranked protein with the experimentally determined peptides.

Peptide mass data can represent a starting point for the examination of protein modifications or processing: Peptides from a protein that do not match those from a database may carry post-translational or artefactual modifications, or may have undergone amino acid substitution or truncation.

Peptide mass fingerprinting will rarely find matches for all peaks in a spectrum. This is because some peptides, especially those that are large and/or very hydrophobic, are either not extracted or not quantitatively extracted from a gel or blot, and others are not ionised efficiently during mass spectrometry.

VII. References

Wilkins M.R, and Williams K.L. (1997) Cross-species identification using amino acid composition : a theoretical evaluation. J. Theor. Biol. 186, 7-15.