ExPASy Home page | Site Map | Search ExPASy | Contact us | PROSITE |
Hosted by NCSC US | Mirror sites: | Canada | China | Korea | Switzerland | Taiwan |
Written by:
Inge Jonassen,
inge@ii.uib.no
Created: February 7th 1997
Pratt has integrated new pattern scoring mechanisms, including one taking into account the diversity of the sequences matching a pattern, and another taking into account the number of sequences matching each pattern.
$ tar -xvf Pratt2.1.tar
$ make
The patterns that can be found is a subset of the set of patterns
that can be described using Prosite notation.
A pattern that can be found by Pratt can be written on the form
The C parameters control how many sequences a pattern should match to
be considered by Pratt:
set the minimum percentage of the input sequences that should match a pattern.
If you set this to, say 80, Pratt will only report patterns
matching at least 80 % of the sequences input.
Allows the use of an alignment or a query sequence to restrict the pattern search.
If G is set to al or query, another option GF will appear
allowing the user to give the name of a file containing a
multiple sequence alignment (in Clustal W format), or a
query sequence in FastA format (without annotation).n
Only patterns consistent with the alignment/matching the query
sequence will be considered.
Loosely a pattern is considered consistent with the alignment if
For more details see
I. Jonassen
Using the B options (BN,BI,BF) on the menu you can control which
pattern symbols will be used during the initial pattern
search and during the refinement phase.
In the pattern C-x(2)-[DE], C and [DE] are the symbols.
The pattern symbols that can be used, are read from a file
if the BI option is set, otherwise a default set will be used.
The default set has as the 20 first elements, the single amino acid
symbols, and it also contains a set of ambiguous symbols, each
containing amino acids that share some physio-chemical properties
If BI is set, option BF will appear to allow you to give the
name of the file. In the file each symbol is given on a separate
line contatining the letters that the symbol should match.
For instance the file could be:1.1 References:
Inge Jonassen, John F. Collins, Desmond Higgins.
Protein Science 1995;4(8):1587-1595.
Inge Jonassen.
Submitted to CABIOS.
PROSITE test cases.
Inge Jonassen, Carsten Helgesen, Desmond Higgins.
Dept. of Informatics, Univ. of Bergen, Reports in Informatics no 116, Febr. 1996
.
Postscript paper,
Test cases.
A. Brazma, I. Jonassen, E. Ukkonen, and J. Vilo.
in Proceedings of the Fourth International Conference on Intellignent Systems for Molecular Biology (ISMB-96), AAAI Press 1996, p 34-43.
More information is available.
1.2 Pattern terminology
Pratt is able to discover conserved patterns in the sequences.
A(1)-x(i1,j1)-A2-x(i2,j2)-....A{p-1}-x(i{p-1},j{p-1})-Ap
where
The product of flexibility for a pattern is the product of the
flexibilities of the flexible wildcard regions in the pattern,
if any, otherwise it is defined to be one.
A pattern component A(k) is an
A wildcard region x(ik,jk) is
Examples:
2. User manual.
2.1 Format of input sequences.
Make a file or a set of files containing the set of sequences to be
analysed. Currently, Pratt can read one of the formats:
2.2 Command line.
Command line:
Pratt <format> <filename> [options]
where <format> is one of
fasta
swissprot
and <filename> is
the name of a file containing the sequences in the given format
2.3 Using the menu to control your search.
When you run Pratt, it will give you a menu allowing you to set a variety of
parameters controlling:
Schematic figure of algorithm used in Pratt version 2.x
Overview of the pattern discovery algorithm.
The user inputs a set unaligned sequences, and the minimum
number of sequences to match a pattern.
(i):
During this phase, patterns are constrained to the pattern class
defined by the bounds set using the menu.
A pattern graph can be constructed either
from the shortest sequences in S (1),
from a special query sequence (2), or
from a multiple sequence alignment (3).
A search is done for the highest scoring patterns in the class
that can be derived from the pattern graph.
The block data structure is used to find all matches to each pattern.
(ii):
The highest scoring patterns found during this search, are input to a
heuristic pattern refinement algorithm, where more ambiguous pattern
components (from a list given by the user in Pratt.sets) can be
added to the patterns found during phase (i).
The refinement phase is optional.
Sample run of Pratt version 2.1:
------------------------------------------------------------
Pratt version 2.1, Sept. 1996
Written by Inge Jonassen,
University of Bergen
Norway
email: inge@ii.uib.no
For more information, see
http://www.ii.uib.no/~inge/Pratt.html
------------------------------------------------------------
Please quote:
I.Jonassen, J.F.Collins, D.G.Higgins.
Protein Science 1995;4(8):1587-1595.
I.Jonassen
submitted to CABIOS
------------------------------------------------------------
Pratt version 2.1
Analysing 166 sequences from file snake
PATTERN CONSERVATION:
CM: min Nr of Seqs to Match 166
C%: min Percentage Seqs to Match 100.0
PATTERN RESTRICTIONS :
PP: pos in seq [off,complete,start] off
PL: max Pattern Length 50
PN: max Nr of Pattern Symbols 50
PX: max Nr of consecutive x's 5
FN: max Nr of flexible spacers 2
FL: max Flexibility 2
FP: max Flex.Product 10
BI: Input Pattern Symbol File off
BN: Nr of Pattern Symbols Initial Search 20
PATTERN SCORING:
S: Scoring [info,mdl,tree,dist,ppv] info
SEARCH PARAMETERS:
G: Pattern Graph from [seq,al,query] seq
E: Search Greediness 3
R: Pattern Refinement on
RG: Generalise ambiguous symbols off
OUTPUT:
OF: Output Filename snake.166.pat
OP: PROSITE Pattern Format on
ON: max number patterns 50
OA: max number Alignments 50
M: Print Patterns in sequences on
MR: ratio for printing 10
MV: print vertically off
X: eXecute program
Q: Quit
help: for on-line help
Command:
For instance the pattern A-x(2,3)-B is consistent with the alignment
ALVGB
AG-LB
ALD-B
Efficient discovery of conserved patterns using a pattern graph.
Submitted to CABIOS
C
DE
and only patterns with the symbols C and [DE] would
be considered. During the initial search, pattern symbols
corresponding to the first BN lines can be used.
Increasing BN will slow down the search and increase the
memory usage, but allow more ambiguous pattern symbols.
The P options are for controlling the patterns to be considered by Pratt. See also the F options for controlling flexibility.
x - 1 x(10) - 10 x(3,4) - 4Increasing PX will increase the time used by Pratt, and also slightly the memory required.
The F options control flexible wildcards in the patterns:
Using this option you can set the maximum number of flexible wildcards (matching a variable number of arbitrary sequence symbols). For instance x(2,4) is a flexible wildcard, and the pattern
C-x(2,4)-[DE]-x(10)-Fcontains one flexible wildcard.
Increasing FN will increase the time used by Pratt.
you can set the maximum flexibility of a flexible wildcard (matching a variable number of arbitrary sequence symbols). For instance x(2,4) and x(10,12) has flexibility 2, and x(10) has flexibility 0. Increasing FL will increase the time used by Pratt.
Using option FP you can set an upper limit on the product of a flexibilities for a pattern. This is related to the memory requirements of the search, and increasing the limit, increases the memory usage. Some patterns and the corresponding product of flexibilities:
C-x(2,4)-[DE]-x(10)-F - (4-2+1)*(10-10+1)= 3 C-x(2,4)-[DE]-x(10-14)-F - (4-2+1)*(14-10+1)= 3*5= 15
For instance the pattern
C-x(4)-Dmight be refined to
C-x-[ILV]-x-D-x(3)-[DEF]
If the RG option is switched on, then ambiguous symbols listed in the symbols file (or in the default symbol set -- see help for option B), are used. If RG is off, only the letters needed to match the input sequences are inlcuded in the ambiguous pattern positions.
For example, if [ILV] is a listed allowed symbol, and [IL] is not, [IL] can be included in a pattern if RG is off, but if RG is on, the full symbol [ILV] will be included instead.
The O options allow you to control the output from Pratt:
All parameters whose values can be set using the menu can also be set
from the command line. For parameters with numerical value (for
example CM (minimum number of sequences to match a pattern) the value
of this parameter is set by including -CM <value> in
the command line. If this is the only parameter for which you wish to use
a non-default value, your command line might look something like
pratt fasta sequences -cm 20
if you wish to look for patterns matching minimum 20 of the sequences
in the file sequences (given in FastA format).
For parameters swithchin on or off options, you should include
on or off behind the option on the command line.
For example if you do not want pattern refinement to be done, your
command line might look like
pratt fasta sequences -cm 20 -r off
Nomally if you set some parameters using the command line, Pratt will just start the search using default values for the parameters that you have not set. If you want to see the menu as well, include -menu in the command line.
ExPASy Home page | Site Map | Search ExPASy | Contact us | PROSITE |
Hosted by NCSC US | Mirror sites: | Canada | China | Korea | Switzerland | Taiwan |