|
|
|
|
The SRS Query Language
|
|
In SRS any retrieval command, logical operations with sets that were obtained by previous queries, links between sets of different databanks, or a combination of all can be expressed by the SRS query language. This chapter describes the language and gives examples of its use.
|
8.1 General Syntax
|
|
A query is an expression with operands and operators. In this chapter queries are sometimes shown as equations in the form:
The query name will be associated with the resultant set of entries. Expressions can specify index searches, logical operations between sets and link operations or a combination of all.
|
8.1.1 List of Operands
|
|
Name of a Databank
In SRS only one name for each databank is used (eg, "EMBL").
Set Name
Each query must be given a name (eg, "Q1") which can be later used to operate on the set that results from that query.
Index Search
A command to search in one or more indices of one or more libraries.
Expression
An expression is treated as an operand if it occurs within parentheses, for example, "(Q1&Q2)>EMBL". Parentheses can be nested to any degree.
Parent
Special operand that allows the conversion of a set of subentries, "Sequence" features, for example, into a set of entries, "Sequence" entries in this case. This is achieved by linking the set with the subentries to "parent" (eg, "Q1 > parent"). Used in the special case where the entry has subentries.
|
8.1.2 List of Operators
|
|
The following table shows a list of SRS query language operators.
TABLE 8.1 SRS Query Language Operators
Operator |
Meaning |
|
Logical OR |
& |
Logical AND |
! |
Logical AND NOT (in colloquial English, BUT NOT) |
> |
Link left |
< |
Link right |
>^ |
Get subtree defined by left operand (hierarchical links) |
>_ |
Get leaf entries of the subtree defined by left operand. (hierarchical links) |
The link operators have precedence over logical operators.
|
8.1.3 Searching in Indices
|
|
This command specifies a search in a single index or a group of indices of one or more databanks. An index search must specify within square brackets the databank or databank group name, the index or index group name, and a search expression. The two names must be separated by a hyphen ('-') and be separated from the search expression by a ':' (string search) or a '#' (range search). Both the field name (eg, "Description") and its abbreviation ("Des") can be used as index name. All strings, including the search words, are treated case insensitive.
searches for the string "elastase" in the "Description" field of the protein databank PIR. "des" is the short name for the same field.
As indicated above two different types of searches, the string search and the range search exist. A string search expression starts with a colon (':') after the index name whereas the numeric range search starts with a hash ('#'). Range searches can be performed only in indices of the types "num" and "real".
Searching Strings
A search expression is a single search word or several words separated by a logical operator. Parentheses may be used for grouping.
EXAMPLE 8.1 - Search strings
[embl-key:insulin]
[medline-aut:needleman,*!wunsch,*]
[embl-def:(acetylchol*&receptor)!muscarinic]
|
Wildcards are useful for searching a group of words or whenever it is unclear how a word is spelt in the databank. You can use two different types of wildcards:
TABLE 8.2 Wild Cards in SRS
* |
Matches one or more characters of any value |
? |
Matches a single character of any value |
Any number of wildcards can be placed anywhere in the search word. If you place one at the beginning you might have a slightly longer response time since then all words of the index have to be searched. But don't worry too much about that - it is still quite fast!
Regular Expressions
Since search words with wildcards are translated into regular expressions, it is of course possible to enter regular expressions directly. They must appear within forward slashes ('/'). Some characters ("^$.[]()?+*") have a special meaning. They must be prefixed with a backslash ('\') if to be matched literally.
TABLE 8.4 Example Usage of Regular Expressions
finds all 3 character strings with a j at the beginning |
finds all 4 digit numbers with a five at the beginning |
finds the gene names "nifa", "nifb", "nifc", "nifd", "nife" |
finds both "muller" and "mueller" |
Note that searches with regular expressions can be slow since all words in the index have to be searched.
Searching Numeric Ranges
In a numeric index (integers, reals) it is possible to search numeric ranges. The number index is only applicable where there is a one to one relationship of entry and value (eg, sequence length, creation date, resolution).
A range can be specified by a single value or two values separated by a colon (':') where the left value must be smaller than the right value. To exclude a boundary value from the range put a '!' in front. A missing number on the left means the minimum, and on the right the maximum value in the index.
The following are queries in an index of the sequence length.
TABLE 8.5 Numeric Range Examples
400 |
selects all sequences with the length of 400 |
400:500 |
selects all sequences from 400 to 500 residues |
400: |
selects all sequences longer than 400 residues |
:500 |
selects all sequences up to 500 residues |
400:!500 |
selects all sequences between 400 and 500 residues excluding 500 |
: |
is also valid and retrieves ...all sequences |
Note that ranges can be combined by logical operators.
300:!500 | !600:700
or
300:700 ! 500:600
|
retrieve the same set of sequences, namely all sequences from 300 to 500, excluding 500, plus all sequences from 600 to 700 where the 600 is excluded.
Searching Dates
It is possible to search dates using one of the two special formats that are recognized by the SRS query language. They are: "YYYYMMDD" or "DD-MMM-YYYY". Examples are:
"19990515"
or
"13-DEC-1999"
|
Dates can be used within ranges as normal numbers:
[swissprot-date#19910101:19951231]
[swissprot-date#13-DEC-1998:13-DEC-1999]
|
Searching Multiple Databanks
Instead of the databank name in a search expression, a list of names may be used. The list must be enclosed within curly braces and the names must be separated by spaces. The query:
[{swissprot swissnew sptrembl}-des:kinase]
|
searches the word "kinase" in the Description index of SWISS-PROT, SWISSNEW and SPTREMBL. If many index queries over multiple databanks must be combined in a single query, it is convenient to define a group of databanks to be used later in the query instead of a databank name. The example creates the group "dbs" with the 3 databanks SWISS-PROT, SWISSNEW and SPTREMBL within the first search command and uses the group name "dbs" in the second search name.
[dbs={swissprot swissnew sptrembl}-des:kinase]
&[dbs-org:human]
|
|
8.1.4 Using Logical Operators
|
|
The three logical operators OR ('|'), AND ('&') and BUTNOT ('!') can be used to combine search words in an index search, or sets in a query expressions.
The following figure illustrates the effect of the three operators in an expression of the form "A operator B".
FIGURE 1. Logical Operators in SRS.
Parenthesis within expressions is allowed, to any degree of nesting.
EXAMPLE 8.2 - Nesting queries with parentheses
[medline-authors:(wu* & rice*) !
(smith* | jones*)]
(q1 & q2) ! (q3 & q4)
|
Logical operations can be only performed between two sets of the same type. It is not possible to combine a set of entries and a set of subentries (see below). In those cases an additional link operation must be specified.
Using Link Operators
The powerful link operators are unique in the SRS query language. They resemble the join in relational databank systems to some extent. The two link operators '<' and '>' combine two sets from different databanks.
The figure shows two databanks "A" and "B" where some entries in "A" have cross references to entries in "B" as indicated by red lines. These cross references are processed to build link indices which provide the basis for the link operation.
FIGURE 2. Cross-Referenced Data.
- A > B
- gives entries in "B" that are referenced by entries in "A".
- A < B
- gives those entries in "A" that reference entries in "B".
References are by nature not bidirectional, that is, there is no guarantee that entries in "B" that are referenced by entries in "A" have references back to "A". The link indices in SRS, however, can be always used bidirectionally. The two link expression can now be seen as:
- A > B
- gives those entries in "B" that are linked to entries in "A".
- A < B
- gives those entries in "A" that are linked to entries in "B".
In this context it is irrelevant which databank contributed the information for building the link index.
Link queries to refine search results
[swissprot-des:kinase] > PDB
|
The index search retrieves all kinase sequences from the
SWISS-PROT protein sequence databank which are then linked to the PDB databank of solved tertiary protein structures. The result is all the PDB entries with atomic coordinates for all kinases for which the tertiary structure has been determined.
In SRS all the pairwise links are combined to generate a network of databanks. Within this network a link can be performed from any databank to any other. If two databanks are not directly connected by a link then a series of links is performed. The following rules are applied to resolve links:
-
The same pairwise link index is used for both directions.
-
SRS tries to find the shortest possible way for linking entries from two libraries. Ideally this is the direct link as, eg, between EMBL and
SWISS-PROT; if no direct link is available (eg, EMBL > PDB) then automatically the optimal or cheapest succession of links is performed (eg, EMBL > SWISS-PROT > PDB).
-
If a set contains entries from different databanks (eg, from EMBL and SWISS-PROT) then the subsets of all libraries found in the set are linked independently.
Advanced queries using links:
The Enzyme databank contains all known reactions catalyzed by enzymes. The number of entries retrieved by the query is the number of different reactions catalyzed by proteins with known structure.
[swissprot-id: acha_human] > prosite > swissprot
|
The index search retrieves the SWISS-PROT entry "ACHA_HUMAN". This entry is then linked to the Prosite entry(s) that documents the protein family where "ACHA_HUMAN" is a member, in this case the family of neuronal acetylcholine receptors. The next link retrieves all SWISS-PROT entries that belong to that family. In effect, the entry "ACHA_HUMAN" is amplified to all members of the protein family or families it belongs to. The example shows that it is possible to navigate over the databank network with an explicit succession of links that is evaluated from left to right.
[swissprot-id:tdt_human] > prodom > pdb
|
This example shows how the amplification by homologous entries in the same databank can be used to find related information in another databank to which the single entry itself is not linked. The query retrieves a DNA polymerase from SWISS-PROT, expands it by related proteins as documented in Prodom (databank of protein domains, another possibility would be Prosite or HSSP) and links these entries to PDB. The result is the set of PDB tertiary protein structures that are homologous to the SWISS-PROT entry "TDT_HUMAN".
q = [sequence-des: kinase] q ! (q < swissprot)
|
Most protein or DNA databanks overlap to a great extent which creates a lot of redundancy. The annotation of equivalent entries in different databanks can be quite different which can be very useful for string searching since the probability of finding a certain enzyme name is much greater if you search all sequence databanks. After the search the links come in handy to remove the overlap. The first query in the example searches "kinase" in all sequence databanks (eg, EMBL, PIR, SWISS-PROT, GenBank). The second query removes the overlap. The goal is to have SWISS-PROT entries plus those in other databanks that do not have equivalents in SWISS-PROT. This is achieved by subtracting all entries from "q" that are linked to SWISS-PROT. SWISS-PROT entries in "q" are not linked to SWISS-PROT and won't be removed!. Just change "SWISS-PROT" to "PIR" if you would rather have PIR entries.
|
8.1.5 Entries and Subentries
|
|
Sets originating from the same databank may have different set types. Consider the two queries:
[swissprot-keywords: transmembrane]
[swissprot-features: transmem]
|
The first query retrieves all SWISS-PROT entries that have transmembrane segments, the second finds all transmembrane features contained in SWISS-PROT entries.
The second query will retrieve many more entries since most transmembrane proteins have more than one membrane spanning segment. If you requested the sequences for entries in the second set you would get the transmembrane segments and not the parent entry's sequence.
The first query returns a set of entries whereas the second returns a set of subentries. The "features" index has a special type: searches in that index will for all sequence databanks result in sets of subentries. Sets of entries and subentries can not be combined with logical operators. Only the link operators may be used between them, ie, it is always possible to link subentries to their respective parent entries.
Performing links with sub-entries
[swissprot-org:human] > [swissprot-features:transmem]
|
returns all transmembrane segments found in human proteins.
[swissprot-org:human] < [swissprot-features:transmem]
|
returns all human proteins that have transmembrane segments.
Sometimes it is necessary to do an explicit conversion from subentries to entries. This can be done with a link to "parent".
Advanced links with sub-entries
[swissprot-features:transmem] > parent |
[swissprot-key:transmembrane]
|
returns all entries that have the "transmembrane" keyword or "transmem" sequence features. This may be necessary to ensure all entries with a certain property are retrieved - the annotation is often not complete!
[swissprot-features:transmem] > parent > pdb
|
finds all PDB entries of proteins that have transmembrane segments (even though the known structure may not yet include the transmembrane region).
|
8.1.6 Storing Intermediate Results in Sets
|
|
If a query becomes very complicated it may be convenient to store an intermediate result within a query into a set with a name that can be used in later parts of the query. An example is an index query in EMBL that then is linked to both SWISS-PROT and SWISSNEW .
[embl-org:human]>SWISSPROT |
[embl-org:human]>SWISSNEW
|
The index query in EMBL has to be specified twice. It is possible to store the result of the index query in a set, called "temp" in the example. The assignment operation must be within parentheses.
(temp=[embl-org:human])>SWISSPROT | temp > SWISSNEW
|
|
|
|