Profiles, Protein Motifs and Domains.

In this practical lesson we will go through several examples to illustrate the concepts of "PROFILES", "MOTIFS", "DOMAINS" and "FAMILIES". Each of the exercises will be centered in the analysis of specific sequences.

The examples include results from database searches or from the application of several tools.

HOWEVER, the idea is that you should repeat by yourself the various analyses that are mentioned.

A selection of links to Databases and Tools
Exercise 1: PROFILES

swiss::YD33_MYCTU, from Mycobacterium tuberculosis, annotated as "Hypothetical protein Rv1333".

Exercise 2: FAMILIES

Protein from the gene gcsf, from Bos taurus, annotated as "Granulocyte colony-stimulating factor precursor (G-CSF)".

Exercise 3: DOMAINS

swiss::ICE9_HUMAN, from Homo sapiens; precursor of caspase-9.

Link to another list of Databases and Tools centered on Protein Families and Profiles.

A selection of links to Databases and Tools:

General Tools:

BLAST: EMBL, EBI, NCBI.

ClustalW: EBI, ch.EMBNET.org, crick.genes.nig.ac.jp, NPS@, GenomeNet.

Databases: SwissProt, EMBL; SRS-EBI, SRS-EMBL.

Pattern Tools and Databases:

PROSITE: Database of protein motifs expressed as patterns or profiles

http://us.expasy.org/prosite/

ScanProsite (several mirrors): Scans a sequence against PROSITE or a pattern against SWISS-PROT and TrEMBL

http://www.expasy.org/cgi-bin/scanprosite

http://us.expasy.org/tools/scanprosite/

http://kr.expasy.org/cgi-bin/scanprosite

http://tw.expasy.org/cgi-bin/scanprosite

http://ca.expasy.org/cgi-bin/scanprosite

http://cn.expasy.org/cgi-bin/scanprosite

PRATT: Generation of patterns (regular expressions) from a group of unaligned sequences.

http://www.ebi.ac.uk/pratt/

Profile Tools and Databases:

NCBI PSI-BLAST: Automatic generation of profiles in iterative searches.

BLAST page at NCBI

ProfileScan: Scans a sequence to find matches to protein patterns and profiles in PROSITE and Pfam.

http://hits.isb-sib.ch/cgi-bin/PFSCAN?

MOTIF: Scans sequences to find motifs; databases to find sequences that match profiles or patterns; and generates profiles from sequences provided by the user.

http://motif.genome.ad.jp/

InterPro: Database of protein families defined by presence of common motifs and domains, defined in several databases such as Pfam, SMART, Prosite, and other.

http://www.ebi.ac.uk/InterProScan

Pfam: database of protein HMM profiles that define domain families.

Sanger Institute (UK)

St. Louis (USA)

Karolinska Institutet (Sweden)

Institut National de la Recherche Agronomique (France)

Bioaccelerator: allows the generation of profiles and the performance of several kinds of searches. .

http://eta.embl-heidelberg.de:8000/

Meme: identifies conserved motifs in groups of sequences, and generates profiles that can be used to search related sequences.

http://meme.sdsc.edu/meme/website/intro.html

1. Profiles

YD33_MYCTU, from Mycobacterium tuberculosis, "Hypothetical protein Rv1333".

This exercise will show that profile-based methods are more sensitive than pairwise similarity searches such as those conducted with BLAST.

Let's assume that we are trying to predict the function of YD33_MYCTU.

A. First get the sequence with SwissProt accession number YD33_MYCTU.

alternative 1: Open SRS at http://srs.ebi.ac.uk/; select the SwissProt database in the Library Page and search for "YD33_MYCTU". Choose "FastaSeqs" view and click on the button "view".

alternative 2 (faster): Open SwissProt at http://us.expasy.org/sprot/ and "quick search" for "YD33_MYCTU". At the end of the page click on "Q10644 in FASTA format".

You should obtain the sequence

>sw|Q10644|YD33_MYCTU Hypothetical protein Rv1333.
MNSITDVGGIRVGHYQRLDPDASLGAGWACGVTVVLPPPGTVGAVDCRGGAPGTRETDLL
DPANSVRFVDALLLAGGSAYGLAAADGVMRWLEEHRRGVAMDSGVVPIVPGAVIFDLPVG
GWNCRPTADFGYSACAAAGVDVAVGTVGVGVGARAGALKGGVGTASATLQSGVTVGVLAV
VNAAGNVVDPATGLPWMADLVGEFALRAPPAEQIAALAQLSSPLGAFNTPFNTTIGVIAC
DAALSPAACRRIAIAAHDGLARTIRPAHTPLDGDTVFALATGAVAVPPEAGVPAALSPET
QLVTAVGAAAADCLARAVLAGVLNAQPVAGIPTYRDMFPGAFGS

B. Now search for homologous sequences using BLAST:

BLAST servers: EMBL, NCBI, EBI.

For EMBL WU-BLAST the initial parameters could be:

program=blastp
database=nrdb95
filter=SEG
descriptions=250
alignments=100
Click on "Submit Query"

Pre-computed results:

Questions

Which proteins are identified in the BLAST searches?
Do the annotations provide any information?
What does the "X" mean in the alignments?
Check whether different results are obtained using SEG as filter or not.

C. Since there is no obvious way to infer the function of the protein, we can go on to use PROFILES:

There are many possible strategies, but the most obvious one is to recover the protein sequences identified as similar with BLAST and generate a MSA with ClustalW. Then we would generate a Profile or an HMM profile, and we would use the profile to search again in protein sequence databases.

The easiest approach is to use PSI-BLAST, because the same algorithm performs automatically an initial BLAST search, retrieves similar sequences, generates an alignment, constructs a profile, and performs a new search, this time using the profile. The newly identified sequences are retrieved and aligned, and the profile is updated with the new information. The process is iterated as many times as the user decides, or until no additional new sequences that match the profile are identified in the database.

Using PSI-BLAST at the NCBI:

BLAST is at http://www.ncbi.nlm.nih.gov/BLAST/. Choose "Protein BLAST / PSI- and PHI-BLAST".
Paste the sequence in "Search".
Descriptions=500
Alignments=100
Format for PSI-BLAST -> with inclusion threshold = 1e-05 (=0.00001)
Hit "BLAST!"

After a while click on "format" to display the results of the first round search.

Choose the proteins you want to use to construct the the profile and select "Run PSI-BLAST iteration 2".

Here you have the pre-computed results of the FIRST ROUND, of the SECOND ROUND and the THIRD ROUND.

Questions:

Have we detected more possible homologues with PSI-BLAST than with BLAST?
Are the annotations clearer?
Can we make any hypothesis about the function?

D. Searching against the HMM profile libraries of Pfam

Now we will try to identify whether there is an HMM profile for this family of proteins in Pfam.

Using Pfam

Go to Pfam: http://www.sanger.ac.uk/Pfam/.
Paste the sequence.
E-value=10
Hit "Search Pfam".

The pre-computed results of the search in Pfam are here.

Questions:

How many domains does this protein have?
Where are they?
What are their functions? (CLick on each of them to access their individual entries)
What can we say about the function of the protein?
How many proteins do have the domain peptidase_S58?
Which other domains appear associated to the domain peptidase_S58? (Go to the Domain Organization box and click on "View Graphic")

2. Families

Protein coded by the gene gcsf of Bos taurus (Granulocyte colony-stimulating factor precursor)

In this exercise we will illustrate the importance of defining protein families in sequence analyses.

A. First, get the sequence:

Go to: http://srs.ebi.ac.uk/.
Choose one of the following databases to look for it: SwissProt, SpTrEMBL or TrEMBL (updates).
Click on "Query forms => Standard".
Search for: "GeneName=gcsf", "Organism name=Bos taurus". Click on "submit query".
And now the same as before: "FastaSeqs" and "View"

>sw|P35833|CSF3_BOVIN Granulocyte colony-stimulating factor precursor (G-CSF).
MKLMVLQLLLWHSALWTVHEATPLGPARSLPQSFLLKCLEQVRKIQADGAELQERLCAAH KLCHPEELMLLRHSLGIPQAPLSSCSSQSLQLTSCLNQLHGGLFLYQGLLQALAGISPEL APTLDTLQLDVTDFATNIWLQMEDLGAAPAVQPTQGAMPTFTSAFQRRAGGVLVASQLHR FLELAYRGLRYLAEP

B. PSI-BLAST search.

Proceed as in the previous exercise.
Here you have, the pre-computed results, for FIVE ROUNDS.
Try to understand what are the differences between the different rounds

Questions:

What has PSI-BLAST shown us?
In the second round this appears: "Q90YI0 (Q90YI0) Interleukin-6 precursor". It has an identity of 20% with respect to the "query", but a significant e-value. Why?
What are the consequences of this last sequence being included in the profile.
Why in the last round do the proteins annotated as "interleukin..." have better e-values than the proteins annotated as "granulocyte..."?
All the proteins that have been identified have the same evolutive origen, but, do they have the same function?
Can you identify subfamilies?
How do you believe that the existence of sub-families affects function prediction by the identification of homologous proteins?

C. Pfam search.

Connect to Pfam.
Repeat the steps indicated in Exercise number 2.
Pre-computed results are here

Questions:

Read the Pfam documentation for the protein domain. Which subfamilies are grouped here?
There is a remote relationship (e-value de 0.39 and 5.6) with the families IL-11 and IL-12. Could it be that the family IL6 documented in Pfam has a common ancestor with IL-11 and IL-12?
To address the previous question it would be necessary to realize a multiple alignment to see if the two families have similarities, or compare their structures.

D. Search in InterPro.

Open a connection to InterProScan at: http://www.ebi.ac.uk/InterProScan/.

Paste the amino acid sequence of gscf.

Enter your e-mail address (any expression with the character @ is OK)

And "Submit job".

Pre-computed results can be found here.

Questions:

With which InterPro entries are there similarities?
Is there a hierarchy?
Is InterPro of any use in determining to which subfamily gcsf belongs?

3. Domains

ICE9_HUMAN, from Homo Sapiens; precursor of caspase-9.

With this example we will illustrate the importance of considering the multi-domain organization of many proteins.

A. First, you should get the sequence from SwissProt using the identifier ICE9_HUMAN:

>ICE9_HUMAN
MDEADRRLLR RCRLRLVEEL QVDQLWDALL SRELFRPHMI EDIQRAGSGS RRDQARQLII
DLETRGSQAL PLFISCLEDT GQDMLASFLR TNRQAAKLSK PTLENLTPVV LRPEIRKPEV
LRPETPRPVD IGSGGFGDVG ALESLRGNAD LAYILSMEPC GHCLIINNVN FCRESGLRTR
TGSNIDCEKL RRRFSSLHFM VEVKGDLTAK KMVLALLELA QQDHGALDCC VVVILSHGCQ
ASHLQFPGAV YGTDGCPVSV EKIVNIFNGT SCPSLGGKPK LFFIQACGGE QKDHGFEVAS
TSPEDESPGS NPEPDATPFQ EGLRTFDQLD AISSLPTPSD IFVSYSTFPG FVSWRDPKSG
SWYVETLDDI FEQWAHSEDL QSLLLRVANA VSVKGIYKQM PGCFNFLRKK LFFKTS

B. Then we will look for related sequences with EMBL-BLAST:

Pre-computed results are here.
Study the results to to check what kind of proteins have been found.
Take a look at the graphic representation of the sequence matches.
Questions:

What can you conclude from the graphical representation of the alignments.
Could it be related with the presence of different domains?

C. Now we will search Pfam

The pre-computed result is here.
Questions:

How many domains does ICE9_HUMAN have?
What is the function of the various domains? (Click on each of them to access their individual entries).
Is there any reason why these domains should appear together?
For each of the domains find out with which other domains are they associated (Go to the Domain Organization box and click on "View Graphic").
If you check the SwissProt entry for ICE9_HUMAN you will learn that this protein interacts with the protein apaf-1. If you retrieve the SwissProt entry for APAF-1, by searching SRS with "description=apaf-1" + "organism name=Homo sapiens", you will find out that the SwissProt entry for APAF-1 is APAF_HUMAN. From the SwissProt entry you can follow the link to the Pfam entry for APAF_HUMAN. Having done all this, can you conclude how do caspase-9 and apaf-1 interact?

Think a little bit:

Protein comparisons indicate that protein domains often appear combined with different partners (shuffled).
What are the implications of domain shuffling on protein function prediction.
What are the implications of domain shuffling at the moment of constructing multiple sequence alignments.

D. Construction of a profile from a MSA, and identification of similar proteins by profile search.

To continue with the analyses of (part of) ICE9_HUMAN, we will use Bioaccelerator. which is a server that allows the generation of profiles from MSA, and to perform profile searches in databases. This is equivalent to what we could do with PSI-BLAST, but it is a better method in the sense that the user has complete control about the alignment that is used to construct the profile.

Since, from the previous exercise, it is obvious that to study multi-domain proteins it may be necessary to isolate the domains, we will concentrate only on the CARD domain.

Procedure:

1. We will obtain a MSA of the CARD domain, for example from the Pfam CARD entry (locate the "Alignment" box, select FASTA format and click on "Get alignment").

The alignment, visualized with Belvu, looks like this:

2. Now, we will use the FASTA formatted alignment to construct a profile or PSSM (position specific scoring matrix), with the program ProfileWeight.

Once the PSSM matrix (profile) has been generated, we will save it in a file.

A piece of information that can be obtained from the PSMM file is the weight that has been defined for each sequence. More divergent sequences have larger weights.

Sequence Weights:
   1 CED4_CAEEL/3-90      100
   2 RIK2_HUMAN/436-524    94
   3 CRAD_HUMAN/2-89       92
   4 ICE2_HUMAN/16-104     54
   5 ICE2_CHICK/8-96       62
   6 ICE9_HUMAN/2-92       83
   7 CED3_CAEVU/3-91       56
   8 CED3_CAEEL/3-91       58
   9 Q66677/22-110         89
10 APAF_HUMAN/2-90       96
11 ICEB_MOUSE/2-94       79
12 ICE5_HUMAN/44-132     69
13 ICED_BOVIN/2-91       62
14 ICE4_HUMAN/2-91       65
15 BIR2_MOUSE/437-525    58
...etc.

3. Now we can use this profile to search a database of sequences. Click on the link: ProfileSearch".

Click on "Upload file" to load the PSSM profile.

Then, set up the gap opening and gap extension penalties to a 11 y 1, respectively.

4. Finally, launch the search.

The pre-computed results can be found here.

Questions:

Why more divergent sequences have larger weights in the profile generated with ProfileWeight?

Compare the results of the profile search with the BLAST search of the one of the sequences in the CARD MSA, ICE2_HUMAN:

>ICE2_HUMAN/16-104
HPHHQETLKKNRVVLAKQLLLSELLEHLLEKDIITLEMRELIQAKVGSFSQNVELLN
LLPKRGPQAFDAFCEALRETKQGHLEDMLLTT

Which method is capable of detecting more homologues?

Michael Tress
Protein Design Group