In this practical lesson we will go through several examples to illustrate the concepts of "PROFILES", "MOTIFS", "DOMAINS" and "FAMILIES". Each of the exercises will be centered in the analysis of specific sequences.
The examples include results from database searches or from the application of several tools.
HOWEVER, the idea is that you should repeat by yourself the various analyses that are mentioned.
A selection of links to Databases and Tools:
General Tools: Pattern Tools and Databases:
- PROSITE: Database of protein motifs expressed as patterns or profiles
- ScanProsite (several mirrors): Scans a sequence against PROSITE or a pattern against SWISS-PROT and TrEMBL
Profile Tools and Databases:
- PRATT: Generation of patterns (regular expressions) from a group of unaligned sequences.
- NCBI PSI-BLAST: Automatic generation of profiles in iterative searches.
- ProfileScan: Scans a sequence to find matches to protein patterns and profiles in PROSITE and Pfam.
- MOTIF: Scans sequences to find motifs; databases to find sequences that match profiles or patterns; and generates profiles from sequences provided by the user.
- InterPro: Database of protein families defined by presence of common motifs and domains, defined in several databases such as Pfam, SMART, Prosite, and other.
- Pfam: database of protein HMM profiles that define domain families.
- Sanger Institute (UK)
- St. Louis (USA)
- Karolinska Institutet (Sweden)
- Institut National de la Recherche Agronomique (France)
- Bioaccelerator: allows the generation of profiles and the performance of several kinds of searches. .
- Meme: identifies conserved motifs in groups of sequences, and generates profiles that can be used to search related sequences.
YD33_MYCTU, from Mycobacterium tuberculosis, "Hypothetical protein Rv1333".
You should obtain the sequence
>sw|Q10644|YD33_MYCTU Hypothetical protein Rv1333.
MNSITDVGGIRVGHYQRLDPDASLGAGWACGVTVVLPPPGTVGAVDCRGGAPGTRETDLL
DPANSVRFVDALLLAGGSAYGLAAADGVMRWLEEHRRGVAMDSGVVPIVPGAVIFDLPVG
GWNCRPTADFGYSACAAAGVDVAVGTVGVGVGARAGALKGGVGTASATLQSGVTVGVLAV
VNAAGNVVDPATGLPWMADLVGEFALRAPPAEQIAALAQLSSPLGAFNTPFNTTIGVIAC
DAALSPAACRRIAIAAHDGLARTIRPAHTPLDGDTVFALATGAVAVPPEAGVPAALSPET
QLVTAVGAAAADCLARAVLAGVLNAQPVAGIPTYRDMFPGAFGS
Protein coded by the gene gcsf of Bos taurus (Granulocyte colony-stimulating factor precursor)
>sw|P35833|CSF3_BOVIN Granulocyte colony-stimulating factor precursor (G-CSF).
MKLMVLQLLLWHSALWTVHEATPLGPARSLPQSFLLKCLEQVRKIQADGAELQERLCAAH KLCHPEELMLLRHSLGIPQAPLSSCSSQSLQLTSCLNQLHGGLFLYQGLLQALAGISPEL APTLDTLQLDVTDFATNIWLQMEDLGAAPAVQPTQGAMPTFTSAFQRRAGGVLVASQLHR FLELAYRGLRYLAEP
ICE9_HUMAN, from Homo Sapiens; precursor of caspase-9.
>ICE9_HUMAN
MDEADRRLLR RCRLRLVEEL QVDQLWDALL SRELFRPHMI EDIQRAGSGS RRDQARQLII
DLETRGSQAL PLFISCLEDT GQDMLASFLR TNRQAAKLSK PTLENLTPVV LRPEIRKPEV
LRPETPRPVD IGSGGFGDVG ALESLRGNAD LAYILSMEPC GHCLIINNVN FCRESGLRTR
TGSNIDCEKL RRRFSSLHFM VEVKGDLTAK KMVLALLELA QQDHGALDCC VVVILSHGCQ
ASHLQFPGAV YGTDGCPVSV EKIVNIFNGT SCPSLGGKPK LFFIQACGGE QKDHGFEVAS
TSPEDESPGS NPEPDATPFQ EGLRTFDQLD AISSLPTPSD IFVSYSTFPG FVSWRDPKSG
SWYVETLDDI FEQWAHSEDL QSLLLRVANA VSVKGIYKQM PGCFNFLRKK LFFKTS
A piece of information that can be obtained from the PSMM file is the weight that has been defined for each sequence. More divergent sequences have larger weights.
Sequence Weights:
1 CED4_CAEEL/3-90 100
2 RIK2_HUMAN/436-524 94
3 CRAD_HUMAN/2-89 92
4 ICE2_HUMAN/16-104 54
5 ICE2_CHICK/8-96 62
6 ICE9_HUMAN/2-92 83
7 CED3_CAEVU/3-91 56
8 CED3_CAEEL/3-91 58
9 Q66677/22-110 89
10 APAF_HUMAN/2-90 96
11 ICEB_MOUSE/2-94 79
12 ICE5_HUMAN/44-132 69
13 ICED_BOVIN/2-91 62
14 ICE4_HUMAN/2-91 65
15 BIR2_MOUSE/437-525 58
...etc.
Michael Tress
Protein Design Group