Sequence-based Methods for Functional Assignment

Methods for Database Searches
Sequence-based Methods for Functional Assignment

Pairwise and Multiple Sequence Alignments and Similarity Searches

Introduction

Alignments are fundamental to most bioinformatics methods. Good alignments can be used to infer the common ancestry of a number of sequences, to make predictions about function and structure and to identify conserved regions that may be of structural or functional importance.
In this part of the practical we will encounter some basic alignment techniques using server-based sequence database search methods and multiple sequence alignment tools.

Tools and Databases

BLAST Servers

EMBL WU-BLAST
NCBI BLAST

Servers for Aligning Two Sequences

EBI Pairwise Alignment - with Smith and Waterman, and Needleman and Wunsch Alignment Tools
NCBI bl2seq
Lipman and Pearson's Align program

Multiple Alignment Servers

ClustalW server

T-Coffee server

Sequence Databases

Uniprot - the most comprehensive database of catalogued protein sequences.
Swiss-Prot - a curated database of highly annotated protein sequences.
PIR Database - another curated sequence database.
TrEMBL - a computer-annotated database of sequences translated from the EMBL nucleotide database.

Multiple Sequence Alignment Viewer

BoxShade

References

Patterns and Profiles, Protein Motifs and Domains

Introduction

In general protein sequences can be grouped in clusters. It is possible to identify and cluster groups of sequences that are maximally similar between them, and minimally similar to other clusters. These clusters of evolutionary-related protein sequences are called families.
The sequence information from alignments of these clusters or families can be combined in order to search for function or for more distantly-related sequences (remote homologues). Evolutionary information from aligned homologous sequences combined in this way is known as a profile.
Another general property of sequences is that sequence similarity may be restricted to short stretches of sequence called domains and motifs.
The definition of these conserved sub-sequences depends on their size and function. Domains are stretches of sequence that appear as structural modules, often within many proteins. Motifs are short conserved sub-sequences that often correspond to active or functional sites. Motifs can be used to help predict function and in the identification of remote homologues.
In the second part of the practical we will go through several examples to illustrate the concepts of patterns, profiles, motifs, domains and families. Each of the exercises will be centered in the analysis of a specific sequence. The examples include results from database searches or from the application of a range of tools.

Tools and Databases

InterPro - an integrated database of protein families, domains, motifs and functional sites.
Blocks - multiply aligned ungapped segments for the most highly conserved regions of proteins.
Motif - a server that scans databases to find motifs or patterns and that can generate sequence profiles.
Pfam - multiple sequence alignments and HMMs of protein domains and families.
PRINTS - database of groups of conserved motifs, or protein fingerprints.
ProDom - protein domain families automatically generated from SWISS-PROT and TrEMBL.
PROSITE - database of protein families and domains defined by functional sites, patterns and profiles.
SMART - Simple Modular Architecture Research Tool for the identification of domains.
COGS database - clusters of sequences determined by comparing sequences from whole genomes.

References

Prediction of Protein Function Tutorial at the EBI

Identifying Domains and Motifs

Domains, motifs, and clusters in the protein universe - a paper by Liu and Rost

Michael Tress
Protein Design Group