Entrez Exercises

NCBI: Entrez Home

Entrez Exercises

A. Simple Searching

Look for all "photosystem" related sequences in the Nucleotide database (use wildcard "*").
How many spinach sequences exist in the Nucleotide databank?
And, how many in the Protein, Structure or Genome databanks?
Display the FASTA view of the Protein entry AAD02267.
Display the graphics view of the Nucleotide entry corresponding to the Protein entry AAD02267.
How many non EST Nucleotide sequences were published by someone named "Jones"?
How many potato polypeptides are included into the Structure database?
Search for all plant proteins with a molecular weight range from 50,000 to 50,050 dalton (use field range format "050000:050050[MOLWT]").
Search for all plant proteins with a sequence length from 300 to 310 aminoacids.
How many spinach proteins have less than 50 amino acids in length?

B. Refining your search

How many non EST Nucleotide sequences were published by someone named "Smith" and have a sequence length of 3000 to 4000 nucleotides? How many of them were not published in 1999? Perform three independent searches and use "History" to combine them [((#1 AND #2) NOT #3)]
What are the differences among [((#1 AND #2) NOT #3)] / [(#1 AND (#2 NOT #3)] / [((#1 AND #2) OR #3)] / [((#1 NOT #2) AND #3)]?
Search for all plant rRNA sequences in the Nucleotide database (use "Limits" to restrict to rRNA in the "Molecule" pull-down menu).
Search for all arabidopsis mitochondrion genes (using "Limits").
Search for all arabidopsis chloroplast genes in the Nucleotide database updated in the last year.
Retrieve all genomic plant genes in the Nucleotide database with a publication date from 1990 to 1995.
Using "Index", obtain the number of tomato sequences in the Nucleotide database. How many non-tomato entries contain the word "tomato"?
Search for all the protein sequences from chloroplasts of spinach, tomato and potato.
Search for all the gnomic sequences of protein kinases from arabidopsis.
Using history, get all glucanase sequences from spinach, tomato and potato, excluding ESTs and patents.

C. Link Out

Obtain the domain distribution of the protein sequence of "phosphoinositide specific phospholipase C" from Arabidopsis thaliana.

Field description

Search Field	Definition	Qualifier
Accession	Contains the unique accession number of the sequence or record, assigned to the nucleotide, protein, structure, genome record, or PopSet by a sequence database builder. The Structure database accession index contains the PDB IDs but not the MMDB IDs.	[ACCN]
All Fields	Contains all terms from all searchable database fields in the database.	[ALL]
Author Name	Contains all authors from all references in the database records. The format is last name space first initial(s), without punctuation (e.g., marley jf).	[AUTH]
EC/RN Number	Number assigned by the Enzyme Commission or Chemical Abstract Service (CAS) to designate a particular enzyme or chemical, respectively.	[ECNO]
Feature Key	Contains the biological features assigned or annotated to the nucleotide sequences and defined in the DDBJ/EMBL/GenBank Feature Table (http://www.ncbi.nlm.nih.gov/collab/FT/index.html). Not available for the Protein or Structure databases.	[FKEY]
Filter	Contains predetermined or filtered subsets of the various databases. These subsets or filters are created by grouping records that are commonly linked to other Entrez databases or within the same database. For example, the PopSet database Filter index includes PopSet all, PopSet medline, PopSet nucleotide, and PopSet protein. The PopSet medline filter includes all PopSet records with links to PubMed; the PopSet nucleotide filter includes all PopSet records with links to the nucleotide database; and, the PopSet protein filter includes all PopSet records with links to the protein database. The PopSet all filter includes all PopSet records. The Nucleotide database Filter index contains a great deal more filters because the database records are linked to numerous external links.	[FILT]
Gene Name	Contains the standard and common names of genes found in the database records. This field is not available in Structure database.	[GENE]
Issue	Contains the issue number of the journal in which the data were published.	[ISS]
Journal Name	Contains the name of the journal in which the data were published. Journal names are indexed in the database in abbreviated form (e.g., J Biol Chem). Journals are also indexed by their by ISSNs. Browse the index if you do not know the ISSN or are not sure how a particular journal name is abbreviated.	[JOUR]
Keyword	Contains special index terms from the controlled vocabularies associated with the GenBank, EMBL, DDBJ, SWISS-Prot, PIR, PRF, or PDB databases. Browse the Keyword indexes of the individual databases to become familiar with these vocabularies. A Keyword index is not available in the Structure database.	[KYWD]
Modification Date	Contains the date that the most recent modification to that record is indexed in Entrez, in the format YYYY/MM/DD (e.g., 1999/08/05). A year alone, (e.g., 1999) will retrieve all records modified for that year; a year and month (e.g., 1999/03) retrieves all records modified for that month that are indexed in Entrez.	[MDAT]
Molecular Weight	Molecular weight of a protein, in Daltons (Da). Note that molecular weight must be entered as a fixed 6 digit field, filled with leading zeros (not letter O), e.g., 002002 [MOLWT]	[MOLWT]
Organism	Contains the scientific and common names for the organisms associated with protein and nucleotide sequences.	[ORGN]
Page Number	Contains the number of the first journal page of the article in which the data were published.	[PAGE]
Primary Accession	Contains the primary accession number of the sequence or record, assigned to the nucleotide, protein, structure, genome record, or PopSet by a sequence database builder. A Primary Accession index is not available in the Structure database.	[PACC]
Properties	Contains properties of the nucleotide or protein sequence. For example, the Nucleotide database's Properties index includes molecule types, publication status, molecule locations, and GenBank divisions. A Properties index is not available in the Structure database.	[PROP]
Protein Name	Contains the standard names of proteins found in database records. Common names may not be indexed in this field so it is best to also consider All Fields or Text Words. A Protein Name index is not available in the Structure database.	[PROT]
Publication Date	Contains the date that records are released into Entrez, in the format YYYY/MM/DD (e.g., 1999/08/05). It is the date the entry first appeared in GenBank explicitly indexed in Entrez. A year alone, (e.g., 1999) will retrieve all records for that year; a year and month (e.g., 1999/03) will retrieve all records released into GenBank for that month.	[PDAT]
SeqID String	Contains the special string identifier, similar to a FASTA identifier, for a given sequence. A SeqID String index is not available in the Structure database.	[SQID]
Sequence Length	Contains the total length of the sequence. Sequence Length indexes are not available in the Structure or PopSet databases.	[SLEN]
Substance Name	Contains the names of any chemicals associated with this record from the CAS registry and the MEDLINE Name of Substance field. Substance Name indexes are not available in the Genome or PopSet databases.	[SUBS]
Text Word	Contains all of the "free text" associated with a record.	[WORD]
Title Word	Includes only those words found in the definition line of a record. The definition line summarizes the biology of the sequence and is carefully constructed by database staff. A standard definition line will include the organism, product name, gene symbol, molecule type and whether it is a partial or complete cds. Title Word indexes are not available in the Structure or PopSet databases.	[TITL]
Uid	Contains the Medline unique identifier for records that contain published references that are linked to PubMed. The Uid index is not browsable.	[UID]
Volume	Contains the volume number of the journal in which the data were published.	[VOL]