PubMed
PubMed Nucleotide Protein Genome Structure PopSet

Computation of Related Articles

      The neighbors of a document are those documents in the database that are the most similar to it. The similarity between documents is measured by the words they have in common with some adjustment for document lengths. In order to carry out such a program one must first define what a word is. For us a word is basically an unbroken string of letters and numerals with at least one letter of the alphabet in it. Words end at hyphens, spaces, newlines, and punctuation. A list of 310 common, but uninformative, words (also known as stop words) are eliminated from processing at this stage. Next a limited amount of stemming of words is done but no thesaurus is used in processing. Words from the abstract of a document are classified as text words. Words from titles are also classified as text words, but words from titles also appear a second time with a special added marker designating them as title words. MeSH terms are placed in a third category and a MeSH term with a subheading qualifier is entered twice, once without the qualifier and once with it. Likewise a MeSH term that is starred (indicating a major concept in a document) is entered once without the star and once with it. These three categories of words (or phrases in the case of MeSH) comprise the representation of a document. No other fields such as author or journal enter into the calculations.

      Having obtained the set of terms that represent each document, the next step is to recognize that not all words are of equal value. Each time a word is used it is assigned a numerical weight. This numerical weight is based on information that the computer can obtain by automatic processing. Automatic processing is important because the number of different terms that have to be assigned weights is close to two million for this system. The weight or value of a term is dependent on three types of information: 1) the number of different documents in the database that contain the term; 2) an estimate of the importance of the term in producing relationships in the database; 3) the number of times the term occurs in a particular document. The first two of these pieces of information are combined to produce a number called the global weight of the term. The global weight is used in weighting the term throughout the database. The third piece of information pertains only to a particular document and is used to produce a number called the local weight of the term in that specific document. When a word occurs in two documents its weight is computed as the product of the global weight times the local weight in each of the documents.

      The global weight of a term is greater for the less frequent terms. This is reasonable because the presence of a term that occurred in most of the documents would really tell one very little about a document. On the other hand a term that occurred in only one hundred documents out of one million would be very helpful in limiting the set of documents of interest. A word that occurred in only ten documents is likely to be even more informative and will receive an even higher weight. The second factor that enters into the computation of the global weight of a term is what we call the strength of the term. It is defined as the probability that a term that occurs in one document will also occur in any other document that is closely related to the first document. For a term of a given frequency the higher the strength the greater the global weight. For details of how the global weight is computed for a term we refer the interested reader to Wilbur and Yang where section 3 is of particular relevance. The local weight of a term within a document is greater the more frequent the term is in that document.

      The similarity between two documents is computed in two steps. The first step is to add up the weights (local wt1 * local wt2 *global wt) of all the terms the two documents have in common. This provides an indication of how related two documents are. However, this preliminary score suffers from the problem that when a document is scored against a long document and a short document the long document will usually win out just because of its length. To correct for this problem we divide this preliminary score by the product of the lengths of the two documents. The resultant score is an example of a vector cosine score. Cosine scoring was originated by Gerard Salton and has a long history in text retrieval. The interested reader is referred to Salton, Automatic Text Processing, Reading, MA: Addison-Wesley, 1989 for further information on this topic. Our approach differs from other approaches in the way we calculate the local and global weights for the individual terms.

      Once the similarity score of a document in relation to each of the other documents in the database has been computed, that document's neighbors are identified as the most similar (highest scoring) documents found. These closely related documents are precomputed for each document in MEDLINE so that when you push the button "See Related Articles" the system has only to retrieve this list. This enables a fast response time for such queries.