Friday, 26 September 2008

Semantic similarity

Semantic similarity

Semantic similarity, is a concept whereby a set of documents or terms within term lists are assigned a metric based on the likeness of their meaning / semantic content.

According to some opinions the concept of semantic similarity is different from semantic relatedness because semantic relatedness includes concepts as antonymy and meronymy, while similarity doesn't. However, much of the literature uses these terms interchangeably, along with terms like semantic distance. In essence, semantic similarity, semantic distance, and semantic relatedness all mean, "How much does term A have to do with term B?"

The answer to this question, as given by the many automatic measures of semantic similarity/relatedness, is usually a number, usually between -1 and 1, or between 0 and 1, where 1 signifies extremely high similarity/relatedness, and 0 signifies little-to-none.

An intuitive way of displaying terms according to their semantic similarity is by grouping together closer related terms and spacing more distantly related ones wider apart. This is common - if sometime subconscious - practice for mind maps and concept maps.

Concretely, this can be achieved for instance by defining a topological similarity, by using ontologies to define a distance between words (a naive metric for terms arranged as nodes in a directed acyclic graph like a hierarchy would be the minimal distance (in separating edges) between the two term nodes), or using statistical means such as a vector space model to correlate words and textual contexts from a suitable text corpus (co-occurrence).


1. Algorithmic Detection of Semantic Similarity
Automatic extraction of semantic information from text and links in Web pages is key to improving the quality of search
results. However, the assessment of automatic semantic measures is limited by the coverage of user studies, which
do not scale with the size, heterogeneity, and growth of the Web. Here we propose to leverage human-generated metadata — namely topical directories — to measure semantic relationships among massive numbers of pairs of Web pages or topics. The Open Directory Project classifies millions of URLs in a topical ontology, providing a rich source from which semantic relationships between Web pages can be derived. While semantic similarity measures based on taxonomies (trees) are well studied, the design of well-founded similarity measures for objects stored in the nodes of arbitrary ontologies (graphs) is an open problem. This paper defines an information-theoretic measure of semantic similarity that exploits both the hierarchical and non-hierarchical structure of an ontology. An experimental study shows
that this measure improves significantly on the traditional taxonomy-based approach. This novel measure allows us to
address the general question of how text and link analyses can be combined to derive measures of relevance that are in good agreement with semantic similarity. Surprisingly, the traditional use of text similarity turns out to be ineffective
for relevance ranking.

2.Roget’s Thesaurus and Semantic Similarity
A system that measures semantic similarity using a computerized 1987 Roget's Thesaurus, and evaluated it by
performing a few typical tests. We compare the results of these tests with those produced by WordNet-based similarity measures. One of the benchmarks is Miller and Charles’ list of 30 noun pairs to which human judges had assigned similarity measures. We correlate these measures with those computed by several NLP systems. The 30 pairs can be traced back to Rubenstein and Goodenough’s 65 pairs, which we have also studied. Our Roget’s-based system gets correlations of .878 for the smaller and .818 for the larger list of noun pairs; this is quite close to the .885 that Resnik obtained when he employed humans to replicate the Miller and Charles experiment. We further evaluate our measure by using Roget’s and WordNet to answer 80 TOEFL, 50 ESL and 300 Reader’s Digest questions: the correct synonym must be selected amongst a group of four words. Our system gets 78.75%, 82.00% and 74.33% of the questions respectively.

3.A new method to measure the semantic similarity of GO terms
Motivation: Although controlled biochemical or biological vocabularies, such as Gene Ontology (GO) (, address the need for consistent descriptions of genes in different data sources, there is still no effective method to determine the functional similarities of genes based on gene annotation information from heterogeneous data sources.
Results: To address this critical need, we proposed a novel method to encode a GO term's semantics (biological meanings) into a numeric value by aggregating the semantic contributions of their ancestor terms (including this specific term) in the GO graph and, in turn, designed an algorithm to measure the semantic similarity of GO terms. Based on the semantic similarities of GO terms used for gene annotation, we designed a new algorithm to measure the functional similarity of genes. The results of using our algorithm to measure the functional similarities of genes in pathways retrieved from the saccharomyces genome database (SGD), and the outcomes of clustering these genes based on the similarity values obtained by our algorithm are shown to be consistent with human perspectives. Furthermore, we developed a set of online tools for gene similarity measurement and knowledge discovery.

Roget’s Thesaurus and Semantic Similarity
A new method to measure the semantic similarity of GO terms

No comments: