Li, Y., McLean, D., Bandar, Z. A., O'Shea, J. D., and Crockett, K. 2006. Sentence Similarity Based on Semantic Nets and Corpus Statistics. IEEE Trans. on Knowl. and Data Eng. 18, 8 (Aug. 2006), 1138-1150. DOI= http://dx.doi.org/10.1109/TKDE.2006.130
ABSTRACT
Sentence similarity measures play an increasingly important role in
text-related research and applications in areas such as text mining,
Web page retrieval, and dialogue systems. Existing methods for
computing sentence similarity have been adopted from approaches used
for long text documents. These methods process sentences in a very
high-dimensional space and are consequently inefficient, require human
input, and are not adaptable to some application domains. This paper
focuses directly on computing the similarity between very short texts
of sentence length. It presents an algorithm that takes account of
semantic information and word order information implied in the
sentences. The semantic similarity of two sentences is calculated using
information from a structured lexical database and from corpus
statistics. The use of a lexical database enables our method to model
human common sense knowledge and the incorporation of corpus statistics
allows our method to be adaptable to different domains. The proposed
method can be used in a variety of applications that involve text
knowledge representation and discovery. Experiments on two sets of
selected sentence pairs demonstrate that the proposed method provides a
similarity measure that shows a significant correlation to human
intuition.
Points made:
- Sentence Similartiy based on path length and depth in the WordNet hierarchy.
- Word order similarity. The metric seems to work rather well.
- Data Sets: Brown Corpus (Content&Statistical Information), WordNet (Semantics)
- The paper points to a dataset created for estimating sentence similarity
Overall, good to read. Provides good resources to


