Dienstag, 21. April 2009

[Paper] Sentence Similarity Based on Semantic Nets and Corpus Statistics


Li, Y., McLean, D., Bandar, Z. A., O'Shea, J. D., and Crockett, K. 2006. Sentence Similarity Based on Semantic Nets and Corpus Statistics. IEEE Trans. on Knowl. and Data Eng. 18, 8 (Aug. 2006), 1138-1150. DOI= http://dx.doi.org/10.1109/TKDE.2006.130

ABSTRACT



Sentence similarity measures play an increasingly important role in
text-related research and applications in areas such as text mining,
Web page retrieval, and dialogue systems. Existing methods for
computing sentence similarity have been adopted from approaches used
for long text documents. These methods process sentences in a very
high-dimensional space and are consequently inefficient, require human
input, and are not adaptable to some application domains. This paper
focuses directly on computing the similarity between very short texts
of sentence length. It presents an algorithm that takes account of
semantic information and word order information implied in the
sentences. The semantic similarity of two sentences is calculated using
information from a structured lexical database and from corpus
statistics. The use of a lexical database enables our method to model
human common sense knowledge and the incorporation of corpus statistics
allows our method to be adaptable to different domains. The proposed
method can be used in a variety of applications that involve text
knowledge representation and discovery. Experiments on two sets of
selected sentence pairs demonstrate that the proposed method provides a
similarity measure that shows a significant correlation to human
intuition.


Points made:

  • Sentence Similartiy based on path length and depth in the WordNet hierarchy.
  • Word order similarity. The metric seems to work rather well.
  • Data Sets: Brown Corpus (Content&Statistical Information), WordNet (Semantics)
  • The paper points to  a dataset created for estimating sentence similarity 

Overall, good to read. Provides good resources to


Sentiment Detection and Opinion Mining

Sentiment detection and opinion mining is currently a hot topic of extracting subjective opinions and assertions from web portals.  A good survey is provided by Pang and Lee. The survey addresses several aspects of this field. Most important in the field of knowledge discovery and text mining is the question how algorithms for analysing unstructured texts written by web users differs with standard text mining tasks like text classification, named entity recognition. From the survey i have taken the following points
  • A smaller number of classes compared to text classification (e.g. positive, ambivalent, negativ vs. Topic hierarchies)
  • Higher dependency on subjective writing stile (e.g. sacarsm)
  • Higher dependency on common sense knowledge: Sentiments can be expressed using non sentiment words and comparing to very good/very bad situation (it feels like driving a car at 360 kmh)
  • High degree of subjectivity: Given the above sentence, some people may like it, some may not
  • Order effects might overthrough frequency effects
Sentiment Tasks

  • Polarity Opinion Classification: Deterine whether a piece of text is good or bad
  • Rating inference/ordinal regeression: Determine the scale of goodness/badness
  • Subjectivity Detection: Detect whether a piece of text contains subjective/objective material
  • Joint Topic/Sentiment analysis

Facts
  • Machine Learning using Unigram Models can achieve over 80% accuracy (Pang et. al. Thumbs Up! Sentiment Classification using Machine Learning)
  • Templates are more stable among domains (compared to IE)
  • Finding correct keywords expressing sentiments seems to be hard ("Go read the book" in movie vs. book domain)
  • Unclear whether bigrams help or not
  • POS Tagging can be considered as a rough version of WSD
  • Syntax has found to be usefull (dependency tree)
  • Negations count (as second feature, by transforming words e.g. NOT, deeper modelling)





[Paper] Sentence Similartiy Based on Semantic Nets and Corpus Statistics

ABSTRACT

Sentence similarity measures play an increasingly important role in text-related research and applications in areas such as text mining, Web page retrieval, and dialogue systems. Existing methods for computing sentence similarity have been adopted from approaches used for long text documents. These methods process sentences in a very high-dimensional space and are consequently inefficient, require human input, and are not adaptable to some application domains. This paper focuses directly on computing the similarity between very short texts of sentence length. It presents an algorithm that takes account of semantic information and word order information implied in the sentences. The semantic similarity of two sentences is calculated using information from a structured lexical database and from corpus statistics. The use of a lexical database enables our method to model human common sense knowledge and the incorporation of corpus statistics allows our method to be adaptable to different domains. The proposed method can be used in a variety of applications that involve text knowledge representation and discovery. Experiments on two sets of selected sentence pairs demonstrate that the proposed method provides a similarity measure that shows a significant correlation to human intuition.


Points made:

  • Sentence Similartiy based on path length and depth in the WordNet hierarchy.
  • Word order similarity. The metric seems to work rather well.
  • Data Sets: Brown Corpus (Content&Statistical Information), WordNet (Semantics)
  • The paper points to  a dataset created for estimating sentence similarity 

Overall, good to read and very detailed. Provides good resources to sentence similarities