Dienstag, 5. Mai 2009

[Topic] Centroid based Classification

[1] claimed a very high increase in accuracy due to the use of a different centroid weighting scheme. The weighting scheme extracts the "discriminative" features. The increase is around 0.7-0.10 F1 measure.

[2] Analyzes centroid based learning approaches in detail. k-nn, c4.5 and centroid base approaches are compared. The success of centroid based approaches is explained as comparing the inter class similarity distribution (=Length of the centroid) vs. the average similarity of a new item to all documents (interpretation of centroid based cosine similarity). While the average similarity of a new items do not take term dependencies into account and suffer similar drawbacks than naive bayes algorithms (over estimate of positive term co-occurrences, underestimate of negative term co-occurrences), the second term (=centroid length) addresses the co-occurrence aspect. (see Section 5 of the paper for details)

Further the paper provides: statistical testing of classification results (resampled t-test and sign test)




[1] Guan, Hu and Zhou, Jingyu and Guo, Minyi (2009) A Class-Feature-Centroid Classifier for Text Categorization. In: 18th International World Wide Web Conference, April 20th-24th, 2009, Madrid, Spain.

http://www2009.eprints.org/21/

[2] Han, E. and Karypis, G. 2000. Centroid-Based Document Classification: Analysis and Experimental Results. In Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery (September 13 - 16, 2000). D. A. Zighed, H. J. Komorowski, and J. M. Zytkow, Eds. Lecture Notes In Computer Science, vol. 1910. Springer-Verlag, London, 424-431.

http://portal.acm.org/citation.cfm?id=669671

Freitag, 1. Mai 2009

[Paper] A comparative Study of Two Short Text Semantic Similarity Measures

    Book Series -
    Book Title  - Agent and Multi-Agent Systems: Technologies and Applications
    Chapter Title  - A Comparative Study of Two Short Text Semantic Similarity Measures
    First Page  - 172
    Last Page  - 181
    Copyright  - 2008
    Author  - James O’Shea
    Author  - Zuhair Bandar
    Author  - Keeley Crockett
    Author  - David McLean
    DOI  - 10.1007/978-3-540-78582-8_18
    Link  - http://www.springerlink.com/content/v0867641u342pm2

James O’SheaContact Information, Zuhair BandarContact Information, Keeley CrockettContact Information and David McLeanContact Information

(1)  Department of Computing and Mathematics, Manchester Metropolitan University, Chester St., Manchester, M1 5GD, United Kingdom
Abstract
This paper describes a comparative study of STASIS and LSA. These measures of semantic similarity can be applied to short texts for use in Conversational Agents (CAs). CAs are computer programs that interact with humans through natural language dialogue. Business organizations have spent large sums of money in recent years developing them for online customer self-service, but achievements have been limited to simple FAQ systems. We believe this is due to the labour-intensive process of scripting, which could be reduced radically by the use of short-text semantic similarity measures. “Short texts” are typically 10-20 words long but are not required to be grammatically correct sentences, for example spoken utterances and text messages. We also present a benchmark data set of 65 sentence pairs with human-derived similarity ratings. This data set is the first of its kind, specifically developed to evaluate such measures and we believe it will be valuable to future researchers.

Keywords  Natural Language - Semantic Similarity - Dialogue Management - User Modeling - Benchmark - Sentence


Important Points

  • Discussion and summary of different kinds of similarities (Taxonomic, related, goal derived and radial)
  • Introduction of a (small) test corpora and how the corpora was created. This includes some discussion on how humans rate.
  • Statement that co-occurrence measures yield also high similarity values for antonyms


Dienstag, 21. April 2009

[Paper] Sentence Similarity Based on Semantic Nets and Corpus Statistics


Li, Y., McLean, D., Bandar, Z. A., O'Shea, J. D., and Crockett, K. 2006. Sentence Similarity Based on Semantic Nets and Corpus Statistics. IEEE Trans. on Knowl. and Data Eng. 18, 8 (Aug. 2006), 1138-1150. DOI= http://dx.doi.org/10.1109/TKDE.2006.130

ABSTRACT



Sentence similarity measures play an increasingly important role in
text-related research and applications in areas such as text mining,
Web page retrieval, and dialogue systems. Existing methods for
computing sentence similarity have been adopted from approaches used
for long text documents. These methods process sentences in a very
high-dimensional space and are consequently inefficient, require human
input, and are not adaptable to some application domains. This paper
focuses directly on computing the similarity between very short texts
of sentence length. It presents an algorithm that takes account of
semantic information and word order information implied in the
sentences. The semantic similarity of two sentences is calculated using
information from a structured lexical database and from corpus
statistics. The use of a lexical database enables our method to model
human common sense knowledge and the incorporation of corpus statistics
allows our method to be adaptable to different domains. The proposed
method can be used in a variety of applications that involve text
knowledge representation and discovery. Experiments on two sets of
selected sentence pairs demonstrate that the proposed method provides a
similarity measure that shows a significant correlation to human
intuition.


Points made:

  • Sentence Similartiy based on path length and depth in the WordNet hierarchy.
  • Word order similarity. The metric seems to work rather well.
  • Data Sets: Brown Corpus (Content&Statistical Information), WordNet (Semantics)
  • The paper points to  a dataset created for estimating sentence similarity 

Overall, good to read. Provides good resources to


Sentiment Detection and Opinion Mining

Sentiment detection and opinion mining is currently a hot topic of extracting subjective opinions and assertions from web portals.  A good survey is provided by Pang and Lee. The survey addresses several aspects of this field. Most important in the field of knowledge discovery and text mining is the question how algorithms for analysing unstructured texts written by web users differs with standard text mining tasks like text classification, named entity recognition. From the survey i have taken the following points
  • A smaller number of classes compared to text classification (e.g. positive, ambivalent, negativ vs. Topic hierarchies)
  • Higher dependency on subjective writing stile (e.g. sacarsm)
  • Higher dependency on common sense knowledge: Sentiments can be expressed using non sentiment words and comparing to very good/very bad situation (it feels like driving a car at 360 kmh)
  • High degree of subjectivity: Given the above sentence, some people may like it, some may not
  • Order effects might overthrough frequency effects
Sentiment Tasks

  • Polarity Opinion Classification: Deterine whether a piece of text is good or bad
  • Rating inference/ordinal regeression: Determine the scale of goodness/badness
  • Subjectivity Detection: Detect whether a piece of text contains subjective/objective material
  • Joint Topic/Sentiment analysis

Facts
  • Machine Learning using Unigram Models can achieve over 80% accuracy (Pang et. al. Thumbs Up! Sentiment Classification using Machine Learning)
  • Templates are more stable among domains (compared to IE)
  • Finding correct keywords expressing sentiments seems to be hard ("Go read the book" in movie vs. book domain)
  • Unclear whether bigrams help or not
  • POS Tagging can be considered as a rough version of WSD
  • Syntax has found to be usefull (dependency tree)
  • Negations count (as second feature, by transforming words e.g. NOT, deeper modelling)





[Paper] Sentence Similartiy Based on Semantic Nets and Corpus Statistics

ABSTRACT

Sentence similarity measures play an increasingly important role in text-related research and applications in areas such as text mining, Web page retrieval, and dialogue systems. Existing methods for computing sentence similarity have been adopted from approaches used for long text documents. These methods process sentences in a very high-dimensional space and are consequently inefficient, require human input, and are not adaptable to some application domains. This paper focuses directly on computing the similarity between very short texts of sentence length. It presents an algorithm that takes account of semantic information and word order information implied in the sentences. The semantic similarity of two sentences is calculated using information from a structured lexical database and from corpus statistics. The use of a lexical database enables our method to model human common sense knowledge and the incorporation of corpus statistics allows our method to be adaptable to different domains. The proposed method can be used in a variety of applications that involve text knowledge representation and discovery. Experiments on two sets of selected sentence pairs demonstrate that the proposed method provides a similarity measure that shows a significant correlation to human intuition.


Points made:

  • Sentence Similartiy based on path length and depth in the WordNet hierarchy.
  • Word order similarity. The metric seems to work rather well.
  • Data Sets: Brown Corpus (Content&Statistical Information), WordNet (Semantics)
  • The paper points to  a dataset created for estimating sentence similarity 

Overall, good to read and very detailed. Provides good resources to sentence similarities


Donnerstag, 26. Februar 2009

Recurrent Neural Networks for Robust Real-World Text Classification

Garen Arevian
2007 IEEE/WIC/ACM International Conference on Web Intelligence

ABSTRACT






This paper explores the application of recurrent neural networks for
the task of robust text classification of a real-world benchmarking
corpus. There are many well-established approaches which are used for
text classification, but they fail to address the challenge from a more
multi-disciplinary viewpoint such as natural language processing and
artificial intelligence. The results demonstrate that these recurrent
neural networks can be a viable addition to the many techniques used in
web intelligence for tasks such as context sensitive email
classification and web site indexing.

Noteworthy

  • Use of recurrent neural networks (Elman Networks) with a context layer, able to consider word orders
  • Further references for NN's in text mining
  • Title based semantic representation (at least pointers to prior literature on the topic)
  • Word order was not important
  • The claim made that NNs acn outperform other classifiers is very strong and does not hold in general








Montag, 16. Februar 2009

Information Retrieval System Evalution: Effort, Sensitivity, and Reliabilitiy

Information Retrieval System Evaluation: Effort, Sensitivity, and Reliability
Mark Sanderson, Justin Zobel

The paper is excellent in terms of comparing IR Systems and the difference in MAP and other measures. A must read for evaluation.


Abstract: The effectiveness of information retrieval systems is measured by comparing performance on a common set of queries and documents. Significance tests are often used to evaluate the reliability of such comparisons. Previous work has examined such tests, but produced results with limited application. Other work established an alternative benchmark for significance, but the resulting test was too stringent. In this paper, we revisit the question of how such tests should be used. We find that the t-test is highly reliable (more so than the sign or Wilcoxon test), and is far more reliable than simply showing a large percentage difference in effectiveness measures between IR systems. Our results show that past empirical work on significance tests over-estimated the error of such tests. We also re-consider comparisons between the reliability of precision at rank 10 and mean average precision, arguing that past comparisons did not consider the assessor effort required to compute such measures. This investigation shows that assessor effort would be better spent building test collections with more topics, each assessed in less detail.

Important Aspects Covered:

  • Brief introduction to statistical significance testing in IR (how and why)
  • Summary of results found by Zobel and Vorhees/Buckley:  8-9% MAP difference on 25 topics (conf = 95%)m, 5-6% MAP difference on 50 topics (conf = 95%)
  • Impact of significance testing on projecting MAP accuracy.
  • Large difference in MAP does not necessarily imply a statistical significant difference, especially on small topic set sizes (e.g. 25). At its worst, comparison must be significant and the difference for MAP must be higher than 10%.
  • T-Test produces lower error rates than sign and Wilcoxon test
  • MAP is more reliable than P@10, but building a reliably P@10 only collection should be cheaper (From an assessors point of view). However, the stability of shallow pool sizes is unclear, not yet tested.


Donnerstag, 29. Januar 2009

Mini How-to write a KRD/KDD Research Paper

I recently stumbled over the ACM SIG KDD 09 Call for Papers, which contains a excellent and comprehensive guid on writing an good research paper...at least for data intensive domains ;)

You can find the link here. The important part is also cited below:

" In writing your paper, we suggest you try to address the following questions, credited to George Heilmeier:


  • What are you trying to do? Articulate your objectives using absolutely no jargon.
  • How is it done today, and what are the limits of current practice?
  • What's new in your approach and why do you think it will be successful?
  • Who cares?
  • If you're successful, what difference will it make?
  • What are the risks and the payoffs? (in other words, what are the limitations and strengths of your work)
  • What
    are the midterm and final "exams" to check for success? (in other
    words, what are the measures of evaluation and evidence of success)

In light of the above principles, we suggest
the following guidelines for the paper content. Note that the headings
and the structure below are meant to be general categories; please
exercise your discretion and creativity to make the paper as
comprehensible as possible to the readers and reviewers.


Abstract


Try to include the following:
  • Motivation: one or two sentences on the problem and it significance;
  • Results: a short paragraph on approach and results;
  • Availability: a link to code, data, and supplementary materials,
    or a statement why this is not possible.

Motivation & Significance


What is the problem and why is it important or significant?


Problem Statement


Formal definition of the problem with any preliminary concepts.


Prior Work & Limitations


What are the existing approaches, and their limitations?


Theory/Algorithm


  • Discuss the main theoretical or algorithmic ideas of the paper;
  • Mention the main theorems (if any), the intuition behind those, and their
    practical application. Move the proofs to the appendix, unless the
    proof itself is the main contribution;
  • Discuss your algorithmic solution (if any) at the conceptual level with
    pseudo-code, to convey the main ideas. Move minute (but
    practically important) implementation details to the appendix;
  • Discuss why you chose certain paths, and discuss unfruitful
    paths that you discarded. In other words, give both the
    theoretical and/or algorithmic "insights" into your work.

Experiments or other Evidence of Success


  • Complete parameter settings and data descriptions should be
    provided (including any links to public resources);
  • Clearly specify the experimental procedure, including evaluation
    measures;
  • Compare to prior solutions, or at least to "strawman" solutions;
  • Clearly discuss the results and what they mean;
  • Only include the most relevant experiments here, using the
    appendix to provide any additional results (say on minor parameter
    tuning of your method, etc).

Discussion and Future Work


Describe insights you gained, the limitations and applicability of
your work, and directions for future research. Every solution has
limitations, which should be explicitly mentioned.


References


Include the most relevant works, making sure all citations are complete
(including editors, publishers, page numbers, etc.).


APPENDIX


You should use the appendix for supporting details. For example, you
may use it to convey detailed technical/practical aspects of your
implementation. You may use the appendix for theorem proofs, or for
additional experimental results. Include include pointers in the
main paper to relevant sections in the appendix.


The appendix is an integral part of the paper, since it will provide
details that are important for a proper appreciation of your work
(e.g., for replicating or extending it, or for comparison).
However, it should be possible on a first read-through to get a good
understanding of the paper's contribution from the main part alone.
Structuring the paper in this way provides a service to the reader,
by separating main ideas from technical details."


Sonntag, 25. Januar 2009

Text classification datasets with splits

...can be found here. It is the 20 Newsgroup data set and the TDT2.

Freitag, 23. Januar 2009

Statistical Machine Translation

I recently stumbled over a reasonable good survey on Statistical Machine Translation from Lopez [1]  . Starting with the IBM Model 3 and 4 it explains the critical steps of machine translation like
1. selection of the translational model (e.g. Transducers, Synchronous Context Free grammars)
2. Parametrization of the model, i.e. what are the parameters which can be learned (e.g. fertility of words, word alignment etc.)
3. Parameter estimation, i.e. how to estimate the values of parametrization (e.g. using generative or discriminative statistical models)
4. Decoding, which is simply translating new text based on the selected and parametrized model

Overall, it contains some interesting detail insights on problems like how to deal with sequences and the difference between discriminative and generative statistical models (see also CRF Introduction). Worthy to read.

Open Source Resources:
[Moses] http://www.statmt.org/moses/
[Overview] http://opentranslation.aspirationtech.org/index.php/Open_Source_Translation_Tools


[1] Lopez, A. 2008. Statistical machine translation. ACM Comput. Surv. 40, 3 (Aug. 2008), 1-49. DOI= http://doi.acm.org/10.1145/1380584.1380586