Readings on Knowledge Relationship Discovery: Mai 2009

Dienstag, 5. Mai 2009

[Topic] Centroid based Classification

[1] claimed a very high increase in accuracy due to the use of a different centroid weighting scheme. The weighting scheme extracts the "discriminative" features. The increase is around 0.7-0.10 F1 measure.

[2] Analyzes centroid based learning approaches in detail. k-nn, c4.5 and centroid base approaches are compared. The success of centroid based approaches is explained as comparing the inter class similarity distribution (=Length of the centroid) vs. the average similarity of a new item to all documents (interpretation of centroid based cosine similarity). While the average similarity of a new items do not take term dependencies into account and suffer similar drawbacks than naive bayes algorithms (over estimate of positive term co-occurrences, underestimate of negative term co-occurrences), the second term (=centroid length) addresses the co-occurrence aspect. (see Section 5 of the paper for details)

Further the paper provides: statistical testing of classification results (resampled t-test and sign test)

[1] Guan, Hu and Zhou, Jingyu and Guo, Minyi (2009) A Class-Feature-Centroid Classifier for Text Categorization. In: 18th International World Wide Web Conference, April 20th-24th, 2009, Madrid, Spain.

http://www2009.eprints.org/21/

[2] Han, E. and Karypis, G. 2000. Centroid-Based Document Classification: Analysis and Experimental Results. In Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery (September 13 - 16, 2000). D. A. Zighed, H. J. Komorowski, and J. M. Zytkow, Eds. Lecture Notes In Computer Science, vol. 1910. Springer-Verlag, London, 424-431.

http://portal.acm.org/citation.cfm?id=669671

This paper describes a comparative study of STASIS and LSA. These measures of semantic similarity can be applied to short texts for use in Conversational Agents (CAs). CAs are computer programs that interact with humans through natural language dialogue. Business organizations have spent large sums of money in recent years developing them for online customer self-service, but achievements have been limited to simple FAQ systems. We believe this is due to the labour-intensive process of scripting, which could be reduced radically by the use of short-text semantic similarity measures. “Short texts” are typically 10-20 words long but are not required to be grammatically correct sentences, for example spoken utterances and text messages. We also present a benchmark data set of 65 sentence pairs with human-derived similarity ratings. This data set is the first of its kind, specifically developed to evaluate such measures and we believe it will be valuable to future researchers.

Readings on Knowledge Relationship Discovery

Dienstag, 5. Mai 2009

[Topic] Centroid based Classification

Freitag, 1. Mai 2009

[Paper] A comparative Study of Two Short Text Semantic Similarity Measures

Follower

Blog-Archiv

Über mich