[1] claimed a very high increase in accuracy due to the use of a different centroid weighting scheme. The weighting scheme extracts the "discriminative" features. The increase is around 0.7-0.10 F1 measure.
[2] Analyzes centroid based learning approaches in detail. k-nn, c4.5 and centroid base approaches are compared. The success of centroid based approaches is explained as comparing the inter class similarity distribution (=Length of the centroid) vs. the average similarity of a new item to all documents (interpretation of centroid based cosine similarity). While the average similarity of a new items do not take term dependencies into account and suffer similar drawbacks than naive bayes algorithms (over estimate of positive term co-occurrences, underestimate of negative term co-occurrences), the second term (=centroid length) addresses the co-occurrence aspect. (see Section 5 of the paper for details)
Further the paper provides: statistical testing of classification results (resampled t-test and sign test)
[1] Guan, Hu and Zhou, Jingyu and Guo, Minyi (2009) A Class-Feature-Centroid Classifier for Text Categorization. In: 18th International World Wide Web Conference, April 20th-24th, 2009, Madrid, Spain.
http://www2009.eprints.org/21/
[2] Han, E. and Karypis, G. 2000. Centroid-Based Document Classification: Analysis and Experimental Results. In Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery (September 13 - 16, 2000). D. A. Zighed, H. J. Komorowski, and J. M. Zytkow, Eds. Lecture Notes In Computer Science, vol. 1910. Springer-Verlag, London, 424-431.
http://portal.acm.org/citation.cfm?id=669671

[2] Analyzes centroid based learning approaches in detail. k-nn, c4.5 and centroid base approaches are compared. The success of centroid based approaches is explained as comparing the inter class similarity distribution (=Length of the centroid) vs. the average similarity of a new item to all documents (interpretation of centroid based cosine similarity). While the average similarity of a new items do not take term dependencies into account and suffer similar drawbacks than naive bayes algorithms (over estimate of positive term co-occurrences, underestimate of negative term co-occurrences), the second term (=centroid length) addresses the co-occurrence aspect. (see Section 5 of the paper for details)
Further the paper provides: statistical testing of classification results (resampled t-test and sign test)
[1] Guan, Hu and Zhou, Jingyu and Guo, Minyi (2009) A Class-Feature-Centroid Classifier for Text Categorization. In: 18th International World Wide Web Conference, April 20th-24th, 2009, Madrid, Spain.
http://www2009.eprints.org/21/
[2] Han, E. and Karypis, G. 2000. Centroid-Based Document Classification: Analysis and Experimental Results. In Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery (September 13 - 16, 2000). D. A. Zighed, H. J. Komorowski, and J. M. Zytkow, Eds. Lecture Notes In Computer Science, vol. 1910. Springer-Verlag, London, 424-431.
http://portal.acm.org/citation.cfm?id=669671

Keine Kommentare:
Kommentar veröffentlichen