Donnerstag, 29. Januar 2009

Mini How-to write a KRD/KDD Research Paper

I recently stumbled over the ACM SIG KDD 09 Call for Papers, which contains a excellent and comprehensive guid on writing an good research paper...at least for data intensive domains ;)

You can find the link here. The important part is also cited below:

" In writing your paper, we suggest you try to address the following questions, credited to George Heilmeier:


  • What are you trying to do? Articulate your objectives using absolutely no jargon.
  • How is it done today, and what are the limits of current practice?
  • What's new in your approach and why do you think it will be successful?
  • Who cares?
  • If you're successful, what difference will it make?
  • What are the risks and the payoffs? (in other words, what are the limitations and strengths of your work)
  • What
    are the midterm and final "exams" to check for success? (in other
    words, what are the measures of evaluation and evidence of success)

In light of the above principles, we suggest
the following guidelines for the paper content. Note that the headings
and the structure below are meant to be general categories; please
exercise your discretion and creativity to make the paper as
comprehensible as possible to the readers and reviewers.


Abstract


Try to include the following:
  • Motivation: one or two sentences on the problem and it significance;
  • Results: a short paragraph on approach and results;
  • Availability: a link to code, data, and supplementary materials,
    or a statement why this is not possible.

Motivation & Significance


What is the problem and why is it important or significant?


Problem Statement


Formal definition of the problem with any preliminary concepts.


Prior Work & Limitations


What are the existing approaches, and their limitations?


Theory/Algorithm


  • Discuss the main theoretical or algorithmic ideas of the paper;
  • Mention the main theorems (if any), the intuition behind those, and their
    practical application. Move the proofs to the appendix, unless the
    proof itself is the main contribution;
  • Discuss your algorithmic solution (if any) at the conceptual level with
    pseudo-code, to convey the main ideas. Move minute (but
    practically important) implementation details to the appendix;
  • Discuss why you chose certain paths, and discuss unfruitful
    paths that you discarded. In other words, give both the
    theoretical and/or algorithmic "insights" into your work.

Experiments or other Evidence of Success


  • Complete parameter settings and data descriptions should be
    provided (including any links to public resources);
  • Clearly specify the experimental procedure, including evaluation
    measures;
  • Compare to prior solutions, or at least to "strawman" solutions;
  • Clearly discuss the results and what they mean;
  • Only include the most relevant experiments here, using the
    appendix to provide any additional results (say on minor parameter
    tuning of your method, etc).

Discussion and Future Work


Describe insights you gained, the limitations and applicability of
your work, and directions for future research. Every solution has
limitations, which should be explicitly mentioned.


References


Include the most relevant works, making sure all citations are complete
(including editors, publishers, page numbers, etc.).


APPENDIX


You should use the appendix for supporting details. For example, you
may use it to convey detailed technical/practical aspects of your
implementation. You may use the appendix for theorem proofs, or for
additional experimental results. Include include pointers in the
main paper to relevant sections in the appendix.


The appendix is an integral part of the paper, since it will provide
details that are important for a proper appreciation of your work
(e.g., for replicating or extending it, or for comparison).
However, it should be possible on a first read-through to get a good
understanding of the paper's contribution from the main part alone.
Structuring the paper in this way provides a service to the reader,
by separating main ideas from technical details."


Sonntag, 25. Januar 2009

Text classification datasets with splits

...can be found here. It is the 20 Newsgroup data set and the TDT2.

Freitag, 23. Januar 2009

Statistical Machine Translation

I recently stumbled over a reasonable good survey on Statistical Machine Translation from Lopez [1]  . Starting with the IBM Model 3 and 4 it explains the critical steps of machine translation like
1. selection of the translational model (e.g. Transducers, Synchronous Context Free grammars)
2. Parametrization of the model, i.e. what are the parameters which can be learned (e.g. fertility of words, word alignment etc.)
3. Parameter estimation, i.e. how to estimate the values of parametrization (e.g. using generative or discriminative statistical models)
4. Decoding, which is simply translating new text based on the selected and parametrized model

Overall, it contains some interesting detail insights on problems like how to deal with sequences and the difference between discriminative and generative statistical models (see also CRF Introduction). Worthy to read.

Open Source Resources:
[Moses] http://www.statmt.org/moses/
[Overview] http://opentranslation.aspirationtech.org/index.php/Open_Source_Translation_Tools


[1] Lopez, A. 2008. Statistical machine translation. ACM Comput. Surv. 40, 3 (Aug. 2008), 1-49. DOI= http://doi.acm.org/10.1145/1380584.1380586