Readings on Knowledge Relationship Discovery: 2011

Montag, 12. September 2011

Two new TEAM members

Two new TEAM members could be recruited by Mendeley and ELIKO in the TEAM project. Christian Prokopp starts at Mendeley working on topics models and their impact. Honghan Wu started at ELIKO and will focus on Linked Open Data and its usage in recommended systems. Their details can be found on the TEAM Research Page.

We are happy to have both on board for bringing our research project forward!

Samstag, 3. September 2011

Kris Jack gives opening Keynote at the I-Know'11 Special Track on Recommendation, Data Sharing, and Research Practices in Science 2.0

Next week the I-Know'11 conferences started, where the TEAM team organised the special track on Recommendation, Data Sharing, and Research Practices in Science 2.0. Kris Jack from Mendeley will give the opening keynote of the special track on "Mendeley: Crowed-Sourcing and Recommending Research on the Large Scale".

I am looking forward to hear his talk.

Author Disambiguation Presentation @ TIR Workshop

This week i presented our work on author disambiguation at the TIR'11 workshop in Toulouse. The presentation outlined our findings on the need of very clean features and better model selection methods for disambiguating author names in the wild. Discussions raised the idea to use overlapping blocking methods and to apply outlier detection afterwards. That may be a nicer approach in solving the model selection problem.

Of course the presentation is also uploaded to the Mendeley based TEAM folder for sharing our documents in the project

Dienstag, 30. August 2011

All good things come to an end

My secondment has finally come to an end. Overall it was a great experience. From my point of view, knowledge transfer worked perfectly in both directions. The achieved results are quite nice, since we could show that title extraction works reliably with support vector machines to be used in subsequent de-duplication methods. Detailed results will follow soon in a publication and a deliverable. For de-duplication, the comparison between finger printing and inverted index based methods has been triggered and will now be continued by James Hammerton in Graz. Ago Luberg from Eliko also addresses similar topics, but in the different scenario of knowledge acquisition from the web. So my work on that topics will fortunately continue.

Overall i see the TEAM project becoming more and more successful in establishing knowledge transfer among all participants.

Donnerstag, 23. Juni 2011

Near-Duplicate-Detection, Metadata Extraction, Visualisation and other things

Since there is so much interesting work to do at the secondment, i tend to forget writing blog posts on what I did. So this one will sum up the last 1.5 months, where i have investigated topics like near duplicate detection, metadata extraction and gave talks on visualisation of bibliographic data and metadata extraction.

Near duplicate detection is essential for the quality of a data set with many committers and sources, like it is the case in our use case. Fingerprinting is the traditional approach to near-duplicate detection, but it does not have the flexibility of a inverted index based approach. So i tried NDD using inverted indizes, in particular Lucene. Lucene is really lightning fast, which allows to achieve title lookups on a set of a 40 million metadata entries in in less than 200ms while still being able to add new titles in real time. Results in terms of accuracy are also promising, although a comparison with fingerprinting is still open. One particular question in the context of research paper management is how to recover from metadata extraction errors? Given a pdf and automatically extracted titles and authors, how good can we recover from errors? What accuracy can be achieved by metadata extraction using state-of-the-art methods like Conditional Random Fields?

Using Conditional Random Fields it was possible to rely on layout features like font-size, position etc. only to extract titles with an recall of appox. 0.8 and a precision of 0.7. Compared to state-of-the-art tools like ParsCit, which achieve similar accuracy but having more domain knowledge, that is quit good. Also experiments showed that metadata extraction, and hence near duplicate search, depend on the domain and the type of journal. Without re-training, ParsCit achieves great performance on IEEE Computer Science papers, but fails on Medical Papers from the BMJ. With the approach developed using Layout information, we can automatically adapt to different journal types and fields, which allows to improve accuracy and recall. I guess that is worth a publication. Working that out will take the rest of my secondment.

Donnerstag, 5. Mai 2011

Near Duplicate Search on Sparse Metadata & Learn2Rank

In the first month we focused on applying learning to rank to sparse metadata and to investigate the applicability of inverted index based metrics and lookups for near duplicate search. Results in both directions seem to be promising. Especially inverted indices using simple word grams or bi-word grams provide efficient near duplicate search facilities.

Next, learn to rank will find its way into tag recommendation in particular and recommendation in general. Basically, will learn to rank outperform traditional recommendation approaches? Lets see.

Donnerstag, 7. April 2011

Starting my research stay at Mendeley

Finally i started my research stay at Mendeley. We settled and i got the first impressions on how everything is working here, the city, the people and of course, colleagues at Mendeley.

The talks i did with people here helped to narrow down the research topics towards Learning-to-Rank and stochastic machine learning and identified possible application areas. Next week i will give a talk on that particular topic and maybe i will have first results. I am looking forward in presenting the possible research ideas and their application, as well as research topics my group in Graz is currently addressing.

Readings on Knowledge Relationship Discovery