Since there is so much interesting work to do at the secondment, i tend to forget writing blog posts on what I did. So this one will sum up the last 1.5 months, where i have investigated topics like near duplicate detection, metadata extraction and gave talks on visualisation of bibliographic data and metadata extraction.
Near duplicate detection is essential for the quality of a data set with many committers and sources, like it is the case in our use case. Fingerprinting is the traditional approach to near-duplicate detection, but it does not have the flexibility of a inverted index based approach. So i tried NDD using inverted indizes, in particular Lucene. Lucene is really lightning fast, which allows to achieve title lookups on a set of a 40 million metadata entries in in less than 200ms while still being able to add new titles in real time. Results in terms of accuracy are also promising, although a comparison with fingerprinting is still open. One particular question in the context of research paper management is how to recover from metadata extraction errors? Given a pdf and automatically extracted titles and authors, how good can we recover from errors? What accuracy can be achieved by metadata extraction using state-of-the-art methods like Conditional Random Fields?
Using Conditional Random Fields it was possible to rely on layout features like font-size, position etc. only to extract titles with an recall of appox. 0.8 and a precision of 0.7. Compared to state-of-the-art tools like ParsCit, which achieve similar accuracy but having more domain knowledge, that is quit good. Also experiments showed that metadata extraction, and hence near duplicate search, depend on the domain and the type of journal. Without re-training, ParsCit achieves great performance on IEEE Computer Science papers, but fails on Medical Papers from the BMJ. With the approach developed using Layout information, we can automatically adapt to different journal types and fields, which allows to improve accuracy and recall. I guess that is worth a publication. Working that out will take the rest of my secondment.
Abonnieren
Kommentare zum Post (Atom)
Keine Kommentare:
Kommentar veröffentlichen