Corpora: Cooperation needed to delevelop Dutch IR test collection

From: Djoerd Hiemstra (
Date: Tue May 09 2000 - 17:42:18 MET DST

  • Next message: Amanda Schiffrin: "Corpora: Plea for conversation transcription & sound files"

    APOLOGIES for multiple copies of this message.

    Dear NLP/IR - researcher,

    We are planning to set up a TREC style information retrieval (IR) test
    collection with Dutch data. The collection will consist of newspaper
    articles, 40 queries and relevance judgements of real users, and will
    be available as a benchmark for future evaluations of Dutch NLP and IR
    techniques for tasks like retrieval or filtering. TREC style IR
    collection are created by judging only a part (the pool) of the
    document collection for relevance (see: for more
    information). Essential for creating a reasonable IR test collection
    is therefore that many different IR systems, following different
    approaches to IR contribute to this document pool. Your cooperation is
    needed to make this a success. We plan to include this evaluation in
    the Cross-Language Evaluation Forum
    (CLEF, CLEF is the
    follow-up of the successful Cross-Language track of TREC and will
    start of this year with a document collection consisting of French,
    English, German and Italian documents. The CLEF organisation intends
    to extend the collection next year with new languages. We hope Dutch
    will be one of those. If you are interested in research on a Dutch
    collection we encourage you to participate, thereby improving the
    quality of the test collection. Next year's evaluation will start in
    May 2001 and results have to be submitted in July 2001. Participation
    in the monolingual Dutch task is relatively simple and could be done
    by students as a design project.

    Chances for success are heavily dependent on the number of groups
    interested in Dutch. The following people already informally expressed
    their interest in working with a Dutch test collection:
      - Keith van Rijsbergen (University of Glasgow)
      - Arjen de Vries (CWI, Amsterdam)
      - Wessel Kraaij (TNO-TPD, Delft)
      - Djoerd Hiemstra (University of Twente)

    Expressing your interest at this point will not commit you to anything,
    but it will help us in showing that there is enough interest from research
    institutes and companies in Dutch as a language to develop and evaluate IR
    systems for. Also, by expressing your interest, will we keep you informed
    on any new developments.

    Note: For this year's CLEF (2000), Dutch translations of the topics
      will be included in the official topics release of 8 May 2000. So,
      already for this year, participants are able to study basic problems
      with handling Dutch (like compound analysis) and use or develop
      resources like stop lists, stemmers, taggers, parsers, translation
      dictionaries, etc. as a preparation for CLEF 2001. We encourage
      interested groups to participate already this year.

      The easiest way to work on Dutch in CLEF 2000 is by doing a
      bilingual task, e.g. using Dutch queries to retrieve English
      documents, but the full task (i.e. Dutch --> X) is also possible. If
      necessary, we are willing to provide and point out resources and/or
      software to groups that are lacking those. For more information,
      please contact the people below.

    Best regards,

    Wessel Kraaij Djoerd Hiemstra
    <> <>
    TNO-TPD University of Twente

    This archive was generated by hypermail 2b29 : Tue May 09 2000 - 17:42:18 MET DST