Corpora: Listing of historical corpora (besides ICAME and Spanish)

From: Mark Davies (
Date: Wed May 24 2000 - 20:57:50 MET DST

  • Next message: "Corpora: 2nd CFP: ACM SIGIR 2000 Workshop on XML and Information Retrieval"

    I'm trying to create a listing of HISTORICAL corpora for languages besides
    Spanish (I already have that), and in addition to those on the ICAME
    CD-ROM, which includes The Helsinki Corpus of English Texts, The Helsinki
    Corpus of Older Scots, Corpus of Early English Correspondence, The
    Newdigate Newsletters, Lampeter Corpus, Innsbruck Computer-Archive of
    Machine-Readable English Texts (ICAMET) [see]

    Here's a listing of what I have so far:

    Language / Name / URL / Approx. time period / Approx. size

    1) English / Penn-Helsinki Parsed Corpus of Middle English / 1150-1500 / 1,200,000 words [based on the
    Helsinki corpus]

    2) English / Penn-Helsinki Parsed Corpus of Old English/ Info at / 850-1150 / 420,000 words
    [based on the Helsinki corpus]

    3) French / ARTFL (Trésor de la langue française) / / 1600 > /
    115,000,000 words

    4) Swedish / Projektet Källtext / ???? /
    2,000,000 words

    5) German / Projekt Gutenberg / /
    Mostly 1900s, but a few earlier / 300 texts (# words ??)

    6) Portuguese / Tycho Brahe Parsed Corpus of Historical Portuguese / / c1600-1900 / Goal of 1,000,000 words

    7) Chinese / Historical Corpora for Synchronic and Diachronic Linguistics
    Studies / / Pre-Qin to
    Chang dynasties (time period??) / 17,000,000 characters

    As can be seen, I haven't yet identified many HISTORICAL corpora for
    German, Dutch, Norwegian, Icelandic, Italian, Romanian, Hungarian, Finnish,
    any of the Slavic languages, or any of the other European languages. In
    addition the only non-European language for which I can find anything is
    Chinese. (Also, I know that there are/must be nice collections of classical
    Greek and Latin in electronic form and on the Web [due to the large number
    of classical texts] but I haven't compiled a list of these yet).

    At any rate, if anyone has information on other historical corpora for the
    desired languages, I'd appreciate your sending me a URL for the
    resources. I will be creating a webpage with links to the historical
    corpora and will announce this on CORPORA in about a week, when I've
    received feedback from others.

    Thanks in advance for your help.

    Mark Davies

    Mark Davies, Associate Professor, Spanish Linguistics
    Dept. of Foreign Languages, Illinois State University
    Normal, IL 61790-4300


    This archive was generated by hypermail 2b29 : Wed May 24 2000 - 20:55:08 MET DST