Corpora: Summary: 'Help on Frequencies'

From: Pascual Cantos (
Date: Mon Oct 15 2001 - 10:39:33 MET DST

  • Next message: M M Van Zaanen: "Re: Corpora: the At sign"

    Dear List Members,

    Last week I forwarded a message, asking for help (hints, commments,
    literature, etc.) on frequency occurrences (see original message below).

    Thanks a million to the people who answered:

    Tony Berber Sardinha
    Raphael Salkie
    Adam Kilgarriff
    William Mann
    Daniel Walker
    Linda Bawcom
    Jerome Richalot

    This is a summary of the comments, literature and websites suggested:


    William Mann:
    Remember that in the very early reports on the Brown Corpus (the grandfather
    of all), the word "jabberwocky" showed up with fairly high frequency.

    Daniel Walker:
    > What do frequencies exactly tell?
    Well, frequencies can give an idea of how likely some event is. A nice
    analogy is the linguistic notion of markedness. The more likely a
    linguistic phenomenon, the more marked it is and vice versa. More
    generally, statistics provide a well formed way to incorporate empirical
    evidence into linguistic studies.
    > And more interesting, what do they hide?
    > How misleading/erroneous can they be?
    > How far can we rely on them?
    It's hard to make inferences about infrequent events. This is both a good
    and a bad thing. For example, sentences which would fail a grammaticality
    judgement may be infrequent, providing empirical support for native
    intuition. On the other hand, most of language is infrequent (This is
    similar to Chomsky's notion of Poverty of Stimulus.) which means it can be
    very difficult to collect examples of interesting phenomena. Most texts
    have a bias towards some domain and can be misleading. For example, just
    because the bilingual proceedings of the Canadian parliament translate
    'House' as 'Chambre' 75% of the time doesn't necessarily indicate that
    'House' rarely means 'maison'. The limitations of statistics in linguistics
    varies according to what you're measuring and how you measure it. There are
    well formed technics for making cut-off and significance decisions, but
    there is also a need for experimentation and maybe even art.
    > What other features/aspects/measures should also be considered?
    > Are there ways/techniques to "correct" frequencies indices, statistically?
    > I would most appreciate ideas, comments and literature on this issue.
    There are many interesting and useful statistics that one can take from
    some body of text and many technics can be used to "correct" or smooth
    counts. I would suggest reading "The linguist's guide to statistics" by

    Linda Bawcom:
    John Sinclair (1991) Corpus, Concordance, Collocation, dice 'Any instance
    of language depends on its surrounding context. The details of choice shown
    in any segment of a text depend-some of them-on choices made elsewhere in
    the text, and so no example is ever complete unless it is a whole text'. (p.
    Y tambien Michael Hoey decia en la conferencia de TESOL Spain
    (1997?)-'Worldlists homogenize the heterogeneous'
    Por eso, para mi, la frequencia de una palabra es solo el premier paso-o
    sea, es interesante en si, pero no tiene tanto importancia (a menos que una
    esta haciendo un diccionario como COBUILD). Para mi, (como profesora) lo más
    importante es el contexto, como la palabra 'collicates' , 'colligates' o
    'co-occurs'. Es decir, si soy un aprendiz de un idoma, y encima perezosa (lo
    que soy!), y si mi profesor/a mi dice que dos palabras son sinonimos-yo voy
    a aprender solo una.
    Lo que si he visto es 1) (en cuanto a fier de un corpus) lo que vas a sacar
    de un corpus depende mucho del corpus-tiene que tenir mucho cuidado con la
    proposito de el . 2) no se puede clasificar 'whole sets' de palabras como se
    hacen en libros de texto para la aprendaje (e.g. maneras de mirar, maneras
    de tocar) sin dar un contexto.
    Un ejemplo-estoy mirando (por un presentación) la diferencia entre tal vez
    y quizá-lo que he visto es que quizá es seguido 8 veces mas por 'por eso' o
    para mas alguna razón' que tal vez-y tambien los dos tiene, en sus
    contextos, casi la mitdad de los instantes, un 'negation'-no sé porque.
    Ahora, como nativo tu, sin duda, ya lo sabia. Pero, yo estaba surprendida..

    Jerome Richalot:
    "Statistics for corpus linguistics" by Michael P. Oakes (Edin. Textbooks in
    empirical linguistics, EUP) seems like a good place to start. It goes beyond
    raw frequencies and purely descriptive statistics into inferential
    Chapter 1 sarts with a quote (de Haan and van Hout 1986) referring to
    descriptive statistics ans "the useful loss of information." This I
    understand at least as one should indeed be aware that some information is
    lost through purely descriptive statistics. I just wonder how "useful" it is!


    Charniak, Eugene. 1993. Statistical Language Learning. Cambridge, MA: MIT

    Manning, Christopher, and Hinrich Schütze. 1999. Foundations of Statistical
    Natural Language Processing. Cambridge, MA: MIT Press

    Jurafsky, Dan, and James Martin. 2000. Speech and Language Processing: An
    Introduction to Natural Language Processing, Computational Linguistics
    and Speech Recognition. Upper Saddle River, NJ: Prentice Hall

    John Sinclair (1991) Corpus, Concordance, Collocation

    Michael Hoey (1997) 'Worldlists homogenize the heterogeneous' Conference,
    TESOL Spain.

    Michael P. Oakes (1997) Statistics for corpus linguistics (Edin. Textbooks in
    empirical linguistics, EUP)

    author = "Adam Kilgarriff",
        title = "Putting Frequencies into the Dictionary",
        journal = "International Journal of Lexicography",
        year = 1997,
        volume = 10,
        number = 2,
        pages = {135--155}

      author = {Adam Kilgarriff},
      title = {Comparing Corpora},
      journal = {International Journal of Corpus inguistics},
      year = {forthcoming},
      volume = {??},
      number = {??},
      pages = {00--00}

    Adam Kilgarriff and Raphael Salkie:
    Corpus similarity and homogeneity via word frequency. In M. Gellerstam et
    al (eds), EURALEX '96 Proceedings (Göteborg, Göteborg University, 1996),

    "The linguist's guide to statistics" by Krenn,

    Many corpus-based applications on foreign language materials and dictionary
    making, among other, mostly rely on raw frequencies (absolute and/or
    relative frequencies) of word forms, lemmas, bi-grams, etc. Frequencies
    indices are taken into account in order to decide whether an item should be
    considered or not.

    And here are my doubts:
    What do frequencies exactly tell?
    And more interesting, what do they hide?
    How misleading/erroneous can they be?
    How far can we rely on them?
    What other features/aspects/measures should also be considered?
    Are there ways/techniques to "correct" frequencies indices, statistically?

    I would most appreciate ideas, comments and literature on this issue.
    I do also promise to send a summary of all mails received.

    Un saludo y un millón de gracias



    Dr. Pascual Cantos Gómez

    Departamento de Filología Inglesa
    Universidad de Murcia
    C/. Santo Cristo, 1
    30071 Murcia (Spain)

    Tel.: +34 968 364365
    Fax: +34 968 363185

    This archive was generated by hypermail 2b29 : Mon Oct 15 2001 - 11:12:26 MET DST