Corpora: summary: corpus evidence that runs counter to intuition?

From: Sebastian Hoffmann (
Date: Wed Oct 24 2001 - 14:02:29 MET DST

  • Next message: Mark Davies: "Corpora: Large corpora with virtually unlimited annotation based on n-grams"

    Dear corpus list subscribers

    I'd like to thank all those people who have sent replies to my
    question about corpus evidence that runs counter to intuition. A
    number of people have asked me to write a summary - so here's what I
    received (in chronological order):

    Yorick Wilks mentioned that Paul Jacobs (in his information
    extraction presentation ) used to point out that, in English,
    'television' nearly always means 'the medium' in corpus counts and
    not 'the TV set' which is what most of us would (unreflectingly)

    Eric Atwell wrote that many people (probably especially
    non-linguists) are surprised by the high frequencies of the
    four-letter word fuck in the spoken part of the BNC and thinks that
    this is "evidence that it is far more natural and normal in ordinary
    speech than many expect".

    I had a look in the BNC and here's some data:
    In its various forms, fuck occurs 2,814 times in the spoken part of
    BNC World Edition (272 instances pmw). It is uttered almost three
    times as often by male speakers than by female speakers (359pmw vs.
    137pmw). So yes, it IS frequent. ;-) But it may also be important to
    note that 985 of the spoken instances occur in one single file (KDA)
    which is a collection of conversations between aircraft engineers. In
    this file, the frequency is a whooping 13000 instances pmw! Other
    files with high numbers of (fuck|fucks|fucking|fucked) are KD9 (110
    instances, 7909pmw), KE1 (141 instances, 6713pmw), KDN (231
    instances, 4986pmw), KCU (164 instances, 3045pmw), KP4 (105
    instances, 2547pmw), G01 (104 instances, 2548pmw), and FP6 (100
    instances, 2520pmw). Thus, these 8 files alone cover 1940 - or 69 per
    cent - of all relevant instances.

    Jasper Holmes notes that intuition often fails in connection with
    well known 'grammar errors'. For example, around half of the cases
    of NONE, NEITHER, NO+np (as subject of a present tense verb) in the
    ICE-GB corpus appeared with plural verbal agreement (your grammar
    book will tell you they are singular). Similarly for examples like _a
    bag of letters_.

    John McKenny wrote:
    What springs to mind for me is the use of 'would' to talk about past
    habits. e.g. "when I was young I would go to Mass each morning with
    my grandmother". Before the advent of COBUILD this was considered to
    be literary and less colloquial than "used to". This was my intuition
    and the general intuition of the EFL community witness countless ELT
    textbooks and grammars. I taught "used to" to countless
    pre-intermediate students leaving "would" for advanced students.
    COBUILD turned this upside down, I think.

    Philip Resnik pointed out that "Talke Macfarland has done some very
    interesting corpus-based work on passive cognate object
    constructions, showing that corpus evidence contradicts some
    introspection-based claims in the literature about grammaticality."

    John Williams wrote that "In the Bank of English, by far the most
    frequent meaning of 'bash' (any
    part of speech) is 'party' whereas I think most native speakers would
    intuitively go for 'hit, beat up' (informal). This could be explained
    by the large news media component of the BofE ('bash = party' is very
    much a 'media' word) or maybe it's 'really' the most frequent meaning
    (whatever that means).

    Guy Aston replied to this statement and pointed out that the BNC
    cannot support this data. He writes:
    On a rough count, out of 272 occurrences of "bash", 97 are verbs
    meaning "hit" and 19 are forms of the delexicalised "have a bash", as
    well as are 6 other nominal uses meaning "a hit". 59 are proper nouns
    (characters called "Bash"), leaving only 80ish as nouns meaning
    "party".. And then the verb bash also has other forms ...

    John Williams mentioned two further points:
    The large news component [in the Bank of English] also explains
    things like the main verb collocates of 'radio station' being things
    like 'capture' or 'take over', rather than the more intuitive 'listen
    to' or 'tune into'.
    And also there are the well-known cases like 'give', where the
    delexicalized meanings ('give a smile', etc) are more frequent than
    'hand over, present'; and 'see = understand' which is more frequent
    than 'see = perceive with eyes'.

    Again, I checked the BNC and looked for verb collocates of radio
    station and radio stations (which together occur 509 times in the
    BNC) within a window of -3 to +3. The result is ranked by
    log-likelihood value and the lemmatization is based on the Lancaster
    scheme provided with the BNC World Edition. Only node-collocate pairs
    which occurred at least 3 times were considered for the calculation.
    Sorry for the formatting - hope you can make sense of this table.

    No. Lemma n n coll. n texts log-likelihood value
    1 be_VERB 3244400 64 43 93.790871
    2 broadcast_VERB 970 7 6 89.244032
    3 occupy_VERB 4379 6 6 56.542749
    4 have_VERB 1319155 30 23 49.442746
    5 own_VERB 6372 5 5 41.556727
    6 play_VERB 37632 6 5 31.023832
    7 report_VERB 18747 5 5 30.875074
    8 run_VERB 39201 6 6 30.547817
    9 use_VERB 105881 8 5 29.948009
    10 seize_VERB 2505 3 3 27.448364
    11 say_VERB 318281 11 10 25.577838
    12 establish_VERB 17397 4 4 23.526801
    13 was_VERB 883602 16 14 20.391751
    14 operate_VERB 10179 3 3 19.103241
    15 hear_VERB 34747 3 2 11.959554
    16 take_VERB 173956 5 5 10.003759
    17 call_VERB 52265 3 3 9.669488
    18 get_VERB 213722 5 5 8.305941
    19 could_VERB 160161 4 4 7.064107
    20 know_VERB 178522 4 4 6.362790
    21 give_VERB 125302 3 3 5.087929
    22 go_VERB 227069 4 4 4.879994
    23 do_VERB 538558 6 6 3.632087
    24 will_VERB 329392 4 4 2.835328

    Since my calculation is based on single word collocates, I cannot
    give any information about "the more intuitive [verb - preposition
    combinations] 'listen to' or 'tune into'" - but in any case, the
    verbs listen and tune are not found in the above table... ;-)

    Bob Krovetz wrote:
    In my work on morphology I would sometimes come across examples that
    made sense, but I wouldn't have thought of it beforehand. I studied
    corpus data in order to decide which morphological variants I should
    reduce to a root for purposes of information retrieval (this is
    called "stemming"). I tried to avoid any groupings that would create
    ambiguity. For example, I didn't reduce "gravitation" to "gravity"
    because "gravity" can also mean "serious" (the gravity of the crime),
    which is the predominant meaning of "gravity" in legal text. So
    should "accelerators" be reduced to "accelerator"? I found that
    "accelerator" refers to either a car accelerator or a nuclear
    particle accelerator in newspaper text. But "accelerators" referred
    only to nuclear particle accelerators. We just don't talk about
    more than one car accelerator. It's possible to do so, but very
    unlikely (at least within newspaper text). I'm not saying that
    "accelerator" is limited to those two meanings either - those were
    just the ones I found in the corpora I studied.

    I'm still looking for more examples - please keep them coming and
    I'll post a second summary...

    Sebastian Hoffmann
    University of Zurich


    This archive was generated by hypermail 2b29 : Wed Oct 24 2001 - 14:15:07 MET DST