Corpora: Log Likelihood Ratio and multi-word units

From: Cameron Smart (
Date: Thu Dec 20 2001 - 02:10:14 MET

  • Next message: Hans Uszkoreit: "Corpora: Information Extraction / Information Retrieval Position Jobs, Saarbrücken"

    Apologies if this question either betrays a fundamental misunderstanding on
    my part or is old hat.

    If one is employing the log likelihood ratio (or similarly Chi-Square) to
    establish a significant difference in use of a certain word in two corpora,
    as far as I understand, it is calculated using a contingency table based on
    the Frequency of the word/ Frequency of other words/ Total number of words
    in corpus.

    However, how is this employed if we want to establish a significant
    difference in use of a multi word unit (such as a 2 word prepositional
    phrase) in two corpora? Frequency of multi-word unit is easy enough, but
    what does "Frequency of other words" become? Indeed can the log likelihood
    ratio be used in this case? If not what alternatives are there?

    Thanks for any comments in advance

    Cameron Smart
    Hong Kong Polytechnic University

    This archive was generated by hypermail 2b29 : Thu Dec 20 2001 - 02:22:33 MET