I am happy to announce the availability of two new resources for
research on the statistical analysis of word cooccurrences.
1) An On-Line Repository of Association Measures
Statistical association measures, applied to cooccurrence frequency data
collected in a contingency table, are the most widely used tool for the
analysis of word combinations and the extraction of collocations from text
corpora. Over the years, many different association measures have been
suggested (mutual information, t-score, the chi-squared test, and Dunning's
log-likelihood, to name but a few) and used in various applications.
This on-line resource aims to be a comprehensive repository of association
measures, including an explanation of the theoretical background of each
measure, references, some implementation notes, and explicit equations in
terms of observed and expected frequencies.
2) The UCS Toolkit (version 0.3)
The UCS toolkit is a collection of libraries and scripts for the statistical
analysis of cooccurrence data. It can be thought of as a simple and highly
specialised database, storing data sets of word pairs and frequency
information in a tabular format in plain (compressed) text files. The data
sets can be viewed, printed, manipulated in various ways, annotated with
association scores, ranked, and sorted. In addition there are some library
functions for the graphical evaluation of association measures in a
collocation extraction task.
The UCS toolkit provides reference implementations for all association
measures listed in the on-line repository above. It is open source software,
based on the freely available Perl (www.perl.com) and R (www.r-project.org)
languages, and should work on most modern Unix-like operating systems
(with experimental support for Windows under the Cygwin emulation layer).
For more information and downloads, please turn to
or go to
and click on "Software".
Best Wishes and a Happy Easter,
This archive was generated by hypermail 2b29 : Tue Apr 06 2004 - 13:03:35 MET DST