There's a recent publication by Mikio Yamamoto and Kenneth W. Church, Computational Linguistics, 27 (1) 1-30 which shows efficient ways to compute a number of
interesting quantities over all substrings in a corpus.
Very nice work
On Tue, Aug 27, 2002 at 05:12:33PM +0200, Stefan Evert wrote:
> Hi there!
> I am currently working on extraction of statistically significant n-gram
> (1<n<6) strings of alpha-numerical characters from a 100 mln character
> corpus, and I intend to apply different significance tests (MI, t-score,
> log-likelihood etc.) on these strings. I'm testing Ted Pedersen's N-gram
> Statistics Package, which seems being able to produce the tasks, however
> it hasn't produced any results after one week of running.
> That's very probably because it's written in Perl and eating up lots
> of memory. I don't think there's a way around C/C++ for problems of
> that size (at the moment, at least).
> I always thought of NSP as a tool for counting N-grams of _tokens_
> rather than character. Apparently, you can change its definition of
> token, but that means using a trivial regular expressions to chop
> single characters from your 100 million input corpus. Which is going
> to take ages.
> I have a couple of queries regarding n-gram extraction:
> 1. I'd like to ask if members of the list are aware of similar software
> capable of accomplishing the above mentioned tasks reliably and
> I'm afraid I don't know of any such tools. Technically, counting
> N-grams produces a very simplistic statistical language model (the
> kind used to generate random poetry), so perhaps you can dig up
> something in that area.
> On the other hand, if you aren't tied to Windows (i.e.\ you have
> access to a Linux or Solaris computer), there's the IMS Corpus
> which isn't quite as outdated as that web page suggests. Although it
> isn't obvious from the online materials, the Corpus Workbench could be
> abused (with the help of a simple Perl script) to do what you want (at
> the price of wasting lots of disk space). Kind of a last resort, I
> 2. And a statistical question. As I need to count association scores for
> trigrams, tetragrams, and pentagrams as well, I plan to split them into
> bigrams consisting of a string of words plus one word [n-1]+ and
> count association scores for them.
> Does anyone know if this is a right thing to do from a statistical point
> of view?
> Again, I don't know of any well-founded discussion of association
> scores for N-grams in the literature. I consider it an intriguing
> problem and plan to do some work in this area when I've finished my
> thesis on bigram associations.
> The most systematic approach to N-grams I've come across is
> J.F. da Silva; G.P. Lopes. "A Local Maxima method and Fair Dispersion
> Normalization for extracting multi-word units from corpora." MOL 6,
> which can be downloaded from the first author's homepage at
> Their approach is based on breaking up N-grams into pairs of [n-1]+
> words, too, but I must say that I'm not really convinced this is the
> right way to go.
> Moral: Early to rise and early to bed
> makes a male healthy and wealthy and dead.
> C.E.R.T. Marbach (CQP Emergency Response Team)
> http://www.ims.uni-stuttgart.de/~evert firstname.lastname@example.org
-- ================================================================= Dr. Chris Brew, Assistant Professor of Computational Linguistics Department of Linguistics, 1712 Neil Avenue, Columbus OH 43210 Tel: +614 292 5420 Fax: +614 292 8833 Email:email@example.com =================================================================
This archive was generated by hypermail 2b29 : Thu Aug 29 2002 - 22:10:50 MET DST