As I mentioned in a related message last week, I'm in the process of
creating a list of 1, 2, and 3-grams (maybe 4 and 5-grams too) in a 100
million word corpus of Spanish.
What I'm looking for is a program that will allow me to create these lists
of n-grams more efficiently than what I have presently. I need a solution
that has the following features:
** PC-based (DOS or Windows)
** Output in non-propriety ASCII format
** Can easily handle input files as large as 1,000,000 words (hopefully,
** Can be run in "batch file" mode, i.e. without human intervention,
process a list of 40 different 1,000,000 word input files, and return 40
output files with the lists of n-grams.
I've been using WordSmith, which can be run in "batch file" mode, and which
has been quite useful. The problem with WordSmith, however, is that it
exports the list of n-grams in a proprietary format, which then have to
manually be converted -- one by one -- to standard ASCII files. In
addition, it doesn't much like input files much larger than about one
I already know that there are some very nice Unix/Linux-based solutions,
but I'm really looking for something that is PC-based, since my students
will also be using something like this in the near future, and all we have
here are PC's :-(.
In addition, I've seen reference to Perl scripts that can be run on a PC,
such as the <bigram-generate.prl> script that comes with the Brill tagger,
and which can be run with Windows ActivePerl. While I may very well end up
using this or a similar Perl script, I'm also very interested in
Thanks in advance for your help. I'll post a summary if there is interest.
Mark Davies, Associate Professor, Spanish Linguistics
4300 Foreign Languages, Illinois State University, Normal, IL 61790-4300
309-438-7975 (voice) / 309-438-8083 (fax)
** Corpus design and use / Web-database programming and optimization **
** Historical and dialectal Spanish and Portuguese syntax / Distance
This archive was generated by hypermail 2b29 : Mon Oct 15 2001 - 14:09:08 MET DST