Content-Type: TEXT/PLAIN; CHARSET=iso-8859-6
It's not much for you, but I did an article on phonological
frequency effects in English in which I speculated (but on pretty solid
grounds, I think) that at least for English, the division between
relatively common and relatively uncommon words occurs at text frequencies
of about 5 per million words of running text. Now, it is well known that
relatively frequent words in written texts, on the whole, occur even more
frequently in spoken texts; conversely, relatively infrequent words occur
even less frequently in spoken texts. I suggested that they should be
about equal (that is, the frequency graph crossover point) at that
frequency of 5/M.
That, of course, is not the end of the story. Apart from the
genre skewing of frequencies I just mentioned, there are also types of
words which ONLY are found in speech (particularly: interjections
[markers, hesitation markers, etc.] and a few 'emotive' words like
_berserk_ [to my knowledge, found in NO published frequency count]).
While these factors are of course important, one would expect their
importance to diminish (except, perhaps, for interjections) insofar as you
are processing HUGE corpuses, since lowered probablility multiplied by
huge input still gives occurrences.
I hope some of these reflections are useful for you. The
reference, such as it is:
Fidelholtz, James L. 1975. Word frequency and vowel reduction in
English. _Chicago linguistic society. Regional meeting. Papers_
11.200-213. [be sure to check the footnotes carefully]
On Mon, 23 Feb 1998, Sameh-al-ansary wrote:
>Date: Mon, 23 Feb 1998 20:57:58 +0200
>From: Sameh-al-ansary <firstname.lastname@example.org>
>To: CORPORA <CORPORA@HD.UIB.NO>
>Subject: Corpora: Papers and corpus software
>Resent-Date: Mon, 23 Feb 1998 20:29:23 +0100
>Dear everyone :
> I am writing my Ph.D thesis in corpus linguistics. I am in need for papers regarding the corpus-based differences between spoken and written language. If any one has published a paper concerning any comparative difference between spoken and written language, their structural and typological differences, or any other difference, please Iet me know.
> Can anyone tell me where can I find a software for tagging and processing a corpus?
>email : email@example.com
James L. Fidelholtz e-mail: firstname.lastname@example.org
A'rea de Ciencias del Lenguaje
Instituto de Ciencias Sociales y Humanidades
Beneme'rita Universidad Auto'noma de Puebla, ME'XICO
Content-Type: TEXT/HTML; CHARSET=iso-8859-6
<!DOCTYPE HTML PUBLIC "-//W3C//DTD W3 HTML//EN">