It was very interesting for me to read the "What is a corpus"
Really a problem exists - what is a corpus, is it balanced or/ and
If we would take as an example a case of corpus of proverbs - who
can say that this is a corpus and not archive or set or dump of
proverbs? We can find many interesting things at a dump storage - but
what is the value of our findings? If we did not any pre-processing
(filtering) during creation of our set of proverbs - then what is the
value of the following statement: "There are no Italian proverbs about
unlucky marriages" ?
This statement is reliable or scientific only for representative
proverb corpus. Otherwise - "dump as input - dump as output (dust to
dust)". Is there a quasi-logical procedure of defining - is this
collection (dump) of textual data a representative corpus? This is the
starting point of all the following activity - is it scientific one or
YS Vladimir Rykov, PhD in Computational Linguistics M_M_M_M_M_M_M_M_M_M_M_M_M
www.blkbox.com/~gigawatt/rykov.html Linguistic Institute
WWW.GOL.RU/~iling 1/12 B.Kislovsky per., Moscow, 103009 KREMLIN WALL IS WHERE YOU MAKE IT !!!
Please - do NOT send Internet (attached,multimedia etc) files - we can read ASCII files ONLY Please - send us *.html, *.doc, other non-ASCII files to the addr: ILING@GOL.RU with RE: For Rykov
This archive was generated by hypermail 2b29 : Mon Feb 07 2000 - 07:40:17 MET