If you subscribe to the UK's Guardian International you can build a good
nespaper corpus for free by registering for the email edition. You'll be
sent the full text of the following sections each week:
international-news, us-news, uk-news, features, culture, and sport. With
around 50,000 words an issue you will soon accumulate a useful set of texts
(already blocked into quite useful thematic groups).
-- Dr Christopher Tribble Sri Lanka 21 Wijerama Mawatha, Colombo 7 TEL +94 75 332 309 UK 122, Queen Alexandra Mansions, Judd Street London WC1 H 9DQ TEL +44 171 833 4271 UK Mailing c/o FCO (Sri Lanka) The British Council, Sri Lanka King Charles Street, London SW1A 2AH E-mail email@example.com Home Page http://ourworld.compuserve.com/homepages/Christopher_Tribble
> -----Original Message----- > From: firstname.lastname@example.org [mailto:email@example.com]On > Behalf Of VASUPRADA KANDRAKONTA(98MCMT04) > Sent: Friday, November 05, 1999 8:00 PM > To: corpus list > Subject: Corpora: Part of Speech Tagging<unknown-words> > > > Hi everybody, > I'm doing a project in POS tagging.For this I'm using the statistical > methods. I've built a Hidden Markov Model using the SUSANNE corpus and am > using the Viterbi Algorithm to find out the best tag sequence.But I have a > problem of sparse data. Can anyone tell me what should be done with the > unknown words<words not found in the corpus>. One method is to use the > features like word endings and capital letter starting. But what about the > state transition matrix. > If anyone knows any literature on the net about this, please let me know. > > I'm in a plan to upgrade my system,using a corpus of larger size.The > corpus I'm using right now is of size 1,30,000words. Can anyone tell me > where I can get a downloadable corpus(free of cost). > > Thankyou, > Vasuprada Kandrakota > Dept. of Computer Science, > University of Hyderabad, > Hyderabad-INDIA 500 046 > > > >