At 09:48 16/06/2003 +0100, Silvia Bernardini wrote:
>We are about to start the categorization of a corpus of Italian newspaper
>text into a set of broad topics (sports, internal affairs, arts, business,
>etc). We plan to follow a standard supervised machine learning approach,
>tagging a subset of the corpus manually, and following the usual
>We would like to find information about other projects concerning the
>categorization of newspaper text -- in particular, we are interested in
>the topic sets that have been used in similar projects. For example, if
>somebody has the list of topics used in the AP text cat collection, and
>could send us a copy, that would be extremely useful.
An european news categorization project was NAMIC
Text categorization test collections for your problem are (in English):
* Reuters Corpus, Volume 1
(http://about.reuters.com/researchandstandards/corpus/) (use this, is much
bigger and challenging).
You can get topics from them.
Also you can use sections of newspapers.
For information on TC, and resources for Italian, contact the Istituto di
Linguistica Computazionale - Consiglio Nazionale Ricerche
(http://www.ilc.cnr.it/indexflash.html) and Fabrizio Sebastiani
(http://faure.iei.pi.cnr.it/~fabrizio/), from the Istituto di Scienza e
Tecnologia dell'Informazione - Consiglio Nazionale Ricerche
>Also, some of our prospective users are interested in a categorization
>scheme that goes beyond topics, further categorizing documents across
>topics into a small set of genres such as *comments* and *news*. This
>seems to be a harder task, and we would be interested in work that pursued
>More in general, we would be grateful for any sort of advice/information
>that seems relevant (e.g., pointers to other text cat work on Italian,
>Thanks a lot!
>Silvia Bernardini, Marco Baroni & Alessandra Volpi
>SSLMIT, University of Bologna at Forli'
Jose Maria Gomez Hidalgo
Departamento de Inteligencia Artificial
Universidad Europea de Madrid
28670 - Villaviciosa de Odon - MADRID
La legislación española ampara el secreto de las comunicaciones. Este
correo electrónico es estrictamente confidencial y va dirigido
exclusivamente a su destinatario/a. Si no es Ud., le rogamos que no difunda
ni copie la transmisión y nos lo notifique cuanto antes.
Spanish law guarantees privacy in electronic communications. This
electronic transmission is strictly confidential and intended solely for
the addressee. If you are not the intended addressee, you are kindly
requested not to disclose nor to copy this transmission and to notify us as
soon as possible.
This archive was generated by hypermail 2b29 : Mon Jun 16 2003 - 10:54:58 MET DST