Dear Colleagues ,
Those intersted in a new generation part-of-speech tagger will be
when addressing their reflexions to firstname.lastname@example.org , especially if
purchasing or collaboration intents ( or hints about ) . Thank you .
Best regards ,
Vladimir V. Gojol
Senior Software Engineer
Institutul National de Informatica
Bucuresti , Romania
I created a part-of-speech tagger with an unusual capacity of dealing
with large contexts , especially for German . I used Negra ( seemingly
best known German corpus , with free obtainable licence ) . The tagger
currently reputed as being the most accurate for German is perhaps TnT .
reports upon this corpus an error rate of 3.4% . But I have found a
matic error in Negra : all the occurences of the auxilliary verbs are
as auxilliary ( VAFIN ) , though in 50% of the cases they function as
verbs ( VVFIN ) . I corrected a part of the corpus ( cca 40,000 tokens )
In this more correct environment ( where the performance of TnT should
probably around 4.5% ) , my tagger gets 1.7% .
On another German corpus ( I call it X ) , with comparable contents (
paper articles ) and tagset , but with attached exterior lexicon ( i.e.
extracted from the corpus ) , the result is 2.4% .
I also used Susanne ( the only English corpus I could get free ) .
ported result for TnT is 3.8% . Mine is 2.8% . On the "A" texts , best
lelable with those in Negra , as journalistic , it's 2.3% .
Initially I had used a Romanian corpus , with a result of 0.9% (
to 1.7% , 2.5% and 4.2% respectively got by the Xerox , Birmingham and
taggers ) .
The speed is comparable to that of TnT and modifiable by parameter
in reverse proportion to the accuracy ( but without affecting it much )
The incremental operating mode and the data structures segmentation
running on very small memory computers .
There is the advantage of an intuitive output ( no hostile binary
matrix ) ,
in a form analogue to the input of some expert systems .
Special facilities exist , such as virtual tags , or context
tion ( permitting to get the minimal contexts set characteristic to a
linguisic style , useful not only for maximum accuracy and speed ) etc.
All is built on two essentially new concepts : organicity and context
pagation . I didn't publish anything about them , to keep up their
appeal . The accuracy comparable to that of manual tagging made me find
errors in the used corpora : 98 in Negra , 36 in Susanne ; Prof. G.
replied gratefully , saying that it's the first time somebody reports
than 2 errors , and that my findings make necessary a new version of
The handling of very large contexts could even modify the current
sign , by cancelling some unnatural decisions ( motivated only by the
city of the existing taggers to see beyond a 3-tokens neighborhood ) ,
those concerning the auxilliary verbs , participles etc. - so removing
burden from the subsequent stages of text processing .
It is written in C ( Linux ) . Demos for German ( Negra or X ) and
( Susanne ) are available .
This archive was generated by hypermail 2b29 : Wed Feb 16 2000 - 17:30:05 MET