Indeed, the problem is common to so many NLP applications that it may
be reasonable to suggest that someone out there must have worked on
this and packaged up the result as a code 'module', to save others
the trouble. Yet if you examine the code to a great many NLP
applications, you find that typically people will develop their
own solution each time.
So, to start the ball rolling, here's a Perl regular expression
for detecting sentences, suggested by one of my colleagues:
.+? # match (non-greedy) anything ...
[.!?] # ... followed by any one of !?.
[")]? # ... and optionally " or )
(?= # with lookahead that it is followed by ...
(?: # either ...
\s+ # some whitespace ...
["(]? # maybe a " or ( ...
[A-Z] # and capital letter
| # or ...
\s*$ # optional whitespace, followed by end of string
Can anyone suggest a better algorithm/solution? It doesn't have to be
in Perl or any other particular language: pseudocode will do fine.
Also, does anyone know of any established test sets for evaluating
such algorithms? If people want to reply directly to me then I'll
summarise to the list.
(NB - I plan also to submit this question to a Perl mailing list, but
right now the experiences of the corpora community are of greater
interest to me.)
Dr TG Rose Speech and Language Group Canon Research Centre Europe Ltd
Occam Road, Surrey Research Park, Guildford, Surrey, UK GU2 5YJ
email: email@example.com tel: +44 1483 448807 fax: +44 1483 448845