*** Apologies for multiple copies ***
ACL 2003 Workshop on Patent Corpus Processing
12 July 2003, Sapporo, Japan
CALL FOR PAPERS
The goal of this workshop is to foster research and development of the
technology for patent corpus processing, by providing a forum in which
researchers and practitioners can exchange and share their ideas,
approaches, perspectives, and experiences from their work in progress.
The processing of intellectual property (IP) documents, including
patents, is important in the scientific, business, and law
communities. Much of the focus for patent and IP processing has been
in the database and information retrieval communities, but not in the
computational linguistics (CL) and natural language processing (NLP)
In 2000, the first ACM SIGIR 2000 Workshop on Patent Retrieval was
held. In this workshop, patent retrieval systems in use at EPO
(European Patent Office) and JAPIO (Japanese Patent Information
Organization) were introduced, and a number of issues related to
patent retrieval (e.g., producing ontologies, cross-language
retrieval, and evaluation methods) were proposed/discussed.
In 2001-2002, the NTCIR workshop (the National Institute of
Informatics, Japan), which is a TREC-style evaluation forum for
research and development on IR/NLP, first performed the patent
retrieval task. Two years of Japanese patents (approximately 7M
documents published in 1998-1999; 18GB) were used to evaluate
mono/cross-lingual patent retrieval systems. In addition,
approximately 17M Japanese/English parallel patent abstracts were used
to evaluate the effectiveness of extracting translation lexicons.
Areas of Interest
Patent corpora are associated with a number of interesting
characteristics, for which various CL/NLP techniques have promise for
improving the quality of patent processing.
* multilinguality: the same/similar contents (i.e., inventions) are
filed in different languages, for which machine translation,
cross/multi-lingual retrieval, and translation extraction alleviate
problems in accessing information in foreign languages.
* scalability: a huge amount of copora data is available and periodically
produced, for which text summarization and natural language generation
help produce understandable coherent condensed contents.
* complexity: since patents consist of overwhelmingly long sentences,
parsing/chunking techniques help produce readable shorter fragments.
* classification: patents are manually categorized based on a specific
classification system, such as IPC (international patent
classification), which can be used for statistical classification
* novelty/temprality/dynamism: new terms and concepts associated with
inventions are periodically created, for which term extraction and
ontology construction techniques help update lexical resources for
* document structures: unlike newspaper articles, patents are
structured with a number of specific fields (e.g., titles, abstracts,
and claims). While conventional text segmentation techniques rely
mainly on linguistic contents (e.g., lexical chains), structure
analysis techniques (e.g., ones related to XML) are also crucial in
the context of CL/NLP.
* applications: the above techniques can directly contribute to a
number of applications, such as patent retrieval systems.
We invite both research papers and project papers associated with, but
not limited to, the rudiments of patent corpus processing listed
above. We also invite papers addressing applications and user
Submission deadline: 10 April 2003
Acceptance notification: 12 May 2003
Final version deadline: 30 May 2003
Workshop date: 12 July 2003
Makoto Iwayama, Tokyo Institute of Technology / Hitachi Ltd., Japan
Atsushi Fujii, University of Tsukuba, Japan
Atsushi Fujii, email@example.com
University of Tsukuba, Japan
This archive was generated by hypermail 2b29 : Fri Mar 28 2003 - 14:29:01 MET