Corpora: PhD thesis

From: nicolas turenne (
Date: Thu Dec 21 2000 - 10:55:55 MET

  • Next message: Pavel Kveton: "Corpora: Czech National Corpus"


    My PhD thesis held at Strasbourg (France) on 24
    November 2000 with the title:
    Statistical Learning from Texts for Concept
    Extraction from a Domain. Application to textual
    Information Filtering.

    is reachable at the web URL:

    under the field knowledge acquisition
    (Please take our apologies if you receive this
    message more than once....)

    The goal of this dissertation is to build an
    automatic and approximate representation of the
    meaning of a document. We try to adapt techniques
    of automatic indexing to a non-indexed document
    base. Classical techniques are based on vector
    models. Each document is represented by certain
    features, and one defines a distance between them.
    Access to relevant documents is based on
    similarity estimation between features. A
    structuring of the domain, described by documents,
    with the aim of obtaining semantic fields, is
    reached by term clustering. One can improve the
    techniques by making it possible to process non
    indexed documents. By adapting linguistic
    knowledge and analysis of relations, pointed out
    by term cooccurrences, the results would improve.
    The growing amount of electronic documents leads
    to a storage of large significant samples of
    re-usable data. Techniques to describe relations
    between terms stem from mathematical methods
    usually applied to structured and non-textual
    data. Coupling of specific knowledge about data
    with a methodology adapted to textual data should
    lead to an improving of classification results. We
    try to justify several things: first, the
    consideration of linguistic phenomena so as to
    reduce biases of a descriptive statistics
    concerning term occurrences; second, the using of
    a method based on graph pattern extraction, which
    is supposed to retrieve conceptual relations
    between terms. Third, we make it easier to
    interpret results from automatic processing by a
    consensus labelling of the theme represented by a
    class. Interpretation of classes remains
    difficult, because of multiple points of view or
    links a user can imagine between terms. More
    accurate classes should facilitate an
    interpretation, driven by a 3-level thesaurus,
    which may be assigned to a conceptual structuring
    of a term of a domain.
    Large use of Internet increases exchange of
    electronic documents between users of different
    websites. Development of software systems dealing
    with what is called "workflow" in intranets,
    improves the flow of documents between persons and
    services. A system which can learn automatically
    user profiles and exploit this knowledge to
    disseminate information is inescapable. We try to
    match a user interest with classes of terms.

    FIELD : Computer Science, Artificial Intelligence.

    KEYWORDS : Terminology, Artificial Intelligence,
    Corpus Processing, Lexicometry, Morphosyntactic
    Schemes, Graph Patterns, Semi-Automatic Extraction
    of Concepts, Term Clustering, Document Filtering,
    Automatic Learning, User Profile, Statistical Data
    Analysis, Information Retrieval.

    This archive was generated by hypermail 2b29 : Fri Dec 22 2000 - 22:17:57 MET