Re: Corpora: protein name list

From: Philip Resnik (
Date: Thu Nov 01 2001 - 17:46:06 MET

  • Next message: Michael Rundell: "Corpora: corpora: evidence and intuition"

    > I am collecting protein name list for bioinformatics research.
    > Does anyone know of public protein name list?

    You might find GenBank useful (
    In particular, there is a protein database "compiled from a variety of
    sources, including SwissProt, PIR, PRF, PDB" (see the information at
    Also, the SWISS-PROT database can be downloaded; see the "downloading"
    link at Either of these
    should provide a source from which a protein name list could be

    You might also be interested in a nice paper by Vasileios
    Hatzivassiloglou, Pablo A. Duboue and Andrey Rzhetsky: Disambiguating
    Proteins, Genes, and RNA in Text: A Machine Learning Approach, in
    Proceedings of the 9th International Conference on Intelligent Systems
    for Molecular Biology, Tivoli Gardens, Denmark, July 21--25, 2001
    They apply supervised learning techniques to disambiguation of textual
    references, which you might find important since many appearances of
    items on a protein name list might actually be references to the
    related gene, etc. (I myself am exploring the use of named-entity
    tagging techniques for similar purposes.)

    I hope this helps. I'd be grateful if you'd post or forward any
    useful replies you receive!



      Philip Resnik, Assistant Professor
      Department of Linguistics and Institute for Advanced Computer Studies

      1401 Marie Mount Hall UMIACS phone: (301) 405-6760
      University of Maryland Linguistics phone: (301) 405-8903
      College Park, MD 20742 USA Fax: (301) 314-2644 / (301) 405-7104 E-mail:

    This archive was generated by hypermail 2b29 : Thu Nov 01 2001 - 17:49:29 MET