|
|
Biological Text Mining |
|
Swiss Institute of BioinformaticsCMU,1 rue Michel Servet, CH-1211 Geneva, Switzerland |
Link to
corpora produced by SIB
Genome research has spawned unprecedented volumes of data, but characterisation of DNA and protein sequences has not kept pace with the rate of data acquisition. To anyone trying to know more about a given sequence, the worldwide collection of abstract and papers remains the ultimate information source. The goal of the BioMinT project is to develop a generic text mining tool that:
interprets different types of queries
retrieves relevant documents from the biological literature
extracts the required information
outputs the result as a database slot filler or as a structured report
The tool will thus provide two essential research support services: (1) Curator's assistant: accelerate, by partially automating, the annotation and update of bio-databases; and (2) Researcher's assistant: generate readable reports in response to queries from biological researchers.
As user of the annotation assistant prototype that will be developed, the SIB will is in particular involved in:
defining program specifications: it will work actively on user requirement analyses, data corpus production, and domain-specific knowledge collection
furnishing sets of document corpora for each stage of tools development and validation
exploring domain-specific knowledge resources that could be exploited in the development
testing the preliminary prototype: evaluation will focus on the performance (precision and recall) the IR/IE modules as well as on the efficiency and usability of the overall system. Users will perform test runs and produce critical feedback in view of prototype revision and improvements for integration into the final prototype
providing user requirements, corpus and validation test for the update module and implement a module for systematic update annotation of Swiss-Prot database entries
validation phases/test runs of the generic prototype, especially in the final validation test by providing direct feedback from database curators.
SIB
will exploit the resulting BioMinT prototype internally. This tool
will add value to a protein sequence by including descriptive
information extracted from the scientific literature as well as from
other web knowledge sources in their Swiss-Prot database. In
particular, BioMinT is expected to: increase the relevance and
completeness of added information to speed up the annotation
procedure in order to handle the amount of new incoming sequences.