Biological Text Mining


Swiss Institute of Bioinformatics

 CMU,1 rue Michel Servet, CH-1211 Geneva, Switzerland



Link to the ExPASy Molecular Biology server of the SIB

Link to official BioMinT home page

          Link to corpora produced by SIB

Abstract

Genome research has spawned unprecedented volumes of data, but characterisation of DNA and protein sequences has not kept pace with the rate of data acquisition. To anyone trying to know more about a given sequence, the worldwide collection of abstract and papers remains the ultimate information source. The goal of the BioMinT project is to develop a generic text mining tool that:

The tool will thus provide two essential research support services: (1) Curator's assistant: accelerate, by partially automating, the annotation and update of bio-databases; and (2) Researcher's assistant: generate readable reports in response to queries from biological researchers.






The Swiss Institute of Bioinformatics as partner of the BioMinT project

As user of the annotation assistant prototype that will be developed, the SIB will is in particular involved in:


SIB will exploit the resulting BioMinT prototype internally. This tool will add value to a protein sequence by including descriptive information extracted from the scientific literature as well as from other web knowledge sources in their Swiss-Prot database. In particular, BioMinT is expected to: increase the relevance and completeness of added information to speed up the annotation procedure in order to handle the amount of new incoming sequences.

For members only




Project partners

SIB Swiss Institute of Bioinformatics, Switzerland (project page)
ÖFAI, the Austrian Research Institute for Artificial Intelligence, Austria (project page)
UNIGE, University of Geneva, Switzerland (project page)
UMAN, The University of Manchester, School of Biological Sciences, UK (project page)
PharmaDM, Belgium (project page)
CNTS Universiteit Antwerpen/Universtaire Instelling Antwerpen, Belgium



Corpora produced by SIB

Within the BioMinT project, the Swiss Institute of Bioinformatics is responsible for supplying  the benchmark environment for training and evaluation of the Information Retrieval (IR) and Information Extraction (IE) components of the BIoMinT text-mining tool (deliverables 1.3, 1.4). For this purpose, various sets of sentences were extracted from Medline abstracts for different topics ( = types of information used to fill a Swiss-Prot entry).  For each topic, a list of abstracts was drawn using information from the RP line which describes the work carried out by the authors of the reference cited in Swiss-Prot entries. Sentences considered to contain relevant information with respect to that topic were extracted manually - two persons were involved in this task, and a mutual agreement was reached regarding the criteria for assessing sentence relevance. In general,  sentences contain the information used by Swiss-Prot curators, and when several sentences describe the same information (e.g. in title and abstract body), all were extracted. The format is the following: each line contains one sentence, and is preceded by the Pubmed_ID from the abstract. Sentences with double quotes are from the abstract title.





TOPIC
# PUBMED_IDS
(derived from RP line information)

# EXTRACTED SENTENCES
ALTERNATIVE PRODUCTS

Alternative initiation: 40
Alternative promoter: 10
Alternative splicing: 744
Undetermined: 4
Alternative initiation: 64
Alternative promoter:  48
Alternative splicing: 993
Undetermined: 642
BINDING

Metal: 80
Binding (other): 327
Calcium: 56
DNA binding: 102
NP-binding: 22
NA
BOND

Disulfid: 514
Crosslink: 28
Undetermined: 6 
Disulfid: 600

ENZYME REGULATION
Enzyme regulation: 103  NA
DEVELOPMENTAL STAGE
Developmental stage: 711  Developmental stage: 1253
DISEASE
Disease: 269  NA
FUNCTION

Function: 4382
Enzyme: 38
Pathway. 31
Cofactor: 103
NA
INDUCTION
Induction: 416  Induction: 794
MASS SPECTROMETRY
Mass spectrometry: 739  NA
METHOD 3D Method 3D: 6923  NA
MUTAGEN
Mutagen: 2355  NA
POST-TRANSLATIONAL MODIFICATIONS (PTM)

Carbohydrate: 467
Phosphorylation: 907
Acetylation: 97
Amidation: 61
Hydroxylation. 16
Methylation: 47
Pyrrolidone carboxylic acid: 13
Sulfation: 40
Myristate: 44
GPI anchor: 32
Undetermined: 48 
Carbohydrate: 439
Phosphorylation: 1283
REGION

Domain: 333
Transmembrane: 6
Zinc finger: 12
Repeat: 32
Active site: 229
Other sites:  71
NA
RNA EDITING
RNA editing: 113  NA
SIMILARITY
Similarity: 123  NA
SUBCELLULAR LOCALISATION
Subcellular localisation: 1835  Subcellular localisation: 2263
SUBUNIT Subunit: 572  Subunit: 697
TISSUE SPECIFICITY
Tissue specificity: 2623  Tissue specificity: 3683
VARIANT

Variant: 7014
Polymorphism 52
NA