Structural Biology Programme

Biological Text Mining Unit

Head of Unit:  Martín Krallinger
Research highlights

The Biological Text Mining Unit began its journey in January 2017, with the aim of providing consultancy, guidance and technical assistance for cognitive computing and text-mining technologies applied to clinical and biomedical documents of relevance, particularly to precision medicine approaches. The Unit has been focusing on fostering collaborative efforts to address the needs faced by healthcare providers (Hospital Virgen del Rocio or Hospital XII de Octubre), Spanish national health-related agencies (the Spanish Medical Agency, the Spanish National Library of Health or the Spanish Royal Medical Academy) and artificial intelligence and language technologies academic research groups.

Clinical text mining represents a strategic innovative research area with considerable potential to leverage uptake of cognitive computing and big data text analytics in health.

To cover integration and discoverability aspects of existing resources, we have examined the design requirements, annotation formats and standards for interoperability of biomedical language technology infrastructures, and have constructed a registry of language processing components and a medical document repository. The Unit has coordinated community assessment evaluation challenges (BioCreative and IberEval) with the aim to benchmark existing tools, to determine cutting edge computational strategies and to enable the development of statistical machine learning models by providing access to Gold Standard annotated training datasets through an evaluation platform. These community assessments included tasks related to the technical evaluation of systems for the automatic recognition of mentions of biomedical named entities in running text (genes, proteins, chemical compounds/drugs, cell lines, diseases, anatomic terms and mutations) or the automatic extraction of drug-target associations. Moreover, we have applied text mining methodologies to concrete use cases, including the implementation of semantic search engines for toxicology and for cancer or the use for disease and gene concepts recognition to assist in the discovery of inverse comorbidities between cancer and neurodegenerative diseases.