Structural Biology Programme

Biological Text Mining Unit

Head of Unit:  Martín Krallinger
Research highlights

During the last decades, cancer research has been producing an enormous wealth of data. Several large-scale research initiatives have produced humungous amounts of well-characterized, structured -omics data accessible in a straightforward way through public data portals. Nevertheless, only a rather small portion of basic research output in oncology leads to the construction of large data repositories. Most results are provided in the form of unstructured natural language texts, such as the scientific literature or clinical documents. The use of traditional search engines does not allow directly linking key information generated by basic research, e.g. sequence variant characterizations, to patient-derived information contained in clinical documents. Understanding of the molecular bases of disease and the detection of new treatments demands a better integration of information across unstructured document repositories as well as combination with annotated structured biomedical database. The use of text mining and natural language processing technologies applied to all kinds of research topics; in particular the biomedical field is generating promising results to overcome the information access, interpretation and exposure problem. Moreover, predictive text mining components are at the base of knowledge discovery or hypothesis generation pipelines as well as health-related semantic search engines.

We apply state-of-the-art information extraction tools and machine learning approaches to mine genomic and medical information from heterogeneous types of documents. This relies on the development of critical pre-processing and document harmonization modules, the identification and analysis of statistical features for the recognition of biomedical entities and relations as well as the creation of annotated resources or Gold Standard data (corpora) for the generation and evaluation of statistical models and text mining components. For the construction of Gold Standard corpora we are developing annotation process standards, guidelines and annotated text resources that serve, in the context of community assessments including the BioCreative challenges ( as an infrastructure to promote the development and comparative evaluation of biomedical text mining tools.

At the technical level the, interoperability standards constitute an often-neglected key aspect for the efficient construction and adaptation of text mining pipelines tuned to particular biomedical application domains. In cooperation with the OpenMinTeD project ( and in line with other projects of the PITL we are examining the development of interoperability specifications that can improve the re-use and construction of text mining workflows incorporating resources from multiple language processing platforms.

Our unit is working on the development and specification of a prototype for phenotypic electronic health records integrating a genomic analysis system adapted to particular cancer.