Construction of a Full Text Corpus for Biomedical Text Mining

Contact Us
If you have any questions or feedback please contact us.

Construction of a Full Text Corpus for Biomedical Text Mining

Biography

contributor

Principal Investigator

Overview

abstract

There is a demonstrated community need for an annotated corpus consisting of the full texts of biomedical journal articles. There are many reasons to believe that the rate-limiting factor impeding progress in biomedical language processing today is the lack of availability of the right kind of expertly annotated data. An annotated corpus is a collection of texts with information about the meaning or structure associated with particular textual elements. Annotated corpora are a critical component of biomedical natural language processing research in two ways. First, most contemporary approaches to language processing rely at least in part on machine learning or statistical models. Such systems must be "trained" on sets of examples with known outputs, so annotated corpora provide the training data vital to the construction of modern NLP systems. Second, annotated corpora provide the gold standard by which various approaches to particular text mining tasks are evaluated. Due to their central roles in training and testing language processing systems, the quality of the design and operational creation of annotated corpora place fundamental limits on what can be accomplished with such systems. Although there has been valuable work done on annotating abstracts, there are important differences between abstracts and full-text articles from a text mining perspective, and annotation of full-text journal articles has been negligible. Workers in both the biological (especially model organism database curation) community and the text mining community have independently pointed out the importance of processing the full text of scientific publications if the biomedical world is to be able to fully utilize text mining. We propose to build a large, fully annotated corpus consisting of full texts of biomedical journal articles. Additionally, previous biomedical corpus annotation efforts have often utilized ad hoc ontologies that have limited their utility outside of the groups that created them. We will ensure community acceptability by annotating with respect to community-consensus ontologies such as the Gene Ontology and the UMLS. Since the task involves expensive human labor, efficiency is a key issue in creating corpora. For this reason, we propose to build a team that includes the builder of the largest semantically annotated corpus to date, one of the pioneers of the model organism databases, and an already-assembled cadre of experienced linguistic and domain-expert annotators.

sponsor award id

G08LM009639

Time

start date

2007-09-15

end date

2010-09-14