Colorado PROFILES, The Colorado Clinical and Translational Sciences Institute (CCTSI)
Keywords
Last Name
Institution

Contact Us
If you have any questions or feedback please contact us.

Construction of a Full Text Corpus for Biomedical Text Mining


Collapse Biography 

Collapse Overview 
Collapse abstract
There is a demonstrated community need for an annotated corpus consisting of the full texts of biomedical journal articles. There are many reasons to believe that the rate-limiting factor impeding progress in biomedical language processing today is the lack of availability of the right kind of expertly annotated data. An annotated corpus is a collection of texts with information about the meaning or structure associated with particular textual elements. Annotated corpora are a critical component of biomedical natural language processing research in two ways. First, most contemporary approaches to language processing rely at least in part on machine learning or statistical models. Such systems must be "trained" on sets of examples with known outputs, so annotated corpora provide the training data vital to the construction of modern NLP systems. Second, annotated corpora provide the gold standard by which various approaches to particular text mining tasks are evaluated. Due to their central roles in training and testing language processing systems, the quality of the design and operational creation of annotated corpora place fundamental limits on what can be accomplished with such systems. Although there has been valuable work done on annotating abstracts, there are important differences between abstracts and full-text articles from a text mining perspective, and annotation of full-text journal articles has been negligible. Workers in both the biological (especially model organism database curation) community and the text mining community have independently pointed out the importance of processing the full text of scientific publications if the biomedical world is to be able to fully utilize text mining. We propose to build a large, fully annotated corpus consisting of full texts of biomedical journal articles. Additionally, previous biomedical corpus annotation efforts have often utilized ad hoc ontologies that have limited their utility outside of the groups that created them. We will ensure community acceptability by annotating with respect to community-consensus ontologies such as the Gene Ontology and the UMLS. Since the task involves expensive human labor, efficiency is a key issue in creating corpora. For this reason, we propose to build a team that includes the builder of the largest semantically annotated corpus to date, one of the pioneers of the model organism databases, and an already-assembled cadre of experienced linguistic and domain-expert annotators.


Collapse sponsor award id
G08LM009639

Collapse Time 
Collapse start date
2007-09-15
Collapse end date
2010-09-14

Copyright © 2024 The Regents of the University of Colorado, a body corporate. All rights reserved. (Harvard PROFILES RNS software version: 2.11.1)