Term Identification Methods for Consumer Health Vocabulary Development
Qing T Zeng1, PhD; Tony Tse2, PhD; Guy Divita2,3, MS; Alla Keselman2,4, PhD; Jon Crowell1, MS; Allen C Browne2, MS; Sergey Goryachev1, MS; Long Ngo1, PhD
1Decision Systems Group, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, USA; 2LHNCBC, National Library of Medicine, NIH, DHHS, Bethesda, MD, USA; 3Management Systems Designers, Inc, Fairfax, VA, USA; 4Aquilent, Inc, Laurel, MD, USA
Summary
Background: The development of consumer health information applications such as health education websites has motivated the research on consumer health vocabulary (CHV). Term identification is a critical task in vocabulary development. Because of the heterogeneity and ambiguity of consumer expressions, term identification for CHV is ore challenging than for professional health vocabularies. Objective: For the evelopment of a CHV, we explored several term identification methods, including collaborative human review and automated term recognition methods. Methods: A set of criteria was established to ensure consistency in the collaborative review, which analyzed 1893 strings. Using the results from the human review, we tested two automated methods—C-value formula and a logistic regression model. Results: The study identified 753 consumer terms and found the logistic regression model to be highly effective for CHV term identification (area under the receiver operating characteristic curve = 95.5%). Conclusions: The collaborative human review and logistic regression methods were effective for identifying terms for CHV development. Keywords
Consumer health information; vocabulary; natural language processing
DOI
http://dx.doi.org/10.2196/jmir.9.1.e4