A Hybrid Language Framework for Ontology-Based Clinical Concept Extraction

Document Type

Article

Publication Date

2-26-2026

Publication Title

Journal of Healthcare Informatics Research

Volume

10

Issue

2

Pages

299-316

Publisher Name

Springer Nature Switzerland AG

Publisher Location

Cham, Switzerland

Abstract

This study presents a hybrid ontology-based framework for clinical concept extraction from narrative EHR discharge summaries using large language models (LLMs) and standardized biomedical terminologies. The framework integrates multiple NLP components in sequence: SparkNLP for chunk detection and named entity recognition (NER), SentenceBERT embeddings for semantic similarity candidate generation, zero-shot inference with LLaMA3-8B and Mistral-7B for concept selection, and UMLS REST API normalization to CUIs and SNOMED CT terms. This coordinated integration of linguistic, semantic, and ontological modules forms a flexible architecture rather than a single-model comparison. We applied the framework to ten MIMIC-III discharge summaries spanning Chief Complaint, Brief Hospital Course, and History of Present Illness sections. Clinicians labeled extracted concepts as correct, partial, incorrect, missing, or spurious to assess model performance. LLaMA3-8B achieved the highest F1 score (0.77) and lowest false positive rate (3.04%), outperforming both Mistral-7B and cTAKES. While cTAKES demonstrated high precision, it had low recall and a significantly higher FPR (29.95%), indicating frequent misclassification. Mistral-7B offered faster processing for shorter notes, while LLaMA3-8B delivered higher accuracy for more detailed sections.

LLMs outperformed traditional rule-based systems by more effectively handling context, modifiers, abbreviations, and multi-word expressions. Prompt refinement and semantic similarity embedding enhanced extraction quality. SparkNLP supported chunking but introduced errors related to spacing and abbreviation handling. We presented a flexible, context-aware framework for clinical concept extraction using LLMs, offering key advantages over rule-based tools. Future work should incorporate full ontology mapping, integrate assertion detection, and validate performance across diverse clinical datasets and domain-adapted LLMs.

Comments

Author Posting ©️© The Author(s), 2026. This article is posted here by permission of Springer Nature Switzerland AG for personal use and non-commercial redistribution. This article was published open access in Journal of Healthcare Informatics Research, Vol. 10, Iss. 2 (February 2026), https://doi.org/10.1007/s41666-026-00232-0.

Creative Commons License

Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License.

Share

COinS