A Pipeline for Automated Phenotype Extraction from Medical Reports Using Large Language Models
Contributo in Atti di convegno
Data di Pubblicazione:
2026
Citazione:
(2026). A Pipeline for Automated Phenotype Extraction from Medical Reports Using Large Language Models . Retrieved from https://hdl.handle.net/10446/322945
Abstract:
Unstructured clinical narratives are a major source of phenotypic evidence for rare-disease diagnosis and genomic
variant interpretation. However, their free-text nature, often multilingual, heterogeneous in format,
and inconsistent in terminology, makes automated phenotype extraction and interoperability with downstream
genomic pipelines difficult. This creates a practical bottleneck for scalable and reproducible phenotype curation
in medical genetics, where manual review is time-consuming and prone to variability. To address this
problem, we propose a robust, open-source, and fully local pipeline for automatically extracting and standardizing
patient phenotypes from medical reports while preserving data privacy. The pipeline integrates: (i)
OCR-based digitization and an LLM-based translation module to produce an English version of the report; (ii)
a GPT-oss–based phenotype extractor using structured, few-shot prompting to identify phenotypes relevant
to the index patient; and (iii) a fuzzy standardization stage that combines lexical similarity with embedding-based
semantic matching to map extracted phenotypes to Human Phenotype Ontology (HPO) concepts. Our
multi-stage design improves robustness to real-world documentation issues, including multilingual acronyms,
variable report structure, spelling errors, and synonym variability, and it ensures privacy compliance by keeping
all computation on local infrastructure. We demonstrate the pipeline end-to-end on a representative clinical
report, showing that it extracts patient-relevant phenotypes and produces HPO-aligned, machine-readable
outputs suitable for downstream genomic analyses. This work provides a practical foundation for privacypreserving,
scalable phenotype curation in clinical genetics and supports future integration and evaluation on
larger clinical datasets.
Tipologia CRIS:
1.4.01 Contributi in atti di convegno - Conference presentations
Elenco autori:
Bombarda, Andrea; Saletta, Martina; Bellini, Matteo; Goisis, Lucrezia; Iascone, Maria; Cazzaniga, Paolo; Savo, Domenico Fabio
Link alla scheda completa:
Titolo del libro:
Proceedings of the 19th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 2: HEALTHINF