A Pipeline for Automated Phenotype Extraction from Medical Reports Using Large Language Models
Contributo in Atti di convegno
Data di Pubblicazione:
2026
Citazione:
(2026). A Pipeline for Automated Phenotype Extraction from Medical Reports Using Large Language Models . Retrieved from https://hdl.handle.net/10446/322945
Abstract:
Unstructured clinical narratives are a major source of phenotypic evidence for rare-disease diagnosis and genomic variant interpretation. However, their free-text nature, often multilingual, heterogeneous in format, and inconsistent in terminology, makes automated phenotype extraction and interoperability with downstream genomic pipelines difficult. This creates a practical bottleneck for scalable and reproducible phenotype curation in medical genetics, where manual review is time-consuming and prone to variability. To address this problem, we propose a robust, open-source, and fully local pipeline for automatically extracting and standardizing patient phenotypes from medical reports while preserving data privacy. The pipeline integrates: (i) OCR-based digitization and an LLM-based translation module to produce an English version of the report; (ii) a GPT-oss–based phenotype extractor using structured, few-shot prompting to identify phenotypes relevant to the index patient; and (iii) a fuzzy standardization stage that combines lexical similarity with embedding-based semantic matching to map extracted phenotypes to Human Phenotype Ontology (HPO) concepts. Our multi-stage design improves robustness to real-world documentation issues, including multilingual acronyms, variable report structure, spelling errors, and synonym variability, and it ensures privacy compliance by keeping all computation on local infrastructure. We demonstrate the pipeline end-to-end on a representative clinical report, showing that it extracts patient-relevant phenotypes and produces HPO-aligned, machine-readable outputs suitable for downstream genomic analyses. This work provides a practical foundation for privacypreserving,
scalable phenotype curation in clinical genetics and supports future integration and evaluation on
larger clinical datasets.
Tipologia CRIS:
1.4.01 Contributi in atti di convegno - Conference presentations
Elenco autori:
Bombarda, Andrea; Saletta, Martina; Bellini, Matteo; Goisis, Lucrezia; Iascone, Maria; Cazzaniga, Paolo; Savo, Domenico Fabio
Link alla scheda completa:
Link al Full Text:
Titolo del libro:
Proceedings of the 19th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 2: HEALTHINF