Skip to Main Content (Press Enter)

Logo UNIBG
  • ×
  • Home
  • Corsi
  • Insegnamenti
  • Persone
  • Pubblicazioni
  • Strutture
  • Terza Missione
  • Attività
  • Competenze

UNI-FIND
Logo UNIBG

|

UNI-FIND

unibg.it
  • ×
  • Home
  • Corsi
  • Insegnamenti
  • Persone
  • Pubblicazioni
  • Strutture
  • Terza Missione
  • Attività
  • Competenze
  1. Corsi

A Pipeline for Automated Phenotype Extraction from Medical Reports Using Large Language Models

Contributo in Atti di convegno
Data di Pubblicazione:
2026
Citazione:
(2026). A Pipeline for Automated Phenotype Extraction from Medical Reports Using Large Language Models . Retrieved from https://hdl.handle.net/10446/322945
Abstract:
Unstructured clinical narratives are a major source of phenotypic evidence for rare-disease diagnosis and genomic variant interpretation. However, their free-text nature, often multilingual, heterogeneous in format, and inconsistent in terminology, makes automated phenotype extraction and interoperability with downstream genomic pipelines difficult. This creates a practical bottleneck for scalable and reproducible phenotype curation in medical genetics, where manual review is time-consuming and prone to variability. To address this problem, we propose a robust, open-source, and fully local pipeline for automatically extracting and standardizing patient phenotypes from medical reports while preserving data privacy. The pipeline integrates: (i) OCR-based digitization and an LLM-based translation module to produce an English version of the report; (ii) a GPT-oss–based phenotype extractor using structured, few-shot prompting to identify phenotypes relevant to the index patient; and (iii) a fuzzy standardization stage that combines lexical similarity with embedding-based semantic matching to map extracted phenotypes to Human Phenotype Ontology (HPO) concepts. Our multi-stage design improves robustness to real-world documentation issues, including multilingual acronyms, variable report structure, spelling errors, and synonym variability, and it ensures privacy compliance by keeping all computation on local infrastructure. We demonstrate the pipeline end-to-end on a representative clinical report, showing that it extracts patient-relevant phenotypes and produces HPO-aligned, machine-readable outputs suitable for downstream genomic analyses. This work provides a practical foundation for privacypreserving, scalable phenotype curation in clinical genetics and supports future integration and evaluation on larger clinical datasets.
Tipologia CRIS:
1.4.01 Contributi in atti di convegno - Conference presentations
Elenco autori:
Bombarda, Andrea; Saletta, Martina; Bellini, Matteo; Goisis, Lucrezia; Iascone, Maria; Cazzaniga, Paolo; Savo, Domenico Fabio
Autori di Ateneo:
BOMBARDA Andrea
CAZZANIGA Paolo
SAVO Domenico Fabio
Link alla scheda completa:
https://aisberg.unibg.it/handle/10446/322945
Titolo del libro:
Proceedings of the 19th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 2: HEALTHINF
Progetto:
ANTHEM - AdvaNced Technologies for Human-centrEd Medicine
  • Ricerca

Ricerca

Settori (2)


PE6_7 - Artificial intelligence, intelligent systems, natural language processing - (2024)

Settore IINF-05/A - Sistemi di elaborazione delle informazioni
  • Utilizzo dei cookie

Realizzato con VIVO | Designed by Cineca | 26.3.4.0