EyeAI

EyeAI combines DERIVA-ML with GPU-based notebooks to create a fully traceable pipeline for glaucoma detection from fundus photographs, spanning data ingest, feature extraction, model training, and clinical review. By treating each dataset, feature set, workflow, and execution as a first-class, versioned artefact, the project shows how DERIVA-ML delivers continuous FAIRness for real-world medical AI.

Why DERIVA-ML underpins EyeAI

Challenge	DERIVA-ML Capability
Multimodal, noisy clinical data	Unified domain + ML schemas; data checked and versioned on ingest
Rapid feature & model iteration	Model-driven UI updates automatically when schema changes
Reproducible, auditable runs	Six-artefact data model (Dataset, Feature, Workflow, Execution, Asset, CV) captures every run
Sensitive patient data	Fine-grained Globus Auth roles (reader / writer / curator) with embargo support

Impact to date¹

>70 datasets, 130 executions, 142 trained models stored with DOIs
13 clinicians provided independent labels; inter-rater metrics computed automatically
Label-efficiency study trained 142 models across 36 data subsets, with 71 retrains done by editing only execution configs
Optic-nerve detector re-trained in one day after performance drift, demonstrating end-to-end reproducibility

Typical workflow

Ingest — fundus JPEGs and metadata uploaded; Minid + BDBag generated.
Feature engineering — VGG19 model crops optic-nerve region, stores SVG as Image Annotation Feature.
Model training — VGG19 and RETFound classifiers trained via notebook; parameters, ROC figure, and model file auto-uploaded as Execution Assets.
Review — Chaise UI lets ophthalmologists inspect predictions and trace back to raw images.

Lessons for other ML health projects

Continuous FAIRness (“all data, all the time”) prevents data cascades.
Controlled vocabularies plus Minid-based datasets make label auditing straightforward.
Workflows stay in plain notebooks; DERIVA-ML adds the provenance layer without locking teams into new tools.

¹ Figures from Li Z. et al., “From Data to Decision: Data-Centric Infrastructure for Reproducible ML in Collaborative eScience,” arXiv 2506.16051 (2025) and Li Z. et al., “DERIVA-ML: Continuous FAIRness for Reproducible Machine Learning Models,” arXiv 2407.01608 (2024).

← Back to all projects