AI and Knowledge Graphs for Invoice Verification

Sep 7, 2025

View report (PDF)

AI and Knowledge Graphs for Invoice Verification

Macro-F1: 0.83 on tariff prediction
Execution accuracy: 0.91 for Text2Cypher translation
Pipeline confidence accuracy: 96.5% on rejection reasons
Master's Thesis Grade: 6 / 6

TL;DR

Swiss medical invoice verification is complex and error-prone, relying heavily on manual checks.
Built AI pipelines combining knowledge graphs and LLMs for tariff prediction, rule translation, and explainable rejection reasoning.
Achieved strong results: Macro-F1 0.83 on tariff codes, 0.91 execution accuracy for rule translation, and 96.5% accuracy when confident in GraphRAG rejection predictions.
Produced working prototypes and defended the EPFL Master's thesis with a grade of 6 / 6.

At a glance

Role: Master’s thesis student
Timeline: Mar–Sep 2025 (6 months)
Context: EPFL Master’s Thesis at ELCA Informatik AG, Zurich Switzerland
Users/Stakeholders: Health insurers, claim auditors, IT integrators
Scope: Data → Modeling → Pipelines → Evaluation

Problem

Swiss healthcare invoice verification is a complex and multi-faceted task. Auditors face three recurring challenges:

Incomplete or placeholder tariff codes — invoices often include code “999” as a placeholder, requiring manual lookup of the correct tariff.
Complex treatment limitation rules — the official Tariff Information System (TIS) encodes these rules in natural language, which must be consistently interpreted and applied.
Justifying rejections — when invoices are non-compliant, insurers must provide transparent and reproducible rejection reasons, which today often depend on past experience and handwritten notes.

Together, these steps make invoice verification slow, costly, and inconsistent. The core challenge is: can AI systems reduce manual workload while ensuring that decisions remain accurate, explainable, and ethically sound across all three use cases?

Solution overview

I designed three pipelines:

Tariff Prediction — models suggest plausible missing tariff codes (= overarching groupings of medical treatments).
Text2Cypher Translation — automatically converts treatment limitation rules into Cypher queries for execution on a Neo4j graph.
Invoice Verification (GraphRAG) — combines historical rejection reasons and Text2Cypher outputs to deliver explainable rejection decisions.

Ethical analysis ensured privacy, fairness, transparency, and sustainability considerations were addressed.

Architecture

The system is built around a central knowledge graph storing invoices, treatments, tariffs, and rejection reasons. Three pipelines operate on top of this graph:

Tariff Prediction: predicts missing codes using ML models.
- Diagram
Text2Cypher: enforces treatment limitations via LLM → Cypher translation.
- Diagram
Invoice Verification: combines embeddings + Text2Cypher rule checks + LLM reasoning for rejection suggestions.
- Diagram

Data

Shared foundation: Anonymized Swiss medical invoices represented in a Neo4j graph schema (patients, invoice positions, tariffs, treatments, rejection reasons).
Tariff Prediction dataset: 4,500 invoices with correct tariff codes, sampled to mitigate extreme class imbalance.
Text2Cypher dataset: Treatment limitation texts crawled from the Tariff Information System (TIS), paired with custom created ground truth Cypher queries. Also tested on a Neo4j benchmark dataset.
Invoice Verification dataset: Invoices with rejection reasons clustered into 60 categories, used as proxy labels for retrieval quality and evaluation.
Privacy/Ethics: All data anonymized by Sumex; additional entity recognition stripped residual PII.

Method

1. Tariff Prediction

Goal: Replace placeholder tariff code 999.
Approaches:
- MLP on graph embeddings (FastRP with Optuna tuning, SMOTE-Tomek oversampling).
- Fine-tuned DistilBERT on treatment hints.
- Hybrid ensemble (random forest combining MLP + LLM outputs).
Evaluation: Macro-F1 score (due to class imbalance); hybrid ensemble performed best 84.0% Macro-F1.
Key insight: Treatment hints (text) carried richer signals than embeddings alone. SMOTE-Tomek reduced effect of class imbalance.

2. Text2Cypher Translation

Goal: Translate TIS treatment limitation rules into executable Cypher queries.
Pipeline:
1. Prompt LLM with graph schema + few-shot examples.
2. Generate Cypher query.
3. Execute query against Neo4j.
4. If failure → run error correction loop.
Evaluation: Compared random, semantic similarity-based, and hand-crafted few-shot selection; best setup achieved 91% execution accuracy.
Safety: Unverifiable rules flagged for manual review to minimize false rejections.

3. Automatic Invoice Verification

Goal: Suggest rejection reasons in an explainable way.
Pipeline:
1. Compute invoice embeddings (FastRP). Retrieve top-25 similar invoices with known rejection reasons.
2. Fetch exceeded TIS limitations via Text2Cypher.
3. Provide both to an LLM for reasoning; model outputs ranked rejection reason suggestions.
Evaluation:
- Overall accuracy: 54.5%.
- When the model was confident, accuracy reached 96.5%.
Limitations: Low recall of rejection reason clusters; suggestions when “unsure” often unhelpful.

Experiments & Results

Benchmarks

Task	Metric	Score (%)
Tariff Prediction	Macro-F1	83.0
Text2Cypher	Execution Acc.	91.0
Rejection Verification	Accuracy (conf.)	96.5

Evaluation protocol. 80/10/10 train/val/test split for tariff prediction, held-out custom dataset for rule translation, and prototype testing on real rejection scenarios.

Error analysis

Tariff prediction still struggles with rare codes.
Long compositional rules can fail in Text2Cypher when unseen operators occur.
Rejection reasoning performance when uncertain remains low.

Impact

Showed that knowledge graph + LLM hybrid pipelines can reduce manual workload in healthcare invoice verification.
Transparent reasoning via Cypher queries improves trust compared to black-box models.
Provided prototypes for ELCA and Sumex to evaluate in production environments.

What I learned

How to integrate symbolic reasoning with neural models (GraphRAG).
Practical handling of extreme class imbalance in structured healthcare data.
The importance of evaluation beyond metrics (fairness, transparency).
Collaborating with industry partners under confidentiality and performance constraints.

Future Work

Improve tariff prediction on tail classes.
Increase Text2Cypher dataset size.
Deploy prototype in sandbox environments for real-world validation.
Extend ethical considerations into deployment phase.

← Back to projects