Luca Engel

Agentic Python Fixer

Oct 15, 2025

Agentic Python FixerCover image generated with OpenAI's GPT-5 model.

TL;DR

At a glance

Problem

Most LLMs can generate Python, but they lack iterative self-correction. Given a failing test suite, models often hallucinate, repeat mistakes, or produce partial patches.

Key challenges:

The question: Can an agentic framework autonomously debug and patch code until tests pass, while remaining safe and backend-agnostic?

Solution overview

I designed an agentic bug-fixing system around the ReAct (Reason + Act) framework, implemented with LangGraph:

  1. Thought Agent

    • Reads the failing code, error messages, and test output.
    • Explains what went wrong and sketches a concrete fix strategy.
  2. Patch Agent

    • Converts the plan into structured JSON patches (line/region operations).
    • Ensures edits are minimal and localized, not full-file to reduce hallucinations.
  3. Docker Sandbox

    • Applies patches and runs tests (e.g., pytest) in an isolated container.
    • Returns stdout/stderr, exit codes, and failing assertions back to the agents.
  4. LangGraph Orchestration

    • Maintains state across iterations: code version, error history, attempts.
    • Implements a ReAct-style loop: think → act → observe → repeat until success or budget exhaustion.

Architecture

Data

Method

Experiments & Results

Quantitative results

ModelStratified subsetpass@1
Qwen 1.7B35% of HumanEvalFix0.50
Qwen 1.7B50% of HumanEvalFix0.40
GPT-4o mini35% of HumanEvalFix0.50
GPT-4o mini50% of HumanEvalFix0.45

Observations:

Analysis (high level)

Impact

What I learned

Future Work

References