Agentic Python Fixer

Oct 15, 2025

Source code View report (PDF)

Qwen 1.7B pass@1: 0.50 (stratified 35% HumanEvalFix)
Qwen 1.7B pass@1: 0.40 (stratified 50% HumanEvalFix)
GPT-4o mini pass@1: 0.50 (stratified 35% HumanEvalFix)
GPT-4o mini pass@1: 0.45 (stratified 50% HumanEvalFix)
Secure sandboxed execution via Docker

TL;DR

Built an agentic Python fixer that autonomously repairs failing code through reason–act loops.
Uses two cooperating agents (Thought + Patch) under the ReAct paradigm, orchestrated by LangGraph.
Runs code fixes safely in a Docker sandbox, isolating untrusted user and LLM code.
On a stratified subset of HumanEvalFix, Qwen 1.7B and GPT-4o mini both reach pass@1 up to 0.50.

At a glance

Role: Research engineer (solo project)
Timeline: Oct 2025 (5 days)
Context: Independent research prototype
Users/Stakeholders: LLM researchers, agentic framework developers, AI debugging tools
My scope: System architecture → agent design → evaluation → analysis

Problem

Most LLMs can generate Python, but they lack iterative self-correction. Given a failing test suite, models often hallucinate, repeat mistakes, or produce partial patches.

Key challenges:

Reason about compiler/runtime errors and test failures.
Propose targeted edits instead of full rewrites.
Execute and iterate safely, without exposing the host to arbitrary code.

The question: Can an agentic framework autonomously debug and patch code until tests pass, while remaining safe and backend-agnostic?

Solution overview

I designed an agentic bug-fixing system around the ReAct (Reason + Act) framework, implemented with LangGraph:

Thought Agent
- Reads the failing code, error messages, and test output.
- Explains what went wrong and sketches a concrete fix strategy.
Patch Agent
- Converts the plan into structured JSON patches (line/region operations).
- Ensures edits are minimal and localized, not full-file to reduce hallucinations.
Docker Sandbox
- Applies patches and runs tests (e.g., pytest) in an isolated container.
- Returns stdout/stderr, exit codes, and failing assertions back to the agents.
LangGraph Orchestration
- Maintains state across iterations: code version, error history, attempts.
- Implements a ReAct-style loop: think → act → observe → repeat until success or budget exhaustion.

Architecture

Nodes: Thought node, Patch node. After every Patch, the Sandbox node runs tests; results feed back to Thought.
Backends: Pluggable LLMs (Qwen 0.6B locally, GPT-4o mini via OpenAI, Qwen 1.7B via HF endpoint).
Safety: All user/LLM code and tests are executed in the Docker sandbox only.

Data

Benchmark: HumanEvalFix — Python coding tasks with seeded bugs and reference fixes.
Subsets:
- Stratified subset @ 35% of tasks (diverse bug types).
- Stratified subset @ 50% of tasks (larger sample).
Signals captured:
- Test pass/fail per iteration.
- Final pass@1 (success within a single full run of the agent).
- Number of iterations until success or timeout.

Method

Agent loop (ReAct):
1. Thought Agent inspects error messages and proposes a reasoning chain.
2. Patch Agent emits a JSON patch object (where/what to change).
3. Sandbox applies patch and executes tests.
4. Observation (logs + failures) is fed back to the Thought Agent.
LangGraph:
- Encodes this loop as a graph of nodes with typed messages.
- Keeps history (previous patches, errors) to avoid cycling.
Model backends:
- Qwen 0.6B (local) for fully on-device runs.
- Qwen 1.7B as a Hugging Face Inference API backend.
- GPT-4o mini as an API-based stronger model.
Evaluation:
- For each model and subset, run once per task and compute pass@1.
- Compare performance across models and dataset fractions.

Experiments & Results

Quantitative results

Model	Stratified subset	pass@1
Qwen 1.7B	35% of HumanEvalFix	0.50
Qwen 1.7B	50% of HumanEvalFix	0.40
GPT-4o mini	35% of HumanEvalFix	0.50
GPT-4o mini	50% of HumanEvalFix	0.45

Observations:

Both Qwen 1.7B and GPT-4o mini reach 50% pass@1 on the smaller stratified subset.
GPT-4o mini holds up slightly better when scaling to 50% of the benchmark (0.45 vs 0.40).
Many successful repairs happen within a small number of iterations, indicating the ReAct loop and structured patches are effective.

Analysis (high level)

Failures often occur when:
- The model misinterprets the bug's root cause.
- The patch model edits the wrong region despite the Thought Agent reasoning correctly.
Successes tend to involve:
- Clear error messages (type errors, wrong return value) and concise functions.
- Localized logic issues (off-by-one, missing conditional, incorrect operator).

Impact

Shows that agentic, iterative debugging can fix a substantial fraction of real benchmark bugs with modest models.
Demonstrates a practical pattern for safe execution of LLM-generated code via Docker.
Provides a reusable evaluation harness for HumanEvalFix-style automatic repair across different backends.

What I learned

Agentic loops benefit heavily from structured patch formats—they make reasoning and rollback simpler than raw text diffs.
LangGraph makes it natural to encode multi-step workflows with explicit state, instead of ad-hoc loops.
Safety and isolation aren’t optional when letting an LLM execute arbitrary code.
Even strong models still need good tooling and state management to be reliable debuggers.

Future Work

Add a critic/verifier agent to sanity-check patches before execution.
Support multi-file projects and more complex build/test setups.
Explore cost vs. performance trade-offs by varying iteration limits and model sizes.

References

Yao et al. (2023): ReAct: Synergizing Reasoning and Acting in Language Models.
Chen et al. (2021): Evaluating Large Language Models on Code (HumanEval).

← Back to projects