Agentic Python Fixer
Oct 15, 2025
Cover image generated with OpenAI's GPT-5 model.- Qwen 1.7B pass@1: 0.50 (stratified 35% HumanEvalFix)
- Qwen 1.7B pass@1: 0.40 (stratified 50% HumanEvalFix)
- GPT-4o mini pass@1: 0.50 (stratified 35% HumanEvalFix)
- GPT-4o mini pass@1: 0.45 (stratified 50% HumanEvalFix)
- Secure sandboxed execution via Docker
TL;DR
- Built an agentic Python fixer that autonomously repairs failing code through reason–act loops.
- Uses two cooperating agents (Thought + Patch) under the ReAct paradigm, orchestrated by LangGraph.
- Runs code fixes safely in a Docker sandbox, isolating untrusted user and LLM code.
- On a stratified subset of HumanEvalFix, Qwen 1.7B and GPT-4o mini both reach pass@1 up to 0.50.
At a glance
- Role: Research engineer (solo project)
- Timeline: Oct 2025 (5 days)
- Context: Independent research prototype
- Users/Stakeholders: LLM researchers, agentic framework developers, AI debugging tools
- My scope: System architecture → agent design → evaluation → analysis
Problem
Most LLMs can generate Python, but they lack iterative self-correction. Given a failing test suite, models often hallucinate, repeat mistakes, or produce partial patches.
Key challenges:
- Reason about compiler/runtime errors and test failures.
- Propose targeted edits instead of full rewrites.
- Execute and iterate safely, without exposing the host to arbitrary code.
The question: Can an agentic framework autonomously debug and patch code until tests pass, while remaining safe and backend-agnostic?
Solution overview
I designed an agentic bug-fixing system around the ReAct (Reason + Act) framework, implemented with LangGraph:
-
Thought Agent
- Reads the failing code, error messages, and test output.
- Explains what went wrong and sketches a concrete fix strategy.
-
Patch Agent
- Converts the plan into structured JSON patches (line/region operations).
- Ensures edits are minimal and localized, not full-file to reduce hallucinations.
-
Docker Sandbox
- Applies patches and runs tests (e.g.,
pytest) in an isolated container. - Returns stdout/stderr, exit codes, and failing assertions back to the agents.
- Applies patches and runs tests (e.g.,
-
LangGraph Orchestration
- Maintains state across iterations: code version, error history, attempts.
- Implements a ReAct-style loop: think → act → observe → repeat until success or budget exhaustion.
Architecture
- Nodes: Thought node, Patch node. After every Patch, the Sandbox node runs tests; results feed back to Thought.
- Backends: Pluggable LLMs (Qwen 0.6B locally, GPT-4o mini via OpenAI, Qwen 1.7B via HF endpoint).
- Safety: All user/LLM code and tests are executed in the Docker sandbox only.
Data
- Benchmark: HumanEvalFix — Python coding tasks with seeded bugs and reference fixes.
- Subsets:
- Stratified subset @ 35% of tasks (diverse bug types).
- Stratified subset @ 50% of tasks (larger sample).
- Signals captured:
- Test pass/fail per iteration.
- Final pass@1 (success within a single full run of the agent).
- Number of iterations until success or timeout.
Method
-
Agent loop (ReAct):
- Thought Agent inspects error messages and proposes a reasoning chain.
- Patch Agent emits a JSON patch object (where/what to change).
- Sandbox applies patch and executes tests.
- Observation (logs + failures) is fed back to the Thought Agent.
-
LangGraph:
- Encodes this loop as a graph of nodes with typed messages.
- Keeps history (previous patches, errors) to avoid cycling.
-
Model backends:
- Qwen 0.6B (local) for fully on-device runs.
- Qwen 1.7B as a Hugging Face Inference API backend.
- GPT-4o mini as an API-based stronger model.
-
Evaluation:
- For each model and subset, run once per task and compute pass@1.
- Compare performance across models and dataset fractions.
Experiments & Results
Quantitative results
| Model | Stratified subset | pass@1 |
|---|---|---|
| Qwen 1.7B | 35% of HumanEvalFix | 0.50 |
| Qwen 1.7B | 50% of HumanEvalFix | 0.40 |
| GPT-4o mini | 35% of HumanEvalFix | 0.50 |
| GPT-4o mini | 50% of HumanEvalFix | 0.45 |
Observations:
- Both Qwen 1.7B and GPT-4o mini reach 50% pass@1 on the smaller stratified subset.
- GPT-4o mini holds up slightly better when scaling to 50% of the benchmark (0.45 vs 0.40).
- Many successful repairs happen within a small number of iterations, indicating the ReAct loop and structured patches are effective.
Analysis (high level)
- Failures often occur when:
- The model misinterprets the bug's root cause.
- The patch model edits the wrong region despite the Thought Agent reasoning correctly.
- Successes tend to involve:
- Clear error messages (type errors, wrong return value) and concise functions.
- Localized logic issues (off-by-one, missing conditional, incorrect operator).
Impact
- Shows that agentic, iterative debugging can fix a substantial fraction of real benchmark bugs with modest models.
- Demonstrates a practical pattern for safe execution of LLM-generated code via Docker.
- Provides a reusable evaluation harness for HumanEvalFix-style automatic repair across different backends.
What I learned
- Agentic loops benefit heavily from structured patch formats—they make reasoning and rollback simpler than raw text diffs.
- LangGraph makes it natural to encode multi-step workflows with explicit state, instead of ad-hoc loops.
- Safety and isolation aren’t optional when letting an LLM execute arbitrary code.
- Even strong models still need good tooling and state management to be reliable debuggers.
Future Work
- Add a critic/verifier agent to sanity-check patches before execution.
- Support multi-file projects and more complex build/test setups.
- Explore cost vs. performance trade-offs by varying iteration limits and model sizes.
References
- Yao et al. (2023): ReAct: Synergizing Reasoning and Acting in Language Models.
- Chen et al. (2021): Evaluating Large Language Models on Code (HumanEval).