GEPA: How Reflective Prompt Evolution Is Beating Reinforcement Learning at 90x Lower Cost

Last Updated: April 12, 2026

What Is GEPA and Why Should AI Engineers Care?

GEPA (Genetic-Pareto) is an open-source framework for optimizing AI prompts, code, agent architectures, and literally any text parameter using LLM-based reflection and Pareto-efficient evolutionary search. Created by Cerebras Research and integrated into DSPy as dspy.GEPA, it takes a fundamentally different approach to optimization. Instead of relying on reinforcement learning or gradient-based methods, GEPA uses an LLM to read full execution traces, including error messages, profiling data, and reasoning logs, to diagnose exactly why a candidate failed and propose targeted fixes. The results speak for themselves: 90x cheaper than RL, 35x faster, and already used in 50+ production deployments at companies like Shopify, Databricks, and OpenAI.

How Does GEPA Compare to Reinforcement Learning for Prompt Optimization?

Traditional reinforcement learning methods like GRPO require massive compute budgets. We are talking 5,000 to 25,000+ evaluations to converge on a good solution. GEPA achieves equal or better results with just 100 to 500 evaluations. That is a 35x speedup in optimization time.

But the cost story is even more dramatic. Databricks used GEPA to make open-source models beat Claude Opus 4.1 at 90x lower cost. Let that sink in. A carefully optimized open-source model, guided by GEPA's reflective evolution, outperformed one of the most expensive frontier models available, for a fraction of the price.

Shopify CEO Tobi Lutke put it plainly: "Both DSPy and GEPA are currently severely under hyped."

When the CEO of a $100B+ company calls your framework underhyped, you know something significant is happening.

The core difference is philosophical. RL treats your prompt or code as a black box and nudges it with reward signals. GEPA treats it as text that a language model can read, understand, and improve by reasoning about failures. It is the difference between trial-and-error and actual diagnosis.

How Does GEPA's Reflective Text Evolution Work?

GEPA operates through a five-step cycle that mirrors how a skilled engineer would debug and improve a system:

1. Select. Pick a candidate from the current Pareto frontier, which is the set of solutions that represent the best trade-offs across multiple objectives like accuracy, cost, and latency.

2. Execute. Run the candidate on a minibatch of examples. This is not just about getting a score. GEPA captures full execution traces including error messages, profiling data, reasoning logs, and any diagnostic output the evaluator produces.

3. Reflect. An LLM reads those execution traces and diagnoses why the candidate failed on specific examples. This is the magic step. Instead of just knowing "you got 6 out of 10 wrong," GEPA understands "you failed on examples requiring multi-step arithmetic because the prompt does not establish intermediate verification."

4. Mutate. Using the reflection insights plus an accumulated log of lessons learned across all previous iterations, the LLM generates an improved candidate. This is not random mutation. It is informed, targeted improvement.

5. Accept. If the new candidate improves on at least one objective without regressing on others, it joins the pool and updates the Pareto front.

GEPA also supports a powerful system-aware merge operation. This takes two Pareto-optimal candidates, analyzes their respective strengths, and combines them into a single solution that inherits the best of both. Think of it as intelligent crossover guided by LLM reasoning rather than random genetic recombination.

What Is Actionable Side Information and Why Does It Matter?

Actionable Side Information (ASI) is the key conceptual innovation in GEPA. In traditional machine learning, gradients tell you which direction to adjust your parameters. But when your "parameters" are natural language prompts or source code, numerical gradients do not exist.

ASI fills that gap. It is diagnostic feedback from evaluators that tells the optimization process not just whether a candidate failed, but why and how to fix it. This could be:

Error messages from code execution
Profiling data showing which steps took longest
Reasoning traces that reveal logical fallacies
Structured evaluator feedback identifying specific failure modes

ASI is the text-optimization analogue of a gradient. It converts a scalar "you scored 72%" into rich, directional information that the reflecting LLM can use to make precise, informed edits.

This is why GEPA converges so fast. Every iteration produces not just a score, but a detailed diagnosis that compounds across runs. The accumulated lesson log means GEPA gets smarter over time, avoiding repeated mistakes and building on prior insights.

What Results Has GEPA Achieved in Production?

The production evidence for GEPA is remarkable. Here are the headline numbers:

32% to 89% ARC-AGI agent accuracy through architecture discovery. GEPA did not just tune a prompt. It discovered entirely new agent architectures that nearly tripled accuracy.
46.6% to 56.6% on AIME 2025 for GPT-4.1 Mini. That is a 10 percentage point jump on one of the hardest math benchmarks, achieved purely through prompt optimization.
55% to 82% coding agent resolve rate on Jinja tasks via auto-learned skills. The agent literally taught itself new capabilities.
40.2% cloud scheduling cost savings, beating hand-crafted expert heuristics. GEPA found scheduling policies that human experts had not considered.
50+ production deployments across Shopify, Databricks, Dropbox, OpenAI, Pydantic, and MLflow.

These are not toy benchmarks. These are real systems, real workloads, and real cost savings. The ARC-AGI result is particularly striking because it demonstrates GEPA's ability to optimize not just text but architectural decisions about how an AI agent should be structured.

What Are the Core Use Cases for GEPA?

Prompt optimization. This is the most intuitive application. GEPA can take any prompt and evolve it to maximize task performance. On AIME 2025 math problems, it boosted GPT-4.1 Mini by 10 points. On HotpotQA multi-hop retrieval, it discovers prompt strategies that guide the model through complex reasoning chains more effectively than human-written prompts.

Code optimization. GEPA treats source code as text, which means it can optimize it the same way it optimizes prompts. The reflective mechanism reads error messages, profiling data, and test failures to propose precise code changes. This is not just "rewrite this function." It is "here is why your function times out on inputs larger than 1000 elements, and here is the targeted fix."

Agent architecture discovery. This is where GEPA truly shines. Instead of manually designing agent workflows, GEPA can explore different architectural choices and discover configurations that dramatically outperform human designs. The 32% to 89% ARC-AGI jump came from architecture discovery, not prompt tweaking.

Cloud scheduling policies. GEPA optimized scheduling heuristics to achieve 40.2% cost savings, beating policies crafted by domain experts. This demonstrates that GEPA works on optimization problems well outside traditional NLP tasks.

Coding agent skill learning. GEPA enabled a coding agent to automatically learn new skills, taking its resolve rate on Jinja tasks from 55% to 82%. The agent did not just get better at what it already knew. It acquired genuinely new capabilities.

The optimize_anything API. GEPA exposes a generic optimize_anything interface that lets you evolve any text artifact in your system. Prompts, code, configuration files, system messages, tool descriptions. If it is text and you can evaluate it, GEPA can optimize it.

How Do You Use GEPA with DSPy?

GEPA is integrated directly into DSPy as dspy.GEPA, and it is the recommended optimization approach for AI pipelines. Here is what makes the integration powerful:

Drop-in replacement. If you are already using DSPy's optimizers like BootstrapFewShot or MIPRO, switching to GEPA is straightforward.
Multi-objective optimization. GEPA's Pareto-efficient search means you can simultaneously optimize for accuracy, cost, latency, or any other measurable objective.
Rich evaluation feedback. DSPy's metric system naturally produces the kind of diagnostic information that feeds GEPA's reflection mechanism.
Accumulated learning. GEPA builds a lesson log across optimization runs, so your pipeline gets progressively smarter even across separate optimization sessions.

The typical workflow looks like this: define your DSPy program, specify your evaluation metric with detailed feedback, and let GEPA evolve your prompts and few-shot examples. The framework handles candidate selection, execution, reflection, mutation, and Pareto-front management automatically.

For teams already invested in the DSPy ecosystem, dspy.GEPA represents the state of the art in automated prompt and pipeline optimization.

Why Is GEPA a Game-Changer for AI Cost Optimization?

The economics of AI in 2026 are defined by a tension: frontier models deliver incredible performance but at eye-watering cost. Every API call to Claude Opus or GPT-4 at scale adds up fast. GEPA fundamentally changes this equation.

By making open-source and smaller models competitive with frontier models through intelligent optimization, GEPA shifts the value from "which model do you use" to "how well have you optimized your system." Databricks proved this: open-source models, optimized with GEPA, beat Claude Opus 4.1 at 90x lower cost.

This has massive implications:

Startups can build production-grade AI systems without burning venture capital on API bills.
Enterprise teams can deploy AI internally without budget battles over frontier model access.
AI consultancies can deliver better results at lower cost, passing savings to clients.
Open-source advocates get a concrete, measurable argument for why open models matter.

The 35x speedup matters too. RL-based optimization can take days or weeks. GEPA converges in hours. That means faster iteration cycles, more experiments, and quicker time to production.

What Makes GEPA Different from Other Prompt Optimization Tools?

Most prompt optimization approaches fall into a few camps. Gradient-based methods require differentiable models and cannot handle discrete text. RL methods like GRPO work but need thousands of evaluations and treat prompts as black boxes. Manual prompt engineering relies on human intuition and does not scale.

GEPA occupies a unique position by combining three ideas that have not been brought together before:

Reflective diagnosis. The LLM does not just try random variations. It reads execution traces and reasons about failures like a human engineer would.

Pareto-efficient search. Instead of optimizing a single metric, GEPA maintains a frontier of solutions that represent the best trade-offs across multiple objectives simultaneously.

Evolutionary compounding. Lessons accumulate across iterations. Every reflection builds on prior insights, creating a compounding knowledge effect that accelerates convergence.

The system-aware merge operation adds another dimension. By intelligently combining the strengths of two strong candidates, GEPA can discover solutions that neither parent could reach alone. This is guided recombination, not random crossover.

How Can You Get Started with GEPA?

Getting started with GEPA is straightforward since it is fully open-source and integrated into DSPy.

Step 1. Install DSPy with GEPA support. The framework is available through the standard DSPy package.

Step 2. Define your task using DSPy's module system. This could be a simple prompt, a multi-step pipeline, or a complex agent architecture.

Step 3. Create an evaluation metric that returns rich diagnostic feedback, not just a score. The more detailed your feedback, the better GEPA's reflection mechanism works. This is where Actionable Side Information comes in.

Step 4. Run dspy.GEPA as your optimizer. Configure your objectives, evaluation budget, and any constraints.

Step 5. Review the Pareto front of optimized candidates and select the one that best fits your deployment constraints.

The framework handles the heavy lifting. Your job is to define what "good" looks like and provide evaluation feedback that helps GEPA understand why something is good or bad.

What Does the Future Hold for Reflective Optimization?

GEPA represents a broader shift in how we think about AI optimization. The industry has been focused on making models bigger and more capable. But there is an equally important frontier: making our use of those models smarter and more efficient.

Reflective optimization points toward a future where:

Every AI system is continuously optimized, not just at design time but in production.
Open-source models are genuinely competitive with frontier models because the optimization layer closes the gap.
AI engineering becomes more like software engineering, with principled debugging, testing, and optimization cycles.
Cost is a design choice, not a constraint. You pick your performance target and GEPA finds the cheapest way to hit it.

With 50+ production deployments already live and adoption accelerating through the DSPy ecosystem, GEPA is moving from research curiosity to industry standard. The companies already using it, Shopify, Databricks, Dropbox, OpenAI, are not exactly known for adopting unproven technology.

Frequently Asked Questions

What is GEPA and how does it optimize AI prompts?

GEPA (Genetic-Pareto) is an open-source framework that optimizes AI prompts, code, and agent architectures using LLM-based reflection and Pareto-efficient evolutionary search. Instead of relying on reinforcement learning or gradient-based methods, GEPA uses an LLM to read full execution traces and diagnose why a candidate failed, then proposes targeted fixes. This approach is 90x cheaper and 35x faster than traditional RL methods.

How much cheaper is GEPA compared to reinforcement learning?

Databricks used GEPA to make open-source models beat Claude Opus 4.1 at 90x lower cost. GEPA requires only 100 to 500 evaluations compared to 5,000 to 25,000+ for GRPO-based reinforcement learning, making it roughly 35x faster as well.

Who is using GEPA in production?

GEPA has 50+ production uses across major companies including Shopify, Databricks, Dropbox, OpenAI, Pydantic, and MLflow. Shopify CEO Tobi Lutke stated that both DSPy and GEPA are currently severely underhyped. It is integrated into DSPy as dspy.GEPA, the recommended approach for AI pipeline optimization.

What is Actionable Side Information in GEPA?

Actionable Side Information (ASI) is diagnostic feedback from evaluators that serves as the text-optimization analogue of a gradient. Instead of computing numerical gradients like in traditional ML, GEPA extracts rich diagnostic signals from execution traces and uses them to guide the evolution of better prompts and code.

Can GEPA optimize more than just prompts?

Yes. GEPA can optimize any text artifact including prompts, code, agent architectures, cloud scheduling policies, and coding agent skills. Its optimize_anything API allows you to evolve any text parameter in your system. Results range from 32% to 89% accuracy improvements on ARC-AGI to 40.2% cloud scheduling cost savings.