We Ran DeepSWE at 1M Context vs 262K. The Results Surprised Us.
Last Updated: June 11, 2026
After getting DeepSeek V4 Flash running at 1M context on our dual DGX Spark setup, we had a hypothesis: more context means better automated coding. The 262K baseline had been our ceiling, and we assumed that tasks bumping against that limit were leaving performance on the table.
So we ran the DeepSWE benchmark (datacurve-ai) twice: once at 1M context, once at 262K. Same model, same hardware, same tasks. The 1M run finished 3x faster. But the results were exactly the same: zero passing patches in both runs.
Here is what happened, why the speed improvement is real but misleading, and what this taught us about local LLM agent performance.
What We Tested
We ran DeepSWE via Pier v0.2.1 with mini-swe-agent v2.3.0 on our dual DGX Spark cluster (DeepSeek V4 Flash, 284B MoE, FP8, tensor parallelism across two nodes connected by 200Gbps RoCE).
Five real-world software engineering tasks:
- prometheus-typed-label-sorting (Go): Multi-domain typed comparison for Prometheus label sorting
- skrub-duration-encoding (Python): Duration encoder for timedelta64/Duration columns
- anko-typed-variable-bindings (Go): Typed variable declarations for the Anko scripting language
- numba-stencil-boundary-modes (Python): Boundary modes for Numba's @stencil decorator
- python-statemachine-state-data-s (Python): State data handling (only ran at 262K, silently dropped at 1M)
The key difference between runs was the vLLM configuration. The 1M run used max-model-len=1000000, gpu-memory-utilization=0.82, and max-num-seqs=6. The 262K baseline used max-model-len=262144, gpu-memory-utilization=0.90, and default concurrency.
The Headline Numbers
Total runtime: 3 hours 11 minutes (1M) vs 10 hours 2 minutes (262K). That is a 68% reduction, or 3.18x faster.
Mean reward: 0.0 in both runs. Not a single passing patch.
Peak context observed: 146,546 tokens (1M) vs 131,123 tokens (262K). Both well under the 262K limit.
The key point is that the 1M context window made everything run dramatically faster but produced zero improvement in code quality. Every patch failed verification in both runs. The extra 738K tokens of context headroom went completely unused.
Per-Task Breakdown
Prometheus Typed Label Sorting (Go)
1M run: 125 steps, 26 minutes agent execution, 0.0 reward 262K run: 136 steps, 2 hours 8 minutes agent execution, 0.0 reward
This was the most dramatic speedup: 4.9x faster agent execution at 1M. The 1M run used fewer steps and fewer tokens, suggesting the model converged on its implementation faster. But neither patch passed verification.
Interestingly, an earlier benchmark report estimated this task would blow past 262K to roughly 645K tokens. The actual peak was only 114K at 262K and 103K at 1M. That estimate was wrong.
Skrub Duration Encoding (Python)
1M run: 206 steps, 36 minutes agent execution, verifier timeout 262K run: 156 steps, 2 hours 4 minutes agent execution, verifier timeout
The only task where the 1M run used more steps (206 vs 156). The faster per-step execution let the agent iterate deeper and explore more approaches, but it still could not produce a passing patch. Both runs generated nearly identical patches (1,341 vs 1,343 lines) and both hit the 30-minute verifier timeout. The generated code likely creates an infinite loop or pathological computation during test execution.
Anko Typed Variable Bindings (Go)
1M run: 113 steps, 14 minutes agent execution, 0.0 reward 262K run: 114 steps, 48 minutes agent execution, 0.0 reward
Nearly identical agent behavior (113 vs 114 steps) but 3.4x faster execution. The 1M run completed in 14 minutes, the fastest task in either benchmark. The Anko codebase is well-structured and the model clearly understood the target files. But the precision required for type checking across edge cases (nil assignment, interface satisfaction, zero values) remained beyond V4 Flash's capability.
Numba Stencil Boundary Modes (Python)
1M run: 169 steps, 44 minutes agent execution, 0.0 reward 262K run: 120 steps, 1 hour 19 minutes agent execution, 0.0 reward
Like skrub, the faster per-step execution enabled more iteration (169 vs 120 steps, a 41% increase). The 1M run consumed 72% more input tokens than the 262K run, the deepest exploration of any task. Yet the patch size was nearly identical and reward remained 0.0. The Numba stencil codebase is notoriously complex (LLVM IR generation, type inference, lowering), and even with dramatically more exploration budget, V4 Flash could not get the boundary mode semantics exactly right.
Why Was the 1M Run 3x Faster?
This is the most interesting finding, and it has nothing to do with context window size.
The speedup is attributable to vLLM KV cache scheduling efficiency:
- Lower GPU memory utilization (0.82 vs 0.90): Less GPU memory pressure means fewer memory management operations during prefill. At 0.90 with 262K context, vLLM packs KV cache densely, causing more swap, recompute, and preemption overhead.
- More KV cache slots at lower density: With max-model-len=1M, vLLM allocates many more KV cache block slots, each covering a larger token range. For sequences in the 100K-150K range (which is all we ever saw), this means fewer block allocations and less fragmentation.
- Explicit max-num-seqs=6: Prevents overcommitment that causes scheduling thrashing.
Average per-step latency dropped from 42.2 seconds to 11.5 seconds, a 3.7x improvement. This includes both LLM inference time and tool execution time (file reads, shell commands, tests). Since tool execution should be constant between runs, the true LLM speedup is even larger.
In summary: the 1M config changes that were supposed to help with context window size accidentally created a much more efficient KV cache layout for moderate-length sequences. The same parameters applied at 262K would likely produce similar speed gains.
The Context Window Was Never the Bottleneck
This was our biggest surprise. We assumed some tasks were bumping against the 262K ceiling. They were not.
No task exceeded 150K tokens of context in either run. The highest peak was numba at 146,546 tokens with the 1M window, still only 56% of the 262K limit. The 1M context window provided 7-10x more headroom than was ever used, and 85% of the 1M window sat completely empty.
Mini-swe-agent manages context by keeping conversations under roughly 150K tokens. It uses a sliding conversation window, so the conversation grows but gets truncated or summarized before hitting any model limit. Whether the model supports 262K or 1M, the agent caps its own usage well below either threshold.
The earlier estimate that prometheus would hit 645K tokens conflated cumulative token counts (summing all input tokens across all API calls) with the peak context size of a single API call. These are fundamentally different things.
More Iterations Do Not Mean Better Results
Two tasks (skrub and numba) saw substantially more agent steps at 1M (+32% and +41% respectively). The faster per-step execution let the model try more approaches, explore more code, and iterate more aggressively.
The patches were virtually identical in size to the 262K runs. And the reward was still 0.0.
The model's ceiling on these tasks is a reasoning precision limit, not an exploration budget limit. V4 Flash can execute the agent loop competently: explore, understand, implement, test, submit. But it cannot produce code precise enough to pass hidden test suites for complex tasks. Giving it more iterations is like giving a chess player more time on a position they do not understand. They will think longer but make the same move.
Token Economics
Both runs cost $0.00 because they ran entirely on local hardware. But the token counts tell an interesting story.
1M run: 41.7M input tokens, 205K output tokens across 4 tasks 262K run: 48.8M input tokens, 246K output tokens across 5 tasks
The input/output ratio of roughly 200:1 is consistent with DeepSWE's agent pattern. Each step sends a massive context prompt (file contents plus conversation history) and receives a relatively terse tool-call response. This is why per-step latency matters so much: the model is processing huge prompts hundreds of times per task.
At cloud API pricing (roughly $3/M input tokens for GPT-4 class), the 1M run would have cost about $125 in input tokens alone. The 262K run would have been about $146. Running locally, both were free after the hardware investment.
What This Means for Local LLM Agents
1. Tune your vLLM config, not just your context window
The biggest performance gain came from KV cache allocation efficiency, not from context window size. If you are running agent workloads on local hardware, experiment with gpu-memory-utilization and max-num-seqs before cranking up max-model-len. You might get the speed benefits without sacrificing memory for unused context headroom.
2. Context window size is task-dependent
For DeepSWE tasks with mini-swe-agent, 262K was already overkill. But for other use cases like processing entire codebases, analyzing long documents, or running multi-hour agent sessions without summarization, the 1M window could be genuinely useful. Do not assume bigger is always better. Measure your actual peak context usage.
3. Model precision is the real bottleneck for code generation
V4 Flash is one of the most capable open models available. It understands codebases, generates plausible implementations, and iterates on failures. But passing hidden test suites for complex real-world software engineering tasks requires a level of precision that current open models have not reached. This is not a context problem or a speed problem. It is a reasoning problem.
4. Local inference makes iteration free
The ability to run 41 million input tokens through a 284B parameter model at zero marginal cost changes how you think about experimentation. We ran two full benchmark suites totaling 90 million input tokens and 450K output tokens without thinking about cost. That freedom to iterate is the real value of local inference, not the context window size.
Credits
This benchmark builds on the dual DGX Spark deployment recipe by tonyd2wild (github.com/tonyd2wild/deepseek-v4-flash-2x-spark-1m) and uses the DeepSWE benchmark framework by datacurve-ai with Pier v0.2.1 and mini-swe-agent v2.3.0.
Frequently Asked Questions
Does 1M context improve automated coding results?
In our testing, no. Running DeepSWE benchmarks with mini-swe-agent, no task ever exceeded 150K context tokens regardless of whether the model supported 262K or 1M. The agent manages its own context window and stays well below either limit. The 1M context provided zero improvement in task completion.
Why was the 1M run faster if context size did not matter?
The speed improvement came from vLLM configuration differences. The 1M config uses lower GPU memory utilization (0.82 vs 0.90) and explicit concurrency limits (max-num-seqs=6), which creates a more efficient KV cache layout for moderate-length sequences. These same parameters could be applied at 262K for similar speed gains.
Can local LLMs pass DeepSWE benchmarks?
In our experience with DeepSeek V4 Flash (284B MoE, FP8 quantized), no. The model executes the agent loop competently but cannot produce code precise enough to pass hidden test suites for complex real-world tasks. The bottleneck is reasoning precision, not context window, speed, or iteration budget.
What is the real value of local inference for agent workloads?
Zero marginal cost on iteration. We ran 90 million input tokens across two benchmark suites without thinking about API costs. That freedom to experiment aggressively, run long agent sessions, and try different configurations is transformative for research and development workflows.
Should I use 1M or 262K context for agent workloads?
Measure your actual peak context usage first. If your agent stays under 150K tokens (as mini-swe-agent does), 262K is sufficient and wastes less GPU memory. If you are processing entire codebases or running unsummarized multi-hour sessions, 1M may genuinely help. But the vLLM throughput tuning (lower gpu-mem-util, explicit max-num-seqs) is likely more impactful than raw context size.

