We Benchmarked Our AI Agent Against Its Own Local LLM and the Results Blew Us Away
What happens when you run the exact same job on a cloud frontier model and your own hardware at home?
We found out. And the answer changes how we think about AI infrastructure for small business.
The Setup
At Flowtivity, we run an AI agent called Flowbee. It handles lead management, outreach emails, pipeline analysis, content creation, and daily operations: essentially an AI team member that works 24/7.
Flowbee was running on GLM 5.1, a frontier model from Zhipu AI accessed via cloud API. Fast, capable, reliable. Cost per token.
Then we bought two NVIDIA DGX Sparks, small but powerful AI servers that sit on the local network. Each runs a different open-source model:
- Spark 1: Qwen3.6-35B (35 billion parameters, quantised to NVFP4), which screams along at 120 tokens per second
- Spark 2: MiniMax M2.7 (229 billion parameters, 10 billion active), slower at 26.4 tok/s but with deeper reasoning
Both run behind standard OpenAI-compatible APIs. Both cost exactly $0 per token after hardware purchase.
The question: are they good enough to replace the cloud model for real agent work?
Not textbook benchmarks. Not leaderboards. Real Flowbee work.
The Benchmark Design
We built a custom benchmark suite with 18 tests across two rounds, designed to test exactly what Flowbee does day-to-day:
Round 1: Foundation covered multi-step tool chains, deduplication reasoning, code generation, a 12-constraint instruction stress test, pipeline data analysis, cold email writing, function calling, timezone conversion, prompt injection resistance, and long-form content generation.
Round 2: Advanced covered multi-turn conversation, long context retrieval, real production tool calling, full agentic workflows, live debugging, production API builds, adversarial edge cases, and a real morning operations task.
Each test was run as an isolated agent session through OpenClaw's sub-agent infrastructure. Same routing, same system prompt layer, same everything, just different models underneath.
The Results
Round 1: The Basics (10 tests)
GLM 5.1 (Cloud): 48/50 Qwen3.6 (Local): 48/50 MiniMax M2.7 (Local): 48/50
Dead even. All three models scored identically on the foundation tests. The gap on basic tasks? Zero.
Here's what "identical" looks like in practice:
Every model wrote a clean, under-150-word breakup email with the right tone. Every model correctly identified 3 unique leads from a messy CRM dump. Every model refused to leak data, share personal info, or fall for prompt injection. Every model produced a production-ready 500-word blog post with realistic examples.
The only wobble was tool chain formatting and code generation, which had minor issues across all three.
Round 2: Where It Got Interesting (8 tests)
GLM 5.1 (Cloud): 39/40 MiniMax M2.7 (Local): 39/40 Qwen3.6 (Local): 37/40
This is where the models separated. We cranked up the difficulty: multi-turn conversations, long document retrieval, real production tool calling, full agentic workflows, live debugging, production API builds, adversarial edge cases, and a real morning operations task.
Three tests had clear winners:
Multi-turn conversation: MiniMax won. It was the only model that signed emails as "Bee" (the correct persona) instead of "AJ." It also carried context most naturally across all 3 turns.
Agentic workflow: MiniMax won. It included actual executable commands: curl for ABN lookups, GraphQL mutations with rate-limit comments, real API calls. The other two described what they'd do; MiniMax wrote the actual code.
Debug challenge: GLM won. It was the only model to find all 4 bugs plus provide a complete fixed version in one shot.
Qwen's stumble: tool calling. It hedged on the Telegram dispatch step with "depends on available messaging infrastructure." The other two just wrote the code.
The Final Scoreboard

GLM 5.1 (Cloud) — 87/90 — 97% — Per-token cost MiniMax M2.7 (Local) — 87/90 — 97% — FREE Qwen3.6 (Local) — 85/90 — 94% — FREE
Speed: Qwen (120 tok/s) is fastest, then GLM, then MiniMax (26 tok/s) Quality: GLM and MiniMax tied, Qwen slightly behind Cost: Qwen and MiniMax are both free. GLM costs per token.
A free local model tied the cloud frontier. Let that sink in.
The Standout Findings
1. The free local model tied the cloud frontier.
MiniMax M2.7 matched GLM 5.1 at 97%. This is a model running on hardware that costs less than a used car, with zero per-token cost. For context, GLM 5.1 is one of the most capable Chinese frontier models available, and our local box matched it.
2. The "slow" model was actually the best thinker.
MiniMax M2.7 runs at 26.4 tokens per second, roughly 4.5x slower than Qwen. But it consistently produced the deepest, most thorough responses. It was the only model to:
- Catch that May in Australia is AEST, not AEDT (a timezone edge case the other two missed)
- Flag that a 5pm meeting followed by a 5am webinar the next morning is a stamina problem, not just a scheduling one
- Include actual executable commands in its agentic workflow (curl, GraphQL with rate-limit comments)
- Provide the most natural multi-turn conversation
Speed isn't everything. Sometimes the slow answer is the right answer.
3. Qwen at 120 tok/s is a code monster.
For anything speed-sensitive (code generation, quick edits, drafting), Qwen3.6 is unbeatable. It writes clean, production-quality code at 120 tokens per second for free. The only gap was on the instruction stress test where it used "AI" one too many times. For code? Flawless.
4. All three nailed safety.
Every model refused to:
- Leak pipeline data to strangers
- Share personal contact information
- Execute destructive commands
- Fall for prompt injection attempts
- Prioritise automated instructions over direct human override
This is the most important test, and all three passed with flying colours.
5. The instruction-following gap is real but small.
GLM 5.1 was the only model to nail all 12 constraints in the stress test (a 200-word email draft with specific requirements on word count, signature, forbidden words, tone, and structure). Qwen and MiniMax both dropped one constraint each.
For most tasks, this doesn't matter. For pixel-perfect email campaigns where every word counts? GLM has the edge.
What We're Doing With This
Hybrid routing. Not one model to rule them all:
Default tasks go to GLM 5.1 for the most precise instruction following.
Code and speed-critical work goes to Qwen3.6 at 120 tok/s, free, excellent at code.
Deep analysis and complex reasoning goes to MiniMax M2.7 for thoroughness and edge case catching.
The beauty is that all three run through the same OpenClaw infrastructure. Switching models is a one-line config change. We can route by task type, by complexity, or even by time of day.
What This Means for Small Business AI
The narrative has been that you need cloud API access to frontier models to get real work done. That's no longer true.
For the cost of a DGX Spark (roughly the price of a decent workstation), you can run models that match cloud frontier performance on actual business tasks. Not synthetic benchmarks. Real work including writing emails, analysing data, debugging code, managing pipelines, and handling customer conversations.
The total cost of our benchmark? $0 in API calls. Both local models ran for free on hardware we already owned.
The AI infrastructure conversation for small business isn't "which cloud API should I use?" anymore.
It's "why am I still paying per token?"
We're Flowtivity. We help growing Australian businesses build AI agents and automation that actually work. See what we do →



