On April 15, 2026, Alibaba's Qwen team released Qwen3.6-35B-A3B, and if you care about running AI agents on your own hardware, this is the most important model release of the year. Here is why.
What Makes Qwen3.6-35B-A3B Different
The key is in the name: 35B total parameters, but only A3B (approximately 3 billion) activated per token during inference. This is a Mixture of Experts (MoE) architecture where the model contains 256 specialised sub-networks called experts, but only routes each token through 8 of them plus 1 shared expert.
The practical effect: you get the reasoning capacity of a much larger model while paying the compute cost of something closer to a 3B model. For self-hosted AI agents that run 24/7, this changes the economics entirely.
On SWE-bench Verified, the standard benchmark for real-world coding tasks, Qwen3.6-35B-A3B scores 73.4, beating Qwen3.5-35B-A3B at 70.0 and Gemma4-31B at 52.0. On Terminal-Bench 2.0, which tests agents completing real tasks inside terminal environments, it scores 51.5, the highest among all compared models. This is not a toy model. It is competitive with proprietary models that cost thousands per month at scale.
Why This Matters for OpenClaw and Self-Hosted Agents
OpenClaw, the open-source personal AI assistant platform, already supports local LLMs through Ollama integration. You can point OpenClaw at a local Ollama instance, and it discovers available models and routes conversations through them. The gateway handles the agent loop, tool execution, memory management, multi-channel messaging (Telegram, WhatsApp, Discord, Slack, and 15+ more), skills, and cron scheduling. The model is just the brain.
Until now, the problem with self-hosted agents has been the cost-quality trade-off. Small local models (7B-14B) are cheap to run but struggle with complex multi-step agent tasks. Large models (70B+) deliver quality but need expensive GPU hardware or rack servers. The MoE architecture in Qwen3.6-35B-A3B splits that difference:
- The 3B active parameter count means inference is fast and memory-efficient
- The 35B total parameter count gives the model deep reasoning capacity
- The native 262K token context window (extensible to 1M with YaRN) is long enough for complex agent sessions with full conversation history, tool outputs, and workspace context
For an OpenClaw agent that runs 24/7, handling emails, managing leads, writing content, and coordinating across messaging channels, this means you can run a capable agent on consumer hardware with zero API costs after the initial hardware investment.
OpenClaw's sub-agent system adds another dimension. The main orchestrator agent can delegate tasks to specialised workers running on different models. You might run Qwen3.6-35B-A3B as the primary local agent for routine tasks, and route complex reasoning to a cloud API only when needed. This hybrid approach dramatically reduces your monthly AI bill.
Hardware Requirements: Exactly What You Need
Here is the hardware breakdown for running Qwen3.6-35B-A3B, based on the model's specs and community testing:
Budget Tier: Single consumer GPU
- GPU: NVIDIA RTX 3090 or RTX 4090 (24GB VRAM)
- RAM: 32GB system RAM
- Storage: 40GB free (model is ~19GB at Q4 quantization, ~36GB at FP16)
- Quantization: Q4_K_M (4-bit) recommended for 24GB VRAM
- Speed: ~15-25 tokens/second
- Cost: ~$800-1,600 (used RTX 3090 to new RTX 4090)
- Run with:
ollama run qwen3.6:35b-a3b
This is the sweet spot for most self-hosted agents. The RTX 3090 is available used for $700-900 and delivers enough VRAM to run the model at Q4 quantization with room for the KV cache and context window.
Mid Tier: Mac Studio or dual GPU
- Mac Studio M4 Max with 64GB or 128GB unified memory
- Or dual RTX 3090/4090 (48GB total VRAM)
- Quantization: Q8 or FP8 for better quality
- Speed: ~20-40 tokens/second
- Cost: ~$2,000-4,000
Apple Silicon's unified memory is ideal for MoE models. The full model weights sit in memory, but only the active experts consume GPU compute. A Mac Studio M4 Max with 64GB can run the model at higher quantization with excellent throughput.
Production Tier: Server deployment
- GPU: NVIDIA A100 80GB or H100 80GB
- Or: 2x RTX 4090 (48GB VRAM) with tensor parallelism
- Framework: vLLM or SGLang for production serving
- Quantization: FP16 (full precision) or FP8
- Speed: 50+ tokens/second, supports multiple concurrent sessions
- Cost: ~$5,000-25,000+ depending on configuration
For teams running multiple agents simultaneously, vLLM with tensor parallelism across multiple GPUs delivers production-grade throughput. The Hugging Face model page includes specific vLLM deployment commands.
Bare Minimum: CPU-only with KTransformers
- CPU: Modern multi-core (16+ cores recommended)
- RAM: 48GB+ system RAM
- Framework: KTransformers (CPU-GPU heterogeneous deployment)
- Speed: ~3-8 tokens/second
- Cost: $0 if you have a decent desktop
The Qwen team specifically recommends KTransformers for resource-constrained environments. It offloads parts of the model to CPU while keeping active experts on GPU. Slow, but functional.
The Multimodal Bonus
Qwen3.6-35B-A3B is not just a text model. It includes a vision encoder that handles images, documents, video, and spatial reasoning natively. On MMMU (multimodal understanding), it scores 81.7, outperforming Claude Sonnet 4.5 at 79.6. For an OpenClaw agent that receives photos via Telegram or processes document attachments from emails, this means you do not need a separate vision model.
Thinking Preservation: Built for Agent Workflows
One feature that specifically benefits long-running agents is Thinking Preservation. By default, reasoning traces (the model's internal chain-of-thought) are discarded after each response. Qwen3.6 can retain these traces across conversation turns, which improves decision consistency in multi-step agent workflows.
For an OpenClaw agent executing a complex task like "research these 10 leads, build prototypes, draft outreach emails, and update the CRM," preserved thinking means the model maintains context about why it made earlier decisions. This reduces redundant reasoning and improves KV cache efficiency.
Cost Comparison: Local vs Cloud API
Running Qwen3.6-35B-A3B locally with OpenClaw compared to equivalent cloud API usage:
- Local (RTX 3090, once-off): $800 hardware + ~$15/month electricity. Unlimited tokens. Zero marginal cost per agent session.
- Claude Sonnet API: $3 per million input tokens, $15 per million output tokens. A busy agent processing 50K tokens/day costs ~$100-300/month.
- GPT-4o API: $2.50 per million input tokens, $10 per million output tokens. Similar monthly costs.
A self-hosted agent running on Qwen3.6-35B-A3B pays for the hardware in 3-6 months of avoided API costs. After that, it is essentially free to run.
What This Model Cannot Do (Yet)
Honest limitations:
- Tool use reliability is not yet at the level of Claude or GPT-4o. Complex multi-tool chains may need retry logic.
- Long context performance degrades at the extremes of the 262K window. For most agent sessions (10-50K tokens), this is not an issue.
- English-centric fine-tuning means performance on other languages, while decent, is not as strong as the multilingual proprietary models.
- No streaming tool calls yet in some frameworks. Check vLLM/SGLang compatibility for your specific use case.
How to Get Started
- Install Ollama:
curl -fsSL https://ollama.com/install.sh | sh - Pull the model:
ollama pull qwen3.6:35b-a3b - Install OpenClaw:
npm install -g openclaw@latest - Run onboarding:
openclaw onboardand point it at your local Ollama instance - Connect a channel (Telegram is fastest) and start chatting with your agent
The whole setup takes about 30 minutes on a machine with an RTX 3090 or better.
The Bottom Line
Qwen3.6-35B-A3B is the first open-weight model that makes self-hosted AI agents genuinely practical for daily business operations. The MoE architecture means you get strong reasoning and coding ability at a fraction of the compute cost of equivalent dense models. Combined with OpenClaw's agent platform (multi-channel messaging, skills, memory, cron, sub-agents), you can build a 24/7 autonomous assistant that runs on consumer hardware with zero ongoing API costs.
For solo developers and small teams who have been watching the AI agent space but balking at cloud API bills, this is your moment. The hardware pays for itself in months. The model is Apache 2.0 licensed for commercial use. And the agent infrastructure is free and open-source.
Frequently Asked Questions
What does A3B mean in Qwen3.6-35B-A3B?
A3B means approximately 3 billion parameters are activated per token during inference, out of 35 billion total parameters. This Mixture of Experts design gives you the reasoning capacity of a larger model at the compute cost of a much smaller one.
Can I run Qwen3.6-35B-A3B on a single GPU?
Yes. On an RTX 3090 or RTX 4090 (24GB VRAM), the model runs at Q4 quantization with room for context. You get roughly 15-25 tokens per second, which is fast enough for interactive agent sessions.
How does Qwen3.6-35B-A3B compare to Claude Sonnet for agent tasks?
On coding benchmarks like SWE-bench, Qwen3.6-35B-A3B is competitive. For complex multi-tool agent workflows, Claude Sonnet still has an edge in reliability. The advantage of Qwen3.6 is zero marginal cost per session and full data privacy.
Is OpenClaw free?
Yes. OpenClaw is open-source under the MIT license. Combined with a local model like Qwen3.6-35B-A3B running on Ollama, your only cost is the hardware and electricity.
What context length does Qwen3.6-35B-A3B support?
The native context length is 262,144 tokens (262K), extensible to over 1 million tokens using YaRN scaling. This is more than enough for complex agent sessions with full conversation history and tool outputs.



