What does Flowtivity do?

Flowtivity is an AI and automation consultancy that helps businesses redesign how work gets done. We combine AI education, custom workflow automation, and AI agent development so businesses become faster, lighter, and easier to run.

How much does Flowtivity cost?

Flowtivity offers a free website scan for automation readiness. Simple AI tools start from $250-600/month. Custom automation projects start from $1K setup + $400-1K/month. Comprehensive implementations from $9K-50K+.

What industries does Flowtivity serve?

Flowtivity serves SMB and mid-market businesses across healthcare, professional services, construction, finance, retail, and education sectors on the Gold Coast and across Australia.

Does Flowtivity offer free tools?

Yes. Flowtivity Insights is a free AI website scanner that provides an automation readiness score (0-100), AEO score (0-100), and personalized AI opportunity report. No sign-up required.

What hardware do I need to run Qwen3.6-35B-A3B for AI agents?

The minimum is an RTX 3090 or RTX 4090 with 24GB VRAM running the model at Q4 quantization, plus 32GB system RAM. Budget around $800-1,600 for the GPU. A Mac Studio M4 Max with 64GB unified memory also works well.

Why Qwen3.6-35B-A3B Changes the Game for Self-Hosted AI Agents

On April 15, 2026, Alibaba's Qwen team released Qwen3.6-35B-A3B, and if you care about running AI agents on your own hardware, this is the most important model release of the year. Here is why.

What Makes Qwen3.6-35B-A3B Different

The key is in the name: 35B total parameters, but only A3B (approximately 3 billion) activated per token during inference. This is a Mixture of Experts (MoE) architecture where the model contains 256 specialised sub-networks called experts, but only routes each token through 8 of them plus 1 shared expert.

The practical effect: you get the reasoning capacity of a much larger model while paying the compute cost of something closer to a 3B model. For self-hosted AI agents that run 24/7, this changes the economics entirely.

On SWE-bench Verified, the standard benchmark for real-world coding tasks, Qwen3.6-35B-A3B scores 73.4, beating Qwen3.5-35B-A3B at 70.0 and Gemma4-31B at 52.0. On Terminal-Bench 2.0, which tests agents completing real tasks inside terminal environments, it scores 51.5, the highest among all compared models. This is not a toy model. It is competitive with proprietary models that cost thousands per month at scale.

Why This Matters for OpenClaw and Self-Hosted Agents

OpenClaw, the open-source personal AI assistant platform, already supports local LLMs through Ollama integration. You can point OpenClaw at a local Ollama instance, and it discovers available models and routes conversations through them. The gateway handles the agent loop, tool execution, memory management, multi-channel messaging (Telegram, WhatsApp, Discord, Slack, and 15+ more), skills, and cron scheduling. The model is just the brain.

Until now, the problem with self-hosted agents has been the cost-quality trade-off. Small local models (7B-14B) are cheap to run but struggle with complex multi-step agent tasks. Large models (70B+) deliver quality but need expensive GPU hardware or rack servers. The MoE architecture in Qwen3.6-35B-A3B splits that difference:

The 3B active parameter count means inference is fast and memory-efficient
The 35B total parameter count gives the model deep reasoning capacity
The native 262K token context window (extensible to 1M with YaRN) is long enough for complex agent sessions with full conversation history, tool outputs, and workspace context

For an OpenClaw agent that runs 24/7, handling emails, managing leads, writing content, and coordinating across messaging channels, this means you can run a capable agent on consumer hardware with zero API costs after the initial hardware investment.

OpenClaw's sub-agent system adds another dimension. The main orchestrator agent can delegate tasks to specialised workers running on different models. You might run Qwen3.6-35B-A3B as the primary local agent for routine tasks, and route complex reasoning to a cloud API only when needed. This hybrid approach dramatically reduces your monthly AI bill.

Hardware Requirements: Exactly What You Need

Here is the hardware breakdown for running Qwen3.6-35B-A3B, based on the model's specs and community testing:

Budget Tier: Single consumer GPU

GPU: NVIDIA RTX 3090 or RTX 4090 (24GB VRAM)
RAM: 32GB system RAM
Storage: 40GB free (model is ~19GB at Q4 quantization, ~36GB at FP16)
Quantization: Q4_K_M (4-bit) recommended for 24GB VRAM
Speed: ~15-25 tokens/second
Cost: ~$800-1,600 (used RTX 3090 to new RTX 4090)
Run with: ollama run qwen3.6:35b-a3b

This is the sweet spot for most self-hosted agents. The RTX 3090 is available used for $700-900 and delivers enough VRAM to run the model at Q4 quantization with room for the KV cache and context window.

Mid Tier: Mac Studio or dual GPU

Mac Studio M4 Max with 64GB or 128GB unified memory
Or dual RTX 3090/4090 (48GB total VRAM)
Quantization: Q8 or FP8 for better quality
Speed: ~20-40 tokens/second
Cost: ~$2,000-4,000

Apple Silicon's unified memory is ideal for MoE models. The full model weights sit in memory, but only the active experts consume GPU compute. A Mac Studio M4 Max with 64GB can run the model at higher quantization with excellent throughput.

Production Tier: Server deployment

GPU: NVIDIA A100 80GB or H100 80GB
Or: 2x RTX 4090 (48GB VRAM) with tensor parallelism
Framework: vLLM or SGLang for production serving
Quantization: FP16 (full precision) or FP8
Speed: 50+ tokens/second, supports multiple concurrent sessions
Cost: ~$5,000-25,000+ depending on configuration

For teams running multiple agents simultaneously, vLLM with tensor parallelism across multiple GPUs delivers production-grade throughput. The Hugging Face model page includes specific vLLM deployment commands.

Bare Minimum: CPU-only with KTransformers

CPU: Modern multi-core (16+ cores recommended)
RAM: 48GB+ system RAM
Framework: KTransformers (CPU-GPU heterogeneous deployment)
Speed: ~3-8 tokens/second
Cost: $0 if you have a decent desktop

The Qwen team specifically recommends KTransformers for resource-constrained environments. It offloads parts of the model to CPU while keeping active experts on GPU. Slow, but functional.

The Multimodal Bonus

Qwen3.6-35B-A3B is not just a text model. It includes a vision encoder that handles images, documents, video, and spatial reasoning natively. On MMMU (multimodal understanding), it scores 81.7, outperforming Claude Sonnet 4.5 at 79.6. For an OpenClaw agent that receives photos via Telegram or processes document attachments from emails, this means you do not need a separate vision model.

Thinking Preservation: Built for Agent Workflows

One feature that specifically benefits long-running agents is Thinking Preservation. By default, reasoning traces (the model's internal chain-of-thought) are discarded after each response. Qwen3.6 can retain these traces across conversation turns, which improves decision consistency in multi-step agent workflows.

For an OpenClaw agent executing a complex task like "research these 10 leads, build prototypes, draft outreach emails, and update the CRM," preserved thinking means the model maintains context about why it made earlier decisions. This reduces redundant reasoning and improves KV cache efficiency.

Cost Comparison: Local vs Cloud API

Running Qwen3.6-35B-A3B locally with OpenClaw compared to equivalent cloud API usage:

Local (RTX 3090, once-off): $800 hardware + ~$15/month electricity. Unlimited tokens. Zero marginal cost per agent session.
Claude Sonnet API: $3 per million input tokens, $15 per million output tokens. A busy agent processing 50K tokens/day costs ~$100-300/month.
GPT-4o API: $2.50 per million input tokens, $10 per million output tokens. Similar monthly costs.

A self-hosted agent running on Qwen3.6-35B-A3B pays for the hardware in 3-6 months of avoided API costs. After that, it is essentially free to run.

What This Model Cannot Do (Yet)

Honest limitations:

Tool use reliability is not yet at the level of Claude or GPT-4o. Complex multi-tool chains may need retry logic.
Long context performance degrades at the extremes of the 262K window. For most agent sessions (10-50K tokens), this is not an issue.
English-centric fine-tuning means performance on other languages, while decent, is not as strong as the multilingual proprietary models.
No streaming tool calls yet in some frameworks. Check vLLM/SGLang compatibility for your specific use case.

How to Get Started

Install Ollama: curl -fsSL https://ollama.com/install.sh | sh
Pull the model: ollama pull qwen3.6:35b-a3b
Install OpenClaw: npm install -g openclaw@latest
Run onboarding: openclaw onboard and point it at your local Ollama instance
Connect a channel (Telegram is fastest) and start chatting with your agent

The whole setup takes about 30 minutes on a machine with an RTX 3090 or better.

The Bottom Line

Qwen3.6-35B-A3B is the first open-weight model that makes self-hosted AI agents genuinely practical for daily business operations. The MoE architecture means you get strong reasoning and coding ability at a fraction of the compute cost of equivalent dense models. Combined with OpenClaw's agent platform (multi-channel messaging, skills, memory, cron, sub-agents), you can build a 24/7 autonomous assistant that runs on consumer hardware with zero ongoing API costs.

For solo developers and small teams who have been watching the AI agent space but balking at cloud API bills, this is your moment. The hardware pays for itself in months. The model is Apache 2.0 licensed for commercial use. And the agent infrastructure is free and open-source.

Frequently Asked Questions

What does A3B mean in Qwen3.6-35B-A3B?

A3B means approximately 3 billion parameters are activated per token during inference, out of 35 billion total parameters. This Mixture of Experts design gives you the reasoning capacity of a larger model at the compute cost of a much smaller one.

Can I run Qwen3.6-35B-A3B on a single GPU?

Yes. On an RTX 3090 or RTX 4090 (24GB VRAM), the model runs at Q4 quantization with room for context. You get roughly 15-25 tokens per second, which is fast enough for interactive agent sessions.

How does Qwen3.6-35B-A3B compare to Claude Sonnet for agent tasks?

On coding benchmarks like SWE-bench, Qwen3.6-35B-A3B is competitive. For complex multi-tool agent workflows, Claude Sonnet still has an edge in reliability. The advantage of Qwen3.6 is zero marginal cost per session and full data privacy.

Is OpenClaw free?

Yes. OpenClaw is open-source under the MIT license. Combined with a local model like Qwen3.6-35B-A3B running on Ollama, your only cost is the hardware and electricity.

What context length does Qwen3.6-35B-A3B support?

The native context length is 262,144 tokens (262K), extensible to over 1 million tokens using YaRN scaling. This is more than enough for complex agent sessions with full conversation history and tool outputs.

Why Qwen3.6-35B-A3B Changes the Game for Self-Hosted AI Agents

What Makes Qwen3.6-35B-A3B Different

Why This Matters for OpenClaw and Self-Hosted Agents

Hardware Requirements: Exactly What You Need

The Multimodal Bonus

Thinking Preservation: Built for Agent Workflows

Cost Comparison: Local vs Cloud API

What This Model Cannot Do (Yet)

How to Get Started

The Bottom Line

Frequently Asked Questions

You might also like

ZCode: The Open-Source Coding Agent Harness Chasing Cursor and Claude Code

Microsoft Just Open-Sourced the OS for AI Agents: Inside the Agent Governance Toolkit

Running a 284B AI Model on Your Desk: Our Real-World DSpark Deployment Log

Want AI insights for your business?