What does Flowtivity do?

Flowtivity is an AI and automation consultancy that helps businesses redesign how work gets done. We combine AI education, custom workflow automation, and AI agent development so businesses become faster, lighter, and easier to run.

How much does Flowtivity cost?

Flowtivity offers a free website scan for automation readiness. Simple AI tools start from $250-600/month. Custom automation projects start from $1K setup + $400-1K/month. Comprehensive implementations from $9K-50K+.

What industries does Flowtivity serve?

Flowtivity serves SMB and mid-market businesses across healthcare, professional services, construction, finance, retail, and education sectors on the Gold Coast and across Australia.

Does Flowtivity offer free tools?

Yes. Flowtivity Insights is a free AI website scanner that provides an automation readiness score (0-100), AEO score (0-100), and personalized AI opportunity report. No sign-up required.

How long does DeepSeek V4 Flash take to start on dual DGX Sparks?

Approximately 9 minutes from docker run to first API response. Weight loading takes about 3 minutes and CUDA graph capture takes about 4 minutes. Once warm, the model responds at 41 tok/s consistently.

We Ran DeepSeek V4 Flash at 1M Context on Two NVIDIA DGX Sparks. Here is What Happened.

Last Updated: June 10, 2026

Two NVIDIA DGX Sparks. A 200Gbps QSFP direct link. DeepSeek V4 Flash, a 284-billion parameter mixture-of-experts model. And one ambitious goal: 1 million tokens of context running on-prem, with no cloud dependency.

We pulled it off. Here's the full recipe, real benchmarks, and everything that broke along the way.

What We Built

We deployed DeepSeek V4 Flash (284B MoE, FP8 quantized) across two NVIDIA DGX Sparks using tensor parallelism (TP=2) connected via 200Gbps RoCEv2 over a direct QSFP cable. The result: 41 tokens per second at 1 million token context length, with 100% tool-call accuracy and coherent output across every test.

The key point is that dual-node tensor parallelism transforms the DGX Spark from a capable single-GPU inference box into a genuine platform for running frontier-scale models at production-relevant speeds. One Spark alone runs this model at 12-15 tok/s with 131K context. Two Sparks together deliver 3x the speed and 7.6x the context window.

Why This Matters

Most discussions about running 100B+ models focus on cloud APIs or massive multi-GPU servers. The DGX Spark changes the equation. It's a compact Grace-Blackwell system with ~120GB of unified memory per node. But for truly large models like DeepSeek V4 Flash (284B parameters), a single Spark isn't enough for both the model weights and a useful context window.

By connecting two Sparks with a 200Gbps QSFP link and using vLLM's tensor parallelism, we split the model across both GPUs while keeping latency manageable. This is edge-scale AI infrastructure with no data center, no cloud egress costs, no API rate limits.

In summary: if you have two DGX Sparks and a QSFP cable, you can run one of the world's most capable open models at 1M context entirely on-premises.

How to Run Your AI Agent for Free

This is the question most people are really asking. Yes, you can run a fully capable AI agent like OpenClaw or Hermes on local hardware with zero API costs. We tested it, and it works surprisingly well.

Because our dual-Spark deployment exposes an OpenAI-compatible API endpoint, any tool that supports the OpenAI API format can point at it. That includes OpenClaw (our preferred agent framework) and Hermes-style agents. You simply set the base URL to your local endpoint instead of api.openai.com.

The setup:

Deploy the model (using the recipe from tonyd2wild's repo)
Point your agent's API base URL to your local endpoint (e.g. http://your-head-node-ip:8000/v1)
Set the model name to deepseek-v4-flash (or whatever alias you configured)
Use any non-empty string as the API key (we use local)

What works:

Tool and function calling at 100% accuracy
Long-context tasks like full codebase analysis and multi-document summarization
Reasoning and step-by-step problem solving
Streaming responses for real-time agent interaction

What to watch for:

At 41 tok/s, it's slower than GPT-4 class cloud APIs, but fast enough for most agent workflows
No built-in moderation or safety filters, so you handle that in your agent layer
The model needs ~9 minutes to start up, so plan for that on reboots
6 concurrent sequences max at 1M context, so scale your parallelism accordingly

The key insight is that "free" doesn't mean "limited." With 1M context and full tool calling, a local DeepSeek V4 Flash deployment gives you capabilities that would cost hundreds of dollars per month through cloud APIs. For teams running agents 24/7 (monitoring, automation, research), the hardware pays for itself quickly.

Our Hardware Setup

Head node (spark-f889) + Worker node (spark-e39e)

SoC: NVIDIA DGX Spark (GB10, Grace-Blackwell, sm_121) on both nodes
Unified memory: ~120GB per node (CPU + GPU shared)
Interconnect: 200Gbps QSFP direct link with RoCEv2 (GID index 3)
Driver: 580.142 (matched on both)
Kernel: 6.17.0-1018-nvidia (matched on both)

The most important factor for dual-node inference is the interconnect. The 200Gbps QSFP link carries both RDMA traffic (NCCL tensor communication) and TCP control plane traffic on the same cable. This eliminates the need for a separate Ethernet network.

Model Configuration

What we deployed:

Model: deepseek-ai/DeepSeek-V4-Flash (284B MoE)
Quantization: FP8 (E4M3 block-scaled dense layers + MXFP4 MoE experts)
vLLM version: 0.21.1rc1.dev339
MoE backend: B12X MXFP4
Dense backend: DeepGEMM with UE8M0 scale format
Speculative decoding: MTP with 2 speculative tokens
KV cache: FP8, block size 256, Lightning Indexer
Tensor parallelism: 2 (one GPU per Spark)
Max context: 1,000,000 tokens
Max concurrent sequences: 6
GPU memory utilization: 0.82

Startup time: Approximately 9 minutes from docker run to first response. GPU memory settles at 79,179 MiB on both nodes, perfectly balanced.

KV cache allocation: 14.77 GiB per node, supporting 2,158,261 total tokens with 2.16x concurrency at 1M context.

Benchmark Results

Decode Speed

Cold start: 41.8 tok/s (1,024 completion tokens, 22 prompt tokens)

Warm decode: 41.1 tok/s (1,024 completion tokens, 16 prompt tokens)

These are stable, repeatable numbers. The model runs at approximately 41 tokens per second during sustained generation after initial warmup.

Context Window Scaling

We tested the full context range to validate the 1M claim:

64K context: 31.3 seconds to first token
256K context: 47.0 seconds to first token
512K context: 334.4 seconds to first token
800K context: 288.1 seconds to first token

All tests produced correct, coherent output. The model genuinely handles context lengths up to 1 million tokens.

Tool Calling

Parallel weather lookup with two cities: 2/2 correct (100%)

The model called get_weather with the right parameters for both Tokyo and New York simultaneously. Tool calling works reliably at any context length.

Reasoning

Bat-and-ball problem ($1.10 total, bat costs $1.00 more than ball): Correct. Answered $0.05 with step-by-step derivation.

The Trade-off

Upgrading from 262K to 1M context cost approximately 12% decode speed (46 tok/s down to 41 tok/s) for a 3.8x context increase (262K to 1M). That's an excellent trade. You barely feel the speed difference, but you gain the ability to process entire codebases, document collections, or conversation histories in a single prompt.

How It Compares

DeepSeek V4 Flash TP=2 (this setup) vs. our other models:

3x faster than single-Spark DeepSeek V4 Flash (12-15 tok/s to 41 tok/s)
55% faster than Step 3.7 Flash GGUF on llama.cpp (29.5 tok/s to 41 tok/s)
Comparable to AEON-27B in usability, but DS4 V4 Flash is a 284B MoE, far more capable for complex reasoning, coding, and long-context tasks

The most important comparison is single-Spark vs. dual-Spark for the same model. Going from one DGX Spark to two gives you triple the speed and nearly 8x the context. That's the difference between a toy demo and a production workload.

Credits and References

This deployment builds on the excellent work by tonyd2wild, who published the full launch recipe and Docker configuration for running DeepSeek V4 Flash across dual DGX Sparks. If you want to try this yourself, the complete setup guide and scripts are available at the GitHub repo.

Full recipe and launch scripts: github.com/tonyd2wild/deepseek-v4-flash-2x-spark-1m

What Broke (And How We Fixed It)

This wasn't plug-and-play. Here are the three issues that cost us the most time.

1. HuggingFace Cache Symlink Crash

What happened: The head node crashed with a Pydantic validation error. The Docker container couldn't resolve symlinks in the HuggingFace model cache because the default mount path (/cache/huggingface/) didn't match the absolute paths stored in the HF cache symlinks (/home/aj/.cache/huggingface/...).

The fix: Mount the cache at the same absolute path the symlinks expect:

# Correct approach: preserve absolute path
-v "$HOME/.cache/huggingface:$HOME/.cache/huggingface" \
-e HF_HOME=$HOME/.cache/huggingface

This took embarrassingly long to diagnose. The error message pointed at "invalid repository ID" when the real issue was symlink resolution inside the container.

2. QSFP Carries Both RDMA and TCP

What happened: The launch script assumed a separate Ethernet NIC for the TCP control plane (enp1s0f0np0). On our Sparks, that interface was DOWN. All traffic goes over the QSFP cable.

The fix: Route both RDMA and TCP over the QSFP interfaces:

-e NCCL_IB_HCA=rocep1s0f1
-e NCCL_SOCKET_IFNAME=enp1s0f1np1
-e GLOO_SOCKET_IFNAME=enp1s0f1np1
-e TP_SOCKET_IFNAME=enp1s0f1np1

3. SSH Username and Host Key Mismatch

What happened: The default script assumed a different SSH username and didn't handle first-connection host key verification.

The fix: Correct username to aj and add StrictHostKeyChecking=no for initial connections. Run ssh-keygen -R to clear stale entries.

Startup Timeline Breakdown

From docker run to first API response takes approximately 9 minutes:

Model validation: ~15 seconds (Pydantic config, quantization detection)
Weight loading: 181 seconds (148.66 GiB checkpoint, 46 safetensor shards)
MTP draft model: 29 seconds (39 parameters, shares embedding layers)
torch.compile (backbone): 16 seconds (Inductor graph)
torch.compile (eagle_head): 3 seconds (MTP head)
TileLang JIT kernels: ~15 seconds (custom CUDA kernels for MoE)
CUDA graph profiling: ~30 seconds
CUDA graph capture: ~4 minutes (multiple sizes for piecewise and full modes)
Uvicorn startup: ~5 seconds

If you see No available shared memory broadcast block found in 60 seconds during CUDA graph capture, that's normal, not an error.

Key Takeaways

For teams considering dual DGX Spark deployments:

The 200Gbps QSFP interconnect is sufficient for TP=2 inference at production speeds
RoCEv2 works well but requires careful NIC configuration. Don't assume defaults match your hardware
HuggingFace cache mounts must preserve absolute symlink paths inside containers
The 12% speed trade-off for 1M context is absolutely worth it
Boot persistence via systemd means the cluster survives restarts with zero manual intervention

The bottom line: Two DGX Sparks + one QSFP cable = a legitimate on-prem platform for frontier AI models at scale. No cloud. No API limits. No data leaving your network.

Frequently Asked Questions

Can a single DGX Spark run DeepSeek V4 Flash?

Yes, but with significant limitations. A single Spark runs DeepSeek V4 Flash at approximately 12-15 tok/s with a maximum context of 131K tokens using IQ2_XXS quantization. Dual-Spark tensor parallelism provides 3x the speed and 7.6x the context window.

How do I run an AI agent for free on local hardware?

Deploy a model like DeepSeek V4 Flash using vLLM with an OpenAI-compatible API endpoint, then point your agent framework (OpenClaw, Hermes, or any OpenAI-compatible client) to your local endpoint instead of a cloud API. You get full tool calling, long context, and zero per-token costs. The trade-off is slower speed and the upfront hardware investment.

What interconnect speed is needed for dual-node inference?

We used a 200Gbps QSFP direct link with RoCEv2, and it proved more than adequate. The link carries both NCCL RDMA traffic and TCP control plane on the same cable. You don't need InfiniBand. RoCEv2 over QSFP works reliably.

How much memory does 1M context require?

At 1M context with FP8 KV cache, each node allocates approximately 14.77 GiB for the KV cache. Total GPU memory settles at 79,179 MiB per node with 0.82 GPU memory utilization. This supports 2,158,261 total KV cache tokens with 2.16x concurrency.

Does tool calling work at full 1M context?

Yes. We tested parallel tool calling at multiple context lengths and achieved 100% accuracy. The model correctly structured function calls and parameters even at 800K+ prompt tokens.

How long does startup take?

Approximately 9 minutes from docker run to first API response. The bulk of that time is weight loading (~~3 minutes) and CUDA graph capture (~~4 minutes). Once warm, the model responds at 41 tok/s consistently.

We Ran DeepSeek V4 Flash at 1M Context on Two NVIDIA DGX Sparks. Here is What Happened.

We Ran DeepSeek V4 Flash at 1M Context on Two NVIDIA DGX Sparks. Here is What Happened.

What We Built

Why This Matters

How to Run Your AI Agent for Free

Our Hardware Setup

Model Configuration

Benchmark Results

Decode Speed

Context Window Scaling

Tool Calling

Reasoning

The Trade-off

How It Compares

Credits and References

What Broke (And How We Fixed It)

1. HuggingFace Cache Symlink Crash

2. QSFP Carries Both RDMA and TCP

3. SSH Username and Host Key Mismatch

Startup Timeline Breakdown

Key Takeaways

Frequently Asked Questions

Can a single DGX Spark run DeepSeek V4 Flash?

How do I run an AI agent for free on local hardware?

What interconnect speed is needed for dual-node inference?

How much memory does 1M context require?

Does tool calling work at full 1M context?

How long does startup take?

You might also like

The AI Employee Factory: Building Specialised AI Workers for Construction with OpenClaw

Token Value Per Watt: The AI Efficiency Methodology for Growing Businesses

AI's Biggest Winners Have the Lowest Margins

Want AI insights for your business?