Back to Blog
Original

Running a 284B AI Model on Your Desk: Our Real-World DSpark Deployment Log

We deployed DeepSeek V4 Flash with DSpark speculative decoding on 2x NVIDIA DGX Spark boxes. 49 tok/s, 1M token context, 6-way concurrency, zero API bill. Real numbers, real bugs, real fixes.

3 July 202610 min read
Running a 284B AI Model on Your Desk: Our Real-World DSpark Deployment Log

Last Updated: July 3, 2026

We just finished migrating our local AI infrastructure to DSpark (DeepSpec speculative decoding) running DeepSeek V4 Flash on two NVIDIA DGX Spark units. This is a real deployment log - every number below was measured on production hardware, not a vendor demo. This is the kind of content that doesn't exist in marketing materials: what actually happens when you try to run a frontier-class model locally.


What Is DSpark and Why Does It Matter?

DSpark (DeepSpec speculative decoding) is a technique where a small draft model predicts multiple tokens ahead, the large model verifies them in a single forward pass, and accepted tokens deliver "free" throughput. The result: a 284-billion-parameter DeepSeek V4 Flash model runs at interactive speeds on two desktop-sized DGX Spark boxes - approximately 49 tokens per second on realistic agent traffic, with a 1-million-token context window and 6-way concurrency. All self-hosted, all free after hardware.

The key difference from older speculative decoding methods (like MTP): DSpark's draft path is tightly coupled to the target model's KV cache with a request-stable slot mapping that makes it safe under concurrent batching. The draft model shares embeddings and attention state with the target, so drafting adds nearly zero overhead.


The Hardware: Two Black Boxes on a Desk

This is not a datacenter setup. It is two small black boxes connected by cables, sitting on a desk, serving a model larger than GPT-3.

  • Compute: 2x NVIDIA DGX Spark (GB10 Grace-Blackwell, sm_121) - one GPU per node
  • Memory: ~128 GiB unified CPU+GPU per node
  • Interconnect: 200 Gb/s QSFP RoCE/RDMA link between both nodes (two cables, both carrying traffic at 50/50 split)
  • What it runs: DeepSeek V4 Flash (284B MoE) split across both GPUs via tensor-parallel (TP=2)

Measured Performance: The Real Numbers

Speed

  • ~49 tok/s single-stream on realistic agent content (code, prose, multi-step reasoning)
  • Up to ~80 tok/s on predictable content (code completion, listing) where spec-decode acceptance is highest
  • ~182 tok/s aggregate at 6-concurrent (measured from the repo's validation suite)

For comparison: the same model on a single Spark without tensor-parallelism ran at 12-15 tok/s. DSpark with TP=2 is 3x faster than single-node, and the spec-decode layer roughly doubles decode throughput again.

Context Window

  • 1,048,576 tokens (1M) - verified live via the /v1/models endpoint
  • ~2.0M-token KV pool - fits multiple long sessions concurrently
  • The headline feature: load an entire codebase, full transcripts, or long document sets into a single request. No RAG chunking, no "context too long" errors, no per-token cost for having a massive window.

Concurrency

  • 6 simultaneous requests (max_num_seqs=6)
  • Tested clean: 6 brand-new sessions firing their first prompts at the exact same moment. All 6 returned coherent, correct output.

Cost

  • Free. Self-hosted. No API bill. The only cost is electricity.

The Technical Stack (Bottom to Top)

The software stack that makes this work is entirely open source:

  • vLLM (v0.21.1rc1.dev339) with a custom GB10/DSpark overlay
  • B12X Mxfp4 MoE backend - the fused MoE kernel that makes the 284B model feasible on GB10 silicon
  • NVFP4 KV cache (nvfp4_ds_mla) - quantized KV that fits 1M context in ~12 GiB of pool
  • DSpark proposer - 3 speculative tokens, probabilistic drafting (post-garble-fix)
  • Lightning Indexer with FP8 indexer cache - sparse-MLA attention for DeepSeek V4
  • FlashInfer sampler - enabled after the garble bug fix

The Bugs We Hit (And How We Fixed Them)

This is the section vendor demos never show you.

The Garble Bug

Symptom: The first configuration used 5 greedy draft tokens. Under concurrent agent traffic, it occasionally produced gibberish: repeated characters, Chinese character drift, leaked tool-schema XML, or looping text.

Root cause: A DSpark spec-decode cold-start mismatch. The greedy draft would commit to a bad token run on a new session's first prompt and poison the KV state for the rest of that session.

The fix: Three changes, all required:

  1. Drop to 3 probabilistic draft tokens (down from 5 greedy)
  2. Remove the repetition_penalty server override (which was also a spec-decode crash risk)
  3. Enable the FlashInfer sampler

Result: Gibberish eliminated. Verified across 6 concurrent cold-starts with tool-call integrity checks. The trade-off: 3 tokens runs at ~49 tok/s instead of a theoretical ~60 tok/s, but it is rock-solid stable. For production agents, stable always wins.

The Throughput Measurement Trap

Symptom: An early benchmark showed "80 tok/s" and we got excited.

Root cause: The benchmark task was "count 1 to 200" - highly predictable content where spec-decode acceptance approaches 100%. Real agent traffic (diverse prose, novel code, reasoning chains) runs at ~49 tok/s. The streaming measurement methodology was also flawed.

Lesson: Always benchmark on diverse, realistic content. Do not trust numbers from predictable tasks.

The Engine Crash

Symptom: One RuntimeError: cancelled in the shared-memory comm path under heavy load.

Reality: Self-hosted means self-operated. You need monitoring, auto-restart, and a cloud fallback for resilience. This is the price of not paying OpenAI.

The GPU Conflict

Symptom: The head node was already running a Step-3.7 llama.cpp server (102 GiB). DSpark needed 97 GiB and only 1.36 GiB was free.

Fix: Stop one, start the other. The DGX Spark's 128 GiB GPU fits either DSpark or a smaller model, not both. A real operational constraint if you run multiple model tiers.


The Migration Story (Step by Step)

  1. Started on FP8/MTP - Prior deployment served DeepSeek V4 Flash with FP8 KV cache and MTP (2-token) speculative decoding at ~41 tok/s with 1M context. Solid but slower.

  2. Migrated to DSpark/NVFP4-KV - Built the Stage A/B/C runtime image. The NVFP4 KV path requires patching vLLM's cache dtype plumbing, attention probe, and the 584-byte padded envelope. Configured dual-Spark TP=2 over RoCE.

  3. Resolved GPU conflict - Stopped the existing Step-3.7 server to free VRAM for DSpark.

  4. Enabled dual QSFP cables - Assigned IPs, set MTU 9000, created RoCEv2 GIDs on both nodes, pointed NCCL at both HCAs. Verified 50/50 RDMA traffic split across both links. Decode is compute-bound so this did not change tok/s, but the engineering is correct.

  5. Fixed the measurement methodology - Re-benchmarked on diverse content to get accurate numbers.

  6. Applied the garble fix - The most critical fix in the whole migration. Verified across 6 concurrent cold-starts plus tool-call integrity validation.


What This Actually Feels Like to Use

Loading a Whole Repo Into Context

With 1M tokens free, you stop thinking about "will this fit." You paste the entire codebase. The model reasons across files the way cloud Copilot cannot because cloud is token-metered. This changes how you work.

Latency You Can Feel

~1-2ms over LAN. Tool loops that make dozens of small calls - impractical on cloud where each round-trip costs money and adds latency - are instant and free. Agentic iteration becomes genuinely cheap.

A 284B Model on a Desk

This scale of model used to require an H100 cluster. The combination of NVFP4 KV cache, B12X MoE kernels, and DSpark speculative decoding collapsed the hardware requirement by roughly 100x.


The Failover Pattern We Use

Local-first, cloud-second. The local endpoint is free, has 1M context, and runs at ~49 tok/s. But it will go down for reboots, updates, and the occasional crash. Always configure a cloud fallback.

  • Primary: Local DSpark endpoint (free, 1M ctx, ~49 tok/s)
  • Secondary: Cloud API (always up, metered, usually 200K ctx limit)
  • API: OpenAI Chat Completions compatible - standard tools schema works
  • Tool calling: Verified, native, OpenAI-compatible
  • Streaming: Working
  • Vision: Not supported (text only)

Why This Matters for the Open-Source AI Stack

DSpark represents a genuine shift in what is possible with local AI infrastructure.

Frontier-class model, desktop hardware. A 284B MoE that needed an H100 cluster now runs on two DGX Sparks. The B12X MoE kernels plus NVFP4 KV cache plus DSpark spec-decode collapsed the hardware requirement dramatically.

1M context, free. The feature cloud providers charge a premium for costs nothing when you own the hardware. This changes how you build agents: load everything, do not retrieve.

Production-safe speculative decoding. Earlier spec-decode methods garbled text under concurrency. DSpark's request-stable KV slot mapping and probabilistic drafting made it reliable enough for real agent fleets.

The open-source stack caught up. vLLM, FlashInfer, and community patches built something that rivals closed serving platforms on niche hardware (GB10) with a model that is not even a year old.

The trade-offs are real: self-operated infrastructure, workload-dependent throughput, and bugs that need fixing. But the trajectory is clear. Local serving of frontier models is no longer experimental. It is a production option, and DSpark is the stack that made it viable.



This is a real deployment log from Flowtivity's local AI infrastructure. Every number was measured on production hardware. Written by AJ Awan, former EY management consultant and founder of Flowtivity, with research and editorial support from Flowbee, our AI growth agent.

Want AI insights for your business?

Get a free AI readiness scan and discover automation opportunities specific to your business.