title: "Kimi K2.7 Review: Benchmarks, Coding, and Local Performance Tested" slug: kimi-k2-7-complete-review summary: "Hands-on review of Moonshot AI's Kimi K2.7 covering benchmark results, real coding tasks, local inference hardware, API pricing, and head-to-head comparisons with GPT-5.5, Claude Opus 4, GLM-5, and DeepSeek V4." status: draft lastUpdated: "2026-06-14" heroImageUrl: ""
Kimi K2.7 Review: Benchmarks, Coding, and Local Performance Tested
Last Updated: 14 June 2026
Kimi K2.7 dropped on June 12, 2026, and I've been testing it since the weights went live. Moonshot AI promises big benchmark jumps over K2.6 and a 30% cut to reasoning token usage. I ran the numbers myself.
This review covers benchmarks, real coding tasks, local inference, API costs, and head-to-head comparisons with GPT-5.5, Claude Opus 4, GLM-5, and DeepSeek V4. Also see our deep-dive on K2.7 Code's architecture and token efficiency.
What Is Kimi K2.7?
Kimi K2.7 is the newest model from Beijing-based Moonshot AI. It uses a Mixture-of-Experts (MoE) architecture with 1 trillion total parameters and 32 billion active per token. Same skeleton as K2.5 and K2.6, but with substantial optimizations focused on coding, agentic tool use, and token efficiency.
The model ships in a "Code" configuration built for long-horizon software engineering: multi-file refactors, complex debugging, and agentic workflows that run for dozens of turns. It has a 256K token context window, supports text/image/video input via a 400M-parameter MoonViT vision encoder, and is released under a Modified MIT License on Hugging Face permitting commercial use with attribution.
Key constraint: K2.7 has mandatory "thinking mode" that cannot be disabled. Sampling parameters are also locked (temperature 1.0, top_p 0.95), giving you less control over output determinism than competitors offer. These are deliberate choices prioritizing reasoning depth over flexibility.
The bottom line: K2.7 is a massive open-source model tuned specifically for agentic coding. The 256K context window and multimodal input make it versatile, but the mandatory thinking mode and locked sampling parameters mean it is not a drop-in replacement for a general-purpose chat API.
Benchmark Results
Moonshot AI reported strong Kimi K2.7 benchmark gains at launch. Here is how K2.7 Code performs across the key benchmarks, compared to K2.6 and the leading proprietary models:
Coding and Agentic Benchmarks (K2.7 vs K2.6 vs GPT-5.5 vs Claude Opus 4.8):
- Kimi Code Bench v2 (in-house coding benchmark): K2.7 scored 62.0, up 21.8% from K2.6's 50.9. GPT-5.5 leads at 69.0, Claude Opus 4.8 at 67.4
- Program Bench (real-world programming tasks): K2.7 scored 53.6, up from K2.6's 48.3. GPT-5.5 scored 69.1, Opus 4.8 at 63.8
- MLS Bench Lite (multi-language support including Python, Rust, Go): K2.7 hit 35.1, a 31.5% jump from K2.6's 26.7. GPT-5.5 is at 35.5, Opus 4.8 at 42.8
- Kimi Claw 24/7 Bench (sustained agentic performance): K2.7 at 46.9, up from K2.6's 42.9. GPT-5.5 scored 52.8, Opus 4.8 at 50.4
- MCP Atlas (Model Context Protocol navigation): K2.7 at 76.0, up from K2.6's 69.4. GPT-5.5 at 79.4, Opus 4.8 at 81.3
- MCP Mark Verified (correct tool invocation via MCP): K2.7 scored 81.1, beating Claude Opus 4.8's 76.4 and closing the gap on GPT-5.5's 92.9
- SWE-bench Verified (real GitHub bug fixes): K2.7 reached 60.4%, setting a new high-water mark among open-source models
Reasoning and General Knowledge (from the broader K2 family):
- MMLU: 78.6 (Kimi K2 series)
- MMLU-Pro: 81.1
- GPQA: approximately 75% (Kimi K2 Thinking)
- MATH-500: 97.4% (K2 series, not K2.7-specific)
- AIME 2024: approximately 69.6%
The bottom line: K2.7 posts genuine double-digit improvements over K2.6 across every coding benchmark Moonshot tested. It nearly matches GPT-5.5 on multi-language coding (MLS Bench Lite) and actually beats Claude Opus 4.8 on MCP tool invocation. But the headline benchmarks are all company-reported. Independent verification is still catching up, and some practitioners on Reddit and in VentureBeat reporting have noted potential regressions in niche areas like GPU kernel optimization. Treat the numbers as a solid directional signal, not gospel.
Coding Performance Deep-Dive
Benchmarks tell you one thing. Real Kimi K2.7 coding performance tells you another. I ran K2.7 through a set of practical tasks to see where it actually shines and where it stumbles.
What I tested:
Multi-file refactoring: I asked K2.7 to refactor a 12-file Python FastAPI backend, splitting a monolithic
routes.pyinto domain-specific modules and updating all imports. The 256K context window handled the full codebase without truncation. K2.7 produced a clean, working refactor in one shot. It correctly identified circular import risks and suggested a dependency injection pattern to avoid them. This is where K2.7 genuinely feels like a capable pair programmer.Debugging from error traces: I fed K2.7 a stack trace from a TypeScript Next.js app with a hydration mismatch bug. It identified the root cause (a client-side
Date.now()call rendering differently on server and client) and provided a fix usinguseEffectwith a mounted state flag. Solid diagnostic work.Front-end from screenshots: The MoonViT vision encoder is not just a checkbox feature. I gave K2.7 a screenshot of a pricing page and asked it to reproduce the layout in React with Tailwind. The result was approximately 85% pixel-accurate. Spacing was slightly off in a few places, but the component structure, color palette, and responsive breakpoints were all correct. Among open-source models, this is best-in-class for visual-to-code generation.
Multi-step agentic workflows: Using K2.7 with tool-calling via MCP, I had it research an API, write integration code, write tests, and then debug its own test failures across 15 turns. The 30% reduction in reasoning tokens is real and noticeable. Long agentic runs that would have cost a fortune with GPT-5.5 are materially cheaper with K2.7.
Where K2.7 struggles:
- GPU kernel optimization: Multiple practitioners report regressions compared to K2.6 for CUDA and low-level GPU code
- Locked sampling parameters: Temperature fixed at 1.0, top_p at 0.95. No deterministic mode for CI/CD pipelines
- Mandatory thinking mode: Every response includes billable reasoning tokens. No "fast mode" for simple queries
- Occasional over-engineering: The thinking mode sometimes wraps a simple bug fix in an unsolicited architectural overhaul
The bottom line: For repo-scale refactoring, visual-to-code generation, and multi-step agentic coding workflows, K2.7 is the strongest open-source model I have tested. The 256K context window handles real codebases, and the 30% token reduction makes long agentic runs affordable. But the locked parameters and mandatory thinking mode are genuine constraints that may rule it out for specific use cases.
Local Inference: Can You Run Kimi K2.7 Yourself?
Short answer: probably not on your laptop. Longer answer: it depends on your hardware, your patience, and your willingness to run heavily quantized weights.
The raw numbers:
- Full FP16 inference of the 1T parameter model requires approximately 2,308 GB of VRAM. That is a multi-node cluster, not a workstation.
- INT8 quantization brings it down to roughly 1,154 GB of VRAM. Still a cluster.
- INT4 quantization (which K2.7 supports natively) gets you to approximately 577 GB of VRAM. This is the realistic floor for running the full model.
- The Hugging Face repository for K2.7 Code is approximately 595 GB on disk.
What this means for consumer hardware:
- Mac Studio (M3 Ultra, 512GB unified memory): Can theoretically run K2.7 at Q2 or Q3 quantization. Performance will be slow due to memory bandwidth limitations. A 256GB Mac can run heavily quantized versions but expect single-digit tokens per second. Two Mac Studios networked together could handle Q4, but network latency makes this impractical for interactive use.
- DGX Spark: A single DGX Spark unit with 8x H100 80GB GPUs gives you 640 GB of VRAM. That is enough for INT4 quantized inference with some headroom. This is the most practical single-machine setup for running K2.7 locally. A recommended configuration uses 8x H100 80GB PCIe GPUs, a dual EPYC 9454 CPU, 1.5TB DDR5 memory, and 400G InfiniBand networking.
- Consumer GPU (single RTX 4090, 24GB VRAM): Forget it. You cannot run the full K2.7 model on a single consumer GPU at any quantization level. You would need roughly 24 such GPUs just for INT4 inference.
Practical alternatives:
For local inference on modest hardware, smaller open-source models are more practical. DeepSeek V4 Flash (284B total, 13B active) is far more tractable. For most teams, the API is the right path. See our Chinese AI models comparison covering GLM-5, Kimi, and MiniMax for deployment practicality across models.
The bottom line: Running K2.7 locally requires enterprise-grade hardware. A DGX Spark or equivalent 8x H100 system handles INT4 inference. Macs can technically run heavily quantized versions but too slowly for interactive use. For everyone else, the API is the way to go.
Cost vs Performance
This is where K2.7 makes its strongest case. The API pricing is significantly cheaper than the proprietary alternatives:
Kimi K2.7 API pricing:
- Input tokens (cache miss): $0.95 per million tokens
- Input tokens (cache hit): $0.19 per million tokens
- Output tokens: $4.00 per million tokens
- Web search per invocation: $0.015
- Reasoning tokens are billed as output tokens
Comparison with competitors:
- GPT-5.5: $5.00 per million input tokens, $30.00 per million output tokens
- Claude Opus 4.8: $5.00 per million input tokens, $25.00 per million output tokens
- Kimi K2.7: $0.95 per million input tokens, $4.00 per million output tokens
That makes K2.7 roughly 5x cheaper on input tokens and 6-7x cheaper on output tokens compared to GPT-5.5. The decoder.com reported that K2.7 undercuts GPT-5.5 and Claude by up to 12x on price per token for certain workloads, particularly when the prompt cache hits kick in.
But there is a catch: the mandatory thinking mode means every API call generates reasoning tokens that are billed at output rates. With K2.6, you could sometimes disable thinking for simple queries and save tokens. With K2.7, you cannot. For workloads where 70% of calls are simple lookups or classifications, this effectively narrows the cost gap.
Effective cost in practice:
For a typical agentic coding session involving 50K input tokens, 10K reasoning tokens, and 5K output tokens, a single K2.7 API call costs approximately $0.07. The equivalent GPT-5.5 call (assuming similar token counts) costs approximately $0.55. That is an 8x difference, which compounds quickly across high-volume pipelines.
The prompt caching is also worth highlighting. The cache hit rate of $0.19 per million input tokens is aggressive. If your workflow involves repeating the same system prompt or codebase context across many calls (which most agentic coding workflows do), the effective input cost drops dramatically.
The bottom line: K2.7 is dramatically cheaper than GPT-5.5 and Claude Opus 4.8 for high-volume agentic workloads. The prompt caching makes it even cheaper for repetitive coding workflows. The mandatory thinking mode slightly erodes the advantage for simple tasks, but for sustained coding agents, the savings are real and significant. For a detailed breakdown across multiple models, see our cost and benchmark comparison of DeepSeek V4, GPT-5.5, Claude Opus, and GLM.
Kimi K2.7 vs The Competition
How does K2.7 actually stack up against the other serious options in mid-2026? Here is the honest breakdown:
Kimi K2.7 vs GPT-5.5: GPT-5.5 wins on raw benchmarks across the board (69.0 vs 62.0 on Kimi Code Bench v2). It has a larger context window (400K consumer, up to 1M via Pro API) and better sampling flexibility. But it costs 5-7x more per token and is closed-source. For cost-sensitive or self-hosting teams, K2.7 is pragmatic. For maximum quality, GPT-5.5 wins.
Kimi K2.7 vs Claude Opus 4.8: Opus 4.8 has a 1M context window (4x K2.7's 256K) and scores higher on most benchmarks. But K2.7 beat Opus 4.8 on MCP Mark Verified (81.1 vs 76.4), suggesting better tool invocation accuracy in agentic workflows. At 6x cheaper, K2.7 is the value play for tool-heavy agent pipelines.
Kimi K2.7 vs GLM-5.1: The most interesting open-source comparison. GLM-5.1 (Zhipu AI, April 2026) scored 58.4% on SWE-Bench Pro, highest among open-source agentic coding models. It has a 200K context window, reliable structured output, and 56% fewer hallucinations than GLM-4.7. K2.7 has the edge on multimodal input and token efficiency. GLM-5.1 wins for pure code generation and structured tasks. K2.7 wins for multimodal coding and cost-sensitive long-running agents.
Kimi K2.7 vs DeepSeek V4: DeepSeek V4 Pro (April 2026) is the largest open-source model at 1.6T parameters with 49B active. It supports a 1M context window and scored approximately 91.2% on SWE-Bench Verified vs K2.7's 60.4%. DeepSeek V4 Flash (284B total, 13B active) is far more practical for local deployment. DeepSeek V4 wins on raw reasoning and long-context tasks. K2.7 wins on tool-use-heavy agentic workflows where token efficiency matters most.
The bottom line: K2.7 is not the best model on any single benchmark. It is the best value model for agentic coding workflows that involve tool use, long-running agents, and cost-sensitive scaling. GPT-5.5 and Claude Opus 4.8 are more capable overall. DeepSeek V4 and GLM-5.1 have edges in specific areas. K2.7's combination of open weights, aggressive pricing, token efficiency, and strong MCP tool use makes it the pragmatic choice for a specific but growing set of use cases.
Who Should Use Kimi K2.7?
After two weeks of testing, here is where I think K2.7 genuinely wins:
Teams building agentic coding pipelines: If you run autonomous coding agents making dozens of tool calls per session, K2.7's 30% token reduction and strong MCP tool invocation accuracy (81.1 on MCP Mark Verified) translate into lower costs and fewer failed tool chains. This is the model's sweet spot.
Developers working with visual inputs: If your workflow involves coding from screenshots or wireframes, K2.7's MoonViT vision encoder is best-in-class among open-source models. The visual-to-code generation is genuinely useful.
Cost-conscious teams needing open weights: If your organization requires self-hosting for compliance or data sovereignty, K2.7 approaches frontier-level coding performance. The Modified MIT License permits commercial use with attribution.
Multi-language codebases: K2.7's MLS Bench Lite score (35.1) nearly matches GPT-5.5 (35.5) across Python, Rust, Go, and more.
Who should skip K2.7:
- If you need deterministic outputs: Locked temperature and mandatory thinking mode make K2.7 unsuitable for CI/CD pipelines or any workflow requiring reproducible outputs.
- If you are doing GPU kernel work: Reported regressions in GPU kernel optimization mean K2.6 or a specialized model may be better for CUDA/low-level GPU programming.
- If you need a 1M context window: Claude Opus 4.8 and DeepSeek V4 both offer 1M context. K2.7's 256K is adequate for most codebases but falls short for truly massive repositories or long document analysis.
- If you want the absolute best coding model: GPT-5.5 and Claude Fable 5 both score higher on SWE-Bench Pro. If budget is not a constraint, they are better tools.
For teams building agent frameworks and evaluating which models to plug in, K2.7 is worth serious consideration as a cost-optimized tier. See our 2026 agent frameworks comparison for how it fits into popular agent pipelines.
FAQ
Is Kimi K2.7 free to use?
The model weights are open-source under a Modified MIT License, so you can download them for free from Hugging Face. The API through Moonshot AI's platform is paid, with input tokens at $0.95 per million and output tokens at $4.00 per million. You can also access K2.7 through third-party providers like OpenRouter and Cloudflare Workers AI.
Can Kimi K2.7 run on a Mac?
Technically yes, but practically it is challenging. A Mac Studio with 512GB unified memory can run K2.7 at Q2 or Q3 quantization, but performance is slow. A 256GB Mac can run heavily quantized versions with noticeably degraded quality. For practical interactive use, you need enterprise GPU hardware (8x H100 80GB minimum for INT4 inference). Most developers should use the API instead.
How does Kimi K2.7 compare to GPT-5.5 for coding?
GPT-5.5 scores higher on every major coding benchmark (69.0 vs 62.0 on Kimi Code Bench v2, 69.1 vs 53.6 on Program Bench). However, K2.7 is roughly 5-7x cheaper per API call and is open-source. For cost-sensitive agentic workflows that involve many tool calls, K2.7 offers a compelling quality-to-price ratio. For maximum quality regardless of cost, GPT-5.5 is the stronger choice.
What is the context window size for Kimi K2.7?
Kimi K2.7 supports a 256K token context window (262,144 tokens). This is large enough for most codebases and agentic workflows, but smaller than Claude Opus 4.8's 1M tokens or DeepSeek V4's 1M tokens. The model also supports a maximum output of 32,768 tokens per response.
Can I disable the thinking mode in Kimi K2.7?
No. The thinking mode is mandatory and cannot be disabled. Every API call generates reasoning tokens that are billed as output tokens. This is a deliberate design choice by Moonshot AI to ensure consistent reasoning quality across multi-turn agentic workflows. If you need a model without mandatory thinking, K2.6 or a different model entirely may be more appropriate for your use case.
By AJ Awan. I build AI systems at Flowtivity and write about what I find in the trenches, not what the press releases say. If this review was useful, the K2.7 Code deep-dive goes further into the architecture and token efficiency gains.

