Graphify: Turn Any Codebase Into a Queryable Knowledge Graph for AI Assistants

Last Updated: April 12, 2026

Every developer who has joined a new company knows the feeling. You stare at a codebase with 200,000 lines of code, dozens of services, and architectural decisions that made sense two years ago but nobody remembers why. Your AI coding assistant tries to help, but it greps through files one by one, burning tokens and missing the big picture.

Graphify, an open source project by Safi Shamsi, solves this problem differently. Instead of searching files, it builds a knowledge graph from your entire project and lets your AI assistant navigate by structure, not keywords. The result is 71.5x fewer tokens per query, persistent knowledge across sessions, and honest transparency about what was found versus guessed.

After testing it across multiple projects, here is why Graphify represents a meaningful shift in how developers will interact with AI coding assistants.

What Is Graphify and Why Does It Matter?

Graphify is a skill for AI coding assistants that reads your files and builds a queryable knowledge graph. Type /graphify . in Claude Code, Codex, Cursor, OpenClaw, or eight other supported platforms, and it processes your entire folder: code, documentation, PDFs, screenshots, whiteboard photos, diagrams, and even video and audio files.

The output is not just another search index. It is a structured graph where nodes are concepts, functions, classes, or documents, and edges are the relationships between them. You get an interactive HTML visualisation, a queryable JSON file, and a plain-language audit report that highlights "god nodes" (highly connected concepts), surprising connections, and suggested questions to explore.

The 71.5x token reduction comes from a simple insight. Instead of feeding raw files into your AI assistant every time you ask a question, Graphify distils the structure into a compact graph. Your assistant reads the one-page report first, then navigates the graph for specific details. This is the difference between reading every page of a textbook versus checking the index.

How Does Graphify Build the Knowledge Graph?

Graphify runs in three distinct passes, each handling a different type of content.

Pass 1: Deterministic AST extraction. Code files are parsed using tree-sitter AST analysis. This extracts classes, functions, imports, call graphs, docstrings, and rationale comments without any LLM involvement. It is fast, deterministic, and costs zero API tokens. Graphify supports 20 programming languages including Python, JavaScript, TypeScript, Go, Rust, Java, C, C++, Ruby, C#, Kotlin, Scala, PHP, Swift, Lua, Zig, PowerShell, Elixir, Objective-C, and Julia.

Pass 2: Multimodal transcription. Video and audio files are transcribed locally using faster-whisper with a domain-aware prompt derived from the corpus itself. The key innovation is that transcription prompts are informed by the "god nodes" identified in Pass 1, so the transcription is tuned to the actual terminology and concepts in your project. Transcripts are cached by SHA256 hash, so re-runs only process changed files.

Pass 3: LLM-powered concept extraction. Claude subagents run in parallel over documents, papers, images, and transcripts to extract concepts, relationships, and design rationale. Every relationship is tagged with one of three confidence levels: EXTRACTED (found directly in source), INFERRED (reasonable inference with a confidence score), or AMBIGUOUS (flagged for human review). This transparency is critical. You always know what Graphify found versus what it guessed.

The results are merged into a NetworkX graph and clustered using Leiden community detection. Clustering is graph-topology-based, meaning it uses edge density rather than embeddings. The semantic similarity edges that the LLM extracts are already in the graph, so they influence community detection directly. No separate embedding step, no vector database needed.

What Outputs Does Graphify Produce?

Graphify generates four outputs in a graphify-out/ directory.

graph.html is an interactive visualisation where you can click nodes, search for concepts, and filter by community cluster. This is genuinely useful for onboarding. A new developer can open the HTML file and see the entire architecture at a glance, including which modules are tightly coupled and which stand alone.

GRAPH_REPORT.md is a one-page plain-language summary. It lists god nodes (the concepts everything connects to), surprising connections (relationships you might not expect), and suggested questions to explore. This is what your AI assistant reads before answering questions, replacing the need to grep through raw files.

graph.json is the persistent, queryable graph. You can query it weeks later without re-reading the original files. It is designed to be traversed hop-by-hop by an LLM, not pasted into a prompt all at once. Think surgical queries, not dump-and-pray.

cache/ contains SHA256 hashes so that re-runs only process files that have changed. For a large project, the first run might take a few minutes. Subsequent runs are near-instant.

Which AI Coding Platforms Does Graphify Support?

This is where Graphify stands out. It works across 11 major AI coding platforms, each with platform-specific integration hooks.

Claude Code gets the deepest integration. Graphify writes a CLAUDE.md section and installs a PreToolUse hook that fires before every Glob and Grep call. If a knowledge graph exists, Claude sees a reminder to read the graph report first, so it navigates by structure instead of searching raw files.

Codex uses AGENTS.md and a PreToolUse hook in .codex/hooks.json. Cursor writes a .cursor/rules/graphify.mdc file with alwaysApply: true, which Cursor includes in every conversation automatically. Gemini CLI installs a BeforeTool hook in settings.json. OpenCode uses a plugin system.

OpenClaw, Aider, Factory Droid, and Trae write rules to AGENTS.md since these platforms do not yet support tool hooks. GitHub Copilot CLI copies the skill to ~/.copilot/skills/.

The practical implication: if you switch between coding assistants (many developers do), your knowledge graph works everywhere. Build it once with /graphify . and every platform benefits.

How Does the Always-On Integration Work?

After building a graph, you run one command to make your assistant always use it. For Claude Code, that is graphify claude install. For OpenClaw, graphify claw install. And so on.

The always-on hook surfaces the GRAPH_REPORT.md summary before every query. Your assistant reads the one-page overview of god nodes, communities, and connections before it starts searching files. This covers most everyday questions: "Where is the authentication logic?" "What depends on the payment service?" "Why does this module exist?"

For deeper queries, three slash commands go further.

/graphify query traverses the raw graph.json hop by hop
/graphify path traces exact paths between two nodes
/graphify explain surfaces edge-level detail including relation type, confidence score, and source location

The analogy is apt: the always-on hook gives your assistant a map. The slash commands let it navigate the map precisely.

Why Is Transparency About Confidence Levels Important?

Most AI tools present their output as fact. Graphify does not. Every relationship in the graph carries a confidence tag.

EXTRACTED means the relationship was found directly in source code or documentation. A function call, an import statement, a docstring reference. These are deterministic facts.

INFERRED means the LLM identified a relationship that is reasonable but not explicitly stated. Each inferred relationship comes with a confidence score. For example, if two modules share similar error-handling patterns, Graphify might infer a semantic similarity with a 0.7 confidence score.

AMBIGUOUS means the relationship is flagged for human review. This happens when the LLM detects a potential connection but lacks enough context to be confident.

This three-tier system matters because it prevents a common failure mode with AI tools: confident wrong answers. When you see an INFERRED relationship with a 0.6 confidence score, you know to verify it before relying on it. When you see EXTRACTED, you can trust it.

What Problem Does Graphify Solve for Teams?

Andrej Karpathy keeps a /raw folder where he drops papers, tweets, screenshots, and notes. Many developers do something similar. The problem is that this raw material becomes a dumping ground. It grows, it accumulates, and it becomes harder to extract value from over time.

Graphify turns that dumping ground into structured knowledge. Papers, screenshots, whiteboard photos, code, documentation, video recordings of architecture meetings. Graphify processes all of it, extracts the concepts and relationships, and connects them into one navigable graph.

For teams, this solves three concrete problems.

Onboarding takes weeks instead of months. A new developer opens graph.html and sees the entire architecture. They read GRAPH_REPORT.md and understand which modules matter most and why. They can ask their AI assistant questions and get answers grounded in the actual structure of the codebase.

Knowledge survives attrition. When a senior developer leaves, their understanding of why certain architectural decisions were made often leaves with them. Graphify captures design rationale from docstrings, comments, and documentation and connects it to the relevant code. The knowledge persists in the graph.

Cross-functional understanding improves. Product managers can look at the interactive graph and understand which services connect to which features. QA engineers can trace the path from a user action through the call graph to the database. The graph becomes a shared language.

How Does Graphify Compare to Traditional Code Search?

Traditional code search (grep, ripgrep, IDE search) finds text matches. It answers "where does this string appear?" Graphify answers "how do these concepts relate?" These are fundamentally different questions.

Consider a practical example. You want to understand how authentication works in a microservices architecture.

With grep, you search for "auth", "token", "session", and "login" across hundreds of files. You get thousands of matches. You read through them one by one, building a mental model. This takes hours and burns thousands of tokens if you are using an AI assistant.

With Graphify, you read the graph report and see that authentication is a god node connected to three services: the API gateway, the user service, and the session store. You see the community cluster that groups all auth-related code. You click through the interactive graph to trace the exact path from login request to token validation. This takes minutes.

The 71.5x token reduction is not marketing. It is the mathematical difference between reading raw files and navigating a structured graph.

What Are the Limitations?

Graphify is not perfect, and being honest about limitations matters.

It requires Claude as the LLM backend. The concept extraction pass uses Claude subagents. If you do not have access to Claude, you lose the most powerful part of the pipeline. The AST pass and transcription pass work independently, but the relationship extraction needs Claude.

OpenClaw and Aider use sequential extraction. On platforms that support parallel subagents (Claude Code, Codex, Factory Droid), Graphify runs concept extraction in parallel, which is significantly faster. On OpenClaw and Aider, extraction runs sequentially, which is slower for large projects.

Large codebases take time on the first run. The initial graph build processes every file. For a 200,000-line codebase, this can take several minutes. Subsequent runs are fast due to SHA256 caching, but the first run requires patience.

The graph quality depends on code quality. If your codebase has no docstrings, no comments, and no documentation, the AST pass still extracts structure (classes, functions, imports). But the concept extraction pass has less to work with, so inferred relationships may have lower confidence scores. Graphify works best when developers write for humans, not just compilers.

Why Does This Matter for the AI Agent Ecosystem?

Graphify fits into a broader pattern in the AI tooling ecosystem: tools that give AI agents structured knowledge rather than raw data to search through. This is the same pattern we see with personal knowledge brains like GBrain, prompt optimisation tools like GEPA, and agent frameworks like OpenClaw.

The common thread is that AI agents are most valuable when they can navigate structured knowledge. A developer asking "why does this service exist?" gets a better answer from a knowledge graph than from a grep search. A business asking "what AI opportunities exist in my operations?" gets a better answer from a structured analysis than from a chatbot.

Graphify represents the infrastructure layer that makes AI coding assistants genuinely useful for understanding, not just writing, code. As the open source AI ecosystem continues to mature, tools like Graphify that bridge the gap between raw files and structured knowledge will become essential.

Getting Started with Graphify

Installation is straightforward.

pip install graphifyy && graphify install

Then open your AI coding assistant and run:

/graphify .

For always-on integration, run the install command for your platform:

graphify claude install    # Claude Code
graphify codex install     # Codex
graphify claw install      # OpenClaw
graphify cursor install    # Cursor
graphify gemini install    # Gemini CLI

Add a .graphifyignore file to exclude folders you do not want in the graph (same syntax as .gitignore). Re-run /graphify . when you want to update the graph. The cache ensures only changed files are reprocessed.

The official repository is safishamsi/graphify on GitHub, and the PyPI package is graphifyy.

Frequently Asked Questions

What is Graphify?

Graphify is an open source AI coding assistant skill that builds knowledge graphs from your codebase, documentation, papers, images, and videos. It runs as a slash command in 11 major coding platforms including Claude Code, Codex, Cursor, OpenClaw, and Gemini CLI. The output is an interactive graph visualisation, a queryable JSON file, and a plain-language audit report that helps your AI assistant navigate your codebase by structure instead of keyword search.

How does Graphify reduce token usage by 71.5x?

Graphify distils your raw files into a structured knowledge graph that your AI assistant can navigate efficiently. Instead of reading hundreds of files to answer a question, the assistant reads a one-page summary (GRAPH_REPORT.md) and then queries the graph for specific details. This is analogous to checking an index instead of reading every page of a textbook, resulting in dramatically fewer tokens consumed per query.

Does Graphify work without an internet connection?

The AST extraction pass works entirely offline with no LLM needed. The transcription pass uses local faster-whisper models. However, the concept extraction pass requires Claude API access, which needs an internet connection. You can use the structural knowledge from Passes 1 and 2 offline, but the full relationship graph requires the LLM pass.

Which programming languages does Graphify support?

Graphify supports 20 programming languages via tree-sitter AST analysis: Python, JavaScript, TypeScript, Go, Rust, Java, C, C++, Ruby, C#, Kotlin, Scala, PHP, Swift, Lua, Zig, PowerShell, Elixir, Objective-C, and Julia. The AST pass extracts classes, functions, imports, call graphs, docstrings, and rationale comments from all supported languages.

How is Graphify different from code search tools like grep?

Grep finds text matches in files. Graphify finds conceptual relationships across your entire project. Grep answers "where does this string appear?" Graphify answers "how do these modules relate and why?" Graphify also processes non-code content like PDFs, images, screenshots, and videos, building a unified knowledge graph that spans all your project materials.