When building an AI system — a RAG pipeline, a custom GPT, an LLM fine-tuning dataset, or an agent framework — one of the earliest and most consequential decisions is format. What format should your documents be in when they enter the system?
The two most common choices are markdown and JSON. Both work. Both have clear strengths. But they serve fundamentally different purposes, and choosing the wrong one for your content type wastes tokens, degrades retrieval quality, and produces worse AI outputs.
This guide breaks down when to use each, why, and how the best production systems combine both.
The short answer: Use markdown for documents (articles, reports, manuals, policies). Use JSON for structured data (API responses, database records, configuration). Use both when your system handles both content types.
Head-to-Head Comparison
| Dimension | Markdown | JSON |
|---|---|---|
| Token efficiency | Excellent — minimal syntax overhead | Medium — brackets, quotes, colons add up |
| Structure | Semantic — headings, lists, emphasis | Explicit — key-value pairs, nesting |
| Human readability | Excellent — readable by anyone | Technical — requires familiarity |
| Parsing | Line-based, regex-friendly | Standard JSON.parse() |
| Best content type | Documents, articles, guides, prose | Structured data, API responses, records |
| LLM comprehension | Excellent for prose and documents | Excellent for structured data |
| RAG chunking | Natural — split on headings | Harder — split on logical record boundaries |
| Embedding quality | High for document content | High for structured records |
| Schema validation | None (free-form) | JSON Schema available |
| AI native format | LLMs output markdown by default | LLMs can output JSON with prompting |
The table tells a clear story: markdown wins for documents, JSON wins for structured data. The rest of this guide explains why — with token counts, RAG benchmarks, and practical guidance.
Token Efficiency: Why It Matters More Than You Think
Tokens are the fundamental unit of LLM processing. Every token costs money (API pricing), consumes context window space (how much the model can see at once), and affects processing speed (fewer tokens = faster responses). When your format adds tokens that carry no content value, you're paying for noise.
Side-by-Side Token Comparison
Example: A product description
Markdown version (~35 tokens):
## Widget Pro
Fast, reliable, **affordable**. Ships in 2 days.
- Lightweight aluminum design
- 5-year manufacturer warranty
- Available in 3 colors
JSON version (~75 tokens):
{
"product": {
"name": "Widget Pro",
"tagline": "Fast, reliable, affordable. Ships in 2 days.",
"features": [
"Lightweight aluminum design",
"5-year manufacturer warranty",
"Available in 3 colors"
]
}
}
The JSON version uses roughly 2x more tokens for identical information. The extra tokens come from braces, brackets, quotes, colons, and key names — none of which add content value for a document-style description.
Example: A data table
Markdown version (~50 tokens):
| Model | Parameters | Context | Cost |
|-------|-----------|---------|------|
| GPT-4o | 200B+ | 128K | $5/$15 |
| Claude 3.5 | Unknown | 200K | $3/$15 |
| Gemini 1.5 | Unknown | 1M | $3.50/$10.50 |
JSON version (~95 tokens):
[
{"model": "GPT-4o", "parameters": "200B+", "context": "128K", "cost": "$5/$15"},
{"model": "Claude 3.5", "parameters": "Unknown", "context": "200K", "cost": "$3/$15"},
{"model": "Gemini 1.5", "parameters": "Unknown", "context": "1M", "cost": "$3.50/$10.50"}
]
For tabular data, JSON repeats key names for every record. Markdown states column headers once. At scale, the difference is dramatic.
The Cost Impact
For a knowledge base with 1,000 documents:
- Markdown: ~500K tokens total
- JSON: ~1.1M tokens total
- Savings: 55% fewer tokens with markdown
At OpenAI's GPT-4o pricing ($5 per million input tokens), that's $2.75 saved per full pass through the knowledge base. For RAG systems making thousands of retrievals per day, the savings compound quickly.
Structure and LLM Comprehension
How Markdown Structures Content
Markdown uses lightweight semantic markers that map directly to document structure:
#headings create hierarchy — the model understands that content under## Methodsbelongs to the Methods section-lists create enumeration — features, steps, requirements**bold**creates emphasis — key terms, important concepts|tables organize data within documents- Code fences isolate code from prose
These markers are minimal (1-3 characters each) but carry strong semantic signal. LLMs understand markdown structure natively because they've been trained on billions of markdown documents from GitHub, documentation sites, and knowledge bases.
How JSON Structures Content
JSON uses explicit key-value pairs with strict syntax:
- Keys define meaning (
"title","body","sections") - Values hold content (strings, arrays, objects)
- Nesting creates hierarchy through object composition
- Arrays create ordered collections
- Schema can enforce and validate structure
JSON structure is unambiguous and machine-parseable, but the structural syntax itself — every brace, bracket, quote, colon, and comma — consumes tokens without adding content.
Which Structure Do LLMs Understand Better?
For documents: Markdown. LLMs are trained on massive amounts of markdown — GitHub READMEs, documentation sites, wikis, forums. They understand that ## Results introduces a results section, that - item is a list entry, and that **important** signals emphasis. This understanding is baked into their training, not something you need to prompt for.
For structured data: JSON. LLMs understand JSON well because they're trained on code, API documentation, and configuration files. When your content is inherently structured data — product records, user profiles, transaction logs — JSON's explicit key-value format is the right representation.
The key insight: LLMs process both formats well, but they process each format best for its natural content type. Forcing documents into JSON or data into markdown degrades comprehension.
RAG Performance: Where Format Choice Has the Biggest Impact
In RAG (Retrieval-Augmented Generation) systems, format choice directly affects three critical metrics: chunking quality, embedding accuracy, and retrieval precision.
Chunking
Effective chunking — splitting documents into semantically coherent pieces for vector storage — is the foundation of RAG performance.
Markdown advantage:
Headers create natural, semantic chunk boundaries. Split at ## or ### to get topic-coherent chunks where each chunk maps to a specific section or concept. The heading itself serves as a built-in summary for the chunk.
## Installation
[300 words about installing the software]
## Configuration
[400 words about configuring settings]
## Troubleshooting
[350 words about common problems]
Three natural chunks, each semantically coherent, each self-describing via its heading.
JSON challenge:
JSON has no inherent section boundaries for document content. If your document is stored as a single JSON string, you're back to arbitrary character-based splitting. If it's structured as nested objects, splitting requires understanding the schema to avoid breaking logical units.
For structured records (arrays of objects), each object is a natural chunk — this works well for product catalogs and user records, but poorly for long-form prose.
Embedding Quality
Vector embeddings represent the semantic meaning of text chunks. Noise in the text pollutes the embedding.
Markdown produces cleaner embeddings for documents:
- Pure content with minimal syntax noise
- Heading context enriches the semantic representation ("## Pricing" + pricing content = focused embedding)
- Lists and tables are semantically clear
JSON has a trade-off:
- Key names add useful context (
"title": "..."adds meaning) - But structural syntax (
{,},[,],:,") adds noise to embeddings - Works well for individual records, poorly for long-form content
Retrieval Accuracy
Benchmarks from RAG practitioners consistently show:
- Markdown chunks retrieve more accurately for document-based queries
- Heading-based chunks match user questions better than arbitrarily split text
- Up to 35% improvement in retrieval precision when using markdown vs raw PDF text
- JSON records retrieve well for field-specific, structured queries
The format you choose for your knowledge base isn't a minor implementation detail — it's a primary driver of how well your AI system answers questions.
When to Use Markdown for LLMs
Use markdown when your content is:
- Documents — reports, articles, guides, manuals, policies, handbooks
- Long-form prose with sections, subsections, and logical flow
- Knowledge base articles for RAG systems and custom GPTs
- Training data from books, papers, documentation, or web content
- Conversational context for ChatGPT, Claude, or Gemini interactions
Real-world examples:
- Company policy documents for an internal HR chatbot
- Technical documentation for a developer support RAG system
- Research papers for an AI-powered literature review tool
- Product manuals for a customer support knowledge base
- Blog content for a content analysis or SEO pipeline
How to get your documents into markdown:
Most source documents are in PDF, Word, HTML, or other formats. Convert them to clean markdown before ingestion using Craft Markdown — it handles PDF, Word, HTML, Excel, CSV, JSON, and more, all in your browser with no file uploads.
When to Use JSON for LLMs
Use JSON when your content is:
- Structured records — user profiles, product catalogs, transaction logs
- API responses being processed or analyzed by AI
- Configuration data or application metadata
- Tabular data that's inherently key-value (not prose)
- LLM output format for function calling, tool use, and structured responses
Real-world examples:
- Product catalog for an e-commerce AI shopping assistant
- User profile data for a personalized recommendation system
- API response parsing in an AI-powered monitoring dashboard
- Structured output from LLM function calling and agent tools
- Database query results for a natural language to SQL system
The Hybrid Approach: What Production Systems Actually Do
The best production AI systems don't choose one format — they use both, matching format to content type.
The Architecture
Documents → Markdown. Convert PDFs, Word docs, HTML pages, and other document content to markdown. Store in your vector database with heading-based chunking. Use for conceptual questions, policy lookups, troubleshooting, and any query that requires understanding prose.
Data → JSON. Keep structured data — user records, product information, transaction history, configuration — in JSON format. Use for precise field-level lookups, filtering, and any query that needs specific data points.
Combined retrieval. Your RAG system searches markdown chunks for conceptual, open-ended questions ("How do I configure SSO?") and JSON records for data lookups ("What's the price of Widget Pro in the enterprise tier?").
Example: Enterprise Knowledge Base
/knowledge-base/
/documents/ ← Markdown
hr-policies.md
engineering-handbook.md
product-specs.md
meeting-notes-2026-q1.md
/data/ ← JSON
employees.json
products.json
pricing.json
office-locations.json
The AI assistant retrieves from /documents/ when asked "What's our remote work policy?" and from /data/ when asked "How many employees are in the London office?"
This hybrid approach maximizes both retrieval quality and token efficiency by using the right format for the right content type.
Convert your documents to AI-ready markdown →
Common Mistakes to Avoid
Mistake 1: Storing Documents as JSON Strings
{
"title": "Company Handbook",
"content": "# Welcome to Acme Corp\n\nThis handbook covers...\n\n## Code of Conduct\n\n..."
}
This is the worst of both worlds. You pay the JSON overhead for the wrapper, and the actual content is just a markdown string stuffed into a JSON value. Extract the markdown and store it directly.
Mistake 2: Converting Structured Data to Markdown Prose
If your source data is an array of product records with 15 fields each, don't convert it to a markdown table or narrative prose. JSON's explicit key-value structure is the right representation for structured data.
Mistake 3: Using One Format for Everything
"We standardized on JSON for our entire knowledge base" — this means your document content is either stored as JSON strings (see Mistake 1) or restructured into key-value pairs that lose the natural flow and section hierarchy. Use the right format for the right content.
Mistake 4: Ignoring Token Costs at Scale
For a prototype with 10 documents, format choice barely matters. For a production system with 10,000 documents and thousands of daily queries, the 40-60% token overhead of JSON for document content translates to real money and real context window pressure.
Key Takeaways
Markdown wins for documents. Articles, reports, guides, manuals, policies — any content that's primarily prose with sections and structure. Up to 55% fewer tokens than JSON.
JSON wins for structured data. Records, API responses, catalogs, configurations — any content that's inherently key-value with explicit fields. Schema validation and explicit structure are valuable here.
Token efficiency translates to cost savings. At scale, markdown's minimal syntax overhead saves real money on API costs and fits more content into context windows.
RAG performance depends on format choice. Heading-based markdown chunking outperforms arbitrary text splitting by up to 35% in retrieval accuracy. JSON records chunk naturally at object boundaries.
The best systems use both. Documents in markdown, data in JSON. Match the format to the content type, not to organizational habit or tooling convenience.
Don't force documents into JSON or data into markdown. Use the right format for the right content, and your AI system will produce better results with lower costs.
Convert your documents to AI-ready markdown — free and private →
Frequently Asked Questions
Can LLMs process both markdown and JSON equally well?
Yes, modern LLMs handle both formats natively. GPT-4o, Claude, Gemini, Llama, and Mistral all understand markdown structure and JSON syntax. The choice should be based on your content type, not model capability. Use markdown for documents and JSON for structured data.
What about YAML — is it better than JSON for LLMs?
YAML is more readable than JSON and slightly more token-efficient (no quotes around keys, no braces). It's a solid choice for configuration files and structured metadata. But for document content, markdown is still the best choice. For structured data in production systems, JSON's strict parsing and schema validation make it more reliable than YAML.
Should I convert JSON knowledge bases to markdown?
Only if your JSON contains long-form document content stored as strings. If your JSON is structured data (product records, user profiles), keep it as JSON — that's its strength. If you have articles or guides stored as JSON objects with a "content" field containing prose, extract that content as standalone markdown files.
What format does ChatGPT output by default?
ChatGPT, Claude, and Gemini all output markdown by default — headings, bold text, lists, code blocks, tables. This is a strong signal that LLMs are optimized for markdown processing. They don't output JSON unless specifically prompted to (via function calling or explicit instructions).
How much do I actually save by using markdown instead of JSON for documents?
Typically 40-60% fewer tokens for document content. For a 1,000-document knowledge base, that can mean 500K fewer tokens per full retrieval pass. At $5 per million input tokens (GPT-4o), the savings compound with every query your system handles.
Does markdown support schema validation like JSON Schema?
No. Markdown is free-form by design — there's no schema enforcement. This is fine for documents where flexibility is a feature. For structured data where field validation matters (required fields, data types, enums), JSON with JSON Schema is the better choice. This is another reason to use both formats in your system.
What about XML — where does it fit?
XML is essentially a more verbose version of JSON for most AI use cases. It has higher token overhead than both markdown and JSON, and LLMs don't process it as naturally. Unless you're working with XML-native systems (SOAP APIs, legacy enterprise), convert XML content to markdown (for documents) or JSON (for structured data) before AI processing. Craft Markdown handles XML to markdown conversion →.