AI Guide • 2026

Markdown vs JSON for LLMs — Which Format Should You Use?

Both formats work with AI systems, but they serve fundamentally different purposes. Here's when to use each — and why the best systems use both.

When building an AI system — a RAG pipeline, a custom GPT, an LLM fine-tuning dataset, or an agent framework — one of the earliest and most consequential decisions is format. What format should your documents be in when they enter the system?

The two most common choices are markdown and JSON. Both work. Both have clear strengths. But they serve fundamentally different purposes, and choosing the wrong one for your content type wastes tokens, degrades retrieval quality, and produces worse AI outputs.

This guide breaks down when to use each, why, and how the best production systems combine both.

The short answer: Use markdown for documents (articles, reports, manuals, policies). Use JSON for structured data (API responses, database records, configuration). Use both when your system handles both content types.


Head-to-Head Comparison

Dimension Markdown JSON
Token efficiency Excellent — minimal syntax overhead Medium — brackets, quotes, colons add up
Structure Semantic — headings, lists, emphasis Explicit — key-value pairs, nesting
Human readability Excellent — readable by anyone Technical — requires familiarity
Parsing Line-based, regex-friendly Standard JSON.parse()
Best content type Documents, articles, guides, prose Structured data, API responses, records
LLM comprehension Excellent for prose and documents Excellent for structured data
RAG chunking Natural — split on headings Harder — split on logical record boundaries
Embedding quality High for document content High for structured records
Schema validation None (free-form) JSON Schema available
AI native format LLMs output markdown by default LLMs can output JSON with prompting

The table tells a clear story: markdown wins for documents, JSON wins for structured data. The rest of this guide explains why — with token counts, RAG benchmarks, and practical guidance.


Token Efficiency: Why It Matters More Than You Think

Tokens are the fundamental unit of LLM processing. Every token costs money (API pricing), consumes context window space (how much the model can see at once), and affects processing speed (fewer tokens = faster responses). When your format adds tokens that carry no content value, you're paying for noise.

Side-by-Side Token Comparison

Example: A product description

Markdown version (~35 tokens):

## Widget Pro

Fast, reliable, **affordable**. Ships in 2 days.

- Lightweight aluminum design
- 5-year manufacturer warranty
- Available in 3 colors

JSON version (~75 tokens):

{
  "product": {
    "name": "Widget Pro",
    "tagline": "Fast, reliable, affordable. Ships in 2 days.",
    "features": [
      "Lightweight aluminum design",
      "5-year manufacturer warranty",
      "Available in 3 colors"
    ]
  }
}

The JSON version uses roughly 2x more tokens for identical information. The extra tokens come from braces, brackets, quotes, colons, and key names — none of which add content value for a document-style description.

Example: A data table

Markdown version (~50 tokens):

| Model | Parameters | Context | Cost |
|-------|-----------|---------|------|
| GPT-4o | 200B+ | 128K | $5/$15 |
| Claude 3.5 | Unknown | 200K | $3/$15 |
| Gemini 1.5 | Unknown | 1M | $3.50/$10.50 |

JSON version (~95 tokens):

[
  {"model": "GPT-4o", "parameters": "200B+", "context": "128K", "cost": "$5/$15"},
  {"model": "Claude 3.5", "parameters": "Unknown", "context": "200K", "cost": "$3/$15"},
  {"model": "Gemini 1.5", "parameters": "Unknown", "context": "1M", "cost": "$3.50/$10.50"}
]

For tabular data, JSON repeats key names for every record. Markdown states column headers once. At scale, the difference is dramatic.

The Cost Impact

For a knowledge base with 1,000 documents:

  • Markdown: ~500K tokens total
  • JSON: ~1.1M tokens total
  • Savings: 55% fewer tokens with markdown

At OpenAI's GPT-4o pricing ($5 per million input tokens), that's $2.75 saved per full pass through the knowledge base. For RAG systems making thousands of retrievals per day, the savings compound quickly.


Structure and LLM Comprehension

How Markdown Structures Content

Markdown uses lightweight semantic markers that map directly to document structure:

  • # headings create hierarchy — the model understands that content under ## Methods belongs to the Methods section
  • - lists create enumeration — features, steps, requirements
  • **bold** creates emphasis — key terms, important concepts
  • | tables organize data within documents
  • Code fences isolate code from prose

These markers are minimal (1-3 characters each) but carry strong semantic signal. LLMs understand markdown structure natively because they've been trained on billions of markdown documents from GitHub, documentation sites, and knowledge bases.

How JSON Structures Content

JSON uses explicit key-value pairs with strict syntax:

  • Keys define meaning ("title", "body", "sections")
  • Values hold content (strings, arrays, objects)
  • Nesting creates hierarchy through object composition
  • Arrays create ordered collections
  • Schema can enforce and validate structure

JSON structure is unambiguous and machine-parseable, but the structural syntax itself — every brace, bracket, quote, colon, and comma — consumes tokens without adding content.

Which Structure Do LLMs Understand Better?

For documents: Markdown. LLMs are trained on massive amounts of markdown — GitHub READMEs, documentation sites, wikis, forums. They understand that ## Results introduces a results section, that - item is a list entry, and that **important** signals emphasis. This understanding is baked into their training, not something you need to prompt for.

For structured data: JSON. LLMs understand JSON well because they're trained on code, API documentation, and configuration files. When your content is inherently structured data — product records, user profiles, transaction logs — JSON's explicit key-value format is the right representation.

The key insight: LLMs process both formats well, but they process each format best for its natural content type. Forcing documents into JSON or data into markdown degrades comprehension.


RAG Performance: Where Format Choice Has the Biggest Impact

In RAG (Retrieval-Augmented Generation) systems, format choice directly affects three critical metrics: chunking quality, embedding accuracy, and retrieval precision.

Chunking

Effective chunking — splitting documents into semantically coherent pieces for vector storage — is the foundation of RAG performance.

Markdown advantage:

Headers create natural, semantic chunk boundaries. Split at ## or ### to get topic-coherent chunks where each chunk maps to a specific section or concept. The heading itself serves as a built-in summary for the chunk.

## Installation
[300 words about installing the software]

## Configuration
[400 words about configuring settings]

## Troubleshooting
[350 words about common problems]

Three natural chunks, each semantically coherent, each self-describing via its heading.

JSON challenge:

JSON has no inherent section boundaries for document content. If your document is stored as a single JSON string, you're back to arbitrary character-based splitting. If it's structured as nested objects, splitting requires understanding the schema to avoid breaking logical units.

For structured records (arrays of objects), each object is a natural chunk — this works well for product catalogs and user records, but poorly for long-form prose.

Embedding Quality

Vector embeddings represent the semantic meaning of text chunks. Noise in the text pollutes the embedding.

Markdown produces cleaner embeddings for documents:

  • Pure content with minimal syntax noise
  • Heading context enriches the semantic representation ("## Pricing" + pricing content = focused embedding)
  • Lists and tables are semantically clear

JSON has a trade-off:

  • Key names add useful context ("title": "..." adds meaning)
  • But structural syntax ({, }, [, ], :, ") adds noise to embeddings
  • Works well for individual records, poorly for long-form content

Retrieval Accuracy

Benchmarks from RAG practitioners consistently show:

  • Markdown chunks retrieve more accurately for document-based queries
  • Heading-based chunks match user questions better than arbitrarily split text
  • Up to 35% improvement in retrieval precision when using markdown vs raw PDF text
  • JSON records retrieve well for field-specific, structured queries

The format you choose for your knowledge base isn't a minor implementation detail — it's a primary driver of how well your AI system answers questions.


When to Use Markdown for LLMs

Use markdown when your content is:

  • Documents — reports, articles, guides, manuals, policies, handbooks
  • Long-form prose with sections, subsections, and logical flow
  • Knowledge base articles for RAG systems and custom GPTs
  • Training data from books, papers, documentation, or web content
  • Conversational context for ChatGPT, Claude, or Gemini interactions

Real-world examples:

  • Company policy documents for an internal HR chatbot
  • Technical documentation for a developer support RAG system
  • Research papers for an AI-powered literature review tool
  • Product manuals for a customer support knowledge base
  • Blog content for a content analysis or SEO pipeline

How to get your documents into markdown:

Most source documents are in PDF, Word, HTML, or other formats. Convert them to clean markdown before ingestion using Craft Markdown — it handles PDF, Word, HTML, Excel, CSV, JSON, and more, all in your browser with no file uploads.


When to Use JSON for LLMs

Use JSON when your content is:

  • Structured records — user profiles, product catalogs, transaction logs
  • API responses being processed or analyzed by AI
  • Configuration data or application metadata
  • Tabular data that's inherently key-value (not prose)
  • LLM output format for function calling, tool use, and structured responses

Real-world examples:

  • Product catalog for an e-commerce AI shopping assistant
  • User profile data for a personalized recommendation system
  • API response parsing in an AI-powered monitoring dashboard
  • Structured output from LLM function calling and agent tools
  • Database query results for a natural language to SQL system

The Hybrid Approach: What Production Systems Actually Do

The best production AI systems don't choose one format — they use both, matching format to content type.

The Architecture

  1. Documents → Markdown. Convert PDFs, Word docs, HTML pages, and other document content to markdown. Store in your vector database with heading-based chunking. Use for conceptual questions, policy lookups, troubleshooting, and any query that requires understanding prose.

  2. Data → JSON. Keep structured data — user records, product information, transaction history, configuration — in JSON format. Use for precise field-level lookups, filtering, and any query that needs specific data points.

  3. Combined retrieval. Your RAG system searches markdown chunks for conceptual, open-ended questions ("How do I configure SSO?") and JSON records for data lookups ("What's the price of Widget Pro in the enterprise tier?").

Example: Enterprise Knowledge Base

/knowledge-base/
  /documents/          ← Markdown
    hr-policies.md
    engineering-handbook.md
    product-specs.md
    meeting-notes-2026-q1.md
  /data/               ← JSON
    employees.json
    products.json
    pricing.json
    office-locations.json

The AI assistant retrieves from /documents/ when asked "What's our remote work policy?" and from /data/ when asked "How many employees are in the London office?"

This hybrid approach maximizes both retrieval quality and token efficiency by using the right format for the right content type.

Convert your documents to AI-ready markdown →


Common Mistakes to Avoid

Mistake 1: Storing Documents as JSON Strings

{
  "title": "Company Handbook",
  "content": "# Welcome to Acme Corp\n\nThis handbook covers...\n\n## Code of Conduct\n\n..."
}

This is the worst of both worlds. You pay the JSON overhead for the wrapper, and the actual content is just a markdown string stuffed into a JSON value. Extract the markdown and store it directly.

Mistake 2: Converting Structured Data to Markdown Prose

If your source data is an array of product records with 15 fields each, don't convert it to a markdown table or narrative prose. JSON's explicit key-value structure is the right representation for structured data.

Mistake 3: Using One Format for Everything

"We standardized on JSON for our entire knowledge base" — this means your document content is either stored as JSON strings (see Mistake 1) or restructured into key-value pairs that lose the natural flow and section hierarchy. Use the right format for the right content.

Mistake 4: Ignoring Token Costs at Scale

For a prototype with 10 documents, format choice barely matters. For a production system with 10,000 documents and thousands of daily queries, the 40-60% token overhead of JSON for document content translates to real money and real context window pressure.


Key Takeaways

  1. Markdown wins for documents. Articles, reports, guides, manuals, policies — any content that's primarily prose with sections and structure. Up to 55% fewer tokens than JSON.

  2. JSON wins for structured data. Records, API responses, catalogs, configurations — any content that's inherently key-value with explicit fields. Schema validation and explicit structure are valuable here.

  3. Token efficiency translates to cost savings. At scale, markdown's minimal syntax overhead saves real money on API costs and fits more content into context windows.

  4. RAG performance depends on format choice. Heading-based markdown chunking outperforms arbitrary text splitting by up to 35% in retrieval accuracy. JSON records chunk naturally at object boundaries.

  5. The best systems use both. Documents in markdown, data in JSON. Match the format to the content type, not to organizational habit or tooling convenience.

Don't force documents into JSON or data into markdown. Use the right format for the right content, and your AI system will produce better results with lower costs.

Convert your documents to AI-ready markdown — free and private →


Frequently Asked Questions

Can LLMs process both markdown and JSON equally well?

Yes, modern LLMs handle both formats natively. GPT-4o, Claude, Gemini, Llama, and Mistral all understand markdown structure and JSON syntax. The choice should be based on your content type, not model capability. Use markdown for documents and JSON for structured data.

What about YAML — is it better than JSON for LLMs?

YAML is more readable than JSON and slightly more token-efficient (no quotes around keys, no braces). It's a solid choice for configuration files and structured metadata. But for document content, markdown is still the best choice. For structured data in production systems, JSON's strict parsing and schema validation make it more reliable than YAML.

Should I convert JSON knowledge bases to markdown?

Only if your JSON contains long-form document content stored as strings. If your JSON is structured data (product records, user profiles), keep it as JSON — that's its strength. If you have articles or guides stored as JSON objects with a "content" field containing prose, extract that content as standalone markdown files.

What format does ChatGPT output by default?

ChatGPT, Claude, and Gemini all output markdown by default — headings, bold text, lists, code blocks, tables. This is a strong signal that LLMs are optimized for markdown processing. They don't output JSON unless specifically prompted to (via function calling or explicit instructions).

How much do I actually save by using markdown instead of JSON for documents?

Typically 40-60% fewer tokens for document content. For a 1,000-document knowledge base, that can mean 500K fewer tokens per full retrieval pass. At $5 per million input tokens (GPT-4o), the savings compound with every query your system handles.

Does markdown support schema validation like JSON Schema?

No. Markdown is free-form by design — there's no schema enforcement. This is fine for documents where flexibility is a feature. For structured data where field validation matters (required fields, data types, enums), JSON with JSON Schema is the better choice. This is another reason to use both formats in your system.

What about XML — where does it fit?

XML is essentially a more verbose version of JSON for most AI use cases. It has higher token overhead than both markdown and JSON, and LLMs don't process it as naturally. Unless you're working with XML-native systems (SOAP APIs, legacy enterprise), convert XML content to markdown (for documents) or JSON (for structured data) before AI processing. Craft Markdown handles XML to markdown conversion →.

Convert your documents to AI-ready markdown

PDF, Word, HTML, CSV, JSON, Excel, and more. Free, private, instant. The right format for better AI results.

Open the Converter