PDF to Markdown for RAG Systems

Q: Can I use PDFs directly in RAG without converting to markdown?

Yes, but results are measurably worse. Raw PDF extraction produces noisy text with formatting artifacts and lost structure. Converting to markdown first typically improves retrieval accuracy by 20-30%.

Q: How long does PDF to markdown conversion take?

Seconds for most documents with browser-based tools like Craft Markdown. Batch conversion of hundreds of PDFs with command-line tools takes minutes.

Q: What about scanned PDFs?

Scanned PDFs need OCR before markdown conversion. Tools like Tesseract or Adobe Acrobat can perform OCR. Quality depends on scan resolution and font clarity.

Q: Is markdown better than JSON for RAG?

For document content like articles and reports, yes. Markdown is more token-efficient and preserves reading flow. JSON is better for structured data like product catalogs and API responses.

Q: How do I handle images in PDFs for RAG?

Text and structure convert to markdown. For informational images like charts and diagrams, use a vision model to generate text descriptions. Decorative images can usually be skipped.

Q: What chunking strategy should I use?

Start with heading-based chunking at ## or ### headers. This preserves semantic boundaries and produces the best retrieval results. For documents without clear headings, use fixed-size chunks of 500-1000 characters with 100-200 character overlap.

RAG (Retrieval-Augmented Generation) systems power the most useful AI applications being built today — enterprise knowledge bases, customer support chatbots, research assistants, and documentation search engines. And most of the documents feeding these systems start as PDFs.

The problem? PDFs are notoriously difficult for AI to process well. Raw PDF extraction produces garbled text, broken tables, lost structure, and noisy output that degrades retrieval quality.

The solution is a conversion step that most successful RAG implementations share: convert PDFs to clean markdown before ingestion. This guide shows you exactly how, with practical code examples, tool recommendations, and best practices from production RAG systems.

Why PDFs Are Problematic for RAG

PDFs were designed for printing, not for data extraction. Under the hood, a PDF is a collection of precise positioning instructions — "place this character at coordinates (72, 340)" — rather than a structured document format. This creates serious challenges for RAG systems.

The core problems:

Complex internal structure: PDF is a page-description language, not a document format. Text, images, and layout are stored as rendering instructions, not semantic content.
Text extraction inconsistencies: Characters may be stored out of order, with their visual position being the only indication of reading sequence.
Tables break apart: Table cells are just positioned text — extraction tools must reconstruct the table structure from spatial relationships.
No semantic markup: There's no concept of "heading" or "list item" in a PDF. These are just text rendered in a larger font or with a bullet character.
Variable quality: PDFs created from different sources (Word, LaTeX, InDesign, scanners) have wildly different internal structures.

Real-world pain points in RAG systems:

Garbled text in vector databases producing irrelevant search results
Poor retrieval accuracy because chunk boundaries fall in arbitrary places
Missing or corrupted tables — often the most valuable content in business documents
Lost document structure making it impossible to attribute retrieved content to specific sections
Page headers, footers, and numbers polluting embeddings

The result? RAG systems built on raw PDF extraction underperform. Users get wrong answers, irrelevant results, or hallucinated content because the source data is noisy.

The fix: convert to markdown first.

Why Markdown is Ideal for RAG

Markdown solves every problem that PDFs create for RAG systems. (For a deeper dive, see our guide on why LLMs love markdown.)

Clear section boundaries for chunking: Markdown headers (##, ###) create natural, semantic break points. Instead of splitting documents at arbitrary character counts, you can chunk at topic boundaries — dramatically improving retrieval relevance.

Semantic headers improve retrieval: When a chunk starts with ## Refund Policy, the embedding captures that this content is about refund policies. Raw PDF text might start mid-sentence with no context about the section topic.

Tables preserved in readable format: Markdown tables maintain their structure and are readable by both humans and AI. PDF-extracted tables often become jumbled rows of text.

No extraction artifacts: No page numbers, headers, footers, or positioning data contaminating your embeddings.

Token-efficient: Markdown uses 25-75% fewer tokens than HTML for equivalent content, meaning more context fits in each LLM call and your API costs decrease.

PDF to Markdown Conversion Options

Browser-Based Tools (Recommended for Most Users)

Craft Markdown — Privacy-First, AI-Ready

Files never leave your browser — complete privacy for sensitive documents
Supports 9+ formats (PDF, Word, HTML, CSV, JSON, and more)
Clean, AI-optimized markdown output
Completely free, no signup required
Convert your PDF now →

pdf2md.morethan.io — Simple and Fast

Single-purpose PDF converter
Files uploaded to their server (privacy consideration)
PDF-only — no other formats

Command-Line Tools (For Automation)

Pandoc — Industry Standard

Supports 50+ formats
Requires local installation
Highly customizable output
Best for batch processing and scripted workflows

MarkItDown (Microsoft) — Python-Based

Open-source, maintained by Microsoft
Designed specifically for LLM data preparation
Good for developer automation pipelines

API Services (For Production Scale)

LlamaParse — AI-Powered Parsing

Uses AI to understand complex document layouts
Excellent for multi-column PDFs, complex tables, and charts
Paid service with limited free tier
Best for production RAG systems with budget

Quick Comparison

Tool	Privacy	Ease of Use	Complex PDFs	Cost
Craft Markdown	Browser-based	Drag and drop	Good	Free
pdf2md	Server upload	Simple	Basic	Free
Pandoc	Local	Command line	Good	Free
MarkItDown	Local	Python script	Good	Free
LlamaParse	Cloud API	API integration	Excellent	Paid

Our recommendation: Start with Craft Markdown for individual documents and testing. Move to Pandoc or MarkItDown when you need automation. Consider LlamaParse for complex documents at scale.

Step-by-Step: PDF to RAG Pipeline

Step 1: Assess Your PDFs

Before converting, evaluate your document collection:

Check text selectability:
Open the PDF and try to highlight text. If you can select and copy text, it's a digital-native PDF — these convert well. If you can't select text, it's a scanned document that needs OCR first.

Evaluate complexity:

Simple text documents (reports, articles, manuals) → Convert directly
Documents with tables → Convert and verify table accuracy
Multi-column layouts → May need specialized tools or manual cleanup
Scanned documents → Require OCR before conversion

Count your documents:

1-20 documents: Browser-based tools are perfect
20-100 documents: Consider command-line automation
100+ documents: Build a scripted pipeline

Step 2: Convert PDF to Markdown

Using Craft Markdown (recommended for most users):

Go to craftmarkdown.com/pdf-to-markdown
Drag and drop your PDF onto the converter
Review the markdown preview — check headings, tables, and structure
Copy to clipboard or download the .md file

Your file never leaves your browser. No server upload, no data collection.

Using Pandoc (for automation):

# Basic conversion
pandoc input.pdf -o output.md

# With image extraction
pandoc input.pdf --extract-media=./images -o output.md

# Batch conversion of all PDFs in a directory
for f in *.pdf; do pandoc "$f" -o "${f%.pdf}.md"; done

Using Python with PyMuPDF:

import fitz

def pdf_to_markdown(pdf_path):
    doc = fitz.open(pdf_path)
    markdown = ""
    for page in doc:
        markdown += page.get_text("text") + "\n\n"
    return markdown

md_text = pdf_to_markdown("document.pdf")
with open("document.md", "w") as f:
    f.write(md_text)

Step 3: Clean and Validate

Even the best conversion tools need a quality check. Common cleanup tasks:

Fix heading hierarchy:
Ensure headings follow a logical structure — # for the document title, ## for main sections, ### for subsections. Conversion sometimes flattens or misidentifies heading levels.

Verify tables:
Open the markdown in a preview tool. Do tables render correctly? Are columns aligned? Is any data missing or misplaced?

Remove artifacts:
Strip out page numbers, repeated headers/footers, and any garbled text from the conversion process.

Validation checklist:

All sections from the original PDF are present
Tables are readable and complete
Links are preserved (if applicable)
No garbled or missing text
Heading hierarchy is correct and consistent
Page numbers and headers/footers are removed

Step 4: Chunk for RAG

Chunking strategy is one of the most important decisions in RAG system design. Markdown makes it significantly easier because of its explicit structure.

Strategy 1: Chunk by heading (recommended)

Split the document at ## or ### headers. This preserves semantic boundaries — each chunk covers a coherent topic.

import re

def chunk_by_heading(markdown_text, level=2):
    pattern = r'\n(?=#{' + str(level) + r'} )'
    chunks = re.split(pattern, markdown_text)
    return [chunk.strip() for chunk in chunks if chunk.strip()]

chunks = chunk_by_heading(markdown_text)

Why this works well: Each chunk starts with a descriptive heading that provides context for the embedding. When a user asks about "refund policy," the chunk that starts with ## Refund Policy will match strongly.

Strategy 2: Fixed-size chunks with overlap

Split at a fixed character count with overlap between chunks. Better for documents without clear heading structure.

def chunk_by_size(text, chunk_size=1000, overlap=200):
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk.strip())
        start = end - overlap
    return [c for c in chunks if c]

chunks = chunk_by_size(markdown_text)

Strategy 3: Paragraph-based chunking

Split at double newlines (paragraph boundaries). Creates natural reading units with variable size.

def chunk_by_paragraph(markdown_text, min_size=200):
    paragraphs = markdown_text.split('\n\n')
    chunks = []
    current = ""
    for para in paragraphs:
        if len(current) + len(para) < min_size:
            current += "\n\n" + para
        else:
            if current:
                chunks.append(current.strip())
            current = para
    if current:
        chunks.append(current.strip())
    return chunks

Our recommendation: Start with heading-based chunking. It produces the best retrieval results for most structured documents. Fall back to fixed-size chunking for unstructured content.

Step 5: Generate Embeddings

Convert your markdown chunks into vector embeddings for semantic search.

from openai import OpenAI

client = OpenAI()

def get_embeddings(texts, model="text-embedding-3-small"):
    response = client.embeddings.create(
        input=texts,
        model=model
    )
    return [item.embedding for item in response.data]

embeddings = get_embeddings(chunks)

Tip: Markdown chunks produce better embeddings than raw PDF text because the content is cleaner, the structure is explicit, and there's no formatting noise diluting the semantic signal.

Step 6: Store in a Vector Database

Load your chunks and embeddings into a vector database for retrieval.

Popular options:

Chroma — Simple, local-first, great for prototyping
Pinecone — Managed, scalable, production-ready
Weaviate — Open-source, feature-rich
Qdrant — High-performance, Rust-based
pgvector — PostgreSQL extension for teams already using Postgres

Example with Chroma:

import chromadb

client = chromadb.Client()
collection = client.create_collection("knowledge_base")

collection.add(
    documents=chunks,
    embeddings=embeddings,
    metadatas=[{"source": "document.pdf", "chunk": i} for i in range(len(chunks))],
    ids=[f"chunk_{i}" for i in range(len(chunks))]
)

Step 7: Query and Retrieve

Search your vector database and use retrieved chunks as context for LLM generation.

results = collection.query(
    query_texts=["What is the company refund policy?"],
    n_results=3
)

context = "\n\n---\n\n".join(results['documents'][0])

response = client.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "Answer based on the provided context. If the context doesn't contain the answer, say so."},
        {"role": "user", "content": f"Context:\n{context}\n\nQuestion: What is the company refund policy?"}
    ]
)

print(response.choices[0].message.content)

Best Practices for PDF-to-RAG Conversion

Preserve Document Structure

Structure is your most valuable asset in RAG. Headers indicate topic changes, lists preserve relationships, and tables maintain data integrity.

Keep the original heading hierarchy — don't flatten ### into ##
Preserve list formatting — bullet points and numbered lists carry semantic meaning
Maintain table structure — tables often contain the highest-value facts in business documents

Handle Tables Carefully

Tables are disproportionately important in RAG systems because they contain dense, factual information that users frequently query.

Always verify table conversion accuracy
Consider creating separate chunks for large tables
Test table-specific queries in your RAG system
If a table doesn't convert well, recreate it manually — the retrieval improvement is worth the effort

Maintain Metadata

Attach metadata to each chunk for better retrieval and attribution:

{
  "content": "## Refund Policy\n\nCustomers may request a full refund within 30 days...",
  "metadata": {
    "source": "company_handbook.pdf",
    "section": "Refund Policy",
    "page": 15,
    "converted_date": "2025-12-15"
  }
}

Metadata enables filtered search (e.g., "search only in the employee handbook"), source attribution in AI responses, and versioning when documents get updated.

Test Retrieval Quality

Before deploying to production:

Create 20-30 representative test queries
Run them against your RAG system
Evaluate whether the retrieved chunks are relevant
Measure precision and recall
Iterate on your chunking strategy based on results

Common Challenges and Solutions

Scanned PDFs (No Selectable Text)

Problem: The PDF contains images of text, not actual text data.

Solution: Run OCR first using Tesseract, Adobe Acrobat, or a cloud OCR service. Then convert the OCR output to markdown. Quality depends on scan resolution and font clarity.

Complex Multi-Column Layouts

Problem: Text from multiple columns gets interleaved during extraction.

Solution: Use AI-powered extraction tools like LlamaParse for complex layouts. Alternatively, convert the PDF to a single-column format in the source application before PDF export.

Tables Spanning Multiple Pages

Problem: Large tables that break across pages lose their structure during conversion.

Solution: Manual reconstruction is often the most reliable approach. For automation, LlamaParse handles multi-page tables better than most tools.

Mixed Content (Text + Images)

Problem: Image content (charts, diagrams, figures) is lost in text extraction.

Solution: Extract images separately. For charts and diagrams, consider using vision models (GPT-4V, Claude Vision) to generate text descriptions, then include those descriptions in your markdown.

Large Document Collections

Problem: Converting hundreds or thousands of PDFs manually isn't practical.

Solution: Build an automated pipeline using Pandoc or MarkItDown for batch processing. Use Craft Markdown for spot-checking and quality validation of individual documents.

RAG Performance: Markdown vs Other Formats

The numbers consistently show that markdown produces better RAG results:

Metric	Raw PDF Text	HTML	Markdown
Retrieval accuracy	~62%	~78%	~89%
Chunk quality	Poor — arbitrary breaks	Medium — tag fragments	High — semantic boundaries
Token efficiency	N/A	Low — tag overhead	High — minimal syntax
Embedding quality	Low — noisy text	Medium — mixed content	High — clean content
Processing speed	Fast extraction	Fast parsing	Fast parsing

The ~27 percentage point improvement in retrieval accuracy from raw PDF to markdown is substantial. For a customer support RAG system handling 1,000 queries per day, that's the difference between hundreds of wrong answers and accurate, helpful responses.

Tools and Resources

Conversion Tools

Craft Markdown — Browser-based, privacy-first, multi-format
Pandoc — Command-line power tool for batch processing
MarkItDown — Microsoft's Python-based LLM-focused converter

RAG Frameworks

LangChain — Popular, well-documented RAG framework with markdown loaders
LlamaIndex — Document-focused RAG with excellent PDF and markdown support
Haystack — Open-source NLP framework for production search systems

Vector Databases

Chroma — Simple, local-first, ideal for prototyping
Pinecone — Managed and scalable for production
Weaviate — Open-source with rich features
Qdrant — High-performance for large-scale deployments

Key Takeaways

PDFs need conversion for optimal RAG performance. Raw PDF extraction produces noisy, unstructured text that degrades retrieval quality.
Markdown is the ideal intermediate format. Clean structure, token efficiency, and semantic boundaries make markdown perfect for RAG.
Structure preservation is critical. Headings and tables carry the most value — verify they convert correctly.
Chunking strategy matters. Header-based chunking produces the best results for structured documents.
Test and iterate. Measure retrieval quality, adjust your pipeline, and continuously improve.
Privacy matters. Use browser-based tools like Craft Markdown for sensitive documents — your files should never leave your device.

Frequently Asked Questions

Can I use PDFs directly in RAG without converting to markdown?

Yes, and many RAG frameworks support direct PDF loading. But the results are measurably worse. Raw PDF extraction produces noisy text with formatting artifacts, arbitrary chunk boundaries, and lost structure. Converting to markdown first typically improves retrieval accuracy by 20-30%.

How long does PDF to markdown conversion take?

Seconds for most documents when using browser-based tools like Craft Markdown. Batch conversion of hundreds of PDFs with command-line tools takes minutes. The conversion step is fast — it's the quality review that takes time.

What about scanned PDFs?

Scanned PDFs need OCR (Optical Character Recognition) before markdown conversion. Tools like Tesseract (free, open-source) or Adobe Acrobat can perform OCR. Quality depends on scan resolution and font clarity. After OCR, convert the text output to markdown.

Is markdown better than JSON for RAG?

For document content (articles, reports, manuals), markdown is better — it's more token-efficient and preserves reading flow. JSON is better for structured data like product catalogs, customer records, or API responses. Most production RAG systems use markdown for documents and JSON for structured data.

How do I handle images in PDFs for RAG?

Text and structure convert to markdown. Images need separate handling. For informational images (charts, diagrams), use a vision model to generate text descriptions and include those in your markdown. For decorative images, you can usually skip them — they don't add retrieval value.

What chunking strategy should I use?

Start with heading-based chunking — split at ## or ### headers. This preserves semantic boundaries and produces the best retrieval results for structured documents. If your documents lack clear headings, use fixed-size chunks (500-1000 characters) with 100-200 character overlap.