RAG (Retrieval-Augmented Generation) systems power the most useful AI applications being built today — enterprise knowledge bases, customer support chatbots, research assistants, and documentation search engines. And most of the documents feeding these systems start as PDFs.
The problem? PDFs are notoriously difficult for AI to process well. Raw PDF extraction produces garbled text, broken tables, lost structure, and noisy output that degrades retrieval quality.
The solution is a conversion step that most successful RAG implementations share: convert PDFs to clean markdown before ingestion. This guide shows you exactly how, with practical code examples, tool recommendations, and best practices from production RAG systems.
Why PDFs Are Problematic for RAG
PDFs were designed for printing, not for data extraction. Under the hood, a PDF is a collection of precise positioning instructions — "place this character at coordinates (72, 340)" — rather than a structured document format. This creates serious challenges for RAG systems.
The core problems:
- Complex internal structure: PDF is a page-description language, not a document format. Text, images, and layout are stored as rendering instructions, not semantic content.
- Text extraction inconsistencies: Characters may be stored out of order, with their visual position being the only indication of reading sequence.
- Tables break apart: Table cells are just positioned text — extraction tools must reconstruct the table structure from spatial relationships.
- No semantic markup: There's no concept of "heading" or "list item" in a PDF. These are just text rendered in a larger font or with a bullet character.
- Variable quality: PDFs created from different sources (Word, LaTeX, InDesign, scanners) have wildly different internal structures.
Real-world pain points in RAG systems:
- Garbled text in vector databases producing irrelevant search results
- Poor retrieval accuracy because chunk boundaries fall in arbitrary places
- Missing or corrupted tables — often the most valuable content in business documents
- Lost document structure making it impossible to attribute retrieved content to specific sections
- Page headers, footers, and numbers polluting embeddings
The result? RAG systems built on raw PDF extraction underperform. Users get wrong answers, irrelevant results, or hallucinated content because the source data is noisy.
The fix: convert to markdown first.
Why Markdown is Ideal for RAG
Markdown solves every problem that PDFs create for RAG systems. (For a deeper dive, see our guide on why LLMs love markdown.)
Clear section boundaries for chunking: Markdown headers (##, ###) create natural, semantic break points. Instead of splitting documents at arbitrary character counts, you can chunk at topic boundaries — dramatically improving retrieval relevance.
Semantic headers improve retrieval: When a chunk starts with ## Refund Policy, the embedding captures that this content is about refund policies. Raw PDF text might start mid-sentence with no context about the section topic.
Tables preserved in readable format: Markdown tables maintain their structure and are readable by both humans and AI. PDF-extracted tables often become jumbled rows of text.
No extraction artifacts: No page numbers, headers, footers, or positioning data contaminating your embeddings.
Token-efficient: Markdown uses 25-75% fewer tokens than HTML for equivalent content, meaning more context fits in each LLM call and your API costs decrease.
PDF to Markdown Conversion Options
Browser-Based Tools (Recommended for Most Users)
Craft Markdown — Privacy-First, AI-Ready
- Files never leave your browser — complete privacy for sensitive documents
- Supports 9+ formats (PDF, Word, HTML, CSV, JSON, and more)
- Clean, AI-optimized markdown output
- Completely free, no signup required
- Convert your PDF now →
pdf2md.morethan.io — Simple and Fast
- Single-purpose PDF converter
- Files uploaded to their server (privacy consideration)
- PDF-only — no other formats
Command-Line Tools (For Automation)
Pandoc — Industry Standard
- Supports 50+ formats
- Requires local installation
- Highly customizable output
- Best for batch processing and scripted workflows
MarkItDown (Microsoft) — Python-Based
- Open-source, maintained by Microsoft
- Designed specifically for LLM data preparation
- Good for developer automation pipelines
API Services (For Production Scale)
LlamaParse — AI-Powered Parsing
- Uses AI to understand complex document layouts
- Excellent for multi-column PDFs, complex tables, and charts
- Paid service with limited free tier
- Best for production RAG systems with budget
Quick Comparison
| Tool | Privacy | Ease of Use | Complex PDFs | Cost |
|---|---|---|---|---|
| Craft Markdown | Browser-based | Drag and drop | Good | Free |
| pdf2md | Server upload | Simple | Basic | Free |
| Pandoc | Local | Command line | Good | Free |
| MarkItDown | Local | Python script | Good | Free |
| LlamaParse | Cloud API | API integration | Excellent | Paid |
Our recommendation: Start with Craft Markdown for individual documents and testing. Move to Pandoc or MarkItDown when you need automation. Consider LlamaParse for complex documents at scale.
Step-by-Step: PDF to RAG Pipeline
Step 1: Assess Your PDFs
Before converting, evaluate your document collection:
Check text selectability:
Open the PDF and try to highlight text. If you can select and copy text, it's a digital-native PDF — these convert well. If you can't select text, it's a scanned document that needs OCR first.
Evaluate complexity:
- Simple text documents (reports, articles, manuals) → Convert directly
- Documents with tables → Convert and verify table accuracy
- Multi-column layouts → May need specialized tools or manual cleanup
- Scanned documents → Require OCR before conversion
Count your documents:
- 1-20 documents: Browser-based tools are perfect
- 20-100 documents: Consider command-line automation
- 100+ documents: Build a scripted pipeline
Step 2: Convert PDF to Markdown
Using Craft Markdown (recommended for most users):
- Go to craftmarkdown.com/pdf-to-markdown
- Drag and drop your PDF onto the converter
- Review the markdown preview — check headings, tables, and structure
- Copy to clipboard or download the
.mdfile
Your file never leaves your browser. No server upload, no data collection.
Using Pandoc (for automation):
# Basic conversion
pandoc input.pdf -o output.md
# With image extraction
pandoc input.pdf --extract-media=./images -o output.md
# Batch conversion of all PDFs in a directory
for f in *.pdf; do pandoc "$f" -o "${f%.pdf}.md"; done
Using Python with PyMuPDF:
import fitz
def pdf_to_markdown(pdf_path):
doc = fitz.open(pdf_path)
markdown = ""
for page in doc:
markdown += page.get_text("text") + "\n\n"
return markdown
md_text = pdf_to_markdown("document.pdf")
with open("document.md", "w") as f:
f.write(md_text)
Step 3: Clean and Validate
Even the best conversion tools need a quality check. Common cleanup tasks:
Fix heading hierarchy:
Ensure headings follow a logical structure — # for the document title, ## for main sections, ### for subsections. Conversion sometimes flattens or misidentifies heading levels.
Verify tables:
Open the markdown in a preview tool. Do tables render correctly? Are columns aligned? Is any data missing or misplaced?
Remove artifacts:
Strip out page numbers, repeated headers/footers, and any garbled text from the conversion process.
Validation checklist:
- All sections from the original PDF are present
- Tables are readable and complete
- Links are preserved (if applicable)
- No garbled or missing text
- Heading hierarchy is correct and consistent
- Page numbers and headers/footers are removed
Step 4: Chunk for RAG
Chunking strategy is one of the most important decisions in RAG system design. Markdown makes it significantly easier because of its explicit structure.
Strategy 1: Chunk by heading (recommended)
Split the document at ## or ### headers. This preserves semantic boundaries — each chunk covers a coherent topic.
import re
def chunk_by_heading(markdown_text, level=2):
pattern = r'\n(?=#{' + str(level) + r'} )'
chunks = re.split(pattern, markdown_text)
return [chunk.strip() for chunk in chunks if chunk.strip()]
chunks = chunk_by_heading(markdown_text)
Why this works well: Each chunk starts with a descriptive heading that provides context for the embedding. When a user asks about "refund policy," the chunk that starts with ## Refund Policy will match strongly.
Strategy 2: Fixed-size chunks with overlap
Split at a fixed character count with overlap between chunks. Better for documents without clear heading structure.
def chunk_by_size(text, chunk_size=1000, overlap=200):
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunk = text[start:end]
chunks.append(chunk.strip())
start = end - overlap
return [c for c in chunks if c]
chunks = chunk_by_size(markdown_text)
Strategy 3: Paragraph-based chunking
Split at double newlines (paragraph boundaries). Creates natural reading units with variable size.
def chunk_by_paragraph(markdown_text, min_size=200):
paragraphs = markdown_text.split('\n\n')
chunks = []
current = ""
for para in paragraphs:
if len(current) + len(para) < min_size:
current += "\n\n" + para
else:
if current:
chunks.append(current.strip())
current = para
if current:
chunks.append(current.strip())
return chunks
Our recommendation: Start with heading-based chunking. It produces the best retrieval results for most structured documents. Fall back to fixed-size chunking for unstructured content.
Step 5: Generate Embeddings
Convert your markdown chunks into vector embeddings for semantic search.
from openai import OpenAI
client = OpenAI()
def get_embeddings(texts, model="text-embedding-3-small"):
response = client.embeddings.create(
input=texts,
model=model
)
return [item.embedding for item in response.data]
embeddings = get_embeddings(chunks)
Tip: Markdown chunks produce better embeddings than raw PDF text because the content is cleaner, the structure is explicit, and there's no formatting noise diluting the semantic signal.
Step 6: Store in a Vector Database
Load your chunks and embeddings into a vector database for retrieval.
Popular options:
- Chroma — Simple, local-first, great for prototyping
- Pinecone — Managed, scalable, production-ready
- Weaviate — Open-source, feature-rich
- Qdrant — High-performance, Rust-based
- pgvector — PostgreSQL extension for teams already using Postgres
Example with Chroma:
import chromadb
client = chromadb.Client()
collection = client.create_collection("knowledge_base")
collection.add(
documents=chunks,
embeddings=embeddings,
metadatas=[{"source": "document.pdf", "chunk": i} for i in range(len(chunks))],
ids=[f"chunk_{i}" for i in range(len(chunks))]
)
Step 7: Query and Retrieve
Search your vector database and use retrieved chunks as context for LLM generation.
results = collection.query(
query_texts=["What is the company refund policy?"],
n_results=3
)
context = "\n\n---\n\n".join(results['documents'][0])
response = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": "Answer based on the provided context. If the context doesn't contain the answer, say so."},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: What is the company refund policy?"}
]
)
print(response.choices[0].message.content)
Best Practices for PDF-to-RAG Conversion
Preserve Document Structure
Structure is your most valuable asset in RAG. Headers indicate topic changes, lists preserve relationships, and tables maintain data integrity.
- Keep the original heading hierarchy — don't flatten
###into## - Preserve list formatting — bullet points and numbered lists carry semantic meaning
- Maintain table structure — tables often contain the highest-value facts in business documents
Handle Tables Carefully
Tables are disproportionately important in RAG systems because they contain dense, factual information that users frequently query.
- Always verify table conversion accuracy
- Consider creating separate chunks for large tables
- Test table-specific queries in your RAG system
- If a table doesn't convert well, recreate it manually — the retrieval improvement is worth the effort
Maintain Metadata
Attach metadata to each chunk for better retrieval and attribution:
{
"content": "## Refund Policy\n\nCustomers may request a full refund within 30 days...",
"metadata": {
"source": "company_handbook.pdf",
"section": "Refund Policy",
"page": 15,
"converted_date": "2025-12-15"
}
}
Metadata enables filtered search (e.g., "search only in the employee handbook"), source attribution in AI responses, and versioning when documents get updated.
Test Retrieval Quality
Before deploying to production:
- Create 20-30 representative test queries
- Run them against your RAG system
- Evaluate whether the retrieved chunks are relevant
- Measure precision and recall
- Iterate on your chunking strategy based on results
Common Challenges and Solutions
Scanned PDFs (No Selectable Text)
Problem: The PDF contains images of text, not actual text data.
Solution: Run OCR first using Tesseract, Adobe Acrobat, or a cloud OCR service. Then convert the OCR output to markdown. Quality depends on scan resolution and font clarity.
Complex Multi-Column Layouts
Problem: Text from multiple columns gets interleaved during extraction.
Solution: Use AI-powered extraction tools like LlamaParse for complex layouts. Alternatively, convert the PDF to a single-column format in the source application before PDF export.
Tables Spanning Multiple Pages
Problem: Large tables that break across pages lose their structure during conversion.
Solution: Manual reconstruction is often the most reliable approach. For automation, LlamaParse handles multi-page tables better than most tools.
Mixed Content (Text + Images)
Problem: Image content (charts, diagrams, figures) is lost in text extraction.
Solution: Extract images separately. For charts and diagrams, consider using vision models (GPT-4V, Claude Vision) to generate text descriptions, then include those descriptions in your markdown.
Large Document Collections
Problem: Converting hundreds or thousands of PDFs manually isn't practical.
Solution: Build an automated pipeline using Pandoc or MarkItDown for batch processing. Use Craft Markdown for spot-checking and quality validation of individual documents.
RAG Performance: Markdown vs Other Formats
The numbers consistently show that markdown produces better RAG results:
| Metric | Raw PDF Text | HTML | Markdown |
|---|---|---|---|
| Retrieval accuracy | ~62% | ~78% | ~89% |
| Chunk quality | Poor — arbitrary breaks | Medium — tag fragments | High — semantic boundaries |
| Token efficiency | N/A | Low — tag overhead | High — minimal syntax |
| Embedding quality | Low — noisy text | Medium — mixed content | High — clean content |
| Processing speed | Fast extraction | Fast parsing | Fast parsing |
The ~27 percentage point improvement in retrieval accuracy from raw PDF to markdown is substantial. For a customer support RAG system handling 1,000 queries per day, that's the difference between hundreds of wrong answers and accurate, helpful responses.
Tools and Resources
Conversion Tools
- Craft Markdown — Browser-based, privacy-first, multi-format
- Pandoc — Command-line power tool for batch processing
- MarkItDown — Microsoft's Python-based LLM-focused converter
RAG Frameworks
- LangChain — Popular, well-documented RAG framework with markdown loaders
- LlamaIndex — Document-focused RAG with excellent PDF and markdown support
- Haystack — Open-source NLP framework for production search systems
Vector Databases
- Chroma — Simple, local-first, ideal for prototyping
- Pinecone — Managed and scalable for production
- Weaviate — Open-source with rich features
- Qdrant — High-performance for large-scale deployments
Further Reading
- Why LLMs Love Markdown — deep dive into why markdown is the best format for AI
- How to Convert PDF to Markdown — complete conversion guide with multiple methods
- Best PDF to Markdown Converters — tool comparison and recommendations
Key Takeaways
- PDFs need conversion for optimal RAG performance. Raw PDF extraction produces noisy, unstructured text that degrades retrieval quality.
- Markdown is the ideal intermediate format. Clean structure, token efficiency, and semantic boundaries make markdown perfect for RAG.
- Structure preservation is critical. Headings and tables carry the most value — verify they convert correctly.
- Chunking strategy matters. Header-based chunking produces the best results for structured documents.
- Test and iterate. Measure retrieval quality, adjust your pipeline, and continuously improve.
- Privacy matters. Use browser-based tools like Craft Markdown for sensitive documents — your files should never leave your device.
Frequently Asked Questions
Can I use PDFs directly in RAG without converting to markdown?
Yes, and many RAG frameworks support direct PDF loading. But the results are measurably worse. Raw PDF extraction produces noisy text with formatting artifacts, arbitrary chunk boundaries, and lost structure. Converting to markdown first typically improves retrieval accuracy by 20-30%.
How long does PDF to markdown conversion take?
Seconds for most documents when using browser-based tools like Craft Markdown. Batch conversion of hundreds of PDFs with command-line tools takes minutes. The conversion step is fast — it's the quality review that takes time.
What about scanned PDFs?
Scanned PDFs need OCR (Optical Character Recognition) before markdown conversion. Tools like Tesseract (free, open-source) or Adobe Acrobat can perform OCR. Quality depends on scan resolution and font clarity. After OCR, convert the text output to markdown.
Is markdown better than JSON for RAG?
For document content (articles, reports, manuals), markdown is better — it's more token-efficient and preserves reading flow. JSON is better for structured data like product catalogs, customer records, or API responses. Most production RAG systems use markdown for documents and JSON for structured data.
How do I handle images in PDFs for RAG?
Text and structure convert to markdown. Images need separate handling. For informational images (charts, diagrams), use a vision model to generate text descriptions and include those in your markdown. For decorative images, you can usually skip them — they don't add retrieval value.
What chunking strategy should I use?
Start with heading-based chunking — split at ## or ### headers. This preserves semantic boundaries and produces the best retrieval results for structured documents. If your documents lack clear headings, use fixed-size chunks (500-1000 characters) with 100-200 character overlap.