Skip to main content

Overview

The PdfConverter class converts PDF documents to Markdown, with advanced features for extracting tables into properly formatted Markdown tables. It uses a hybrid approach combining pdfplumber for form-style documents and pdfminer for text-heavy documents.

Dependencies

pip install markitdown[pdf]
Requires: pdfminer.six, pdfplumber

Accepted Formats

MIME Types
list
  • application/pdf
  • application/x-pdf
Extensions
list
  • .pdf

Methods

accepts()

def accepts(
    file_stream: BinaryIO,
    stream_info: StreamInfo,
    **kwargs: Any,
) -> bool
Returns True if the file has a .pdf extension or PDF MIME type.

convert()

def convert(
    file_stream: BinaryIO,
    stream_info: StreamInfo,
    **kwargs: Any,
) -> DocumentConverterResult
Converts a PDF file to Markdown. Returns: DocumentConverterResult with converted Markdown text Raises: MissingDependencyException if PDF dependencies are not installed

Features

Intelligent Extraction Strategy

The converter analyzes each page and selects the optimal extraction method:
  1. Form-Style Extraction - For documents with structured layouts (forms, invoices, tables)
    • Analyzes word positions to detect column structures
    • Extracts borderless tables by detecting aligned text
    • Generates properly formatted Markdown tables with headers and separators
  2. Text Extraction - For plain text documents (reports, articles)
    • Falls back to pdfminer for better text spacing and readability
    • Used when less than 20% of pages are form-style

Table Detection

The converter detects tables by:
  • Analyzing word positions and alignment across rows
  • Identifying consistent column structures (3-20 columns)
  • Detecting header rows and data rows
  • Filtering out multi-column text layouts (scientific papers)

Special Formatting

MasterFormat Numbering - Automatically merges partial numbering patterns:
.1
The intent of this Request for Proposal...
Becomes:
.1 The intent of this Request for Proposal...

Example Usage

from markitdown.converters import PdfConverter
from markitdown._stream_info import StreamInfo

converter = PdfConverter()

with open("document.pdf", "rb") as f:
    stream_info = StreamInfo(extension=".pdf")
    
    if converter.accepts(f, stream_info):
        result = converter.convert(f, stream_info)
        print(result.markdown)

Output Example

For a PDF with tables:
| Column 1 | Column 2 | Column 3 |
| -------- | -------- | -------- |
| Value 1  | Value 2  | Value 3  |
| Value 4  | Value 5  | Value 6  |

Regular paragraph text appears as plain markdown.

| Header A | Header B |
| -------- | -------- |
| Data A   | Data B   |

Implementation Details

Source Location

~/workspace/source/packages/markitdown/src/markitdown/converters/_pdf_converter.py:495

Key Functions

  • _extract_form_content_from_words() - Detects and extracts form-style tables
  • _extract_tables_from_words() - Extracts traditional bordered tables
  • _merge_partial_numbering_lines() - Post-processes MasterFormat numbering
  • _to_markdown_table() - Formats extracted tables as Markdown

Algorithm Highlights

  1. Adaptive Tolerance - Column boundaries calculated using 70th percentile of gaps
  2. Column Clustering - Groups nearby x-positions into columns (configurable tolerance)
  3. Table Quality Validation - Ensures detected tables have consistent structure
  4. Hybrid Fallback - Automatically switches extraction method if needed

Performance Considerations

  • PDF is read into memory (BytesIO) for compatibility with both libraries
  • Pages are processed sequentially
  • Large PDFs may require significant memory
  • Complex forms with 15+ columns use adaptive extraction