PdfConverter

Overview

The PdfConverter class converts PDF documents to Markdown, with advanced features for extracting tables into properly formatted Markdown tables. It uses a hybrid approach combining pdfplumber for form-style documents and pdfminer for text-heavy documents.

Dependencies

pip install markitdown[pdf]

Requires: pdfminer.six, pdfplumber

Accepted Formats

MIME Types

list

application/pdf
application/x-pdf

Extensions

list

.pdf

Methods

accepts()

def accepts(
    file_stream: BinaryIO,
    stream_info: StreamInfo,
    **kwargs: Any,
) -> bool

Returns True if the file has a .pdf extension or PDF MIME type.

convert()

def convert(
    file_stream: BinaryIO,
    stream_info: StreamInfo,
    **kwargs: Any,
) -> DocumentConverterResult

Converts a PDF file to Markdown. Returns: DocumentConverterResult with converted Markdown text Raises: MissingDependencyException if PDF dependencies are not installed

Features

Intelligent Extraction Strategy

The converter analyzes each page and selects the optimal extraction method:

Form-Style Extraction - For documents with structured layouts (forms, invoices, tables)
- Analyzes word positions to detect column structures
- Extracts borderless tables by detecting aligned text
- Generates properly formatted Markdown tables with headers and separators
Text Extraction - For plain text documents (reports, articles)
- Falls back to pdfminer for better text spacing and readability
- Used when less than 20% of pages are form-style

Table Detection

The converter detects tables by:

Analyzing word positions and alignment across rows
Identifying consistent column structures (3-20 columns)
Detecting header rows and data rows
Filtering out multi-column text layouts (scientific papers)

Special Formatting

MasterFormat Numbering - Automatically merges partial numbering patterns:

.1
The intent of this Request for Proposal...

Becomes:

.1 The intent of this Request for Proposal...

Example Usage

from markitdown.converters import PdfConverter
from markitdown._stream_info import StreamInfo

converter = PdfConverter()

with open("document.pdf", "rb") as f:
    stream_info = StreamInfo(extension=".pdf")
    
    if converter.accepts(f, stream_info):
        result = converter.convert(f, stream_info)
        print(result.markdown)

Output Example

For a PDF with tables:

| Column 1 | Column 2 | Column 3 |
| -------- | -------- | -------- |
| Value 1  | Value 2  | Value 3  |
| Value 4  | Value 5  | Value 6  |

Regular paragraph text appears as plain markdown.

| Header A | Header B |
| -------- | -------- |
| Data A   | Data B   |

Implementation Details

Source Location

~/workspace/source/packages/markitdown/src/markitdown/converters/_pdf_converter.py:495

Key Functions

_extract_form_content_from_words() - Detects and extracts form-style tables
_extract_tables_from_words() - Extracts traditional bordered tables
_merge_partial_numbering_lines() - Post-processes MasterFormat numbering
_to_markdown_table() - Formats extracted tables as Markdown

Algorithm Highlights

Adaptive Tolerance - Column boundaries calculated using 70th percentile of gaps
Column Clustering - Groups nearby x-positions into columns (configurable tolerance)
Table Quality Validation - Ensures detected tables have consistent structure
Hybrid Fallback - Automatically switches extraction method if needed

Performance Considerations

PDF is read into memory (BytesIO) for compatibility with both libraries
Pages are processed sequentially
Large PDFs may require significant memory
Complex forms with 15+ columns use adaptive extraction

Core

Converters

Exceptions

Overview

Dependencies

Accepted Formats

Methods

accepts()

convert()

Features

Intelligent Extraction Strategy

Table Detection

Special Formatting

Example Usage

Output Example

Implementation Details

Source Location

Key Functions

Algorithm Highlights

Performance Considerations

Core

Converters

Exceptions

​Overview

​Dependencies

​Accepted Formats

​Methods

​accepts()

​convert()

​Features

​Intelligent Extraction Strategy

​Table Detection

​Special Formatting

​Example Usage

​Output Example

​Implementation Details

​Source Location

​Key Functions

​Algorithm Highlights

​Performance Considerations

Overview

Dependencies

Accepted Formats

Methods

accepts()

convert()

Features

Intelligent Extraction Strategy

Table Detection

Special Formatting

Example Usage

Output Example

Implementation Details

Source Location

Key Functions

Algorithm Highlights

Performance Considerations