Overview
ThePdfConverter class converts PDF documents to Markdown, with advanced features for extracting tables into properly formatted Markdown tables. It uses a hybrid approach combining pdfplumber for form-style documents and pdfminer for text-heavy documents.
Dependencies
pdfminer.six, pdfplumber
Accepted Formats
application/pdfapplication/x-pdf
.pdf
Methods
accepts()
True if the file has a .pdf extension or PDF MIME type.
convert()
DocumentConverterResult with converted Markdown text
Raises: MissingDependencyException if PDF dependencies are not installed
Features
Intelligent Extraction Strategy
The converter analyzes each page and selects the optimal extraction method:-
Form-Style Extraction - For documents with structured layouts (forms, invoices, tables)
- Analyzes word positions to detect column structures
- Extracts borderless tables by detecting aligned text
- Generates properly formatted Markdown tables with headers and separators
-
Text Extraction - For plain text documents (reports, articles)
- Falls back to
pdfminerfor better text spacing and readability - Used when less than 20% of pages are form-style
- Falls back to
Table Detection
The converter detects tables by:- Analyzing word positions and alignment across rows
- Identifying consistent column structures (3-20 columns)
- Detecting header rows and data rows
- Filtering out multi-column text layouts (scientific papers)
Special Formatting
MasterFormat Numbering - Automatically merges partial numbering patterns:Example Usage
Output Example
For a PDF with tables:Implementation Details
Source Location
~/workspace/source/packages/markitdown/src/markitdown/converters/_pdf_converter.py:495
Key Functions
_extract_form_content_from_words()- Detects and extracts form-style tables_extract_tables_from_words()- Extracts traditional bordered tables_merge_partial_numbering_lines()- Post-processes MasterFormat numbering_to_markdown_table()- Formats extracted tables as Markdown
Algorithm Highlights
- Adaptive Tolerance - Column boundaries calculated using 70th percentile of gaps
- Column Clustering - Groups nearby x-positions into columns (configurable tolerance)
- Table Quality Validation - Ensures detected tables have consistent structure
- Hybrid Fallback - Automatically switches extraction method if needed
Performance Considerations
- PDF is read into memory (
BytesIO) for compatibility with both libraries - Pages are processed sequentially
- Large PDFs may require significant memory
- Complex forms with 15+ columns use adaptive extraction