Converters Overview

Architecture

MarkItDown uses a modular converter architecture where each converter is responsible for handling specific file types. All converters inherit from the DocumentConverter base class and implement two key methods:

accepts() - Determines if the converter can handle a given file based on MIME type, extension, or URL
convert() - Performs the actual conversion to Markdown

Base Classes

DocumentConverter

The abstract base class for all converters. Key Methods:

def accepts(
    self,
    file_stream: BinaryIO,
    stream_info: StreamInfo,
    **kwargs: Any,
) -> bool:
    """Determine if this converter can handle the document."""

def convert(
    self,
    file_stream: BinaryIO,
    stream_info: StreamInfo,
    **kwargs: Any,
) -> DocumentConverterResult:
    """Convert the document to Markdown."""

DocumentConverterResult

The result object returned by all converters. Properties:

markdown (str) - The converted Markdown text
title (Optional[str]) - Optional document title extracted from the file
text_content (str) - Deprecated alias for markdown

Available Converters

Document Formats

PDF

Convert PDF files with table extraction support

DOCX

Convert Word documents preserving styles and tables

PPTX

Convert PowerPoint presentations with images and charts

XLSX/XLS

Convert Excel spreadsheets to Markdown tables

Media Formats

Images

Extract metadata and generate AI descriptions for images

Audio

Transcribe audio files and extract metadata

Web & Markup

HTML

Convert HTML to clean Markdown

Other Converters

Plain text, CSV, Jupyter notebooks, ZIP archives, EPUB, RSS, Wikipedia, YouTube, Bing search, Outlook messages, and Azure Document Intelligence

Converter Selection

MarkItDown automatically selects the appropriate converter based on:

File Extension - Primary method (e.g., .pdf, .docx)
MIME Type - Secondary method from HTTP headers or file detection
URL Pattern - For special web converters (Wikipedia, YouTube, Bing)
Content Inspection - Some converters peek at file content to confirm format

Converters are tried in registration order until one accepts the file.

Common Options

Many converters accept these optional parameters via **kwargs:

llm_client

OpenAI client

OpenAI-compatible client for AI-powered features (image captioning, etc.)

llm_model

str

Model name to use with the LLM client (e.g., “gpt-4o”)

llm_prompt

str

Custom prompt for LLM operations

exiftool_path

str

Path to exiftool binary for metadata extraction

keep_data_uris

bool

default:"False"

Embed images as base64 data URIs instead of file references

Error Handling

Converters may raise these exceptions:

MissingDependencyException - Required dependency not installed
FileConversionException - Conversion failed for a supported file type
UnsupportedFormatException - No converter available for the file type

Core

Converters

Exceptions

Architecture

Base Classes

DocumentConverter

DocumentConverterResult

Available Converters

Document Formats

PDF

DOCX

PPTX

XLSX/XLS

Media Formats

Images

Audio

Web & Markup

HTML

Other Converters

Converter Selection

Common Options

Error Handling

Core

Converters

Exceptions

​Architecture

​Base Classes

​DocumentConverter

​DocumentConverterResult

​Available Converters

​Document Formats

PDF

DOCX

PPTX

XLSX/XLS

​Media Formats

Images

Audio

​Web & Markup

HTML

Other Converters

​Converter Selection

​Common Options

​Error Handling

Architecture

Base Classes

DocumentConverter

DocumentConverterResult

Available Converters

Document Formats

Media Formats

Web & Markup

Converter Selection

Common Options

Error Handling