DocumentConverter

The DocumentConverter class is the abstract base class for all document converters in MarkItDown. Custom converters must inherit from this class and implement the accepts() and convert() methods.

Overview

A converter’s lifecycle consists of two phases:

Acceptance - The accepts() method determines if the converter can handle a given file
Conversion - The convert() method performs the actual conversion to Markdown

Methods

accepts()

def accepts(
    file_stream: BinaryIO,
    stream_info: StreamInfo,
    **kwargs: Any
) -> bool

Determine if the converter can handle the given document.

file_stream

BinaryIO

required

The file-like object to check. Must support seek(), tell(), and read() methods.

stream_info

StreamInfo

required

Metadata about the file including mimetype, extension, charset, URL, and filename.

kwargs

dict

Additional keyword arguments that may be used by the converter.

accepts

bool

Returns True if the converter can handle this document, False otherwise.

Important Notes

The accepts() method must NOT change the file stream position. If you need to read from the stream, save the position first and restore it before returning.

def accepts(self, file_stream, stream_info, **kwargs):
    # Save current position
    cur_pos = file_stream.tell()
    
    # Read some bytes to check file signature
    data = file_stream.read(100)
    
    # IMPORTANT: Restore position before returning
    file_stream.seek(cur_pos)
    
    return data.startswith(b"EXPECTED_SIGNATURE")

Decision Criteria

The accepts() method typically checks:

Extension: stream_info.extension (e.g., .pdf, .docx)
MIME type: stream_info.mimetype (e.g., application/pdf)
URL: stream_info.url (for site-specific converters like Wikipedia, YouTube)
Filename: stream_info.filename (for well-known files like Dockerfile, Makefile)
File content: Read file signature/magic bytes (with position reset)

convert()

def convert(
    file_stream: BinaryIO,
    stream_info: StreamInfo,
    **kwargs: Any
) -> DocumentConverterResult

Convert the document to Markdown.

file_stream

BinaryIO

required

The file-like object to convert. Must support seek(), tell(), and read() methods.

stream_info

StreamInfo

required

Metadata about the file.

kwargs

dict

Additional options for the converter. Common options include:

llm_client: LLM client for AI-powered conversion
llm_model: Model name for LLM
exiftool_path: Path to exiftool binary
url: Source URL (deprecated, use stream_info.url)
file_extension: File extension (deprecated, use stream_info.extension)

result

DocumentConverterResult

The conversion result containing the Markdown text and optional metadata.

Exceptions

The convert() method may raise:

FileConversionException - When the converter recognizes the format but conversion fails
MissingDependencyException - When a required dependency is not installed
Other exceptions for unexpected errors

Creating a Custom Converter

Basic Example

from markitdown import DocumentConverter, DocumentConverterResult
from typing import BinaryIO, Any

class JsonConverter(DocumentConverter):
    """Convert JSON files to Markdown tables."""
    
    def accepts(self, file_stream: BinaryIO, stream_info, **kwargs: Any) -> bool:
        # Accept JSON files based on extension or mimetype
        return (
            stream_info.extension in [".json", ".jsonl"] or
            stream_info.mimetype == "application/json"
        )
    
    def convert(self, file_stream: BinaryIO, stream_info, **kwargs: Any) -> DocumentConverterResult:
        import json
        
        # Read and parse JSON
        content = file_stream.read()
        data = json.loads(content)
        
        # Convert to Markdown
        markdown = f"# JSON Data\n\n```json\n{json.dumps(data, indent=2)}\n```"
        
        return DocumentConverterResult(
            markdown=markdown,
            title="JSON Document"
        )

Advanced Example with Content Inspection

from markitdown import DocumentConverter, DocumentConverterResult, StreamInfo
from typing import BinaryIO, Any

class CustomBinaryConverter(DocumentConverter):
    """Convert files with custom binary format."""
    
    MAGIC_BYTES = b"CUSTOM\\x00\\x01"
    
    def accepts(self, file_stream: BinaryIO, stream_info: StreamInfo, **kwargs: Any) -> bool:
        # First check extension
        if stream_info.extension == ".custom":
            return True
        
        # Fall back to magic byte detection
        cur_pos = file_stream.tell()
        try:
            header = file_stream.read(len(self.MAGIC_BYTES))
            return header == self.MAGIC_BYTES
        finally:
            # CRITICAL: Restore stream position
            file_stream.seek(cur_pos)
    
    def convert(self, file_stream: BinaryIO, stream_info: StreamInfo, **kwargs: Any) -> DocumentConverterResult:
        # Read entire file
        content = file_stream.read()
        
        # Parse custom format
        # ... parsing logic ...
        
        # Generate Markdown
        markdown = "# Custom Format Document\n\n"
        markdown += "Content goes here..."
        
        return DocumentConverterResult(
            markdown=markdown,
            title="Custom Document"
        )

URL-Based Converter

from markitdown import DocumentConverter, DocumentConverterResult
from typing import BinaryIO, Any

class GitHubConverter(DocumentConverter):
    """Special converter for GitHub URLs."""
    
    def accepts(self, file_stream: BinaryIO, stream_info, **kwargs: Any) -> bool:
        # Accept based on URL pattern
        if stream_info.url is None:
            return False
        return "github.com" in stream_info.url
    
    def convert(self, file_stream: BinaryIO, stream_info, **kwargs: Any) -> DocumentConverterResult:
        # Read HTML content
        html = file_stream.read().decode('utf-8')
        
        # Extract and format GitHub-specific content
        # ... extraction logic ...
        
        markdown = f"# GitHub: {stream_info.url}\n\n"
        # ... add content ...
        
        return DocumentConverterResult(
            markdown=markdown,
            title="GitHub Page"
        )

Registering Custom Converters

After creating a converter, register it with MarkItDown:

from markitdown import MarkItDown, PRIORITY_SPECIFIC_FILE_FORMAT

md = MarkItDown()
md.register_converter(
    JsonConverter(),
    priority=PRIORITY_SPECIFIC_FILE_FORMAT  # 0.0 - tried before generic converters
)

result = md.convert("data.json")

Priority Guidelines

Choose the appropriate priority for your converter:

PRIORITY_SPECIFIC_FILE_FORMAT (0.0) - For specific formats (.docx, .pdf) or specific sites (Wikipedia, GitHub)
PRIORITY_GENERIC_FILE_FORMAT (10.0) - For catch-all converters (text/*, generic HTML)

Lower priority values are tried first. Converters with the same priority are tried in reverse registration order (most recently registered first).

Best Practices

Stream Position Management

Always restore the file stream position in accepts() if you read from it.

def accepts(self, file_stream, stream_info, **kwargs):
    cur_pos = file_stream.tell()
    try:
        # Read operations here
        data = file_stream.read(100)
        # ... check data ...
    finally:
        file_stream.seek(cur_pos)  # Always restore

Error Handling

Raise appropriate exceptions:

from markitdown._exceptions import FileConversionException, MissingDependencyException

def convert(self, file_stream, stream_info, **kwargs):
    try:
        import optional_library
    except ImportError:
        raise MissingDependencyException(
            "optional_library is required for this converter"
        )
    
    try:
        # Conversion logic
        pass
    except Exception as e:
        raise FileConversionException(f"Conversion failed: {e}")

Metadata Handling

Return complete metadata when available:

def convert(self, file_stream, stream_info, **kwargs):
    # Extract title from content if possible
    title = extract_title(file_stream)
    
    return DocumentConverterResult(
        markdown=markdown_content,
        title=title  # Helps users identify content
    )

Core

Converters

Exceptions

Overview

Methods

accepts()

Important Notes

Decision Criteria

convert()

Exceptions

Creating a Custom Converter

Basic Example

Advanced Example with Content Inspection

URL-Based Converter

Registering Custom Converters

Priority Guidelines

Best Practices

Stream Position Management

Error Handling

Metadata Handling

See Also

Core

Converters

Exceptions

​Overview

​Methods

​accepts()

​Important Notes

​Decision Criteria

​convert()

​Exceptions

​Creating a Custom Converter

​Basic Example

​Advanced Example with Content Inspection

​URL-Based Converter

​Registering Custom Converters

​Priority Guidelines

​Best Practices

​Stream Position Management

​Error Handling

​Metadata Handling

​See Also

Overview

Methods

accepts()

Important Notes

Decision Criteria

convert()

Exceptions

Creating a Custom Converter

Basic Example

Advanced Example with Content Inspection

URL-Based Converter

Registering Custom Converters

Priority Guidelines

Best Practices

Stream Position Management

Error Handling

Metadata Handling

See Also