Skip to main content
The DocumentConverter class is the abstract base class for all document converters in MarkItDown. Custom converters must inherit from this class and implement the accepts() and convert() methods.

Overview

A converter’s lifecycle consists of two phases:
  1. Acceptance - The accepts() method determines if the converter can handle a given file
  2. Conversion - The convert() method performs the actual conversion to Markdown

Methods

accepts()

def accepts(
    file_stream: BinaryIO,
    stream_info: StreamInfo,
    **kwargs: Any
) -> bool
Determine if the converter can handle the given document.
file_stream
BinaryIO
required
The file-like object to check. Must support seek(), tell(), and read() methods.
stream_info
StreamInfo
required
Metadata about the file including mimetype, extension, charset, URL, and filename.
kwargs
dict
Additional keyword arguments that may be used by the converter.
accepts
bool
Returns True if the converter can handle this document, False otherwise.

Important Notes

The accepts() method must NOT change the file stream position. If you need to read from the stream, save the position first and restore it before returning.
def accepts(self, file_stream, stream_info, **kwargs):
    # Save current position
    cur_pos = file_stream.tell()
    
    # Read some bytes to check file signature
    data = file_stream.read(100)
    
    # IMPORTANT: Restore position before returning
    file_stream.seek(cur_pos)
    
    return data.startswith(b"EXPECTED_SIGNATURE")

Decision Criteria

The accepts() method typically checks:
  • Extension: stream_info.extension (e.g., .pdf, .docx)
  • MIME type: stream_info.mimetype (e.g., application/pdf)
  • URL: stream_info.url (for site-specific converters like Wikipedia, YouTube)
  • Filename: stream_info.filename (for well-known files like Dockerfile, Makefile)
  • File content: Read file signature/magic bytes (with position reset)

convert()

def convert(
    file_stream: BinaryIO,
    stream_info: StreamInfo,
    **kwargs: Any
) -> DocumentConverterResult
Convert the document to Markdown.
file_stream
BinaryIO
required
The file-like object to convert. Must support seek(), tell(), and read() methods.
stream_info
StreamInfo
required
Metadata about the file.
kwargs
dict
Additional options for the converter. Common options include:
  • llm_client: LLM client for AI-powered conversion
  • llm_model: Model name for LLM
  • exiftool_path: Path to exiftool binary
  • url: Source URL (deprecated, use stream_info.url)
  • file_extension: File extension (deprecated, use stream_info.extension)
result
DocumentConverterResult
The conversion result containing the Markdown text and optional metadata.

Exceptions

The convert() method may raise:
  • FileConversionException - When the converter recognizes the format but conversion fails
  • MissingDependencyException - When a required dependency is not installed
  • Other exceptions for unexpected errors

Creating a Custom Converter

Basic Example

from markitdown import DocumentConverter, DocumentConverterResult
from typing import BinaryIO, Any

class JsonConverter(DocumentConverter):
    """Convert JSON files to Markdown tables."""
    
    def accepts(self, file_stream: BinaryIO, stream_info, **kwargs: Any) -> bool:
        # Accept JSON files based on extension or mimetype
        return (
            stream_info.extension in [".json", ".jsonl"] or
            stream_info.mimetype == "application/json"
        )
    
    def convert(self, file_stream: BinaryIO, stream_info, **kwargs: Any) -> DocumentConverterResult:
        import json
        
        # Read and parse JSON
        content = file_stream.read()
        data = json.loads(content)
        
        # Convert to Markdown
        markdown = f"# JSON Data\n\n```json\n{json.dumps(data, indent=2)}\n```"
        
        return DocumentConverterResult(
            markdown=markdown,
            title="JSON Document"
        )

Advanced Example with Content Inspection

from markitdown import DocumentConverter, DocumentConverterResult, StreamInfo
from typing import BinaryIO, Any

class CustomBinaryConverter(DocumentConverter):
    """Convert files with custom binary format."""
    
    MAGIC_BYTES = b"CUSTOM\\x00\\x01"
    
    def accepts(self, file_stream: BinaryIO, stream_info: StreamInfo, **kwargs: Any) -> bool:
        # First check extension
        if stream_info.extension == ".custom":
            return True
        
        # Fall back to magic byte detection
        cur_pos = file_stream.tell()
        try:
            header = file_stream.read(len(self.MAGIC_BYTES))
            return header == self.MAGIC_BYTES
        finally:
            # CRITICAL: Restore stream position
            file_stream.seek(cur_pos)
    
    def convert(self, file_stream: BinaryIO, stream_info: StreamInfo, **kwargs: Any) -> DocumentConverterResult:
        # Read entire file
        content = file_stream.read()
        
        # Parse custom format
        # ... parsing logic ...
        
        # Generate Markdown
        markdown = "# Custom Format Document\n\n"
        markdown += "Content goes here..."
        
        return DocumentConverterResult(
            markdown=markdown,
            title="Custom Document"
        )

URL-Based Converter

from markitdown import DocumentConverter, DocumentConverterResult
from typing import BinaryIO, Any

class GitHubConverter(DocumentConverter):
    """Special converter for GitHub URLs."""
    
    def accepts(self, file_stream: BinaryIO, stream_info, **kwargs: Any) -> bool:
        # Accept based on URL pattern
        if stream_info.url is None:
            return False
        return "github.com" in stream_info.url
    
    def convert(self, file_stream: BinaryIO, stream_info, **kwargs: Any) -> DocumentConverterResult:
        # Read HTML content
        html = file_stream.read().decode('utf-8')
        
        # Extract and format GitHub-specific content
        # ... extraction logic ...
        
        markdown = f"# GitHub: {stream_info.url}\n\n"
        # ... add content ...
        
        return DocumentConverterResult(
            markdown=markdown,
            title="GitHub Page"
        )

Registering Custom Converters

After creating a converter, register it with MarkItDown:
from markitdown import MarkItDown, PRIORITY_SPECIFIC_FILE_FORMAT

md = MarkItDown()
md.register_converter(
    JsonConverter(),
    priority=PRIORITY_SPECIFIC_FILE_FORMAT  # 0.0 - tried before generic converters
)

result = md.convert("data.json")

Priority Guidelines

Choose the appropriate priority for your converter:
  • PRIORITY_SPECIFIC_FILE_FORMAT (0.0) - For specific formats (.docx, .pdf) or specific sites (Wikipedia, GitHub)
  • PRIORITY_GENERIC_FILE_FORMAT (10.0) - For catch-all converters (text/*, generic HTML)
Lower priority values are tried first. Converters with the same priority are tried in reverse registration order (most recently registered first).

Best Practices

Stream Position Management

Always restore the file stream position in accepts() if you read from it.
def accepts(self, file_stream, stream_info, **kwargs):
    cur_pos = file_stream.tell()
    try:
        # Read operations here
        data = file_stream.read(100)
        # ... check data ...
    finally:
        file_stream.seek(cur_pos)  # Always restore

Error Handling

Raise appropriate exceptions:
from markitdown._exceptions import FileConversionException, MissingDependencyException

def convert(self, file_stream, stream_info, **kwargs):
    try:
        import optional_library
    except ImportError:
        raise MissingDependencyException(
            "optional_library is required for this converter"
        )
    
    try:
        # Conversion logic
        pass
    except Exception as e:
        raise FileConversionException(f"Conversion failed: {e}")

Metadata Handling

Return complete metadata when available:
def convert(self, file_stream, stream_info, **kwargs):
    # Extract title from content if possible
    title = extract_title(file_stream)
    
    return DocumentConverterResult(
        markdown=markdown_content,
        title=title  # Helps users identify content
    )

See Also