The DocumentConverter class is the abstract base class for all document converters in MarkItDown. Custom converters must inherit from this class and implement the accepts() and convert() methods.
Overview
A converter’s lifecycle consists of two phases:
- Acceptance - The
accepts() method determines if the converter can handle a given file
- Conversion - The
convert() method performs the actual conversion to Markdown
Methods
accepts()
def accepts(
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any
) -> bool
Determine if the converter can handle the given document.
The file-like object to check. Must support seek(), tell(), and read() methods.
Metadata about the file including mimetype, extension, charset, URL, and filename.
Additional keyword arguments that may be used by the converter.
Returns True if the converter can handle this document, False otherwise.
Important Notes
The accepts() method must NOT change the file stream position. If you need to read from the stream, save the position first and restore it before returning.
def accepts(self, file_stream, stream_info, **kwargs):
# Save current position
cur_pos = file_stream.tell()
# Read some bytes to check file signature
data = file_stream.read(100)
# IMPORTANT: Restore position before returning
file_stream.seek(cur_pos)
return data.startswith(b"EXPECTED_SIGNATURE")
Decision Criteria
The accepts() method typically checks:
- Extension:
stream_info.extension (e.g., .pdf, .docx)
- MIME type:
stream_info.mimetype (e.g., application/pdf)
- URL:
stream_info.url (for site-specific converters like Wikipedia, YouTube)
- Filename:
stream_info.filename (for well-known files like Dockerfile, Makefile)
- File content: Read file signature/magic bytes (with position reset)
convert()
def convert(
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any
) -> DocumentConverterResult
Convert the document to Markdown.
The file-like object to convert. Must support seek(), tell(), and read() methods.
Additional options for the converter. Common options include:
llm_client: LLM client for AI-powered conversion
llm_model: Model name for LLM
exiftool_path: Path to exiftool binary
url: Source URL (deprecated, use stream_info.url)
file_extension: File extension (deprecated, use stream_info.extension)
The conversion result containing the Markdown text and optional metadata.
Exceptions
The convert() method may raise:
FileConversionException - When the converter recognizes the format but conversion fails
MissingDependencyException - When a required dependency is not installed
- Other exceptions for unexpected errors
Creating a Custom Converter
Basic Example
from markitdown import DocumentConverter, DocumentConverterResult
from typing import BinaryIO, Any
class JsonConverter(DocumentConverter):
"""Convert JSON files to Markdown tables."""
def accepts(self, file_stream: BinaryIO, stream_info, **kwargs: Any) -> bool:
# Accept JSON files based on extension or mimetype
return (
stream_info.extension in [".json", ".jsonl"] or
stream_info.mimetype == "application/json"
)
def convert(self, file_stream: BinaryIO, stream_info, **kwargs: Any) -> DocumentConverterResult:
import json
# Read and parse JSON
content = file_stream.read()
data = json.loads(content)
# Convert to Markdown
markdown = f"# JSON Data\n\n```json\n{json.dumps(data, indent=2)}\n```"
return DocumentConverterResult(
markdown=markdown,
title="JSON Document"
)
Advanced Example with Content Inspection
from markitdown import DocumentConverter, DocumentConverterResult, StreamInfo
from typing import BinaryIO, Any
class CustomBinaryConverter(DocumentConverter):
"""Convert files with custom binary format."""
MAGIC_BYTES = b"CUSTOM\\x00\\x01"
def accepts(self, file_stream: BinaryIO, stream_info: StreamInfo, **kwargs: Any) -> bool:
# First check extension
if stream_info.extension == ".custom":
return True
# Fall back to magic byte detection
cur_pos = file_stream.tell()
try:
header = file_stream.read(len(self.MAGIC_BYTES))
return header == self.MAGIC_BYTES
finally:
# CRITICAL: Restore stream position
file_stream.seek(cur_pos)
def convert(self, file_stream: BinaryIO, stream_info: StreamInfo, **kwargs: Any) -> DocumentConverterResult:
# Read entire file
content = file_stream.read()
# Parse custom format
# ... parsing logic ...
# Generate Markdown
markdown = "# Custom Format Document\n\n"
markdown += "Content goes here..."
return DocumentConverterResult(
markdown=markdown,
title="Custom Document"
)
URL-Based Converter
from markitdown import DocumentConverter, DocumentConverterResult
from typing import BinaryIO, Any
class GitHubConverter(DocumentConverter):
"""Special converter for GitHub URLs."""
def accepts(self, file_stream: BinaryIO, stream_info, **kwargs: Any) -> bool:
# Accept based on URL pattern
if stream_info.url is None:
return False
return "github.com" in stream_info.url
def convert(self, file_stream: BinaryIO, stream_info, **kwargs: Any) -> DocumentConverterResult:
# Read HTML content
html = file_stream.read().decode('utf-8')
# Extract and format GitHub-specific content
# ... extraction logic ...
markdown = f"# GitHub: {stream_info.url}\n\n"
# ... add content ...
return DocumentConverterResult(
markdown=markdown,
title="GitHub Page"
)
Registering Custom Converters
After creating a converter, register it with MarkItDown:
from markitdown import MarkItDown, PRIORITY_SPECIFIC_FILE_FORMAT
md = MarkItDown()
md.register_converter(
JsonConverter(),
priority=PRIORITY_SPECIFIC_FILE_FORMAT # 0.0 - tried before generic converters
)
result = md.convert("data.json")
Priority Guidelines
Choose the appropriate priority for your converter:
PRIORITY_SPECIFIC_FILE_FORMAT (0.0) - For specific formats (.docx, .pdf) or specific sites (Wikipedia, GitHub)
PRIORITY_GENERIC_FILE_FORMAT (10.0) - For catch-all converters (text/*, generic HTML)
Lower priority values are tried first. Converters with the same priority are tried in reverse registration order (most recently registered first).
Best Practices
Stream Position Management
Always restore the file stream position in accepts() if you read from it.
def accepts(self, file_stream, stream_info, **kwargs):
cur_pos = file_stream.tell()
try:
# Read operations here
data = file_stream.read(100)
# ... check data ...
finally:
file_stream.seek(cur_pos) # Always restore
Error Handling
Raise appropriate exceptions:
from markitdown._exceptions import FileConversionException, MissingDependencyException
def convert(self, file_stream, stream_info, **kwargs):
try:
import optional_library
except ImportError:
raise MissingDependencyException(
"optional_library is required for this converter"
)
try:
# Conversion logic
pass
except Exception as e:
raise FileConversionException(f"Conversion failed: {e}")
Return complete metadata when available:
def convert(self, file_stream, stream_info, **kwargs):
# Extract title from content if possible
title = extract_title(file_stream)
return DocumentConverterResult(
markdown=markdown_content,
title=title # Helps users identify content
)
See Also