The MarkItDown class is the primary interface for converting various document formats to Markdown. It manages converter registration, file type detection, and the conversion process.
Constructor
MarkItDown(
*,
enable_builtins: Union[None, bool] = None,
enable_plugins: Union[None, bool] = None,
**kwargs
)
Create a new MarkItDown instance.
enable_builtins
bool | None
default:"True"
Enable built-in converters. When None or True, built-in converters are automatically registered.
enable_plugins
bool | None
default:"False"
Enable plugin converters. When True, converters from installed plugins are registered.
Custom requests session for HTTP operations. If not provided, a default session is created with appropriate Accept headers.
LLM client instance for converters that support AI-powered conversion.
Model name to use with the LLM client.
Custom prompt to use with LLM-based converters.
Path to the exiftool binary for image metadata extraction. If not provided, searches common system paths.
Custom style map for DOCX conversion.
Azure Document Intelligence endpoint URL. When provided, enables the Document Intelligence converter.
Credentials for Azure Document Intelligence.
File types to process with Document Intelligence.
API version for Document Intelligence service.
Example
from markitdown import MarkItDown
# Basic usage with defaults
md = MarkItDown()
# With custom configuration
md = MarkItDown(
enable_builtins=True,
enable_plugins=False,
exiftool_path="/usr/local/bin/exiftool"
)
# With Azure Document Intelligence
md = MarkItDown(
docintel_endpoint="https://your-resource.cognitiveservices.azure.com/",
docintel_credential=credential
)
Methods
convert()
def convert(
source: Union[str, requests.Response, Path, BinaryIO],
*,
stream_info: Optional[StreamInfo] = None,
**kwargs: Any
) -> DocumentConverterResult
Convert a document from various source types to Markdown.
source
str | requests.Response | Path | BinaryIO
required
The source to convert. Can be:
- Local file path (str or Path)
- URL string (http://, https://, file://, data://)
- requests.Response object
- Binary file-like object (BinaryIO)
Optional metadata about the source. If not provided, MarkItDown attempts to infer it.
The conversion result containing the Markdown text and optional metadata.
Example
# Convert local file
result = md.convert("document.pdf")
print(result.markdown)
# Convert URL
result = md.convert("https://example.com/file.docx")
# Convert with explicit stream info
from markitdown import StreamInfo
result = md.convert(
"file.txt",
stream_info=StreamInfo(charset="utf-8", mimetype="text/plain")
)
# Convert binary stream
with open("file.pdf", "rb") as f:
result = md.convert(f)
convert_local()
def convert_local(
path: Union[str, Path],
*,
stream_info: Optional[StreamInfo] = None,
file_extension: Optional[str] = None, # Deprecated
url: Optional[str] = None, # Deprecated
**kwargs: Any
) -> DocumentConverterResult
Convert a local file to Markdown.
Path to the local file to convert.
Optional metadata about the file.
Example
result = md.convert_local("/path/to/document.docx")
print(f"Title: {result.title}")
print(result.markdown)
convert_stream()
def convert_stream(
stream: BinaryIO,
*,
stream_info: Optional[StreamInfo] = None,
file_extension: Optional[str] = None, # Deprecated
url: Optional[str] = None, # Deprecated
**kwargs: Any
) -> DocumentConverterResult
Convert a binary stream to Markdown.
Binary file-like object to convert. Must support read() method. If not seekable, the stream is loaded into memory.
Optional metadata about the stream. Used for format detection.
Example
import io
# Convert in-memory bytes
data = b"PDF content..."
stream = io.BytesIO(data)
result = md.convert_stream(stream)
# With stream info
from markitdown import StreamInfo
result = md.convert_stream(
stream,
stream_info=StreamInfo(extension=".pdf", mimetype="application/pdf")
)
convert_url()
def convert_url(
url: str,
*,
stream_info: Optional[StreamInfo] = None,
file_extension: Optional[str] = None,
mock_url: Optional[str] = None,
**kwargs: Any
) -> DocumentConverterResult
Convert a URL to Markdown. This is an alias for convert_uri().
URL to convert (http://, https://, file://, or data://).
Optional metadata override.
Pretend the content came from this URL instead (for converter routing).
Example
# Convert web page
result = md.convert_url("https://wikipedia.org/wiki/Python")
# Convert file URI
result = md.convert_url("file:///path/to/document.pdf")
# Convert data URI
result = md.convert_url("data:text/plain;base64,SGVsbG8gV29ybGQ=")
convert_uri()
def convert_uri(
uri: str,
*,
stream_info: Optional[StreamInfo] = None,
file_extension: Optional[str] = None,
mock_url: Optional[str] = None,
**kwargs: Any
) -> DocumentConverterResult
Convert a URI to Markdown. Supports http://, https://, file://, and data:// schemes.
URI to convert. Supported schemes:
http:// and https://: Fetches content via HTTP
file://: Reads local file
data://: Decodes data URI
Optional metadata override.
Mock the request as if it came from a different URL.
Example
# HTTP URI
result = md.convert_uri("https://example.com/doc.pdf")
# File URI
result = md.convert_uri("file:///home/user/document.docx")
# Data URI
result = md.convert_uri(
"data:text/html;charset=utf-8,%3Ch1%3EHello%3C%2Fh1%3E"
)
convert_response()
def convert_response(
response: requests.Response,
*,
stream_info: Optional[StreamInfo] = None,
file_extension: Optional[str] = None,
url: Optional[str] = None,
**kwargs: Any
) -> DocumentConverterResult
Convert an HTTP response to Markdown.
response
requests.Response
required
HTTP response object from the requests library.
Optional metadata override. By default, metadata is extracted from response headers.
Example
import requests
response = requests.get("https://example.com/document.pdf")
result = md.convert_response(response)
print(result.markdown)
register_converter()
def register_converter(
converter: DocumentConverter,
*,
priority: float = PRIORITY_SPECIFIC_FILE_FORMAT
) -> None
Register a custom document converter.
converter
DocumentConverter
required
The converter instance to register.
Converter priority. Lower values are tried first. Use:
PRIORITY_SPECIFIC_FILE_FORMAT (0.0) for specific formats
PRIORITY_GENERIC_FILE_FORMAT (10.0) for generic/catch-all converters
Example
from markitdown import MarkItDown, DocumentConverter, PRIORITY_SPECIFIC_FILE_FORMAT
class MyCustomConverter(DocumentConverter):
def accepts(self, file_stream, stream_info, **kwargs):
return stream_info.extension == ".custom"
def convert(self, file_stream, stream_info, **kwargs):
# Conversion logic
return DocumentConverterResult(markdown="# Custom content")
md = MarkItDown()
md.register_converter(
MyCustomConverter(),
priority=PRIORITY_SPECIFIC_FILE_FORMAT
)
enable_builtins()
def enable_builtins(**kwargs) -> None
Enable and register built-in converters. Built-in converters are enabled by default. This method should only be called once if built-ins were initially disabled.
Configuration options passed to converters (llm_client, exiftool_path, etc.).
Example
# Create instance with builtins disabled
md = MarkItDown(enable_builtins=False)
# Enable later with configuration
md.enable_builtins(
exiftool_path="/usr/local/bin/exiftool",
llm_client=my_llm_client
)
enable_plugins()
def enable_plugins(**kwargs) -> None
Enable and register converters provided by installed plugins. Plugins are disabled by default. This method should only be called once if plugins were initially disabled.
Configuration options passed to plugin converters.
Example
# Create instance with plugins disabled
md = MarkItDown(enable_plugins=False)
# Enable later
md.enable_plugins()
Constants
PRIORITY_SPECIFIC_FILE_FORMAT = 0.0
Priority value for converters that handle specific file formats (e.g., .docx, .pdf, .xlsx) or specific websites (e.g., Wikipedia, YouTube).
PRIORITY_GENERIC_FILE_FORMAT = 10.0
Priority value for near catch-all converters that handle generic mimetypes (e.g., text/*, application/zip, text/html).
Built-in Converters
When enable_builtins=True (default), the following converters are automatically registered:
- PlainTextConverter - Plain text files (priority 10.0)
- HtmlConverter - HTML documents (priority 10.0)
- ZipConverter - ZIP archives (priority 10.0)
- RssConverter - RSS feeds
- WikipediaConverter - Wikipedia pages
- YouTubeConverter - YouTube videos
- BingSerpConverter - Bing search results
- DocxConverter - Microsoft Word documents
- XlsxConverter - Excel spreadsheets (.xlsx)
- XlsConverter - Excel spreadsheets (.xls)
- PptxConverter - PowerPoint presentations
- PdfConverter - PDF documents
- ImageConverter - Image files with OCR
- AudioConverter - Audio files with transcription
- IpynbConverter - Jupyter notebooks
- OutlookMsgConverter - Outlook email messages
- EpubConverter - EPUB ebooks
- CsvConverter - CSV files
- DocumentIntelligenceConverter - Azure Document Intelligence (when endpoint provided)
Error Handling
The convert() methods may raise the following exceptions:
FileConversionException - Converter attempted conversion but failed
UnsupportedFormatException - No converter can handle the format
MissingDependencyException - Required dependency not installed
from markitdown import MarkItDown, UnsupportedFormatException
md = MarkItDown()
try:
result = md.convert("document.xyz")
except UnsupportedFormatException:
print("This file format is not supported")