PlainTextConverter
Converts plain text files, JSON, and Markdown to Markdown (passthrough with encoding detection).
- MIME Types:
text/*, application/json, application/markdown
- Extensions:
.txt, .text, .md, .markdown, .json, .jsonl
Features
- Automatic character encoding detection using
charset_normalizer
- Respects
stream_info.charset if provided
- Handles files with any charset (UTF-8, Latin-1, etc.)
Example
from markitdown.converters import PlainTextConverter
from markitdown._stream_info import StreamInfo
converter = PlainTextConverter()
with open("README.md", "rb") as f:
result = converter.convert(f, StreamInfo(extension=".md"))
print(result.markdown) # Original content
Source
_plain_text_converter.py:33
CsvConverter
Converts CSV files to Markdown tables.
- MIME Types:
text/csv, application/csv
- Extensions:
.csv
Features
- First row treated as header
- Automatic column alignment
- Handles missing cells
- Character encoding detection
Example
from markitdown.converters import CsvConverter
from markitdown._stream_info import StreamInfo
converter = CsvConverter()
with open("data.csv", "rb") as f:
result = converter.convert(f, StreamInfo(extension=".csv"))
print(result.markdown)
Input CSV:
Name,Age,City
Alice,30,NYC
Bob,25,SF
Output:
| Name | Age | City |
| --- | --- | --- |
| Alice | 30 | NYC |
| Bob | 25 | SF |
Source
_csv_converter.py:15
IpynbConverter
Converts Jupyter Notebook (.ipynb) files to Markdown.
- MIME Types:
application/json (if contains nbformat)
- Extensions:
.ipynb
Features
- Markdown cells preserved as-is
- Code cells wrapped in
```python blocks
- Raw cells wrapped in
``` blocks
- Extracts title from first H1 heading or notebook metadata
Example
from markitdown.converters import IpynbConverter
from markitdown._stream_info import StreamInfo
converter = IpynbConverter()
with open("analysis.ipynb", "rb") as f:
result = converter.convert(f, StreamInfo(extension=".ipynb"))
print(result.markdown)
print(f"Title: {result.title}")
Output:
# Data Analysis
This notebook analyzes customer data.
```python
import pandas as pd
df = pd.read_csv('data.csv')
df.head()
Results
The analysis shows…
### Source
`_ipynb_converter.py:15`
---
## ZipConverter
Converts ZIP archives by extracting and converting all contained files.
### Accepted Formats
- **MIME Types:** `application/zip`
- **Extensions:** `.zip`
### Constructor
```python
def __init__(self, *, markitdown: MarkItDown)
Requires: MarkItDown instance to process extracted files
Features
- Recursively converts all files in archive
- Each file presented under
## File: path/to/file.ext heading
- Unsupported files silently skipped
- Maintains directory structure in headings
Example
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("archive.zip")
print(result.markdown)
Output:
Content from the zip file `archive.zip`:
## File: docs/readme.txt
Welcome to the project...
## File: data/report.xlsx
## Sheet1
| Column1 | Column2 |
| --- | --- |
| Data1 | Data2 |
Source
_zip_converter.py:22
EpubConverter
Converts EPUB ebooks to Markdown, preserving chapter structure and metadata.
- MIME Types:
application/epub, application/epub+zip, application/x-epub+zip
- Extensions:
.epub
Features
- Extracts metadata (title, authors, publisher, description, etc.)
- Converts chapters in spine order
- Preserves chapter structure
- Extends
HtmlConverter for content conversion
Example
from markitdown.converters import EpubConverter
converter = EpubConverter()
with open("book.epub", "rb") as f:
result = converter.convert(f, StreamInfo(extension=".epub"))
print(result.markdown)
Output:
**Title:** The Great Novel
**Authors:** Jane Doe, John Smith
**Publisher:** Example Press
**Date:** 2024-01-15
**Description:** A gripping tale of...
# Chapter 1
It was a dark and stormy night...
Source
_epub_converter.py:26
Converts RSS and Atom feeds to Markdown.
- MIME Types:
application/rss+xml, application/atom+xml, text/xml, application/xml
- Extensions:
.rss, .atom, .xml
Features
- Supports RSS 2.0 and Atom formats
- Extracts feed title, description, and items
- HTML content in descriptions converted to Markdown
- Preserves publication dates
Example
from markitdown.converters import RssConverter
converter = RssConverter()
with open("feed.rss", "rb") as f:
result = converter.convert(f, StreamInfo(extension=".rss"))
print(result.markdown)
Output:
# Tech News Blog
Daily technology news and updates
## New AI Breakthrough Announced
Published on: Wed, 15 Feb 2024 10:30:00 GMT
Researchers have developed a new...
## Cloud Computing Trends
Published on: Tue, 14 Feb 2024 14:00:00 GMT
The latest trends in cloud infrastructure...
Source
_rss_converter.py:29
WikipediaConverter
Specialized converter for Wikipedia pages, extracting main article content.
- MIME Types:
text/html, application/xhtml
- Extensions:
.html, .htm
- URL Pattern:
https://*.wikipedia.org/*
Features
- Extracts only main content (
#mw-content-text)
- Removes navigation, sidebars, and footer
- Preserves article title
- Extends
HtmlConverter
Example
import requests
from markitdown.converters import WikipediaConverter
from markitdown._stream_info import StreamInfo
url = "https://en.wikipedia.org/wiki/Python_(programming_language)"
response = requests.get(url)
converter = WikipediaConverter()
result = converter.convert_string(
response.content,
StreamInfo(url=url, mimetype="text/html")
)
print(result.markdown)
Source
_wikipedia_converter.py:20
YouTubeConverter
Extracts YouTube video metadata, description, and transcript.
- URL Pattern:
https://www.youtube.com/watch?v=*
- MIME Types:
text/html, application/xhtml
Dependencies
Optional: youtube-transcript-api for transcripts
Features
- Extracts video title, views, keywords, runtime
- Retrieves description from page metadata
- Downloads transcript (if available)
- Supports multiple languages
Parameters
youtube_transcript_languages
Languages to try for transcript (e.g., ["en", "es"]). Defaults to detected languages.
Example
import requests
from markitdown.converters import YouTubeConverter
url = "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
response = requests.get(url)
converter = YouTubeConverter()
result = converter.convert(
io.BytesIO(response.content),
StreamInfo(url=url, mimetype="text/html"),
youtube_transcript_languages=["en"]
)
print(result.markdown)
Output:
# YouTube
## Video Title
### Video Metadata
- **Views:** 1234567890
- **Runtime:** PT3M33S
### Description
Official music video...
### Transcript
We're no strangers to love. You know the rules and so do I...
Source
_youtube_converter.py:37
BingSerpConverter
Extracts organic search results from Bing search results pages.
- URL Pattern:
https://www.bing.com/search?q=*
- MIME Types:
text/html, application/xhtml
Features
- Extracts search query
- Parses organic results (
.b_algo class)
- Decodes redirect URLs
- Converts HTML descriptions to Markdown
Example
import requests
from markitdown.converters import BingSerpConverter
url = "https://www.bing.com/search?q=python+programming"
response = requests.get(url)
converter = BingSerpConverter()
result = converter.convert(
io.BytesIO(response.content),
StreamInfo(url=url, mimetype="text/html")
)
print(result.markdown)
Output:
## A Bing search for 'python programming' found the following results:
[Python.org](https://www.python.org)
Welcome to Python.org - The official home of the Python programming language...
[Python Tutorial](https://docs.python.org/tutorial/)
The Python Tutorial — Python 3.12 documentation...
Source
_bing_serp_converter.py:23
OutlookMsgConverter
Converts Outlook .msg email files to Markdown.
Dependencies
pip install markitdown[outlook]
Requires: olefile
- MIME Types:
application/vnd.ms-outlook
- Extensions:
.msg
Features
- Extracts email headers (From, To, Subject)
- Retrieves email body content
- Handles text encoding (UTF-16, UTF-8)
Example
from markitdown.converters import OutlookMsgConverter
converter = OutlookMsgConverter()
with open("message.msg", "rb") as f:
result = converter.convert(f, StreamInfo(extension=".msg"))
print(result.markdown)
Output:
# Email Message
**From:** john.doe@example.com
**To:** jane.smith@example.com
**Subject:** Project Update
## Content
Hi Jane,
Here's the latest update on the project...
Source
_outlook_msg_converter.py:24
DocumentIntelligenceConverter
Uses Azure Document Intelligence (formerly Form Recognizer) for advanced OCR and document understanding.
Dependencies
pip install markitdown[az-doc-intel]
Requires: azure-ai-documentintelligence, azure-identity
- DOCX, PPTX, XLSX (without OCR)
- PDF, JPEG, PNG, BMP, TIFF (with OCR)
- HTML
Constructor
def __init__(
self,
*,
endpoint: str,
api_version: str = "2024-07-31-preview",
credential: AzureKeyCredential | TokenCredential | None = None,
file_types: List[DocumentIntelligenceFileType] = [...]
)
Parameters:
Azure Document Intelligence endpoint URL
api_version
str
default:"2024-07-31-preview"
API version to use
credential
AzureKeyCredential | TokenCredential
Authentication credential. Defaults to DefaultAzureCredential() or AZURE_API_KEY env var.
File types to accept. Values from DocumentIntelligenceFileType enum.
Features
- Advanced OCR with high resolution support
- Formula extraction from documents
- Font style detection
- Layout analysis
- Native Markdown output
Example
from markitdown.converters import DocumentIntelligenceConverter
from azure.core.credentials import AzureKeyCredential
converter = DocumentIntelligenceConverter(
endpoint="https://your-resource.cognitiveservices.azure.com/",
credential=AzureKeyCredential("your-api-key")
)
with open("scanned_document.pdf", "rb") as f:
result = converter.convert(f, StreamInfo(extension=".pdf"))
print(result.markdown)
File Types
from markitdown.converters import DocumentIntelligenceFileType
# Available file types:
DocumentIntelligenceFileType.DOCX
DocumentIntelligenceFileType.PPTX
DocumentIntelligenceFileType.XLSX
DocumentIntelligenceFileType.HTML
DocumentIntelligenceFileType.PDF
DocumentIntelligenceFileType.JPEG
DocumentIntelligenceFileType.PNG
DocumentIntelligenceFileType.BMP
DocumentIntelligenceFileType.TIFF
Analysis Features
For OCR-supported formats (PDF, images):
FORMULAS - Extract mathematical formulas
OCR_HIGH_RESOLUTION - High-quality OCR
STYLE_FONT - Font style information
Source
_doc_intel_converter.py:130
Authentication
Supports multiple authentication methods:
- API Key (via
AzureKeyCredential)
- Environment Variable (
AZURE_API_KEY)
- Managed Identity (via
DefaultAzureCredential)
# Using environment variable
import os
os.environ["AZURE_API_KEY"] = "your-key"
converter = DocumentIntelligenceConverter(endpoint="https://...")
# Using managed identity
from azure.identity import DefaultAzureCredential
converter = DocumentIntelligenceConverter(
endpoint="https://...",
credential=DefaultAzureCredential()
)