Other Converters - MarkItDown

PlainTextConverter

Converts plain text files, JSON, and Markdown to Markdown (passthrough with encoding detection).

Accepted Formats

MIME Types: text/*, application/json, application/markdown
Extensions: .txt, .text, .md, .markdown, .json, .jsonl

Features

Automatic character encoding detection using charset_normalizer
Respects stream_info.charset if provided
Handles files with any charset (UTF-8, Latin-1, etc.)

Example

from markitdown.converters import PlainTextConverter
from markitdown._stream_info import StreamInfo

converter = PlainTextConverter()

with open("README.md", "rb") as f:
    result = converter.convert(f, StreamInfo(extension=".md"))
    print(result.markdown)  # Original content

Source

_plain_text_converter.py:33

CsvConverter

Converts CSV files to Markdown tables.

Accepted Formats

MIME Types: text/csv, application/csv
Extensions: .csv

Features

First row treated as header
Automatic column alignment
Handles missing cells
Character encoding detection

Example

from markitdown.converters import CsvConverter
from markitdown._stream_info import StreamInfo

converter = CsvConverter()

with open("data.csv", "rb") as f:
    result = converter.convert(f, StreamInfo(extension=".csv"))
    print(result.markdown)

Input CSV:

Name,Age,City
Alice,30,NYC
Bob,25,SF

Output:

| Name | Age | City |
| --- | --- | --- |
| Alice | 30 | NYC |
| Bob | 25 | SF |

Source

_csv_converter.py:15

IpynbConverter

Converts Jupyter Notebook (.ipynb) files to Markdown.

Accepted Formats

MIME Types: application/json (if contains nbformat)
Extensions: .ipynb

Features

Markdown cells preserved as-is
Code cells wrapped in ```python blocks
Raw cells wrapped in ``` blocks
Extracts title from first H1 heading or notebook metadata

Example

from markitdown.converters import IpynbConverter
from markitdown._stream_info import StreamInfo

converter = IpynbConverter()

with open("analysis.ipynb", "rb") as f:
    result = converter.convert(f, StreamInfo(extension=".ipynb"))
    print(result.markdown)
    print(f"Title: {result.title}")

Output:

# Data Analysis

This notebook analyzes customer data.

```python
import pandas as pd
df = pd.read_csv('data.csv')
df.head()

Results

The analysis shows…

### Source

`_ipynb_converter.py:15`

---

## ZipConverter

Converts ZIP archives by extracting and converting all contained files.

### Accepted Formats

- **MIME Types:** `application/zip`
- **Extensions:** `.zip`

### Constructor

```python
def __init__(self, *, markitdown: MarkItDown)

Requires: MarkItDown instance to process extracted files

Features

Recursively converts all files in archive
Each file presented under ## File: path/to/file.ext heading
Unsupported files silently skipped
Maintains directory structure in headings

Example

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("archive.zip")
print(result.markdown)

Output:

Content from the zip file `archive.zip`:

## File: docs/readme.txt

Welcome to the project...

## File: data/report.xlsx

## Sheet1
| Column1 | Column2 |
| --- | --- |
| Data1 | Data2 |

Source

_zip_converter.py:22

EpubConverter

Converts EPUB ebooks to Markdown, preserving chapter structure and metadata.

Accepted Formats

MIME Types: application/epub, application/epub+zip, application/x-epub+zip
Extensions: .epub

Features

Extracts metadata (title, authors, publisher, description, etc.)
Converts chapters in spine order
Preserves chapter structure
Extends HtmlConverter for content conversion

Example

from markitdown.converters import EpubConverter

converter = EpubConverter()

with open("book.epub", "rb") as f:
    result = converter.convert(f, StreamInfo(extension=".epub"))
    print(result.markdown)

Output:

**Title:** The Great Novel
**Authors:** Jane Doe, John Smith
**Publisher:** Example Press
**Date:** 2024-01-15
**Description:** A gripping tale of...

# Chapter 1

It was a dark and stormy night...

Source

_epub_converter.py:26

RssConverter

Converts RSS and Atom feeds to Markdown.

Accepted Formats

MIME Types: application/rss+xml, application/atom+xml, text/xml, application/xml
Extensions: .rss, .atom, .xml

Features

Supports RSS 2.0 and Atom formats
Extracts feed title, description, and items
HTML content in descriptions converted to Markdown
Preserves publication dates

Example

from markitdown.converters import RssConverter

converter = RssConverter()

with open("feed.rss", "rb") as f:
    result = converter.convert(f, StreamInfo(extension=".rss"))
    print(result.markdown)

Output:

# Tech News Blog
Daily technology news and updates

## New AI Breakthrough Announced
Published on: Wed, 15 Feb 2024 10:30:00 GMT
Researchers have developed a new...

## Cloud Computing Trends
Published on: Tue, 14 Feb 2024 14:00:00 GMT
The latest trends in cloud infrastructure...

Source

_rss_converter.py:29

WikipediaConverter

Specialized converter for Wikipedia pages, extracting main article content.

Accepted Formats

MIME Types: text/html, application/xhtml
Extensions: .html, .htm
URL Pattern: https://*.wikipedia.org/*

Features

Extracts only main content (#mw-content-text)
Removes navigation, sidebars, and footer
Preserves article title
Extends HtmlConverter

Example

import requests
from markitdown.converters import WikipediaConverter
from markitdown._stream_info import StreamInfo

url = "https://en.wikipedia.org/wiki/Python_(programming_language)"
response = requests.get(url)

converter = WikipediaConverter()
result = converter.convert_string(
    response.content,
    StreamInfo(url=url, mimetype="text/html")
)
print(result.markdown)

Source

_wikipedia_converter.py:20

YouTubeConverter

Extracts YouTube video metadata, description, and transcript.

Accepted Formats

URL Pattern: https://www.youtube.com/watch?v=*
MIME Types: text/html, application/xhtml

Dependencies

Optional: youtube-transcript-api for transcripts

Features

Extracts video title, views, keywords, runtime
Retrieves description from page metadata
Downloads transcript (if available)
Supports multiple languages

Parameters

youtube_transcript_languages

list

Languages to try for transcript (e.g., ["en", "es"]). Defaults to detected languages.

Example

import requests
from markitdown.converters import YouTubeConverter

url = "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
response = requests.get(url)

converter = YouTubeConverter()
result = converter.convert(
    io.BytesIO(response.content),
    StreamInfo(url=url, mimetype="text/html"),
    youtube_transcript_languages=["en"]
)
print(result.markdown)

Output:

# YouTube

## Video Title

### Video Metadata
- **Views:** 1234567890
- **Runtime:** PT3M33S

### Description
Official music video...

### Transcript
We're no strangers to love. You know the rules and so do I...

Source

_youtube_converter.py:37

BingSerpConverter

Extracts organic search results from Bing search results pages.

Accepted Formats

URL Pattern: https://www.bing.com/search?q=*
MIME Types: text/html, application/xhtml

Features

Extracts search query
Parses organic results (.b_algo class)
Decodes redirect URLs
Converts HTML descriptions to Markdown

Example

import requests
from markitdown.converters import BingSerpConverter

url = "https://www.bing.com/search?q=python+programming"
response = requests.get(url)

converter = BingSerpConverter()
result = converter.convert(
    io.BytesIO(response.content),
    StreamInfo(url=url, mimetype="text/html")
)
print(result.markdown)

Output:

## A Bing search for 'python programming' found the following results:

[Python.org](https://www.python.org)
Welcome to Python.org - The official home of the Python programming language...

[Python Tutorial](https://docs.python.org/tutorial/)
The Python Tutorial — Python 3.12 documentation...

Source

_bing_serp_converter.py:23

OutlookMsgConverter

Converts Outlook .msg email files to Markdown.

Dependencies

pip install markitdown[outlook]

Requires: olefile

Accepted Formats

MIME Types: application/vnd.ms-outlook
Extensions: .msg

Features

Extracts email headers (From, To, Subject)
Retrieves email body content
Handles text encoding (UTF-16, UTF-8)

Example

from markitdown.converters import OutlookMsgConverter

converter = OutlookMsgConverter()

with open("message.msg", "rb") as f:
    result = converter.convert(f, StreamInfo(extension=".msg"))
    print(result.markdown)

Output:

# Email Message

**From:** john.doe@example.com
**To:** jane.smith@example.com
**Subject:** Project Update

## Content

Hi Jane,

Here's the latest update on the project...

Source

_outlook_msg_converter.py:24

DocumentIntelligenceConverter

Uses Azure Document Intelligence (formerly Form Recognizer) for advanced OCR and document understanding.

Dependencies

pip install markitdown[az-doc-intel]

Requires: azure-ai-documentintelligence, azure-identity

Accepted Formats

DOCX, PPTX, XLSX (without OCR)
PDF, JPEG, PNG, BMP, TIFF (with OCR)
HTML

Constructor

def __init__(
    self,
    *,
    endpoint: str,
    api_version: str = "2024-07-31-preview",
    credential: AzureKeyCredential | TokenCredential | None = None,
    file_types: List[DocumentIntelligenceFileType] = [...]
)

Parameters:

endpoint

str

required

Azure Document Intelligence endpoint URL

api_version

str

default:"2024-07-31-preview"

API version to use

credential

AzureKeyCredential | TokenCredential

Authentication credential. Defaults to DefaultAzureCredential() or AZURE_API_KEY env var.

file_types

list

File types to accept. Values from DocumentIntelligenceFileType enum.

Features

Advanced OCR with high resolution support
Formula extraction from documents
Font style detection
Layout analysis
Native Markdown output

Example

from markitdown.converters import DocumentIntelligenceConverter
from azure.core.credentials import AzureKeyCredential

converter = DocumentIntelligenceConverter(
    endpoint="https://your-resource.cognitiveservices.azure.com/",
    credential=AzureKeyCredential("your-api-key")
)

with open("scanned_document.pdf", "rb") as f:
    result = converter.convert(f, StreamInfo(extension=".pdf"))
    print(result.markdown)

File Types

from markitdown.converters import DocumentIntelligenceFileType

# Available file types:
DocumentIntelligenceFileType.DOCX
DocumentIntelligenceFileType.PPTX  
DocumentIntelligenceFileType.XLSX
DocumentIntelligenceFileType.HTML
DocumentIntelligenceFileType.PDF
DocumentIntelligenceFileType.JPEG
DocumentIntelligenceFileType.PNG
DocumentIntelligenceFileType.BMP
DocumentIntelligenceFileType.TIFF

Analysis Features

For OCR-supported formats (PDF, images):

FORMULAS - Extract mathematical formulas
OCR_HIGH_RESOLUTION - High-quality OCR
STYLE_FONT - Font style information

Source

_doc_intel_converter.py:130

Authentication

Supports multiple authentication methods:

API Key (via AzureKeyCredential)
Environment Variable (AZURE_API_KEY)
Managed Identity (via DefaultAzureCredential)

# Using environment variable
import os
os.environ["AZURE_API_KEY"] = "your-key"
converter = DocumentIntelligenceConverter(endpoint="https://...")

# Using managed identity
from azure.identity import DefaultAzureCredential
converter = DocumentIntelligenceConverter(
    endpoint="https://...",
    credential=DefaultAzureCredential()
)

Core

Converters

Exceptions

​PlainTextConverter

​Accepted Formats

​Features

​Example

​Source

​CsvConverter

​Accepted Formats

​Features

​Example

​Source

​IpynbConverter

​Accepted Formats

​Features

​Example

​Results

​Features

​Example

​Source

​EpubConverter

​Accepted Formats

​Features

​Example

​Source

​RssConverter

​Accepted Formats

​Features

​Example

​Source

​WikipediaConverter

​Accepted Formats

​Features

​Example

​Source

​YouTubeConverter

​Accepted Formats

​Dependencies

​Features

​Parameters

​Example

​Source

​BingSerpConverter

​Accepted Formats

​Features

​Example

​Source

​OutlookMsgConverter

​Dependencies

​Accepted Formats

​Features

​Example

​Source

​DocumentIntelligenceConverter

​Dependencies

​Accepted Formats

​Constructor

​Features

​Example

​File Types

​Analysis Features

​Source

​Authentication

PlainTextConverter

Accepted Formats

Features

Example

Source

CsvConverter

Accepted Formats

Features

Example

Source

IpynbConverter

Accepted Formats

Features

Example

Results

Features

Example

Source

EpubConverter

Accepted Formats

Features

Example

Source

RssConverter

Accepted Formats

Features

Example

Source

WikipediaConverter

Accepted Formats

Features

Example

Source

YouTubeConverter

Accepted Formats

Dependencies

Features

Parameters

Example

Source

BingSerpConverter

Accepted Formats

Features

Example

Source

OutlookMsgConverter

Dependencies

Accepted Formats

Features

Example

Source

DocumentIntelligenceConverter

Dependencies

Accepted Formats

Constructor

Features

Example

File Types

Analysis Features

Source

Authentication