Skip to main content

PlainTextConverter

Converts plain text files, JSON, and Markdown to Markdown (passthrough with encoding detection).

Accepted Formats

  • MIME Types: text/*, application/json, application/markdown
  • Extensions: .txt, .text, .md, .markdown, .json, .jsonl

Features

  • Automatic character encoding detection using charset_normalizer
  • Respects stream_info.charset if provided
  • Handles files with any charset (UTF-8, Latin-1, etc.)

Example

from markitdown.converters import PlainTextConverter
from markitdown._stream_info import StreamInfo

converter = PlainTextConverter()

with open("README.md", "rb") as f:
    result = converter.convert(f, StreamInfo(extension=".md"))
    print(result.markdown)  # Original content

Source

_plain_text_converter.py:33

CsvConverter

Converts CSV files to Markdown tables.

Accepted Formats

  • MIME Types: text/csv, application/csv
  • Extensions: .csv

Features

  • First row treated as header
  • Automatic column alignment
  • Handles missing cells
  • Character encoding detection

Example

from markitdown.converters import CsvConverter
from markitdown._stream_info import StreamInfo

converter = CsvConverter()

with open("data.csv", "rb") as f:
    result = converter.convert(f, StreamInfo(extension=".csv"))
    print(result.markdown)
Input CSV:
Name,Age,City
Alice,30,NYC
Bob,25,SF
Output:
| Name | Age | City |
| --- | --- | --- |
| Alice | 30 | NYC |
| Bob | 25 | SF |

Source

_csv_converter.py:15

IpynbConverter

Converts Jupyter Notebook (.ipynb) files to Markdown.

Accepted Formats

  • MIME Types: application/json (if contains nbformat)
  • Extensions: .ipynb

Features

  • Markdown cells preserved as-is
  • Code cells wrapped in ```python blocks
  • Raw cells wrapped in ``` blocks
  • Extracts title from first H1 heading or notebook metadata

Example

from markitdown.converters import IpynbConverter
from markitdown._stream_info import StreamInfo

converter = IpynbConverter()

with open("analysis.ipynb", "rb") as f:
    result = converter.convert(f, StreamInfo(extension=".ipynb"))
    print(result.markdown)
    print(f"Title: {result.title}")
Output:
# Data Analysis

This notebook analyzes customer data.

```python
import pandas as pd
df = pd.read_csv('data.csv')
df.head()

Results

The analysis shows…

### Source

`_ipynb_converter.py:15`

---

## ZipConverter

Converts ZIP archives by extracting and converting all contained files.

### Accepted Formats

- **MIME Types:** `application/zip`
- **Extensions:** `.zip`

### Constructor

```python
def __init__(self, *, markitdown: MarkItDown)
Requires: MarkItDown instance to process extracted files

Features

  • Recursively converts all files in archive
  • Each file presented under ## File: path/to/file.ext heading
  • Unsupported files silently skipped
  • Maintains directory structure in headings

Example

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("archive.zip")
print(result.markdown)
Output:
Content from the zip file `archive.zip`:

## File: docs/readme.txt

Welcome to the project...

## File: data/report.xlsx

## Sheet1
| Column1 | Column2 |
| --- | --- |
| Data1 | Data2 |

Source

_zip_converter.py:22

EpubConverter

Converts EPUB ebooks to Markdown, preserving chapter structure and metadata.

Accepted Formats

  • MIME Types: application/epub, application/epub+zip, application/x-epub+zip
  • Extensions: .epub

Features

  • Extracts metadata (title, authors, publisher, description, etc.)
  • Converts chapters in spine order
  • Preserves chapter structure
  • Extends HtmlConverter for content conversion

Example

from markitdown.converters import EpubConverter

converter = EpubConverter()

with open("book.epub", "rb") as f:
    result = converter.convert(f, StreamInfo(extension=".epub"))
    print(result.markdown)
Output:
**Title:** The Great Novel
**Authors:** Jane Doe, John Smith
**Publisher:** Example Press
**Date:** 2024-01-15
**Description:** A gripping tale of...

# Chapter 1

It was a dark and stormy night...

Source

_epub_converter.py:26

RssConverter

Converts RSS and Atom feeds to Markdown.

Accepted Formats

  • MIME Types: application/rss+xml, application/atom+xml, text/xml, application/xml
  • Extensions: .rss, .atom, .xml

Features

  • Supports RSS 2.0 and Atom formats
  • Extracts feed title, description, and items
  • HTML content in descriptions converted to Markdown
  • Preserves publication dates

Example

from markitdown.converters import RssConverter

converter = RssConverter()

with open("feed.rss", "rb") as f:
    result = converter.convert(f, StreamInfo(extension=".rss"))
    print(result.markdown)
Output:
# Tech News Blog
Daily technology news and updates

## New AI Breakthrough Announced
Published on: Wed, 15 Feb 2024 10:30:00 GMT
Researchers have developed a new...

## Cloud Computing Trends
Published on: Tue, 14 Feb 2024 14:00:00 GMT
The latest trends in cloud infrastructure...

Source

_rss_converter.py:29

WikipediaConverter

Specialized converter for Wikipedia pages, extracting main article content.

Accepted Formats

  • MIME Types: text/html, application/xhtml
  • Extensions: .html, .htm
  • URL Pattern: https://*.wikipedia.org/*

Features

  • Extracts only main content (#mw-content-text)
  • Removes navigation, sidebars, and footer
  • Preserves article title
  • Extends HtmlConverter

Example

import requests
from markitdown.converters import WikipediaConverter
from markitdown._stream_info import StreamInfo

url = "https://en.wikipedia.org/wiki/Python_(programming_language)"
response = requests.get(url)

converter = WikipediaConverter()
result = converter.convert_string(
    response.content,
    StreamInfo(url=url, mimetype="text/html")
)
print(result.markdown)

Source

_wikipedia_converter.py:20

YouTubeConverter

Extracts YouTube video metadata, description, and transcript.

Accepted Formats

  • URL Pattern: https://www.youtube.com/watch?v=*
  • MIME Types: text/html, application/xhtml

Dependencies

Optional: youtube-transcript-api for transcripts

Features

  • Extracts video title, views, keywords, runtime
  • Retrieves description from page metadata
  • Downloads transcript (if available)
  • Supports multiple languages

Parameters

youtube_transcript_languages
list
Languages to try for transcript (e.g., ["en", "es"]). Defaults to detected languages.

Example

import requests
from markitdown.converters import YouTubeConverter

url = "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
response = requests.get(url)

converter = YouTubeConverter()
result = converter.convert(
    io.BytesIO(response.content),
    StreamInfo(url=url, mimetype="text/html"),
    youtube_transcript_languages=["en"]
)
print(result.markdown)
Output:
# YouTube

## Video Title

### Video Metadata
- **Views:** 1234567890
- **Runtime:** PT3M33S

### Description
Official music video...

### Transcript
We're no strangers to love. You know the rules and so do I...

Source

_youtube_converter.py:37

BingSerpConverter

Extracts organic search results from Bing search results pages.

Accepted Formats

  • URL Pattern: https://www.bing.com/search?q=*
  • MIME Types: text/html, application/xhtml

Features

  • Extracts search query
  • Parses organic results (.b_algo class)
  • Decodes redirect URLs
  • Converts HTML descriptions to Markdown

Example

import requests
from markitdown.converters import BingSerpConverter

url = "https://www.bing.com/search?q=python+programming"
response = requests.get(url)

converter = BingSerpConverter()
result = converter.convert(
    io.BytesIO(response.content),
    StreamInfo(url=url, mimetype="text/html")
)
print(result.markdown)
Output:
## A Bing search for 'python programming' found the following results:

[Python.org](https://www.python.org)
Welcome to Python.org - The official home of the Python programming language...

[Python Tutorial](https://docs.python.org/tutorial/)
The Python Tutorial — Python 3.12 documentation...

Source

_bing_serp_converter.py:23

OutlookMsgConverter

Converts Outlook .msg email files to Markdown.

Dependencies

pip install markitdown[outlook]
Requires: olefile

Accepted Formats

  • MIME Types: application/vnd.ms-outlook
  • Extensions: .msg

Features

  • Extracts email headers (From, To, Subject)
  • Retrieves email body content
  • Handles text encoding (UTF-16, UTF-8)

Example

from markitdown.converters import OutlookMsgConverter

converter = OutlookMsgConverter()

with open("message.msg", "rb") as f:
    result = converter.convert(f, StreamInfo(extension=".msg"))
    print(result.markdown)
Output:
# Email Message

**From:** john.doe@example.com
**To:** jane.smith@example.com
**Subject:** Project Update

## Content

Hi Jane,

Here's the latest update on the project...

Source

_outlook_msg_converter.py:24

DocumentIntelligenceConverter

Uses Azure Document Intelligence (formerly Form Recognizer) for advanced OCR and document understanding.

Dependencies

pip install markitdown[az-doc-intel]
Requires: azure-ai-documentintelligence, azure-identity

Accepted Formats

  • DOCX, PPTX, XLSX (without OCR)
  • PDF, JPEG, PNG, BMP, TIFF (with OCR)
  • HTML

Constructor

def __init__(
    self,
    *,
    endpoint: str,
    api_version: str = "2024-07-31-preview",
    credential: AzureKeyCredential | TokenCredential | None = None,
    file_types: List[DocumentIntelligenceFileType] = [...]
)
Parameters:
endpoint
str
required
Azure Document Intelligence endpoint URL
api_version
str
default:"2024-07-31-preview"
API version to use
credential
AzureKeyCredential | TokenCredential
Authentication credential. Defaults to DefaultAzureCredential() or AZURE_API_KEY env var.
file_types
list
File types to accept. Values from DocumentIntelligenceFileType enum.

Features

  • Advanced OCR with high resolution support
  • Formula extraction from documents
  • Font style detection
  • Layout analysis
  • Native Markdown output

Example

from markitdown.converters import DocumentIntelligenceConverter
from azure.core.credentials import AzureKeyCredential

converter = DocumentIntelligenceConverter(
    endpoint="https://your-resource.cognitiveservices.azure.com/",
    credential=AzureKeyCredential("your-api-key")
)

with open("scanned_document.pdf", "rb") as f:
    result = converter.convert(f, StreamInfo(extension=".pdf"))
    print(result.markdown)

File Types

from markitdown.converters import DocumentIntelligenceFileType

# Available file types:
DocumentIntelligenceFileType.DOCX
DocumentIntelligenceFileType.PPTX  
DocumentIntelligenceFileType.XLSX
DocumentIntelligenceFileType.HTML
DocumentIntelligenceFileType.PDF
DocumentIntelligenceFileType.JPEG
DocumentIntelligenceFileType.PNG
DocumentIntelligenceFileType.BMP
DocumentIntelligenceFileType.TIFF

Analysis Features

For OCR-supported formats (PDF, images):
  • FORMULAS - Extract mathematical formulas
  • OCR_HIGH_RESOLUTION - High-quality OCR
  • STYLE_FONT - Font style information

Source

_doc_intel_converter.py:130

Authentication

Supports multiple authentication methods:
  1. API Key (via AzureKeyCredential)
  2. Environment Variable (AZURE_API_KEY)
  3. Managed Identity (via DefaultAzureCredential)
# Using environment variable
import os
os.environ["AZURE_API_KEY"] = "your-key"
converter = DocumentIntelligenceConverter(endpoint="https://...")

# Using managed identity
from azure.identity import DefaultAzureCredential
converter = DocumentIntelligenceConverter(
    endpoint="https://...",
    credential=DefaultAzureCredential()
)