HtmlConverter

Overview

The HtmlConverter class converts HTML and XHTML documents to Markdown. It uses BeautifulSoup for parsing and a custom Markdownify implementation for conversion. This converter is also used as a base class for other converters (DOCX, PPTX, EPUB) that produce HTML as an intermediate format.

Dependencies

Included in base install: beautifulsoup4, markdownify

Accepted Formats

MIME Types

list

text/html
application/xhtml*

Extensions

list

.html
.htm

Class Definition

class HtmlConverter(DocumentConverter):
    """Converts HTML content to Markdown."""

Methods

accepts()

def accepts(
    file_stream: BinaryIO,
    stream_info: StreamInfo,
    **kwargs: Any,
) -> bool

Returns True for HTML/XHTML files based on extension or MIME type.

convert()

def convert(
    file_stream: BinaryIO,
    stream_info: StreamInfo,
    **kwargs: Any,
) -> DocumentConverterResult

Converts HTML from a file stream to Markdown. Parameters:

file_stream

BinaryIO

required

Binary stream of the HTML file

stream_info

StreamInfo

required

Metadata about the file (must include charset if not UTF-8)

Returns: DocumentConverterResult with Markdown and extracted title

convert_string()

def convert_string(
    self,
    html_content: str,
    *,
    url: Optional[str] = None,
    **kwargs
) -> DocumentConverterResult

Convenience method to convert HTML string directly to Markdown. Parameters:

html_content

str

required

HTML content as string

url

str

Optional URL for the HTML content (used in StreamInfo)

Returns: DocumentConverterResult with Markdown

Features

HTML Cleaning

Before conversion, the HTML is cleaned:

Script Removal - All <script> tags removed
Style Removal - All <style> tags removed
Body Extraction - Only <body> content processed (if present)

Converted Elements

Headings - <h1>-<h6> → # headings
Paragraphs - <p> → Paragraphs with blank lines
Lists - <ul>, <ol> → Markdown lists
Links - <a> → [text](url)
Images - <img> → ![alt](src)
Tables - <table> → Markdown tables
Code - <code>, <pre> → Inline/block code
Emphasis - <em>, <strong> → *italic*, **bold**
Blockquotes - <blockquote> → > quotes
Horizontal Rules - <hr> → ---

Title Extraction

Extracted from <title> tag and returned in DocumentConverterResult.title.

Character Encoding

Respects stream_info.charset, defaults to UTF-8:

encoding = "utf-8" if stream_info.charset is None else stream_info.charset
soup = BeautifulSoup(file_stream, "html.parser", from_encoding=encoding)

Example Usage

From File

from markitdown.converters import HtmlConverter
from markitdown._stream_info import StreamInfo

converter = HtmlConverter()

with open("page.html", "rb") as f:
    stream_info = StreamInfo(
        extension=".html",
        charset="utf-8"
    )
    result = converter.convert(f, stream_info)
    print(result.markdown)
    print(f"Title: {result.title}")

From String

html = """
<!DOCTYPE html>
<html>
<head><title>Sample Page</title></head>
<body>
    <h1>Welcome</h1>
    <p>This is a <strong>sample</strong> HTML page.</p>
    <ul>
        <li>Item 1</li>
        <li>Item 2</li>
    </ul>
</body>
</html>
"""

converter = HtmlConverter()
result = converter.convert_string(html, url="https://example.com/page.html")
print(result.markdown)

Output:

# Welcome

This is a **sample** HTML page.

- Item 1
- Item 2

With Tables

html_table = """
<table>
    <thead>
        <tr><th>Name</th><th>Age</th><th>City</th></tr>
    </thead>
    <tbody>
        <tr><td>Alice</td><td>30</td><td>NYC</td></tr>
        <tr><td>Bob</td><td>25</td><td>SF</td></tr>
    </tbody>
</table>
"""

result = converter.convert_string(html_table)
print(result.markdown)

Output:

| Name | Age | City |
| --- | --- | --- |
| Alice | 30 | NYC |
| Bob | 25 | SF |

Usage as Base Class

Many converters extend HtmlConverter to leverage HTML-to-Markdown conversion:

class DocxConverter(HtmlConverter):
    def convert(self, file_stream, stream_info, **kwargs):
        # Convert DOCX to HTML using mammoth
        html_content = mammoth.convert_to_html(file_stream).value
        
        # Use parent class to convert HTML to Markdown
        return self._html_converter.convert_string(html_content, **kwargs)

Used by:

DocxConverter - Word documents via mammoth
PptxConverter - Tables from PowerPoint
XlsxConverter / XlsConverter - Excel tables via pandas HTML
EpubConverter - EPUB content files

Implementation Details

Source Location

~/workspace/source/packages/markitdown/src/markitdown/converters/_html_converter.py:20

Conversion Pipeline

Parse HTML

soup = BeautifulSoup(file_stream, "html.parser", from_encoding=encoding)

Clean Content

for script in soup(["script", "style"]):
    script.extract()

Extract Body

body_elm = soup.find("body")
webpage_text = _CustomMarkdownify(**kwargs).convert_soup(body_elm)

Extract Title

title = None if soup.title is None else soup.title.string

Custom Markdownify

Uses _CustomMarkdownify class (from _markdownify.py) which extends the markdownify library with custom conversion rules for better Markdown output quality.

Advanced Options

The converter accepts options passed to _CustomMarkdownify:

result = converter.convert(
    file_stream,
    stream_info,
    heading_style="ATX",      # Use # for headings (default)
    bullets="-",              # Use - for bullets
    code_language="",         # Default language for code blocks
    strip=["script", "style"] # Additional tags to strip
)

Use Cases

Web Scraping

import requests

url = "https://example.com/article"
response = requests.get(url)

converter = HtmlConverter()
result = converter.convert_string(
    response.text,
    url=url
)
print(result.markdown)

Documentation Conversion

# Convert HTML docs to Markdown
for html_file in Path("docs").glob("**/*.html"):
    with open(html_file, "rb") as f:
        result = converter.convert(
            f,
            StreamInfo(extension=".html")
        )
        md_file = html_file.with_suffix(".md")
        md_file.write_text(result.markdown)

Email Conversion

# Convert HTML email to readable Markdown
html_email = get_email_html()  # From email client
result = converter.convert_string(html_email)
print(result.markdown)

Limitations

Complex CSS layouts not preserved
JavaScript-rendered content not processed
Forms and interactive elements lost
Nested tables may have formatting issues
Some HTML5 semantic elements treated as divs
Embedded media (video, audio) becomes links only

Core

Converters

Exceptions

Overview

Dependencies

Accepted Formats

Class Definition

Methods

accepts()

convert()

convert_string()

Features

HTML Cleaning

Converted Elements

Title Extraction

Character Encoding

Example Usage

From File

From String

With Tables

Usage as Base Class

Implementation Details

Source Location

Conversion Pipeline

Custom Markdownify

Advanced Options

Use Cases

Web Scraping

Documentation Conversion

Email Conversion

Limitations

Core

Converters

Exceptions

​Overview

​Dependencies

​Accepted Formats

​Class Definition

​Methods

​accepts()

​convert()

​convert_string()

​Features

​HTML Cleaning

​Converted Elements

​Title Extraction

​Character Encoding

​Example Usage

​From File

​From String

​With Tables

​Usage as Base Class

​Implementation Details

​Source Location

​Conversion Pipeline

​Custom Markdownify

​Advanced Options

​Use Cases

​Web Scraping

​Documentation Conversion

​Email Conversion

​Limitations

Overview

Dependencies

Accepted Formats

Class Definition

Methods

accepts()

convert()

convert_string()

Features

HTML Cleaning

Converted Elements

Title Extraction

Character Encoding

Example Usage

From File

From String

With Tables

Usage as Base Class

Implementation Details

Source Location

Conversion Pipeline

Custom Markdownify

Advanced Options

Use Cases

Web Scraping

Documentation Conversion

Email Conversion

Limitations