Skip to main content

Overview

The HtmlConverter class converts HTML and XHTML documents to Markdown. It uses BeautifulSoup for parsing and a custom Markdownify implementation for conversion. This converter is also used as a base class for other converters (DOCX, PPTX, EPUB) that produce HTML as an intermediate format.

Dependencies

Included in base install: beautifulsoup4, markdownify

Accepted Formats

MIME Types
list
  • text/html
  • application/xhtml*
Extensions
list
  • .html
  • .htm

Class Definition

class HtmlConverter(DocumentConverter):
    """Converts HTML content to Markdown."""

Methods

accepts()

def accepts(
    file_stream: BinaryIO,
    stream_info: StreamInfo,
    **kwargs: Any,
) -> bool
Returns True for HTML/XHTML files based on extension or MIME type.

convert()

def convert(
    file_stream: BinaryIO,
    stream_info: StreamInfo,
    **kwargs: Any,
) -> DocumentConverterResult
Converts HTML from a file stream to Markdown. Parameters:
file_stream
BinaryIO
required
Binary stream of the HTML file
stream_info
StreamInfo
required
Metadata about the file (must include charset if not UTF-8)
Returns: DocumentConverterResult with Markdown and extracted title

convert_string()

def convert_string(
    self,
    html_content: str,
    *,
    url: Optional[str] = None,
    **kwargs
) -> DocumentConverterResult
Convenience method to convert HTML string directly to Markdown. Parameters:
html_content
str
required
HTML content as string
url
str
Optional URL for the HTML content (used in StreamInfo)
Returns: DocumentConverterResult with Markdown

Features

HTML Cleaning

Before conversion, the HTML is cleaned:
  1. Script Removal - All <script> tags removed
  2. Style Removal - All <style> tags removed
  3. Body Extraction - Only <body> content processed (if present)

Converted Elements

  • Headings - <h1>-<h6># headings
  • Paragraphs - <p> → Paragraphs with blank lines
  • Lists - <ul>, <ol> → Markdown lists
  • Links - <a>[text](url)
  • Images - <img>![alt](src)
  • Tables - <table> → Markdown tables
  • Code - <code>, <pre> → Inline/block code
  • Emphasis - <em>, <strong>*italic*, **bold**
  • Blockquotes - <blockquote>> quotes
  • Horizontal Rules - <hr>---

Title Extraction

Extracted from <title> tag and returned in DocumentConverterResult.title.

Character Encoding

Respects stream_info.charset, defaults to UTF-8:
encoding = "utf-8" if stream_info.charset is None else stream_info.charset
soup = BeautifulSoup(file_stream, "html.parser", from_encoding=encoding)

Example Usage

From File

from markitdown.converters import HtmlConverter
from markitdown._stream_info import StreamInfo

converter = HtmlConverter()

with open("page.html", "rb") as f:
    stream_info = StreamInfo(
        extension=".html",
        charset="utf-8"
    )
    result = converter.convert(f, stream_info)
    print(result.markdown)
    print(f"Title: {result.title}")

From String

html = """
<!DOCTYPE html>
<html>
<head><title>Sample Page</title></head>
<body>
    <h1>Welcome</h1>
    <p>This is a <strong>sample</strong> HTML page.</p>
    <ul>
        <li>Item 1</li>
        <li>Item 2</li>
    </ul>
</body>
</html>
"""

converter = HtmlConverter()
result = converter.convert_string(html, url="https://example.com/page.html")
print(result.markdown)
Output:
# Welcome

This is a **sample** HTML page.

- Item 1
- Item 2

With Tables

html_table = """
<table>
    <thead>
        <tr><th>Name</th><th>Age</th><th>City</th></tr>
    </thead>
    <tbody>
        <tr><td>Alice</td><td>30</td><td>NYC</td></tr>
        <tr><td>Bob</td><td>25</td><td>SF</td></tr>
    </tbody>
</table>
"""

result = converter.convert_string(html_table)
print(result.markdown)
Output:
| Name | Age | City |
| --- | --- | --- |
| Alice | 30 | NYC |
| Bob | 25 | SF |

Usage as Base Class

Many converters extend HtmlConverter to leverage HTML-to-Markdown conversion:
class DocxConverter(HtmlConverter):
    def convert(self, file_stream, stream_info, **kwargs):
        # Convert DOCX to HTML using mammoth
        html_content = mammoth.convert_to_html(file_stream).value
        
        # Use parent class to convert HTML to Markdown
        return self._html_converter.convert_string(html_content, **kwargs)
Used by:
  • DocxConverter - Word documents via mammoth
  • PptxConverter - Tables from PowerPoint
  • XlsxConverter / XlsConverter - Excel tables via pandas HTML
  • EpubConverter - EPUB content files

Implementation Details

Source Location

~/workspace/source/packages/markitdown/src/markitdown/converters/_html_converter.py:20

Conversion Pipeline

  1. Parse HTML
    soup = BeautifulSoup(file_stream, "html.parser", from_encoding=encoding)
    
  2. Clean Content
    for script in soup(["script", "style"]):
        script.extract()
    
  3. Extract Body
    body_elm = soup.find("body")
    webpage_text = _CustomMarkdownify(**kwargs).convert_soup(body_elm)
    
  4. Extract Title
    title = None if soup.title is None else soup.title.string
    

Custom Markdownify

Uses _CustomMarkdownify class (from _markdownify.py) which extends the markdownify library with custom conversion rules for better Markdown output quality.

Advanced Options

The converter accepts options passed to _CustomMarkdownify:
result = converter.convert(
    file_stream,
    stream_info,
    heading_style="ATX",      # Use # for headings (default)
    bullets="-",              # Use - for bullets
    code_language="",         # Default language for code blocks
    strip=["script", "style"] # Additional tags to strip
)

Use Cases

Web Scraping

import requests

url = "https://example.com/article"
response = requests.get(url)

converter = HtmlConverter()
result = converter.convert_string(
    response.text,
    url=url
)
print(result.markdown)

Documentation Conversion

# Convert HTML docs to Markdown
for html_file in Path("docs").glob("**/*.html"):
    with open(html_file, "rb") as f:
        result = converter.convert(
            f,
            StreamInfo(extension=".html")
        )
        md_file = html_file.with_suffix(".md")
        md_file.write_text(result.markdown)

Email Conversion

# Convert HTML email to readable Markdown
html_email = get_email_html()  # From email client
result = converter.convert_string(html_email)
print(result.markdown)

Limitations

  • Complex CSS layouts not preserved
  • JavaScript-rendered content not processed
  • Forms and interactive elements lost
  • Nested tables may have formatting issues
  • Some HTML5 semantic elements treated as divs
  • Embedded media (video, audio) becomes links only