Overview
The HtmlConverter class converts HTML and XHTML documents to Markdown. It uses BeautifulSoup for parsing and a custom Markdownify implementation for conversion. This converter is also used as a base class for other converters (DOCX, PPTX, EPUB) that produce HTML as an intermediate format.
Dependencies
Included in base install: beautifulsoup4, markdownify
text/html
application/xhtml*
Class Definition
class HtmlConverter(DocumentConverter):
"""Converts HTML content to Markdown."""
Methods
accepts()
def accepts(
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any,
) -> bool
Returns True for HTML/XHTML files based on extension or MIME type.
convert()
def convert(
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any,
) -> DocumentConverterResult
Converts HTML from a file stream to Markdown.
Parameters:
Binary stream of the HTML file
Metadata about the file (must include charset if not UTF-8)
Returns: DocumentConverterResult with Markdown and extracted title
convert_string()
def convert_string(
self,
html_content: str,
*,
url: Optional[str] = None,
**kwargs
) -> DocumentConverterResult
Convenience method to convert HTML string directly to Markdown.
Parameters:
Optional URL for the HTML content (used in StreamInfo)
Returns: DocumentConverterResult with Markdown
Features
HTML Cleaning
Before conversion, the HTML is cleaned:
- Script Removal - All
<script> tags removed
- Style Removal - All
<style> tags removed
- Body Extraction - Only
<body> content processed (if present)
Converted Elements
- Headings -
<h1>-<h6> → # headings
- Paragraphs -
<p> → Paragraphs with blank lines
- Lists -
<ul>, <ol> → Markdown lists
- Links -
<a> → [text](url)
- Images -
<img> → 
- Tables -
<table> → Markdown tables
- Code -
<code>, <pre> → Inline/block code
- Emphasis -
<em>, <strong> → *italic*, **bold**
- Blockquotes -
<blockquote> → > quotes
- Horizontal Rules -
<hr> → ---
Extracted from <title> tag and returned in DocumentConverterResult.title.
Character Encoding
Respects stream_info.charset, defaults to UTF-8:
encoding = "utf-8" if stream_info.charset is None else stream_info.charset
soup = BeautifulSoup(file_stream, "html.parser", from_encoding=encoding)
Example Usage
From File
from markitdown.converters import HtmlConverter
from markitdown._stream_info import StreamInfo
converter = HtmlConverter()
with open("page.html", "rb") as f:
stream_info = StreamInfo(
extension=".html",
charset="utf-8"
)
result = converter.convert(f, stream_info)
print(result.markdown)
print(f"Title: {result.title}")
From String
html = """
<!DOCTYPE html>
<html>
<head><title>Sample Page</title></head>
<body>
<h1>Welcome</h1>
<p>This is a <strong>sample</strong> HTML page.</p>
<ul>
<li>Item 1</li>
<li>Item 2</li>
</ul>
</body>
</html>
"""
converter = HtmlConverter()
result = converter.convert_string(html, url="https://example.com/page.html")
print(result.markdown)
Output:
# Welcome
This is a **sample** HTML page.
- Item 1
- Item 2
With Tables
html_table = """
<table>
<thead>
<tr><th>Name</th><th>Age</th><th>City</th></tr>
</thead>
<tbody>
<tr><td>Alice</td><td>30</td><td>NYC</td></tr>
<tr><td>Bob</td><td>25</td><td>SF</td></tr>
</tbody>
</table>
"""
result = converter.convert_string(html_table)
print(result.markdown)
Output:
| Name | Age | City |
| --- | --- | --- |
| Alice | 30 | NYC |
| Bob | 25 | SF |
Usage as Base Class
Many converters extend HtmlConverter to leverage HTML-to-Markdown conversion:
class DocxConverter(HtmlConverter):
def convert(self, file_stream, stream_info, **kwargs):
# Convert DOCX to HTML using mammoth
html_content = mammoth.convert_to_html(file_stream).value
# Use parent class to convert HTML to Markdown
return self._html_converter.convert_string(html_content, **kwargs)
Used by:
DocxConverter - Word documents via mammoth
PptxConverter - Tables from PowerPoint
XlsxConverter / XlsConverter - Excel tables via pandas HTML
EpubConverter - EPUB content files
Implementation Details
Source Location
~/workspace/source/packages/markitdown/src/markitdown/converters/_html_converter.py:20
Conversion Pipeline
-
Parse HTML
soup = BeautifulSoup(file_stream, "html.parser", from_encoding=encoding)
-
Clean Content
for script in soup(["script", "style"]):
script.extract()
-
Extract Body
body_elm = soup.find("body")
webpage_text = _CustomMarkdownify(**kwargs).convert_soup(body_elm)
-
Extract Title
title = None if soup.title is None else soup.title.string
Custom Markdownify
Uses _CustomMarkdownify class (from _markdownify.py) which extends the markdownify library with custom conversion rules for better Markdown output quality.
Advanced Options
The converter accepts options passed to _CustomMarkdownify:
result = converter.convert(
file_stream,
stream_info,
heading_style="ATX", # Use # for headings (default)
bullets="-", # Use - for bullets
code_language="", # Default language for code blocks
strip=["script", "style"] # Additional tags to strip
)
Use Cases
Web Scraping
import requests
url = "https://example.com/article"
response = requests.get(url)
converter = HtmlConverter()
result = converter.convert_string(
response.text,
url=url
)
print(result.markdown)
Documentation Conversion
# Convert HTML docs to Markdown
for html_file in Path("docs").glob("**/*.html"):
with open(html_file, "rb") as f:
result = converter.convert(
f,
StreamInfo(extension=".html")
)
md_file = html_file.with_suffix(".md")
md_file.write_text(result.markdown)
Email Conversion
# Convert HTML email to readable Markdown
html_email = get_email_html() # From email client
result = converter.convert_string(html_email)
print(result.markdown)
Limitations
- Complex CSS layouts not preserved
- JavaScript-rendered content not processed
- Forms and interactive elements lost
- Nested tables may have formatting issues
- Some HTML5 semantic elements treated as divs
- Embedded media (video, audio) becomes links only