Skip to main content

Overview

The DocxConverter class converts Microsoft Word .docx files to Markdown, preserving style information (headings, bold, italic) and table structures. It extends HtmlConverter and uses mammoth to convert DOCX to HTML, then to Markdown.

Dependencies

pip install markitdown[docx]
Requires: mammoth

Accepted Formats

MIME Types
list
  • application/vnd.openxmlformats-officedocument.wordprocessingml.document
Extensions
list
  • .docx

Class Definition

class DocxConverter(HtmlConverter):
    """Converts DOCX files to Markdown.
    
    Style information (e.g., headings) and tables are preserved where possible.
    """

Constructor

def __init__(self):
    super().__init__()
    self._html_converter = HtmlConverter()
Initializes the converter with an internal HtmlConverter instance.

Methods

accepts()

def accepts(
    file_stream: BinaryIO,
    stream_info: StreamInfo,
    **kwargs: Any,
) -> bool
Returns True if the file has a .docx extension or the DOCX MIME type.

convert()

def convert(
    file_stream: BinaryIO,
    stream_info: StreamInfo,
    **kwargs: Any,
) -> DocumentConverterResult
Converts a DOCX file to Markdown. Parameters:
file_stream
BinaryIO
required
Binary stream of the DOCX file
stream_info
StreamInfo
required
Metadata about the file (extension, MIME type, etc.)
style_map
str
Optional Mammoth style map to customize conversion. Allows mapping DOCX styles to specific HTML/Markdown elements.
Returns: DocumentConverterResult with converted Markdown Raises: MissingDependencyException if mammoth is not installed

Features

Preserved Elements

  • Headings - H1-H6 styles converted to Markdown headings
  • Text Formatting - Bold, italic, underline, strikethrough
  • Lists - Bulleted and numbered lists
  • Tables - Converted to Markdown tables with proper alignment
  • Links - Hyperlinks preserved
  • Images - Image references maintained (see options)

Pre-Processing

The converter applies pre-processing to the DOCX file before conversion (via pre_process_docx()) to normalize structure and improve conversion quality.

Example Usage

Basic Conversion

from markitdown.converters import DocxConverter
from markitdown._stream_info import StreamInfo

converter = DocxConverter()

with open("document.docx", "rb") as f:
    stream_info = StreamInfo(extension=".docx")
    result = converter.convert(f, stream_info)
    print(result.markdown)

With Style Map

# Custom style mapping
style_map = """
p[style-name='Section Heading'] => h2:fresh
p[style-name='Subsection Heading'] => h3:fresh
"""

with open("document.docx", "rb") as f:
    stream_info = StreamInfo(extension=".docx")
    result = converter.convert(f, stream_info, style_map=style_map)
    print(result.markdown)

Output Example

Input DOCX with heading, paragraph, and table:
# Document Title

This is a paragraph with **bold** and *italic* text.

## Section Heading

| Column 1 | Column 2 | Column 3 |
| --- | --- | --- |
| Data 1 | Data 2 | Data 3 |
| Data 4 | Data 5 | Data 6 |

- Bullet point 1
- Bullet point 2
  - Nested item

Conversion Pipeline

  1. Pre-process - Apply DOCX structure normalization
  2. Mammoth Conversion - Convert DOCX to HTML using Mammoth
  3. HTML to Markdown - Convert HTML to Markdown using HtmlConverter

Style Map Reference

Mammoth style maps allow fine-grained control over conversion:
p[style-name='Quote'] => blockquote
r[style-name='Code'] => code
p[style-name='Normal'] => p:fresh
See Mammoth documentation for complete style map syntax.

Implementation Details

Source Location

~/workspace/source/packages/markitdown/src/markitdown/converters/_docx_converter.py:31

Inheritance

  • Extends HtmlConverter for Markdown conversion
  • Leverages existing HTML-to-Markdown infrastructure

Limitations

  • Only supports .docx format (Office 2007+), not legacy .doc files
  • Complex formatting may not convert perfectly
  • Embedded objects (equations, SmartArt) have limited support
  • Comments and tracked changes are not preserved