DocxConverter

Overview

The DocxConverter class converts Microsoft Word .docx files to Markdown, preserving style information (headings, bold, italic) and table structures. It extends HtmlConverter and uses mammoth to convert DOCX to HTML, then to Markdown.

Dependencies

pip install markitdown[docx]

Requires: mammoth

Accepted Formats

MIME Types

list

application/vnd.openxmlformats-officedocument.wordprocessingml.document

Extensions

list

.docx

Class Definition

class DocxConverter(HtmlConverter):
    """Converts DOCX files to Markdown.
    
    Style information (e.g., headings) and tables are preserved where possible.
    """

Constructor

def __init__(self):
    super().__init__()
    self._html_converter = HtmlConverter()

Initializes the converter with an internal HtmlConverter instance.

Methods

accepts()

def accepts(
    file_stream: BinaryIO,
    stream_info: StreamInfo,
    **kwargs: Any,
) -> bool

Returns True if the file has a .docx extension or the DOCX MIME type.

convert()

def convert(
    file_stream: BinaryIO,
    stream_info: StreamInfo,
    **kwargs: Any,
) -> DocumentConverterResult

Converts a DOCX file to Markdown. Parameters:

file_stream

BinaryIO

required

Binary stream of the DOCX file

stream_info

StreamInfo

required

Metadata about the file (extension, MIME type, etc.)

style_map

str

Optional Mammoth style map to customize conversion. Allows mapping DOCX styles to specific HTML/Markdown elements.

Returns: DocumentConverterResult with converted Markdown Raises: MissingDependencyException if mammoth is not installed

Features

Preserved Elements

Headings - H1-H6 styles converted to Markdown headings
Text Formatting - Bold, italic, underline, strikethrough
Lists - Bulleted and numbered lists
Tables - Converted to Markdown tables with proper alignment
Links - Hyperlinks preserved
Images - Image references maintained (see options)

Pre-Processing

The converter applies pre-processing to the DOCX file before conversion (via pre_process_docx()) to normalize structure and improve conversion quality.

Example Usage

Basic Conversion

from markitdown.converters import DocxConverter
from markitdown._stream_info import StreamInfo

converter = DocxConverter()

with open("document.docx", "rb") as f:
    stream_info = StreamInfo(extension=".docx")
    result = converter.convert(f, stream_info)
    print(result.markdown)

With Style Map

# Custom style mapping
style_map = """
p[style-name='Section Heading'] => h2:fresh
p[style-name='Subsection Heading'] => h3:fresh
"""

with open("document.docx", "rb") as f:
    stream_info = StreamInfo(extension=".docx")
    result = converter.convert(f, stream_info, style_map=style_map)
    print(result.markdown)

Output Example

Input DOCX with heading, paragraph, and table:

# Document Title

This is a paragraph with **bold** and *italic* text.

## Section Heading

| Column 1 | Column 2 | Column 3 |
| --- | --- | --- |
| Data 1 | Data 2 | Data 3 |
| Data 4 | Data 5 | Data 6 |

- Bullet point 1
- Bullet point 2
  - Nested item

Conversion Pipeline

Pre-process - Apply DOCX structure normalization
Mammoth Conversion - Convert DOCX to HTML using Mammoth
HTML to Markdown - Convert HTML to Markdown using HtmlConverter

Style Map Reference

Mammoth style maps allow fine-grained control over conversion:

p[style-name='Quote'] => blockquote
r[style-name='Code'] => code
p[style-name='Normal'] => p:fresh

See Mammoth documentation for complete style map syntax.

Implementation Details

Source Location

~/workspace/source/packages/markitdown/src/markitdown/converters/_docx_converter.py:31

Inheritance

Extends HtmlConverter for Markdown conversion
Leverages existing HTML-to-Markdown infrastructure

Limitations

Only supports .docx format (Office 2007+), not legacy .doc files
Complex formatting may not convert perfectly
Embedded objects (equations, SmartArt) have limited support
Comments and tracked changes are not preserved

Core

Converters

Exceptions

Overview

Dependencies

Accepted Formats

Class Definition

Constructor

Methods

accepts()

convert()

Features

Preserved Elements

Pre-Processing

Example Usage

Basic Conversion

With Style Map

Output Example

Conversion Pipeline

Style Map Reference

Implementation Details

Source Location

Inheritance

Limitations

Core

Converters

Exceptions

​Overview

​Dependencies

​Accepted Formats

​Class Definition

​Constructor

​Methods

​accepts()

​convert()

​Features

​Preserved Elements

​Pre-Processing

​Example Usage

​Basic Conversion

​With Style Map

​Output Example

​Conversion Pipeline

​Style Map Reference

​Implementation Details

​Source Location

​Inheritance

​Limitations

Overview

Dependencies

Accepted Formats

Class Definition

Constructor

Methods

accepts()

convert()

Features

Preserved Elements

Pre-Processing

Example Usage

Basic Conversion

With Style Map

Output Example

Conversion Pipeline

Style Map Reference

Implementation Details

Source Location

Inheritance

Limitations