Overview
TheDocxConverter class converts Microsoft Word .docx files to Markdown, preserving style information (headings, bold, italic) and table structures. It extends HtmlConverter and uses mammoth to convert DOCX to HTML, then to Markdown.
Dependencies
mammoth
Accepted Formats
application/vnd.openxmlformats-officedocument.wordprocessingml.document
.docx
Class Definition
Constructor
HtmlConverter instance.
Methods
accepts()
True if the file has a .docx extension or the DOCX MIME type.
convert()
Binary stream of the DOCX file
Metadata about the file (extension, MIME type, etc.)
Optional Mammoth style map to customize conversion. Allows mapping DOCX styles to specific HTML/Markdown elements.
DocumentConverterResult with converted Markdown
Raises: MissingDependencyException if mammoth is not installed
Features
Preserved Elements
- Headings - H1-H6 styles converted to Markdown headings
- Text Formatting - Bold, italic, underline, strikethrough
- Lists - Bulleted and numbered lists
- Tables - Converted to Markdown tables with proper alignment
- Links - Hyperlinks preserved
- Images - Image references maintained (see options)
Pre-Processing
The converter applies pre-processing to the DOCX file before conversion (viapre_process_docx()) to normalize structure and improve conversion quality.
Example Usage
Basic Conversion
With Style Map
Output Example
Input DOCX with heading, paragraph, and table:Conversion Pipeline
- Pre-process - Apply DOCX structure normalization
- Mammoth Conversion - Convert DOCX to HTML using Mammoth
- HTML to Markdown - Convert HTML to Markdown using HtmlConverter
Style Map Reference
Mammoth style maps allow fine-grained control over conversion:Implementation Details
Source Location
~/workspace/source/packages/markitdown/src/markitdown/converters/_docx_converter.py:31
Inheritance
- Extends
HtmlConverterfor Markdown conversion - Leverages existing HTML-to-Markdown infrastructure
Limitations
- Only supports
.docxformat (Office 2007+), not legacy.docfiles - Complex formatting may not convert perfectly
- Embedded objects (equations, SmartArt) have limited support
- Comments and tracked changes are not preserved