Skip to main content

Overview

The AudioConverter class converts audio files to Markdown by extracting metadata (via exiftool) and transcribing speech to text (via speech_recognition library). Supports common audio formats including WAV, MP3, M4A, and MP4.

Dependencies

pip install markitdown  # Base install
Required: None
Optional: exiftool (external), SpeechRecognition library

Accepted Formats

MIME Types
list
  • audio/x-wav
  • audio/mpeg
  • video/mp4
Extensions
list
  • .wav
  • .mp3
  • .m4a
  • .mp4

Class Definition

class AudioConverter(DocumentConverter):
    """Converts audio files to markdown.
    
    Extracts metadata (if exiftool installed) and transcribes
    speech (if speech_recognition installed).
    """

Methods

accepts()

def accepts(
    file_stream: BinaryIO,
    stream_info: StreamInfo,
    **kwargs: Any,
) -> bool
Returns True for supported audio file extensions or MIME types.

convert()

def convert(
    file_stream: BinaryIO,
    stream_info: StreamInfo,
    **kwargs: Any,
) -> DocumentConverterResult
Converts an audio file to Markdown with metadata and transcript. Parameters:
file_stream
BinaryIO
required
Binary stream of the audio file
stream_info
StreamInfo
required
Metadata about the file (extension, MIME type)
exiftool_path
str
Path to exiftool binary for metadata extraction
Returns: DocumentConverterResult with metadata and transcript Note: Transcription requires speech_recognition library. If not installed, only metadata is extracted.

Features

Metadata Extraction

If exiftool is available, extracts these audio metadata fields:
  • Title - Track title
  • Artist - Artist/performer name
  • Author - Author name
  • Band - Band/group name
  • Album - Album title
  • Genre - Music genre
  • Track - Track number
  • DateTimeOriginal - Original recording date
  • CreateDate - File creation date
  • NumChannels - Number of audio channels (mono/stereo)
  • SampleRate - Sample rate in Hz
  • AvgBytesPerSec - Average bitrate
  • BitsPerSample - Bit depth
Note: Duration field is excluded as it may be incorrect when read from memory.

Audio Transcription

When speech_recognition is installed:
  1. Detects audio format from extension/MIME type
  2. Transcribes speech to text
  3. Adds transcript under ”### Audio Transcript:” heading

Example Usage

Metadata Only

from markitdown.converters import AudioConverter
from markitdown._stream_info import StreamInfo

converter = AudioConverter()

with open("podcast.mp3", "rb") as f:
    stream_info = StreamInfo(
        extension=".mp3",
        mimetype="audio/mpeg"
    )
    result = converter.convert(f, stream_info)
    print(result.markdown)
Output:
Title: Episode 42: The Future of AI
Artist: Tech Talks Podcast
Album: Season 3
Genre: Podcast
DateTimeOriginal: 2024-02-15
NumChannels: 2
SampleRate: 44100
BitsPerSample: 16

With Transcription

# Requires: pip install SpeechRecognition

converter = AudioConverter()

with open("interview.wav", "rb") as f:
    stream_info = StreamInfo(extension=".wav")
    result = converter.convert(f, stream_info)
    print(result.markdown)
Output:
Title: Interview with Dr. Smith
Artist: News Radio
DateTimeOriginal: 2024-02-20
NumChannels: 1
SampleRate: 48000

### Audio Transcript:
Welcome to our show today. We're joined by Dr. Smith, who will discuss 
recent advances in renewable energy technology. Dr. Smith, thank you for 
being here. Can you tell us about your latest research?

Custom exiftool Path

with open("music.m4a", "rb") as f:
    stream_info = StreamInfo(extension=".m4a")
    result = converter.convert(
        f,
        stream_info,
        exiftool_path="/opt/homebrew/bin/exiftool"
    )
    print(result.markdown)

Implementation Details

Source Location

~/workspace/source/packages/markitdown/src/markitdown/converters/_audio_converter.py:23

Format Detection

The converter maps file extensions to audio formats for transcription:
if stream_info.extension == ".wav" or stream_info.mimetype == "audio/x-wav":
    audio_format = "wav"
elif stream_info.extension == ".mp3" or stream_info.mimetype == "audio/mpeg":
    audio_format = "mp3"
elif stream_info.extension in [".mp4", ".m4a"] or stream_info.mimetype == "video/mp4":
    audio_format = "mp4"

Transcription Pipeline

Transcription uses the transcribe_audio() helper function (from _transcribe_audio.py):
transcript = transcribe_audio(file_stream, audio_format=audio_format)
This function:
  1. Uses speech_recognition library
  2. Supports multiple speech recognition engines
  3. Returns text transcript or None if transcription fails

Error Handling

  • Missing speech_recognition raises MissingDependencyException (caught silently)
  • Transcription failures are silent (no transcript section added)
  • Metadata extraction failures are silent (field not included)

Use Cases

Podcast Indexing

import os
from pathlib import Path

converter = AudioConverter()

for audio_file in Path("podcasts").glob("*.mp3"):
    with open(audio_file, "rb") as f:
        result = converter.convert(
            f,
            StreamInfo(extension=".mp3")
        )
        # Save transcript for search indexing
        with open(f"{audio_file.stem}_transcript.md", "w") as out:
            out.write(result.markdown)

Meeting Notes

# Transcribe meeting recording
with open("team_meeting_2024-02-15.wav", "rb") as f:
    result = converter.convert(
        f,
        StreamInfo(extension=".wav")
    )
    print(result.markdown)  # Review and edit transcript

Music Library Management

# Extract metadata for music organization
for music_file in Path("library").glob("**/*.mp3"):
    with open(music_file, "rb") as f:
        result = converter.convert(
            f,
            StreamInfo(extension=".mp3"),
            exiftool_path="/usr/local/bin/exiftool"
        )
        # Parse metadata for database
        metadata = {}
        for line in result.markdown.split("\n"):
            if ":" in line:
                key, value = line.split(":", 1)
                metadata[key.strip()] = value.strip()

Limitations

  • Transcription accuracy depends on audio quality and speech clarity
  • Only supports formats: WAV, MP3, M4A, MP4
  • Large audio files may require significant processing time
  • Transcription requires internet connection (for some engines)
  • Metadata extraction requires external exiftool binary
  • No speaker identification or timestamp markers
  • Music/background noise may interfere with transcription
  • Non-English speech may have limited support

Transcription Engines

The speech_recognition library supports multiple engines:
  • Google Speech Recognition (default, requires internet)
  • Sphinx (offline, less accurate)
  • Google Cloud Speech
  • Microsoft Azure Speech
  • IBM Speech to Text
  • Whisper (OpenAI)
Refer to speech_recognition documentation for configuration.