Overview
TheAudioConverter class converts audio files to Markdown by extracting metadata (via exiftool) and transcribing speech to text (via speech_recognition library). Supports common audio formats including WAV, MP3, M4A, and MP4.
Dependencies
Optional:
exiftool (external), SpeechRecognition library
Accepted Formats
audio/x-wavaudio/mpegvideo/mp4
.wav.mp3.m4a.mp4
Class Definition
Methods
accepts()
True for supported audio file extensions or MIME types.
convert()
Binary stream of the audio file
Metadata about the file (extension, MIME type)
Path to exiftool binary for metadata extraction
DocumentConverterResult with metadata and transcript
Note: Transcription requires speech_recognition library. If not installed, only metadata is extracted.
Features
Metadata Extraction
Ifexiftool is available, extracts these audio metadata fields:
Title- Track titleArtist- Artist/performer nameAuthor- Author nameBand- Band/group nameAlbum- Album titleGenre- Music genreTrack- Track numberDateTimeOriginal- Original recording dateCreateDate- File creation dateNumChannels- Number of audio channels (mono/stereo)SampleRate- Sample rate in HzAvgBytesPerSec- Average bitrateBitsPerSample- Bit depth
Duration field is excluded as it may be incorrect when read from memory.
Audio Transcription
Whenspeech_recognition is installed:
- Detects audio format from extension/MIME type
- Transcribes speech to text
- Adds transcript under ”### Audio Transcript:” heading
Example Usage
Metadata Only
With Transcription
Custom exiftool Path
Implementation Details
Source Location
~/workspace/source/packages/markitdown/src/markitdown/converters/_audio_converter.py:23
Format Detection
The converter maps file extensions to audio formats for transcription:Transcription Pipeline
Transcription uses thetranscribe_audio() helper function (from _transcribe_audio.py):
- Uses
speech_recognitionlibrary - Supports multiple speech recognition engines
- Returns text transcript or None if transcription fails
Error Handling
- Missing
speech_recognitionraisesMissingDependencyException(caught silently) - Transcription failures are silent (no transcript section added)
- Metadata extraction failures are silent (field not included)
Use Cases
Podcast Indexing
Meeting Notes
Music Library Management
Limitations
- Transcription accuracy depends on audio quality and speech clarity
- Only supports formats: WAV, MP3, M4A, MP4
- Large audio files may require significant processing time
- Transcription requires internet connection (for some engines)
- Metadata extraction requires external
exiftoolbinary - No speaker identification or timestamp markers
- Music/background noise may interfere with transcription
- Non-English speech may have limited support
Transcription Engines
Thespeech_recognition library supports multiple engines:
- Google Speech Recognition (default, requires internet)
- Sphinx (offline, less accurate)
- Google Cloud Speech
- Microsoft Azure Speech
- IBM Speech to Text
- Whisper (OpenAI)
speech_recognition documentation for configuration.