StreamInfo

The StreamInfo class stores metadata about a file stream, including mimetype, extension, charset, and source information. It’s used throughout MarkItDown to help converters identify and process files correctly.

Overview

StreamInfo is a frozen dataclass that contains optional metadata fields. All fields can be None, depending on how the stream was opened and what information is available.

@dataclass(kw_only=True, frozen=True)
class StreamInfo:
    mimetype: Optional[str] = None
    extension: Optional[str] = None
    charset: Optional[str] = None
    filename: Optional[str] = None
    local_path: Optional[str] = None
    url: Optional[str] = None

Fields

mimetype

mimetype: Optional[str]

mimetype

str | None

MIME type of the file (e.g., "application/pdf", "text/html", "image/png").

Detected from:

HTTP Content-Type header
File extension guessing
Magika content analysis
Data URI scheme

extension

extension: Optional[str]

extension

str | None

File extension including the leading dot (e.g., ".pdf", ".docx", ".html").

Detected from:

Local file path
URL path
HTTP Content-Disposition header filename
MIME type guessing

charset

charset: Optional[str]

charset

str | None

Character encoding (e.g., "utf-8", "iso-8859-1", "windows-1252").

Detected from:

HTTP Content-Type header
Data URI attributes
charset_normalizer analysis for text files

filename

filename: Optional[str]

filename

str | None

Original filename (e.g., "report.pdf", "data.json").

Detected from:

Local file path (basename)
HTTP Content-Disposition header
URL path (if it looks like a file)

local_path

local_path: Optional[str]

local_path

str | None

Full path to the file on disk if read from local filesystem.

Only set when using:

convert_local()
convert() with a local path
convert_uri() with file:// scheme

url

url: Optional[str]

url

str | None

Source URL if the file was fetched from the web.

Set when using:

convert_url() / convert_uri() with HTTP(S) URLs
convert_response() (from response.url)
Manually via stream_info parameter

Methods

copy_and_update()

def copy_and_update(*args, **kwargs) -> StreamInfo

Create a copy of the StreamInfo with updated fields. Non-None values from arguments override existing values.

args

StreamInfo

One or more StreamInfo instances to merge. Later arguments take precedence.

kwargs

dict

Individual field values to update (mimetype, extension, etc.).

result

StreamInfo

New StreamInfo instance with merged values.

Example

from markitdown import StreamInfo

# Create base info
base = StreamInfo(extension=".txt", charset="utf-8")

# Update with additional info
updated = base.copy_and_update(
    mimetype="text/plain",
    url="https://example.com/file.txt"
)

print(updated.extension)  # .txt (from base)
print(updated.charset)    # utf-8 (from base)
print(updated.mimetype)   # text/plain (new)
print(updated.url)        # https://example.com/file.txt (new)

Merging Multiple StreamInfo Objects

info1 = StreamInfo(extension=".pdf", charset="utf-8")
info2 = StreamInfo(mimetype="application/pdf", url="https://example.com")

# Merge multiple sources
merged = info1.copy_and_update(info2)

print(merged.extension)   # .pdf (from info1)
print(merged.charset)     # utf-8 (from info1)
print(merged.mimetype)    # application/pdf (from info2)
print(merged.url)         # https://example.com (from info2)

Usage Examples

Basic Creation

from markitdown import StreamInfo

# Create with known metadata
info = StreamInfo(
    extension=".docx",
    mimetype="application/vnd.openxmlformats-officedocument.wordprocessingml.document",
    charset="utf-8"
)

With Conversion Methods

from markitdown import MarkItDown, StreamInfo

md = MarkItDown()

# Provide metadata when format is ambiguous
result = md.convert_stream(
    file_stream,
    stream_info=StreamInfo(
        extension=".pdf",
        mimetype="application/pdf"
    )
)

Override Detected Metadata

# Force specific handling even if detection differs
result = md.convert(
    "document.dat",  # Unknown extension
    stream_info=StreamInfo(
        extension=".docx",
        mimetype="application/vnd.openxmlformats-officedocument.wordprocessingml.document"
    )
)

In Custom Converters

from markitdown import DocumentConverter, DocumentConverterResult, StreamInfo
from typing import BinaryIO, Any

class CustomConverter(DocumentConverter):
    def accepts(self, file_stream: BinaryIO, stream_info: StreamInfo, **kwargs: Any) -> bool:
        # Check multiple fields
        if stream_info.extension == ".custom":
            return True
        
        if stream_info.mimetype == "application/x-custom":
            return True
        
        # URL-based routing
        if stream_info.url and "custom.com" in stream_info.url:
            return True
        
        return False
    
    def convert(self, file_stream: BinaryIO, stream_info: StreamInfo, **kwargs: Any) -> DocumentConverterResult:
        # Use metadata in conversion
        title = stream_info.filename or "Untitled"
        source = stream_info.url or stream_info.local_path or "Unknown"
        
        markdown = f"# {title}\n\nSource: {source}\n\n"
        # ... conversion logic ...
        
        return DocumentConverterResult(markdown=markdown, title=title)

Metadata Detection

Automatic Detection Flow

MarkItDown automatically builds StreamInfo through multiple detection stages:

md = MarkItDown()

# 1. From local path
result = md.convert("document.pdf")
# StreamInfo:
#   extension: ".pdf" (from filename)
#   filename: "document.pdf" (from path)
#   local_path: "/full/path/to/document.pdf"
#   mimetype: "application/pdf" (guessed from extension)

# 2. From HTTP response
result = md.convert_url("https://example.com/doc.docx")
# StreamInfo:
#   url: "https://example.com/doc.docx"
#   filename: "doc.docx" (from URL or Content-Disposition)
#   extension: ".docx" (from filename)
#   mimetype: "application/vnd..." (from Content-Type header)
#   charset: "utf-8" (from Content-Type header)

# 3. From data URI
result = md.convert_uri("data:text/plain;charset=utf-8,Hello")
# StreamInfo:
#   mimetype: "text/plain" (from URI)
#   charset: "utf-8" (from URI)

Magika Enhancement

MarkItDown uses Magika to analyze file content and enhance metadata:

import io

# Binary stream with no metadata
stream = io.BytesIO(pdf_bytes)

# MarkItDown uses Magika to detect format
result = md.convert_stream(stream)
# StreamInfo enriched with:
#   mimetype: "application/pdf" (detected by Magika)
#   extension: ".pdf" (from Magika's format database)
#   charset: "utf-8" (for text files, via charset_normalizer)

Common Patterns

Check Available Metadata

class MyConverter(DocumentConverter):
    def accepts(self, file_stream, stream_info: StreamInfo, **kwargs):
        # Prefer extension (most reliable)
        if stream_info.extension:
            return stream_info.extension in [".md", ".markdown"]
        
        # Fall back to mimetype
        if stream_info.mimetype:
            return stream_info.mimetype == "text/markdown"
        
        # No metadata available
        return False

Handle Missing Metadata

def convert(self, file_stream, stream_info: StreamInfo, **kwargs):
    # Provide defaults for missing metadata
    title = stream_info.filename or "Untitled Document"
    
    if stream_info.local_path:
        source = f"File: {stream_info.local_path}"
    elif stream_info.url:
        source = f"URL: {stream_info.url}"
    else:
        source = "Unknown source"
    
    markdown = f"# {title}\n\n*Source: {source}*\n\n"
    # ...

Build StreamInfo Incrementally

# Start with empty
info = StreamInfo()

# Add URL info
if url:
    info = info.copy_and_update(url=url)

# Add detected format
if detected_extension:
    info = info.copy_and_update(extension=detected_extension)

# Add encoding
if charset:
    info = info.copy_and_update(charset=charset)

# Use final info
result = md.convert_stream(stream, stream_info=info)

Core

Converters

Exceptions

Overview

Fields

mimetype

extension

charset

filename

local_path

url

Methods

copy_and_update()

Example

Merging Multiple StreamInfo Objects

Usage Examples

Basic Creation

With Conversion Methods

Override Detected Metadata

In Custom Converters

Metadata Detection

Automatic Detection Flow

Magika Enhancement

Common Patterns

Check Available Metadata

Handle Missing Metadata

Build StreamInfo Incrementally

See Also

Core

Converters

Exceptions

​Overview

​Fields

​mimetype

​extension

​charset

​filename

​local_path

​url

​Methods

​copy_and_update()

​Example

​Merging Multiple StreamInfo Objects

​Usage Examples

​Basic Creation

​With Conversion Methods

​Override Detected Metadata

​In Custom Converters

​Metadata Detection

​Automatic Detection Flow

​Magika Enhancement

​Common Patterns

​Check Available Metadata

​Handle Missing Metadata

​Build StreamInfo Incrementally

​See Also

Overview

Fields

mimetype

extension

charset

filename

local_path

url

Methods

copy_and_update()

Example

Merging Multiple StreamInfo Objects

Usage Examples

Basic Creation

With Conversion Methods

Override Detected Metadata

In Custom Converters

Metadata Detection

Automatic Detection Flow

Magika Enhancement

Common Patterns

Check Available Metadata

Handle Missing Metadata

Build StreamInfo Incrementally

See Also