Skip to main content
The StreamInfo class stores metadata about a file stream, including mimetype, extension, charset, and source information. It’s used throughout MarkItDown to help converters identify and process files correctly.

Overview

StreamInfo is a frozen dataclass that contains optional metadata fields. All fields can be None, depending on how the stream was opened and what information is available.
@dataclass(kw_only=True, frozen=True)
class StreamInfo:
    mimetype: Optional[str] = None
    extension: Optional[str] = None
    charset: Optional[str] = None
    filename: Optional[str] = None
    local_path: Optional[str] = None
    url: Optional[str] = None

Fields

mimetype

mimetype: Optional[str]
mimetype
str | None
MIME type of the file (e.g., "application/pdf", "text/html", "image/png").
Detected from:
  • HTTP Content-Type header
  • File extension guessing
  • Magika content analysis
  • Data URI scheme

extension

extension: Optional[str]
extension
str | None
File extension including the leading dot (e.g., ".pdf", ".docx", ".html").
Detected from:
  • Local file path
  • URL path
  • HTTP Content-Disposition header filename
  • MIME type guessing

charset

charset: Optional[str]
charset
str | None
Character encoding (e.g., "utf-8", "iso-8859-1", "windows-1252").
Detected from:
  • HTTP Content-Type header
  • Data URI attributes
  • charset_normalizer analysis for text files

filename

filename: Optional[str]
filename
str | None
Original filename (e.g., "report.pdf", "data.json").
Detected from:
  • Local file path (basename)
  • HTTP Content-Disposition header
  • URL path (if it looks like a file)

local_path

local_path: Optional[str]
local_path
str | None
Full path to the file on disk if read from local filesystem.
Only set when using:
  • convert_local()
  • convert() with a local path
  • convert_uri() with file:// scheme

url

url: Optional[str]
url
str | None
Source URL if the file was fetched from the web.
Set when using:
  • convert_url() / convert_uri() with HTTP(S) URLs
  • convert_response() (from response.url)
  • Manually via stream_info parameter

Methods

copy_and_update()

def copy_and_update(*args, **kwargs) -> StreamInfo
Create a copy of the StreamInfo with updated fields. Non-None values from arguments override existing values.
args
StreamInfo
One or more StreamInfo instances to merge. Later arguments take precedence.
kwargs
dict
Individual field values to update (mimetype, extension, etc.).
result
StreamInfo
New StreamInfo instance with merged values.

Example

from markitdown import StreamInfo

# Create base info
base = StreamInfo(extension=".txt", charset="utf-8")

# Update with additional info
updated = base.copy_and_update(
    mimetype="text/plain",
    url="https://example.com/file.txt"
)

print(updated.extension)  # .txt (from base)
print(updated.charset)    # utf-8 (from base)
print(updated.mimetype)   # text/plain (new)
print(updated.url)        # https://example.com/file.txt (new)

Merging Multiple StreamInfo Objects

info1 = StreamInfo(extension=".pdf", charset="utf-8")
info2 = StreamInfo(mimetype="application/pdf", url="https://example.com")

# Merge multiple sources
merged = info1.copy_and_update(info2)

print(merged.extension)   # .pdf (from info1)
print(merged.charset)     # utf-8 (from info1)
print(merged.mimetype)    # application/pdf (from info2)
print(merged.url)         # https://example.com (from info2)

Usage Examples

Basic Creation

from markitdown import StreamInfo

# Create with known metadata
info = StreamInfo(
    extension=".docx",
    mimetype="application/vnd.openxmlformats-officedocument.wordprocessingml.document",
    charset="utf-8"
)

With Conversion Methods

from markitdown import MarkItDown, StreamInfo

md = MarkItDown()

# Provide metadata when format is ambiguous
result = md.convert_stream(
    file_stream,
    stream_info=StreamInfo(
        extension=".pdf",
        mimetype="application/pdf"
    )
)

Override Detected Metadata

# Force specific handling even if detection differs
result = md.convert(
    "document.dat",  # Unknown extension
    stream_info=StreamInfo(
        extension=".docx",
        mimetype="application/vnd.openxmlformats-officedocument.wordprocessingml.document"
    )
)

In Custom Converters

from markitdown import DocumentConverter, DocumentConverterResult, StreamInfo
from typing import BinaryIO, Any

class CustomConverter(DocumentConverter):
    def accepts(self, file_stream: BinaryIO, stream_info: StreamInfo, **kwargs: Any) -> bool:
        # Check multiple fields
        if stream_info.extension == ".custom":
            return True
        
        if stream_info.mimetype == "application/x-custom":
            return True
        
        # URL-based routing
        if stream_info.url and "custom.com" in stream_info.url:
            return True
        
        return False
    
    def convert(self, file_stream: BinaryIO, stream_info: StreamInfo, **kwargs: Any) -> DocumentConverterResult:
        # Use metadata in conversion
        title = stream_info.filename or "Untitled"
        source = stream_info.url or stream_info.local_path or "Unknown"
        
        markdown = f"# {title}\n\nSource: {source}\n\n"
        # ... conversion logic ...
        
        return DocumentConverterResult(markdown=markdown, title=title)

Metadata Detection

Automatic Detection Flow

MarkItDown automatically builds StreamInfo through multiple detection stages:
md = MarkItDown()

# 1. From local path
result = md.convert("document.pdf")
# StreamInfo:
#   extension: ".pdf" (from filename)
#   filename: "document.pdf" (from path)
#   local_path: "/full/path/to/document.pdf"
#   mimetype: "application/pdf" (guessed from extension)

# 2. From HTTP response
result = md.convert_url("https://example.com/doc.docx")
# StreamInfo:
#   url: "https://example.com/doc.docx"
#   filename: "doc.docx" (from URL or Content-Disposition)
#   extension: ".docx" (from filename)
#   mimetype: "application/vnd..." (from Content-Type header)
#   charset: "utf-8" (from Content-Type header)

# 3. From data URI
result = md.convert_uri("data:text/plain;charset=utf-8,Hello")
# StreamInfo:
#   mimetype: "text/plain" (from URI)
#   charset: "utf-8" (from URI)

Magika Enhancement

MarkItDown uses Magika to analyze file content and enhance metadata:
import io

# Binary stream with no metadata
stream = io.BytesIO(pdf_bytes)

# MarkItDown uses Magika to detect format
result = md.convert_stream(stream)
# StreamInfo enriched with:
#   mimetype: "application/pdf" (detected by Magika)
#   extension: ".pdf" (from Magika's format database)
#   charset: "utf-8" (for text files, via charset_normalizer)

Common Patterns

Check Available Metadata

class MyConverter(DocumentConverter):
    def accepts(self, file_stream, stream_info: StreamInfo, **kwargs):
        # Prefer extension (most reliable)
        if stream_info.extension:
            return stream_info.extension in [".md", ".markdown"]
        
        # Fall back to mimetype
        if stream_info.mimetype:
            return stream_info.mimetype == "text/markdown"
        
        # No metadata available
        return False

Handle Missing Metadata

def convert(self, file_stream, stream_info: StreamInfo, **kwargs):
    # Provide defaults for missing metadata
    title = stream_info.filename or "Untitled Document"
    
    if stream_info.local_path:
        source = f"File: {stream_info.local_path}"
    elif stream_info.url:
        source = f"URL: {stream_info.url}"
    else:
        source = "Unknown source"
    
    markdown = f"# {title}\n\n*Source: {source}*\n\n"
    # ...

Build StreamInfo Incrementally

# Start with empty
info = StreamInfo()

# Add URL info
if url:
    info = info.copy_and_update(url=url)

# Add detected format
if detected_extension:
    info = info.copy_and_update(extension=detected_extension)

# Add encoding
if charset:
    info = info.copy_and_update(charset=charset)

# Use final info
result = md.convert_stream(stream, stream_info=info)

See Also