The StreamInfo class stores metadata about a file stream, including mimetype, extension, charset, and source information. It’s used throughout MarkItDown to help converters identify and process files correctly.
Overview
StreamInfo is a frozen dataclass that contains optional metadata fields. All fields can be None, depending on how the stream was opened and what information is available.
@dataclass(kw_only=True, frozen=True)
class StreamInfo:
mimetype: Optional[str] = None
extension: Optional[str] = None
charset: Optional[str] = None
filename: Optional[str] = None
local_path: Optional[str] = None
url: Optional[str] = None
Fields
mimetype
MIME type of the file (e.g., "application/pdf", "text/html", "image/png").
Detected from:
- HTTP
Content-Type header
- File extension guessing
- Magika content analysis
- Data URI scheme
extension
File extension including the leading dot (e.g., ".pdf", ".docx", ".html").
Detected from:
- Local file path
- URL path
- HTTP
Content-Disposition header filename
- MIME type guessing
charset
Character encoding (e.g., "utf-8", "iso-8859-1", "windows-1252").
Detected from:
- HTTP
Content-Type header
- Data URI attributes
- charset_normalizer analysis for text files
filename
Original filename (e.g., "report.pdf", "data.json").
Detected from:
- Local file path (basename)
- HTTP
Content-Disposition header
- URL path (if it looks like a file)
local_path
local_path: Optional[str]
Full path to the file on disk if read from local filesystem.
Only set when using:
convert_local()
convert() with a local path
convert_uri() with file:// scheme
url
Source URL if the file was fetched from the web.
Set when using:
convert_url() / convert_uri() with HTTP(S) URLs
convert_response() (from response.url)
- Manually via stream_info parameter
Methods
copy_and_update()
def copy_and_update(*args, **kwargs) -> StreamInfo
Create a copy of the StreamInfo with updated fields. Non-None values from arguments override existing values.
One or more StreamInfo instances to merge. Later arguments take precedence.
Individual field values to update (mimetype, extension, etc.).
New StreamInfo instance with merged values.
Example
from markitdown import StreamInfo
# Create base info
base = StreamInfo(extension=".txt", charset="utf-8")
# Update with additional info
updated = base.copy_and_update(
mimetype="text/plain",
url="https://example.com/file.txt"
)
print(updated.extension) # .txt (from base)
print(updated.charset) # utf-8 (from base)
print(updated.mimetype) # text/plain (new)
print(updated.url) # https://example.com/file.txt (new)
Merging Multiple StreamInfo Objects
info1 = StreamInfo(extension=".pdf", charset="utf-8")
info2 = StreamInfo(mimetype="application/pdf", url="https://example.com")
# Merge multiple sources
merged = info1.copy_and_update(info2)
print(merged.extension) # .pdf (from info1)
print(merged.charset) # utf-8 (from info1)
print(merged.mimetype) # application/pdf (from info2)
print(merged.url) # https://example.com (from info2)
Usage Examples
Basic Creation
from markitdown import StreamInfo
# Create with known metadata
info = StreamInfo(
extension=".docx",
mimetype="application/vnd.openxmlformats-officedocument.wordprocessingml.document",
charset="utf-8"
)
With Conversion Methods
from markitdown import MarkItDown, StreamInfo
md = MarkItDown()
# Provide metadata when format is ambiguous
result = md.convert_stream(
file_stream,
stream_info=StreamInfo(
extension=".pdf",
mimetype="application/pdf"
)
)
# Force specific handling even if detection differs
result = md.convert(
"document.dat", # Unknown extension
stream_info=StreamInfo(
extension=".docx",
mimetype="application/vnd.openxmlformats-officedocument.wordprocessingml.document"
)
)
In Custom Converters
from markitdown import DocumentConverter, DocumentConverterResult, StreamInfo
from typing import BinaryIO, Any
class CustomConverter(DocumentConverter):
def accepts(self, file_stream: BinaryIO, stream_info: StreamInfo, **kwargs: Any) -> bool:
# Check multiple fields
if stream_info.extension == ".custom":
return True
if stream_info.mimetype == "application/x-custom":
return True
# URL-based routing
if stream_info.url and "custom.com" in stream_info.url:
return True
return False
def convert(self, file_stream: BinaryIO, stream_info: StreamInfo, **kwargs: Any) -> DocumentConverterResult:
# Use metadata in conversion
title = stream_info.filename or "Untitled"
source = stream_info.url or stream_info.local_path or "Unknown"
markdown = f"# {title}\n\nSource: {source}\n\n"
# ... conversion logic ...
return DocumentConverterResult(markdown=markdown, title=title)
Automatic Detection Flow
MarkItDown automatically builds StreamInfo through multiple detection stages:
md = MarkItDown()
# 1. From local path
result = md.convert("document.pdf")
# StreamInfo:
# extension: ".pdf" (from filename)
# filename: "document.pdf" (from path)
# local_path: "/full/path/to/document.pdf"
# mimetype: "application/pdf" (guessed from extension)
# 2. From HTTP response
result = md.convert_url("https://example.com/doc.docx")
# StreamInfo:
# url: "https://example.com/doc.docx"
# filename: "doc.docx" (from URL or Content-Disposition)
# extension: ".docx" (from filename)
# mimetype: "application/vnd..." (from Content-Type header)
# charset: "utf-8" (from Content-Type header)
# 3. From data URI
result = md.convert_uri("data:text/plain;charset=utf-8,Hello")
# StreamInfo:
# mimetype: "text/plain" (from URI)
# charset: "utf-8" (from URI)
Magika Enhancement
MarkItDown uses Magika to analyze file content and enhance metadata:
import io
# Binary stream with no metadata
stream = io.BytesIO(pdf_bytes)
# MarkItDown uses Magika to detect format
result = md.convert_stream(stream)
# StreamInfo enriched with:
# mimetype: "application/pdf" (detected by Magika)
# extension: ".pdf" (from Magika's format database)
# charset: "utf-8" (for text files, via charset_normalizer)
Common Patterns
class MyConverter(DocumentConverter):
def accepts(self, file_stream, stream_info: StreamInfo, **kwargs):
# Prefer extension (most reliable)
if stream_info.extension:
return stream_info.extension in [".md", ".markdown"]
# Fall back to mimetype
if stream_info.mimetype:
return stream_info.mimetype == "text/markdown"
# No metadata available
return False
def convert(self, file_stream, stream_info: StreamInfo, **kwargs):
# Provide defaults for missing metadata
title = stream_info.filename or "Untitled Document"
if stream_info.local_path:
source = f"File: {stream_info.local_path}"
elif stream_info.url:
source = f"URL: {stream_info.url}"
else:
source = "Unknown source"
markdown = f"# {title}\n\n*Source: {source}*\n\n"
# ...
Build StreamInfo Incrementally
# Start with empty
info = StreamInfo()
# Add URL info
if url:
info = info.copy_and_update(url=url)
# Add detected format
if detected_extension:
info = info.copy_and_update(extension=detected_extension)
# Add encoding
if charset:
info = info.copy_and_update(charset=charset)
# Use final info
result = md.convert_stream(stream, stream_info=info)
See Also