Document model

Search Toolkit represents every ingested source with a single, unified document model. Two classes carry the data through the whole pipeline:

Document: the full output of an extractor for one source.
DocumentChunk: a retrievable slice of that document, optionally carrying an embedding.

Both are tied together by a deterministic identity derived from a source_id and a locator. The same identity is mirrored on the retrieval side by SearchResultChunk, so a chunk keeps the same id from ingestion all the way through search.

Information

Earlier preview releases exposed a separate page representation between documents and chunks. This has been removed. Extractors now produce DocumentChunk objects directly.

Document

A Document is what an extractor produces for a single source. Its id is computed automatically from source_id, so you rarely set it by hand.

class Document:
    id: str                       # deterministic, computed from source_id
    source_id: str                # stable identifier of the source
    content: str                  # full extracted text
    chunks: list[DocumentChunk]   # the document's chunks
    metadata: DocumentMetadata    # extensible, immutable metadata

class Document:
    id: str                       # deterministic, computed from source_id
    source_id: str                # stable identifier of the source
    content: str                  # full extracted text
    chunks: list[DocumentChunk]   # the document's chunks
    metadata: DocumentMetadata    # extensible, immutable metadata

DocumentChunk

A DocumentChunk is the unit that gets indexed and retrieved. Its id is computed from source_id + locator, and parent_ref points back to the id of the Document it belongs to.

class DocumentChunk:
    id: str                          # deterministic, computed from source_id + locator
    source_id: str                   # same source_id as the parent document
    locator: str                     # semantic position within the source
    start_offset: int                # inclusive character offset
    end_offset: int                  # exclusive character offset
    parent_ref: str | None           # id of the parent Document
    chunk_type: ChunkType            # content, image_annotation, or summary
    content: str                     # the chunk text
    metadata: DocumentChunkMetadata  # extensible, immutable metadata
    embedding: list[float] | None    # populated once embedded

class DocumentChunk:
    id: str                          # deterministic, computed from source_id + locator
    source_id: str                   # same source_id as the parent document
    locator: str                     # semantic position within the source
    start_offset: int                # inclusive character offset
    end_offset: int                  # exclusive character offset
    parent_ref: str | None           # id of the parent Document
    chunk_type: ChunkType            # content, image_annotation, or summary
    content: str                     # the chunk text
    metadata: DocumentChunkMetadata  # extensible, immutable metadata
    embedding: list[float] | None    # populated once embedded

Identity: source_id, locator, and deterministic ids

source_id

source_id is the stable identifier of the source document, such as a filepath, a URL, or any custom scheme like arxiv:1706.03762. It is set on File.source_id (defaulting to the file path or name) and stamped by extractors onto the resulting document and every one of its chunks.

Because identity flows from source_id rather than from where the bytes happen to live, you can decouple a document's identity from its storage location: re-ingesting the same logical source produces the same ids even if the file has moved.

locator

locator describes the semantic position of a chunk within its source. The built-in formats are:

char:{start}-{end}: a character range.
page:{n}:char:{start}-{end}: a character range on a known page, for paginated sources.

When a chunk is not plain content, its type prefixes the locator, for example summary:char:0-512 or image_annotation:page:2:char:0-128.

Deterministic ids

Ids are UUID5 hashes derived from the identity fields, not random values:

Document.id = hash of source_id
DocumentChunk.id = hash of source_id + locator
parent_ref = hash of source_id (so a chunk always resolves back to its document)

Because ids are derived, re-ingesting the same content overwrites the same records instead of creating duplicates, which makes indexing idempotent.

Chunk types

chunk_type distinguishes the kinds of chunks a document can contain:

Type	Description
`content`	A slice of the document's main text.
`image_annotation`	Text describing an image (for example an OCR caption).
`summary`	A generated summary of the document or a section.

Metadata

DocumentMetadata and DocumentChunkMetadata are immutable once created, and you can add your own keys. The toolkit ships typed subtypes for common cases, such as DocumentFileMetadata (filename, filepath) and PagedDocumentChunkMetadata (page_number).

A metadata key must not collide with a model field name (for example, you cannot put source_id in metadata), since that would be ambiguous when the chunk is persisted.

Search results

On the retrieval side, SearchResultChunk carries the same identity contract: id, source_id, locator, parent_ref, and chunk_type. Every result maps straight back to the chunk that was ingested.