Embeddings Configuration¶

amsdal_ml uses OpenAI's embedding models to convert text into vector representations for semantic search. This page covers how to configure the embedding system.

OpenAI Embedder¶

The default embedder uses OpenAI's text-embedding-3-small model. Set your API key:

export AMSDAL_ML_OPENAI_API_KEY='sk-...'

Embedding Parameters¶

Env Variable	Default	Description
`AMSDAL_ML_EMBED_MODEL_NAME`	`text-embedding-3-small`	OpenAI embedding model to use
`AMSDAL_ML_EMBED_DIMENSIONS`	`1536`	Vector dimensions (must match the model)

Available Models¶

Model	Dimensions	Description
`text-embedding-3-small`	1536	Fast, cost-effective (default)
`text-embedding-3-large`	3072	Higher quality, more expensive
`text-embedding-ada-002`	1536	Legacy model

To use a different model:

export AMSDAL_ML_EMBED_MODEL_NAME='text-embedding-3-large'
export AMSDAL_ML_EMBED_DIMENSIONS=3072

How Embeddings Work¶

When you index an AMSDAL model object, the system:

Walks the object recursively (default depth 2) collecting field values and related objects
Generates text — structured facts about the object (field names + values)
Splits text into chunks (default ≤800 tokens and ≤7 sentences per chunk; sentences shorter than 4 words are dropped), respecting sentence boundaries
Embeds each chunk via the OpenAI Embeddings API
Stores embeddings as EmbeddingModel records in the database, linked to the source object

EmbeddingModel¶

Each chunk is stored as an EmbeddingModel record with:

Field	Description
`data_object_class`	Source model class name
`data_object_id`	Source object ID
`chunk_index`	Chunk number (0-based)
`raw_text`	The text that was embedded
`embedding`	Vector embedding (VectorField)
`tags`	List of string tags for filtering
`ml_metadata`	Optional metadata (type `Any`, default `None`)

Custom Embedder¶

You can provide a custom embedder by subclassing Embedder:

from amsdal_ml.ml_ingesting.embedders.embedder import Embedder

class MyEmbedder(Embedder):
    def embed(self, text: str) -> list[float]:
        # Return embedding for the text
        ...

    async def aembed(self, text: str) -> list[float]:
        # Async version
        ...

Chunking Strategy¶

The default chunking splits text by sentences, accumulating up to 800 tokens per chunk. This ensures:

Chunks respect natural sentence boundaries
Each chunk is small enough for the embedding model's context window
Related facts stay together within a chunk

Chunking parameters (walk depth, max chunks, tokens per chunk) are fixed by DefaultChunkStrategy. To change them, pass a custom ChunkStrategy to DefaultIngesting:

from amsdal_ml.ml_ingesting.chunk_strategy import ChunkStrategy, ChunkParams


class BigChunks(ChunkStrategy):
    def get_params(self, obj) -> ChunkParams:
        return ChunkParams(max_depth=3, max_chunks=20, max_tokens_per_chunk=1200)