Embeddings Configuration¶
amsdal_ml uses OpenAI's embedding models to convert text into vector representations for semantic search. This page covers how to configure the embedding system.
OpenAI Embedder¶
The default embedder uses OpenAI's text-embedding-3-small model. Set your API key:
export OPENAI_API_KEY='sk-...'
Embedding Parameters¶
| Env Variable | Default | Description |
|---|---|---|
EMBED_MODEL_NAME |
text-embedding-3-small |
OpenAI embedding model to use |
EMBED_DIMENSIONS |
1536 |
Vector dimensions (must match the model) |
Available Models¶
| Model | Dimensions | Description |
|---|---|---|
text-embedding-3-small |
1536 | Fast, cost-effective (default) |
text-embedding-3-large |
3072 | Higher quality, more expensive |
text-embedding-ada-002 |
1536 | Legacy model |
To use a different model:
export EMBED_MODEL_NAME='text-embedding-3-large'
export EMBED_DIMENSIONS=3072
How Embeddings Work¶
When you index an AMSDAL model object, the system:
- Walks the object recursively (default depth 2) collecting field values and related objects
- Generates text — structured facts about the object (field names + values)
- Splits text into chunks (default ≤800 tokens and ≤7 sentences per chunk; sentences shorter than 4 words are dropped), respecting sentence boundaries
- Embeds each chunk via the OpenAI Embeddings API
- Stores embeddings as
EmbeddingModelrecords in the database, linked to the source object
EmbeddingModel¶
Each chunk is stored as an EmbeddingModel record with:
| Field | Description |
|---|---|
data_object_class |
Source model class name |
data_object_id |
Source object ID |
chunk_index |
Chunk number (0-based) |
raw_text |
The text that was embedded |
embedding |
Vector embedding (VectorField) |
tags |
List of string tags for filtering |
ml_metadata |
Optional metadata (type Any, default None) |
Custom Embedder¶
You can provide a custom embedder by subclassing Embedder:
from amsdal_ml.ml_ingesting.embedders.embedder import Embedder
class MyEmbedder(Embedder):
def embed(self, text: str) -> list[float]:
# Return embedding for the text
...
async def aembed(self, text: str) -> list[float]:
# Async version
...
Chunking Strategy¶
The default chunking splits text by sentences, accumulating up to 800 tokens per chunk. This ensures:
- Chunks respect natural sentence boundaries
- Each chunk is small enough for the embedding model's context window
- Related facts stay together within a chunk
Chunking parameters (walk depth, max chunks, tokens per chunk) are fixed by DefaultChunkStrategy. To change them, pass a custom ChunkStrategy to DefaultIngesting:
from amsdal_ml.ml_ingesting.chunk_strategy import ChunkStrategy, ChunkParams
class BigChunks(ChunkStrategy):
def get_params(self, obj) -> ChunkParams:
return ChunkParams(max_depth=3, max_chunks=20, max_tokens_per_chunk=1200)