Semantic Search¶

After ingesting data, use the retriever to find relevant content by meaning rather than exact keyword match.

Basic Usage¶

from amsdal_ml.ml_retrievers.openai_retriever import OpenAIRetriever

retriever = OpenAIRetriever()

# Sync
results = retriever.similarity_search('customers with overdue payments', k=5)

# Async
results = await retriever.asimilarity_search('customers with overdue payments', k=5)

for chunk in results:
    print(f'{chunk.object_class}#{chunk.object_id} (distance: {chunk.distance:.3f})')
    print(chunk.raw_text)

RetrievalChunk¶

Each result is a RetrievalChunk with:

Field	Type	Description
`object_class`	`str`	Source model class name
`object_id`	`str`	Source object ID
`chunk_index`	`int`	Chunk number within the object
`raw_text`	`str`	The text that matched
`distance`	`float`	Cosine distance (lower = more similar)
`tags`	`list[str]`	Tags attached to this embedding
`metadata`	`dict`	Additional metadata

Tag Filtering¶

Use tags to narrow search to specific subsets of your data:

# Only search within customer embeddings
results = await retriever.asimilarity_search(
    'payment history',
    k=10,
    include_tags=['customers'],
)

# Exclude certain tags
results = await retriever.asimilarity_search(
    'payment history',
    k=10,
    exclude_tags=['archived'],
)

# Combine both
results = await retriever.asimilarity_search(
    'payment history',
    k=10,
    include_tags=['customers', 'active'],
    exclude_tags=['test-data'],
)

Default Tags¶

Set default include/exclude tags via config:

export RETRIEVER_INCLUDE_TAGS_DEFAULT='production'
export RETRIEVER_EXCLUDE_TAGS_DEFAULT='test-data,archived'

How It Works¶

The query text is embedded using the same OpenAI model used for ingestion
A cosine distance search runs against stored EmbeddingModel records
The top max(k × 5, 100) candidates are fetched, then filtered by tags
Results are sorted by distance and trimmed to k items
The retriever constructor accepts a max_context_tokens parameter (reserved for future use)

Configuration¶

Env Variable	Default	Description
`RETRIEVER_DEFAULT_K`	`8`	Number of results to return
`RETRIEVER_INCLUDE_TAGS_DEFAULT`	—	Default include tags
`RETRIEVER_EXCLUDE_TAGS_DEFAULT`	—	Default exclude tags

Custom Retriever¶

Subclass DefaultRetriever to use a different embedding provider:

from amsdal_ml.ml_retrievers.default_retriever import DefaultRetriever

class MyRetriever(DefaultRetriever):
    def _embed_query(self, text: str) -> list[float]:
        # Return embedding vector for the text
        ...

    async def _aembed_query(self, text: str) -> list[float]:
        # Async version
        ...

Register via config:

export ML_RETRIEVER_CLASS='myapp.retrievers.MyRetriever'