Skip to content

Semantic Search

After ingesting data, use the retriever to find relevant content by meaning rather than exact keyword match.

Basic Usage

from amsdal_ml.ml_retrievers.openai_retriever import OpenAIRetriever

retriever = OpenAIRetriever()

# Sync
results = retriever.similarity_search('customers with overdue payments', k=5)

# Async
results = await retriever.asimilarity_search('customers with overdue payments', k=5)

for chunk in results:
    print(f'{chunk.object_class}#{chunk.object_id} (distance: {chunk.distance:.3f})')
    print(chunk.raw_text)

RetrievalChunk

Each result is a RetrievalChunk with:

Field Type Description
object_class str Source model class name
object_id str Source object ID
chunk_index int Chunk number within the object
raw_text str The text that matched
distance float Cosine distance (lower = more similar)
tags list[str] Tags attached to this embedding
metadata dict Additional metadata

Tag Filtering

Use tags to narrow search to specific subsets of your data:

# Only search within customer embeddings
results = await retriever.asimilarity_search(
    'payment history',
    k=10,
    include_tags=['customers'],
)

# Exclude certain tags
results = await retriever.asimilarity_search(
    'payment history',
    k=10,
    exclude_tags=['archived'],
)

# Combine both
results = await retriever.asimilarity_search(
    'payment history',
    k=10,
    include_tags=['customers', 'active'],
    exclude_tags=['test-data'],
)

Default Tags

Set default include/exclude tags via config:

export RETRIEVER_INCLUDE_TAGS_DEFAULT='production'
export RETRIEVER_EXCLUDE_TAGS_DEFAULT='test-data,archived'

How It Works

  1. The query text is embedded using the same OpenAI model used for ingestion
  2. A cosine distance search runs against stored EmbeddingModel records
  3. The top max(k × 5, 100) candidates are fetched, then filtered by tags
  4. Results are sorted by distance and trimmed to k items
  5. The retriever constructor accepts a max_context_tokens parameter (reserved for future use)

Configuration

Env Variable Default Description
RETRIEVER_DEFAULT_K 8 Number of results to return
RETRIEVER_INCLUDE_TAGS_DEFAULT Default include tags
RETRIEVER_EXCLUDE_TAGS_DEFAULT Default exclude tags

Custom Retriever

Subclass DefaultRetriever to use a different embedding provider:

from amsdal_ml.ml_retrievers.default_retriever import DefaultRetriever

class MyRetriever(DefaultRetriever):
    def _embed_query(self, text: str) -> list[float]:
        # Return embedding vector for the text
        ...

    async def _aembed_query(self, text: str) -> list[float]:
        # Async version
        ...

Register via config:

export ML_RETRIEVER_CLASS='myapp.retrievers.MyRetriever'