Semantic Search¶
After ingesting data, use the retriever to find relevant content by meaning rather than exact keyword match.
Basic Usage¶
from amsdal_ml.ml_retrievers.openai_retriever import OpenAIRetriever
retriever = OpenAIRetriever()
# Sync
results = retriever.similarity_search('customers with overdue payments', k=5)
# Async
results = await retriever.asimilarity_search('customers with overdue payments', k=5)
for chunk in results:
print(f'{chunk.object_class}#{chunk.object_id} (distance: {chunk.distance:.3f})')
print(chunk.raw_text)
RetrievalChunk¶
Each result is a RetrievalChunk with:
| Field | Type | Description |
|---|---|---|
object_class |
str |
Source model class name |
object_id |
str |
Source object ID |
chunk_index |
int |
Chunk number within the object |
raw_text |
str |
The text that matched |
distance |
float |
Cosine distance (lower = more similar) |
tags |
list[str] |
Tags attached to this embedding |
metadata |
dict |
Additional metadata |
Tag Filtering¶
Use tags to narrow search to specific subsets of your data:
# Only search within customer embeddings
results = await retriever.asimilarity_search(
'payment history',
k=10,
include_tags=['customers'],
)
# Exclude certain tags
results = await retriever.asimilarity_search(
'payment history',
k=10,
exclude_tags=['archived'],
)
# Combine both
results = await retriever.asimilarity_search(
'payment history',
k=10,
include_tags=['customers', 'active'],
exclude_tags=['test-data'],
)
Default Tags¶
Set default include/exclude tags via config:
export RETRIEVER_INCLUDE_TAGS_DEFAULT='production'
export RETRIEVER_EXCLUDE_TAGS_DEFAULT='test-data,archived'
How It Works¶
- The query text is embedded using the same OpenAI model used for ingestion
- A cosine distance search runs against stored
EmbeddingModelrecords - The top
max(k × 5, 100)candidates are fetched, then filtered by tags - Results are sorted by distance and trimmed to
kitems - The retriever constructor accepts a
max_context_tokensparameter (reserved for future use)
Configuration¶
| Env Variable | Default | Description |
|---|---|---|
RETRIEVER_DEFAULT_K |
8 |
Number of results to return |
RETRIEVER_INCLUDE_TAGS_DEFAULT |
— | Default include tags |
RETRIEVER_EXCLUDE_TAGS_DEFAULT |
— | Default exclude tags |
Custom Retriever¶
Subclass DefaultRetriever to use a different embedding provider:
from amsdal_ml.ml_retrievers.default_retriever import DefaultRetriever
class MyRetriever(DefaultRetriever):
def _embed_query(self, text: str) -> list[float]:
# Return embedding vector for the text
...
async def _aembed_query(self, text: str) -> list[float]:
# Async version
...
Register via config:
export ML_RETRIEVER_CLASS='myapp.retrievers.MyRetriever'