Classifiers
Classifiers categorise an input. Some classifiers use generative AI models to do this, while others use quantitative techniques like vector similarity search.
azure_openai_json_classification
azure_openai_json_classification(classifier, description=None, filename=None, content=None, metadata=None, image_url=None, image_detail='high', max_characters=200000, classification_target=None, field_prompt=None, system_prompt=None, api_key=None, azure_endpoint=None, api_version=None, model=None, temperature=None, max_retries=5, openai_client=None)
Classify an artefact (e.g. a file or page) into a category from a provided list of classifiers. Uses an LLM provided by the Azure OpenAI service.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
classifier
|
list[dict[str, Any]]
|
A list of candidate codes. Each is a dict with a |
required |
filename
|
str | None
|
Title of the artefact. |
None
|
description
|
str | None
|
Description of the artefact. |
None
|
content
|
str | None
|
Main body content of the artefact. |
None
|
metadata
|
dict | None
|
Metadata properties about the artefact. |
None
|
image_url
|
str | None
|
A URL to an image of the artefact. Must be a signed or public URL that the OpenAI service can access. |
None
|
image_detail
|
str
|
Vision fidelity for the image — "high" (default; full tiling, best for reading text), "low" (downscaled to 512x512, cheap) or "auto". |
'high'
|
max_characters
|
int
|
Character limit before content will be truncated. |
200000
|
classification_target
|
str | None
|
The name of the field/axis being classified
(e.g. "Discipline", "Uniclass Systems"). Establishes for the model
what dimension the candidate codes represent — without it the model
must infer the intent of an opaque code list, a common driver of
plausible hallucination. Ignored when |
None
|
field_prompt
|
str | None
|
Optional field-level guidance applied to the whole
decision (distinct from each candidate's own |
None
|
system_prompt
|
str | None
|
Override the default prompt with custom instructions. |
None
|
api_key
|
str | None
|
API key for Azure resource. If not provided will default to environment variable |
None
|
azure_endpoint
|
str | None
|
Your Azure endpoint, including the resource, e.g. https://example-resource.azure.openai.com/. if not provided will default to environment variable |
None
|
api_version
|
str | None
|
API version for Azure resource. |
None
|
model
|
str | None
|
Model deployment name within the Azure resource. If not provided will default to environment variable |
None
|
max_retries
|
int
|
Number of retry attempts to the Azure OpenAI API service. |
5
|
openai_client
|
AzureOpenAIClient | None
|
Optional AzureOpenAIClient instance to reuse. If not provided, a new client will be created. Providing a shared client instance improves performance in concurrent scenarios. |
None
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
JSON-formatted dictionary containing:
|
azure_openai_json_tag
azure_openai_json_tag(tags, content, metadata=None, image_url=None, image_detail='high', column_prompt=None, max_characters=200000, system_prompt=None, api_key=None, azure_endpoint=None, api_version=None, model=None, temperature=None, max_retries=5)
Apply zero or more tags from a structured candidate list to a document.
A tag column is a one-to-many classifier: any number of candidates may be
selected. Each candidate carries an id (stable key), title (label),
optional description (LLM context) and optional prompt. The LLM
selects by id and returns reasoning + certainty per applied tag.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tags
|
list[Any]
|
Candidate tags. Either structured dicts ({id, title, description?, prompt?}) or plain strings (coerced to {id, title}). |
required |
content
|
str
|
Main body content of the document. |
required |
metadata
|
dict | None
|
Metadata properties about the document. |
None
|
image_url
|
str | None
|
A signed/public URL to an image of the document. |
None
|
image_detail
|
str
|
Vision fidelity for the image — "high" (default; full tiling, best for reading text), "low" (downscaled to 512x512, cheap) or "auto". |
'high'
|
column_prompt
|
str | None
|
Optional column-level guidance applied to the whole
selection for this tag column (distinct from each candidate's own
|
None
|
max_characters
|
int
|
Character limit before content is truncated. |
200000
|
system_prompt
|
str | None
|
Override the default system prompt. |
None
|
api_key
|
str | None
|
API key for the Azure OpenAI resource. |
None
|
azure_endpoint
|
str | None
|
Endpoint URL for the Azure OpenAI resource. |
None
|
api_version
|
str | None
|
API version for the Azure OpenAI resource. |
None
|
model
|
str | None
|
Model deployment name within the Azure resource. |
None
|
temperature
|
float | None
|
Sampling temperature. |
None
|
max_retries
|
int
|
Number of retry attempts to the Azure OpenAI service. |
5
|
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]]
|
A list of applied tags, each a dict:
|
list[dict[str, Any]]
|
|
list[dict[str, Any]]
|
(or |
list[dict[str, Any]]
|
validates a strict |
list[dict[str, Any]]
|
Returns an empty list if nothing applies or the response is malformed. |
is_valid_classifier_code
is_valid_classifier_code(code, allowed_values)
Return whether a selected classifier code is a legitimate option.
A code is valid when it matches one of the classifier's allowed_values
codes, or is a known non-selection sentinel (e.g. "-", "UNCERTAIN",
"ERROR"). It is invalid only when a code was returned that is not in the
allowed list — i.e. a hallucinated option. With no allowed values to check
against, the code is treated as valid (there is nothing to validate).
vector_similarity_search
vector_similarity_search(query, storage_location='./storage', parquet_name='vectors')
Perform vector cosine similarity search between an input query and a set of embeddings stored in a local parquet file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
str
|
The input string for which the similarity search needs to be performed. |
required |
storage_location
|
str
|
Storage location of the Parquet embeddings files. |
'./storage'
|
parquet_name
|
str
|
The name of the Parquet embeddings files. |
'vectors'
|
Returns:
| Type | Description |
|---|---|
dict[Any, Any]
|
A dictionary object of the most similar match. |