Skip to content

Classifiers

Classifiers categorise an input. Some classifiers use generative AI models to do this, while others use quantitative techniques like vector similarity search.

azure_openai_json_classification

azure_openai_json_classification(classifier, description=None, filename=None, content=None, metadata=None, image_url=None, image_detail='high', max_characters=200000, classification_target=None, field_prompt=None, system_prompt=None, api_key=None, azure_endpoint=None, api_version=None, model=None, temperature=None, max_retries=5, openai_client=None)

Classify an artefact (e.g. a file or page) into a category from a provided list of classifiers. Uses an LLM provided by the Azure OpenAI service.

Parameters:

Name Type Description Default
classifier list[dict[str, Any]]

A list of candidate codes. Each is a dict with a code and a human-readable label (code_description, or the legacy title / description aliases); an optional guidance / prompt / examples field gives the model extra context. Candidates are normalised to the output schema's key names before prompting.

required
filename str | None

Title of the artefact.

None
description str | None

Description of the artefact.

None
content str | None

Main body content of the artefact.

None
metadata dict | None

Metadata properties about the artefact.

None
image_url str | None

A URL to an image of the artefact. Must be a signed or public URL that the OpenAI service can access.

None
image_detail str

Vision fidelity for the image — "high" (default; full tiling, best for reading text), "low" (downscaled to 512x512, cheap) or "auto".

'high'
max_characters int

Character limit before content will be truncated.

200000
classification_target str | None

The name of the field/axis being classified (e.g. "Discipline", "Uniclass Systems"). Establishes for the model what dimension the candidate codes represent — without it the model must infer the intent of an opaque code list, a common driver of plausible hallucination. Ignored when system_prompt is given.

None
field_prompt str | None

Optional field-level guidance applied to the whole decision (distinct from each candidate's own guidance). Injected as a guidance section. Ignored when system_prompt is given.

None
system_prompt str | None

Override the default prompt with custom instructions.

None
api_key str | None

API key for Azure resource. If not provided will default to environment variable AZURE_FOUNDRY_API_KEY.

None
azure_endpoint str | None

Your Azure endpoint, including the resource, e.g. https://example-resource.azure.openai.com/. if not provided will default to environment variable AZURE_OPENAI_ENDPOINT.

None
api_version str | None

API version for Azure resource.

None
model str | None

Model deployment name within the Azure resource. If not provided will default to environment variable AZURE_OPENAI_DEPLOYMENT.

None
max_retries int

Number of retry attempts to the Azure OpenAI API service.

5
openai_client AzureOpenAIClient | None

Optional AzureOpenAIClient instance to reuse. If not provided, a new client will be created. Providing a shared client instance improves performance in concurrent scenarios.

None

Returns:

Type Description
dict[str, Any]

JSON-formatted dictionary containing:

  1. code: selected classification code.
  2. code_description: title / description of the classification code.
  3. certainty: confidence score, either "low", "medium" or "high".
  4. explanation: concise explanation of why this code was chosen.

azure_openai_json_tag

azure_openai_json_tag(tags, content, metadata=None, image_url=None, image_detail='high', column_prompt=None, max_characters=200000, system_prompt=None, api_key=None, azure_endpoint=None, api_version=None, model=None, temperature=None, max_retries=5)

Apply zero or more tags from a structured candidate list to a document.

A tag column is a one-to-many classifier: any number of candidates may be selected. Each candidate carries an id (stable key), title (label), optional description (LLM context) and optional prompt. The LLM selects by id and returns reasoning + certainty per applied tag.

Parameters:

Name Type Description Default
tags list[Any]

Candidate tags. Either structured dicts ({id, title, description?, prompt?}) or plain strings (coerced to {id, title}).

required
content str

Main body content of the document.

required
metadata dict | None

Metadata properties about the document.

None
image_url str | None

A signed/public URL to an image of the document.

None
image_detail str

Vision fidelity for the image — "high" (default; full tiling, best for reading text), "low" (downscaled to 512x512, cheap) or "auto".

'high'
column_prompt str | None

Optional column-level guidance applied to the whole selection for this tag column (distinct from each candidate's own prompt). Injected as a guidance section in the system prompt.

None
max_characters int

Character limit before content is truncated.

200000
system_prompt str | None

Override the default system prompt.

None
api_key str | None

API key for the Azure OpenAI resource.

None
azure_endpoint str | None

Endpoint URL for the Azure OpenAI resource.

None
api_version str | None

API version for the Azure OpenAI resource.

None
model str | None

Model deployment name within the Azure resource.

None
temperature float | None

Sampling temperature.

None
max_retries int

Number of retry attempts to the Azure OpenAI service.

5

Returns:

Type Description
list[dict[str, Any]]

A list of applied tags, each a dict: {id, title, description, certainty, explanation}.

list[dict[str, Any]]

certainty is normalised to lowercase "high" | "medium" | "low"

list[dict[str, Any]]

(or None if the model returned an out-of-range value) — the backend

list[dict[str, Any]]

validates a strict Literal, so callers must not emit other casings.

list[dict[str, Any]]

Returns an empty list if nothing applies or the response is malformed.

is_valid_classifier_code

is_valid_classifier_code(code, allowed_values)

Return whether a selected classifier code is a legitimate option.

A code is valid when it matches one of the classifier's allowed_values codes, or is a known non-selection sentinel (e.g. "-", "UNCERTAIN", "ERROR"). It is invalid only when a code was returned that is not in the allowed list — i.e. a hallucinated option. With no allowed values to check against, the code is treated as valid (there is nothing to validate).

vector_similarity_search(query, storage_location='./storage', parquet_name='vectors')

Perform vector cosine similarity search between an input query and a set of embeddings stored in a local parquet file.

Parameters:

Name Type Description Default
query str

The input string for which the similarity search needs to be performed.

required
storage_location str

Storage location of the Parquet embeddings files.

'./storage'
parquet_name str

The name of the Parquet embeddings files.

'vectors'

Returns:

Type Description
dict[Any, Any]

A dictionary object of the most similar match.