Classifiers

Classifiers categorise an input. Some classifiers use generative AI models to do this, while others use quantitative techniques like vector similarity search.

azure_openai_json_classification

azure_openai_json_classification(classifier, description=None, filename=None, content=None, metadata=None, image_url=None, image_detail='high', max_characters=200000, classification_target=None, field_prompt=None, system_prompt=None, api_key=None, azure_endpoint=None, api_version=None, model=None, temperature=None, max_retries=5, openai_client=None)

Classify an artefact (e.g. a file or page) into a category from a provided list of classifiers. Uses an LLM provided by the Azure OpenAI service.

Parameters:

Name	Type	Description	Default
`classifier`	`list[dict[str, Any]]`	A list of candidate codes. Each is a dict with a `code` and a human-readable label (`code_description`, or the legacy `title` / `description` aliases); an optional `guidance` / `prompt` / `examples` field gives the model extra context. Candidates are normalised to the output schema's key names before prompting.	required
`filename`	`str \| None`	Title of the artefact.	`None`
`description`	`str \| None`	Description of the artefact.	`None`
`content`	`str \| None`	Main body content of the artefact.	`None`
`metadata`	`dict \| None`	Metadata properties about the artefact.	`None`
`image_url`	`str \| None`	A URL to an image of the artefact. Must be a signed or public URL that the OpenAI service can access.	`None`
`image_detail`	`str`	Vision fidelity for the image — "high" (default; full tiling, best for reading text), "low" (downscaled to 512x512, cheap) or "auto".	`'high'`
`max_characters`	`int`	Character limit before content will be truncated.	`200000`
`classification_target`	`str \| None`	The name of the field/axis being classified (e.g. "Discipline", "Uniclass Systems"). Establishes for the model what dimension the candidate codes represent — without it the model must infer the intent of an opaque code list, a common driver of plausible hallucination. Ignored when `system_prompt` is given.	`None`
`field_prompt`	`str \| None`	Optional field-level guidance applied to the whole decision (distinct from each candidate's own `guidance`). Injected as a guidance section. Ignored when `system_prompt` is given.	`None`
`system_prompt`	`str \| None`	Override the default prompt with custom instructions.	`None`
`api_key`	`str \| None`	API key for Azure resource. If not provided will default to environment variable `AZURE_FOUNDRY_API_KEY`.	`None`
`azure_endpoint`	`str \| None`	Your Azure endpoint, including the resource, e.g. https://example-resource.azure.openai.com/. if not provided will default to environment variable `AZURE_OPENAI_ENDPOINT`.	`None`
`api_version`	`str \| None`	API version for Azure resource.	`None`
`model`	`str \| None`	Model deployment name within the Azure resource. If not provided will default to environment variable `AZURE_OPENAI_DEPLOYMENT`.	`None`
`max_retries`	`int`	Number of retry attempts to the Azure OpenAI API service.	`5`
`openai_client`	`AzureOpenAIClient \| None`	Optional AzureOpenAIClient instance to reuse. If not provided, a new client will be created. Providing a shared client instance improves performance in concurrent scenarios.	`None`

Returns:

Type	Description
`dict[str, Any]`	JSON-formatted dictionary containing: `code`: selected classification code. `code_description`: title / description of the classification code. `certainty`: confidence score, either "low", "medium" or "high". `explanation`: concise explanation of why this code was chosen.

azure_openai_json_tag

azure_openai_json_tag(tags, content, metadata=None, image_url=None, image_detail='high', column_prompt=None, max_characters=200000, system_prompt=None, api_key=None, azure_endpoint=None, api_version=None, model=None, temperature=None, max_retries=5)

Apply zero or more tags from a structured candidate list to a document.

A tag column is a one-to-many classifier: any number of candidates may be selected. Each candidate carries an id (stable key), title (label), optional description (LLM context) and optional prompt. The LLM selects by id and returns reasoning + certainty per applied tag.

Parameters:

Name	Type	Description	Default
`tags`	`list[Any]`	Candidate tags. Either structured dicts ({id, title, description?, prompt?}) or plain strings (coerced to {id, title}).	required
`content`	`str`	Main body content of the document.	required
`metadata`	`dict \| None`	Metadata properties about the document.	`None`
`image_url`	`str \| None`	A signed/public URL to an image of the document.	`None`
`image_detail`	`str`	Vision fidelity for the image — "high" (default; full tiling, best for reading text), "low" (downscaled to 512x512, cheap) or "auto".	`'high'`
`column_prompt`	`str \| None`	Optional column-level guidance applied to the whole selection for this tag column (distinct from each candidate's own `prompt`). Injected as a guidance section in the system prompt.	`None`
`max_characters`	`int`	Character limit before content is truncated.	`200000`
`system_prompt`	`str \| None`	Override the default system prompt.	`None`
`api_key`	`str \| None`	API key for the Azure OpenAI resource.	`None`
`azure_endpoint`	`str \| None`	Endpoint URL for the Azure OpenAI resource.	`None`
`api_version`	`str \| None`	API version for the Azure OpenAI resource.	`None`
`model`	`str \| None`	Model deployment name within the Azure resource.	`None`
`temperature`	`float \| None`	Sampling temperature.	`None`
`max_retries`	`int`	Number of retry attempts to the Azure OpenAI service.	`5`

Returns:

Type	Description
`list[dict[str, Any]]`	A list of applied tags, each a dict: `{id, title, description, certainty, explanation}`.
`list[dict[str, Any]]`	`certainty` is normalised to lowercase `"high" \| "medium" \| "low"`
`list[dict[str, Any]]`	(or `None` if the model returned an out-of-range value) — the backend
`list[dict[str, Any]]`	validates a strict `Literal`, so callers must not emit other casings.
`list[dict[str, Any]]`	Returns an empty list if nothing applies or the response is malformed.

is_valid_classifier_code

is_valid_classifier_code(code, allowed_values)

Return whether a selected classifier code is a legitimate option.

A code is valid when it matches one of the classifier's allowed_values codes, or is a known non-selection sentinel (e.g. "-", "UNCERTAIN", "ERROR"). It is invalid only when a code was returned that is not in the allowed list — i.e. a hallucinated option. With no allowed values to check against, the code is treated as valid (there is nothing to validate).

vector_similarity_search

vector_similarity_search(query, storage_location='./storage', parquet_name='vectors')

Perform vector cosine similarity search between an input query and a set of embeddings stored in a local parquet file.

Parameters:

Name	Type	Description	Default
`query`	`str`	The input string for which the similarity search needs to be performed.	required
`storage_location`	`str`	Storage location of the Parquet embeddings files.	`'./storage'`
`parquet_name`	`str`	The name of the Parquet embeddings files.	`'vectors'`

Returns:

Type	Description
`dict[Any, Any]`	A dictionary object of the most similar match.