Skip to content

Classifiers

Classifiers categorise an input. Some classifiers use generative AI models to do this, while others use quantitative techniques like vector similarity search.

azure_openai_json_classification

azure_openai_json_classification(classifier, description=None, filename=None, content=None, metadata=None, image_url=None, max_characters=200000, system_prompt=None, api_key=None, azure_endpoint=None, api_version=None, model=None, max_retries=5, openai_client=None)

Classify an artefact (e.g. a file or page) into a category from a provided list of classifiers. Uses an LLM provided by the Azure OpenAI service.

Parameters:

Name Type Description Default
classifier list[dict[str, Any]]

A list of classifiers.

required
filename str | None

Title of the artefact.

None
description str | None

Description of the artefact.

None
content str | None

Main body content of the artefact.

None
metadata dict | None

Metadata properties about the artefact.

None
image_url str | None

A URL to an image of the artefact. Must be a signed or public URL that the OpenAI service can access.

None
max_characters int

Character limit before content will be truncated.

200000
system_prompt str | None

Override the default prompt with custom instructions.

None
api_key str | None

API key for Azure resource. If not provided will default to environment variable AZURE_OPENAI_API_KEY.

None
azure_endpoint str | None

Your Azure endpoint, including the resource, e.g. https://example-resource.azure.openai.com/. if not provided will default to environment variable AZURE_OPENAI_ENDPOINT.

None
api_version str | None

API version for Azure resource.

None
model str | None

Model deployment name within the Azure resource. If not provided will default to environment variable AZURE_OPENAI_DEPLOYMENT.

None
max_retries int

Number of retry attempts to the Azure OpenAI API service.

5
openai_client AzureOpenAIClient | None

Optional AzureOpenAIClient instance to reuse. If not provided, a new client will be created. Providing a shared client instance improves performance in concurrent scenarios.

None

Returns:

Type Description
dict[str, Any]

JSON-formatted dictionary containing:

  1. code: selected classification code.
  2. title: title / description of the classification code
  3. certainty: confidence score, either "low", "medium" or "high".
  4. explanation: concise explanation of why this code was chosen.
vector_similarity_search(query, storage_location='./storage', parquet_name='vectors')

Perform vector cosine similarity search between an input query and a set of embeddings stored in a local parquet file.

Parameters:

Name Type Description Default
query str

The input string for which the similarity search needs to be performed.

required
storage_location str

Storage location of the Parquet embeddings files.

'./storage'
parquet_name str

The name of the Parquet embeddings files.

'vectors'

Returns:

Type Description
dict[Any, Any]

A dictionary object of the most similar match.