Skip to content

Processors

Base class for processing units.

ProcessingUnit

Bases: ABC

Abstract base class for processing units.

Template Method pattern:

  • run() orchestrates the workflow
  • Subclasses implement _setup(), _process(), _teardown()

__init__

__init__(use_persisted_results=False, persist_results=True)

Initialize processing unit.

Parameters:

Name Type Description Default
use_persisted_results bool

If True, use existing results from blob storage instead of reprocessing

False
persist_results bool

If True, save results to blob storage after processing

True

run abstractmethod

run(document_version, **kwargs)

Execute the processing unit.

Parameters:

Name Type Description Default
document_version DocumentVersion

The document to process

required

Other Parameters:

Name Type Description
session Optional[Session]

Session for accessing the metadata specification

**kwargs Any

Other processor-specific arguments

Azure connection processing unit for establishing session and connector in DAG workflow.

HoppaConnector

Bases: ProcessingUnit

Processing unit for establishing connection to Hoppa session on MS Azure.

This unit handles the initial connection setup as the first step in a DAG, creating both the session and connector that other units will use.

It also handles logic relating to getting the document in question from the list of documents in the session

run

run(**kwargs)

Execute the processing unit workflow.

HoppaConnector doesn't need document_version or session as inputs since it creates them.

Other Parameters:

Name Type Description
organization str

Organization identifier for the session

workspace Str

Workspace identifier within the organization

session_id str

Unique session identifier

user_id str

User identifier for authentication and logging

include_blob (str, Optional)

A blob_name that should be included when indexing document versions. All other bobs will be ignored. If not provided then method will return the first DocumentVersion in the session.

Returns:

Type Description
dict[str, Any]

Processing results containing 'session' and 'document_version'

FilePreprocessor

Bases: ProcessingUnit

Processing unit for extracting content from files.

Handles:

  • File stream loading
  • ZIP file extraction
  • Content extraction using general_purpose_read
  • Basic file type and size metrics

Note: This processor doesn't use result caching since file content is always processed fresh.

run

run(document_version, **kwargs)

Execute the processing unit workflow.

Parameters:

Name Type Description Default
document_version DocumentVersion

The document to process

required

Other Parameters:

Name Type Description
chunk_pages (boolean, Optional)

if True, extracted file content will be chunked by page and each chunk saved separately. Default is False.

page_limit (int, Optional)

Applies to PDF only. Number of pages to extract from the document. Setting to 0 will extract all pages. Default is 10.

clean_markdown (bool, Optional)

Applies to PDF only. Determines whether to convert HTML content (e.g. tables) to pure markdown. Default is True.

Returns: Processing results

MetadataExtractor

Bases: ProcessingUnit

Processing unit for extracting metadata from various file types.

Handles:

  • Images: downsample to reduce LLM token consumption and extract metadata
  • PDFs: screenshot first page to provide LLM with awareness of page layout. If run() method is called with argument all_pages set to True then will generate a thumbnail image for each page.
  • Office documents: Extract embedded metadata
  • CAD files: Convert and extract metadata
  • Smart result merging that preserves existing metadata

run

run(document_version, **kwargs)

Execute the processing unit workflow.

Parameters:

Name Type Description Default
document_version DocumentVersion

The document to process

required

Other Parameters:

Name Type Description
all_pages (bool, Optional)

If True, thumbnail images will be generated for all PDF pages. Default is False.

is_zipfile (bool, Optional)

If True, the parent DocumentVersion is a zipfile and the DocumentVersion.signed_url property points at a zip file, not the extracted file. The content will be extracted and uploaded to a temporary Azure Blob Storage location so that 3rd party services (e.g. Autodesk Design Automation) can access it.

Returns: Processing results

Description query unit for creating file summaries using OpenAI.

DescriptionQuery

Bases: ProcessingUnit

Processing unit for generating file descriptions using OpenAI.

run

run(document_version, **kwargs)

Other Parameters:

Name Type Description
system_prompt str

Instructions for LLM when generating the description. Replaces default system prompt.

LLM analysis unit for direct language model-based analysis as fallback.

LLMClassifier

Bases: ProcessingUnit

Processing unit for direct LLM-based analysis.

Handles:

  • Direct analysis using language models when search/reranking fails
  • Hierarchical analysis with parent code support
  • Processing multiple analysers that need fallback analysis
  • Integration with existing hierarchical_list_classification function
  • Smart result merging that preserves existing classifiers

run

run(document_version, session=None, **kwargs)

Execute the processing unit workflow.

Parameters:

Name Type Description Default
document_version DocumentVersion

The document to process

required
session Session | None

Optional session for reading query definitions from storage.

None

Other Parameters:

Name Type Description
property_id str | list

The id (or a list of ids) of the MetadataProperties containing the query definition(s).

direct_classifiers (dict[dict | MetadataProperty], Optional)

A set of classifiers passed directly as a dictionary. Each classifier dictionary must have 'code', and 'description' fields.

Returns: Processing results

LLMTagger

Bases: ProcessingUnit

Processing unit for tagging of documents using Azure OpenAI LLM.

run

run(document_version, session=None, **kwargs)

Execute the processing unit workflow.

Parameters:

Name Type Description Default
document_version DocumentVersion

The document to process

required
session Session | None

Optional session for reading tag definitions from storage. If not provided then a list of tags must be provided for the ProcessingUnit to run.

None

Other Parameters:

Name Type Description
property_id str | list

The id (or a list of ids) of the MetadataProperties containing the query definition(s). Default is 'tags'.

direct_tags (list[str], Optional)

A set of tags passed directly as a list.

Returns:

Type Description
dict[str, str]

Processing results

Embedding classification unit for similarity-based analysis using vector embeddings.

RerankerClassifier

Bases: ProcessingUnit

Classifier processing unit for embedding-based classification.

Handles:

  • Vector similarity matching between search term response and classification codes
  • Usually only used to process one classifier at a time, but can handle multiple
  • Only one query response can be passed
  • Hierarchical analysis support with parent codes
  • Configurable similarity thresholds

run

run(document_version, session=None, **kwargs)

Execute the processing unit workflow.

Parameters:

Name Type Description Default
document_version DocumentVersion

The document to process

required
session Session | None

Optional session for reading query definitions from storage. If not provided then a list of classifiers must be provided for the ProcessingUnit to run.

None

Other Parameters:

Name Type Description
property_id str | list

The id (or a list of ids) of the MetadataProperties containing the query definition(s).

direct_classifiers (dict[dict | MetadataProperty], Optional)

A set of classifiers passed directly as a dictionary. Each classifier dictionary must have 'code', and 'description' fields.

query_response str

A free-text response (generated by an earlier workflow step) to match against one of the MetadataProperty classifier options. If not provided then will attempt to use the MetadataProperty 'query' definition to generate a response. If the MetadataProperty doesn't have a defined query then will revert to LLM classification.

similarity_threshold float

A top-ranked classifier with a similarity score greater than or equal to this threshold will automatically be picked. Must be <= 1.

Returns: Processing results

UniclassClassifier

Bases: ProcessingUnit

Classifier processing unit for Uniclass classification of file contents.

Handles:

  • Extracting or generating content lists from files
  • Classifying each content entry against Uniclass
  • Supporting custom filter parameters
  • Processing multiple content entries in parallel

run

run(document_version, **kwargs)

Execute the processing unit workflow.

Parameters:

Name Type Description Default
document_version DocumentVersion

Pass a document for processing.

required

Other Parameters:

Name Type Description
property_id str

The id (or a list of ids) of the MetadataProperties to save results to.

filter (str, Optional)

an ODATA query to filter to Uniclass table (subsystem). See examples below.

n (int, Optional)

Upper limit of Uniclass codes to generate. Default is 5.

Returns:

Type Description
dict[str, str]

Processing results

Examples:

Classify to the Products table only.

>>> processing_unit.run(text="In-situ reinforced concrete upstand beam", filter="subsystem eq Products")