Processors

Base class for processing units.

ProcessingUnit

Bases: ABC

Abstract base class for processing units.

Template Method pattern:

run() orchestrates the workflow
Subclasses implement _setup(), _process(), _teardown()

init

__init__(use_persisted_results=False, persist_results=True)

Initialize processing unit.

Parameters:

Name	Type	Description	Default
`use_persisted_results`	`bool`	If True, use existing results from blob storage instead of reprocessing	`False`
`persist_results`	`bool`	If True, save results to blob storage after processing	`True`

run `abstractmethod`

run(document_version, **kwargs)

Execute the processing unit.

Parameters:

Name	Type	Description	Default
`document_version`	`DocumentVersion`	The document to process	required

Other Parameters:

Name	Type	Description
`session`	`Optional[Session]`	Session for accessing the metadata specification
`**kwargs`	`Any`	Other processor-specific arguments

Azure connection processing unit for establishing session and connector in DAG workflow.

HoppaConnector

Bases: ProcessingUnit

Processing unit for establishing connection to Hoppa session on MS Azure.

This unit handles the initial connection setup as the first step in a DAG, creating both the session and connector that other units will use.

It also handles logic relating to getting the document in question from the list of documents in the session

run

run(**kwargs)

Execute the processing unit workflow.

HoppaConnector doesn't need document_version or session as inputs since it creates them.

Other Parameters:

Name	Type	Description
`organization`	`str`	Organization identifier for the session
`workspace`	`Str`	Workspace identifier within the organization
`session_id`	`str`	Unique session identifier
`user_id`	`str`	User identifier for authentication and logging
`include_blob`	`(str, Optional)`	A blob_name that should be included when indexing document versions. All other bobs will be ignored. If not provided then method will return the first DocumentVersion in the session.

Returns:

Type	Description
`dict[str, Any]`	Processing results containing 'session' and 'document_version'

FilePreprocessor

Bases: ProcessingUnit

Processing unit for extracting content from files.

Handles:

File stream loading
ZIP file extraction
Content extraction using general_purpose_read
Basic file type and size metrics

Note: This processor doesn't use result caching since file content is always processed fresh.

run

run(document_version, **kwargs)

Execute the processing unit workflow.

Parameters:

Name	Type	Description	Default
`document_version`	`DocumentVersion`	The document to process	required

Other Parameters:

Name	Type	Description
`chunk_pages`	`(boolean, Optional)`	if True, extracted file content will be chunked by page and each chunk saved separately. Default is False.
`page_limit`	`(int, Optional)`	Applies to PDF only. Number of pages to extract from the document. Setting to 0 will extract all pages. Default is 10.
`clean_markdown`	`(bool, Optional)`	Applies to PDF only. Determines whether to convert HTML content (e.g. tables) to pure markdown. Default is True.

Returns: Processing results

MetadataExtractor

Bases: ProcessingUnit

Processing unit for extracting metadata from various file types.

Handles:

Images: downsample to reduce LLM token consumption and extract metadata
PDFs: screenshot first page to provide LLM with awareness of page layout. If run() method is called with argument all_pages set to True then will generate a thumbnail image for each page.
Office documents: Extract embedded metadata
CAD files: Convert and extract metadata
Smart result merging that preserves existing metadata

run

run(document_version, **kwargs)

Execute the processing unit workflow.

Parameters:

Name	Type	Description	Default
`document_version`	`DocumentVersion`	The document to process	required

Other Parameters:

Name	Type	Description
`all_pages`	`(bool, Optional)`	If True, thumbnail images will be generated for all PDF pages. Default is False.
`is_zipfile`	`(bool, Optional)`	If True, the parent DocumentVersion is a zipfile and the DocumentVersion.signed_url property points at a zip file, not the extracted file. The content will be extracted and uploaded to a temporary Azure Blob Storage location so that 3rd party services (e.g. Autodesk Design Automation) can access it.

Returns: Processing results

Description query unit for creating file summaries using OpenAI.

DescriptionQuery

Bases: ProcessingUnit

Processing unit for generating file descriptions using OpenAI.

run

run(document_version, **kwargs)

Other Parameters:

Name	Type	Description
`system_prompt`	`str`	Instructions for LLM when generating the description. Replaces default system prompt.

LLM analysis unit for direct language model-based analysis as fallback.

LLMClassifier

Bases: ProcessingUnit

Processing unit for direct LLM-based analysis.

Handles:

Direct analysis using language models when search/reranking fails
Hierarchical analysis with parent code support
Processing multiple analysers that need fallback analysis
Integration with existing hierarchical_list_classification function
Smart result merging that preserves existing classifiers

run

run(document_version, session=None, **kwargs)

Execute the processing unit workflow.

Parameters:

Name	Type	Description	Default
`document_version`	`DocumentVersion`	The document to process	required
`session`	`Session \| None`	Optional session for reading query definitions from storage.	`None`

Other Parameters:

Name	Type	Description
`property_id`	`str \| list`	The id (or a list of ids) of the MetadataProperties containing the query definition(s).
`direct_classifiers`	`(dict[dict \| MetadataProperty], Optional)`	A set of classifiers passed directly as a dictionary. Each classifier dictionary must have 'code', and 'description' fields.

Returns: Processing results

LLMTagger

Bases: ProcessingUnit

Processing unit for tagging of documents using Azure OpenAI LLM.

run

run(document_version, session=None, **kwargs)

Execute the processing unit workflow.

Parameters:

Name	Type	Description	Default
`document_version`	`DocumentVersion`	The document to process	required
`session`	`Session \| None`	Optional session for reading tag definitions from storage. If not provided then a list of tags must be provided for the ProcessingUnit to run.	`None`

Other Parameters:

Name	Type	Description
`property_id`	`str \| list`	The id (or a list of ids) of the MetadataProperties containing the query definition(s). Default is 'tags'.
`direct_tags`	`(list[str], Optional)`	A set of tags passed directly as a list.

Returns:

Type	Description
`dict[str, str]`	Processing results

Embedding classification unit for similarity-based analysis using vector embeddings.

RerankerClassifier

Bases: ProcessingUnit

Classifier processing unit for embedding-based classification.

Handles:

Vector similarity matching between search term response and classification codes
Usually only used to process one classifier at a time, but can handle multiple
Only one query response can be passed
Hierarchical analysis support with parent codes
Configurable similarity thresholds

run

run(document_version, session=None, **kwargs)

Execute the processing unit workflow.

Parameters:

Name	Type	Description	Default
`document_version`	`DocumentVersion`	The document to process	required
`session`	`Session \| None`	Optional session for reading query definitions from storage. If not provided then a list of classifiers must be provided for the ProcessingUnit to run.	`None`

Other Parameters:

Name	Type	Description
`property_id`	`str \| list`	The id (or a list of ids) of the MetadataProperties containing the query definition(s).
`direct_classifiers`	`(dict[dict \| MetadataProperty], Optional)`	A set of classifiers passed directly as a dictionary. Each classifier dictionary must have 'code', and 'description' fields.
`query_response`	`str`	A free-text response (generated by an earlier workflow step) to match against one of the MetadataProperty classifier options. If not provided then will attempt to use the MetadataProperty 'query' definition to generate a response. If the MetadataProperty doesn't have a defined query then will revert to LLM classification.
`similarity_threshold`	`float`	A top-ranked classifier with a similarity score greater than or equal to this threshold will automatically be picked. Must be <= 1.

Returns: Processing results

UniclassClassifier

Bases: ProcessingUnit

Classifier processing unit for Uniclass classification of file contents.

Handles:

Extracting or generating content lists from files
Classifying each content entry against Uniclass
Supporting custom filter parameters
Processing multiple content entries in parallel

run

run(document_version, **kwargs)

Execute the processing unit workflow.

Parameters:

Name	Type	Description	Default
`document_version`	`DocumentVersion`	Pass a document for processing.	required

Other Parameters:

Name	Type	Description
`property_id`	`str`	The id (or a list of ids) of the MetadataProperties to save results to.
`filter`	`(str, Optional)`	an ODATA query to filter to Uniclass table (subsystem). See examples below.
`n`	`(int, Optional)`	Upper limit of Uniclass codes to generate. Default is 5.

Returns:

Type	Description
`dict[str, str]`	Processing results

Examples:

Classify to the Products table only.

>>> processing_unit.run(text="In-situ reinforced concrete upstand beam", filter="subsystem eq Products")

Processors

ProcessingUnit

__init__

run abstractmethod

HoppaConnector

run

FilePreprocessor

run

MetadataExtractor

run

DescriptionQuery

run

LLMClassifier

run

LLMTagger

run

RerankerClassifier

run

UniclassClassifier

run

init

run `abstractmethod`