Processors
Base class for processing units.
ProcessingUnit
Bases: ABC
Abstract base class for processing units.
Template Method pattern:
- run() orchestrates the workflow
- Subclasses implement _setup(), _process(), _teardown()
__init__
__init__(use_persisted_results=False, persist_results=True)
Initialize processing unit.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
use_persisted_results
|
bool
|
If True, use existing results from blob storage instead of reprocessing |
False
|
persist_results
|
bool
|
If True, save results to blob storage after processing |
True
|
run
abstractmethod
run(document_version, **kwargs)
Execute the processing unit.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
document_version
|
DocumentVersion
|
The document to process |
required |
Other Parameters:
| Name | Type | Description |
|---|---|---|
session |
Optional[Session]
|
Session for accessing the metadata specification |
**kwargs |
Any
|
Other processor-specific arguments |
Azure connection processing unit for establishing session and connector in DAG workflow.
HoppaConnector
Bases: ProcessingUnit
Processing unit for establishing connection to Hoppa session on MS Azure.
This unit handles the initial connection setup as the first step in a DAG, creating both the session and connector that other units will use.
It also handles logic relating to getting the document in question from the list of documents in the session
run
run(**kwargs)
Execute the processing unit workflow.
HoppaConnector doesn't need document_version or session as inputs since it creates them.
Other Parameters:
| Name | Type | Description |
|---|---|---|
organization |
str
|
Organization identifier for the session |
workspace |
Str
|
Workspace identifier within the organization |
session_id |
str
|
Unique session identifier |
user_id |
str
|
User identifier for authentication and logging |
include_blob |
(str, Optional)
|
A blob_name that should be included when indexing document versions. All other bobs will be ignored. If not provided then method will return the first DocumentVersion in the session. |
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Processing results containing 'session' and 'document_version' |
FilePreprocessor
Bases: ProcessingUnit
Processing unit for extracting content from files.
Handles:
- File stream loading
- ZIP file extraction
- Content extraction using general_purpose_read
- Basic file type and size metrics
Note: This processor doesn't use result caching since file content is always processed fresh.
run
run(document_version, **kwargs)
Execute the processing unit workflow.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
document_version
|
DocumentVersion
|
The document to process |
required |
Other Parameters:
| Name | Type | Description |
|---|---|---|
chunk_pages |
(boolean, Optional)
|
if True, extracted file content will be chunked by page and each chunk saved separately. Default is False. |
page_limit |
(int, Optional)
|
Applies to PDF only. Number of pages to extract from the document. Setting to 0 will extract all pages. Default is 10. |
clean_markdown |
(bool, Optional)
|
Applies to PDF only. Determines whether to convert HTML content (e.g. tables) to pure markdown. Default is True. |
Returns: Processing results
MetadataExtractor
Bases: ProcessingUnit
Processing unit for extracting metadata from various file types.
Handles:
- Images: downsample to reduce LLM token consumption and extract metadata
- PDFs: screenshot first page to provide LLM with awareness of page layout.
If
run()method is called with argumentall_pagesset toTruethen will generate a thumbnail image for each page. - Office documents: Extract embedded metadata
- CAD files: Convert and extract metadata
- Smart result merging that preserves existing metadata
run
run(document_version, **kwargs)
Execute the processing unit workflow.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
document_version
|
DocumentVersion
|
The document to process |
required |
Other Parameters:
| Name | Type | Description |
|---|---|---|
all_pages |
(bool, Optional)
|
If True, thumbnail images will be generated for all PDF pages. Default is False. |
is_zipfile |
(bool, Optional)
|
If True, the parent DocumentVersion is a zipfile and the DocumentVersion.signed_url property points at a zip file, not the extracted file. The content will be extracted and uploaded to a temporary Azure Blob Storage location so that 3rd party services (e.g. Autodesk Design Automation) can access it. |
Returns: Processing results
Description query unit for creating file summaries using OpenAI.
DescriptionQuery
Bases: ProcessingUnit
Processing unit for generating file descriptions using OpenAI.
run
run(document_version, **kwargs)
Other Parameters:
| Name | Type | Description |
|---|---|---|
system_prompt |
str
|
Instructions for LLM when generating the description. Replaces default system prompt. |
LLM analysis unit for direct language model-based analysis as fallback.
LLMClassifier
Bases: ProcessingUnit
Processing unit for direct LLM-based analysis.
Handles:
- Direct analysis using language models when search/reranking fails
- Hierarchical analysis with parent code support
- Processing multiple analysers that need fallback analysis
- Integration with existing hierarchical_list_classification function
- Smart result merging that preserves existing classifiers
run
run(document_version, session=None, **kwargs)
Execute the processing unit workflow.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
document_version
|
DocumentVersion
|
The document to process |
required |
session
|
Session | None
|
Optional session for reading query definitions from storage. |
None
|
Other Parameters:
| Name | Type | Description |
|---|---|---|
property_id |
str | list
|
The id (or a list of ids) of the MetadataProperties containing the query definition(s). |
direct_classifiers |
(dict[dict | MetadataProperty], Optional)
|
A set of classifiers passed directly as a dictionary. Each classifier dictionary must have 'code', and 'description' fields. |
Returns: Processing results
LLMTagger
Bases: ProcessingUnit
Processing unit for tagging of documents using Azure OpenAI LLM.
run
run(document_version, session=None, **kwargs)
Execute the processing unit workflow.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
document_version
|
DocumentVersion
|
The document to process |
required |
session
|
Session | None
|
Optional session for reading tag definitions from storage. If not provided then a list of tags must be provided for the ProcessingUnit to run. |
None
|
Other Parameters:
| Name | Type | Description |
|---|---|---|
property_id |
str | list
|
The id (or a list of ids) of the MetadataProperties containing the query definition(s). Default is 'tags'. |
direct_tags |
(list[str], Optional)
|
A set of tags passed directly as a list. |
Returns:
| Type | Description |
|---|---|
dict[str, str]
|
Processing results |
Embedding classification unit for similarity-based analysis using vector embeddings.
RerankerClassifier
Bases: ProcessingUnit
Classifier processing unit for embedding-based classification.
Handles:
- Vector similarity matching between search term response and classification codes
- Usually only used to process one classifier at a time, but can handle multiple
- Only one query response can be passed
- Hierarchical analysis support with parent codes
- Configurable similarity thresholds
run
run(document_version, session=None, **kwargs)
Execute the processing unit workflow.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
document_version
|
DocumentVersion
|
The document to process |
required |
session
|
Session | None
|
Optional session for reading query definitions from storage. If not provided then a list of classifiers must be provided for the ProcessingUnit to run. |
None
|
Other Parameters:
| Name | Type | Description |
|---|---|---|
property_id |
str | list
|
The id (or a list of ids) of the MetadataProperties containing the query definition(s). |
direct_classifiers |
(dict[dict | MetadataProperty], Optional)
|
A set of classifiers passed directly as a dictionary. Each classifier dictionary must have 'code', and 'description' fields. |
query_response |
str
|
A free-text response (generated by an earlier workflow step) to match against one of the MetadataProperty classifier options. If not provided then will attempt to use the MetadataProperty 'query' definition to generate a response. If the MetadataProperty doesn't have a defined query then will revert to LLM classification. |
similarity_threshold |
float
|
A top-ranked classifier with a similarity score greater than or equal to this threshold will automatically be picked. Must be <= 1. |
Returns: Processing results
UniclassClassifier
Bases: ProcessingUnit
Classifier processing unit for Uniclass classification of file contents.
Handles:
- Extracting or generating content lists from files
- Classifying each content entry against Uniclass
- Supporting custom filter parameters
- Processing multiple content entries in parallel
run
run(document_version, **kwargs)
Execute the processing unit workflow.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
document_version
|
DocumentVersion
|
Pass a document for processing. |
required |
Other Parameters:
| Name | Type | Description |
|---|---|---|
property_id |
str
|
The id (or a list of ids) of the MetadataProperties to save results to. |
filter |
(str, Optional)
|
an ODATA query to filter to Uniclass table (subsystem). See examples below. |
n |
(int, Optional)
|
Upper limit of Uniclass codes to generate. Default is 5. |
Returns:
| Type | Description |
|---|---|
dict[str, str]
|
Processing results |
Examples:
Classify to the Products table only.
>>> processing_unit.run(text="In-situ reinforced concrete upstand beam", filter="subsystem eq Products")