Bindings

Bindings are how Workbench connects to remotely-stored documents for analysis.

DocumentVersion

Bases: InformationContainer

Document version container representing a top-level document.

Represents a document version as a specialized information container that serves as the root node for organizing document-related data including metadata, results, and document sheets containing sub-components.

Attributes:

Name	Type	Description
`file_name`	`str`	Name of the document file
`id`	`str`	Unique identifier for the document version
`directory`	`str`	Directory path where children should be stored
`source`	`str`	Source system or origin of the document
`web_url`	`str`	Web URL for accessing the document in source system
`attributes`	`Dict[Any, Any]`	Source-specific properties and metadata
`file_type`	`str`	File extension (lowercase, without dot)
`metadata`	`InformationContainer`	Container for document metadata
`results`	`InformationContainer`	Container for analysis results
`sheets`	`List[DocumentVersionSheet]`	List of document sheets

init

__init__(signed_url, id=None, file_name=None, directory=None, source='<UNKNOWN>', web_url=None, attributes=None, url_generator=None, **url_params)

Initialize a document version.

Parameters:

Name	Type	Description	Default
`signed_url`	`str`	Signed URL for accessing the document	required
`id`	`str`	Unique identifier in the directory. Defaults to file_name if not provided.	`None`
`file_name`	`str`	Document file name. Extracted from URL if not provided.	`None`
`directory`	`str`	Directory path for storing children. Can be combined with id to form a unique surrogate key. Defaults to empty string.	`None`
`source`	`str`	Source system identifier. Defaults to ''.	`'<UNKNOWN>'`
`web_url`	`str`	Web URL in source system. Defaults to None.	`None`
`attributes`	`Dict[Any, Any]`	Source-specific properties. Defaults to empty dict.	`None`
`url_generator`	`callable`	Function to regenerate expired URLs. Defaults to None.	`None`
`**url_params`	`Any`	Additional parameters passed to parent InformationContainer.	`{}`

add_sheet

add_sheet()

Add a new sheet to this document version.

Creates a new DocumentVersionSheet instance and appends it to the sheets list.

Returns:

Name	Type	Description
`DocumentVersionSheet`	`DocumentVersionSheet`	The newly created sheet that was added to the document.

bind_metadata

bind_metadata(signed_url, content=None, headers=None)

Bind metadata to this document version.

Creates an InformationContainer for metadata and optionally writes content to storage if content has been provided.

Parameters:

Name	Type	Description	Default
`signed_url`	`str`	The signed URL for accessing the metadata storage.	required
`content`	`Any`	The metadata content to write to storage. Defaults to None.	`None`
`headers`	`Dict[str, str]`	HTTP headers for accessing the storage. Defaults to None.	`None`

Returns:

Type	Description
`None`	None

bind_results

bind_results(signed_url, content=None, headers=None)

Bind results to this document version.

Creates an InformationContainer for results and optionally writes content to storage if content has been provided.

Parameters:

Name	Type	Description	Default
`signed_url`	`str`	The signed URL for accessing the results storage.	required
`content`	`Any`	The results content to write to storage. Defaults to None.	`None`
`headers`	`Dict[str, str]`	HTTP headers for accessing the storage. Defaults to None.	`None`

Returns:

Type	Description
`None`	None

get

get(max_retries=3, backoff_factor=1.5, num_pages=10, max_download_size=400 * 1024 * 1024)

Fetch data from the signed URL with exponential backoff retry logic.

For PDFs, can extract only first N pages to reduce memory usage. The PDF page extractor is intelligent enough to recognize end-of-file (EOF) termination characters before the last bytes, allowing extraction of pages from PDFs substantially larger than the file size limit.

Parameters:

Name	Type	Description	Default
`max_retries`	`int`	Maximum number of retry attempts. Defaults to 3.	`3`
`backoff_factor`	`float`	Exponential backoff multiplier for retry delays. Defaults to 1.5.	`1.5`
`num_pages`	`int`	Number of pages to extract (applies only to PDFs). Defaults to 10.	`10`
`max_download_size`	`int`	Maximum bytes to attempt to download. Defaults to 400MB.	`400 * 1024 * 1024`

Returns:

Name	Type	Description
`bytes`	`bytes`	The downloaded document content.

Raises:

Type	Description
`RuntimeError`	If all retry attempts fail.
`ValueError`	If the file size exceeds max_download_size.

to_dict

to_dict()

Recursively converts custom objects into dictionaries.

Returns:

Type	Description
`Dict[str, Any]`	Dictionary representation of the DocumentVersion object.

DocumentVersionSheet

Child class of DocumentVersion. Exposes InformationContainer based attributes and methods for reading / writing file content.

Attributes:

Name	Type	Description
`metadata`	`InformationContainer`	Container for document metadata
`results`	`InformationContainer`	Container for analysis results
`chunks`	`List[InformationContainer]`	A list of text chunks extracted from the file sheet.
`images`	`List[InformationContainer]`	A list of images extracted from the file sheet.
`rows`	`List[InformationContainer]`	A list of table rows extracted from the file sheet.

init

__init__()

Constructor for the class. When working with DocumentVersion, call the add_sheet method of that class.

bind_chunk

bind_chunk(signed_url, content=None, headers=None)

Bind a chunk object to the sheet. Will append the chunk to the end of the 'chunks' list.

Info

AzureBlobSession binding expects chunk blobs to adopt the suffix convention {doc_version.directory}/{doc_version.id}{sheet_index}_.*chunks\.[json|txt], e.g.:

"workspace-directory/file001.pdf0_0chunk.json"

Signed URLs not conforming to this convention won't be properly identified and bound to the session when initializing.

Parameters:

Name	Type	Description	Default
`signed_url`	`str`	A signed or public URL to the metadata object	required
`content`	`Any`	If content is provided, then the method will POST the content to the signed_url. If no content is provided then the `InformationContainer` will simply be bound to the sheet to allow it to be accessed in future.	`None`
`headers`	`dict`	Request headers (e.g. 'content-type') to be included when interacting with the `InformationContainer`.	`None`

bind_image

bind_image(signed_url, content=None, headers=None)

Bind an image object to the sheet. Will append the image to the end of the 'images' list.

Info

AzureBlobSession binding expects image blobs to adopt the suffix convention {doc_version.directory}/{doc_version.id}{sheet_index}_.*[thumbnail|image]\.[bmp|webp], e.g.:

"workspace-directory/file001.pdf0_0image.webp"

Signed URLs not conforming to this convention won't be properly identified and bound to the session when initializing.

Parameters:

Name	Type	Description	Default
`signed_url`	`str`	A signed or public URL to the metadata object	required
`content`	`Any`	If content is provided, then the method will POST the content to the signed_url. If no content is provided then the `InformationContainer` will simply be bound to the sheet to allow it to be accessed in future.	`None`
`headers`	`dict`	Request headers (e.g. 'content-type') to be included when interacting with the `InformationContainer`.	`None`

bind_metadata

bind_metadata(signed_url, content=None, headers=None)

Bind a metadata object to the sheet.

Info

AzureBlobSession binding expects metadata blobs to adopt the suffix convention {doc_version.directory}/{doc_version.id}{sheet_index}_metadata\.json, e.g.:

"workspace-directory/file001.pdf0_metadata.json"

Signed URLs not conforming to this convention won't be properly identified and bound to the session when initializing.

Parameters:

Name	Type	Description	Default
`signed_url`	`str`	A signed or public URL to the metadata object	required
`content`	`Any`	If content is provided, then the method will POST the content to the signed_url. If no content is provided then the `InformationContainer` will simply be bound to the sheet to allow it to be accessed in future.	`None`
`headers`	`dict`	Request headers (e.g. 'content-type') to be included when interacting with the `InformationContainer`.	`None`

bind_results

bind_results(signed_url, content=None, headers=None)

Bind a results object to the sheet.

Info

AzureBlobSession binding expects result blobs to adopt the suffix convention {doc_version.directory}/{doc_version.id}{sheet_index}_results\.json, e.g.:

"workspace-directory/file001.pdf0_results.json"

Signed URLs not conforming to this convention won't be properly identified and bound to the session when initializing.

Parameters:

Name	Type	Description	Default
`signed_url`	`str`	A signed or public URL to the results object	required
`content`	`Any`	If content is provided, then the method will POST the content to the signed_url. If no content is provided then the `InformationContainer` will simply be bound to the sheet to allow it to be accessed in future.	`None`
`headers`	`dict`	Request headers (e.g. 'content-type') to be included when interacting with the `InformationContainer`.	`None`

bind_row

bind_row(signed_url, content=None, headers=None)

Bind a row object to the sheet. Will append the row to the end of the 'rows' list.

Info

AzureBlobSession binding expects row blobs to adopt the suffix convention {doc_version.directory}/csv.*\.json], e.g.:

"workspace-directory/csv/table1.csvrow0.json"

Signed URLs not conforming to this convention won't be properly identified and bound to the session when initializing.

Parameters:

Name	Type	Description	Default
`signed_url`	`str`	A signed or public URL to the metadata object	required
`content`	`Any`	If content is provided, then the method will POST the content to the signed_url. If no content is provided then the `InformationContainer` will simply be bound to the sheet to allow it to be accessed in future.	`None`
`headers`	`dict`	Request headers (e.g. 'content-type') to be included when interacting with the `InformationContainer`.	`None`

to_dict

to_dict()

Recursively converts custom objects into dictionaries.

Returns:

Type	Description
`Dict[str, Any]`	A dictionary representation of the DocumentVersionSheet.

InformationContainer

Base class for information containers with secure URL access.

Represents a container that holds information with temporary signed URL access for reading and writing data. Provides automatic URL regeneration and retry logic for network operations.

Attributes:

Name	Type	Description
`_signed_url`	`str`	The current signed URL for accessing the container
`headers`	`dict`	HTTP headers to use for requests
`_url_generator`	`callable`	Function to regenerate expired URLs
`_url_expires_at`	`datetime`	When the current URL expires

signed_url `property` `writable`

signed_url

Get a valid signed URL, regenerating if necessary.

Returns:

Name	Type	Description
`str`	`str`	A valid signed URL for accessing the container

init

__init__(signed_url=None, headers=None, url_generator=None)

Initialize an information container.

Parameters:

Name	Type	Description	Default
`signed_url`	`str`	Initial signed URL for the container. Defaults to None.	`None`
`headers`	`dict`	HTTP headers for requests. Defaults to Azure Blob Storage headers.	`None`
`url_generator`	`callable`	Function to regenerate expired URLs. Defaults to None.	`None`

get

get(max_retries=3, backoff_factor=1.5)

Fetch data from the signed URL with retry logic.

Retrieves data from the container using exponential backoff retry logic for handling transient network errors.

Parameters:

Name	Type	Description	Default
`max_retries`	`int`	Maximum number of retry attempts.	`3`
`backoff_factor`	`float`	Multiplier for retry delay.	`1.5`

Returns:

Name	Type	Description
`bytes`	`bytes`	The raw content from the container

Raises:

Type	Description
`RuntimeError`	If all retry attempts fail

set

set(data, max_retries=3, backoff_factor=1.5)

Write data to the container with retry logic.

Writes data to the resource at the signed URL using exponential backoff retry logic for handling transient network errors.

Parameters:

Name	Type	Description	Default
`data`	`Any`	The data to be written. Can be binary or text depending on the resource. If headers specify JSON content type, the data will be JSON-encoded.	required
`max_retries`	`int`	Maximum number of retry attempts.	`3`
`backoff_factor`	`float`	Multiplier for retry delay.	`1.5`

Returns:

Name	Type	Description
`int`	`int`	HTTP status code from the successful write operation

Raises:

Type	Description
`RuntimeError`	If all retry attempts fail

to_dict

to_dict()

Recursively converts custom objects into dictionaries.

Returns:

Type	Description
`Dict[str, Any]`	Dictionary representation of the InformationContainer object.

Serializable

Base class providing serialization capabilities.

This class provides a simple to_dict() method that converts object attributes to a dictionary format for serialization purposes.

to_dict

to_dict()

Convert object attributes to dictionary format.

Returns:

Type	Description
`Dict[str, Any]`	Dictionary containing all object attributes.

Session

Bases: ABC

Abstract base class for data processing sessions.

Provides a framework for applying data processing operations to a scoped set of information containers. Sessions are scoped to specific data partitions for security and prevent cross-talk between different data sources.

All concrete implementations must provide methods to

Parse information management standards (blueprints for processing)
Parse workflows for data processing pipelines
Parse and catalogue information containers

Attributes:

Name	Type	Description
`organization`	`str`	Organization identifier for the session
`workspace`	`str`	Workspace identifier within the organization
`session_id`	`str`	Unique session identifier
`directory`	`str`	Directory path for the session data. Automatically set as "{workspace}/{session_id}".
`user_id`	`str`	User identifier for authentication and logging
`workflow`	`Dict \| None`	Parsed workflow configuration
`classifiers`	`Dict`	Document classifiers configuration
`attributes`	`List`	Document attributes configuration
`tags`	`List[str]`	Document tags configuration
`prompts`	`Dict`	Custom prompts for AI operations
`document_versions`	`List[DocumentVersion]`	Documents in the session
`initialized`	`bool`	Whether the session has been initialized (document versions have been indexed)

init

__init__(organization, workspace, session_id, user_id)

Initialize the session by getting the information standard and workflow from storage and parsing the contents.

Parameters:

Name	Type	Description	Default
`organization`	`str`	Organization identifier	required
`workspace`	`str`	Workspace identifier within the organization	required
`session_id`	`str`	Unique session identifier	required
`user_id`	`str`	User identifier for authentication and logging	required

initialize

initialize()

Initialize the session by parsing files and metadata.

Manually triggers the expensive parsing operations to populate the session with document versions. This is separated from the constructor to allow for lazy initialization.

Note

This method is idempotent - calling it multiple times will not re-parse files if the session is already initialized.

parse_files `abstractmethod`

parse_files()

Parse and catalogue files in the session.

Must be implemented by concrete session classes to discover and parse all files available in the session.

Returns:

Type	Description
`List[DocumentVersion]`	List of document versions found in the session

parse_standard `abstractmethod`

parse_standard()

Parse the information management standard for this session.

Must be implemented by concrete session classes to parse and return the information management standard components.

Returns:

Type	Description
`Tuple[Dict, List, List[str], Dict]`	A tuple containing: classifiers: Dictionary of available classifiers attributes: List of session attributes tags: List of available tags prompts: Dictionary of configured prompts

parse_workflow `abstractmethod`

parse_workflow()

Parse the workflow configuration for this session.

Must be implemented by concrete session classes to parse and return the workflow configuration that defines data processing pipelines.

Returns:

Type	Description
`Dict`	Workflow configuration dictionary

to_dict

to_dict()

Convert session object to dictionary format.

Recursively converts the session and all its nested objects into dictionaries for serialization purposes.

Returns:

Type	Description
`Dict[str, Any]`	Dictionary representation of the Session object.

Bindings

DocumentVersion

__init__

add_sheet

bind_metadata

bind_results

get

to_dict

DocumentVersionSheet

__init__

bind_chunk

bind_image

bind_metadata

bind_results

bind_row

to_dict

InformationContainer

signed_url property writable

__init__

get

set

to_dict

Serializable

to_dict

Session

__init__

initialize

parse_files abstractmethod

parse_standard abstractmethod

parse_workflow abstractmethod

to_dict

init

init

signed_url `property` `writable`

init

init

parse_files `abstractmethod`

parse_standard `abstractmethod`

parse_workflow `abstractmethod`