Skip to content

Bindings

Bindings are how Workbench connects to remotely-stored documents for analysis.

DocumentVersion

Bases: InformationContainer

Document version container representing a top-level document.

Represents a document version as a specialized information container that serves as the root node for organizing document-related data including metadata, results, and document sheets containing sub-components.

Attributes:

Name Type Description
file_name str

Name of the document file

id str

Unique identifier for the document version

directory str

Directory path where children should be stored

source str

Source system or origin of the document

web_url str

Web URL for accessing the document in source system

attributes Dict[Any, Any]

Source-specific properties and metadata

file_type str

File extension (lowercase, without dot)

metadata InformationContainer

Container for document metadata

results InformationContainer

Container for analysis results

sheets List[DocumentVersionSheet]

List of document sheets

__init__

__init__(signed_url, id=None, file_name=None, directory=None, source='<UNKNOWN>', web_url=None, attributes=None, url_generator=None, **url_params)

Initialize a document version.

Parameters:

Name Type Description Default
signed_url str

Signed URL for accessing the document

required
id str

Unique identifier in the directory. Defaults to file_name if not provided.

None
file_name str

Document file name. Extracted from URL if not provided.

None
directory str

Directory path for storing children. Can be combined with id to form a unique surrogate key. Defaults to empty string.

None
source str

Source system identifier. Defaults to ''.

'<UNKNOWN>'
web_url str

Web URL in source system. Defaults to None.

None
attributes Dict[Any, Any]

Source-specific properties. Defaults to empty dict.

None
url_generator callable

Function to regenerate expired URLs. Defaults to None.

None
**url_params Any

Additional parameters passed to parent InformationContainer.

{}

add_sheet

add_sheet()

Add a new sheet to this document version.

Creates a new DocumentVersionSheet instance and appends it to the sheets list.

Returns:

Name Type Description
DocumentVersionSheet DocumentVersionSheet

The newly created sheet that was added to the document.

bind_metadata

bind_metadata(signed_url, content=None, headers=None)

Bind metadata to this document version.

Creates an InformationContainer for metadata and optionally writes content to storage if content has been provided.

Parameters:

Name Type Description Default
signed_url str

The signed URL for accessing the metadata storage.

required
content Any

The metadata content to write to storage. Defaults to None.

None
headers Dict[str, str]

HTTP headers for accessing the storage. Defaults to None.

None

Returns:

Type Description
None

None

bind_results

bind_results(signed_url, content=None, headers=None)

Bind results to this document version.

Creates an InformationContainer for results and optionally writes content to storage if content has been provided.

Parameters:

Name Type Description Default
signed_url str

The signed URL for accessing the results storage.

required
content Any

The results content to write to storage. Defaults to None.

None
headers Dict[str, str]

HTTP headers for accessing the storage. Defaults to None.

None

Returns:

Type Description
None

None

get

get(max_retries=3, backoff_factor=1.5, num_pages=10, max_download_size=400 * 1024 * 1024)

Fetch data from the signed URL with exponential backoff retry logic.

For PDFs, can extract only first N pages to reduce memory usage. The PDF page extractor is intelligent enough to recognize end-of-file (EOF) termination characters before the last bytes, allowing extraction of pages from PDFs substantially larger than the file size limit.

Parameters:

Name Type Description Default
max_retries int

Maximum number of retry attempts. Defaults to 3.

3
backoff_factor float

Exponential backoff multiplier for retry delays. Defaults to 1.5.

1.5
num_pages int

Number of pages to extract (applies only to PDFs). Defaults to 10.

10
max_download_size int

Maximum bytes to attempt to download. Defaults to 400MB.

400 * 1024 * 1024

Returns:

Name Type Description
bytes bytes

The downloaded document content.

Raises:

Type Description
RuntimeError

If all retry attempts fail.

ValueError

If the file size exceeds max_download_size.

to_dict

to_dict()

Recursively converts custom objects into dictionaries.

Returns:

Type Description
Dict[str, Any]

Dictionary representation of the DocumentVersion object.

DocumentVersionSheet

Child class of DocumentVersion. Exposes InformationContainer based attributes and methods for reading / writing file content.

Attributes:

Name Type Description
metadata InformationContainer

Container for document metadata

results InformationContainer

Container for analysis results

chunks List[InformationContainer]

A list of text chunks extracted from the file sheet.

images List[InformationContainer]

A list of images extracted from the file sheet.

rows List[InformationContainer]

A list of table rows extracted from the file sheet.

__init__

__init__()

Constructor for the class. When working with DocumentVersion, call the add_sheet method of that class.

bind_chunk

bind_chunk(signed_url, content=None, headers=None)

Bind a chunk object to the sheet. Will append the chunk to the end of the 'chunks' list.

Info

AzureBlobSession binding expects chunk blobs to adopt the suffix convention {doc_version.directory}/{doc_version.id}{sheet_index}_.*chunks\.[json|txt], e.g.:

"workspace-directory/file001.pdf0_0chunk.json"

Signed URLs not conforming to this convention won't be properly identified and bound to the session when initializing.

Parameters:

Name Type Description Default
signed_url str

A signed or public URL to the metadata object

required
content Any

If content is provided, then the method will POST the content to the signed_url. If no content is provided then the InformationContainer will simply be bound to the sheet to allow it to be accessed in future.

None
headers dict

Request headers (e.g. 'content-type') to be included when interacting with the InformationContainer.

None

bind_image

bind_image(signed_url, content=None, headers=None)

Bind an image object to the sheet. Will append the image to the end of the 'images' list.

Info

AzureBlobSession binding expects image blobs to adopt the suffix convention {doc_version.directory}/{doc_version.id}{sheet_index}_.*[thumbnail|image]\.[bmp|webp], e.g.:

"workspace-directory/file001.pdf0_0image.webp"

Signed URLs not conforming to this convention won't be properly identified and bound to the session when initializing.

Parameters:

Name Type Description Default
signed_url str

A signed or public URL to the metadata object

required
content Any

If content is provided, then the method will POST the content to the signed_url. If no content is provided then the InformationContainer will simply be bound to the sheet to allow it to be accessed in future.

None
headers dict

Request headers (e.g. 'content-type') to be included when interacting with the InformationContainer.

None

bind_metadata

bind_metadata(signed_url, content=None, headers=None)

Bind a metadata object to the sheet.

Info

AzureBlobSession binding expects metadata blobs to adopt the suffix convention {doc_version.directory}/{doc_version.id}{sheet_index}_metadata\.json, e.g.:

"workspace-directory/file001.pdf0_metadata.json"

Signed URLs not conforming to this convention won't be properly identified and bound to the session when initializing.

Parameters:

Name Type Description Default
signed_url str

A signed or public URL to the metadata object

required
content Any

If content is provided, then the method will POST the content to the signed_url. If no content is provided then the InformationContainer will simply be bound to the sheet to allow it to be accessed in future.

None
headers dict

Request headers (e.g. 'content-type') to be included when interacting with the InformationContainer.

None

bind_results

bind_results(signed_url, content=None, headers=None)

Bind a results object to the sheet.

Info

AzureBlobSession binding expects result blobs to adopt the suffix convention {doc_version.directory}/{doc_version.id}{sheet_index}_results\.json, e.g.:

"workspace-directory/file001.pdf0_results.json"

Signed URLs not conforming to this convention won't be properly identified and bound to the session when initializing.

Parameters:

Name Type Description Default
signed_url str

A signed or public URL to the results object

required
content Any

If content is provided, then the method will POST the content to the signed_url. If no content is provided then the InformationContainer will simply be bound to the sheet to allow it to be accessed in future.

None
headers dict

Request headers (e.g. 'content-type') to be included when interacting with the InformationContainer.

None

bind_row

bind_row(signed_url, content=None, headers=None)

Bind a row object to the sheet. Will append the row to the end of the 'rows' list.

Info

AzureBlobSession binding expects row blobs to adopt the suffix convention {doc_version.directory}/csv.*\.json], e.g.:

"workspace-directory/csv/table1.csvrow0.json"

Signed URLs not conforming to this convention won't be properly identified and bound to the session when initializing.

Parameters:

Name Type Description Default
signed_url str

A signed or public URL to the metadata object

required
content Any

If content is provided, then the method will POST the content to the signed_url. If no content is provided then the InformationContainer will simply be bound to the sheet to allow it to be accessed in future.

None
headers dict

Request headers (e.g. 'content-type') to be included when interacting with the InformationContainer.

None

to_dict

to_dict()

Recursively converts custom objects into dictionaries.

Returns:

Type Description
Dict[str, Any]

A dictionary representation of the DocumentVersionSheet.

InformationContainer

Base class for information containers with secure URL access.

Represents a container that holds information with temporary signed URL access for reading and writing data. Provides automatic URL regeneration and retry logic for network operations.

Attributes:

Name Type Description
_signed_url str

The current signed URL for accessing the container

headers dict

HTTP headers to use for requests

_url_generator callable

Function to regenerate expired URLs

_url_expires_at datetime

When the current URL expires

signed_url property writable

signed_url

Get a valid signed URL, regenerating if necessary.

Returns:

Name Type Description
str str

A valid signed URL for accessing the container

__init__

__init__(signed_url=None, headers=None, url_generator=None)

Initialize an information container.

Parameters:

Name Type Description Default
signed_url str

Initial signed URL for the container. Defaults to None.

None
headers dict

HTTP headers for requests. Defaults to Azure Blob Storage headers.

None
url_generator callable

Function to regenerate expired URLs. Defaults to None.

None

get

get(max_retries=3, backoff_factor=1.5)

Fetch data from the signed URL with retry logic.

Retrieves data from the container using exponential backoff retry logic for handling transient network errors.

Parameters:

Name Type Description Default
max_retries int

Maximum number of retry attempts.

3
backoff_factor float

Multiplier for retry delay.

1.5

Returns:

Name Type Description
bytes bytes

The raw content from the container

Raises:

Type Description
RuntimeError

If all retry attempts fail

set

set(data, max_retries=3, backoff_factor=1.5)

Write data to the container with retry logic.

Writes data to the resource at the signed URL using exponential backoff retry logic for handling transient network errors.

Parameters:

Name Type Description Default
data Any

The data to be written. Can be binary or text depending on the resource. If headers specify JSON content type, the data will be JSON-encoded.

required
max_retries int

Maximum number of retry attempts.

3
backoff_factor float

Multiplier for retry delay.

1.5

Returns:

Name Type Description
int int

HTTP status code from the successful write operation

Raises:

Type Description
RuntimeError

If all retry attempts fail

to_dict

to_dict()

Recursively converts custom objects into dictionaries.

Returns:

Type Description
Dict[str, Any]

Dictionary representation of the InformationContainer object.

Serializable

Base class providing serialization capabilities.

This class provides a simple to_dict() method that converts object attributes to a dictionary format for serialization purposes.

to_dict

to_dict()

Convert object attributes to dictionary format.

Returns:

Type Description
Dict[str, Any]

Dictionary containing all object attributes.

Session

Bases: ABC

Abstract base class for data processing sessions.

Provides a framework for applying data processing operations to a scoped set of information containers. Sessions are scoped to specific data partitions for security and prevent cross-talk between different data sources.

All concrete implementations must provide methods to
  • Parse information management standards (blueprints for processing)
  • Parse workflows for data processing pipelines
  • Parse and catalogue information containers

Attributes:

Name Type Description
organization str

Organization identifier for the session

workspace str

Workspace identifier within the organization

session_id str

Unique session identifier

directory str

Directory path for the session data. Automatically set as "{workspace}/{session_id}".

user_id str

User identifier for authentication and logging

workflow Dict | None

Parsed workflow configuration

classifiers Dict

Document classifiers configuration

attributes List

Document attributes configuration

tags List[str]

Document tags configuration

prompts Dict

Custom prompts for AI operations

document_versions List[DocumentVersion]

Documents in the session

initialized bool

Whether the session has been initialized (document versions have been indexed)

__init__

__init__(organization, workspace, session_id, user_id)

Initialize the session by getting the information standard and workflow from storage and parsing the contents.

Parameters:

Name Type Description Default
organization str

Organization identifier

required
workspace str

Workspace identifier within the organization

required
session_id str

Unique session identifier

required
user_id str

User identifier for authentication and logging

required

initialize

initialize()

Initialize the session by parsing files and metadata.

Manually triggers the expensive parsing operations to populate the session with document versions. This is separated from the constructor to allow for lazy initialization.

Note

This method is idempotent - calling it multiple times will not re-parse files if the session is already initialized.

parse_files abstractmethod

parse_files()

Parse and catalogue files in the session.

Must be implemented by concrete session classes to discover and parse all files available in the session.

Returns:

Type Description
List[DocumentVersion]

List of document versions found in the session

parse_standard abstractmethod

parse_standard()

Parse the information management standard for this session.

Must be implemented by concrete session classes to parse and return the information management standard components.

Returns:

Type Description
Tuple[Dict, List, List[str], Dict]

A tuple containing:

  • classifiers: Dictionary of available classifiers
  • attributes: List of session attributes
  • tags: List of available tags
  • prompts: Dictionary of configured prompts

parse_workflow abstractmethod

parse_workflow()

Parse the workflow configuration for this session.

Must be implemented by concrete session classes to parse and return the workflow configuration that defines data processing pipelines.

Returns:

Type Description
Dict

Workflow configuration dictionary

to_dict

to_dict()

Convert session object to dictionary format.

Recursively converts the session and all its nested objects into dictionaries for serialization purposes.

Returns:

Type Description
Dict[str, Any]

Dictionary representation of the Session object.