Bindings
Bindings are how Workbench connects to remotely-stored documents for analysis.
DocumentVersion
Bases: InformationContainer
Document version container representing a top-level document.
Represents a document version as a specialized information container that serves as the root node for organizing document-related data including metadata, results, and document sheets containing sub-components.
Attributes:
Name | Type | Description |
---|---|---|
file_name |
str
|
Name of the document file |
id |
str
|
Unique identifier for the document version |
directory |
str
|
Directory path where children should be stored |
source |
str
|
Source system or origin of the document |
web_url |
str
|
Web URL for accessing the document in source system |
attributes |
Dict[Any, Any]
|
Source-specific properties and metadata |
file_type |
str
|
File extension (lowercase, without dot) |
metadata |
InformationContainer
|
Container for document metadata |
results |
InformationContainer
|
Container for analysis results |
sheets |
List[DocumentVersionSheet]
|
List of document sheets |
__init__
__init__(signed_url, id=None, file_name=None, directory=None, source='<UNKNOWN>', web_url=None, attributes=None, url_generator=None, **url_params)
Initialize a document version.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
signed_url
|
str
|
Signed URL for accessing the document |
required |
id
|
str
|
Unique identifier in the directory. Defaults to file_name if not provided. |
None
|
file_name
|
str
|
Document file name. Extracted from URL if not provided. |
None
|
directory
|
str
|
Directory path for storing children. Can be combined with id to form a unique surrogate key. Defaults to empty string. |
None
|
source
|
str
|
Source system identifier. Defaults to ' |
'<UNKNOWN>'
|
web_url
|
str
|
Web URL in source system. Defaults to None. |
None
|
attributes
|
Dict[Any, Any]
|
Source-specific properties. Defaults to empty dict. |
None
|
url_generator
|
callable
|
Function to regenerate expired URLs. Defaults to None. |
None
|
**url_params
|
Any
|
Additional parameters passed to parent InformationContainer. |
{}
|
add_sheet
add_sheet()
Add a new sheet to this document version.
Creates a new DocumentVersionSheet instance and appends it to the sheets list.
Returns:
Name | Type | Description |
---|---|---|
DocumentVersionSheet |
DocumentVersionSheet
|
The newly created sheet that was added to the document. |
bind_metadata
bind_metadata(signed_url, content=None, headers=None)
Bind metadata to this document version.
Creates an InformationContainer for metadata and optionally writes content to storage if content has been provided.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
signed_url
|
str
|
The signed URL for accessing the metadata storage. |
required |
content
|
Any
|
The metadata content to write to storage. Defaults to None. |
None
|
headers
|
Dict[str, str]
|
HTTP headers for accessing the storage. Defaults to None. |
None
|
Returns:
Type | Description |
---|---|
None
|
None |
bind_results
bind_results(signed_url, content=None, headers=None)
Bind results to this document version.
Creates an InformationContainer for results and optionally writes content to storage if content has been provided.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
signed_url
|
str
|
The signed URL for accessing the results storage. |
required |
content
|
Any
|
The results content to write to storage. Defaults to None. |
None
|
headers
|
Dict[str, str]
|
HTTP headers for accessing the storage. Defaults to None. |
None
|
Returns:
Type | Description |
---|---|
None
|
None |
get
get(max_retries=3, backoff_factor=1.5, num_pages=10, max_download_size=400 * 1024 * 1024)
Fetch data from the signed URL with exponential backoff retry logic.
For PDFs, can extract only first N pages to reduce memory usage. The PDF page extractor is intelligent enough to recognize end-of-file (EOF) termination characters before the last bytes, allowing extraction of pages from PDFs substantially larger than the file size limit.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
max_retries
|
int
|
Maximum number of retry attempts. Defaults to 3. |
3
|
backoff_factor
|
float
|
Exponential backoff multiplier for retry delays. Defaults to 1.5. |
1.5
|
num_pages
|
int
|
Number of pages to extract (applies only to PDFs). Defaults to 10. |
10
|
max_download_size
|
int
|
Maximum bytes to attempt to download. Defaults to 400MB. |
400 * 1024 * 1024
|
Returns:
Name | Type | Description |
---|---|---|
bytes |
bytes
|
The downloaded document content. |
Raises:
Type | Description |
---|---|
RuntimeError
|
If all retry attempts fail. |
ValueError
|
If the file size exceeds max_download_size. |
to_dict
to_dict()
Recursively converts custom objects into dictionaries.
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
Dictionary representation of the DocumentVersion object. |
DocumentVersionSheet
Child class of DocumentVersion
. Exposes InformationContainer
based attributes and methods for reading / writing file content.
Attributes:
Name | Type | Description |
---|---|---|
metadata |
InformationContainer
|
Container for document metadata |
results |
InformationContainer
|
Container for analysis results |
chunks |
List[InformationContainer]
|
A list of text chunks extracted from the file sheet. |
images |
List[InformationContainer]
|
A list of images extracted from the file sheet. |
rows |
List[InformationContainer]
|
A list of table rows extracted from the file sheet. |
__init__
__init__()
Constructor for the class. When working with DocumentVersion
, call the add_sheet
method of that class.
bind_chunk
bind_chunk(signed_url, content=None, headers=None)
Bind a chunk object to the sheet. Will append the chunk to the end of the 'chunks' list.
Info
AzureBlobSession
binding expects chunk blobs to adopt the suffix convention {doc_version.directory}/{doc_version.id}{sheet_index}_.*chunks\.[json|txt]
, e.g.:
"workspace-directory/file001.pdf0_0chunk.json"
Signed URLs not conforming to this convention won't be properly identified and bound to the session when initializing.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
signed_url
|
str
|
A signed or public URL to the metadata object |
required |
content
|
Any
|
If content is provided, then the method will POST the content to the signed_url. If no content is provided then the |
None
|
headers
|
dict
|
Request headers (e.g. 'content-type') to be included when interacting with the |
None
|
bind_image
bind_image(signed_url, content=None, headers=None)
Bind an image object to the sheet. Will append the image to the end of the 'images' list.
Info
AzureBlobSession
binding expects image blobs to adopt the suffix convention {doc_version.directory}/{doc_version.id}{sheet_index}_.*[thumbnail|image]\.[bmp|webp]
, e.g.:
"workspace-directory/file001.pdf0_0image.webp"
Signed URLs not conforming to this convention won't be properly identified and bound to the session when initializing.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
signed_url
|
str
|
A signed or public URL to the metadata object |
required |
content
|
Any
|
If content is provided, then the method will POST the content to the signed_url. If no content is provided then the |
None
|
headers
|
dict
|
Request headers (e.g. 'content-type') to be included when interacting with the |
None
|
bind_metadata
bind_metadata(signed_url, content=None, headers=None)
Bind a metadata object to the sheet.
Info
AzureBlobSession
binding expects metadata blobs to adopt the suffix convention {doc_version.directory}/{doc_version.id}{sheet_index}_metadata\.json
, e.g.:
"workspace-directory/file001.pdf0_metadata.json"
Signed URLs not conforming to this convention won't be properly identified and bound to the session when initializing.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
signed_url
|
str
|
A signed or public URL to the metadata object |
required |
content
|
Any
|
If content is provided, then the method will POST the content to the signed_url. If no content is provided then the |
None
|
headers
|
dict
|
Request headers (e.g. 'content-type') to be included when interacting with the |
None
|
bind_results
bind_results(signed_url, content=None, headers=None)
Bind a results object to the sheet.
Info
AzureBlobSession
binding expects result blobs to adopt the suffix convention {doc_version.directory}/{doc_version.id}{sheet_index}_results\.json
, e.g.:
"workspace-directory/file001.pdf0_results.json"
Signed URLs not conforming to this convention won't be properly identified and bound to the session when initializing.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
signed_url
|
str
|
A signed or public URL to the results object |
required |
content
|
Any
|
If content is provided, then the method will POST the content to the signed_url. If no content is provided then the |
None
|
headers
|
dict
|
Request headers (e.g. 'content-type') to be included when interacting with the |
None
|
bind_row
bind_row(signed_url, content=None, headers=None)
Bind a row object to the sheet. Will append the row to the end of the 'rows' list.
Info
AzureBlobSession
binding expects row blobs to adopt the suffix convention {doc_version.directory}/csv.*\.json]
, e.g.:
"workspace-directory/csv/table1.csvrow0.json"
Signed URLs not conforming to this convention won't be properly identified and bound to the session when initializing.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
signed_url
|
str
|
A signed or public URL to the metadata object |
required |
content
|
Any
|
If content is provided, then the method will POST the content to the signed_url. If no content is provided then the |
None
|
headers
|
dict
|
Request headers (e.g. 'content-type') to be included when interacting with the |
None
|
to_dict
to_dict()
Recursively converts custom objects into dictionaries.
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
A dictionary representation of the DocumentVersionSheet. |
InformationContainer
Base class for information containers with secure URL access.
Represents a container that holds information with temporary signed URL access for reading and writing data. Provides automatic URL regeneration and retry logic for network operations.
Attributes:
Name | Type | Description |
---|---|---|
_signed_url |
str
|
The current signed URL for accessing the container |
headers |
dict
|
HTTP headers to use for requests |
_url_generator |
callable
|
Function to regenerate expired URLs |
_url_expires_at |
datetime
|
When the current URL expires |
signed_url
property
writable
signed_url
Get a valid signed URL, regenerating if necessary.
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
A valid signed URL for accessing the container |
__init__
__init__(signed_url=None, headers=None, url_generator=None)
Initialize an information container.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
signed_url
|
str
|
Initial signed URL for the container. Defaults to None. |
None
|
headers
|
dict
|
HTTP headers for requests. Defaults to Azure Blob Storage headers. |
None
|
url_generator
|
callable
|
Function to regenerate expired URLs. Defaults to None. |
None
|
get
get(max_retries=3, backoff_factor=1.5)
Fetch data from the signed URL with retry logic.
Retrieves data from the container using exponential backoff retry logic for handling transient network errors.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
max_retries
|
int
|
Maximum number of retry attempts. |
3
|
backoff_factor
|
float
|
Multiplier for retry delay. |
1.5
|
Returns:
Name | Type | Description |
---|---|---|
bytes |
bytes
|
The raw content from the container |
Raises:
Type | Description |
---|---|
RuntimeError
|
If all retry attempts fail |
set
set(data, max_retries=3, backoff_factor=1.5)
Write data to the container with retry logic.
Writes data to the resource at the signed URL using exponential backoff retry logic for handling transient network errors.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
Any
|
The data to be written. Can be binary or text depending on the resource. If headers specify JSON content type, the data will be JSON-encoded. |
required |
max_retries
|
int
|
Maximum number of retry attempts. |
3
|
backoff_factor
|
float
|
Multiplier for retry delay. |
1.5
|
Returns:
Name | Type | Description |
---|---|---|
int |
int
|
HTTP status code from the successful write operation |
Raises:
Type | Description |
---|---|
RuntimeError
|
If all retry attempts fail |
to_dict
to_dict()
Recursively converts custom objects into dictionaries.
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
Dictionary representation of the InformationContainer object. |
Serializable
Base class providing serialization capabilities.
This class provides a simple to_dict() method that converts object attributes to a dictionary format for serialization purposes.
to_dict
to_dict()
Convert object attributes to dictionary format.
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
Dictionary containing all object attributes. |
Session
Bases: ABC
Abstract base class for data processing sessions.
Provides a framework for applying data processing operations to a scoped set of information containers. Sessions are scoped to specific data partitions for security and prevent cross-talk between different data sources.
All concrete implementations must provide methods to
- Parse information management standards (blueprints for processing)
- Parse workflows for data processing pipelines
- Parse and catalogue information containers
Attributes:
Name | Type | Description |
---|---|---|
organization |
str
|
Organization identifier for the session |
workspace |
str
|
Workspace identifier within the organization |
session_id |
str
|
Unique session identifier |
directory |
str
|
Directory path for the session data. Automatically set as "{workspace}/{session_id}". |
user_id |
str
|
User identifier for authentication and logging |
workflow |
Dict | None
|
Parsed workflow configuration |
classifiers |
Dict
|
Document classifiers configuration |
attributes |
List
|
Document attributes configuration |
tags |
List[str]
|
Document tags configuration |
prompts |
Dict
|
Custom prompts for AI operations |
document_versions |
List[DocumentVersion]
|
Documents in the session |
initialized |
bool
|
Whether the session has been initialized (document versions have been indexed) |
__init__
__init__(organization, workspace, session_id, user_id)
Initialize the session by getting the information standard and workflow from storage and parsing the contents.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
organization
|
str
|
Organization identifier |
required |
workspace
|
str
|
Workspace identifier within the organization |
required |
session_id
|
str
|
Unique session identifier |
required |
user_id
|
str
|
User identifier for authentication and logging |
required |
initialize
initialize()
Initialize the session by parsing files and metadata.
Manually triggers the expensive parsing operations to populate the session with document versions. This is separated from the constructor to allow for lazy initialization.
Note
This method is idempotent - calling it multiple times will not re-parse files if the session is already initialized.
parse_files
abstractmethod
parse_files()
Parse and catalogue files in the session.
Must be implemented by concrete session classes to discover and parse all files available in the session.
Returns:
Type | Description |
---|---|
List[DocumentVersion]
|
List of document versions found in the session |
parse_standard
abstractmethod
parse_standard()
Parse the information management standard for this session.
Must be implemented by concrete session classes to parse and return the information management standard components.
Returns:
Type | Description |
---|---|
Tuple[Dict, List, List[str], Dict]
|
A tuple containing:
|
parse_workflow
abstractmethod
parse_workflow()
Parse the workflow configuration for this session.
Must be implemented by concrete session classes to parse and return the workflow configuration that defines data processing pipelines.
Returns:
Type | Description |
---|---|
Dict
|
Workflow configuration dictionary |
to_dict
to_dict()
Convert session object to dictionary format.
Recursively converts the session and all its nested objects into dictionaries for serialization purposes.
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
Dictionary representation of the Session object. |