Bindings
Bindings are how Workbench connects to remotely-stored documents for analysis.
AzureBlobSession
Bases: Session
get_processing_status
get_processing_status()
Check if all files in the session have completed processing.
Returns:
| Type | Description |
|---|---|
str
|
|
results_to_df
results_to_df()
Transmute all document version results to a Pandas dataframe:
- Inserting results to the correct dataframe columns according to their field names
- Merging user edits and AI edits into their correct cell positions
Returns:
| Type | Description |
|---|---|
DataFrame
|
Each row in the dataframe is a document version. |
set_file_count
set_file_count(new_file_count)
Set the file count in the session properties JSON.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
new_file_count
|
int
|
The count of files |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if operation successful, else false. |
set_processing_status
set_processing_status(new_status)
Set the processing status in the session properties JSON.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
new_status
|
str
|
The status to set. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if operation successful, else false. |
AuditableMixin
Bases: ABC
Abstract mix-in class requiring audit-tracking capabilities.
DocumentVersion
Bases: AuditableMixin, SignedEntity
bind_chunk
abstractmethod
bind_chunk(content=None, sheet_index=0, chunk_index=0)
Short-hand method to upload a text chunk to a sheet. If content not provided then initializes an empty sheet chunk.
Returns:
| Name | Type | Description |
|---|---|---|
signed_url |
str
|
Signed URL to read/write the chunk. |
bind_image
abstractmethod
bind_image(content=None, sheet_index=0, image_index=0)
Short-hand method to upload an image to a sheet. If content not provided then initializes an empty sheet chunk.
Returns:
| Name | Type | Description |
|---|---|---|
signed_url |
str
|
Signed URL to read/write the image. |
bind_row
abstractmethod
bind_row(content=None, sheet_index=0, row_index=0)
Short-hand method to upload a row to a sheet. If content not provided then initializes an empty sheet row.
Returns:
| Name | Type | Description |
|---|---|---|
signed_url |
str
|
Signed URL to read/write the row. |
bind_sheet
abstractmethod
bind_sheet(sheet_index=0)
Short-hand method to add a sheet of the relevant concrete class (e.g. AzureDocumentVersionSheet) to the DocumentVersion at the specified index. IF concrete object already exists at the specified sheet_index then this will not be overwritten.
content
abstractmethod
content()
Short-hand method to get all the text content from each sheet.
get
get(max_retries=3, backoff_factor=1.5, page_limit=10, max_download_size=400 * 1024 * 1024)
Fetch data from the signed URL with exponential backoff retry logic.
For PDFs, can extract only first N pages to reduce memory usage. The PDF page extractor is intelligent enough to recognize end-of-file (EOF) termination characters before the last bytes, allowing extraction of pages from PDFs substantially larger than the file size limit.
For ZIP files, extracts the file matching self.file_name from the archive.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
max_retries
|
int
|
Maximum number of retry attempts. Defaults to 3. |
3
|
backoff_factor
|
float
|
Exponential backoff multiplier for retry delays. Defaults to 1.5. |
1.5
|
page_limit
|
int
|
Number of pages to extract (applies only to PDFs). Defaults to 10. |
10
|
max_download_size
|
int
|
Maximum bytes to attempt to download. Defaults to 400MB. |
400 * 1024 * 1024
|
Returns:
| Name | Type | Description |
|---|---|---|
bytes |
bytes
|
The downloaded document content. |
Raises:
| Type | Description |
|---|---|
RuntimeError
|
If all retry attempts fail. |
ValueError
|
If the file size exceeds max_download_size. |
FileNotFoundError
|
If the specified file is not found in a ZIP archive. |
pad_sheets
pad_sheets(sheet_number)
Pad with None until we reach the desired index
EntityMixin
Bases: SerializableMixin
Combined mixin providing data access, caching, serialization, and dictionary-like interface
__contains__
__contains__(key)
Check if a key exists using 'in' operator.
__getitem__
__getitem__(key)
Allow dictionary-style access to data attributes.
This works with cached data and automatically converts nested objects.
get
abstractmethod
get()
Fetch data from the underlying data store - implement in concrete classes
keys
keys()
Return all public attribute names, properties, and data keys.
set
abstractmethod
set(data=None)
Persist data to the underlying data store - implement in concrete classes
MetadataSpecification
Bases: AuditableMixin, EntityMixin
excel_from_classifiers
excel_from_classifiers()
Export classifiers to Excel workbook with:
- Summary sheet with all classifiers and their top-level properties
- Individual sheets for each classifier's picklist options
Returns:
| Type | Description |
|---|---|
BytesIO
|
Excel workbook as bytes that can be downloaded or sent via API. |
turtle_from_classifiers
turtle_from_classifiers()
Convert the information standard classifiers dictionary into Turtle, a textual syntax language for RDF triples that can be imported into other systems.
Returns:
| Type | Description |
|---|---|
str
|
A string-formatted JSON-LD Turtle definition. |
Results
Bases: EntityMixin
add
abstractmethod
add(id, name, value, method='workflow', certainty=None, explanation=None)
Adds a new PropertyValue to the result object. Does not write results to storage. For this, set() method must be called.
SerializableMixin
Mixin providing data serialization, and dictionary-like interface
__contains__
__contains__(key)
Check if a key exists using 'in' operator.
__getitem__
__getitem__(key)
Allow dictionary-style access to data attributes.
This works with cached data and automatically converts nested objects.
__setitem__
__setitem__(key, value)
Allow dictionary-style setting of data attributes.
items
items()
Return public key-value pairs.
keys
keys()
Return all public attribute names, properties, and data keys.
to_dict
to_dict()
Convert object attributes to dictionary format.
Recursively converts the object and all its properties into dictionaries for serialization purposes. Only includes public attributes and properties.
values
values()
Return all public attribute values, property values, and data values.
Session
Bases: AuditableMixin, SerializableMixin
__call__
__call__()
Initialize the session and return all public Session properties.
Returns:
| Name | Type | Description |
|---|---|---|
list |
list[tuple[str, Any]]
|
List of (key, value) tuples for all public Session properties. Includes properties inherited from SerializableMixin and AuditableMixin. |
Example
session_data = session() for key, value in session_data: ... print(f"{key}: {value}")
flat
flat()
Generator that yields (id, object) tuples for the document hierarchy.
Returns a flat view of the nested document structure, yielding each object with its id as the key. This does not modify the Session, it only provides an iterable view.
Yields:
| Type | Description |
|---|---|
tuple[str, DocumentVersion | DocumentVersionSheet | SheetItem]
|
Tuple[str, object]: (id, object) pairs for: - DocumentVersion objects - DocumentVersionSheet objects - SheetItem objects from chunks, images, and rows lists |
Example
for obj_id, obj in session.flat(): ... print(f"{obj_id}: {type(obj).name}")
SignedEntity
Bases: EntityMixin, ABC
Abstract base class for entities with secure URL access.
Represents a container that holds information with temporary signed URL access for reading and writing data. Provides automatic URL regeneration.
signed_url
property
writable
signed_url
Get a valid signed URL, regenerating if necessary.
__init__
__init__(signed_url=None, url_generator=None)
Initialize an information container.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
signed_url
|
str | None
|
Initial signed URL for the container. |
None
|
url_generator
|
Callable[[], str] | None
|
Function to regenerate expired URLs. |
None
|
get
abstractmethod
get()
Fetch data from the signed URL or return cached body.
Subclasses should implement this method with their own signature and logic as needed.