Bindings
Bindings are how Workbench connects to remotely-stored documents for analysis.
AzureBlobSession
Bases: Session
get_processing_status
get_processing_status()
Check if all files in the session have completed processing.
Returns:
| Type | Description |
|---|---|
str
|
|
refresh_status
refresh_status()
Reconcile cached session summary from a LIST + pointer-JSON reads.
Cheap path: one LIST to enumerate blobs and classify each by artefact-suffix presence (status, row_count, per-file status and row_count all fall out here).
Follow-up cost: for SharePoint/Autodesk source files whose display
name isn't already cached in session_properties.json, download the
pointer JSON to read name. Subsequent refreshes skip this for
unchanged files, so it amortises to zero.
Writes status, file_count, row_count, sources, document_summaries and status_last_updated_at back to session_properties.json.
results_to_df
results_to_df()
Transmute all document version results to a Pandas dataframe:
- Inserting results to the correct dataframe columns according to their field names
- Merging user edits and AI edits into their correct cell positions
Returns:
| Type | Description |
|---|---|
DataFrame
|
Each row in the dataframe is a document version. |
set_file_count
set_file_count(new_file_count)
Set the file count in the session properties JSON.
Also updates the in-memory cache.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
new_file_count
|
int
|
The count of files |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if operation successful, else false. |
set_processing_status
set_processing_status(new_status)
Set the processing status in the session properties JSON.
Also updates the in-memory cache and status timestamp so that
subsequent reads of session.status reflect the new value without
needing a full refresh_status().
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
new_status
|
str
|
The status to set. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if operation successful, else false. |
AuditableMixin
Bases: ABC
Abstract mix-in class requiring audit-tracking capabilities.
DocumentSummary
Bases: TypedDict
Shape returned by DocumentVersion.summary().
DocumentSummaryLite
Bases: TypedDict
Cheap per-document record exposed on the session summary.
Derivable from a LIST + at most one pointer-JSON read per previously unseen SharePoint/Autodesk file. Used by the workspace-list endpoint so the sessions table can render without triggering Session.initialize().
DocumentVersion
Bases: AuditableMixin, SignedEntity
child_count
property
child_count
Number of sheets bound to this document.
file_name
property
file_name
Storage-facing file name. Defaults to the id with a .json suffix when none is present.
row_count
abstractmethod
property
row_count
Total number of row artefacts across all sheets.
size
abstractmethod
property
size
Size in bytes of the source file, or None if unavailable.
For externally-hosted files (SharePoint, Autodesk) this should come from the source API payload (connection_details), not from any local config blob used to persist session state.
status
abstractmethod
property
status
Processing status of this document derived from artefact presence.
- DRAFT: no processing artefacts present
- PROCESSING: at least one chunk/image/row artefact present but no results
- COMPLETED: a results artefact exists for this document or one of its sheets
bind_chunk
abstractmethod
bind_chunk(content=None, sheet_index=0, chunk_index=0)
Short-hand method to upload a text chunk to a sheet. If content not provided then initializes an empty sheet chunk.
Returns:
| Name | Type | Description |
|---|---|---|
signed_url |
str
|
Signed URL to read/write the chunk. |
bind_image
abstractmethod
bind_image(content=None, sheet_index=0, image_index=0)
Short-hand method to upload an image to a sheet. If content not provided then initializes an empty sheet chunk.
Returns:
| Name | Type | Description |
|---|---|---|
signed_url |
str
|
Signed URL to read/write the image. |
bind_row
abstractmethod
bind_row(content=None, sheet_index=0, row_index=0)
Short-hand method to upload a row to a sheet. If content not provided then initializes an empty sheet row.
Returns:
| Name | Type | Description |
|---|---|---|
signed_url |
str
|
Signed URL to read/write the row. |
bind_sheet
abstractmethod
bind_sheet(sheet_index=0)
Short-hand method to add a sheet of the relevant concrete class (e.g. AzureDocumentVersionSheet) to the DocumentVersion at the specified index. IF concrete object already exists at the specified sheet_index then this will not be overwritten.
content
abstractmethod
content()
Short-hand method to get all the text content from each sheet.
get
get(max_retries=3, backoff_factor=1.5, page_limit=10, max_download_size=400 * 1024 * 1024)
Fetch data from the signed URL with exponential backoff retry logic.
For PDFs, can extract only first N pages to reduce memory usage. The PDF page extractor is intelligent enough to recognize end-of-file (EOF) termination characters before the last bytes, allowing extraction of pages from PDFs substantially larger than the file size limit.
For ZIP files, extracts the file matching self.file_name from the archive.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
max_retries
|
int
|
Maximum number of retry attempts. Defaults to 3. |
3
|
backoff_factor
|
float
|
Exponential backoff multiplier for retry delays. Defaults to 1.5. |
1.5
|
page_limit
|
int
|
Number of pages to extract (applies only to PDFs). Defaults to 10. |
10
|
max_download_size
|
int
|
Maximum bytes to attempt to download. Defaults to 400MB. |
400 * 1024 * 1024
|
Returns:
| Name | Type | Description |
|---|---|---|
bytes |
bytes
|
The downloaded document content. |
Raises:
| Type | Description |
|---|---|
RuntimeError
|
If all retry attempts fail. |
ValueError
|
If the file size exceeds max_download_size. |
FileNotFoundError
|
If the specified file is not found in a ZIP archive. |
pad_sheets
pad_sheets(sheet_number)
Pad with None until we reach the desired index
summary
summary()
Return a serialisable summary of this document version.
Implemented on the ABC in terms of the abstract properties, so every binding produces the same shape without the caller needing to reach into the concrete class.
EntityMixin
Bases: SerializableMixin
Combined mixin providing data access, caching, serialization, and dictionary-like interface
__contains__
__contains__(key)
Check if a key exists using 'in' operator.
__getitem__
__getitem__(key)
Allow dictionary-style access to data attributes.
This works with cached data and automatically converts nested objects.
get
abstractmethod
get()
Fetch data from the underlying data store - implement in concrete classes
keys
keys()
Return all public attribute names, properties, and data keys.
set
abstractmethod
set(data=None)
Persist data to the underlying data store - implement in concrete classes
MetadataSpecification
Bases: AuditableMixin, EntityMixin
excel_from_classifiers
excel_from_classifiers()
Export classifiers to Excel workbook with:
- Summary sheet with all classifiers and their top-level properties
- Individual sheets for each classifier's picklist options
Returns:
| Type | Description |
|---|---|
BytesIO
|
Excel workbook as bytes that can be downloaded or sent via API. |
turtle_from_classifiers
turtle_from_classifiers()
Convert the information standard classifiers dictionary into Turtle, a textual syntax language for RDF triples that can be imported into other systems.
Returns:
| Type | Description |
|---|---|
str
|
A string-formatted JSON-LD Turtle definition. |
Results
Bases: EntityMixin
add
abstractmethod
add(id, name, value, method='workflow', certainty=None, explanation=None, format_valid=True)
Adds a new PropertyValue to the result object. Does not write results to storage. For this, set() method must be called.
SerializableMixin
Mixin providing data serialization, and dictionary-like interface
__contains__
__contains__(key)
Check if a key exists using 'in' operator.
__getitem__
__getitem__(key)
Allow dictionary-style access to data attributes.
This works with cached data and automatically converts nested objects.
__setitem__
__setitem__(key, value)
Allow dictionary-style setting of data attributes.
items
items()
Return public key-value pairs.
keys
keys()
Return all public attribute names, properties, and data keys.
to_dict
to_dict()
Convert object attributes to dictionary format.
Recursively converts the object and all its properties into dictionaries for serialization purposes. Only includes public attributes and properties.
values
values()
Return all public attribute values, property values, and data values.
Session
Bases: AuditableMixin, SerializableMixin
document_summaries
abstractmethod
property
document_summaries
Cheap per-document records, refreshed in lock-step with status.
One record per source file in the session. Populated by refresh_status() and persisted alongside the other cached summary fields. Reads never block on initialize().
file_count
abstractmethod
property
file_count
Number of source files in the session (cheap / cached).
row_count
abstractmethod
property
row_count
Total number of row artefacts across every file in the session (cheap / cached).
sources
abstractmethod
property
sources
Distinct storage providers present in this session (e.g. ["sharepoint", "csv"]).
status
abstractmethod
property
status
Rolled-up processing status for the session.
Cheap to read — backed by a cached summary maintained by the concrete binding. Callers who need a guaranteed-fresh value should call refresh_status() first.
status_last_updated_at
abstractmethod
property
status_last_updated_at
When the cached status/counts were last reconciled from storage.
__call__
__call__()
Initialize the session and return all public Session properties.
Returns:
| Name | Type | Description |
|---|---|---|
list |
list[tuple[str, Any]]
|
List of (key, value) tuples for all public Session properties. Includes properties inherited from SerializableMixin and AuditableMixin. |
Example
session_data = session() for key, value in session_data: ... print(f"{key}: {value}")
flat
flat()
Generator that yields (id, object) tuples for the document hierarchy.
Returns a flat view of the nested document structure, yielding each object with its id as the key. This does not modify the Session, it only provides an iterable view.
Yields:
| Type | Description |
|---|---|
tuple[str, DocumentVersion | DocumentVersionSheet | SheetItem]
|
Tuple[str, object]: (id, object) pairs for: - DocumentVersion objects - DocumentVersionSheet objects - SheetItem objects from chunks, images, and rows lists |
Example
for obj_id, obj in session.flat(): ... print(f"{obj_id}: {type(obj).name}")
refresh_status
abstractmethod
refresh_status()
Reconcile status, file_count, row_count and sources from underlying storage.
Implementations should derive these from a single low-cost enumeration of the session's storage (e.g. a blob LIST), not from a full initialize(). The result is persisted so that subsequent reads of the cached properties reflect the new values.
summary
summary()
Return a serialisable summary of this session.
Implemented on the ABC in terms of the abstract properties so every binding produces the same shape.
SessionStatus
Bases: StrEnum
Canonical status values for a Session or a DocumentVersion.
- DRAFT: source file(s) present but no processing artefacts observed.
- PROCESSING: at least one processing artefact exists but no final results.
- COMPLETED: results artefact exists for every file.
SessionSummary
Bases: TypedDict
Shape returned by Session.summary().
SignedEntity
Bases: EntityMixin, ABC
Abstract base class for entities with secure URL access.
Represents a container that holds information with temporary signed URL access for reading and writing data. Provides automatic URL regeneration.
signed_url
property
writable
signed_url
Get a valid signed URL, regenerating if necessary.
__init__
__init__(signed_url=None, url_generator=None)
Initialize an information container.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
signed_url
|
str | None
|
Initial signed URL for the container. |
None
|
url_generator
|
Callable[[], str] | None
|
Function to regenerate expired URLs. |
None
|
get
abstractmethod
get()
Fetch data from the signed URL or return cached body.
Subclasses should implement this method with their own signature and logic as needed.