Skip to content

Bindings

Bindings are how Workbench connects to remotely-stored documents for analysis.

AzureBlobSession

Bases: Session

get_processing_status

get_processing_status()

Check if all files in the session have completed processing.

Returns:

Type Description
str

completed if all files have status 'completed', error if any file has status 'failed', processing if no errors and any file has status 'processing', else draft.

refresh_status

refresh_status()

Reconcile cached session summary from a LIST + pointer-JSON reads.

Cheap path: one LIST to enumerate blobs and classify each by artefact-suffix presence (status, row_count, per-file status and row_count all fall out here).

Follow-up cost: for SharePoint/Autodesk source files whose display name isn't already cached in session_properties.json, download the pointer JSON to read name. Subsequent refreshes skip this for unchanged files, so it amortises to zero.

Writes status, file_count, row_count, sources, document_summaries and status_last_updated_at back to session_properties.json.

results_to_df

results_to_df()

Transmute all document version results to a Pandas dataframe:

  1. Inserting results to the correct dataframe columns according to their field names
  2. Merging user edits and AI edits into their correct cell positions

Returns:

Type Description
DataFrame

Each row in the dataframe is a document version.

set_file_count

set_file_count(new_file_count)

Set the file count in the session properties JSON.

Also updates the in-memory cache.

Parameters:

Name Type Description Default
new_file_count int

The count of files

required

Returns:

Type Description
bool

True if operation successful, else false.

set_processing_status

set_processing_status(new_status)

Set the processing status in the session properties JSON.

Also updates the in-memory cache and status timestamp so that subsequent reads of session.status reflect the new value without needing a full refresh_status().

Parameters:

Name Type Description Default
new_status str

The status to set.

required

Returns:

Type Description
bool

True if operation successful, else false.

AuditableMixin

Bases: ABC

Abstract mix-in class requiring audit-tracking capabilities.

DocumentSummary

Bases: TypedDict

Shape returned by DocumentVersion.summary().

DocumentSummaryLite

Bases: TypedDict

Cheap per-document record exposed on the session summary.

Derivable from a LIST + at most one pointer-JSON read per previously unseen SharePoint/Autodesk file. Used by the workspace-list endpoint so the sessions table can render without triggering Session.initialize().

DocumentVersion

Bases: AuditableMixin, SignedEntity

child_count property

child_count

Number of sheets bound to this document.

file_name property

file_name

Storage-facing file name. Defaults to the id with a .json suffix when none is present.

row_count abstractmethod property

row_count

Total number of row artefacts across all sheets.

size abstractmethod property

size

Size in bytes of the source file, or None if unavailable.

For externally-hosted files (SharePoint, Autodesk) this should come from the source API payload (connection_details), not from any local config blob used to persist session state.

status abstractmethod property

status

Processing status of this document derived from artefact presence.

  • DRAFT: no processing artefacts present
  • PROCESSING: at least one chunk/image/row artefact present but no results
  • COMPLETED: a results artefact exists for this document or one of its sheets

bind_chunk abstractmethod

bind_chunk(content=None, sheet_index=0, chunk_index=0)

Short-hand method to upload a text chunk to a sheet. If content not provided then initializes an empty sheet chunk.

Returns:

Name Type Description
signed_url str

Signed URL to read/write the chunk.

bind_image abstractmethod

bind_image(content=None, sheet_index=0, image_index=0)

Short-hand method to upload an image to a sheet. If content not provided then initializes an empty sheet chunk.

Returns:

Name Type Description
signed_url str

Signed URL to read/write the image.

bind_row abstractmethod

bind_row(content=None, sheet_index=0, row_index=0)

Short-hand method to upload a row to a sheet. If content not provided then initializes an empty sheet row.

Returns:

Name Type Description
signed_url str

Signed URL to read/write the row.

bind_sheet abstractmethod

bind_sheet(sheet_index=0)

Short-hand method to add a sheet of the relevant concrete class (e.g. AzureDocumentVersionSheet) to the DocumentVersion at the specified index. IF concrete object already exists at the specified sheet_index then this will not be overwritten.

content abstractmethod

content()

Short-hand method to get all the text content from each sheet.

get

get(max_retries=3, backoff_factor=1.5, page_limit=10, max_download_size=400 * 1024 * 1024)

Fetch data from the signed URL with exponential backoff retry logic.

For PDFs, can extract only first N pages to reduce memory usage. The PDF page extractor is intelligent enough to recognize end-of-file (EOF) termination characters before the last bytes, allowing extraction of pages from PDFs substantially larger than the file size limit.

For ZIP files, extracts the file matching self.file_name from the archive.

Parameters:

Name Type Description Default
max_retries int

Maximum number of retry attempts. Defaults to 3.

3
backoff_factor float

Exponential backoff multiplier for retry delays. Defaults to 1.5.

1.5
page_limit int

Number of pages to extract (applies only to PDFs). Defaults to 10.

10
max_download_size int

Maximum bytes to attempt to download. Defaults to 400MB.

400 * 1024 * 1024

Returns:

Name Type Description
bytes bytes

The downloaded document content.

Raises:

Type Description
RuntimeError

If all retry attempts fail.

ValueError

If the file size exceeds max_download_size.

FileNotFoundError

If the specified file is not found in a ZIP archive.

pad_sheets

pad_sheets(sheet_number)

Pad with None until we reach the desired index

summary

summary()

Return a serialisable summary of this document version.

Implemented on the ABC in terms of the abstract properties, so every binding produces the same shape without the caller needing to reach into the concrete class.

EntityMixin

Bases: SerializableMixin

Combined mixin providing data access, caching, serialization, and dictionary-like interface

__contains__

__contains__(key)

Check if a key exists using 'in' operator.

__getitem__

__getitem__(key)

Allow dictionary-style access to data attributes.

This works with cached data and automatically converts nested objects.

get abstractmethod

get()

Fetch data from the underlying data store - implement in concrete classes

keys

keys()

Return all public attribute names, properties, and data keys.

set abstractmethod

set(data=None)

Persist data to the underlying data store - implement in concrete classes

MetadataSpecification

Bases: AuditableMixin, EntityMixin

excel_from_classifiers

excel_from_classifiers()

Export classifiers to Excel workbook with:

  1. Summary sheet with all classifiers and their top-level properties
  2. Individual sheets for each classifier's picklist options

Returns:

Type Description
BytesIO

Excel workbook as bytes that can be downloaded or sent via API.

turtle_from_classifiers

turtle_from_classifiers()

Convert the information standard classifiers dictionary into Turtle, a textual syntax language for RDF triples that can be imported into other systems.

Returns:

Type Description
str

A string-formatted JSON-LD Turtle definition.

Results

Bases: EntityMixin

add abstractmethod

add(id, name, value, method='workflow', certainty=None, explanation=None, format_valid=True)

Adds a new PropertyValue to the result object. Does not write results to storage. For this, set() method must be called.

SerializableMixin

Mixin providing data serialization, and dictionary-like interface

__contains__

__contains__(key)

Check if a key exists using 'in' operator.

__getitem__

__getitem__(key)

Allow dictionary-style access to data attributes.

This works with cached data and automatically converts nested objects.

__setitem__

__setitem__(key, value)

Allow dictionary-style setting of data attributes.

items

items()

Return public key-value pairs.

keys

keys()

Return all public attribute names, properties, and data keys.

to_dict

to_dict()

Convert object attributes to dictionary format.

Recursively converts the object and all its properties into dictionaries for serialization purposes. Only includes public attributes and properties.

values

values()

Return all public attribute values, property values, and data values.

Session

Bases: AuditableMixin, SerializableMixin

document_summaries abstractmethod property

document_summaries

Cheap per-document records, refreshed in lock-step with status.

One record per source file in the session. Populated by refresh_status() and persisted alongside the other cached summary fields. Reads never block on initialize().

file_count abstractmethod property

file_count

Number of source files in the session (cheap / cached).

row_count abstractmethod property

row_count

Total number of row artefacts across every file in the session (cheap / cached).

sources abstractmethod property

sources

Distinct storage providers present in this session (e.g. ["sharepoint", "csv"]).

status abstractmethod property

status

Rolled-up processing status for the session.

Cheap to read — backed by a cached summary maintained by the concrete binding. Callers who need a guaranteed-fresh value should call refresh_status() first.

status_last_updated_at abstractmethod property

status_last_updated_at

When the cached status/counts were last reconciled from storage.

__call__

__call__()

Initialize the session and return all public Session properties.

Returns:

Name Type Description
list list[tuple[str, Any]]

List of (key, value) tuples for all public Session properties. Includes properties inherited from SerializableMixin and AuditableMixin.

Example

session_data = session() for key, value in session_data: ... print(f"{key}: {value}")

flat

flat()

Generator that yields (id, object) tuples for the document hierarchy.

Returns a flat view of the nested document structure, yielding each object with its id as the key. This does not modify the Session, it only provides an iterable view.

Yields:

Type Description
tuple[str, DocumentVersion | DocumentVersionSheet | SheetItem]

Tuple[str, object]: (id, object) pairs for: - DocumentVersion objects - DocumentVersionSheet objects - SheetItem objects from chunks, images, and rows lists

Example

for obj_id, obj in session.flat(): ... print(f"{obj_id}: {type(obj).name}")

refresh_status abstractmethod

refresh_status()

Reconcile status, file_count, row_count and sources from underlying storage.

Implementations should derive these from a single low-cost enumeration of the session's storage (e.g. a blob LIST), not from a full initialize(). The result is persisted so that subsequent reads of the cached properties reflect the new values.

summary

summary()

Return a serialisable summary of this session.

Implemented on the ABC in terms of the abstract properties so every binding produces the same shape.

SessionStatus

Bases: StrEnum

Canonical status values for a Session or a DocumentVersion.

  • DRAFT: source file(s) present but no processing artefacts observed.
  • PROCESSING: at least one processing artefact exists but no final results.
  • COMPLETED: results artefact exists for every file.

SessionSummary

Bases: TypedDict

Shape returned by Session.summary().

SignedEntity

Bases: EntityMixin, ABC

Abstract base class for entities with secure URL access.

Represents a container that holds information with temporary signed URL access for reading and writing data. Provides automatic URL regeneration.

signed_url property writable

signed_url

Get a valid signed URL, regenerating if necessary.

__init__

__init__(signed_url=None, url_generator=None)

Initialize an information container.

Parameters:

Name Type Description Default
signed_url str | None

Initial signed URL for the container.

None
url_generator Callable[[], str] | None

Function to regenerate expired URLs.

None

get abstractmethod

get()

Fetch data from the signed URL or return cached body.

Subclasses should implement this method with their own signature and logic as needed.