Skip to content

Overview

Workbench adopts conventions that govern how data inputs and outputs are represented and operations are performed.

Becoming familar with these concepts will allow you to quickly assemble analysis workflows and apply them to your own data.

Object-oriented programming

Workbench applies object-oriented programming (OOP) principles to model objects using pre-defined classes with data (attributes) and code (methods). For example, a page extracted from a PDF can be represented as an instance of the DocumentVersionSheet class, and the page's metadata can be retrieved by calling the get() method.

from workbench.bindings import DocumentVersionSheet

...

DocumentVersionSheet.metadata.get()

This guide introduces some of the common classes you will interact with when using Workbench.

Information containers

Workbench adopts the ISO 19650 principle of the information container. The InformationContainer class connects to a blob in a blob storage system (like Azure Blob Storage or AWS S3) and provides a basic mechanism for accessing input data and writing outputs back to storage.

What is an information container?

According to ISO 19650, an information container is a unique file.

For more information on information container management through metadata assignment, refer to the UK BIM Framework's Guidance Part C: Facilitating the common data environment (workflow and technical solutions).

To initialise an information container you must provide a URL and (optionally) request headers to read from and write to the blob.

from workbench.bindings import InformationContainer

signed_url = "https://example.com/path-to-file"
info_cont = InformationContainer(signed_url)

The InformationContainer class has several sub-classes and related classes that introduce new attributes and methods to represent data at different levels of decomposition:

Class Purpose
DocumentVersion Root-node information container that represents an intact file (such as a PDF or an Excel workbook). Can have zero or many child DocumentVersionSheet objects that contain information containers derived from the parent.
DocumentVersionSheet Represents a page or sheet in a DocumentVersion. Stores lists of information container text content, image / graphical content and rows (tabular content) that have been extracted from the DocumentVersion.

For more information, read this guide on Documents.

Sessions

Instead of instantiating and interacting with InformationContainer objects individually - which could be inefficient and prone to error - Workbench provides session classes for indexing and analysing documents in batches. Invoking a session class exposes several notable attributes (amongst others):

Attribute Purpose
session.document_versions A list of DocumentVersions that belong to the session.
session.workflow A series of analysis steps and dependencies - otherwise known as a directed acyclic graph (DAG). For more information on workflows this guide.
session.classifiers session.tags session.attributes Metadata field definitions that can be applied to the document versions by running a workflow.

Session bindings are the fastest way to begin any analysis, and are a big part of what makes Workbench uniquely easy to use. They abstract away much of the complexity of document indexing from other sources and provide a consistent schema for working with other Workbench classes and methods.

# Print key document information
for doc_version in session.document_versions:
    print(f'Document name: {doc_version.file_name}')
    try:
        # Retrieve metadata from storage
        metadata = doc_version.metadata.get()
        print(json.dumps(metadata, indent=4))
    except:
        print(f'No metadata found for {doc_version.file_name}')

Before we explain how to create a session it's helpful to understand how sessions are managed to protect your data.

Organisational hierarchy

The Hoppa ecosystem adopts a three-tier hierarchy for scoping user permissions and managing information:

  • Organization - All users must belong to at least one organization. Data can only reside within one organization.
  • Workspace - Bucket for managing information related to a particular project, system or location. Allows templating of information standards and workflows for re-use. Belongs to an organization.
  • Session - Bucket for running analysis and storing results. Belongs to a workspace.

Creating a session

Sessions must be created using the Hoppa web app before they can be used in Workbench. These video guides show how to connect to 3rd party document management systems and add files to the session. For information on binding Workbench to your session, read this guide on Session bindings.