Skip to content

Jobs

Jobs are classes that encapsulate other Workbench functions and introduce in-built state management to write outputs back to storage.

BasicMetadataExtraction

Bases: Job

Applies basic transformations to common file types to prepare for analysis.

  • Images: downsample to reduce LLM token consumption.
  • PDFs: screenshot first page to provide LLM with awareness of page layout.
  • All other file types: Extract embedded metadata.

run

run(doc_version, session=None)

Execute the job.

Parameters:

Name Type Description Default
doc_version DocumentVersion

The DocumentVersion to extract content from.

required
session Session

Session binding to write results to cloud storage. If not provided then results will be returned directly in the method call.

None

Returns:

Type Description
Optional[Dict[str, Any]]

A dictionary of extracted metadata, if no Session binding is passed to the method.

Note

For some file types (like images and PDFs) a low-resolution thumbnail image is also generated and bound to the first sheet in the DocumentVersion. See user guide on Getting & setting content with DocumentVersions.

The extracted metadata dictionary is also bound to the DocumentVersion.

Document2Uniclass

Bases: Job

Takes a DocumentVersion, analyses the content to generate a list of up to ten key document themes, then classifies each list entry to Uniclass.

run

run(doc_version)

Execute the job.

Parameters:

Name Type Description Default
doc_version DocumentVersion

The document version to analyse

required

Returns:

Type Description
List[Dict[str, Any]]

A list of Uniclass classifications. For more details on the response schema, see classify_uniclass.

To do

This job will not bind analysis results back to the session.

DetailedAnalysis

Bases: Job

Applies the session analysis specification to a DocumentVersion:

  • Classify: Categorise the DocumentVersion into predefined sets
  • Search: Question answering on the document content
  • Tag: Apply one or more labels to flag key concepts (e.g. HSE, COSSH)
Note

Passes the document content and the document summary in the classification prompt. This can improve accuracy and consistency versus the StandardAnalysis job, but will result in higher token consumption.

run

run(doc_version, session=None, classifiers=None, attributes=None, tags=None, prompts=None)

Execute the job.

If a Session binding is provided then the method will write results back to cloud storage. Else, results will be returned directly to the method call.

The job will insert results into the DocumentVersion results dictionary, meaning any previous results at the same key will be overwritten. All other results will be retained.

Warning

The job expects a Session binding to be passed so the analysis specification can be accessed, OR for the analysis specification to be passed directly in the method call. If neither are provided then the job will exit.

If any specification parameters are passed in the method call then the function assumes this is the only step that needs to be executed. For example, the method will not load 'classifiers' from the method call and 'attributes' and 'tags' from the Session binding.

Parameters:

Name Type Description Default
doc_version DocumentVersion

The DocumentVersion to analyse.

required
session Session

Session binding to read the analysis specification and write results to cloud storage. If not provided then results will be returned directly in the method call.

None
classifiers List[Dict[str, Any]]

A list of classifiers. Schema matches the Session binding 'classifiers' attribute.

None
attributes List[Dict[str, Any]]

A list of attributes. Schema matches the Session binding 'attributes' attribute.

None
tags List[str]

A list of tags. Schema matches the Session binding 'tags' attribute.

None
prompts Dict[str, str]

A list of prompts that will be passed as overrides to each of the analysis steps. Schema matches the Session binding 'prompts' attribute.

None

Returns:

Type Description
Optional[Dict[str, Any]]

A dictionary of results, if no Session binding is passed to the method.

RerankerClassification

Bases: Job

Applies the session classification specification to a DocumentVersion.

This job is specialized for very long or complex classification option sets. Unlike other classification jobs it includes an interim step to free-text query for the classification and then match to a classifier code using OpenAI chat and reranker models.

Unlike StandardAnalysis and DetailedAnalysis this job does NOT analyse for search terms or tags in the specification.

The job will insert results into the DocumentVersion results dictionary, meaning any previous results at the same key will be overwritten. All other results will be retained.

Warning

The job expects a Session binding to be passed so the analysis specification can be accessed, OR for the analysis specification to be passed directly in the method call. If neither are provided then the job will exit.

run

run(doc_version, session=None, classifiers=None, similarity_threshold=0.85)

Execute the job.

Parameters:

Name Type Description Default
doc_version DocumentVersion

The DocumentVersion to analyse.

required
session Session

Session binding to read the analysis specification and write results to cloud storage. If not provided then results will be returned directly in the method call.

None
classifiers List

A list of classifiers. Schema matches the Session binding 'classifiers' attribute. If the classifer doesn't have a 'query' attribute then the Job will default to basic classification using LLMs.

None
similarity_threshold float

Any nearest-neighbours below this similarity threshold will automatically be discounted. Increasing this threshold will reduce the likelihood of classification to an unsuitable category but may exclude some suitable categories from being selected.

0.85

Returns:

Type Description
Optional[Dict[str, Any]]

A dictionary of results, if no Session binding is passed to the method.

StandardAnalysis

Bases: Job

Applies the session analysis specification to a DocumentVersion:

  • Classify: Categorise the DocumentVersion into predefined sets
  • Search: Question answering on the document content
  • Tag: Apply one or more labels to flag key concepts (e.g. HSE, COSSH)
Note

Passes only the document summary in the classification prompt. This reduces token consumption versus the DetailedAnalysis job, but can result in lower accuracy and consistency.

run

run(doc_version, metadata={}, session=None, classifiers=None, attributes=None, tags=None, prompts=None)

Execute the job.

If a Session binding is provided then the method will write results back to cloud storage. Else, results will be returned directly to the method call.

The job will insert results into the DocumentVersion results dictionary, meaning any previous results at the same key will be overwritten. All other results will be retained.

Warning

The job expects a Session binding to be passed so the analysis specification can be accessed, OR for the analysis specification to be passed directly in the method call. If neither are provided then the job will exit.

If any specification parameters are passed in the method call then the function assumes this is the only step that needs to be executed. For example, the method will not load 'classifiers' from the method call and 'attributes' and 'tags' from the Session binding.

Parameters:

Name Type Description Default
doc_version DocumentVersion

The DocumentVersion to analyse.

required
metadata Dict[str, Any]

Allows metadata to be passed directly to the method, overriding metadata bound to the DocumentVersion.

{}
session AzureBlobSession

Session binding to read the analysis specification and write results to cloud storage. If not provided then results will be returned directly in the method call.

None
classifiers List[Dict[str, Any]]

A list of classifiers. Schema matches the Session binding 'classifiers' attribute.

None
attributes List[Dict[str, Any]]

A list of attributes. Schema matches the Session binding 'attributes' attribute.

None
tags List[str]

A list of tags. Schema matches the Session binding 'tags' attribute.

None
prompts Dict[str, Any]

A list of prompts that will be passed as overrides to each of the analysis steps. Schema matches the Session binding 'prompts' attribute.

None

Returns:

Type Description
Optional[Dict[str, Any]]

A dictionary of results, if no Session binding is passed to the method.

VectorizeCodePart

Bases: Job

A job that takes a classification category from an analysis specification and creates a vector embedding for each picklist item using the Azure OpenAI Ada model.

A dataframe containing the original picklist items plus their embeddings are saved to local storage as a parquet file and returned directly in the method call.

run

run(standard_code_part, mode, save_directory='./storage', file_name='vectors')

Runs the vectorization job.

Parameters:

Name Type Description Default
standard_code_part List[Dict[str, Any]]

A list of dictionaries containing 'code', 'description', 'prompt'.

required
mode str

The mode of vectorization ('description', 'prompt', or 'both').

required
save_directory str

Local storage location to save the output parquet file,

'./storage'
file_name str

Root name of the output parquet file, without file extension.

'vectors'

Returns:

Type Description
DataFrame

The resulting DataFrame with embeddings.