Jobs

Jobs are classes that encapsulate other Workbench functions and introduce in-built state management to write outputs back to storage.

BasicMetadataExtraction

Bases: Job

Applies basic transformations to common file types to prepare for analysis.

Images: downsample to reduce LLM token consumption.
PDFs: screenshot first page to provide LLM with awareness of page layout.
All other file types: Extract embedded metadata.

run

run(doc_version, session=None)

Execute the job.

Parameters:

Name	Type	Description	Default
`doc_version`	`DocumentVersion`	The DocumentVersion to extract content from.	required
`session`	`Session`	Session binding to write results to cloud storage. If not provided then results will be returned directly in the method call.	`None`

Returns:

Type	Description
`Optional[Dict[str, Any]]`	A dictionary of extracted metadata, if no Session binding is passed to the method.

Note

For some file types (like images and PDFs) a low-resolution thumbnail image is also generated and bound to the first sheet in the DocumentVersion. See user guide on Getting & setting content with DocumentVersions.

The extracted metadata dictionary is also bound to the DocumentVersion.

Document2Uniclass

Bases: Job

Takes a DocumentVersion, analyses the content to generate a list of up to ten key document themes, then classifies each list entry to Uniclass.

run

run(doc_version)

Execute the job.

Parameters:

Name	Type	Description	Default
`doc_version`	`DocumentVersion`	The document version to analyse	required

Returns:

Type	Description
`List[Dict[str, Any]]`	A list of Uniclass classifications. For more details on the response schema, see classify_uniclass.

To do

This job will not bind analysis results back to the session.

DetailedAnalysis

Bases: Job

Applies the session analysis specification to a DocumentVersion:

Classify: Categorise the DocumentVersion into predefined sets
Search: Question answering on the document content
Tag: Apply one or more labels to flag key concepts (e.g. HSE, COSSH)

Note

Passes the document content and the document summary in the classification prompt. This can improve accuracy and consistency versus the StandardAnalysis job, but will result in higher token consumption.

run

run(doc_version, session=None, classifiers=None, attributes=None, tags=None, prompts=None)

Execute the job.

If a Session binding is provided then the method will write results back to cloud storage. Else, results will be returned directly to the method call.

The job will insert results into the DocumentVersion results dictionary, meaning any previous results at the same key will be overwritten. All other results will be retained.

Warning

The job expects a Session binding to be passed so the analysis specification can be accessed, OR for the analysis specification to be passed directly in the method call. If neither are provided then the job will exit.

If any specification parameters are passed in the method call then the function assumes this is the only step that needs to be executed. For example, the method will not load 'classifiers' from the method call and 'attributes' and 'tags' from the Session binding.

Parameters:

Name	Type	Description	Default
`doc_version`	`DocumentVersion`	The DocumentVersion to analyse.	required
`session`	`Session`	Session binding to read the analysis specification and write results to cloud storage. If not provided then results will be returned directly in the method call.	`None`
`classifiers`	`List[Dict[str, Any]]`	A list of classifiers. Schema matches the Session binding 'classifiers' attribute.	`None`
`attributes`	`List[Dict[str, Any]]`	A list of attributes. Schema matches the Session binding 'attributes' attribute.	`None`
`tags`	`List[str]`	A list of tags. Schema matches the Session binding 'tags' attribute.	`None`
`prompts`	`Dict[str, str]`	A list of prompts that will be passed as overrides to each of the analysis steps. Schema matches the Session binding 'prompts' attribute.	`None`

Returns:

Type	Description
`Optional[Dict[str, Any]]`	A dictionary of results, if no Session binding is passed to the method.

RerankerClassification

Bases: Job

Applies the session classification specification to a DocumentVersion.

This job is specialized for very long or complex classification option sets. Unlike other classification jobs it includes an interim step to free-text query for the classification and then match to a classifier code using OpenAI chat and reranker models.

Unlike StandardAnalysis and DetailedAnalysis this job does NOT analyse for search terms or tags in the specification.

The job will insert results into the DocumentVersion results dictionary, meaning any previous results at the same key will be overwritten. All other results will be retained.

Warning

The job expects a Session binding to be passed so the analysis specification can be accessed, OR for the analysis specification to be passed directly in the method call. If neither are provided then the job will exit.

run

run(doc_version, session=None, classifiers=None, similarity_threshold=0.85)

Execute the job.

Parameters:

Name	Type	Description	Default
`doc_version`	`DocumentVersion`	The DocumentVersion to analyse.	required
`session`	`Session`	Session binding to read the analysis specification and write results to cloud storage. If not provided then results will be returned directly in the method call.	`None`
`classifiers`	`List`	A list of classifiers. Schema matches the Session binding 'classifiers' attribute. If the classifer doesn't have a 'query' attribute then the Job will default to basic classification using LLMs.	`None`
`similarity_threshold`	`float`	Any nearest-neighbours below this similarity threshold will automatically be discounted. Increasing this threshold will reduce the likelihood of classification to an unsuitable category but may exclude some suitable categories from being selected.	`0.85`

Returns:

Type	Description
`Optional[Dict[str, Any]]`	A dictionary of results, if no Session binding is passed to the method.

StandardAnalysis

Bases: Job

Applies the session analysis specification to a DocumentVersion:

Classify: Categorise the DocumentVersion into predefined sets
Search: Question answering on the document content
Tag: Apply one or more labels to flag key concepts (e.g. HSE, COSSH)

Note

Passes only the document summary in the classification prompt. This reduces token consumption versus the DetailedAnalysis job, but can result in lower accuracy and consistency.

run

run(doc_version, metadata={}, session=None, classifiers=None, attributes=None, tags=None, prompts=None)

Execute the job.

If a Session binding is provided then the method will write results back to cloud storage. Else, results will be returned directly to the method call.

The job will insert results into the DocumentVersion results dictionary, meaning any previous results at the same key will be overwritten. All other results will be retained.

Warning

The job expects a Session binding to be passed so the analysis specification can be accessed, OR for the analysis specification to be passed directly in the method call. If neither are provided then the job will exit.

If any specification parameters are passed in the method call then the function assumes this is the only step that needs to be executed. For example, the method will not load 'classifiers' from the method call and 'attributes' and 'tags' from the Session binding.

Parameters:

Name	Type	Description	Default
`doc_version`	`DocumentVersion`	The DocumentVersion to analyse.	required
`metadata`	`Dict[str, Any]`	Allows metadata to be passed directly to the method, overriding metadata bound to the `DocumentVersion`.	`{}`
`session`	`AzureBlobSession`	Session binding to read the analysis specification and write results to cloud storage. If not provided then results will be returned directly in the method call.	`None`
`classifiers`	`List[Dict[str, Any]]`	A list of classifiers. Schema matches the Session binding 'classifiers' attribute.	`None`
`attributes`	`List[Dict[str, Any]]`	A list of attributes. Schema matches the Session binding 'attributes' attribute.	`None`
`tags`	`List[str]`	A list of tags. Schema matches the Session binding 'tags' attribute.	`None`
`prompts`	`Dict[str, Any]`	A list of prompts that will be passed as overrides to each of the analysis steps. Schema matches the Session binding 'prompts' attribute.	`None`

Returns:

Type	Description
`Optional[Dict[str, Any]]`	A dictionary of results, if no Session binding is passed to the method.

VectorizeCodePart

Bases: Job

A job that takes a classification category from an analysis specification and creates a vector embedding for each picklist item using the Azure OpenAI Ada model.

A dataframe containing the original picklist items plus their embeddings are saved to local storage as a parquet file and returned directly in the method call.

run

run(standard_code_part, mode, save_directory='./storage', file_name='vectors')

Runs the vectorization job.

Parameters:

Name	Type	Description	Default
`standard_code_part`	`List[Dict[str, Any]]`	A list of dictionaries containing 'code', 'description', 'prompt'.	required
`mode`	`str`	The mode of vectorization ('description', 'prompt', or 'both').	required
`save_directory`	`str`	Local storage location to save the output parquet file,	`'./storage'`
`file_name`	`str`	Root name of the output parquet file, without file extension.	`'vectors'`

Returns:

Type	Description
`DataFrame`	The resulting DataFrame with embeddings.