Jobs
Jobs are classes that encapsulate other Workbench functions and introduce in-built state management to write outputs back to storage.
BasicMetadataExtraction
Bases: Job
Applies basic transformations to common file types to prepare for analysis.
- Images: downsample to reduce LLM token consumption.
- PDFs: screenshot first page to provide LLM with awareness of page layout.
- All other file types: Extract embedded metadata.
run
run(doc_version, session=None)
Execute the job.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
doc_version
|
DocumentVersion
|
The DocumentVersion to extract content from. |
required |
session
|
Session
|
Session binding to write results to cloud storage. If not provided then results will be returned directly in the method call. |
None
|
Returns:
Type | Description |
---|---|
Optional[Dict[str, Any]]
|
A dictionary of extracted metadata, if no Session binding is passed to the method. |
Note
For some file types (like images and PDFs) a low-resolution thumbnail image is also generated and bound to the first sheet in the DocumentVersion. See user guide on Getting & setting content with DocumentVersions.
The extracted metadata dictionary is also bound to the DocumentVersion.
Document2Uniclass
Bases: Job
Takes a DocumentVersion
, analyses the content to generate a list of up to ten key document themes, then classifies each list entry to Uniclass.
run
run(doc_version)
Execute the job.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
doc_version
|
DocumentVersion
|
The document version to analyse |
required |
Returns:
Type | Description |
---|---|
List[Dict[str, Any]]
|
A list of Uniclass classifications. For more details on the response schema, see classify_uniclass. |
To do
This job will not bind analysis results back to the session.
DetailedAnalysis
Bases: Job
Applies the session analysis specification to a DocumentVersion
:
- Classify: Categorise the
DocumentVersion
into predefined sets - Search: Question answering on the document content
- Tag: Apply one or more labels to flag key concepts (e.g. HSE, COSSH)
Note
Passes the document content and the document summary in the classification prompt.
This can improve accuracy and consistency versus the StandardAnalysis
job, but will result in higher token consumption.
run
run(doc_version, session=None, classifiers=None, attributes=None, tags=None, prompts=None)
Execute the job.
If a Session
binding is provided then the method will write results back to cloud storage. Else, results will be returned directly to the method call.
The job will insert results into the DocumentVersion
results dictionary, meaning any previous results at the same key will be overwritten. All other results will be retained.
Warning
The job expects a Session
binding to be passed so the analysis specification can be accessed, OR for the analysis specification to be passed directly in the method call. If neither are provided then the job will exit.
If any specification parameters are passed in the method call then the function assumes this is the only step that needs to be executed.
For example, the method will not load 'classifiers' from the method call and 'attributes' and 'tags' from the Session
binding.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
doc_version
|
DocumentVersion
|
The DocumentVersion to analyse. |
required |
session
|
Session
|
Session binding to read the analysis specification and write results to cloud storage. If not provided then results will be returned directly in the method call. |
None
|
classifiers
|
List[Dict[str, Any]]
|
A list of classifiers. Schema matches the Session binding 'classifiers' attribute. |
None
|
attributes
|
List[Dict[str, Any]]
|
A list of attributes. Schema matches the Session binding 'attributes' attribute. |
None
|
tags
|
List[str]
|
A list of tags. Schema matches the Session binding 'tags' attribute. |
None
|
prompts
|
Dict[str, str]
|
A list of prompts that will be passed as overrides to each of the analysis steps. Schema matches the Session binding 'prompts' attribute. |
None
|
Returns:
Type | Description |
---|---|
Optional[Dict[str, Any]]
|
A dictionary of results, if no Session binding is passed to the method. |
RerankerClassification
Bases: Job
Applies the session classification specification to a DocumentVersion
.
This job is specialized for very long or complex classification option sets. Unlike other classification jobs it includes an interim step to free-text query for the classification and then match to a classifier code using OpenAI chat and reranker models.
Unlike StandardAnalysis
and DetailedAnalysis
this job does NOT analyse for search terms or tags in the specification.
The job will insert results into the DocumentVersion
results dictionary, meaning any previous results at the same key will be overwritten. All other results will be retained.
Warning
The job expects a Session
binding to be passed so the analysis specification can be accessed, OR for the analysis specification to be passed directly in the method call. If neither are provided then the job will exit.
run
run(doc_version, session=None, classifiers=None, similarity_threshold=0.85)
Execute the job.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
doc_version
|
DocumentVersion
|
The DocumentVersion to analyse. |
required |
session
|
Session
|
Session binding to read the analysis specification and write results to cloud storage. If not provided then results will be returned directly in the method call. |
None
|
classifiers
|
List
|
A list of classifiers. Schema matches the Session binding 'classifiers' attribute. If the classifer doesn't have a 'query' attribute then the Job will default to basic classification using LLMs. |
None
|
similarity_threshold
|
float
|
Any nearest-neighbours below this similarity threshold will automatically be discounted. Increasing this threshold will reduce the likelihood of classification to an unsuitable category but may exclude some suitable categories from being selected. |
0.85
|
Returns:
Type | Description |
---|---|
Optional[Dict[str, Any]]
|
A dictionary of results, if no Session binding is passed to the method. |
StandardAnalysis
Bases: Job
Applies the session analysis specification to a DocumentVersion
:
- Classify: Categorise the
DocumentVersion
into predefined sets - Search: Question answering on the document content
- Tag: Apply one or more labels to flag key concepts (e.g. HSE, COSSH)
Note
Passes only the document summary in the classification prompt.
This reduces token consumption versus the DetailedAnalysis
job, but can result in lower accuracy and consistency.
run
run(doc_version, metadata={}, session=None, classifiers=None, attributes=None, tags=None, prompts=None)
Execute the job.
If a Session
binding is provided then the method will write results back to cloud storage. Else, results will be returned directly to the method call.
The job will insert results into the DocumentVersion
results dictionary, meaning any previous results at the same key will be overwritten. All other results will be retained.
Warning
The job expects a Session
binding to be passed so the analysis specification can be accessed, OR for the analysis specification to be passed directly in the method call. If neither are provided then the job will exit.
If any specification parameters are passed in the method call then the function assumes this is the only step that needs to be executed.
For example, the method will not load 'classifiers' from the method call and 'attributes' and 'tags' from the Session
binding.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
doc_version
|
DocumentVersion
|
The DocumentVersion to analyse. |
required |
metadata
|
Dict[str, Any]
|
Allows metadata to be passed directly to the method, overriding metadata bound to the |
{}
|
session
|
AzureBlobSession
|
Session binding to read the analysis specification and write results to cloud storage. If not provided then results will be returned directly in the method call. |
None
|
classifiers
|
List[Dict[str, Any]]
|
A list of classifiers. Schema matches the Session binding 'classifiers' attribute. |
None
|
attributes
|
List[Dict[str, Any]]
|
A list of attributes. Schema matches the Session binding 'attributes' attribute. |
None
|
tags
|
List[str]
|
A list of tags. Schema matches the Session binding 'tags' attribute. |
None
|
prompts
|
Dict[str, Any]
|
A list of prompts that will be passed as overrides to each of the analysis steps. Schema matches the Session binding 'prompts' attribute. |
None
|
Returns:
Type | Description |
---|---|
Optional[Dict[str, Any]]
|
A dictionary of results, if no Session binding is passed to the method. |
VectorizeCodePart
Bases: Job
A job that takes a classification category from an analysis specification and creates a vector embedding for each picklist item using the Azure OpenAI Ada model.
A dataframe containing the original picklist items plus their embeddings are saved to local storage as a parquet file and returned directly in the method call.
run
run(standard_code_part, mode, save_directory='./storage', file_name='vectors')
Runs the vectorization job.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
standard_code_part
|
List[Dict[str, Any]]
|
A list of dictionaries containing 'code', 'description', 'prompt'. |
required |
mode
|
str
|
The mode of vectorization ('description', 'prompt', or 'both'). |
required |
save_directory
|
str
|
Local storage location to save the output parquet file, |
'./storage'
|
file_name
|
str
|
Root name of the output parquet file, without file extension. |
'vectors'
|
Returns:
Type | Description |
---|---|
DataFrame
|
The resulting DataFrame with embeddings. |