Data
azure_doc_intel_read
azure_doc_intel_read(blob, credential=AzureKeyCredential(os.environ['AZURE_AI_API_KEY']), endpoint=os.environ['AZURE_AI_ENDPOINT'], max_pages=10, max_retries=3, initial_delay=2.0, chunk_pages=False, clean_markdown=True)
Reads and analyzes text from a document using Azure's prebuilt layout model.
Uses Azure Document Intelligence with exponential retry logic for robust document text extraction. Supports various document formats including PDF and images.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
blob
|
bytes
|
The document content to be analyzed (e.g., PDF or image bytes). |
required |
credential
|
AzureKeyCredential
|
Azure authentication credential. Defaults to credential created from AZURE_AI_API_KEY environment variable. |
AzureKeyCredential(environ['AZURE_AI_API_KEY'])
|
endpoint
|
str
|
Azure AI endpoint URL. Defaults to value from AZURE_AI_ENDPOINT environment variable. |
environ['AZURE_AI_ENDPOINT']
|
max_pages
|
int
|
Maximum number of document pages to extract. |
10
|
max_retries
|
int
|
Maximum number of retries for transient errors. |
3
|
initial_delay
|
float
|
Initial delay in seconds for exponential backoff. |
2.0
|
chunk_pages
|
bool
|
If false returns content as single string, if true returns dictionary with page as index and content as value. |
False
|
clean_markdown
|
bool
|
Determines whether to convert HTML content (e.g. tables) to pure markdown. |
True
|
Returns:
| Type | Description |
|---|---|
str | dict
|
The extracted document content as markdown text. |
Raises:
| Type | Description |
|---|---|
ServiceRequestError
|
When Azure service request fails after all retries. |
ServiceResponseError
|
When Azure service response is invalid after all retries. |
HttpResponseError
|
When HTTP request fails after all retries. |
Examples:
>>> with open('document.pdf', 'rb') as f:
... content = f.read()
>>> markdown_text = azure_doc_intel_read(content, max_pages=5)
>>> print(markdown_text)
extract_rows_from_bill_of_quantities
extract_rows_from_bill_of_quantities(content, model=os.environ['AZURE_OPENAI_DEPLOYMENT'], max_workers=None, document_name=None)
Extracts quantity and cost estimates from a bill of quantities file in PDF or Office (Excel, Word) formats.
Maps all line items into a general-purpose, normalised schema, retaining page numbering and other metadata for downstream processing.
This function is optimized for performance using a two-phase approach: 1. Serial phase: Sequentially processes all pages to build context and identify pages containing estimate tables 2. Parallel phase: Concurrently extracts tables from identified pages using ThreadPoolExecutor for significant speedup
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
content
|
str
|
The extracted file content. Expects pages to be delimited by |
required |
model
|
str
|
Model deployment name within your Azure resource. If not provided will default to environment variable |
environ['AZURE_OPENAI_DEPLOYMENT']
|
max_workers
|
int | None
|
Maximum number of worker threads for parallel table extraction. If None, defaults to min(32, (cpu_count or 1) + 4). |
None
|
document_name
|
str | None
|
Optional document filename to help identify asset types. If provided, will be included in Phase 1 AI analysis. |
None
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The extracted tables as a single Pandas DataFrame. |
extract_rows_from_invoice
extract_rows_from_invoice(content, model=os.environ['AZURE_OPENAI_DEPLOYMENT'], max_workers=None)
Extracts line items from HTML-formatted invoice content. Preserves existing table headers.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
content
|
str
|
The extracted file content. Expects pages to be delimited by |
required |
model
|
str
|
Model deployment name within your Azure resource. If not provided will default to environment variable |
environ['AZURE_OPENAI_DEPLOYMENT']
|
max_workers
|
int | None
|
Maximum number of worker threads for parallel table extraction. If None, defaults to min(32, (cpu_count or 1) + 4). |
None
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The extracted rows as a single Pandas DataFrame. |
extract_text_from_doc
extract_text_from_doc(doc_bytes)
Extract text content from a Microsoft Word 97-2003 DOC file.
Extracts text from legacy Word documents using OLE file parsing. Handles encoding issues and cleans up control characters while preserving line breaks and document structure.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
doc_bytes
|
bytes
|
The DOC file content as bytes. |
required |
Returns:
| Type | Description |
|---|---|
str | None
|
Extracted text content with cleaned formatting, or None if extraction fails or file is invalid. |
Raises:
| Type | Description |
|---|---|
OleFileError
|
If the DOC file is not a valid OLE file. |
UnicodeDecodeError
|
If text encoding cannot be determined. |
Examples:
>>> with open('document.doc', 'rb') as f:
... content = f.read()
>>> text = extract_text_from_doc(content)
>>> if text:
... print(text)
extract_text_from_docm
extract_text_from_docm(doc_bytes)
Extract text content from a Microsoft Word DOCM file.
Extracts text from a macro-enabled Word document by parsing the internal XML structure and removing XML tags to obtain plain text content.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
doc_bytes
|
bytes
|
The DOCM file content as bytes. |
required |
Returns:
| Type | Description |
|---|---|
str
|
Extracted plain text content with XML tags removed and normalized whitespace. |
Raises:
| Type | Description |
|---|---|
BadZipFile
|
If the DOCM file is corrupted or not a valid ZIP archive. |
KeyError
|
If the required word/document.xml file is not found. |
UnicodeDecodeError
|
If the XML content encoding is not supported. |
Examples:
>>> with open('document.docm', 'rb') as f:
... content = f.read()
>>> text = extract_text_from_docm(content)
>>> print(text)
extract_text_from_docx
extract_text_from_docx(docx_bytes)
Extract text content from a Microsoft Word DOCX file.
Extracts text from all paragraphs and tables in the document, preserving the document structure and table formatting. Tables are converted to pipe-separated format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
docx_bytes
|
bytes
|
The DOCX file content as bytes. |
required |
Returns:
| Type | Description |
|---|---|
str
|
Extracted text content with paragraphs and tables separated by double newlines. Table cells are separated by pipe characters. |
Raises:
| Type | Description |
|---|---|
BadZipFile
|
If the DOCX file is corrupted or not a valid ZIP archive. |
ValueError
|
If the file is not a valid DOCX format. |
Examples:
>>> with open('document.docx', 'rb') as f:
... content = f.read()
>>> text = extract_text_from_docx(content)
>>> print(text)
extract_text_from_hpg
extract_text_from_hpg(doc_bytes)
Convert HPGL/HPG plotter file to readable text.
Converts Hewlett-Packard Graphics Language (HPGL) files to text by first converting them to PDF format and then using Azure Document Intelligence for OCR text extraction.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
doc_bytes
|
bytes
|
The HPGL/HPG file content as bytes. |
required |
Returns:
| Type | Description |
|---|---|
str
|
Extracted text content from the plotter file via OCR. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the HPGL file format is not supported. |
ServiceRequestError
|
If Azure Document Intelligence service fails. |
Examples:
>>> with open('drawing.hpg', 'rb') as f:
... content = f.read()
>>> text = extract_text_from_hpg(content)
>>> print(text)
extract_text_from_mcdx
extract_text_from_mcdx(doc_bytes)
Extract text content from a Mathcad Prime MCDX file.
Processes an MCDX file by extracting text from all XAMLPackage files contained within the mathcad/xaml/ directory. Each XAMLPackage is processed individually and their text content is combined.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
doc_bytes
|
bytes
|
The MCDX file content as bytes. |
required |
Returns:
| Type | Description |
|---|---|
str
|
Combined text content from all XAMLPackages, separated by carriage return and newline characters. |
Raises:
| Type | Description |
|---|---|
BadZipFile
|
If the MCDX file is corrupted or not a valid ZIP archive. |
KeyError
|
If required XAMLPackage files are not found in the archive. |
XMLSyntaxError
|
If XAMLPackage XML content is invalid. |
Examples:
>>> with open('calculation.mcdx', 'rb') as f:
... content = f.read()
>>> text = extract_text_from_mcdx(content)
>>> print(text)
extract_text_from_msg
extract_text_from_msg(doc_bytes)
Convert a Microsoft Outlook MSG file to readable text.
Extracts email content including subject, sender, recipient, date, and body from a Microsoft Outlook MSG file format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
doc_bytes
|
bytes
|
The MSG file content as bytes. |
required |
Returns:
| Type | Description |
|---|---|
str
|
Formatted email content with headers and body text. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the file is not a valid MSG format. |
AttributeError
|
If required email properties are missing. |
Examples:
>>> with open('email.msg', 'rb') as f:
... content = f.read()
>>> text = extract_text_from_msg(content)
>>> print(text)
Subject: Meeting Tomorrow
From: john@example.com
To: jane@example.com
Date: 2024-01-15
Body: Let's meet at 2pm...
extract_text_from_odt
extract_text_from_odt(doc_bytes)
Extract text content from an OpenDocument Text (ODT) file.
Processes an ODT document by extracting text from all paragraph elements, preserving the document structure and formatting.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
doc_bytes
|
bytes
|
The ODT file content as bytes. |
required |
Returns:
| Type | Description |
|---|---|
str
|
Extracted text content with paragraphs separated by newlines. |
Raises:
| Type | Description |
|---|---|
BadZipFile
|
If the ODT file is corrupted or not a valid ZIP archive. |
XMLSyntaxError
|
If the ODT document contains invalid XML. |
Examples:
>>> with open('document.odt', 'rb') as f:
... content = f.read()
>>> text = extract_text_from_odt(content)
>>> print(text)
extract_text_from_pptx
extract_text_from_pptx(pptx_bytes)
Extract text content from PowerPoint PPTX/PPSX/PPTM files.
Extracts text from all slides in the presentation by parsing the XML structure directly. Handles text elements and tables, preserving slide organization and table structure.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pptx_bytes
|
bytes
|
The PowerPoint file content as bytes. |
required |
Returns:
| Type | Description |
|---|---|
str
|
Extracted text content with slide separators and table formatting. Tables are converted to pipe-separated format. Returns error message if extraction fails. |
Raises:
| Type | Description |
|---|---|
BadZipFile
|
If the PowerPoint file is corrupted or not a valid ZIP archive. |
XMLSyntaxError
|
If slide XML content is invalid. |
Note
Works with PowerPoint XML formats (PPTX, PPSX, PPTM) but not older binary PowerPoint 1997-2003 documents (PPT).
Examples:
>>> with open('presentation.pptx', 'rb') as f:
... content = f.read()
>>> text = extract_text_from_pptx(content)
>>> print(text)
extract_text_from_rtf
extract_text_from_rtf(doc_bytes)
Convert RTF formatted bytes to plain text.
Converts Rich Text Format (RTF) documents to plain text by decoding the bytes and using RTF parsing libraries to extract readable content.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
doc_bytes
|
bytes
|
RTF formatted data as bytes. |
required |
Returns:
| Type | Description |
|---|---|
str
|
Plain text content extracted from RTF with formatting removed. |
Raises:
| Type | Description |
|---|---|
UnicodeDecodeError
|
If RTF content cannot be decoded properly. |
Examples:
>>> with open('document.rtf', 'rb') as f:
... content = f.read()
>>> text = extract_text_from_rtf(content)
>>> print(text)
extract_text_from_xer
extract_text_from_xer(doc_bytes)
Extract task information from a Primavera P6 XER schedule file.
Parses an XER file and extracts tasks grouped by WBS (Work Breakdown Structure), returning the data as markdown tables. Only includes relevant task information, stripping out unnecessary metadata to provide a clean, readable output.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
doc_bytes
|
bytes
|
The XER file content as bytes. |
required |
Returns:
| Type | Description |
|---|---|
str
|
Markdown-formatted text with tasks organized by WBS hierarchy. |
str
|
Each WBS section contains a table with task details including: |
str
|
|
str
|
|
str
|
|
str
|
|
str
|
|
str
|
|
str
|
|
Raises:
| Type | Description |
|---|---|
CorruptXerFile
|
If the XER file is corrupted or missing required tables. |
UnicodeDecodeError
|
If the file encoding cannot be determined. |
Examples:
>>> with open('schedule.xer', 'rb') as f:
... content = f.read()
>>> markdown = extract_text_from_xer(content)
>>> print(markdown)
## Project: Construction Schedule
1.0 Site Preparation
| Activity ID | Activity Name | Start | Finish | Duration | Status | % Complete |
|---|---|---|---|---|---|---|
| A1000 | Mobilization | 2024-01-15 | 2024-01-20 | 5 | Complete | 100 |
extract_text_from_xls
extract_text_from_xls(doc_bytes, clean_markdown=True)
Extract text content from an XLSX/XLSM spreadsheet file.
Extracts text from all worksheets in the spreadsheet, preserving the sheet structure and cell organization. Each sheet is processed individually with sheet names as headers.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
doc_bytes
|
bytes
|
The XLSX/XLSM file content as bytes. |
required |
clean_markdown
|
bool
|
If True, returns pipe-separated text format. If False, returns HTML tables. Defaults to True. |
True
|
Returns:
| Type | Description |
|---|---|
str
|
Extracted text content. Format depends on clean_markdown parameter: |
str
|
|
str
|
|
Raises:
| Type | Description |
|---|---|
BadZipFile
|
If the file is not a valid ZIP archive. |
ValueError
|
If the file is not a valid Excel format. |
Examples:
>>> with open('spreadsheet.xlsx', 'rb') as f:
... content = f.read()
>>> text = extract_text_from_xls(content) # Markdown format
>>> html = extract_text_from_xls(content, clean_markdown=False) # HTML format
extract_text_from_xmcd
extract_text_from_xmcd(xmcd_bytes)
Extract and process text from a Mathcad XMCD file.
Processes an XMCD file by parsing its XML content and extracting text elements, filtering out empty lines and limiting output to the first 100 text elements.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
xmcd_bytes
|
bytes
|
The XMCD file content as bytes. |
required |
Returns:
| Type | Description |
|---|---|
str
|
Extracted text content separated by carriage return and newline characters. |
Raises:
| Type | Description |
|---|---|
XMLSyntaxError
|
If the XMCD file contains invalid XML. |
UnicodeDecodeError
|
If the file encoding is not supported. |
Examples:
>>> with open('calculation.xmcd', 'rb') as f:
... content = f.read()
>>> text = extract_text_from_xmcd(content)
>>> print(text)
general_purpose_read
general_purpose_read(blob, filetype, chunk_pages=False, num_pages=10, clean_markdown=True)
Extract text content from various file formats using appropriate parsers.
Analyzes the file type and routes to the most appropriate extraction method. Supports a wide range of document formats including Office documents, PDFs, images, and specialized formats like Mathcad and CAD files.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
blob
|
bytes | str
|
The document content as bytes, or string for plain text. |
required |
filetype
|
str
|
The file extension indicating the document type. |
required |
chunk_pages
|
bool
|
If True, return dict with page numbers as keys (only supported for certain formats). If False, return concatenated text string (default behavior) |
False
|
num_pages
|
int
|
Applies to PDF only. Number of pages to extract from the document. Setting to 0 will extract all pages. Default is 10. |
10
|
clean_markdown
|
bool
|
Applies to PDF only. Determines whether to convert HTML content (e.g. tables) to pure markdown. |
True
|
Returns: Extracted text content from the document. Returns empty string for compressed archives or if extraction fails.
Raises:
| Type | Description |
|---|---|
Exception
|
Various exceptions depending on the file type and extraction method used. Logs errors and re-raises the exception. |
Examples:
>>> with open('document.pdf', 'rb') as f:
... content = f.read()
>>> text = general_purpose_read(content, 'pdf')
>>> print(text)