Skip to content

Data

azure_doc_intel_read

azure_doc_intel_read(blob, credential=AzureKeyCredential(os.environ['AZURE_AI_API_KEY']), endpoint=os.environ['AZURE_AI_ENDPOINT'], max_pages=10, max_retries=3, initial_delay=2.0, chunk_pages=False, clean_markdown=True)

Reads and analyzes text from a document using Azure's prebuilt layout model.

Uses Azure Document Intelligence with exponential retry logic for robust document text extraction. Supports various document formats including PDF and images.

Parameters:

Name Type Description Default
blob bytes

The document content to be analyzed (e.g., PDF or image bytes).

required
credential AzureKeyCredential

Azure authentication credential. Defaults to credential created from AZURE_AI_API_KEY environment variable.

AzureKeyCredential(environ['AZURE_AI_API_KEY'])
endpoint str

Azure AI endpoint URL. Defaults to value from AZURE_AI_ENDPOINT environment variable.

environ['AZURE_AI_ENDPOINT']
max_pages int

Maximum number of document pages to extract.

10
max_retries int

Maximum number of retries for transient errors.

3
initial_delay float

Initial delay in seconds for exponential backoff.

2.0
chunk_pages bool

If false returns content as single string, if true returns dictionary with page as index and content as value.

False
clean_markdown bool

Determines whether to convert HTML content (e.g. tables) to pure markdown.

True

Returns:

Type Description
str | dict

The extracted document content as markdown text.

Raises:

Type Description
ServiceRequestError

When Azure service request fails after all retries.

ServiceResponseError

When Azure service response is invalid after all retries.

HttpResponseError

When HTTP request fails after all retries.

Examples:

>>> with open('document.pdf', 'rb') as f:
...     content = f.read()
>>> markdown_text = azure_doc_intel_read(content, max_pages=5)
>>> print(markdown_text)

extract_rows_from_bill_of_quantities

extract_rows_from_bill_of_quantities(content, model=os.environ['AZURE_OPENAI_DEPLOYMENT'], max_workers=None, document_name=None)

Extracts quantity and cost estimates from a bill of quantities file in PDF or Office (Excel, Word) formats.

Maps all line items into a general-purpose, normalised schema, retaining page numbering and other metadata for downstream processing.

This function is optimized for performance using a two-phase approach: 1. Serial phase: Sequentially processes all pages to build context and identify pages containing estimate tables 2. Parallel phase: Concurrently extracts tables from identified pages using ThreadPoolExecutor for significant speedup

Parameters:

Name Type Description Default
content str

The extracted file content. Expects pages to be delimited by <!-- PageBreak --> and tables to be delimited by html <td/> tags.

required
model str

Model deployment name within your Azure resource. If not provided will default to environment variable AZURE_OPENAI_DEPLOYMENT.

environ['AZURE_OPENAI_DEPLOYMENT']
max_workers int | None

Maximum number of worker threads for parallel table extraction. If None, defaults to min(32, (cpu_count or 1) + 4).

None
document_name str | None

Optional document filename to help identify asset types. If provided, will be included in Phase 1 AI analysis.

None

Returns:

Type Description
DataFrame

The extracted tables as a single Pandas DataFrame.

extract_rows_from_invoice

extract_rows_from_invoice(content, model=os.environ['AZURE_OPENAI_DEPLOYMENT'], max_workers=None)

Extracts line items from HTML-formatted invoice content. Preserves existing table headers.

Parameters:

Name Type Description Default
content str

The extracted file content. Expects pages to be delimited by <!-- PageBreak --> and tables to be delimited by html <td/> tags.

required
model str

Model deployment name within your Azure resource. If not provided will default to environment variable AZURE_OPENAI_DEPLOYMENT.

environ['AZURE_OPENAI_DEPLOYMENT']
max_workers int | None

Maximum number of worker threads for parallel table extraction. If None, defaults to min(32, (cpu_count or 1) + 4).

None

Returns:

Type Description
DataFrame

The extracted rows as a single Pandas DataFrame.

extract_text_from_doc

extract_text_from_doc(doc_bytes)

Extract text content from a Microsoft Word 97-2003 DOC file.

Extracts text from legacy Word documents using OLE file parsing. Handles encoding issues and cleans up control characters while preserving line breaks and document structure.

Parameters:

Name Type Description Default
doc_bytes bytes

The DOC file content as bytes.

required

Returns:

Type Description
str | None

Extracted text content with cleaned formatting, or None if extraction fails or file is invalid.

Raises:

Type Description
OleFileError

If the DOC file is not a valid OLE file.

UnicodeDecodeError

If text encoding cannot be determined.

Examples:

>>> with open('document.doc', 'rb') as f:
...     content = f.read()
>>> text = extract_text_from_doc(content)
>>> if text:
...     print(text)

extract_text_from_docm

extract_text_from_docm(doc_bytes)

Extract text content from a Microsoft Word DOCM file.

Extracts text from a macro-enabled Word document by parsing the internal XML structure and removing XML tags to obtain plain text content.

Parameters:

Name Type Description Default
doc_bytes bytes

The DOCM file content as bytes.

required

Returns:

Type Description
str

Extracted plain text content with XML tags removed and normalized whitespace.

Raises:

Type Description
BadZipFile

If the DOCM file is corrupted or not a valid ZIP archive.

KeyError

If the required word/document.xml file is not found.

UnicodeDecodeError

If the XML content encoding is not supported.

Examples:

>>> with open('document.docm', 'rb') as f:
...     content = f.read()
>>> text = extract_text_from_docm(content)
>>> print(text)

extract_text_from_docx

extract_text_from_docx(docx_bytes)

Extract text content from a Microsoft Word DOCX file.

Extracts text from all paragraphs and tables in the document, preserving the document structure and table formatting. Tables are converted to pipe-separated format.

Parameters:

Name Type Description Default
docx_bytes bytes

The DOCX file content as bytes.

required

Returns:

Type Description
str

Extracted text content with paragraphs and tables separated by double newlines. Table cells are separated by pipe characters.

Raises:

Type Description
BadZipFile

If the DOCX file is corrupted or not a valid ZIP archive.

ValueError

If the file is not a valid DOCX format.

Examples:

>>> with open('document.docx', 'rb') as f:
...     content = f.read()
>>> text = extract_text_from_docx(content)
>>> print(text)

extract_text_from_hpg

extract_text_from_hpg(doc_bytes)

Convert HPGL/HPG plotter file to readable text.

Converts Hewlett-Packard Graphics Language (HPGL) files to text by first converting them to PDF format and then using Azure Document Intelligence for OCR text extraction.

Parameters:

Name Type Description Default
doc_bytes bytes

The HPGL/HPG file content as bytes.

required

Returns:

Type Description
str

Extracted text content from the plotter file via OCR.

Raises:

Type Description
ValueError

If the HPGL file format is not supported.

ServiceRequestError

If Azure Document Intelligence service fails.

Examples:

>>> with open('drawing.hpg', 'rb') as f:
...     content = f.read()
>>> text = extract_text_from_hpg(content)
>>> print(text)

extract_text_from_mcdx

extract_text_from_mcdx(doc_bytes)

Extract text content from a Mathcad Prime MCDX file.

Processes an MCDX file by extracting text from all XAMLPackage files contained within the mathcad/xaml/ directory. Each XAMLPackage is processed individually and their text content is combined.

Parameters:

Name Type Description Default
doc_bytes bytes

The MCDX file content as bytes.

required

Returns:

Type Description
str

Combined text content from all XAMLPackages, separated by carriage return and newline characters.

Raises:

Type Description
BadZipFile

If the MCDX file is corrupted or not a valid ZIP archive.

KeyError

If required XAMLPackage files are not found in the archive.

XMLSyntaxError

If XAMLPackage XML content is invalid.

Examples:

>>> with open('calculation.mcdx', 'rb') as f:
...     content = f.read()
>>> text = extract_text_from_mcdx(content)
>>> print(text)

extract_text_from_msg

extract_text_from_msg(doc_bytes)

Convert a Microsoft Outlook MSG file to readable text.

Extracts email content including subject, sender, recipient, date, and body from a Microsoft Outlook MSG file format.

Parameters:

Name Type Description Default
doc_bytes bytes

The MSG file content as bytes.

required

Returns:

Type Description
str

Formatted email content with headers and body text.

Raises:

Type Description
ValueError

If the file is not a valid MSG format.

AttributeError

If required email properties are missing.

Examples:

>>> with open('email.msg', 'rb') as f:
...     content = f.read()
>>> text = extract_text_from_msg(content)
>>> print(text)
Subject: Meeting Tomorrow
From: john@example.com
To: jane@example.com
Date: 2024-01-15

Body: Let's meet at 2pm...

extract_text_from_odt

extract_text_from_odt(doc_bytes)

Extract text content from an OpenDocument Text (ODT) file.

Processes an ODT document by extracting text from all paragraph elements, preserving the document structure and formatting.

Parameters:

Name Type Description Default
doc_bytes bytes

The ODT file content as bytes.

required

Returns:

Type Description
str

Extracted text content with paragraphs separated by newlines.

Raises:

Type Description
BadZipFile

If the ODT file is corrupted or not a valid ZIP archive.

XMLSyntaxError

If the ODT document contains invalid XML.

Examples:

>>> with open('document.odt', 'rb') as f:
...     content = f.read()
>>> text = extract_text_from_odt(content)
>>> print(text)

extract_text_from_pptx

extract_text_from_pptx(pptx_bytes)

Extract text content from PowerPoint PPTX/PPSX/PPTM files.

Extracts text from all slides in the presentation by parsing the XML structure directly. Handles text elements and tables, preserving slide organization and table structure.

Parameters:

Name Type Description Default
pptx_bytes bytes

The PowerPoint file content as bytes.

required

Returns:

Type Description
str

Extracted text content with slide separators and table formatting. Tables are converted to pipe-separated format. Returns error message if extraction fails.

Raises:

Type Description
BadZipFile

If the PowerPoint file is corrupted or not a valid ZIP archive.

XMLSyntaxError

If slide XML content is invalid.

Note

Works with PowerPoint XML formats (PPTX, PPSX, PPTM) but not older binary PowerPoint 1997-2003 documents (PPT).

Examples:

>>> with open('presentation.pptx', 'rb') as f:
...     content = f.read()
>>> text = extract_text_from_pptx(content)
>>> print(text)

extract_text_from_rtf

extract_text_from_rtf(doc_bytes)

Convert RTF formatted bytes to plain text.

Converts Rich Text Format (RTF) documents to plain text by decoding the bytes and using RTF parsing libraries to extract readable content.

Parameters:

Name Type Description Default
doc_bytes bytes

RTF formatted data as bytes.

required

Returns:

Type Description
str

Plain text content extracted from RTF with formatting removed.

Raises:

Type Description
UnicodeDecodeError

If RTF content cannot be decoded properly.

Examples:

>>> with open('document.rtf', 'rb') as f:
...     content = f.read()
>>> text = extract_text_from_rtf(content)
>>> print(text)

extract_text_from_xer

extract_text_from_xer(doc_bytes)

Extract task information from a Primavera P6 XER schedule file.

Parses an XER file and extracts tasks grouped by WBS (Work Breakdown Structure), returning the data as markdown tables. Only includes relevant task information, stripping out unnecessary metadata to provide a clean, readable output.

Parameters:

Name Type Description Default
doc_bytes bytes

The XER file content as bytes.

required

Returns:

Type Description
str

Markdown-formatted text with tasks organized by WBS hierarchy.

str

Each WBS section contains a table with task details including:

str
  • Activity ID (task_code)
str
  • Activity Name
str
  • Start Date
str
  • Finish Date
str
  • Duration (days)
str
  • Status
str
  • % Complete

Raises:

Type Description
CorruptXerFile

If the XER file is corrupted or missing required tables.

UnicodeDecodeError

If the file encoding cannot be determined.

Examples:

>>> with open('schedule.xer', 'rb') as f:
...     content = f.read()
>>> markdown = extract_text_from_xer(content)
>>> print(markdown)
## Project: Construction Schedule
1.0 Site Preparation
Activity ID Activity Name Start Finish Duration Status % Complete
A1000 Mobilization 2024-01-15 2024-01-20 5 Complete 100

extract_text_from_xls

extract_text_from_xls(doc_bytes, clean_markdown=True)

Extract text content from an XLSX/XLSM spreadsheet file.

Extracts text from all worksheets in the spreadsheet, preserving the sheet structure and cell organization. Each sheet is processed individually with sheet names as headers.

Parameters:

Name Type Description Default
doc_bytes bytes

The XLSX/XLSM file content as bytes.

required
clean_markdown bool

If True, returns pipe-separated text format. If False, returns HTML tables. Defaults to True.

True

Returns:

Type Description
str

Extracted text content. Format depends on clean_markdown parameter:

str
  • True: Sheet names and cell values separated by pipe characters (|), with each row on a new line.
str
  • False: HTML tables with sheet names as headers.

Raises:

Type Description
BadZipFile

If the file is not a valid ZIP archive.

ValueError

If the file is not a valid Excel format.

Examples:

>>> with open('spreadsheet.xlsx', 'rb') as f:
...     content = f.read()
>>> text = extract_text_from_xls(content)  # Markdown format
>>> html = extract_text_from_xls(content, clean_markdown=False)  # HTML format

extract_text_from_xmcd

extract_text_from_xmcd(xmcd_bytes)

Extract and process text from a Mathcad XMCD file.

Processes an XMCD file by parsing its XML content and extracting text elements, filtering out empty lines and limiting output to the first 100 text elements.

Parameters:

Name Type Description Default
xmcd_bytes bytes

The XMCD file content as bytes.

required

Returns:

Type Description
str

Extracted text content separated by carriage return and newline characters.

Raises:

Type Description
XMLSyntaxError

If the XMCD file contains invalid XML.

UnicodeDecodeError

If the file encoding is not supported.

Examples:

>>> with open('calculation.xmcd', 'rb') as f:
...     content = f.read()
>>> text = extract_text_from_xmcd(content)
>>> print(text)

general_purpose_read

general_purpose_read(blob, filetype, chunk_pages=False, num_pages=10, clean_markdown=True)

Extract text content from various file formats using appropriate parsers.

Analyzes the file type and routes to the most appropriate extraction method. Supports a wide range of document formats including Office documents, PDFs, images, and specialized formats like Mathcad and CAD files.

Parameters:

Name Type Description Default
blob bytes | str

The document content as bytes, or string for plain text.

required
filetype str

The file extension indicating the document type.

required
chunk_pages bool

If True, return dict with page numbers as keys (only supported for certain formats). If False, return concatenated text string (default behavior)

False
num_pages int

Applies to PDF only. Number of pages to extract from the document. Setting to 0 will extract all pages. Default is 10.

10
clean_markdown bool

Applies to PDF only. Determines whether to convert HTML content (e.g. tables) to pure markdown.

True

Returns: Extracted text content from the document. Returns empty string for compressed archives or if extraction fails.

Raises:

Type Description
Exception

Various exceptions depending on the file type and extraction method used. Logs errors and re-raises the exception.

Examples:

>>> with open('document.pdf', 'rb') as f:
...     content = f.read()
>>> text = general_purpose_read(content, 'pdf')
>>> print(text)