Data

azure_doc_intel_read

azure_doc_intel_read(blob, credential=AzureKeyCredential(os.environ['AZURE_AI_API_KEY']), endpoint=os.environ['AZURE_AI_ENDPOINT'], max_pages=10, max_retries=3, initial_delay=2.0, chunk_pages=False)

Reads and analyzes text from a document using Azure's prebuilt layout model.

Uses Azure Document Intelligence with exponential retry logic for robust document text extraction. Supports various document formats including PDF and images.

Parameters:

Name	Type	Description	Default
`blob`	`bytes`	The document content to be analyzed (e.g., PDF or image bytes).	required
`credential`	`AzureKeyCredential`	Azure authentication credential. Defaults to credential created from AZURE_AI_API_KEY environment variable.	`AzureKeyCredential(environ['AZURE_AI_API_KEY'])`
`endpoint`	`str`	Azure AI endpoint URL. Defaults to value from AZURE_AI_ENDPOINT environment variable.	`environ['AZURE_AI_ENDPOINT']`
`max_pages`	`int`	Maximum number of document pages to extract.	`10`
`max_retries`	`int`	Maximum number of retries for transient errors.	`3`
`initial_delay`	`float`	Initial delay in seconds for exponential backoff.	`2.0`
`chunk_pages`	`bool`	If false returns content as single string, if true returns dictionary with page as index and content as value.	`False`

Returns:

Type	Description
`str \| dict`	The extracted document content as markdown text.

Raises:

Type	Description
`ServiceRequestError`	When Azure service request fails after all retries.
`ServiceResponseError`	When Azure service response is invalid after all retries.
`HttpResponseError`	When HTTP request fails after all retries.

Examples:

>>> with open('document.pdf', 'rb') as f:
...     content = f.read()
>>> markdown_text = azure_doc_intel_read(content, max_pages=5)
>>> print(markdown_text)

extract_text_from_doc

extract_text_from_doc(doc_bytes)

Extract text content from a Microsoft Word 97-2003 DOC file.

Extracts text from legacy Word documents using OLE file parsing. Handles encoding issues and cleans up control characters while preserving line breaks and document structure.

Parameters:

Name	Type	Description	Default
`doc_bytes`	`bytes`	The DOC file content as bytes.	required

Returns:

Type	Description
`Optional[str]`	Extracted text content with cleaned formatting, or None if extraction fails or file is invalid.

Raises:

Type	Description
`OleFileError`	If the DOC file is not a valid OLE file.
`UnicodeDecodeError`	If text encoding cannot be determined.

Examples:

>>> with open('document.doc', 'rb') as f:
...     content = f.read()
>>> text = extract_text_from_doc(content)
>>> if text:
...     print(text)

extract_text_from_docm

extract_text_from_docm(doc_bytes)

Extract text content from a Microsoft Word DOCM file.

Extracts text from a macro-enabled Word document by parsing the internal XML structure and removing XML tags to obtain plain text content.

Parameters:

Name	Type	Description	Default
`doc_bytes`	`bytes`	The DOCM file content as bytes.	required

Returns:

Type	Description
`str`	Extracted plain text content with XML tags removed and normalized whitespace.

Raises:

Type	Description
`BadZipFile`	If the DOCM file is corrupted or not a valid ZIP archive.
`KeyError`	If the required word/document.xml file is not found.
`UnicodeDecodeError`	If the XML content encoding is not supported.

Examples:

>>> with open('document.docm', 'rb') as f:
...     content = f.read()
>>> text = extract_text_from_docm(content)
>>> print(text)

extract_text_from_docx

extract_text_from_docx(docx_bytes)

Extract text content from a Microsoft Word DOCX file.

Extracts text from all paragraphs and tables in the document, preserving the document structure and table formatting. Tables are converted to pipe-separated format.

Parameters:

Name	Type	Description	Default
`docx_bytes`	`bytes`	The DOCX file content as bytes.	required

Returns:

Type	Description
`str`	Extracted text content with paragraphs and tables separated by double newlines. Table cells are separated by pipe characters.

Raises:

Type	Description
`BadZipFile`	If the DOCX file is corrupted or not a valid ZIP archive.
`ValueError`	If the file is not a valid DOCX format.

Examples:

>>> with open('document.docx', 'rb') as f:
...     content = f.read()
>>> text = extract_text_from_docx(content)
>>> print(text)

extract_text_from_hpg

extract_text_from_hpg(doc_bytes)

Convert HPGL/HPG plotter file to readable text.

Converts Hewlett-Packard Graphics Language (HPGL) files to text by first converting them to PDF format and then using Azure Document Intelligence for OCR text extraction.

Parameters:

Name	Type	Description	Default
`doc_bytes`	`bytes`	The HPGL/HPG file content as bytes.	required

Returns:

Type	Description
`str`	Extracted text content from the plotter file via OCR.

Raises:

Type	Description
`ValueError`	If the HPGL file format is not supported.
`ServiceRequestError`	If Azure Document Intelligence service fails.

Examples:

>>> with open('drawing.hpg', 'rb') as f:
...     content = f.read()
>>> text = extract_text_from_hpg(content)
>>> print(text)

extract_text_from_mcdx

extract_text_from_mcdx(doc_bytes)

Extract text content from a Mathcad Prime MCDX file.

Processes an MCDX file by extracting text from all XAMLPackage files contained within the mathcad/xaml/ directory. Each XAMLPackage is processed individually and their text content is combined.

Parameters:

Name	Type	Description	Default
`doc_bytes`	`bytes`	The MCDX file content as bytes.	required

Returns:

Type	Description
`str`	Combined text content from all XAMLPackages, separated by carriage return and newline characters.

Raises:

Type	Description
`BadZipFile`	If the MCDX file is corrupted or not a valid ZIP archive.
`KeyError`	If required XAMLPackage files are not found in the archive.
`XMLSyntaxError`	If XAMLPackage XML content is invalid.

Examples:

>>> with open('calculation.mcdx', 'rb') as f:
...     content = f.read()
>>> text = extract_text_from_mcdx(content)
>>> print(text)

extract_text_from_msg

extract_text_from_msg(doc_bytes)

Convert a Microsoft Outlook MSG file to readable text.

Extracts email content including subject, sender, recipient, date, and body from a Microsoft Outlook MSG file format.

Parameters:

Name	Type	Description	Default
`doc_bytes`	`bytes`	The MSG file content as bytes.	required

Returns:

Type	Description
`str`	Formatted email content with headers and body text.

Raises:

Type	Description
`ValueError`	If the file is not a valid MSG format.
`AttributeError`	If required email properties are missing.

Examples:

>>> with open('email.msg', 'rb') as f:
...     content = f.read()
>>> text = extract_text_from_msg(content)
>>> print(text)
Subject: Meeting Tomorrow
From: john@example.com
To: jane@example.com
Date: 2024-01-15

Body: Let's meet at 2pm...

extract_text_from_odt

extract_text_from_odt(doc_bytes)

Extract text content from an OpenDocument Text (ODT) file.

Processes an ODT document by extracting text from all paragraph elements, preserving the document structure and formatting.

Parameters:

Name	Type	Description	Default
`doc_bytes`	`bytes`	The ODT file content as bytes.	required

Returns:

Type	Description
`str`	Extracted text content with paragraphs separated by newlines.

Raises:

Type	Description
`BadZipFile`	If the ODT file is corrupted or not a valid ZIP archive.
`XMLSyntaxError`	If the ODT document contains invalid XML.

Examples:

>>> with open('document.odt', 'rb') as f:
...     content = f.read()
>>> text = extract_text_from_odt(content)
>>> print(text)

extract_text_from_pptx

extract_text_from_pptx(pptx_bytes)

Extract text content from PowerPoint PPTX/PPSX/PPTM files.

Extracts text from all slides in the presentation by parsing the XML structure directly. Handles text elements and tables, preserving slide organization and table structure.

Parameters:

Name	Type	Description	Default
`pptx_bytes`	`bytes`	The PowerPoint file content as bytes.	required

Returns:

Type	Description
`str`	Extracted text content with slide separators and table formatting. Tables are converted to pipe-separated format. Returns error message if extraction fails.

Raises:

Type	Description
`BadZipFile`	If the PowerPoint file is corrupted or not a valid ZIP archive.
`XMLSyntaxError`	If slide XML content is invalid.

Note

Works with PowerPoint XML formats (PPTX, PPSX, PPTM) but not older binary PowerPoint 1997-2003 documents (PPT).

Examples:

>>> with open('presentation.pptx', 'rb') as f:
...     content = f.read()
>>> text = extract_text_from_pptx(content)
>>> print(text)

extract_text_from_rtf

extract_text_from_rtf(doc_bytes)

Convert RTF formatted bytes to plain text.

Converts Rich Text Format (RTF) documents to plain text by decoding the bytes and using RTF parsing libraries to extract readable content.

Parameters:

Name	Type	Description	Default
`doc_bytes`	`bytes`	RTF formatted data as bytes.	required

Returns:

Type	Description
`str`	Plain text content extracted from RTF with formatting removed.

Raises:

Type	Description
`UnicodeDecodeError`	If RTF content cannot be decoded properly.

Examples:

>>> with open('document.rtf', 'rb') as f:
...     content = f.read()
>>> text = extract_text_from_rtf(content)
>>> print(text)

extract_text_from_xls

extract_text_from_xls(doc_bytes)

Extract text content from an XLSX/XLSM spreadsheet file.

Extracts text from all worksheets in the spreadsheet, preserving the sheet structure and cell organization. Each sheet is processed individually with sheet names as headers.

Parameters:

Name	Type	Description	Default
`doc_bytes`	`bytes`	The XLSX/XLSM file content as bytes.	required

Returns:

Type	Description
`str`	Extracted text content with sheet names and cell values separated by pipe characters (\|), with each row on a new line.

Raises:

Type	Description
`BadZipFile`	If the file is not a valid ZIP archive.
`ValueError`	If the file is not a valid Excel format.

Examples:

>>> with open('spreadsheet.xlsx', 'rb') as f:
...     content = f.read()
>>> text = extract_text_from_xls(content)
>>> print(text)
Sheet: Sheet1
Header1 | Header2 | Header3
Value1 | Value2 | Value3

extract_text_from_xmcd

extract_text_from_xmcd(xmcd_bytes)

Extract and process text from a Mathcad XMCD file.

Processes an XMCD file by parsing its XML content and extracting text elements, filtering out empty lines and limiting output to the first 100 text elements.

Parameters:

Name	Type	Description	Default
`xmcd_bytes`	`bytes`	The XMCD file content as bytes.	required

Returns:

Type	Description
`str`	Extracted text content separated by carriage return and newline characters.

Raises:

Type	Description
`XMLSyntaxError`	If the XMCD file contains invalid XML.
`UnicodeDecodeError`	If the file encoding is not supported.

Examples:

>>> with open('calculation.xmcd', 'rb') as f:
...     content = f.read()
>>> text = extract_text_from_xmcd(content)
>>> print(text)

general_purpose_read

general_purpose_read(blob, filetype, chunk_pages=False)

Extract text content from various file formats using appropriate parsers.

Analyzes the file type and routes to the most appropriate extraction method. Supports a wide range of document formats including Office documents, PDFs, images, and specialized formats like Mathcad and CAD files.

Parameters:

Name	Type	Description	Default
`blob`	`bytes`	The document content as bytes, or string for plain text.	required
`filetype`	`str`	The file extension indicating the document type.	required
`chunk_pages`	`bool`	If True, return dict with page numbers as keys (only supported for certain formats). If False, return concatenated text string (default behavior)	`False`

Returns:

Type	Description
`str \| dict`	Extracted text content from the document. Returns empty string for compressed archives or if extraction fails.

Raises:

Type	Description
`Exception`	Various exceptions depending on the file type and extraction method used. Logs errors and re-raises the exception.

Examples:

>>> with open('document.pdf', 'rb') as f:
...     content = f.read()
>>> text = general_purpose_read(content, 'pdf')
>>> print(text)