Skip to content

Data

azure_doc_intel_read

azure_doc_intel_read(blob, credential=AzureKeyCredential(os.environ['AZURE_AI_API_KEY']), endpoint=os.environ['AZURE_AI_ENDPOINT'], max_pages=10, max_retries=3, initial_delay=2.0, chunk_pages=False)

Reads and analyzes text from a document using Azure's prebuilt layout model.

Uses Azure Document Intelligence with exponential retry logic for robust document text extraction. Supports various document formats including PDF and images.

Parameters:

Name Type Description Default
blob bytes

The document content to be analyzed (e.g., PDF or image bytes).

required
credential AzureKeyCredential

Azure authentication credential. Defaults to credential created from AZURE_AI_API_KEY environment variable.

AzureKeyCredential(environ['AZURE_AI_API_KEY'])
endpoint str

Azure AI endpoint URL. Defaults to value from AZURE_AI_ENDPOINT environment variable.

environ['AZURE_AI_ENDPOINT']
max_pages int

Maximum number of document pages to extract.

10
max_retries int

Maximum number of retries for transient errors.

3
initial_delay float

Initial delay in seconds for exponential backoff.

2.0
chunk_pages bool

If false returns content as single string, if true returns dictionary with page as index and content as value.

False

Returns:

Type Description
str | dict

The extracted document content as markdown text.

Raises:

Type Description
ServiceRequestError

When Azure service request fails after all retries.

ServiceResponseError

When Azure service response is invalid after all retries.

HttpResponseError

When HTTP request fails after all retries.

Examples:

>>> with open('document.pdf', 'rb') as f:
...     content = f.read()
>>> markdown_text = azure_doc_intel_read(content, max_pages=5)
>>> print(markdown_text)

extract_text_from_doc

extract_text_from_doc(doc_bytes)

Extract text content from a Microsoft Word 97-2003 DOC file.

Extracts text from legacy Word documents using OLE file parsing. Handles encoding issues and cleans up control characters while preserving line breaks and document structure.

Parameters:

Name Type Description Default
doc_bytes bytes

The DOC file content as bytes.

required

Returns:

Type Description
Optional[str]

Extracted text content with cleaned formatting, or None if extraction fails or file is invalid.

Raises:

Type Description
OleFileError

If the DOC file is not a valid OLE file.

UnicodeDecodeError

If text encoding cannot be determined.

Examples:

>>> with open('document.doc', 'rb') as f:
...     content = f.read()
>>> text = extract_text_from_doc(content)
>>> if text:
...     print(text)

extract_text_from_docm

extract_text_from_docm(doc_bytes)

Extract text content from a Microsoft Word DOCM file.

Extracts text from a macro-enabled Word document by parsing the internal XML structure and removing XML tags to obtain plain text content.

Parameters:

Name Type Description Default
doc_bytes bytes

The DOCM file content as bytes.

required

Returns:

Type Description
str

Extracted plain text content with XML tags removed and normalized whitespace.

Raises:

Type Description
BadZipFile

If the DOCM file is corrupted or not a valid ZIP archive.

KeyError

If the required word/document.xml file is not found.

UnicodeDecodeError

If the XML content encoding is not supported.

Examples:

>>> with open('document.docm', 'rb') as f:
...     content = f.read()
>>> text = extract_text_from_docm(content)
>>> print(text)

extract_text_from_docx

extract_text_from_docx(docx_bytes)

Extract text content from a Microsoft Word DOCX file.

Extracts text from all paragraphs and tables in the document, preserving the document structure and table formatting. Tables are converted to pipe-separated format.

Parameters:

Name Type Description Default
docx_bytes bytes

The DOCX file content as bytes.

required

Returns:

Type Description
str

Extracted text content with paragraphs and tables separated by double newlines. Table cells are separated by pipe characters.

Raises:

Type Description
BadZipFile

If the DOCX file is corrupted or not a valid ZIP archive.

ValueError

If the file is not a valid DOCX format.

Examples:

>>> with open('document.docx', 'rb') as f:
...     content = f.read()
>>> text = extract_text_from_docx(content)
>>> print(text)

extract_text_from_hpg

extract_text_from_hpg(doc_bytes)

Convert HPGL/HPG plotter file to readable text.

Converts Hewlett-Packard Graphics Language (HPGL) files to text by first converting them to PDF format and then using Azure Document Intelligence for OCR text extraction.

Parameters:

Name Type Description Default
doc_bytes bytes

The HPGL/HPG file content as bytes.

required

Returns:

Type Description
str

Extracted text content from the plotter file via OCR.

Raises:

Type Description
ValueError

If the HPGL file format is not supported.

ServiceRequestError

If Azure Document Intelligence service fails.

Examples:

>>> with open('drawing.hpg', 'rb') as f:
...     content = f.read()
>>> text = extract_text_from_hpg(content)
>>> print(text)

extract_text_from_mcdx

extract_text_from_mcdx(doc_bytes)

Extract text content from a Mathcad Prime MCDX file.

Processes an MCDX file by extracting text from all XAMLPackage files contained within the mathcad/xaml/ directory. Each XAMLPackage is processed individually and their text content is combined.

Parameters:

Name Type Description Default
doc_bytes bytes

The MCDX file content as bytes.

required

Returns:

Type Description
str

Combined text content from all XAMLPackages, separated by carriage return and newline characters.

Raises:

Type Description
BadZipFile

If the MCDX file is corrupted or not a valid ZIP archive.

KeyError

If required XAMLPackage files are not found in the archive.

XMLSyntaxError

If XAMLPackage XML content is invalid.

Examples:

>>> with open('calculation.mcdx', 'rb') as f:
...     content = f.read()
>>> text = extract_text_from_mcdx(content)
>>> print(text)

extract_text_from_msg

extract_text_from_msg(doc_bytes)

Convert a Microsoft Outlook MSG file to readable text.

Extracts email content including subject, sender, recipient, date, and body from a Microsoft Outlook MSG file format.

Parameters:

Name Type Description Default
doc_bytes bytes

The MSG file content as bytes.

required

Returns:

Type Description
str

Formatted email content with headers and body text.

Raises:

Type Description
ValueError

If the file is not a valid MSG format.

AttributeError

If required email properties are missing.

Examples:

>>> with open('email.msg', 'rb') as f:
...     content = f.read()
>>> text = extract_text_from_msg(content)
>>> print(text)
Subject: Meeting Tomorrow
From: john@example.com
To: jane@example.com
Date: 2024-01-15

Body: Let's meet at 2pm...

extract_text_from_odt

extract_text_from_odt(doc_bytes)

Extract text content from an OpenDocument Text (ODT) file.

Processes an ODT document by extracting text from all paragraph elements, preserving the document structure and formatting.

Parameters:

Name Type Description Default
doc_bytes bytes

The ODT file content as bytes.

required

Returns:

Type Description
str

Extracted text content with paragraphs separated by newlines.

Raises:

Type Description
BadZipFile

If the ODT file is corrupted or not a valid ZIP archive.

XMLSyntaxError

If the ODT document contains invalid XML.

Examples:

>>> with open('document.odt', 'rb') as f:
...     content = f.read()
>>> text = extract_text_from_odt(content)
>>> print(text)

extract_text_from_pptx

extract_text_from_pptx(pptx_bytes)

Extract text content from PowerPoint PPTX/PPSX/PPTM files.

Extracts text from all slides in the presentation by parsing the XML structure directly. Handles text elements and tables, preserving slide organization and table structure.

Parameters:

Name Type Description Default
pptx_bytes bytes

The PowerPoint file content as bytes.

required

Returns:

Type Description
str

Extracted text content with slide separators and table formatting. Tables are converted to pipe-separated format. Returns error message if extraction fails.

Raises:

Type Description
BadZipFile

If the PowerPoint file is corrupted or not a valid ZIP archive.

XMLSyntaxError

If slide XML content is invalid.

Note

Works with PowerPoint XML formats (PPTX, PPSX, PPTM) but not older binary PowerPoint 1997-2003 documents (PPT).

Examples:

>>> with open('presentation.pptx', 'rb') as f:
...     content = f.read()
>>> text = extract_text_from_pptx(content)
>>> print(text)

extract_text_from_rtf

extract_text_from_rtf(doc_bytes)

Convert RTF formatted bytes to plain text.

Converts Rich Text Format (RTF) documents to plain text by decoding the bytes and using RTF parsing libraries to extract readable content.

Parameters:

Name Type Description Default
doc_bytes bytes

RTF formatted data as bytes.

required

Returns:

Type Description
str

Plain text content extracted from RTF with formatting removed.

Raises:

Type Description
UnicodeDecodeError

If RTF content cannot be decoded properly.

Examples:

>>> with open('document.rtf', 'rb') as f:
...     content = f.read()
>>> text = extract_text_from_rtf(content)
>>> print(text)

extract_text_from_xls

extract_text_from_xls(doc_bytes)

Extract text content from an XLSX/XLSM spreadsheet file.

Extracts text from all worksheets in the spreadsheet, preserving the sheet structure and cell organization. Each sheet is processed individually with sheet names as headers.

Parameters:

Name Type Description Default
doc_bytes bytes

The XLSX/XLSM file content as bytes.

required

Returns:

Type Description
str

Extracted text content with sheet names and cell values separated by pipe characters (|), with each row on a new line.

Raises:

Type Description
BadZipFile

If the file is not a valid ZIP archive.

ValueError

If the file is not a valid Excel format.

Examples:

>>> with open('spreadsheet.xlsx', 'rb') as f:
...     content = f.read()
>>> text = extract_text_from_xls(content)
>>> print(text)
Sheet: Sheet1
Header1 | Header2 | Header3
Value1 | Value2 | Value3

extract_text_from_xmcd

extract_text_from_xmcd(xmcd_bytes)

Extract and process text from a Mathcad XMCD file.

Processes an XMCD file by parsing its XML content and extracting text elements, filtering out empty lines and limiting output to the first 100 text elements.

Parameters:

Name Type Description Default
xmcd_bytes bytes

The XMCD file content as bytes.

required

Returns:

Type Description
str

Extracted text content separated by carriage return and newline characters.

Raises:

Type Description
XMLSyntaxError

If the XMCD file contains invalid XML.

UnicodeDecodeError

If the file encoding is not supported.

Examples:

>>> with open('calculation.xmcd', 'rb') as f:
...     content = f.read()
>>> text = extract_text_from_xmcd(content)
>>> print(text)

general_purpose_read

general_purpose_read(blob, filetype, chunk_pages=False)

Extract text content from various file formats using appropriate parsers.

Analyzes the file type and routes to the most appropriate extraction method. Supports a wide range of document formats including Office documents, PDFs, images, and specialized formats like Mathcad and CAD files.

Parameters:

Name Type Description Default
blob bytes

The document content as bytes, or string for plain text.

required
filetype str

The file extension indicating the document type.

required
chunk_pages bool

If True, return dict with page numbers as keys (only supported for certain formats). If False, return concatenated text string (default behavior)

False

Returns:

Type Description
str | dict

Extracted text content from the document. Returns empty string for compressed archives or if extraction fails.

Raises:

Type Description
Exception

Various exceptions depending on the file type and extraction method used. Logs errors and re-raises the exception.

Examples:

>>> with open('document.pdf', 'rb') as f:
...     content = f.read()
>>> text = general_purpose_read(content, 'pdf')
>>> print(text)