Data
azure_doc_intel_read
azure_doc_intel_read(blob, credential=AzureKeyCredential(os.environ['AZURE_AI_API_KEY']), endpoint=os.environ['AZURE_AI_ENDPOINT'], max_pages=10, max_retries=3, initial_delay=2.0, chunk_pages=False)
Reads and analyzes text from a document using Azure's prebuilt layout model.
Uses Azure Document Intelligence with exponential retry logic for robust document text extraction. Supports various document formats including PDF and images.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
blob
|
bytes
|
The document content to be analyzed (e.g., PDF or image bytes). |
required |
credential
|
AzureKeyCredential
|
Azure authentication credential. Defaults to credential created from AZURE_AI_API_KEY environment variable. |
AzureKeyCredential(environ['AZURE_AI_API_KEY'])
|
endpoint
|
str
|
Azure AI endpoint URL. Defaults to value from AZURE_AI_ENDPOINT environment variable. |
environ['AZURE_AI_ENDPOINT']
|
max_pages
|
int
|
Maximum number of document pages to extract. |
10
|
max_retries
|
int
|
Maximum number of retries for transient errors. |
3
|
initial_delay
|
float
|
Initial delay in seconds for exponential backoff. |
2.0
|
chunk_pages
|
bool
|
If false returns content as single string, if true returns dictionary with page as index and content as value. |
False
|
Returns:
Type | Description |
---|---|
str | dict
|
The extracted document content as markdown text. |
Raises:
Type | Description |
---|---|
ServiceRequestError
|
When Azure service request fails after all retries. |
ServiceResponseError
|
When Azure service response is invalid after all retries. |
HttpResponseError
|
When HTTP request fails after all retries. |
Examples:
>>> with open('document.pdf', 'rb') as f:
... content = f.read()
>>> markdown_text = azure_doc_intel_read(content, max_pages=5)
>>> print(markdown_text)
extract_text_from_doc
extract_text_from_doc(doc_bytes)
Extract text content from a Microsoft Word 97-2003 DOC file.
Extracts text from legacy Word documents using OLE file parsing. Handles encoding issues and cleans up control characters while preserving line breaks and document structure.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
doc_bytes
|
bytes
|
The DOC file content as bytes. |
required |
Returns:
Type | Description |
---|---|
Optional[str]
|
Extracted text content with cleaned formatting, or None if extraction fails or file is invalid. |
Raises:
Type | Description |
---|---|
OleFileError
|
If the DOC file is not a valid OLE file. |
UnicodeDecodeError
|
If text encoding cannot be determined. |
Examples:
>>> with open('document.doc', 'rb') as f:
... content = f.read()
>>> text = extract_text_from_doc(content)
>>> if text:
... print(text)
extract_text_from_docm
extract_text_from_docm(doc_bytes)
Extract text content from a Microsoft Word DOCM file.
Extracts text from a macro-enabled Word document by parsing the internal XML structure and removing XML tags to obtain plain text content.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
doc_bytes
|
bytes
|
The DOCM file content as bytes. |
required |
Returns:
Type | Description |
---|---|
str
|
Extracted plain text content with XML tags removed and normalized whitespace. |
Raises:
Type | Description |
---|---|
BadZipFile
|
If the DOCM file is corrupted or not a valid ZIP archive. |
KeyError
|
If the required word/document.xml file is not found. |
UnicodeDecodeError
|
If the XML content encoding is not supported. |
Examples:
>>> with open('document.docm', 'rb') as f:
... content = f.read()
>>> text = extract_text_from_docm(content)
>>> print(text)
extract_text_from_docx
extract_text_from_docx(docx_bytes)
Extract text content from a Microsoft Word DOCX file.
Extracts text from all paragraphs and tables in the document, preserving the document structure and table formatting. Tables are converted to pipe-separated format.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
docx_bytes
|
bytes
|
The DOCX file content as bytes. |
required |
Returns:
Type | Description |
---|---|
str
|
Extracted text content with paragraphs and tables separated by double newlines. Table cells are separated by pipe characters. |
Raises:
Type | Description |
---|---|
BadZipFile
|
If the DOCX file is corrupted or not a valid ZIP archive. |
ValueError
|
If the file is not a valid DOCX format. |
Examples:
>>> with open('document.docx', 'rb') as f:
... content = f.read()
>>> text = extract_text_from_docx(content)
>>> print(text)
extract_text_from_hpg
extract_text_from_hpg(doc_bytes)
Convert HPGL/HPG plotter file to readable text.
Converts Hewlett-Packard Graphics Language (HPGL) files to text by first converting them to PDF format and then using Azure Document Intelligence for OCR text extraction.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
doc_bytes
|
bytes
|
The HPGL/HPG file content as bytes. |
required |
Returns:
Type | Description |
---|---|
str
|
Extracted text content from the plotter file via OCR. |
Raises:
Type | Description |
---|---|
ValueError
|
If the HPGL file format is not supported. |
ServiceRequestError
|
If Azure Document Intelligence service fails. |
Examples:
>>> with open('drawing.hpg', 'rb') as f:
... content = f.read()
>>> text = extract_text_from_hpg(content)
>>> print(text)
extract_text_from_mcdx
extract_text_from_mcdx(doc_bytes)
Extract text content from a Mathcad Prime MCDX file.
Processes an MCDX file by extracting text from all XAMLPackage files contained within the mathcad/xaml/ directory. Each XAMLPackage is processed individually and their text content is combined.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
doc_bytes
|
bytes
|
The MCDX file content as bytes. |
required |
Returns:
Type | Description |
---|---|
str
|
Combined text content from all XAMLPackages, separated by carriage return and newline characters. |
Raises:
Type | Description |
---|---|
BadZipFile
|
If the MCDX file is corrupted or not a valid ZIP archive. |
KeyError
|
If required XAMLPackage files are not found in the archive. |
XMLSyntaxError
|
If XAMLPackage XML content is invalid. |
Examples:
>>> with open('calculation.mcdx', 'rb') as f:
... content = f.read()
>>> text = extract_text_from_mcdx(content)
>>> print(text)
extract_text_from_msg
extract_text_from_msg(doc_bytes)
Convert a Microsoft Outlook MSG file to readable text.
Extracts email content including subject, sender, recipient, date, and body from a Microsoft Outlook MSG file format.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
doc_bytes
|
bytes
|
The MSG file content as bytes. |
required |
Returns:
Type | Description |
---|---|
str
|
Formatted email content with headers and body text. |
Raises:
Type | Description |
---|---|
ValueError
|
If the file is not a valid MSG format. |
AttributeError
|
If required email properties are missing. |
Examples:
>>> with open('email.msg', 'rb') as f:
... content = f.read()
>>> text = extract_text_from_msg(content)
>>> print(text)
Subject: Meeting Tomorrow
From: john@example.com
To: jane@example.com
Date: 2024-01-15
Body: Let's meet at 2pm...
extract_text_from_odt
extract_text_from_odt(doc_bytes)
Extract text content from an OpenDocument Text (ODT) file.
Processes an ODT document by extracting text from all paragraph elements, preserving the document structure and formatting.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
doc_bytes
|
bytes
|
The ODT file content as bytes. |
required |
Returns:
Type | Description |
---|---|
str
|
Extracted text content with paragraphs separated by newlines. |
Raises:
Type | Description |
---|---|
BadZipFile
|
If the ODT file is corrupted or not a valid ZIP archive. |
XMLSyntaxError
|
If the ODT document contains invalid XML. |
Examples:
>>> with open('document.odt', 'rb') as f:
... content = f.read()
>>> text = extract_text_from_odt(content)
>>> print(text)
extract_text_from_pptx
extract_text_from_pptx(pptx_bytes)
Extract text content from PowerPoint PPTX/PPSX/PPTM files.
Extracts text from all slides in the presentation by parsing the XML structure directly. Handles text elements and tables, preserving slide organization and table structure.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
pptx_bytes
|
bytes
|
The PowerPoint file content as bytes. |
required |
Returns:
Type | Description |
---|---|
str
|
Extracted text content with slide separators and table formatting. Tables are converted to pipe-separated format. Returns error message if extraction fails. |
Raises:
Type | Description |
---|---|
BadZipFile
|
If the PowerPoint file is corrupted or not a valid ZIP archive. |
XMLSyntaxError
|
If slide XML content is invalid. |
Note
Works with PowerPoint XML formats (PPTX, PPSX, PPTM) but not older binary PowerPoint 1997-2003 documents (PPT).
Examples:
>>> with open('presentation.pptx', 'rb') as f:
... content = f.read()
>>> text = extract_text_from_pptx(content)
>>> print(text)
extract_text_from_rtf
extract_text_from_rtf(doc_bytes)
Convert RTF formatted bytes to plain text.
Converts Rich Text Format (RTF) documents to plain text by decoding the bytes and using RTF parsing libraries to extract readable content.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
doc_bytes
|
bytes
|
RTF formatted data as bytes. |
required |
Returns:
Type | Description |
---|---|
str
|
Plain text content extracted from RTF with formatting removed. |
Raises:
Type | Description |
---|---|
UnicodeDecodeError
|
If RTF content cannot be decoded properly. |
Examples:
>>> with open('document.rtf', 'rb') as f:
... content = f.read()
>>> text = extract_text_from_rtf(content)
>>> print(text)
extract_text_from_xls
extract_text_from_xls(doc_bytes)
Extract text content from an XLSX/XLSM spreadsheet file.
Extracts text from all worksheets in the spreadsheet, preserving the sheet structure and cell organization. Each sheet is processed individually with sheet names as headers.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
doc_bytes
|
bytes
|
The XLSX/XLSM file content as bytes. |
required |
Returns:
Type | Description |
---|---|
str
|
Extracted text content with sheet names and cell values separated by pipe characters (|), with each row on a new line. |
Raises:
Type | Description |
---|---|
BadZipFile
|
If the file is not a valid ZIP archive. |
ValueError
|
If the file is not a valid Excel format. |
Examples:
>>> with open('spreadsheet.xlsx', 'rb') as f:
... content = f.read()
>>> text = extract_text_from_xls(content)
>>> print(text)
Sheet: Sheet1
Header1 | Header2 | Header3
Value1 | Value2 | Value3
extract_text_from_xmcd
extract_text_from_xmcd(xmcd_bytes)
Extract and process text from a Mathcad XMCD file.
Processes an XMCD file by parsing its XML content and extracting text elements, filtering out empty lines and limiting output to the first 100 text elements.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
xmcd_bytes
|
bytes
|
The XMCD file content as bytes. |
required |
Returns:
Type | Description |
---|---|
str
|
Extracted text content separated by carriage return and newline characters. |
Raises:
Type | Description |
---|---|
XMLSyntaxError
|
If the XMCD file contains invalid XML. |
UnicodeDecodeError
|
If the file encoding is not supported. |
Examples:
>>> with open('calculation.xmcd', 'rb') as f:
... content = f.read()
>>> text = extract_text_from_xmcd(content)
>>> print(text)
general_purpose_read
general_purpose_read(blob, filetype, chunk_pages=False)
Extract text content from various file formats using appropriate parsers.
Analyzes the file type and routes to the most appropriate extraction method. Supports a wide range of document formats including Office documents, PDFs, images, and specialized formats like Mathcad and CAD files.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
blob
|
bytes
|
The document content as bytes, or string for plain text. |
required |
filetype
|
str
|
The file extension indicating the document type. |
required |
chunk_pages
|
bool
|
If True, return dict with page numbers as keys (only supported for certain formats). If False, return concatenated text string (default behavior) |
False
|
Returns:
Type | Description |
---|---|
str | dict
|
Extracted text content from the document. Returns empty string for compressed archives or if extraction fails. |
Raises:
Type | Description |
---|---|
Exception
|
Various exceptions depending on the file type and extraction method used. Logs errors and re-raises the exception. |
Examples:
>>> with open('document.pdf', 'rb') as f:
... content = f.read()
>>> text = general_purpose_read(content, 'pdf')
>>> print(text)