PDF

fixed_chunking

fixed_chunking(doc_bytes, split_by='page', chunk_size=1)

Split a PDF document into chunks either by pages or character count.

This function provides flexible chunking of PDF documents for processing large files in smaller segments. It supports both page-based and character-based chunking strategies.

Parameters:

Name	Type	Description	Default
`doc_bytes`	`bytes`	The complete PDF file as a byte array.	required
`split_by`	`str`	The chunking strategy. Defaults to "page". Valid options: - "page": Split by pages, returning PDF byte chunks - "char": Split by character count, returning text string chunks	`'page'`
`chunk_size`	`int`	The size of each chunk. Defaults to 1. - For "page" mode: Number of pages per chunk - For "char" mode: Number of characters per chunk	`1`

Returns:

Type	Description
`Union[List[bytes], List[str]]`	Union[List[bytes], List[str]]: - If split_by="page": List of PDF chunks as byte arrays - If split_by="char": List of text chunks as strings

Raises:

Type	Description
`ValueError`	If split_by is not "page" or "char".
`FileDataError`	If the PDF data is corrupted or invalid.
`MemoryError`	If the PDF is too large to process or chunk_size is too small for character-based chunking.

Examples:

>>> with open('large_document.pdf', 'rb') as f:
...     pdf_data = f.read()
>>> # Split into 5-page chunks
>>> page_chunks = fixed_chunking(pdf_data, split_by="page", chunk_size=5)
>>> # Split into 1000-character text chunks
>>> text_chunks = fixed_chunking(pdf_data, split_by="char", chunk_size=1000)
>>> print(f"Created {len(page_chunks)} page chunks")

Note

For page-based chunking, each chunk is a valid PDF document that can be processed independently. For character-based chunking, the entire document text is extracted first, which may be memory-intensive for large files.

table_extractor

table_extractor(pdf_bytes, start_page=0, end_page=None)

Extract tables from specified pages of a PDF document.

Processes pages within the specified range and returns all discovered tables as structured lists.

Parameters:

Name	Type	Description	Default
`pdf_bytes`	`bytes`	The PDF file as a byte array.	required
`start_page`	`int`	The first page to scan (0-based index).	`0`
`end_page`	`Optional[int]`	The last page to scan (0-based index, exclusive). If None, processes from start_page to the end of the document.	`None`

Returns:

Type	Description
`List[List[List[str]]]`	A list of tables where: - Each table is represented as a list of rows - Each row is a list of cell values as strings - Empty list if no tables are found

Raises:

Type	Description
`FileDataError`	If the PDF data is corrupted or invalid.
`IndexError`	If start_page is greater than the number of pages in the document.
`MemoryError`	If the PDF is too large to process in memory.

Examples:

>>> with open('report_with_tables.pdf', 'rb') as f:
...     pdf_data = f.read()
>>> tables = table_extractor(pdf_data, start_page=2, end_page=5)
>>> for i, table in enumerate(tables):
...     print(f"Table {i+1}: {len(table)} rows, {len(table[0])} columns")
...     for row in table[:3]:  # Print first 3 rows
...         print(row)

Note

The function automatically handles page range validation to prevent errors when end_page exceeds the actual number of pages. Table detection accuracy depends on the PDF structure and may vary with different document layouts.

toc_finder

toc_finder(text)

Extract and format the Table of Contents section from PDF text using AI.

This function uses pattern matching to locate the Table of Contents section
in extracted PDF text, then employs Azure OpenAI to clean and format the
content into readable Markdown format.

Args:
    text: The extracted text content from a PDF document, typically
        from the first few pages where TOC is usually located.

Returns:
    The extracted and formatted TOC text in Markdown format,
        or None if no Table of Contents section is found.

Raises:
    Exception: If Azure OpenAI client initialization or API call fails.
    ValueError: If the extracted text cannot be processed by the AI model.

Examples:
    >>> pdf_text = "Table of Contents

Introduction...
Methods..." >>> toc_markdown = toc_finder(pdf_text) >>> if toc_markdown: ... print(toc_markdown) # 1. Introduction # 2. Methods # ...

Note: The function searches for "table of contents" or "contents" (case-insensitive) and extracts up to 2000 characters following the match for AI processing. Requires Azure OpenAI credentials to be properly configured.