Skip to content

PDF

fixed_chunking

fixed_chunking(doc_bytes, split_by='page', chunk_size=1)

Split a PDF document into chunks either by pages or character count.

This function provides flexible chunking of PDF documents for processing large files in smaller segments. It supports both page-based and character-based chunking strategies.

Parameters:

Name Type Description Default
doc_bytes bytes

The complete PDF file as a byte array.

required
split_by str

The chunking strategy. Defaults to "page". Valid options: - "page": Split by pages, returning PDF byte chunks - "char": Split by character count, returning text string chunks

'page'
chunk_size int

The size of each chunk. Defaults to 1. - For "page" mode: Number of pages per chunk - For "char" mode: Number of characters per chunk

1

Returns:

Type Description
Union[List[bytes], List[str]]

Union[List[bytes], List[str]]: - If split_by="page": List of PDF chunks as byte arrays - If split_by="char": List of text chunks as strings

Raises:

Type Description
ValueError

If split_by is not "page" or "char".

FileDataError

If the PDF data is corrupted or invalid.

MemoryError

If the PDF is too large to process or chunk_size is too small for character-based chunking.

Examples:

>>> with open('large_document.pdf', 'rb') as f:
...     pdf_data = f.read()
>>> # Split into 5-page chunks
>>> page_chunks = fixed_chunking(pdf_data, split_by="page", chunk_size=5)
>>> # Split into 1000-character text chunks
>>> text_chunks = fixed_chunking(pdf_data, split_by="char", chunk_size=1000)
>>> print(f"Created {len(page_chunks)} page chunks")
Note

For page-based chunking, each chunk is a valid PDF document that can be processed independently. For character-based chunking, the entire document text is extracted first, which may be memory-intensive for large files.

table_extractor

table_extractor(pdf_bytes, start_page=0, end_page=None)

Extract tables from specified pages of a PDF document.

Processes pages within the specified range and returns all discovered tables as structured lists.

Parameters:

Name Type Description Default
pdf_bytes bytes

The PDF file as a byte array.

required
start_page int

The first page to scan (0-based index).

0
end_page Optional[int]

The last page to scan (0-based index, exclusive). If None, processes from start_page to the end of the document.

None

Returns:

Type Description
List[List[List[str]]]

A list of tables where: - Each table is represented as a list of rows - Each row is a list of cell values as strings - Empty list if no tables are found

Raises:

Type Description
FileDataError

If the PDF data is corrupted or invalid.

IndexError

If start_page is greater than the number of pages in the document.

MemoryError

If the PDF is too large to process in memory.

Examples:

>>> with open('report_with_tables.pdf', 'rb') as f:
...     pdf_data = f.read()
>>> tables = table_extractor(pdf_data, start_page=2, end_page=5)
>>> for i, table in enumerate(tables):
...     print(f"Table {i+1}: {len(table)} rows, {len(table[0])} columns")
...     for row in table[:3]:  # Print first 3 rows
...         print(row)
Note

The function automatically handles page range validation to prevent errors when end_page exceeds the actual number of pages. Table detection accuracy depends on the PDF structure and may vary with different document layouts.

toc_finder

toc_finder(text)
Extract and format the Table of Contents section from PDF text using AI.

This function uses pattern matching to locate the Table of Contents section
in extracted PDF text, then employs Azure OpenAI to clean and format the
content into readable Markdown format.

Args:
    text: The extracted text content from a PDF document, typically
        from the first few pages where TOC is usually located.

Returns:
    The extracted and formatted TOC text in Markdown format,
        or None if no Table of Contents section is found.

Raises:
    Exception: If Azure OpenAI client initialization or API call fails.
    ValueError: If the extracted text cannot be processed by the AI model.

Examples:
    >>> pdf_text = "Table of Contents
  1. Introduction...
  2. Methods..." >>> toc_markdown = toc_finder(pdf_text) >>> if toc_markdown: ... print(toc_markdown) # 1. Introduction # 2. Methods # ...

    Note: The function searches for "table of contents" or "contents" (case-insensitive) and extracts up to 2000 characters following the match for AI processing. Requires Azure OpenAI credentials to be properly configured.