Metadata

extract_filetype_description

extract_filetype_description(file_extension)

Extract a descriptive name for a file type based on its extension.

Looks up the file extension in a YAML configuration file to return a human-readable description of the file type.

Parameters:

Name	Type	Description	Default
`file_extension`	`str`	The file extension (e.g., 'pdf', 'docx', 'jpg').	required

Returns:

Type	Description
`str`	A formatted description in the format 'EXT. Description.' where EXT is the uppercase extension and Description is the human-readable file type description.

Raises:

Type	Description
`FileNotFoundError`	If the file_extensions.yml file is not found.
`KeyError`	If the file extension is not found in the configuration.

Examples:

>>> extract_filetype_description('pdf')
'PDF. Portable Document Format.'

>>> extract_filetype_description('docx')
'DOCX. Microsoft Word Document.'

extract_metadata_doc

extract_metadata_doc(doc_bytes)

Extract core properties from a Word 97-2003 DOC document.

Extracts metadata from legacy Word documents using OLE file parsing. Handles both summary and document summary properties.

Parameters:

Name	Type	Description	Default
`doc_bytes`	`bytes`	The DOC document content as bytes.	required

Returns:

Type	Description
`Dict[str, Any]`	Dictionary containing available metadata properties. Values are represented as strings using repr() for consistency.

Raises:

Type	Description
`OleFileError`	If the document is not a valid OLE file.
`AttributeError`	If metadata extraction fails.

Examples:

>>> with open('document.doc', 'rb') as f:
...     content = f.read()
>>> metadata = extract_metadata_doc(content)
>>> print(metadata)
{'title': "'My Document'", 'author': "'John Doe'"}

extract_metadata_docm

extract_metadata_docm(doc_bytes)

Extract metadata from a DOCX or DOCM file using direct XML parsing.

Alternative approach to extract metadata by directly parsing the XML structure of Office documents. Extracts both core and custom properties from the document's internal XML files.

Parameters:

Name	Type	Description	Default
`doc_bytes`	`bytes`	The DOCX or DOCM document content as bytes.	required

Returns:

Type	Description
`Dict[str, Any]`	Dictionary containing core and custom properties with non-None values. Datetime strings are formatted as DD-MMM-YYYY.

Raises:

Type	Description
`BadZipFile`	If the document is corrupted or not a valid ZIP archive.
`KeyError`	If required XML files are not found in the document.
`XMLSyntaxError`	If the document contains invalid XML.

Examples:

>>> with open('document.docm', 'rb') as f:
...     content = f.read()
>>> metadata = extract_metadata_docm(content)
>>> print(metadata)
{'Title': 'My Document', 'Author': 'John Doe'}

extract_metadata_docx

extract_metadata_docx(doc_bytes)

Extract both core and custom properties from a Word DOCX or DOCM document.

Extracts document metadata including core properties (title, author, dates) and custom properties defined by the user. Handles both DOCX and DOCM formats.

Parameters:

Name	Type	Description	Default
`doc_bytes`	`bytes`	The DOCX or DOCM document content as bytes.	required

Returns:

Type	Description
`Dict[str, Any]`	Combined dictionary of core and custom properties with non-None values. Datetime objects are converted to DD-MMM-YYYY format.

Raises:

Type	Description
`BadZipFile`	If the document is corrupted or not a valid ZIP archive.
`ValueError`	If the document is not a valid DOCX/DOCM format.
`XMLSyntaxError`	If the document contains invalid XML.

Examples:

>>> with open('document.docx', 'rb') as f:
...     content = f.read()
>>> metadata = extract_metadata_docx(content)
>>> print(metadata)
{'Title': 'My Document', 'Author': 'John Doe', 'Created': '15-Jan-2024'}

extract_metadata_iam

extract_metadata_iam(iam_bytes, file_name)

Extract referenced files from an Autodesk Inventor assembly (IAM) file.

Analyzes the assembly file content to identify all referenced part (IPT) and assembly (IAM) files. Filters out self-references and duplicates.

Parameters:

Name	Type	Description	Default
`iam_bytes`	`bytes`	The IAM assembly file content as bytes.	required
`file_name`	`str`	The name of the current assembly file.	required

Returns:

Type	Description
`Dict[str, Any]`	Dictionary containing 'AssemblyPartReferences' key with a list of unique referenced file names.

Raises:

Type	Description
`UnicodeDecodeError`	If the file content cannot be decoded as UTF-16.
`ValueError`	If the file format is not recognized.

Examples:

>>> with open('assembly.iam', 'rb') as f:
...     content = f.read()
>>> metadata = extract_metadata_iam(content, 'assembly.iam')
>>> print(metadata)
{'AssemblyPartReferences': ['part1.ipt', 'part2.ipt', 'subassembly.iam']}

extract_metadata_image

extract_metadata_image(image_bytes)

Extracts metadata from an image (PNG, JPEG, TIFF, etc.).

Parameters:

Name	Type	Description	Default
`image_bytes`	`bytes`	A byte object of the image.	required

Returns:

Type	Description
`Dict[str, Any]`	A dictionary with EXIF metadata (if available), image size, and format.

extract_metadata_ipt

extract_metadata_ipt(file_url, metadata_url, image_url)

Submit IPT file to Autodesk Design Automation API for metadata extraction.

Processes Autodesk Inventor part (IPT) or assembly (IAM) files through the Design Automation API to extract metadata and generate thumbnails. Uses a custom deployed extraction script.

Parameters:

Name	Type	Description	Default
`file_url`	`str`	Signed URL where the IPT/IAM file can be downloaded.	required
`metadata_url`	`str`	Signed URL where the output metadata JSON will be uploaded.	required
`image_url`	`str`	Signed URL where the output thumbnail BMP will be uploaded.	required

Returns:

Type	Description
`Dict[str, Any]`	Dictionary containing metadata properties with tag names as keys and text content as values. Thumbnail image is not returned directly.

Raises:

Type	Description
`ConnectionError`	If the Design Automation API is unreachable.
`ValueError`	If the URLs are malformed or invalid.
`AuthenticationError`	If API authentication fails.

Note

The API returns two outputs: - metadata: JSON containing all core metadata and bill of materials - image: BMP thumbnail in isometric projection

Examples:

>>> metadata = extract_metadata_ipt(
...     'https://storage.example.com/file.ipt',
...     'https://storage.example.com/metadata.json',
...     'https://storage.example.com/thumbnail.bmp'
... )
>>> print(metadata)

extract_metadata_odt

extract_metadata_odt(doc_bytes)

Extract core properties from an OpenDocument Text (ODT) document.

Extracts metadata from ODT files by parsing the document's metadata elements. Handles various metadata fields defined in the OpenDocument standard.

Parameters:

Name	Type	Description	Default
`doc_bytes`	`bytes`	The ODT document content as bytes.	required

Returns:

Type	Description
`Dict[str, Any]`	Dictionary containing metadata properties with tag names as keys and text content as values.

Raises:

Type	Description
`BadZipFile`	If the ODT file is corrupted or not a valid ZIP archive.
`XMLSyntaxError`	If the ODT document contains invalid XML.

Examples:

>>> with open('document.odt', 'rb') as f:
...     content = f.read()
>>> metadata = extract_metadata_odt(content)
>>> print(metadata)
{'title': 'My Document', 'creator': 'John Doe'}

extract_metadata_pptx

extract_metadata_pptx(ppt_bytes)

Extract metadata from a PPTX, PPSX, or PPTM presentation.

Extracts core, custom, and application properties from PowerPoint files using direct XML parsing. Handles all PowerPoint XML-based formats.

Parameters:

Name	Type	Description	Default
`ppt_bytes`	`bytes`	The PowerPoint presentation content as bytes.	required

Returns:

Type	Description
`Dict[str, Any]`	Combined dictionary of core, custom, and application properties with non-None values. Datetime objects are converted to DD-MMM-YYYY format.

Raises:

Type	Description
`BadZipFile`	If the presentation is corrupted or not a valid ZIP archive.
`XMLSyntaxError`	If the presentation contains invalid XML.
`ValueError`	If the file is not a valid PowerPoint format.

Example

with open('presentation.pptx', 'rb') as f: ... content = f.read() metadata = extract_metadata_pptx(content) print(metadata)

extract_metadata_svg

extract_metadata_svg(svg_bytes)

Extract metadata from an SVG file.

Extracts width, height, title, description, and other metadata elements from Scalable Vector Graphics (SVG) files.

Parameters:

Name	Type	Description	Default
`svg_bytes`	`bytes`	The SVG file content as bytes.	required

Returns:

Type	Description
`Dict[str, Any]`	Dictionary containing SVG metadata including dimensions, title, description, and format information.

Raises:

Type	Description
`UnicodeDecodeError`	If the SVG file encoding is not supported.
`XMLSyntaxError`	If the SVG contains invalid XML.

Example

with open('image.svg', 'rb') as f: ... content = f.read() metadata = extract_metadata_svg(content) print(metadata)

extract_metadata_xlsx

extract_metadata_xlsx(xls_bytes)

Extract both core and custom properties from an XLSX or XLSM spreadsheet.

Extracts metadata from Excel files by parsing the internal XML structure. Handles both core properties and custom properties defined by users.

Parameters:

Name	Type	Description	Default
`xls_bytes`	`bytes`	The XLSX or XLSM spreadsheet content as bytes.	required

Returns:

Type	Description
`Dict[str, Any]`	Combined dictionary of core and custom properties with non-None values. Datetime objects are converted to DD-MMM-YYYY format.

Raises:

Type	Description
`BadZipFile`	If the spreadsheet is corrupted or not a valid ZIP archive.
`XMLSyntaxError`	If the spreadsheet contains invalid XML.
`ValueError`	If the file is not a valid Excel format.

Example

with open('spreadsheet.xlsx', 'rb') as f: ... content = f.read() metadata = extract_metadata_xlsx(content) print(metadata)

extract_metadata_zip

extract_metadata_zip(zip_bytes)

Extract metadata from a ZIP archive.

Analyzes the contents of a ZIP file to extract information about the archived files including file count and file listing.

Parameters:

Name	Type	Description	Default
`zip_bytes`	`bytes`	The ZIP file content as bytes.	required

Returns:

Type	Description
`Dict[str, Any]`	Dictionary containing archive metadata with 'archive_item_count' and 'archive' keys, or empty dict if extraction fails.

Raises:

Type	Description
`BadZipFile`	If the file is not a valid ZIP archive.
`ValueError`	If the ZIP file is corrupted.

Example

with open('archive.zip', 'rb') as f: ... content = f.read() metadata = extract_metadata_zip(content) print(metadata)