Skip to content

Metadata

extract_filetype_description

extract_filetype_description(file_extension)

Extract a descriptive name for a file type based on its extension.

Looks up the file extension in a YAML configuration file to return a human-readable description of the file type.

Parameters:

Name Type Description Default
file_extension str

The file extension (e.g., 'pdf', 'docx', 'jpg').

required

Returns:

Type Description
str

A formatted description in the format 'EXT. Description.' where EXT is the uppercase extension and Description is the human-readable file type description.

Raises:

Type Description
FileNotFoundError

If the file_extensions.yml file is not found.

KeyError

If the file extension is not found in the configuration.

Examples:

>>> extract_filetype_description('pdf')
'PDF. Portable Document Format.'
>>> extract_filetype_description('docx')
'DOCX. Microsoft Word Document.'

extract_metadata_doc

extract_metadata_doc(doc_bytes)

Extract core properties from a Word 97-2003 DOC document.

Extracts metadata from legacy Word documents using OLE file parsing. Handles both summary and document summary properties.

Parameters:

Name Type Description Default
doc_bytes bytes

The DOC document content as bytes.

required

Returns:

Type Description
Dict[str, Any]

Dictionary containing available metadata properties. Values are represented as strings using repr() for consistency.

Raises:

Type Description
OleFileError

If the document is not a valid OLE file.

AttributeError

If metadata extraction fails.

Examples:

>>> with open('document.doc', 'rb') as f:
...     content = f.read()
>>> metadata = extract_metadata_doc(content)
>>> print(metadata)
{'title': "'My Document'", 'author': "'John Doe'"}

extract_metadata_docm

extract_metadata_docm(doc_bytes)

Extract metadata from a DOCX or DOCM file using direct XML parsing.

Alternative approach to extract metadata by directly parsing the XML structure of Office documents. Extracts both core and custom properties from the document's internal XML files.

Parameters:

Name Type Description Default
doc_bytes bytes

The DOCX or DOCM document content as bytes.

required

Returns:

Type Description
Dict[str, Any]

Dictionary containing core and custom properties with non-None values. Datetime strings are formatted as DD-MMM-YYYY.

Raises:

Type Description
BadZipFile

If the document is corrupted or not a valid ZIP archive.

KeyError

If required XML files are not found in the document.

XMLSyntaxError

If the document contains invalid XML.

Examples:

>>> with open('document.docm', 'rb') as f:
...     content = f.read()
>>> metadata = extract_metadata_docm(content)
>>> print(metadata)
{'Title': 'My Document', 'Author': 'John Doe'}

extract_metadata_docx

extract_metadata_docx(doc_bytes)

Extract both core and custom properties from a Word DOCX or DOCM document.

Extracts document metadata including core properties (title, author, dates) and custom properties defined by the user. Handles both DOCX and DOCM formats.

Parameters:

Name Type Description Default
doc_bytes bytes

The DOCX or DOCM document content as bytes.

required

Returns:

Type Description
Dict[str, Any]

Combined dictionary of core and custom properties with non-None values. Datetime objects are converted to DD-MMM-YYYY format.

Raises:

Type Description
BadZipFile

If the document is corrupted or not a valid ZIP archive.

ValueError

If the document is not a valid DOCX/DOCM format.

XMLSyntaxError

If the document contains invalid XML.

Examples:

>>> with open('document.docx', 'rb') as f:
...     content = f.read()
>>> metadata = extract_metadata_docx(content)
>>> print(metadata)
{'Title': 'My Document', 'Author': 'John Doe', 'Created': '15-Jan-2024'}

extract_metadata_iam

extract_metadata_iam(iam_bytes, file_name)

Extract referenced files from an Autodesk Inventor assembly (IAM) file.

Analyzes the assembly file content to identify all referenced part (IPT) and assembly (IAM) files. Filters out self-references and duplicates.

Parameters:

Name Type Description Default
iam_bytes bytes

The IAM assembly file content as bytes.

required
file_name str

The name of the current assembly file.

required

Returns:

Type Description
Dict[str, Any]

Dictionary containing 'AssemblyPartReferences' key with a list of unique referenced file names.

Raises:

Type Description
UnicodeDecodeError

If the file content cannot be decoded as UTF-16.

ValueError

If the file format is not recognized.

Examples:

>>> with open('assembly.iam', 'rb') as f:
...     content = f.read()
>>> metadata = extract_metadata_iam(content, 'assembly.iam')
>>> print(metadata)
{'AssemblyPartReferences': ['part1.ipt', 'part2.ipt', 'subassembly.iam']}

extract_metadata_image

extract_metadata_image(image_bytes)

Extracts metadata from an image (PNG, JPEG, TIFF, etc.).

Parameters:

Name Type Description Default
image_bytes bytes

A byte object of the image.

required

Returns:

Type Description
Dict[str, Any]

A dictionary with EXIF metadata (if available), image size, and format.

extract_metadata_ipt

extract_metadata_ipt(file_url, metadata_url, image_url)

Submit IPT file to Autodesk Design Automation API for metadata extraction.

Processes Autodesk Inventor part (IPT) or assembly (IAM) files through the Design Automation API to extract metadata and generate thumbnails. Uses a custom deployed extraction script.

Parameters:

Name Type Description Default
file_url str

Signed URL where the IPT/IAM file can be downloaded.

required
metadata_url str

Signed URL where the output metadata JSON will be uploaded.

required
image_url str

Signed URL where the output thumbnail BMP will be uploaded.

required

Returns:

Type Description
Dict[str, Any]

Dictionary containing metadata properties with tag names as keys and text content as values. Thumbnail image is not returned directly.

Raises:

Type Description
ConnectionError

If the Design Automation API is unreachable.

ValueError

If the URLs are malformed or invalid.

AuthenticationError

If API authentication fails.

Note

The API returns two outputs: - metadata: JSON containing all core metadata and bill of materials - image: BMP thumbnail in isometric projection

Examples:

>>> metadata = extract_metadata_ipt(
...     'https://storage.example.com/file.ipt',
...     'https://storage.example.com/metadata.json',
...     'https://storage.example.com/thumbnail.bmp'
... )
>>> print(metadata)

extract_metadata_odt

extract_metadata_odt(doc_bytes)

Extract core properties from an OpenDocument Text (ODT) document.

Extracts metadata from ODT files by parsing the document's metadata elements. Handles various metadata fields defined in the OpenDocument standard.

Parameters:

Name Type Description Default
doc_bytes bytes

The ODT document content as bytes.

required

Returns:

Type Description
Dict[str, Any]

Dictionary containing metadata properties with tag names as keys and text content as values.

Raises:

Type Description
BadZipFile

If the ODT file is corrupted or not a valid ZIP archive.

XMLSyntaxError

If the ODT document contains invalid XML.

Examples:

>>> with open('document.odt', 'rb') as f:
...     content = f.read()
>>> metadata = extract_metadata_odt(content)
>>> print(metadata)
{'title': 'My Document', 'creator': 'John Doe'}

extract_metadata_pptx

extract_metadata_pptx(ppt_bytes)

Extract metadata from a PPTX, PPSX, or PPTM presentation.

Extracts core, custom, and application properties from PowerPoint files using direct XML parsing. Handles all PowerPoint XML-based formats.

Parameters:

Name Type Description Default
ppt_bytes bytes

The PowerPoint presentation content as bytes.

required

Returns:

Type Description
Dict[str, Any]

Combined dictionary of core, custom, and application properties with non-None values. Datetime objects are converted to DD-MMM-YYYY format.

Raises:

Type Description
BadZipFile

If the presentation is corrupted or not a valid ZIP archive.

XMLSyntaxError

If the presentation contains invalid XML.

ValueError

If the file is not a valid PowerPoint format.

Example

with open('presentation.pptx', 'rb') as f: ... content = f.read() metadata = extract_metadata_pptx(content) print(metadata)

extract_metadata_svg

extract_metadata_svg(svg_bytes)

Extract metadata from an SVG file.

Extracts width, height, title, description, and other metadata elements from Scalable Vector Graphics (SVG) files.

Parameters:

Name Type Description Default
svg_bytes bytes

The SVG file content as bytes.

required

Returns:

Type Description
Dict[str, Any]

Dictionary containing SVG metadata including dimensions, title, description, and format information.

Raises:

Type Description
UnicodeDecodeError

If the SVG file encoding is not supported.

XMLSyntaxError

If the SVG contains invalid XML.

Example

with open('image.svg', 'rb') as f: ... content = f.read() metadata = extract_metadata_svg(content) print(metadata)

extract_metadata_xlsx

extract_metadata_xlsx(xls_bytes)

Extract both core and custom properties from an XLSX or XLSM spreadsheet.

Extracts metadata from Excel files by parsing the internal XML structure. Handles both core properties and custom properties defined by users.

Parameters:

Name Type Description Default
xls_bytes bytes

The XLSX or XLSM spreadsheet content as bytes.

required

Returns:

Type Description
Dict[str, Any]

Combined dictionary of core and custom properties with non-None values. Datetime objects are converted to DD-MMM-YYYY format.

Raises:

Type Description
BadZipFile

If the spreadsheet is corrupted or not a valid ZIP archive.

XMLSyntaxError

If the spreadsheet contains invalid XML.

ValueError

If the file is not a valid Excel format.

Example

with open('spreadsheet.xlsx', 'rb') as f: ... content = f.read() metadata = extract_metadata_xlsx(content) print(metadata)

extract_metadata_zip

extract_metadata_zip(zip_bytes)

Extract metadata from a ZIP archive.

Analyzes the contents of a ZIP file to extract information about the archived files including file count and file listing.

Parameters:

Name Type Description Default
zip_bytes bytes

The ZIP file content as bytes.

required

Returns:

Type Description
Dict[str, Any]

Dictionary containing archive metadata with 'archive_item_count' and 'archive' keys, or empty dict if extraction fails.

Raises:

Type Description
BadZipFile

If the file is not a valid ZIP archive.

ValueError

If the ZIP file is corrupted.

Example

with open('archive.zip', 'rb') as f: ... content = f.read() metadata = extract_metadata_zip(content) print(metadata)