Metadata
extract_filetype_description
extract_filetype_description(file_extension)
Extract a descriptive name for a file type based on its extension.
Looks up the file extension in a YAML configuration file to return a human-readable description of the file type.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_extension
|
str
|
The file extension (e.g., 'pdf', 'docx', 'jpg'). |
required |
Returns:
Type | Description |
---|---|
str
|
A formatted description in the format 'EXT. Description.' where EXT is the uppercase extension and Description is the human-readable file type description. |
Raises:
Type | Description |
---|---|
FileNotFoundError
|
If the file_extensions.yml file is not found. |
KeyError
|
If the file extension is not found in the configuration. |
Examples:
>>> extract_filetype_description('pdf')
'PDF. Portable Document Format.'
>>> extract_filetype_description('docx')
'DOCX. Microsoft Word Document.'
extract_metadata_doc
extract_metadata_doc(doc_bytes)
Extract core properties from a Word 97-2003 DOC document.
Extracts metadata from legacy Word documents using OLE file parsing. Handles both summary and document summary properties.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
doc_bytes
|
bytes
|
The DOC document content as bytes. |
required |
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
Dictionary containing available metadata properties. Values are represented as strings using repr() for consistency. |
Raises:
Type | Description |
---|---|
OleFileError
|
If the document is not a valid OLE file. |
AttributeError
|
If metadata extraction fails. |
Examples:
>>> with open('document.doc', 'rb') as f:
... content = f.read()
>>> metadata = extract_metadata_doc(content)
>>> print(metadata)
{'title': "'My Document'", 'author': "'John Doe'"}
extract_metadata_docm
extract_metadata_docm(doc_bytes)
Extract metadata from a DOCX or DOCM file using direct XML parsing.
Alternative approach to extract metadata by directly parsing the XML structure of Office documents. Extracts both core and custom properties from the document's internal XML files.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
doc_bytes
|
bytes
|
The DOCX or DOCM document content as bytes. |
required |
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
Dictionary containing core and custom properties with non-None values. Datetime strings are formatted as DD-MMM-YYYY. |
Raises:
Type | Description |
---|---|
BadZipFile
|
If the document is corrupted or not a valid ZIP archive. |
KeyError
|
If required XML files are not found in the document. |
XMLSyntaxError
|
If the document contains invalid XML. |
Examples:
>>> with open('document.docm', 'rb') as f:
... content = f.read()
>>> metadata = extract_metadata_docm(content)
>>> print(metadata)
{'Title': 'My Document', 'Author': 'John Doe'}
extract_metadata_docx
extract_metadata_docx(doc_bytes)
Extract both core and custom properties from a Word DOCX or DOCM document.
Extracts document metadata including core properties (title, author, dates) and custom properties defined by the user. Handles both DOCX and DOCM formats.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
doc_bytes
|
bytes
|
The DOCX or DOCM document content as bytes. |
required |
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
Combined dictionary of core and custom properties with non-None values. Datetime objects are converted to DD-MMM-YYYY format. |
Raises:
Type | Description |
---|---|
BadZipFile
|
If the document is corrupted or not a valid ZIP archive. |
ValueError
|
If the document is not a valid DOCX/DOCM format. |
XMLSyntaxError
|
If the document contains invalid XML. |
Examples:
>>> with open('document.docx', 'rb') as f:
... content = f.read()
>>> metadata = extract_metadata_docx(content)
>>> print(metadata)
{'Title': 'My Document', 'Author': 'John Doe', 'Created': '15-Jan-2024'}
extract_metadata_iam
extract_metadata_iam(iam_bytes, file_name)
Extract referenced files from an Autodesk Inventor assembly (IAM) file.
Analyzes the assembly file content to identify all referenced part (IPT) and assembly (IAM) files. Filters out self-references and duplicates.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
iam_bytes
|
bytes
|
The IAM assembly file content as bytes. |
required |
file_name
|
str
|
The name of the current assembly file. |
required |
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
Dictionary containing 'AssemblyPartReferences' key with a list of unique referenced file names. |
Raises:
Type | Description |
---|---|
UnicodeDecodeError
|
If the file content cannot be decoded as UTF-16. |
ValueError
|
If the file format is not recognized. |
Examples:
>>> with open('assembly.iam', 'rb') as f:
... content = f.read()
>>> metadata = extract_metadata_iam(content, 'assembly.iam')
>>> print(metadata)
{'AssemblyPartReferences': ['part1.ipt', 'part2.ipt', 'subassembly.iam']}
extract_metadata_image
extract_metadata_image(image_bytes)
Extracts metadata from an image (PNG, JPEG, TIFF, etc.).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
image_bytes
|
bytes
|
A byte object of the image. |
required |
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
A dictionary with EXIF metadata (if available), image size, and format. |
extract_metadata_ipt
extract_metadata_ipt(file_url, metadata_url, image_url)
Submit IPT file to Autodesk Design Automation API for metadata extraction.
Processes Autodesk Inventor part (IPT) or assembly (IAM) files through the Design Automation API to extract metadata and generate thumbnails. Uses a custom deployed extraction script.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_url
|
str
|
Signed URL where the IPT/IAM file can be downloaded. |
required |
metadata_url
|
str
|
Signed URL where the output metadata JSON will be uploaded. |
required |
image_url
|
str
|
Signed URL where the output thumbnail BMP will be uploaded. |
required |
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
Dictionary containing metadata properties with tag names as keys and text content as values. Thumbnail image is not returned directly. |
Raises:
Type | Description |
---|---|
ConnectionError
|
If the Design Automation API is unreachable. |
ValueError
|
If the URLs are malformed or invalid. |
AuthenticationError
|
If API authentication fails. |
Note
The API returns two outputs: - metadata: JSON containing all core metadata and bill of materials - image: BMP thumbnail in isometric projection
Examples:
>>> metadata = extract_metadata_ipt(
... 'https://storage.example.com/file.ipt',
... 'https://storage.example.com/metadata.json',
... 'https://storage.example.com/thumbnail.bmp'
... )
>>> print(metadata)
extract_metadata_odt
extract_metadata_odt(doc_bytes)
Extract core properties from an OpenDocument Text (ODT) document.
Extracts metadata from ODT files by parsing the document's metadata elements. Handles various metadata fields defined in the OpenDocument standard.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
doc_bytes
|
bytes
|
The ODT document content as bytes. |
required |
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
Dictionary containing metadata properties with tag names as keys and text content as values. |
Raises:
Type | Description |
---|---|
BadZipFile
|
If the ODT file is corrupted or not a valid ZIP archive. |
XMLSyntaxError
|
If the ODT document contains invalid XML. |
Examples:
>>> with open('document.odt', 'rb') as f:
... content = f.read()
>>> metadata = extract_metadata_odt(content)
>>> print(metadata)
{'title': 'My Document', 'creator': 'John Doe'}
extract_metadata_pptx
extract_metadata_pptx(ppt_bytes)
Extract metadata from a PPTX, PPSX, or PPTM presentation.
Extracts core, custom, and application properties from PowerPoint files using direct XML parsing. Handles all PowerPoint XML-based formats.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ppt_bytes
|
bytes
|
The PowerPoint presentation content as bytes. |
required |
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
Combined dictionary of core, custom, and application properties with non-None values. Datetime objects are converted to DD-MMM-YYYY format. |
Raises:
Type | Description |
---|---|
BadZipFile
|
If the presentation is corrupted or not a valid ZIP archive. |
XMLSyntaxError
|
If the presentation contains invalid XML. |
ValueError
|
If the file is not a valid PowerPoint format. |
Example
with open('presentation.pptx', 'rb') as f: ... content = f.read() metadata = extract_metadata_pptx(content) print(metadata)
extract_metadata_svg
extract_metadata_svg(svg_bytes)
Extract metadata from an SVG file.
Extracts width, height, title, description, and other metadata elements from Scalable Vector Graphics (SVG) files.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
svg_bytes
|
bytes
|
The SVG file content as bytes. |
required |
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
Dictionary containing SVG metadata including dimensions, title, description, and format information. |
Raises:
Type | Description |
---|---|
UnicodeDecodeError
|
If the SVG file encoding is not supported. |
XMLSyntaxError
|
If the SVG contains invalid XML. |
Example
with open('image.svg', 'rb') as f: ... content = f.read() metadata = extract_metadata_svg(content) print(metadata)
extract_metadata_xlsx
extract_metadata_xlsx(xls_bytes)
Extract both core and custom properties from an XLSX or XLSM spreadsheet.
Extracts metadata from Excel files by parsing the internal XML structure. Handles both core properties and custom properties defined by users.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
xls_bytes
|
bytes
|
The XLSX or XLSM spreadsheet content as bytes. |
required |
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
Combined dictionary of core and custom properties with non-None values. Datetime objects are converted to DD-MMM-YYYY format. |
Raises:
Type | Description |
---|---|
BadZipFile
|
If the spreadsheet is corrupted or not a valid ZIP archive. |
XMLSyntaxError
|
If the spreadsheet contains invalid XML. |
ValueError
|
If the file is not a valid Excel format. |
Example
with open('spreadsheet.xlsx', 'rb') as f: ... content = f.read() metadata = extract_metadata_xlsx(content) print(metadata)
extract_metadata_zip
extract_metadata_zip(zip_bytes)
Extract metadata from a ZIP archive.
Analyzes the contents of a ZIP file to extract information about the archived files including file count and file listing.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
zip_bytes
|
bytes
|
The ZIP file content as bytes. |
required |
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
Dictionary containing archive metadata with 'archive_item_count' and 'archive' keys, or empty dict if extraction fails. |
Raises:
Type | Description |
---|---|
BadZipFile
|
If the file is not a valid ZIP archive. |
ValueError
|
If the ZIP file is corrupted. |
Example
with open('archive.zip', 'rb') as f: ... content = f.read() metadata = extract_metadata_zip(content) print(metadata)