Python處理OpenXML的類庫 openxmllib
openxmllib 為 Python 語言提供了用來處理 OpenXML 文檔的類庫,要求 lxml 的支持。
Office Open XML格式使用Open Packaging Conventions,XML Paper Specification (XPS)也使用它。但是,這兩種格式在許多重要的方面是不同的。XPS是一個頁面內的,固定的文檔格式,它是在Microsoft Windows Vista操作系統當中所引入的。而Office Open XML格式是面向Office Word 2007,Office Excel 2007,和Office PowerPoint 2007的完全可編輯的文件格式。雖然它們在XML和ZIP壓縮的使用方面有很多相似的地方,但是它們在文件格式的設計和使用目的上還是有著很大的不同。
These examples say all::
>>> import openxmllib
>>> doc = openxmllib.openXmlDocument(path='office.docx')
>>> # Raises a ValueError on not supported office files.
>>> doc.mimeType
'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
>>> doc.coreProperties # Keys may depend on application
{'title': u'blah...', u'creator': u'John Doe', ...}
>>> doc.extendedProperties # Keys may depend on application
{'Words': u'312', 'Application': u'Your favorite word processor', ...}
>>> doc.customProperties # May return an empty mapping
{'My property': u'My value', ...}
>>> doc.allProperties # Merges core+extended+custom properties (see above)
{...}
>>> doc.indexableText(include_properties=False)
u'all the words of that document body'
>>> doc.indexableText(include_properties=True)
u'all the words of that document body and all properties values'
Standard ``mimetypes`` package extensions ::
>>> import mimetypes
>>> mimetypes.guess_type('somedoc.docx')
('application/vnd.openxmlformats-officedocument.wordprocessingml.document', None)
>>> mimetypes.guess_type('somecalc.xlsx')
('application/vnd.openxmlformats-officedocument.spreadsheetml.sheet', None)
>>> mimetypes.guess_type('someslides.pptx')
('application/vnd.openxmlformats-officedocument.presentationml.presentation', None)
Document factory signatures::
>>> # We have the path for the office file
>>> doc = openxmllib.openXmlDocument(path='office.docx')
>>> # We have a file object for the office file
>>> fh = open('office.docx', 'rb')
>>> doc = openxmllib.openXmlDocument(file_='office.docx')
>>> # We have the URL for the office file
>>> doc = openxmllib.openXmlDocument(url='http://domain.tld/office.docx')
>>> # Xe have the raw data of the office file
>>> import mimetypes
>>> docx_mimetype = mimetypes.guess_type('office.docx')
>>> body = open('office.docx', 'rb').read()
>>> doc = open(data=body, mime_type=docx_mimetype)
Note that if you're not running a Python application, you may get the indexable
text from a document with the `openxmlinfo.py` console utility. Just type::
本文由用戶 fmms 自行上傳分享,僅供網友學習交流。所有權歸原作者,若您的權利被侵害,請聯系管理員。
轉載本站原創文章,請注明出處,并保留原始鏈接、圖片水印。
本站是一個以用戶分享為主的開源技術平臺,歡迎各類分享!