Analyzing OOXML
===============

OOXML files can be thought of at six levels of abstraction:

-  As a file
-  As a Zip archive
-  As a set of Parts (a Part has a *name* and an array of data bytes
   called a *stream*)
-  As a set of Parts with data in well defined formats (usually XML)
-  As a graph of Parts, connected by Relationships, and having
   associated metadata (specifically Content-Type)
-  As a document with Features and CoreProperties

Each abstraction level completely contains all the levels beneath it;
there are no leaks. That is, all Features are implemented as
Parts & Relationships, all Parts are contained in the Zip file, etc. This
concept will be explained more below.

Different types of analysis can be done at each of these levels.

File
----

At its simplest, an OOXML file is a single file; that is, a filename and
a sequence of bytes. OfficeDissector's Document class takes this file
and exposes its deeper content:

::

    import officedissector
    doc = officedissector.doc.Document('test/fraunhoferlibrary/Artikel.docx') # Returns a Document object

Note that the filename's extension is signficant and affects the
behavior of Microsoft Office. Member variables ``doc.type``,
``doc.is_macro_enabled`` and ``doc.is_template`` expose the meaning of
the filename extension.

Zip file
--------

An OOXML file is a Zip archive (each member of this archive is called a
Part, described below). Method `:func:officedissector.doc.zip` provides a Zip object which
provides access at this level of abstraction. However, most analysis is
best performed at the deeper levels of abstraction below.

Part
----

Parts are the heart of OOXML. **The entire Document is defined by its
Parts** (other than the limited information provided by the filename's
extension and Zip metadata described above). Even the Parts' own
metadata is stored in Parts (as will be described below). Thus,
**analyzing OOXML is about analyzing Parts**.

``Document.parts`` provides a list of all Parts in the document.

At their simplest level, Parts have only two properties: a \_a *name*
(exposed through ``Part.name``) and an array of data bytes called a
*stream* (exposed through ``Part.stream()``).

::

    for p in doc.parts:
        print p.name # p.name returns a String
        print p.stream().read(10) # p.stream() returns a File-like object

XML Parts
---------

Most Parts are in XML format. OfficeDissector will parse the XML for
you, using the ``Part.xml()`` method, or perform XPath queries
(recommended), using the ``Part.xpath()`` method.

Content-Type
------------

Relationshos
------------

Features and CoreProperties
---------------------------

