python-openxml · jrherskovic-mda · Apr 29, 2024 · Nov 6, 2023 · Apr 27, 2024 · Apr 27, 2024
diff --git a/HISTORY.rst b/HISTORY.rst
@@ -3,6 +3,21 @@
 Release History
 ---------------
 
+1.1.2 (2024-05-01)
+++++++++++++++++++
+
+- Fix #1383 Revert lxml<=4.9.2 pin that breaks Python 3.12 install
+- Fix #1385 Support use of Part._rels by python-docx-template
+- Add support and testing for Python 3.12
+
+1.1.1 (2024-04-29)
+++++++++++++++++++
+
+- Fix #531, #1146 Index error on table with misaligned borders
+- Fix #1335 Tolerate invalid float value in bottom-margin
+- Fix #1337 Do not require typing-extensions at runtime
+
+
 1.1.0 (2023-11-03)
 ++++++++++++++++++
 

diff --git a/Makefile b/Makefile
@@ -30,7 +30,8 @@ build:
 	$(BUILD)
 
 clean:
-	find . -type f -name \*.pyc -exec rm {} \;
+	# find . -type f -name \*.pyc -exec rm {} \;
+	fd -e pyc -I -x rm
 	rm -rf dist *.egg-info .coverage .DS_Store
 
 cleandocs:

diff --git a/docs/index.rst b/docs/index.rst
@@ -74,6 +74,7 @@ User Guide
    user/install
    user/quickstart
    user/documents
+   user/tables
    user/text
    user/sections
    user/hdrftr

diff --git a/docs/user/tables.rst b/docs/user/tables.rst
@@ -0,0 +1,202 @@
+.. _tables:
+
+Working with Tables
+===================
+
+Word provides sophisticated capabilities to create tables. As usual, this power comes with
+additional conceptual complexity.
+
+This complexity becomes most apparent when *reading* tables, in particular from documents drawn from
+the wild where there is limited or no prior knowledge as to what the tables might contain or how
+they might be structured.
+
+These are some of the important concepts you'll need to understand.
+
+
+Concept: Simple (uniform) tables
+--------------------------------
+
+::
+
+  +---+---+---+
+  | a | b | c |
+  +---+---+---+
+  | d | e | f |
+  +---+---+---+
+  | g | h | i |
+  +---+---+---+
+
+The basic concept of a table is intuitive enough. You have *rows* and *columns*, and at each (row,
+column) position is a different *cell*. It can be described as a *grid* or a *matrix*. Let's call
+this concept a *uniform table*. A relational database table and a Pandas dataframe are both examples
+of a uniform table.
+
+The following invariants apply to uniform tables:
+
+* Each row has the same number of cells, one for each column.
+* Each column has the same number of cells, one for each row.
+
+
+Complication 1: Merged Cells
+----------------------------
+
+::
+
+  +---+---+---+   +---+---+---+
+  |   a   | b |   |   | b | c |
+  +---+---+---+   + a +---+---+
+  | c | d | e |   |   | d | e |
+  +---+---+---+   +---+---+---+
+  | f | g | h |   | f | g | h |
+  +---+---+---+   +---+---+---+
+
+While very suitable for data processing, a uniform table lacks expressive power desireable for
+tables intended for a human reader.
+
+Perhaps the most important characteristic a uniform table lacks is *merged cells*. It is very common
+to want to group multiple cells into one, for example to form a column-group heading or provide the
+same value for a sequence of cells rather than repeat it for each cell. These make a rendered table
+more *readable* by reducing the cognitive load on the human reader and make certain relationships
+explicit that might easily be missed otherwise.
+
+Unfortunately, accommodating merged cells breaks both the invariants of a uniform table:
+
+* Each row can have a different number of cells.
+* Each column can have a different number of cells.
+
+This challenges reading table contents programatically. One might naturally want to read the table
+into a uniform matrix data structure like a 3 x 3 "2D array" (list of lists perhaps), but this is
+not directly possible when the table is not known to be uniform.
+
+
+Concept: The layout grid
+------------------------
+
+::
+
+  + - + - + - +
+  |   |   |   |
+  + - + - + - +
+  |   |   |   |
+  + - + - + - +
+  |   |   |   |
+  + - + - + - +
+
+In Word, each table has a *layout grid*.
+
+- The layout grid is *uniform*. There is a layout position for every (layout-row, layout-column)
+  pair.
+- The layout grid itself is not visible. However it is represented and referenced by certain
+  elements and attributes within the table XML
+- Each table cell is located at a layout-grid position; i.e. the top-left corner of each cell is the
+  top-left corner of a layout-grid cell.
+- Each table cell occupies one or more whole layout-grid cells. A merged cell will occupy multiple
+  layout-grid cells. No table cell can occupy a partial layout-grid cell.
+- Another way of saying this is that every vertical boundary (left and right) of a cell aligns with
+  a layout-grid vertical boundary, likewise for horizontal boundaries. But not all layout-grid
+  boundaries need be occupied by a cell boundary of the table.
+
+
+Complication 2: Omitted Cells
+-----------------------------
+
+::
+
+      +---+---+   +---+---+---+
+      | a | b |   | a | b | c |
+  +---+---+---+   +---+---+---+
+  | c | d |           | d |
+  +---+---+       +---+---+---+
+      | e |       | e | f | g |
+      +---+       +---+---+---+
+
+Word is unusual in that it allows cells to be omitted from the beginning or end (but not the middle)
+of a row. A typical practical example is a table with both a row of column headings and a column of
+row headings, but no top-left cell (position 0, 0), such as this XOR truth table.
+
+::
+
+      +---+---+
+      | T | F |
+  +---+---+---+
+  | T | F | T |
+  +---+---+---+
+  | F | T | F |
+  +---+---+---+
+
+In `python-docx`, omitted cells in a |_Row| object are represented by the ``.grid_cols_before`` and
+``.grid_cols_after`` properties. In the example above, for the first row, ``.grid_cols_before``
+would equal ``1`` and ``.grid_cols_after`` would equal ``0``.
+
+Note that omitted cells are not just "empty" cells. They represent layout-grid positions that are
+unoccupied by a cell and they cannot be represented by a |_Cell| object. This distinction becomes
+important when trying to produce a uniform representation (e.g. a 2D array) for an arbitrary Word
+table.
+
+
+Concept: `python-docx` approximates uniform tables by default
+-------------------------------------------------------------
+
+To accurately represent an arbitrary table would require a complex graph data structure. Navigating
+this data structure would be at least as complex as navigating the `python-docx` object graph for a
+table. When extracting content from a collection of arbitrary Word files, such as for indexing the
+document, it is common to choose a simpler data structure and *approximate* the table in that
+structure.
+
+Reflecting on how a relational table or dataframe represents tabular information, a straightforward
+approximation would simply repeat merged-cell values for each layout-grid cell occupied by the
+merged cell::
+
+
+  +---+---+---+      +---+---+---+
+  |   a   | b |  ->  | a | a | b |
+  +---+---+---+      +---+---+---+
+  |   | d | e |  ->  | c | d | e |
+  + c +---+---+      +---+---+---+
+  |   | f | g |  ->  | c | f | g |
+  +---+---+---+      +---+---+---+
+
+This is what ``_Row.cells`` does by default. Conceptually::
+
+  >>> [tuple(c.text for c in r.cells) for r in table.rows]
+  [
+    (a, a, b),
+    (c, d, e),
+    (c, f, g),
+  ]
+
+Note this only produces a uniform "matrix" of cells when there are no omitted cells. Dealing with
+omitted cells requires a more sophisticated approach when maintaining column integrity is required::
+
+  #     +---+---+
+  #     | a | b |
+  # +---+---+---+
+  # | c | d |
+  # +---+---+
+  #     | e |
+  #     +---+
+
+  def iter_row_cell_texts(row: _Row) -> Iterator[str]:
+      for _ in range(row.grid_cols_before):
+          yield ""
+      for c in row.cells:
+          yield c.text
+      for _ in range(row.grid_cols_after):
+          yield ""
+
+  >>> [tuple(iter_row_cell_texts(r)) for r in table.rows]
+  [
+    ("",  "a", "b"),
+    ("c", "d", ""),
+    ("",  "e", ""),
+  ]
+
+
+Complication 3: Tables are Recursive
+------------------------------------
+
+Further complicating table processing is their recursive nature. In Word, as in HTML, a table cell
+can itself include one or more tables.
+
+These can be detected using ``_Cell.tables`` or ``_Cell.iter_inner_content()``. The latter preserves
+the document order of the table with respect to paragraphs also in the cell.
diff --git a/features/steps/cell.py b/features/steps/cell.py