Write in Markdown, Deliver as DOCX
How to Automate Corporate Word Documents with Pandoc
Published 25/04/2025, last edited 26/05/2025
Contents
1 Introduction
Microsoft Word is a bloated, bug-ridden mess of a word processor. Despite the supposed ease of use promised by its WYSIWYG philosophy, most users never actually learn how to use it properly. Manually formatting paragraphs instead of applying styles, or inserting empty paragraphs to create vertical spacing instead of adusting paragraph style settings — these are just a few of the many frequently committed sins.
To make matters worse, Word’s native DOCX format is ill-suited to version control due to its binary nature. Consequently, I consider there to be only one defensible reason to use Microsoft Word:
- Your employer requires it, and you are otherwise happy with your job.
Some might argue to add the following to the above list:
- A journal requires submission in DOCX format.[1][1]: Find a better journal.
- Your collaborators aren’t comfortable using text-based formats like Markdown or LaTeX.[2][2]: Find different collaborators or teach them your ways. (Ok fine, I realise that is not always practical…)
- Someone sends you a DOCX file.[3][3]: Use LibreOffice to view it instead.
To avoid the need to interact with Word and reap the full benefits of source control, you can author your documents in Markdown instead, and automatically generate DOCX files using pandoc. For corporate documents, it is usually important to match the exact layout and style of some existing DOCX template. This article explains how to accomplish this.
2 Preliminaries
Pandoc supports a variety of text-based input formats to choose from, such as various Markdown flavours, LaTeX, DocBook, reStructuredText, MediaWiki markup, and many others. The supported Markdown flavours are
markdown(Pandoc’s own Markdown flavour),markdown_mmd(MultiMarkdown),markdown_phpextra(PHP Markdown Extra), andmarkdown_strict(John Gruber’s original Markdown).
I favour pandoc’s own Markdown flavour due to it supporting mathematical equations and references, both of which are essential for writing technical documentation.
To customise the DOCX output, pandoc offers two distinct mechanisms.
The first of these is the --reference-doc
option, which alows the provision of a DOCX file whose content will
be ignored, but whose styles and document properties (including margins,
page size, header, and footer) are used in the created DOCX. The second
is the --template
option, enabling the user to supply an Office Open XML
(OOXML)[4][4]: Not to be
confused with OpenOffice
XML. template that is used by pandoc to structure the
contents of the created DOCX. OOXML is the document format used
internally in DOCX files to represent document contents and was accepted
as an ISO/IEC standard in 2008, the approval process having been marked
by significant controversy[5]. The DOCX file format uses
Open
Packaging Conventions (OPC) to package the contents of a
document. This means that a DOCX file is just a ZIP archive containing
various XML files as well as media contained in the document, such as
images.
According to InfoWorld,
OOXML was opposed by many on grounds it was unneeded, as software makers could use OpenDocument Format (ODF), a less complicated office software format that was already an international standard.
The debate became so embittered that IBM, which backs ODF, threatened in September to consider leaving standards bodies that allowed dominant companies such as Microsoft to wield what it perceived as undue influence. Microsoft was accused of leaning on countries in order to secure enough votes for OOXML to pass.
Ars Technica reported that Standards Norway, the Norwegian technical standards organisation, lost 13 of its 23-person technical committee as a result of the organisation’s decision to vote in favour of the standard after a public consultation. It had been found that a significant number of the responses were identical submissions authored by Microsoft. In their letter of resignation, the members wrote:
Standard Norway chose to defy their own technical committee and vote yes to a specification that is immature, useless, and unworthy of being called an ISO standard.
Some more information on the OOXML specification can be found in Rob Weir’s article How to hire Guillaume Portes, which makes for an entertaining read.
The default reference-doc.docx can be obtained by
running
pandoc -o reference-doc.docx --print-default-data-file reference.docx
and the default template.openxml by running
pandoc -o template.openxml --print-default-template=openxml
Once reference-doc.docx and
template.openxml have been created, the DOCX can be created
as follows:
pandoc --reference-doc=reference-doc.docx --template=template.openxml input.md -o output.docx
I wrote some tools to make generation of DOCX files using pandoc
easier, which can be found in this repository.
Amongst other things, the repository contains several PowerShell 5.1
scripts to enable effective editing and versioning of
reference-doc.docx. Using these requires Windows 10 or
11.[6][6]: My operating system of
choice is Linux-based, but having to use Windows is another one of those
consequences that come with being employed.[7][7]: I may add equivalent
Unix-compatible scripts to the repo in the future. If that is something
you desire, feel free to open an
issue. It also contains the script compile.ps,
which executes pandoc with a bunch of options you are likely going to
need. The script can be used as follows:
.\compile.ps1 input.md
Furthermore, the repository also contains a
reference-doc and a template.openxml, which I
recommend using as a starting point.[8][8]: The repository’s reference-doc
in particular contains some improvements over pandoc’s default
reference-doc — see Sec. 3.1.
3 Editing
reference-doc.docx
In the above-mentioned
repository, the reference-doc is stored in its unzipped
form, to enable effective version control. To edit
reference-doc.docx, run
edit-reference-doc.ps1, which turns the subdirectory
reference-doc/ back into a ZIP archive, writes it to
reference-doc.docx, and opens the file in MS Word for
editing. The script apply-reference-doc-edits.ps1 performs
the reverse action: it unzips reference-doc.docx, formats
the contained XML files, and writes the files back to the subdirectory
reference-doc/. It also removes a bunch of superfluous[9][9]: These script have been tested
fairly extensively, so the removal of the XML elements and attributes
deemed superfluous shouldn’t be breaking your
reference-doc.docx. However, if you end up with an invalid
DOCX, consider disabling this step during formatting to see if this
fixes your problem. Equally, if I have missed some superfluous elements
or attributes that should be removed, feel free to raise a pull
request! XML elements and attributes which tend to change on
every save and which would make effective version control difficult.
This approach enables you to examine precisely the changes that are made
to reference-doc.docx on each save. It is good practice to
only check in the changes that are clearly related to what you were
intending to change. You can experiment with editing the XML files by
hand to learn about their structure and purpose. This will inevitably
lead to you creating an invalid DOCX[10], but this can
easily be fixed with git restore reference-doc.
The script also discards any changes to the files
/docProps/app.xml,/docProps/core.xml,/word/settings.xml, and/word/glossary/settings.xml,
as they contain metadata (such as the number of words in the
document, or the time when the document was last saved), which we are
not interested in version controlling. You may on occasion find that
those files need to be updated (e.g. if you want to change the
compatibility settings of your DOCX). In that case you can either edit
these files manually, or comment out the relevant code in
apply-reference-doc-edits.ps1 preventing changes to these
files from being applied to reference-doc/.
3.1 Compatibility settings
Of special note is the <w:compat> element
contained in /word/settings.xml. It should look something
like this:
<w:compat>
<w:compatSetting w:name="compatibilityMode" w:uri="http://schemas.microsoft.com/office/word" w:val="15" />
<w:compatSetting w:name="overrideTableStyleFontSizeAndJustification" w:uri="http://schemas.microsoft.com/office/word" w:val="1" />
<w:compatSetting w:name="enableOpenTypeFeatures" w:uri="http://schemas.microsoft.com/office/word" w:val="1" />
<w:compatSetting w:name="doNotFlipMirrorIndents" w:uri="http://schemas.microsoft.com/office/word" w:val="1" />
<w:compatSetting w:name="differentiateMultirowTableHeaders" w:uri="http://schemas.microsoft.com/office/word" w:val="1" />
<w:compatSetting w:name="useWord2013TrackBottomHyphenation" w:uri="http://schemas.microsoft.com/office/word" w:val="0" />
</w:compat>
If this element is not present in /word/settings.xml,
Word assumes that your DOCX file is an older file format, which has
formatting implications. In particular, it affects the way tables are
aligned in the document. Without this element, Word appears to align
tables horizontally so that the left page margin is aligned with the
text in the table’s first column, rather than with the table’s left
border, the latter being much less aesthetically displeasing.
/word/settings.xml without
<w:compat> element.
/word/settings.xml with
above <w:compat> element.This <w:compat> element was added to the
reference-doc in the above mentioned
repository.
3.2 How to edit styles
There are four kinds of styles in Word: character styles, paragraph styles, table styles, and list styles[11][11]: I haven’t yet had the need to edit list styles, so I can’t comment intelligently on this type of style I’m afraid.. Character styles are represented by a double-storey lowercase “a” and are applied to individual sections of text within a paragraph. They can be used to change e.g. the font or font style of text (bold, italic, font colour, …). Paragraph styles are represented by the pilcrow sign (¶) and contain all the properties of character styles, but also various additional settings one might apply on a paragraph level such as paragraph margins, line spacing, numbering, indentation, and others. Table styles can be used to apply pre-defined formatting to a table, such as the presence, location, and style of borders, special formattings for the first or last column or row, alignment of both the table itself and the contents of cells, and so forth. To further complicate things unnecessarily there is actually a fifth style, termed “Linked (paragraph and character)”. Linked styles are needed to create a paragraph style that is based on a character styles, but this shouldn’t be necessary in most cases.
Pandoc’s default reference-doc.docx contains all the
styles that are used by pandoc to style the text in the output document.
These styles are listed in the manual[12][12]: Note that the “Source Code”
style is created dynamically by pandoc and is not present in
reference-doc.docx — see this issue. To
modify the style of Code elements, edit the “Verbatim Char” character
style instead.. To modify them, assuming you are using the above-mentioned
repository, run ./edit-reference-doc.ps1 to create
reference-doc.docx and automatically open it in Word. To
edit styles:[13][13]: Given how
many clicks it takes to edit a style, I’m not surprised most Word users
never work out how to do this and instead opt to apply custom formatting
to elements individually.
- Click on text that is formatted in the style you want to edit (this step is optional).
- Click on the small button in the lower right corner of the styles
pane:
- In the small window that opens subsequently, click on the button
with the green tick:
- In the larger window that opens subsequently, select the style you
would like to edit (if you completed the first step, it should already
be selected), and click the ‘Modify’ button:
- Make edits to the style in the ‘Modify Style’ window and confirm the
changes with ‘OK’, or discard with ‘Cancel’:
Once you are happy with your style edits, save
reference-doc.docx, and apply the changes to the
subdirectory reference-doc/ by running
.\apply-reference-doc-edits.ps1. You can then examine the
changes to the XML. If you only changed the styles of the document, the
only file with any modifications should be
reference-doc/word/styles.xml (unless you also changed a
paragraph style’s numbering settings, in which case there will also be
changes to /reference-doc/word/numbering.xml — see Sec. 3.4).
3.3 Custom styles
To create a custom style (e.g. a custom heading style that is to be
used for front matter headings, or appendix headings which are numbered
‘A/B/C’ rather than ‘1/2/3’), create a new style in Word with the
appropriate settings. In pandoc’s Markdown, this custom style can be
applied using fenced Divs if
it’s a custom paragraph style, or bracketed
Spans if it’s a character style by specifying the
custom-style attribute:
:::{custom-style="Paragraph Style"}
Appendix
:::
Paragraph with [styled text]{custom-style="Character Style"}
3.4 Numbering headings
To number headings, two options are available:
- Letting pandoc take care of the numbering via the
--number-sectionsoption, or - Modifying heading styles in
reference-doc.docxso that they are numbered.
The second option requires more work, but allows fine-grained control over numbering schemes (arabic, roman, and other numbering settings). To associate a numbering style with a heading, in the ‘Modify Style’ window, click ‘Format’ in the lower left hand corner, select ‘Numbering…’ and choose the desired number format. You can also ‘Define [a] New Number Format…’. Once the numbering scheme has been defined, it seems like there is no option in Word to edit it. The only way to change a specified numbering scheme is to click ‘Define New Number Format…’, and apply all settings all over again… To increase (or decrease) the space between the number and the heading, select the ‘Paragraph…’ setting and under ‘Indentation’, select ‘Hanging’ (in the drop-down list under ‘Special:’) and specify the desired indentation.
3.5 Centering captions, images, and tables
To center captions, edit the Caption linked style. Both
the Table Caption and the Image Caption
paragraph styles are based on the Caption style, so any
settings you change for Caption also affect
Table Caption and Image Caption (provided
those styles don’t override the specific setting).
To center images, edit the Figure paragraph style. The
Captioned Figure style is based on the Figure
style, and will therefore be affected by changes to the
Figure style.
To center tables, edit the Table style. In the ‘Modify
Style’ window, ensure that ‘Whole table’ is selected in the dropdown
under ‘Formatting’. Then click ‘Format’ in the bottom left hand corner,
and select ‘Table Properties’. Under ‘Alignment’, select ‘Center’. This
will center all tables. Special formatting applied to other parts of the
table (e.g. the ‘Header row’) can prevent tables from being centered (it
took me a while to discover this — my tables still weren’t centered, and
it turned out that left table alignment was specified for the ‘Header
row’, and this setting took precedence over the alignment applied to the
‘Whole table’…).
4 Editing template.openxml
As mentioned in Sec. 2,
pandoc uses the OOXML template supplied via the --template
option to create the output document. The template contains various
pandoc template variables, including the $body$ variable,
which is interpolated by pandoc with the document content during
compilation. The pandoc-docx-tools
repository also contains pandoc’s default template,
template.openxml. This template is passed to pandoc by the
script compile.ps1 to create the output document. One does
not need to be profient in OOXML to edit template.openxml,
as inspiration can be found by looking at the contents of existing DOCX
files that have the kind of content one would also like to be present in
the output document. To do this, unzip the existing DOCX file and open
/word/document.xml in a text editor. Readability of the XML
is greatly improved if it is formatted prior to inspection. This can be
done using the format-openxml.ps1 PowerShell script, which
optionally takes as first argument the path to the contents of the
unzipped DOCX file.
4.1 Paragraphs, runs, and text
In OOXML, paragraphs are represented by the <w:p>
element. A special style can be applied to the paragraph via the
<w:pPr> and <w:pStyle> elements.
The smallest unit in OOXML that can be styled (with a character style)
is called a run and is represented by the
<w:r> element. Actual text is contained in the
<w:t> element. A non-empty paragraph must contain
both <w:r> and <w:t> elements.
Consider the following example:
<w:p>
<w:pPr>
<w:pStyle w:val="MyStyle" />
</w:pPr>
<w:r>
<w:t>Text.</w:t>
</w:r>
</w:p>
4.2 Page breaks
In OOXML, a page break looks like this:
<w:p>
<w:r>
<w:br w:type="page" />
</w:r>
</w:p>
4.3 Inserting images in the template
When a DOCX contains an image, the corresponding OpenXML in
/word/document.xml looks something like this:[14][14]: Word is actually very
particular about this and doesn’t tolerate the removal of some elements
that might seem superfluous to the minimalist OpenXML author. The
<a:stretch> element is the only element that can be
removed without invalidating the document, however this causes incorrect
scaling behaviour of images.
<w:p>
<w:r>
<w:drawing>
<wp:inline>
<wp:extent cx="990000" cy="792000"/>
<wp:docPr id="1" name="Picture 1"/>
<a:graphic xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main">
<a:graphicData uri="http://schemas.openxmlformats.org/drawingml/2006/picture">
<pic:pic xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture">
<pic:nvPicPr>
<pic:cNvPr id="0" name="image1.jpg"/>
<pic:cNvPicPr/>
</pic:nvPicPr>
<pic:blipFill>
<a:blip r:embed="rId8"/>
<a:stretch>
<a:fillRect/>
</a:stretch>
</pic:blipFill>
<pic:spPr>
<a:xfrm>
<a:off x="0" y="0"/>
<a:ext cx="990000" cy="792000"/>
</a:xfrm>
<a:prstGeom prst="rect">
<a:avLst/>
</a:prstGeom>
</pic:spPr>
</pic:pic>
</a:graphicData>
</a:graphic>
</wp:inline>
</w:drawing>
</w:r>
</w:p>
The actual image is referenced in the element
<a:blip r:embed="rId1"/>, which references
rId8 in this case. This rId is defined in
/word/_rels/document.xml.rels (below is only an excerpt of
that file):
<Relationship Id="rId8" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image" Target="media/image1.png" />
The actual image file resides at /word/media/image1.png,
as defined in the Relationship above. To insert an image in
the OOXML template, we can reuse the image rIds from our
reference-doc.docx.[15][15]: This functionality has only been implemented in pandoc very
recently and requires pandoc version 3.7 or
up. This requires the image to first be inserted into
reference-doc.docx. After running
.\apply-reference-doc-edits.ps1, we can inspect the
contents of reference-doc/word/_rels/document.xml.rels to
determine the rId of the image we inserted. This
rId can then be used in the OOXML snippet above (in place
of rId8 in the <a:blip r:embed="rId8"/>
element) to insert the image into the template.
To change the size of the image, edit the cx and
cy properties of both the
<wp:extent/> and the <a:ext>
elements. The unit of cx and cy is the
English Metric Unit (EMU). There are 360,000 EMUs in a
centimetre, and 914,440 EMUs in an inch.[16][16]: The use of the EMU (a unit invented by the OOXML
authors) allows both metric and imperical quantities to be represented
as integers, thereby preventing round-off in calculations that start
with either imperial or metric units. If this was the only criterion,
one might have chosen to define a unit of which there are 50 in a
centimetre and 127 in an inch, however by using a sufficiently small
unit, floating point errors can be avoided by using integer arithmetic
instead, which helps with precise alignment of graphical elements. In
addition, 360,000 and 914,400 are both divisible by 2, 3, 4, 5, and 6!
Quite clever actually. However, I would argue that this problem was more
elegantly solved in the SVG format, where coordinates are defined in user units,
and the relationship between user units and actual units is given by the
SVG’s width and height attributes.
5 Appendices
5.1 HTML-like paragraph spacing
By default, paragraph margins in Word behave like they do in HTML.
This means that if paragraph A is followed by paragraph B, and paragraph
A has a bottom margin[17][17]: In
Word, top and bottom margins of paragraph styles are changed by clicking
on ‘Format’ in the lower left hand corner of the ‘Modify Style’ window,
selecting ‘Paragraph…’, and changing the ‘Before:’ and ‘After:’ values
in the ‘Spacing’ section. of 10pt, and paragraph B has top margin
of 20pt, then the spacing between paragraphs A and B is 20pt, the
greater of the two margins. This behaviour might be unintuive if you
have not previously worked with HTML/CSS and can be changed so that the
spacing is the sum of the two margins instead. To change the setting,
select File -> Options (bottom left), and go to the ‘Advanced’ tab.
Scroll all the way down to ‘Layout options for:’ and change the “Don’t
use HTML paragraph auto spacing” tickbox according to preference. This
prepends a corresponding element to the <w:compat>
element’s contents in /word/settings.xml:
<w:compat>
<w:doNotUseHTMLParagraphAutoSpacing />
...
Being used to working with HTML/CSS, I should note that I found the paragraph spacing settings in Word rather limiting and inflexible. In particular, I would like the ability to set spacing of a paragraph based on what kind of paragraph precedes or succeeds it. For example, if I wanted third level headings that follow second level to have no top margin (but have a non-zero top margin otherwise), I would add the following CSS:
h3 {
margin-top: 1em;
}
h2 + h3 {
margin-top: 0;
}
In OOXML, this doesn’t appear to be possible (or if it is, that functionality is not exposed in Word).