Previous Posts (N/A)
So you think it might be a good idea to use a word document with placeholder text that can be used with some application to create reports? By replacing those placeholders with real data?
Something like this?
Date: [[DATE]]
User: [[USERNAME]]
I would ask for your own sake, that you do not do that, and use something else. But if you are dead set on it, let me share my experience…
First, what is a Microsoft .docx
file? While you may be able to avoid some of the following complexities with a fancy library, if you are doing this from "scratch" then this section is vital.
A docx
file basically a fancy zip file. In fact, you can take a look into one if you rename the file and change the extension to .zip
instead of .docx
. I've done this many times in my journey to debug and test.
Side note: you'd be suprised how many files out there are just renamed zip files…
Here is a docx
"sample" with it's extension renamed to zip
.
Here is a great resource on the inner workings of a docx from officeopenxml.com by Daniel Dick.
The "heart" of a word document (at least one that is primarily text), will live in the word/document.xml
file. This will be filled with hundreds of xml tags, the most important of which will be the body tag: <w:body>
and the "paragraph" tags <w:p>
. Paragraph's in word files are the basic blocks of content. They contain several other tags within them such as <w:pPr>
which are the paragraph properties (the xPr
notation is common across word tags), and the <w:r>
tags which are "runs", basically another container for text elements. The actual text is stored in <w:t>
tags.
Here's an example taken from the officeopenxml site:
<w:p>
<w:pPr>
<w:jc w:val="center">
<w:pPr>
<w:r>
<w:rPr>
<w:b/>
</w:rPr>
<w:t>This is text.</w:t>
</w:r>
</w:p>
This is a simple example of what a centered piece of text in word can look like in its xml form. I will hold off on explaining all the different tags, there are good references from the aforementioned website, and also plenty of other places. ChatGPT and other LLMs aren't bad for this as well as resources from Microsoft themselves.
One thing I chose to omit from the example is the dreaded w:rsid
tags. These are tags that are enabled by default in almost every Microsoft Word version and are used for tracking editing sessions. These tags are one of the first major roadblocks you will face when it comes to parsing word documents. Your best bet is to just to detect these tags in code and implement a solution to detect the original placeholder text. The way I did it was to check for rsid
tags within w:r
elements and merge them to reform the placeholder text. More on that later…
You might think to yourself, "can't we just disable the tracking information in the documents". Technically yes. There is no setting to turn this off. There used to be in certain old versions of word (over 10 years ago?). I found a way to disable this for individual documents is to open up VBA (Microsoft Visual Basic for Applications) in Word and to write a Macro that auto-executes every time the document is opened. You can set StoreRSIDOnSave = False
. Having this macro run will fix the problem. But when you want to distribute this document or have other people use the template code, then the problem resurfaces…
Microsoft Word has it's own concept of "templates" which are dotm
files. By default every word document inherits the properities of a dotm
file called "Normal.dotm". This file lives in your AppData\Roaming\Microsoft\Templates
. A global solution to the one described in the previous paragraph, is to modify this file and attach the macro to it. This will have it auto-execute the macro for basically every word document. If you wanted to avoid this massive global change, you could create your own .dotm
and set all your future "template" docx
's to inherit from this custom .dotm
. Good luck convincing other user's to do this.
In my project, I used Qt and C++, so here was my solution to the problem:
QString mergeTaggedText(QDomElement& startElement) {
QString mergedText;
QDomElement currentElement = startElement;
int checkAfter = -1;
while (!currentElement.isNull()) {
QDomElement tElement = currentElement.firstChildElement("w:t");
if (!tElement.isNull()) {
mergedText += tElement.text();
bool breakCondition = false;
if (checkAfter == -1) {
breakCondition = mergedText.contains("]]");
if (breakCondition) {
checkAfter = mergedText.indexOf("]]") + 2;
}
}
else {
breakCondition = mergedText.mid(checkAfter).contains("]]");
}
if (breakCondition) {
int endPos = mergedText.lastIndexOf("]]");
int checkPos = mergedText.indexOf("[", endPos);
if (checkPos == -1) {
break;
}
// We have been struck by lightning:
// We found a starting tag after an ending tag... continue merging
checkAfter = endPos + 2;
}
}
currentElement = currentElement.nextSiblingElement("w:r");
}
return mergedText;
}
void processElementToRemoveRsids(QDomElement& element) {
QDomElement child = element.firstChildElement();
while (!child.isNull()) {
QDomElement nextChild = child.nextSiblingElement();
if (child.nodeName() == "w:r") {
QDomElement tElement = child.firstChildElement("w:t");
if (!tElement.isNull() && tElement.text().contains("[")) {
QString mergedText = mergeTaggedText(child);
// We could add a check here to see if we contain at least 1 full tag, but merging
// these RSID elements is low risk, and it's probably safe to even merge every
// single one in the document anyway...
// Create a new w:r element with merged text
QDomElement newRElement = element.ownerDocument().createElement("w:r");
QDomElement newRPrElement = child.firstChildElement("w:rPr").cloneNode().toElement();
newRElement.appendChild(newRPrElement);
QDomElement newTElement = element.ownerDocument().createElement("w:t");
newTElement.appendChild(element.ownerDocument().createTextNode(mergedText));
newRElement.appendChild(newTElement);
// Replace the old elements with the new merged element
element.insertBefore(newRElement, child);
// Remove old elements - from start until we reach the last element we merged
while (!child.isNull() && !mergedText.endsWith(child.firstChildElement("w:t").text())) {
QDomElement toRemove = child;
child = child.nextSiblingElement("w:r");
element.removeChild(toRemove);
}
if (!child.isNull()) {
element.removeChild(child);
}
child = newRElement;
}
} else {
processElementToRemoveRsids(child);
}
child = nextChild;
}
}
My tags were in the form [[TAG_NAME]]
so this logic specifically looks for the starting character [
. As you can see, it's not perfect, but it worked with every malformed document.xml
I threw at it.
Including the following mess (with a tag of [[ACCOUNT_NUM]]
):
<w:r>
<w:rPr>
<w:noProof/>
</w:rPr>
<w:t>Account: [[ACC</w:t>
</w:r>
<w:r w:rsidR="00DD06FC">
<w:rPr>
<w: noProof/>
</w:rPr>
<w:t>O</w:t>
</w:r>
<w:r>
<w:rPr>
<w:noProof/>
</w:rPr>
<w:t>UNT_N</w:t>
</w:r>
<w:r w:rsidR="00702A9D">
<w:rPr>
<w:noProof/>
</w:rPr>
<w:t>UM]</w:t>
</w:r
<w:r>
<w:rPr>
<w:noProof/>
</w:rPr>
<w:t>]</w:t>
</w:r>
The special case in the mergeTaggedText
function even protects against the following scenario:
<w:t>]] [</w:t>
i.e. when a tag begins in the same <w:r>
as the closing tag appears in.
If you find a better solution (ideally something that doesn't even require code) please share. There are dozens of us out there struggling with this issue.
Another thing you might want to do with your template file is to insert images into it. This is different than replacing text or just going through the document.xml
file. This is because of the anatomy of the docx
file again. Images live in the word/media
folder in the docx
structure. So in order to actually insert an image you would have to:
word/media
directorydocument.xml.rels
file in the word/rels
folder. "Rels", or Relationships, are files that map connections between different parts of the document. This includes things like images and styles. Here's what a rel tag might look like for adding an image1
to your document:<Relationship Id="rId7" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image" Target="media/image1.png"/>
document.xml
to actually have it appear. This isn't as simple as just a tag with the id in it, you need specific tags and xml name spaces surrounding it… here's a link from our favorite resource that dives into it a bit more. Basically you can find example xml of an existing image in a word document and replace the id with your id and it should work… at least it has for me.Another special case is the headers and footers. These files live separately from document.xml
, and in fact in their own xml files: headerN.xml
and footerN.xml
. The N
is a number. You can modify text in your headers and footers by modifying the text in these header
and footer
xml files.
Despite my facetious comments at the start of this article, using word documents as templates isn't a huge deal once you wrap your head around how a word document actually works. Of course some things that end up being huge pains in the butt (RSIDs, looking at you), really should have better support on Microsoft's end. While there could easily be a third party tool out there that strips these annoying things out of your documents, it would be nice to just be able to turn them off within word. But I find myself being upset about lots of seemingly inane things that Microsoft does with Office and Windows these days.
About Me
Hi, I'm Dimitar. I graduated from UIUC with my B.S in Computer Science in 2022. I work full-time as a software engineer. My interests in the field are all over the place, ranging from cyber security, efficient programming, operating systems, to database systems and more. I find most CS topics interesting to talk about and wish I had all the time and energy to explore them all deeply.
Outside of tech, I like biking, boxing, video games (waay too much Skyrim), and writing (hence this blog!)