In a recent thread in the System iNetwork Forums, someone asked how to produce Microsoft Word documents from IBM i. He wanted to create a document in Word but then insert data from a DB2 for i database into the document with an RPG program.
I advised that he should investigate the .docx format that became Word's standard format starting with Office 2007. I knew that it was based on XML, and so you should be able to create it from any programming language, but I didn't elaborate, because I hadn't done it myself. Now I've decided it's time to experiment with it. In this article, I tell you what I discovered, and I demonstrate how to insert data into a Word document from RPG.
Microsoft Word and the .docx Format
With the release of Microsoft Office 2007, Microsoft changed its file formats dramatically. Word, Excel, and PowerPoint all have new native formats to store their data in. They still support the old formats, of course, but their "standard" format is the XML ones.
Microsoft Word is no exception. The older format uses the extension .doc to denote that it's a Word document, and the new format uses the extension .docx, which denotes that it's an XML variant of the Word document. Because I know that XML is plain text, and I know I can read and write XML from RPG, I should be able to work with the XML format, right?
Turns out, it's a little more complex than that. A .docx file isn't a single XML file; it's actually a .zip file that contains a whole directory structure. Within that structure are many XML files.
To experiment, I opened Microsoft Word and created the following Word document:
Notice that I've put placeholders where I wanted to insert data from my RPG program. For the date, I put ====DATE====. I figure that my RPG program can search the XML document for that string and replace it with the actual date. Likewise, I have placeholders like ====RECIP==== and ====TITLE==== for their corresponding fields from my RPG program. I chose the = character because it has no special meaning in XML, it works across all character sets, and it's unlikely that four consecutive = characters would appear in a normal business letter.
I saved this document to my PC as ACME.docx. I made certain to use the Office 2007 .docx format, since I know that's a zipped XML document. I used FTP in binary mode to upload this document to an IFS directory on my system that runs IBM i.
Next, I used the InfoZip Utility in PASE to unzip the Word document. (Actually, my first attempt used 7-Zip, which worked well for unzipping, but when I zipped up my result, it didn't work. Apparently, 7-Zip creates .zip files that Microsoft Word doesn't understand. InfoZip doesn't have as many features as 7-Zip but seems to be compatible with Word.) From QShell, I typed the following command:
unzip ACME.docx
And the command's output looked like this:
Archive: ACME.docx
inflating: [Content_Types].xml
creating: _rels/
inflating: _rels/.rels
creating: docProps/
inflating: docProps/core.xml
inflating: docProps/app.xml
inflating: docProps/custom.xml
creating: word/
creating: word/_rels/
inflating: word/_rels/document.xml.rels
inflating: word/_rels/settings.xml.rels
inflating: word/document.xml
inflating: word/footnotes.xml
inflating: word/endnotes.xml
inflating: word/header1.xml
creating: word/theme/
inflating: word/theme/theme1.xml
inflating: word/settings.xml
inflating: word/webSettings.xml
inflating: word/styles.xml
inflating: word/numbering.xml
inflating: word/fontTable.xml
At this point, you may be wondering how that worked! After all, it was a .docx file, not a .zip file! How was I able to unzip it? In truth, it was a zip file. That's what today's Word documents are—they're .zip files that contain a particular directory structure. The files inside that structure are XML documents that contain the layout of the Word document.
I was amazed at how much data is stored inside a Word document. The files that contain the phrase "rels" are relationship documents that describe how the files relate to one another. Most of the others, including styles.xml, fontTable.xml, settings.xml, and theme1.xml are XML documents that describe how the document looks. What are the fonts? How is everything laid out? For now, I'm content to let Word figure all that out.
The only file I'm interested in is the document.xml file in the word subdirectory. It contains the actual document, including my ==== placeholders. If I load it up into my RPG program, I should be able to find those placeholders, insert my own text, save it back to disk, and re-zip it.
The Document.xml File
The document.xml file is, of course, an XML file. You can open it and look at its contents, and you'll see that it contains the text you typed into Word. I opened mine with the Firefox web browser, since Firefox formats XML nicely on the screen, making it easy to read. Here's an excerpt from the document.xml file:
<w:document>
<w:body>
.
.
<w:p w:rsidR="00BD0BBB" w:rsidRPr="00BD0BBB" w:rsidRDefault="001E5AC9" w:rsi
dP="00BD0BBB">
<w:pPr>
<w:pStyle w:val="Date"/>
</w:pPr>
<w:r>
<w:t>====DATE====</w:t>
</w:r>
</w:p>
.
.
As you can see, it's an XML file, but what are all the elements in it? What does <w:Pr> do, for example? What is a w:rsidR="00BD0BBB"? I certainly don't know. Fortunately, I don't have to worry too much about them. I just need to replace ====DATE==== with data from my RPG program, and then I can save the rest of it back to disk unchanged.
So I did that. I wrote an RPG program that follows these steps:
- Calls InfoZip to unzip the .docx file into a temporary directory.
- Reads the document.xml file into a character string in my RPG program.
- Uses the %SCAN and %REPLACE BIFs to replace my placeholders with data from my program.
- Saves the document.xml file back to the IFS.
- Calls InfoZip to .zip the XML files again, creating a new .docx file.
One Tricky Problem
It partially worked. All the fields were replaced except my ====STATE==== and ====POSTAL=== fields. For some reason, they didn't get replaced! It took a while, but I eventually found the problem. In my document.xml, I was expecting to find this:
<w:t>====CITY====, ====STATE==== ====POSTAL====</w:t>
However, I didn't find that. Instead, I found this:
<w:t>====CITY====, ====STATE===</w:t></w:r><w:proofErr
w:type="gramStart"/><w:r><w:t>= =</w:t></w:r>
<w:proofErr w:type="gramEnd"/><w:r><w:t>===POSTAL====</w:t>
It appears that Word decided that my placeholders were bad grammar, so it inserted "proofErr" tags to show me where my grammar error started and ended. Because it happened to be in the middle of my ====STATE==== and ====POSTAL==== placeholders, my RPG program couldn't find the strings and therefore failed to replace them properly.
Once I finally realized this, I went into Word and disabled its spelling and grammar checking and tried again. This time, it worked!
The RPG Code
How does the RPG code work? It works by calling the QCMDEXC API to invoke QShell, and it uses QShell to unzip the .docx file.
// ------------------------------------------------------
// Extract the DOCX Template to a temporary
// directory and mark the document.xml file w/CCSID 1208
// ------------------------------------------------------
cmd = 'QSH CMD(''export PATH=$PATH:/usr/local/bin +
&& mkdir "' + TMPDIR + '" +
&& cd "' + TMPDIR + '" +
&& unzip "' + Template + '" +
&& setccsid 1208 word/document.xml'')';
QCMDEXC(cmd: %len(cmd));
I found I had to set the CCSID of document.xml to 1208 (UTF-8) in order for the IFS APIs to perform proper translation of the data when my program reads it in. In the preceding code, I used QShell's setccsid utility to do this. The CHGATR CL command is another good way to change the CCSID, but since I was already in QShell, I opted for QShell's command.
Now that my .docx file has been unzipped, I use the IFS APIs to load it into a variable in my RPG program.
D buf s 65535a
D vbuf s 65535a varying
.
.
// ------------------------------------------------------
// Load the document.xml file from the template
// into an RPG variable.
// ------------------------------------------------------
IfsPath = TMPDIR + '/word/document.xml';
fd = open( IfsPath
: O_RDONLY + O_CCSID + O_TEXTDATA
: 0
: 0 );
if (fd = -1);
// handle error here
endif;
len = read(fd: %addr(buf): %size(buf));
callp close(fd);
vbuf = %subst(buf:1:len);
Since I decided I wanted my code to remain V5R4 compatible, I used an alphanumeric field that's only 65535 bytes long. It was more than large enough for my simple Word document. However, it's easy to imagine a situation where I might want to handle larger documents. In IBM i 6.1, you can change the size of buf and vbuf to much larger sizes, up to 16MB. I leave that change as an exercise for the reader.
Because I'm working with V5R4 code, I can't use RPG's new Scan and Replace (%SCANRPL) BIF, either, so I wrote myself a subprocedure to perform scanning and replacing.
P scanrpl B
D PI
D vbuf 65535a varying
D oldval 100a varying const
D newval 100a varying const
D pos s 10i 0
/free
pos = %scan( oldval: vbuf );
dow pos > 0;
vbuf = %replace( newval: vbuf: pos: %len(oldval));
pos = %scan( oldval: vbuf: pos+%len(newval) );
enddo;
/end-free
P E
With this procedure, I can easily scan for my placeholders and replace them with data from my RPG program.
D WordRepl_fields_t...
D ds qualified
D date 10a varying
D recip 30a varying
D recipnm 30a varying
D title 30a varying
D company 30a varying
D address 30a varying
D city 20a varying
D state 2a
D postal 10a varying
D my ds likeds(WordRepl_fields_t)
.
.
scanrpl( vbuf : '====DATE====' : my.date );
scanrpl( vbuf : '====RECIP====' : my.recip );
scanrpl( vbuf : '====RECIPNM====' : my.recipnm );
scanrpl( vbuf : '====TITLE====' : my.title );
scanrpl( vbuf : '====COMPANY====' : my.company );
scanrpl( vbuf : '====ADDRESS====' : my.address );
scanrpl( vbuf : '====CITY====' : my.city );
scanrpl( vbuf : '====STATE====' : my.state );
scanrpl( vbuf : '====POSTAL====' : my.postal );
Prior to the preceding code, I set values for recipient, title, company, et al., in the my data structure. I just use simple variable assignment to hard-code these values in my RPG program. However, in a real-world program, you'd probably want to get this information either from the user or from a database file.
The result is that the placeholders are replaced with data from variables in my program. Now my vbuf variable contains the final XML document, with the data already filled in. I need to write it out to the IFS using the IFS APIs:
fd = open( IfsPath
: O_TRUNC + O_WRONLY + O_TEXTDATA + O_CCSID
: 0
: 0 );
if (fd = -1);
// handle error here
endif;
callp write(fd: %addr(vbuf)+VARPREF: %len(vbuf));
callp close(fd);
And the final step is to create a new .docx file by zipping up my temporary directory. I used QShell and InfoZip to do this.
cmd = 'QSH CMD(''export PATH=$PATH:/usr/local/bin +
&& cd "' + TMPDIR + '" +
&& zip -r "' + NewDoc + '" *'')';
QCMDEXC(cmd: %len(cmd));
The Result
When I open my new .docx file with Microsoft Word, it looks like this:
Code Download
Download the RPG code described in this article.
Download a copy of InfoZip that has been compiled for PASE.