Monday, August 18, 2008

Working With Sources (2) Microsoft Office - Word

This is the second of a series of posts dealing with the issue of how to access various source formats. This time we'll look at accessing Microsoft Word .doc and .docx formats.

Don’t forget that in some cases formatting loss may be acceptable. Consider whether the formatting is important to the content of the document before going any further. The simplest way to test is to select all, copy, and paste into an AppleTrans document and see if this yields reasonable results. For documents without text generation features (tables of contents), headers, footers or custom layouts using text boxes, the style support in AppleTrans should be able to handle most of these issues.

There are a number of alternatives when you wish to preserve formatting. The first is to view the results when exported to a file format that can preserve the extra formatting in text. One such format is the (older) office xml file, "Word XML Document (.xml)" in the "Save as…" dialog.

For the new OOXML format files (.docx) it is possible to work directly on the file without exporting it. Rename the file in the Finder, adding a ".zip" extension. Then decompress the file. Inside the folder are a number of files, but the main text flow of the document is contained in the word/document.xml file. Note that textboxes are in drawings/drawing1.xml and similar files. To segment properly see my post on segmenting Office XML. After translating the text in that file, zip the main folder (possibly by using the compress action in the context menu) and rename it with the ".docx" extension.

A final option involves using WildCAT. Please see the (future) post about that in the sidebar.

No comments: