Monday, August 18, 2008

Working With Sources (3) Microsoft Office - Excel

This is the third of a series of posts dealing with the issue of how to access various source formats. This time we'll look at accessing Microsoft Excel .xls and .xlsx formats.

Don’t forget that in some cases formatting loss may be acceptable. Consider whether the formatting is important to the content of the document before going any further. The simplest way to test is to select all, copy, and paste into an AppleTrans document and see if this yields reasonable results. For Excel documents this will likely not be the case. Many people use Excel to layout text and when the grid doesn't work add text boxes floating on top of the mess. These documents usually rely on the position of the text to communicate information about the relationship between the text so you will need to preserve it.

Fortunately Excel supports a number of text based formats that should make it easy to find one that will work, preserving the formatting and layout, while giving direct access to the text. In the "Save as…" dialog choosing "Excel 2004 XML Spreadsheet" will also preserve embedded images. For slightly more simple data, choosing an option like CSV (comma separated values) or tab separated values are also viable options.

The .xlsx file format that is default with Office 2007/8 is also good because it allows direct access to the underlying text while preserving layout. In addition it may provide you with a shortcut because the file format uses a shared strings table, so you could translate a string once and have it fill in every place it is used in the file at once.

Rename the .xlsx file in the Finder adding a ".zip" extension and decompress. Inside the file look for xl/sharedStrings.xml and then double-check for strings not in that file in each sheet file (found in xl/worksheets/). Note that textboxes are in drawings/drawing1.xml and similar files. To segment properly see my post on segmenting Office XML. After translating the file, compress the folder and rename it with the ".xlsx" extension.

No comments: