Monday, August 25, 2008

Segmenting (1) Microsoft Office OOXML

In previous posts on getting at source text, I outlined how to get at Microsoft Office formats by saving to the new OOXML formats (.docx, .xlsx, .pptx) and then unzipping the package.

I'd like to expand on those articles by talking about segmentation. When opening the OOXML file parts (XML files contained inside) basically all of the text you will want to access is in <t></t> tags. Most of the time these tags have a single letter namespace prefix (but not always). The HTML segmenting rules in AppleTrans can get at the text you want, but you may find that they also segment things you don't want. The HTML rule is also very complex and takes even longer than it might otherwise.

To address these issues we will add a segment rule specifically for Microsoft Office OOXML.

First, go find AppleTrans in the Finder and select Show Package Contents from the context menu. Navigate to Contents/Resources/English.lproj/SegmentRules.plist. This file contains the rules for segmenting. If you think you will edit this file fairly regularly, you might find it useful to make an alias and save it somewhere for easy access.

The SegmentRules.plist file is a standard Apple property list file. If you have the developer tools installed you can use the handy Property List Editor application to do the following steps, but a plain text editor will do fine as well.

Segmentation rules are basically just groups of regular expressions. I haven't tested all the possibilities thoroughly, but it appears that they are not greedy (meaning that when multiple substrings match a pattern it takes the shortest).

To make our pattern we can borrow the rule that looks the most like the one we want. The plist rule matches strings in an XML file by looking for segments in <string></string> tags. The rule is as follows:


    <key>XML Plist</key>
    <dict>
        <key>Prefix</key>
        <string>&lt;string&gt;</string>
        <key>Segment</key>
        <string></string>
        <key>Suffix</key>
        <string>&lt;\/string&gt;</string>
     </dict>


Copy that and change the values to reflect the OOXML code you want:


    <key>Microsoft Office XML</key>
    <dict>
        <key>Prefix</key>
        <string>&lt;(.:)?t&gt;</string>
        <key>Segment</key>
        <string></string>
        <key>Suffix</key>
        <string>&lt;\/(.:)?t&gt;</string>
     </dict>


Save the file and when you open AppleTrans next you should be able to select the Microsoft Office XML rule and have it segment all of the text. This will still take a while with a lot of text, but it should be better than the HTML rule.

No comments: