Monday, August 18, 2008

Working With Sources (5) PDF - the good

After documents that can be translated as text or rich text and Microsoft Office formatted documents, the next most common file format in my translation is Portable Document Format (.pdf) documents. There are, for our purposes, three kinds of PDF file: files containing useful text (the good), files containing useless text (the bad) and photo/scan files (the ugly). The one good thing about PDF documents is that, in general, there is no easy way to edit a PDF file so the client is not expecting the formatting.

This time we will look at how to access the good. These kinds of documents are often generated for documents ready to print, such as posters, brochures, investor relations materials and so on.

The simple way to know if this is the kind of document you are dealing with is to open it in Preview and try to selecting the text with the text selection tool.



Just selecting text doesn't yet qualify it as a "good" file. Try selecting an entire paragraph and pasting it into an AppleTrans document. If the text comes out generally in readable format you're most of the way there.

If you've come across a "good" file there are a number of ways to get the text out in a nice way. Let me introduce a few.

The first technique is to just open the document in the latest version of Adobe Reader, which selects text in columns properly by default. With it installed you can often get away with a simple select all, copy and paste.

This is also possible with Preview for those who don't want to install extra software: copy and paste text in order into a text file. Note that you can select multiple lines and then join them afterwards (for example by running a script or regular expression over them). The advantage of Adobe Reader is that it selects text in columns intelligently. A multi-column document selected in Preview will interweave the lines.

Tip: To help with selecting text in documents using Preview that have text boxes or columns and other odd layouts with the text selection tool, hold option when selecting with the mouse.

Here are some methods suggested by Elmars:


Another rather easy way to extract text from a "good" PDF file for translation with Appletrans uses the free Macintosh command-line utility pdftotext.

It has its own installer which places the program executable in /usr/local/bin/

Open a terminal and type /usr/local/bin/pdftotext -htmlmeta *.pdf *.html

Replace the * with the name of your PDF file.

Tip: Spaces in the filename must be preceeded with a backslash (\).

Using the -htmlmeta flag helps to retain the encoding in the resulting simple html file. Open it with TextEdit and resave as .txt or .rtf for use with Appletrans. Of course, you still will have to check and arrange the text before translation.

The advantage of this method is that being a command line tool it may be easily used for batch processing.

Furthermore, OpenOffice 3 for Mac (soon to be made official, in the meantime download the release candidate) has a plugin for conversion of PDF to OpenOffice Draw: pdfimport

In the converted Draw file, each text line is in a separate text box. If you open the Draw file (.odg) as a zip file, all the text can be found in the content.xml file. After translating with Appletrans and inserting the new content.xml into the Draw file, you can open the odg file and export it as PDF.


If you know of another good way, please write it down in the comments for everyone else, and I'll update this page as necessary.

2 comments:

Steven DeWitt said...

Hi all,

I've been using AppleTrans for several years for various tasks and was pleased to come across this blog. Don't forget about the PDF filter in the Files section of the Yahoo! AppleTrans group; it works very well.

Cheers,

Steven

jen grasso said...

Sorry, I couldn't find a contact so I'm writing regarding a question. Is it possible to import existing memories into AppleTrans? When I found AppleTrans through the apple site, I also saw recommended other programs - AppleGlot, various different glossaries,etc. I'm new to CAT software but have been translating the 'old school' way for a while so I'm looking to ease the process. I'd rather have some libraries to work with rather than start from scratch with all my new projects if possible.
Sorry if this is the wrong place to ask such a thing, I'm just short of finding answers online.