By Ken Fox
You know what really grinds my gears? When I open a PDF file containing what appears to be digitally-formatted text and find that it is non-copyable and non-searchable. The ability to search, copy and paste text are essential functions of digital communications – so the idea that a text is born digitally and therefore ASCII (American Standard Code for Information Interchange) encoded, and that somebody wittingly or unwittingly should remove that functionality – it leads to much weeping and wailing and gnashing of teeth on my part.
Well just last week I was sent a large PDF document with more than 70 pages of text. So I opened it in Adobe Acrobat, and tried to execute a search for a key term, and found that it was (you guessed it) another one of those documents that had signs of ASCII-formatted text in its progeny, but through the manipulations of some kind of monster, been reduced to the mere semblance of text, no more searchable than a stack of paper.
So naturally I commenced with my usual process of wailing and gnashing, but after a few minutes of that I got a notion that maybe I should try something different. In near desperation, I got the idea that – just maybe – if I “select all” and paste it into a text editor then some hitherto-hidden ASCII-encoded text might appear. Worth a try, right?
So I hit control-A, and THIS happened:
“Why yes,” I said out loud, “in fact I WOULD like to run text recognition to make the text on this page accessible – THANKS for asking!”
I clicked Yes.
Then I got asked for some settings, which I ignored and just clicked OK – opting for the default option in my excitement.
Adobe Acrobat then leapt through my document, systematically performing the miracle of breathing life into the dead letters at the rate of about a page a second – slightly faster for the “born digital” main portion, and a bit slower for some appendices that bore the stigmata of pre-digital technology.
The result was perfectly copyable, pastable, searchable text in the main body of the document. As for the typewritten appendices, Acrobat almost flawlessly converted them into digital text as well, while maintaining the visual features of the original typed text. Basically, the document looked identical to how it had looked prior to the procedure but was now digitally functional. The only letters and numbers that resisted the resurrection were data from a single table with a very small typeface – those few characters remained a heretical community of graphics in the midst of a near-universal mass conversion.
Optical text recognition technology has come a long way in a few short years.
Now if you work anywhere in the legal industry (or do any kind of office work), then there is a good chance you have been able to follow right along, and to some of you, this is already old news and why am I boring you. But if there are any among you who don’t know what I’m talking about with text that can be searched and copied – you need to learn a few tricks that will make your life a whole lot easier. Begin with learning these commands, which work on almost all text-editing software:
CTL-F … Find text in document
CTL-A … Select All
CTL-X … Cut selected text
CTL-C … Copy selected text
CTL-V … Paste the last text you cut or copied
CTL-Z … Undo last operation
CTL-Y … Redo undone operation
CTL-H … Find all identified text in document and replace with other text
You can use point-and-click menus for these operations as well, but I find the keyboard shortcuts easier. These features, and many others, are now standard practice in office work – so learning them will not get you ahead so much as get you caught up with the rest of us.
And if you ever come across a text, especially a longish one, for which the above commands do not work, try to do minimal weeping & wailing and tooth-gnashing. And when you are done that, wipe the tears off your keyboard and try the simple operation described above. Failing that, try something else. And if all else fails, ask your friend in IT to perform a miracle. Because there is no reason to tolerate text in a digital file that cannot function as digital text.