OCR, those three arsey initials. They menace us in different ways.

Historically, the OCR is the Organically modified and Cooled Reactor, an early nuclear reactor, one of which operated in Canada (also home of the Ottawa Central Railway) between 1965 and 1985. The inherent advantage of the OCR’s high negative temperature coefficient is offset by the fact that it also increases reactor control difficulty. For example, since the coolant and moderator are one and the same, the entry of relatively cold, dense coolant into the core will increase moderation, slowing down more neutrons, and cause reactivity to increase. The resulting power increase would rapidly be quelled by the effect of the negative temperature coefficient but may cause the reactor to shut down prematurely. It doesn’t take a genius to see why it never caught on, even in Ohio.

During my adolescence, OCR betokened the newly formed Oxford, Cambridge and RSA examination board, which tested me periodically to see if I had accidentally absorbed anything of value the teachers might have said. When the school system finally spat me out, I hoped for a letter of congratulation, or at least of apology. Nothing.

And now we come to the greatest bugbear of them all: Optical Character Recognition. You’ve probably encountered OCR somewhere online. Here’s the principle: a page of text is scanned as an image file, and a program identifies the characters and converts the image into searchable, manipulable text. It’s ingenious, when it works.

I first encountered OCR about twelve years ago, when the Music department at my school obtained a scanner for use with the music notation program Sibelius. The thought was that you could scan in a page of music and it would be magically transformed into a MIDI sequence. In reality, I suspect what generally happened was that music that should have sounded like this:

ended up sounding like this:

Other OCR failures I have known: Gramophone magazine’s website used to contain a phenomenally useful (if not flawless) searchable database of reviews called Gramofile, than which no greater online resource for reviews of classical music recordings has ever existed. A few years ago, it was abandoned in favour of the Gramophone Archive, a complete online archive of the magazine. A laudable project, but to date not an unmitigatedly successful one. The outcry against Gramofile’s obliteration continues. The new archive appears to be improving, but one still encounters statements like

In terra pox was commissioned by the Swiss Radio to mark the end of the Second World War

and references, for example, to Fauré’s song

Rave d’amour, Op 5 No 2

It’s very big on the funky house scene, apparently.

The Library of America has very high production values, so it is a surprise to find in an otherwise exemplary edition of Saul Bellow’s Novels, 1956-1964, a description of

the frazzle-faced Mr Penis

This is in fact Mr Perls, a minor character from Bellow’s Seize the Day. His name is rendered correctly in all other instances, and presumably the error has arisen from inaccurate OCR, given the visual similarity of ‘rl’ and ‘ni’. Curious that nobody should have picked up on it before publication, though, and perhaps it has been corrected in later printings. Of course, it’s just the kind of thing that would happen to Bellow’s hapless protagonist, Tommy Wilhelm. He sweats out a book, finally gets it published, and there turns out to be a genital typo on page 34.

The reason I’m writing about OCR at all is that last month I encountered an academic article about OCR that had itself been poorly converted. It is ‘Optical formula recognition based on structural features’ by Xue-dong Tian of Hebei University in Baoding, China, and if you happen to be a member of a subscribing institution you may peruse it here.

Viewed as text, the scanned article’s abstract suffers badly. The OCR clearly can’t cope with Times New Roman Bold. It begins:

Automatic wognition of Pormnlas is one of the key parts in an OCR system.

An unusual opening gambit. I’m having trouble wognising a couple of the words.

It cvuld be really useful to be able to re-use knowledge in SeientEc boob which are not adable in electronic form.

Still lost, but I’m intrigued by the SeientEc boob. Let’s give it one more chance.

First seplrh and process COM& components to gain the symbol components, and then recognize the symbol. Atkr that, analym the structure of formula on the ba$is of the refognition result and the geometry features.

By novv my interect has gonc entire1y. That’s tje probkm witj 0CR. It gets on jour nervos.


Tags: , ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: