A command-line tool for running OCR on PDFs: Simon Willison


This tool is on the techier end of the spectrum of things we talk about around here, but I suspect it might be of interest to those dealing with large, legacy PDF files:

Simon is, among other gurudoms, a Python guru, and makes stuff so cool that even a Javascript addict like myself always checks out! (I mean, he is also a Javascript guru. Anyway.)

The idea here is that you upload a PDF to a “bucket” (essentially a cloud-based directory) on Amazon S3, and then use their separate service (“Textract”) to do the OCR. I was surprised to see that the description of the page claims that they can recognize handwriting.

As always, Simon’s writeup on how to use his s3-ocr tool is very clear. If anyone here ends up giving it a try, a post-mortem would be very interesting.

The before/after example is not perfect (to my eyes, it’s not as good as Transkribus, but it’s pretty compelling:

In. In J a ... the Joe 14
Lalpa spinosa, Eggt bud development. of
Farcomas spindle. cells in nested gowers 271
Fayigaga tridactylites, leaf glaur of ruce 33
staining & mounting
Stiles 133
tilica films, a structure of Diatoins morehouse 38
thile new microscopic
Broeck 22 /
Smith reproduction in the huntroom tribe
Trakes, develop mouht succession of the porsion tango/229
Soirce President of the Roy: truc: Soo
forby, Presidents address
pongida, difficulties of classification
tage, american adjustable concentric
ttlese staining & mountring wood sections 133
Stodder, Frustulia Iasconica, havicula
chomboides, & havi cula crassinervis 265
falicylic acid u movorcopy
falpar enctry ology of
Brooke 9.97
Sanderson micros: characters If inflammation
tap, circulation of the
Jars, structure of the genus Brisinga
latter throvite connective substances 191- 241
Jehorey Cessification in birds, formation
of ed blood corpuseles during the
ossification process

See also le Tweet.