Apropos of not much, I ran across this PDF to HTML converter and I have to say it’s pretty amazing.
Tools like this often floats around in the scammier ends of the internet, but this tool really seems to be well done. The HTML it outputs is pretty good, and it will even extract fonts (!).
You upload a PDF, and it gives you a zip file to download. When you open the
.zip you get a directory structure something like this:
Of course, if the data in the PDF in the first place is wacky, it can’t fix that, and it doesn’t do OCR. But in the fortunate instance that you have a pretty well-structured PDF, getting an HTML version can be very helpful, because it’s muuuch easier to process than a PDF.
If the text is well-structured, the
index.html file even gives you a pagination/search interface.
I mean, in an ideal world we wouldn’t have to do this at all because things wouldn’t be PDF-only in the first place, but the is the !