Q. I have a PDF that consists only of images, can VisibleThread parse this?
[Updated Mar 2019]
A. VisibleThread Docs does not analyze PDFs that consist only of images. You will get a zero word count.
This is because:
- Reliably extracting text from images is an inexact science. Consider handwriting versus typed text in an image. In either case, depending on legibility, and clarity of scanned image, the results can vary a lot. It also depends on the type of OCR (Optical Character Recognition) technology you use.
- There are many 3rd party conversion utilities that may be of help. And many of our customers already have these conversion utilities today (eg: Adobe Professional)
So, if you have PDFs comprised only of scanned images, what are your options?
- Use a 3rd party utility to convert image to text first, then upload the result. OCR technology does exist, however VisibleThread cannot vouch for its reliability. There are many utilities. Our tests indicate that the results can be mixed in terms of accuracy.
The best way to find the available utilities is to google these words (or similar): ‘image conversion pdf to word OCR’
- If possible, request the PDF in text form from the issuing authority, and then upload that to VT Docs.