When scanning identical PDf and Word docs sometimes there can be vastly different results.
The reason for the difference in scores is two fold:
1. The PDF document contains more words than the Word document
2. The PDF document contains fewer but longer sentences.
Before explaining why this is the case, first a little bit of background information:
We use different techniques to parse/read Word documents and PDF documents.
Parsing a word document is the more straightforward. Word documents are text based and the internal format/structure of the Word document outlines where sentences/words/paragraphs begin and end etc.
PDF documents are not like this.
In the case of PDFs, when we parse the PDF document we see streams of characters and position references. Unlike MS Word, there is no implicit understanding or knowledge codified in the format of what is a heading, paragraph etc. As such we use heuristic analysis (based on language learning) to determine beginning and end points for paragraphs and sentences. Our analysis algorithms are trained based on technical publications in the English language.
Occasionally you will see anomalies where we are slicing the words/sentences etc in the PDF incorrectly. However many of the 'irregularities' in output are caused by elements beyond our control such the configuration of the tool that was used to generate the PDF's etc.
We do regularly review our PDF parsing algorithms to handle as many of these scenarios as we can.
The PDF document contains more words
Take the example:
'An easy way to claim',
In the underlying PDF document this text may be stored as:
'A n ea sy way to c la i m'.
Our PDF parser would therefore count this as 7 words. Its difficult to say exactly why the text may be structured like that in a PDF document and it should be considered on a case by case basis.
The PDF document contains fewer but longer sentences
This issue is caused by how we determine the beginning/end of a sentence.
As explained above with MS Word this is straightforward as paragraph and section markers are encoded in the document. With PDF documents we must try to guess where sections/paragraphs begin and end.
As an example:
This is the first paragraph.
In the Word document we see this as 2 sentences. However in the PDF document when we look at the underlying text, it appears as:
Section 1 This is the first paragraph.
Our PDF processor reads this as 1 long sentence as it cannot find any clues to indicate where it should split the text into multiple sentences.
This issue occurs at every section within the document. In the underlying PDF, the section heading actually appears as belonging to the first sentence in the section.