Follow

How does VisibleThread delineate content when splitting out/shredding documents?

 

We use certain algorithms to split contiguous pieces of content. The exact algorithm depends on the format of the document we analyze, either MS Word or PDF. 

In MS Word documents, there are 2 native format types; .docx & .doc. One is an XML format (.docx) and one is a binary format (.doc). Microsoft introduced the newer format .docx in MS Word 2007. It is more reliable from a content extraction/delineation point of view. Regardless, We extract what MS Word considers paragraphs of text.

In the case of PDFs, the underlying format is actually what's known as postscript. Unlike MS Word, this has no implicit understanding or knowledge codified in the format of what is a heading etc. As such we use heuristic analysis (based on language learning) to determine beginning and end points for paragraphs. Our analysis algorithms are trained based on technical publications in the English language.

Just like MS Word, the intent of the extraction is to isolate paragraphs. Paragraphs in this context are those that you or I would consider a paragraph.

Regardless of delineation mechanism, all text in both MS Word and PDFs are analyzed.

Was this article helpful?
0 out of 0 found this helpful
Have more questions? Submit a request

0 Comments

Article is closed for comments.