API FAQs Downloads

Challenges Converting PDF Files


If you are seeing poor parsing results from a PDF, that problem is almost certainly caused by the PDF being a file that looks great, but internally is corrupted.

There is a simple way to find out if a PDF is corrupt. Open the file using the free Adobe Acrobat Reader software, choose File -> Save as Text and save the file. Then open it using a text editor such as Notepad or UltraEdit. You will almost certainly see the horror jumping out at you.

Here is an example that looks great, but internally is corrupt:

You can see below that the text is a mess with one word per line and some characters replaced by numbers:

Summary


.
6+
years
experience
in 
fast-paced 
agile 
environments 
managing 
mul9ple 
projects 
. 
Strong 
communica9on 
skills 
with 
ability 
to 
ar9culate, 

[OMITTING THE REMAINDER]

This problem is NOT fixable. It is not caused by Sovren software. It is caused by a PDF that looks great but internally is corrupt. To read more on why/how PDFs can be corrupt, read the below articles: