The attached word document is ESSENTIAL to understanding this project as it contains very important images. I will ask if you have read the attached brief before I will accept your bid. This is a short description of the project. Please read the attached document for the whole story.
We need a SOLR search engine built from old, multi-page PDFs. All of the indexed documents will be PDFs and many will need to go through OCR first. We will probably use something like Foxit to do the image to text conversion. We know the output will be messy, but text will only be used in indexing process. When user does a search, s/he will access the PDF directly.
Note: All of our work is in Java. This will be running on a large Linux server.
This project is not that simple though. Let’s take a look at this example > [url removed, login to view]
We will want to index this 30 page document. But it contains more than one form (unique section). State Oil & Gas sites will often put an entire wellbore’s files in a single PDF. 20 years of paperwork can be sitting in a single PDF. If we index as-is and return results with a 30 to 100-page PDF attached, the user will never be able to find the single mention of their search string after opening the very long PDF file.
For this reason, we need to break the 30+ page PDF into individual pages, OCR each, and index each page separately. When doing a search, user is actually searching individual pages. We tell the user we found the queried text on page 19 of the PDF. S/he clicks to get the full 30 pages, but knows to go to page 19. We may even load the PDF in a frame and keep a header at the top that reminds user to look on page 19. And there may be multiple mentions of the search query in a single PDF file.
A lot of it will be nasty looking. Documentation goes back 50+ years to typewriters.
If this all seems pretty impossible, you would be right. In fact, we believe the OCR will be so incomplete in places, we cannot even show a snippet (10-20 words) of text on the search results page, because it will be nonsensical. But this is ok. If we can OCR 70% of the data from these PDFs, that’s 70% we didn’t have yesterday. And no one will ever see the OCR text to complain how incomplete it is…
Why are we going to all this effort? We plan on using SOLR to build a metadata engine around these documents. We are less interested in the content of each page and more interested in the page type, that a particular wellbore even has a C-144 form. We'd like to get as much data as we can but realize we won't be able to get it all.
The end user will probably do very little “free text” searching of SOLR. Instead, we will process 10,000 of our own search phrases (tokenization and algorithms), e.g. “Tank Closure” or “C-144” and build a table of all the document types that are inside PDFs for each wellbore. We may tell a user that wellbore [Removed by Freelancer.com Admin - please see Section 13 of our Terms and Conditions]
Now, it starts to make sense why we are breaking apart all the PDFs for OCR and indexing. We may store page 1, 2, 3, 4 and 5 in a database row for wellbore [Removed by Freelancer.com Admin - please see Section 13 of our Terms and Conditions]
We cannot stress this enough. The user never sees the OCR text or the broken apart PDFs. Will be way too confusing. Instead, we will direct the user to open the original PDFs and go to page 6 or page 1 or page 27 and read further about a tank disclosure for this particular wellbore.
Expect 10-15 million PDFs. If this work is good, we have many more follow on projects from this that we will LOVE for you to work on.
OK! That should be enough to communicate the main purpose of this project. Please read the attached document which has more detailed information about the entire project.