Word Doc Data Extraction & PHP/MySQL Database Script w/ Search
Summary: Extraction to a searchable database of formatted text and images out of Word docs organized into tables.
This is a unique project in its specifications, so please read carefully before bidding.
On a server, I will have an indefinite number of Microsoft Word files in a specified folder with a specified file mask. Say, test*.doc. The number and content of the files will change over time.
Each Word file consists of a Word table, consistent across each page of the document, three rows and two columns. The Word files will be indeterminate in length but always conform to this format.
In each cell of the document will be a mix of formatted text and occasionally pictures. The content of the cells correspond to basically, a question bank, with one cell having a question and another cell having the answer. They are laid out for duplex printing, which means the question and answer are on different pages (and reversed from left to right), but there is a consistent relationship in the layout.
Odd numbered pages have answers and even numbered pages have questions.
You will need to program some sort of script that will, once a day, scan any new or changed documents (including all of the files matching the file mask the first time it is run) and input the questions and answers (paired together and extracted together based on the duplex printing cell pattern) into a MySQL database, preserving the formatting information, layout and images, if present.
The database table used to store the content would have columns for a unique ID of the question/answer combination, the filename it was extracted from, a date/timestamp of when the information was captured, the question, the answer, and a boolean of active/inactive.
The database/script would need a mechanism to detect changes or new files when it is run subsequently- this should be based on a checksum of the file, not a date modified check. If a file is changed, all previous question/answer entries associated with that file would be marked as "inactive" in the table (so this is another boolean variable necessary).
The interface for this database needs to be limited and secure. Basically, we need a "Google-like" full text search of the question and answer. A search would return results from matching question/answer pairs to the query. The matching results could source from multiple Word files.
The results should be returned to an html page, in a table, with full formatting applied and any images embedded, with a layout exactly like the Word doc from which it was extracted, except that the question and answer will be side by side in the table.
Access to this interface will be controlled by a username and password.
We need the ability to set usernames and passwords for it, as well as customized clearances as to number of queries per day for the user, and the maximum number of results to show for a query (not just on a page, but that will be shown, no matter what, to the user). The queries should be returned sorted by relevance.
However, certain portions of the "question" field will need to be suppressed, but still searchable. This will be based on text patterns in the question cells that we will show after you accept the project.
Also, a master user account will need to available that has no restrictions on queries, number of results, and no suppression of the above-mentioned content.
This project, while unique, should be straightforward to someone who knows what they're doing. Samples of target files provided upon project start.