PDF, RTF, TXT to HTML Converter
We need a perl or python module that converts PDF, RTF, TXT to HTML files.
Images and pictures, WMF etc can be ignored we only interested in the text itself and its logical layout - paragraphs bullets lists tables etc.
TXT translated to paragraphs only - \n means
\n followed by one or more empty lines means
do not use Word's object model, same goes for adobe acrobat
The module has simple interface convert that gets the filename, and directory and returns the filename of the htm file - example
example of how it will be used (if you use perl)
my $convertor = ModuleName->new;
my $file = ModuleName->Convert('[url removed, login to view]', 'c:\\documetns')
print "*** Error" . ModuleName->GetLastError();
where $file will be '[url removed, login to view]' if everything OK
If the conversion fails the return value will be 0
And the error string should be returned by and GetErrorLast() function
The module should handle UTF-8 encoding as well as 8bit encodings (UTF-16 is bonus if you offer it)
The code should run unattended and you should create log file with all the errors
we should get all the source code documented and we get all copyrights and we can do whatever we want with the code including changing it and reselling it, or eating it ..:-)
the module should be compatible with ms windows and all the module dependencies as well, it must be based only on open source code no special modules that cost money or limit our ability to distribute the code are allowed !
we want simple code that is easy to maintain
we have several other modules we need so if you do well on this one you may get others too.