I have a software product that reads online text and creates a detailed profile (a profile is then compared to other profiles and recommendations can then be served).
The profiling engine is a single-server Java application that is served off Tomcat. It has a REST API.
Up till now, the profiles have reached my server via full text RSS feeds, or XML files (that I then create a custom parser for in Java).
I now have a project where I will receive a high volume of urls (around 80,000 arriving during the course of the day) and will need to 'scrape' the text off these pages before passing this to the profiling engine.
For this development operational speed is very important and so the 'scraper' needs to be fast acting in order to handle the perceived transaction volume but also accurate enough so that most of the page 'junk' does not affect adversely the profile that is made.
Ideally the web scraper will take the page 'title' and 'article' text and use these for profiling.
However, there will not be a standard format for these pages and so the web scraper needs to be fairly generic too.
Get in contact if you feel you can achieve this but please you must have experience in this field!!