We need one public gov't website scrapped. It's a simple scrape; nothing special like captcha, password, etc... The gov't site is updated every time there is new information. The (Screen-Scraper .sss) Scraping Session in java would need to be aware of new information, and write this information in tsv format.
Scraped data needs to
(1) Have unique ID, compared to db for duplicates (mysql)
(2) Write scraped data to tsv format (approx 10 fields and 1 image)
(3) Have resilient extractor patterns
(4) Have Java Codes // Commented/Documented
The unique ID is incremental, and this is how you get to the details page.
The extractor patterns are simple.
(1) Must check if there is new information (scrapable data) with in a short period, or it will no longer be available.
(2) Sometimes the image doesn't yet exist, and the data does exist. With that said, here is the challange, sometimes the image will never exist, at which point we need to keep the scraped data, (i.e. iterate - after so many tries - if img not exist, keep the scraped data)
(3) It may seem like a simple site to scrape at first glance, but please don't underestaimate it, and leave it for the last day the project is due, as it has to be production ready when you submit it.
(1) Please only bid if you have experience with [url removed, login to view]
Project Due Date:
3-4 days after bid acceptance
This is my first post here with [url removed, login to view], so please bear with me as I learn the ropes. I work for an attorney firm who specializes with clients in direct marketing, so I will have more projects similar to this. We need this right away and production ready, as this is an integral part of a larger pilot program we are launching.
Thanks for reading this. Look forward to the bids.