Require a simple but effective URL detection web (PHP) site/service to be built.
The service will take a pre-formatted CSV file that contains a listing of businesses with the base level information:
Business Name, Category, Street Address, Town/City Name, Postcode(ZIP), State, Phone Number
"Category" will be a basic description of the business line of trade (like Hairdressers) which might be a beneficial additional piece of information to help limit detection results.
The service will take this list and then for each entry in the list it will create a list of up to 30 'candidate' URLs that belong to this business:
1. Attempt to directly 'guess' the most likely domain name based on the business name (using a couple of simple rules such as removing plural 'florists' > 'florist', removing common words like 'limited', etc.)
2. Connect to the top 3 search engines (Google, Bing, Yahoo) and conduct a couple of searches (using the name individually and then with the phone number/city name and other potential combos), to capture the top 5 - 10 URLs returned that most likely match this business
Then, based on this candidate list of URLs for the business, the service must connect and scrape each detected website and try and match the range of fields provided (Name, Address, Postcode, Phone Number, etc.) to appearing in the text somewhere on the candidate website.. depending on how many of these values are detected on the candidate URL will add to a 'confidence' score which attributes how likely that particular website URL belongs to the business in question.
Some intelligent parsing of the incoming data (ie. we may be able to input Post Office and Physical Address details for multi-address checks/matching, phone numbers will be provided with bracketed area codes and these can be optionally stripped for additional phone number checks etc.), as well as scrape results, will yield the most effective outcomes.
The return (or saved to disk) results will then be a return list of the businesses, along with all the candidate URLs, and the confidence rating attributed to each.
Based on the results of your diligent work, you may consequently win a second following project which is to expand this into a much larger ($3k-$5k) and comprehensive interface that allows interactive viewing of the candidate URL data returned, selection of the site to scrape, and then complex parsing of scraped data (images, text blocks, external feeds, etc.) The procession to that phase depends on how well this first stage is built and the kinds of accuracy/depth of results it achieves.
FYI there will be a simple 'black list' of URLs that the detection (through search engines) should ignore - we will load this up with a list of common internet directories which can easily be mistaken as a website home page (because they list the business details as part of their directory)
You will deliver this project in your own hosting environment and allow us to upload some sample (50-100) record sets to see the tool's efficiency prior to commitment and payment (at which time you will then release the full source code for us to host in our own environment).