I will post current script to the PMB shortly.
The current script uses the yahoo API to query a search string, get a list of urls, then scrape Meta data from the urls. It then outputs the data to the screen.
I want the script modified to read a search string from a database (please provide script to create needed table) which has search strings. Each string should only be searched once, then marked with another field to indicate it has already been searched and to move on to the next search string the next time the script is ran. Each time the script is ran (which will be from cron) I only want ONE search string searched. Note that the search strings in the database may be more advanced searches such as 'powered by WordPress intitle:"shopping cart software" inurl:shopping'
I want the extracted urls to be added to a mysql table (please provide script to create needed table), then before the scraping of the new urls the script should make sure the url has not already been scraped, if so then delete it or mark it so it doesn't get scraped again.
The data I want scraped is as follows: post title (not page title), first 400 characters of the actual blogpost.
Before data is output to a file, I want the script to see if there is google adsense code on the page or if the blog post has less then 1,500 characters, and if either then to disregard or not add this data to the output files. I want to be able to comment out both of these from happening in case I change my mind on a project.
I want two output files:
The first output file I need to be in CSV format with field names:link(which is the url),title(which is the blog post title),description(which is the first 400 characters). Each url scrape needs to be it's own record.
The second output file needs to be a formatted text file like this (note I removed the html tags, but will provide later):
first 400 characters
I'll want to be able to comment out either of the output files depending on what I'm doing so only one will run.