The initial phase of this project is exploratory and simple. We would like a script that queries six (6) different user-generated content sites to get a sample of the links in those sites, structured per user, and per category. This would concern 200 user profiles per site.
The script should :
- query the API, when possible/available and/or
- collecting the data via HTML listings through screen scraping. Note the scraper must have the ability also to detect a Next link on the page and to follow that link and extract data from all pages returned. A number of examples in Python will be supplied.
The data will then be inserted in the MySQL database. We have a set of references and examples for web scraping, in a number of languages, particurlarly Python. The language and form of development is largely neutral. Caching and time-outs should be implemented to limit the load on the servers queried.
Additional questions are welcome. The examples and information regarding the websites to be queried will be supplied separately.