Data Extraction/ Transformation
We are looking to build an Open Access archive of freely available scholarly journals. [url removed, login to view] is a good explanation of the what the content and project field is related to.
A. Create a harvesting engine in your own choice of coding ( parallel processing has proved the best results) that can:
1.) Crawl specific Internet sites (targets), we will help with the target choices, OAI is one method some site support
2.) If not crawling read from an input file to gleam the data, some site supply
3.) Ensure the data is accurate and test URLs for correctness
4.) Dump the defined data to a text delimited file format
5.) Transfer the data via ftp to us
B. Work with us to find new resources and refresh existing sources on a monthly basis.
C. Provide new and updated data feeds continually
D. Provide your own platform to run the harvests, a muli-core processor should be sufficient
E. The data provided will be Article level data relative to each Journal. The detail data will need these output fields:
"Publisher", "Journal Title", “Article Title”, "ISSN", "Alternate ISSN", "Journal Year", "JournalVol","JournalIssue", "HTML URL", "PDF URL", "Start Page", "End Page"