I need a application that can do the following:
As input there will be a .txt file
The txt file will have thousands of urls of website pages. (it's a specific website that has alot of pages , each page has a section there users can post comments, exactly this section will be checked for language.)
The application will go to the url and check if the majority of the content is in a specific language. if yes it saves this url in another .txt file. If a page has a language that doesn't fit the required one this url will be saved also in a separate .txt file.
It doesn't have to be 100% accurate.
Please send me a message how you plan to approach this project.
It's important that the application can process atleast 1 url per 10 seconds (if faster is possible its even better)
I'm not really sure how to solve this so feel free to share ideas.
One idea would be in another .txt file there are like 100 or more of the top words of a chosen language (i can provide these)
and it checks if there is a certain percentage of them in the webpage.
It's very important that the application only checks a specific region of the web page (the comment section) but that should be easy to solve.
i have another option that will make the job easier.
Currently i have a bot that can save the url + the fetched text into a ms acces database.
Also there are already language detection libraries avaible on the net.
so basically all that the programm has to do is go through the database and put a check mark or even tag to the urls.
Please adjust your bid for this option in the long run it's better for me!
attached is a sample ms acces database.
your application must analyse the text in the "text" field and then put the language tag in the "Language" field.
you can use external freely avaible language checker libraries. or even the google translate api.