I need a application that can do the following:
As input there will be a .txt file
The txt file will have thousands of urls of website pages. (it's a specific website that has alot of pages , each page has a section there users can post comments, exactly this section will be checked for language.)
The application will go to the url and check if the majority of the content is in a specific language. if yes it saves this url in another .txt file. If a page has a language that doesn't fit the required one this url will be saved also in a separate .txt file.
It doesn't have to be 100% accurate.
Please send me a message how you plan to approach this project.
It's important that the application can process atleast 1 url per 10 seconds (if faster is possible its even better)
I'm not really sure how to solve this so feel free to share ideas.
One idea would be in another .txt file there are like 100 or more of the top words of a chosen language (i can provide these)
and it checks if there is a certain percentage of them in the webpage.
It's very important that the application only checks a specific region of the web page (the comment section) but that should be easy to solve.