I am trying to migrate my current deduping tool which works on our own infrastructure to Amazon EC2.
The use case is I have a dataset of 5,000+ articles (up to 100,00 on some projects) from various datasources and need to find potential duplicates and flag them in our database for manual review by our users.
Currently the .net app grabs the table from the project MSSQL database and sends it over to our mssql database and kicks off a number of stored procedures to search for potential duplicates (based on full and partial matches of publication, article name, author, etc.). The results are then presented back to the .net app and the articles appended with the potential duplicate status. We have the code/stored procedures for doing the dedupe in either MSSQL or MYSQL.
I am looking to have the duplicate search done at amazon.