Keyword Parser/Cleaner Script
I would like a script created that will help me parse and clean out certain patterns. The script input would be master list, kill list, and syntax flags. The output should consist of the cleaned list and dirty list. The script must deal with files with millions of lines. Each list will be in the format of one keyword/keyphrase per line terminated by \n. The master list is raw keywords/keyphrases and is currently 10M lines. The kill list will be primarily composed of undesirable keywords/keyphrases and is currently around 10K. I would also like features to help sort out bad data such as: excessive repeating characters/numbers/spaces, special characters, enforce max keyphrase length, and anything else that would help clean up a keyword list. Please let me know if you have any questions or ideas.