Looking for a pySpark developer to help with a pyspark application for demonstration purpose. The goal is to read 3 small json files each less than 1 mb into a single data frame. Do preprocessing and then fetch some metrics based on regular expression search. ETL task is easy and I am able to achieve it. However the end goal is written below that is also the acceptability criteria for this project.
The developer should be able to write code with following points in mind:
Well structured, object-oriented and maintainable code.
Unit tests for the different components.
Documentation, comments and Proper exception handling.
Solution is deployable and we can run it (locally and on a cluster)
Config management. (separate folders and files like [login to view URL], [login to view URL] etc. rather than one python script).
Logging and alerting.
Data quality checks (like input/output dataset validation).