My aim is to decentralize the global control of Resource Manager in YARN framework by providing another layer called Rack Unit Resource Manager (RU_RM) layer. The aim of this layer is to make compute nodes on each rack to be controlled by their corresponding Rack Unit Resource Manager instead of a single Resource Manager controlling all the compute nodes in the network. I believe this will help improve response and turnaround time for each job and will eliminate single point of failure which makes jobs halt in the existing global resource manager.
The second idea is to ensure that each Rack Unit resource manager holds resources for which it is directly responsible to and also have backup copies of resources for the RU_RM preceding/succeeding it so that if any RU_RM fails, the predecessor or successor can continue with the management of compute nodes in that rack until such RU_RM recovers from failure. This second idea is conceived from the work of Melliar Smith & Louise M. Moser: O-Ring: A fault tolerance and Load Balancing Architecture in a P2P systems.
Considering the time limit and type of project however, I have an alternative model in case my main work will require great effort.