Consider a world with grid 2x2 ( see attachment)
The cells S1, S2, S3, S4 are the states.
In each state the agent can choose one of the following actions: up, down, left, right.
The S1 state is the terminal state. In any other state the agent is moving to the next cell depending on the action.
For example: we are in the S3 and we choose the action ''Right''. Then the agent moves to S4 with probability 1 and reward -1.
In case of the action selected drives the agent outside the grid then it will hit to a wall and will move to the opposite state with reward -2. For example. At S4 we want to go right, will result the agent to move left to S3.
We consider initially that Q(S,a) is 0 for every S,a.
Monte Carlo algorithm for every visit with exploring starts for an episode of 3 steps.
What will be the policy of the agent after the episode and why?