Here are notes on the simulations:
Convergence of Q-learning is guaranteed, if each state/action pair is visited infinitely often, for Markov decision processes (MDP).  There are two different algorithms: one for deterministic and one for nondeterministic MDP's.

This version of the repeater placement problem is nondeterministic and not a MDP.   Because the robots are trained using three separate mazes, a robot can be in a grid square where a good move cannot be determined from the state information. For example, given a certain state in maze 1, the robot should move up when the same state in maze 2 requires a down-left move.  If solutions for all the mazes exist in which ambiguous states can be avoided, then it is possible for Q-learning to find them even though robot training is not an MDP.

There are two primary reasons for nondeterminacy in this problem.  First, one robot does not know how the other robot will move.  Second, the robots use "inertia" in that they continue to make the same move in successive steps in the same state when the Q-learning table has equal maximum values for several actions.  See Tuning Q-Learning Parameters with a Genetic Algorithm for details.

Several ad hoc modifications were made to Q-learning to improve convergence.  First, the maximum iterations for each trail changes depending on the average number of steps needed to solve the problems in the recent past.  For example, if the average number of steps to solve three mazes is 170, the algorithm will halt any trial at 200 steps to prevent long trials that do not solve the maze.  This technique reduces simulation time and avoids positive rewards for poor solutions.

Second, the value of epsilon, the probability that the robot will take an exploratory move, is reduced as the average number of steps to solve the mazes decreases.  During initial exploration, exploratory moves are good in that they help the robots to find solutions to the mazes.  But, as the robots are converging to a solution, a particular random move might cause divergence.  The art of configuring Q-leanring in this situation is to find a good function for setting epsilon that allows the system to converge.

The maximum trial iterations and the value of epsilon can increase and decrease as needed depending on the average of recent solution rates.