The Pond is a simulator for studying learning agents.  It is an object-oriented Java framework for building experiments involving agents, food, and inanimate objects.  The system was constructed to study learning in environments that have impediments to both learning and survival and to study methods of agent cooperation in an environment where agents explore unknown territory.

The key features of the Pond are

  • Objected-oriented design that allows new types of environments, agents, food, and inanimate objects to be created
  • Virtual or gird location of Pond objects
  • Genetic algorithm (GA) that can be used to tune learning parameters
  • Reinforcement learning agent
  • PondProject class that provides a convenient way to build simulation runs
  • Scaled arithmetic for learning parameters


The primary classes in the system are

The primary parameters for a simulation run are
There are two main ways to run the simulation package:  program mode and PondProject mode.  In program mode, a program is written to instantiate Pond, PondEnvironment, and PondObject objects and to call the appropriate methods to run the simulation.  In PondProject mode, a simple program is written to instantiate and configure a PondProject object to run the simulation or GA.  One a PondProject object is created, the default parameters for the simulation run can be overridden using the SetValue() method.

Survival Experiments

The Pond was used to study reinforcement learning agents in an environment with impediments to survival and learning.  In these experiments, there are Q-learning agents (class AgentQ), red veggies, and blue veggies.  In the static version of the Pond, the red veggies give nutrients and the blue veggies remove nutrients when they are eaten by agents.  In the dynamic version, the role of blue and red veggies change over time.

Agents learn to make survival decisions based on a reward they generate internally based on their energy level.  Agents only have access to a 4-level health value while making moves and eating veggies affects a finer grain energy level.  Learning is also affected by the way veggies behave after having been eaten.  Both type of veggies have no nutritional affect on agents who eat them during the growth period that occurs after they are eaten.  The affect of the low resolution feedback the agents receive and the veggie growth delay is that the agent reward system is nondeterministic.  But, because the feedback is statistically related to the effect of eating the two types of veggies, the agents can learn which veggies to eat.  Another impediment to learning and survival is that there is not enough food to support the initial number of agents.

In the dynamic version of the Pond where the roles of the red and blue veggies change, learning is more difficult, but the agents can learn to survive.

The paper, Tuning Q-Learning Parameters with a Genetic Algorithm, describes a set of Pond experiments in both static and dynamic environments.  A GA was used to tune Q-learning parameters to increase the survival rate of agents.

Future Direction

To complete the Pond simulator, grid location mode must be implemented.  Virtual mode reduces the complexity of agents because they do not have to plan their moves in a coordinate system.  However, grid mode will allow more flexibility and the ability to study agents that plan physical moves.

Once grid mode is completed, I plan to use the Pond to study cooperating agents that explore an unknown terrain.  In particular, I want to investigate ways in which the outcome of exploratory actions by one agent can be shared with other agents in a way that leads to quicker learning.

This site © copyright 2004 by Ben E. Cline.  Contact info:
 Contact address 8/2004