Solutions

Solution objects represent the solution computed by a planning algorithm, and they can be used by the agents to decide how to behave in an uncertain environment with limited resource availability. For example, a solution can be a policy describing the action to execute depending on the environment state. Other examples are collections of policies and finite-state controllers. The toolbox provides generic data structures which represent such solutions, and below we discuss them in more detail for both Markov Decision Processes and Partially Observable Markov Decision Processes.

Solutions for Markov Decision Processes

CMDPAlgorithms compute a CMDPSolution, which represents the solution that can be used by the agents to choose their actions. A CMDPSolution object provides the getActions(t, joint state) method, which returns an array containing an action for each agent given a time step and the joint state of the agents. The joint state is simply an array containing the individual states of the agents. An overview is provided in the figure below, in which an arrow indicates that a class implements an interface.

Solution structure for MDPs

A specific implementation of a CMDPSolution is provided by CMDPSolutionPolicyBased, which uses policies to decide which actions need to be executed by the agents. It contains an MDPAgentSolutionPolicyBased object for each agent. This is visualized by the 1..n relationship in the figure.

An MDPAgentSolutionPolicyBased object can be an individual policy or a set of policies with associated probabilities:

Individual policy: both MDPPolicyDeterministic and MDPPolicyStochastic implement the MDPAgentSolutionPolicyBased interface.
Set of policies: in this case it is possible to use the MDPPolicySet class, which implements the MDPAgentSolutionPolicyBased interface. An MDPPolicySet object contains multiple policies, which can be either an MDPPolicyDeterministic or an MDPPolicyStochastic object.

After solving a planning problem, the algorithm obtains individual policies or sets of policies for each agent. Finally, it creates an CMDPSolutionPolicyBased object, which is eventually returned. Below we provide a few additional details about deterministic and stochastic policies.

Deterministic policy

Class: solutions.MDPPolicyDeterministic

The getAction(t,s) method returns the action to be executed in state s at time t.

Stochastic policy

Class: solutions.MDPPolicyStochastic

The getAction(t,s) method samples an action from the distribution represented by the stochastic policy, and it returns this action. Calling getAction(t,s) multiple times for the same t and s may give different actions due to the stochastic nature of the policy.

Solutions for Partially Observable Markov Decision Processes

CPOMDPAlgorithms compute a CPOMDPSolution, which represents the solution that can be used by the agents to choose their actions. A CPOMDPSolution object provides the getActions(t, joint belief) method, which returns an array containing an action for each agent given a time step and the joint belief of the agents. The joint belief is simply an array containing the individual beliefs of the agents. An overview is provided in the figure below, in which an arrow indicates that a class implements an interface.

Solution structure for POMDPs

A specific implementation of a CPOMDPSolution is provided by CPOMDPSolutionPolicyBased, which uses policies to decide which actions need to be executed by the agents. It contains an POMDPAgentSolutionPolicyBased object for each agent. This is visualized by the 1..n relationship in the figure.

An POMDPAgentSolutionPolicyBased object can be an individual policy or a set of policies with associated probabilities:

Individual policy: POMDPPolicyFSC, POMDPPolicyGraph, POMDPPolicyVector implement the POMDPAgentSolutionPolicyBased interface.
Set of policies: in this case it is possible to use the POMDPPolicySet class, which implements the MDPAgentSolutionPolicyBased interface. A POMDPPolicySet object contains multiple policies, which can be a POMDPPolicyFSC, POMDPPolicyGraph or POMDPPolicyVector object.

After solving a planning problem, the algorithm obtains individual policies or sets of policies for each agent. Finally, it creates an CPOMDPSolutionPolicyBased object, which is eventually returned. Below we provide a few additional details about deterministic and stochastic policies.

Deterministic vector-based policy

Class: solutions.POMDPPolicyVector

The getAction(b,t) method returns the action to be executed in belief b at time t. The policy is represented by a set of alpha vectors for each time step.

Deterministic finite-state controller

Class: solutions.POMDPPolicyGraph

The getAction(b,t) method returns the action to be executed based on the current state of the finite-state controller. The update(a,o) method implements the transition of controller states, and reset() sets the current state to the initial state of the controller.

For more details about this policy representation we refer to: Walraven, E., & Spaan, M. T. J. (2018). Column Generation Algorithms for Constrained POMDPs. Journal of Artificial Intelligence Research, 62, 489–533.

Stochastic finite-state controller

Class: solutions.POMDPPolicyFSC

The getAction(b,t) method returns the action to be executed based on the current state of the finite-state controller. The update(a,o) method implements the transition of controller states, and reset() sets the current state to the initial state of the controller.

For more details about this policy representation we refer to: Poupart, P., Malhotra, A., Pei, P., Kim, K.E., Goh, B., & Bowling, M. (2015). Approximate Linear Programming for Constrained Partially Observable Markov Decision Processes. In Proceedings of the 29th AAAI Conference on Artificial Intelligence (pp. 3342–3348).

The ConstrainedPlanningToolbox has been developed by the Algorithmics group at Delft University of Technology, The Netherlands. Please visit our website for more information.