
[full paper] 
Laurent Peret, Frederick Garcia
Markov Decision Processes (MDPs) have become a standard these last ten years for solving problems of sequential decision under uncertainty. The usual request in this framework is the computation of an optimal policy that defines the optimal action for each state of the system. For complex MDPs, exact computation of optimal policies is often infeasible. Several approaches have been developed by the Reinforcement Learning community to compute near optimal policies for complex MDPs by means of function approximation and simulation. In this paper, we investigate the problem of refining near optimal policies via online search techniques, tackling the local problem of finding an optimal action for a single current state of the system. Kearns, Mansour & Ng (Machine Learning 2002) consider such an online approach based on sampling: at each step, a randomly sampled lookahead tree is developed to compute the optimal action for the current state. We propose here some control search strategies for constructing such trees. Their purpose is to provide good "anytime" profiles : they quickly select a good action with a high probability and then smoothly increase the probability of selecting the optimal action.
Keywords: Markov decision processes, reinforcement learning, sampling, heuristic search
Citation: Laurent Peret, Frederick Garcia: Controlling online search for solving Markov Decision Processes via sampling. In R.López de Mántaras and L.Saitta (eds.): ECAI2004, Proceedings of the 16th European Conference on Artificial Intelligence, IOS Press, Amsterdam, 2004, pp.530534.