On Generation of Representations for Reinforcement Learning

Decanato - Facoltà di scienze informatiche

Data d'inizio: 4 Settembre 2012

Data di fine: 5 Settembre 2012

You are cordially invited to attend the PhD Dissertation Defense of Yi SUN on Tuesday, September 4th 2012 at 09h00 in room A24 (Red building)

Creating autonomous agents that learn to act from sequential interactions has long been perceived as one of the ultimate goals of Artificial Intelligence (AI). Reinforcement Learning (RL), a subfield of Machine Learning (ML), addresses important aspects of this objective. This dissertation investigates into a particular problem encountered in RL called representation generation. Two related sub problems are considered, namely basis generation and model learning, concerning which we present three pieces of original studies.

In the first study, we consider a particular basis generation method called online kernel sparsification (OKS). OKS is originally proposed for recursive least square regression, and then quickly extended to RL. Despite the popularity of the method, important theoretical questions are still to be answered. In particular, it was unclear how the speed in which the size of the OKS dictionary, or equivalently the number of basis functions constructed, grows with the number of data. Characterizing such growth rate is crucial for the understanding of OKS, both on its computational complexity and, perhaps more importantly, the generalization property of the resulting linear regressor or value function estimator. We investigate into this problem, using a novel formula expressing the expected determinant of the kernel Gram matrix in terms of the eigenvalues of the covariance operator. Based on this formula, we are able to connect the cardinality of the dictionary with the eigen-decay of the covariance operator. In particular, we prove that under certain technical conditions, the size of the dictionary will always grow sub-linearly in the number of data points, and, as a consequence, the kernel linear regressor or value function estimator constructed from the resulting dictionary is consistent.

In the second study, we turn to a different class of basis generation methods, which make use of the reward information. Previous approaches in this setup construct a series of basis functions that in sufficient number can eventually represent the value function. In contrast, we show theoretically that there is a single, ideal basis function, whose addition to the set of basis functions immediately reduces the error to zero -- without changing existing weights. Moreover, this ideal basis function is simply the value function that results from replacing the MDP's reward function with its Bellman error. This result suggests a novel method for improving value function estimation: a primary reinforcement learner estimates its value function using its present basis functions; it then sends its TD error to a secondary learner, which interprets that error as a reward function and estimates the corresponding value function; the resulting value function then becomes the primary learner's new basis function. We present both batch and online versions in combination with incremental basis projection, and demonstrate that the performance is superior to existing methods, especially in the case of large discount factors.

In the last study, we focus on the problem of model learning, especially intelligent learning of the transition model of the environment. The problem is investigated under a Bayesian framework, where the learning is performed through probabilistic inference, and the learning progress is measured using Shannon information gain. In this setting, we show that the problem can be formulated as a RL problem, where the reward is given by the immediate information gain from performing the next action. This shows that the model-learning problem can in principle be solved using algorithms developed for RL. In particular, we show theoretically that if the environment is an MDP, then near optimal model learning can be achieved following this approach.

Dissertation Committee:

  • Prof. Jürgen Schmidhuber, Università della Svizzera italiana/IDSIA, Switzerland (Research Advisor)
  • Prof. Rolf Krause, Università della Svizzera italiana, Switzerland (Internal Member)
  • Prof. Kai Hormann, Università della Svizzera italiana, Switzerland (Internal Member)
  • Prof. Marcus Hutter, Australia’s National University, Australia (External Member)
  • Prof. Richard S. Sutton, University of Alberta, Canada (External Member)
  • Prof. Marco Wiering, University of Groningen, The Netherlands (External Member)