Agnostic Architecture for Heterogeneous Multi-Environment Reinforcement Learning

Korea University
NeurIPS 2023 Foundation Models for Decision Making Workshop

*Corresponding Author

TL; DR

We propose the agnostic neural network architecture and combine distributed training algorithm for heterogeneous multi-environment reinforcement learning. This architecture is composed of multi-modal policy and multi-modal value functions. It can minimize the usage of environment-specific layers. We show that multi-environment training is possible with our architecture.

MY ALT TEXT

Abstract

In new environments, training a Reinforcement Learning (RL) agent from scratch can prove to be inefficient. The computational and temporal costs can be significantly reduced if the agent can learn across diverse environments and effectively perform transfer learning. However, achieving learning across multiple environments is challenging due to the varying state and action spaces inherent in different RL problems. Padding or naive parameter-sharing with environment-specific layers for different state-action spaces are possible solutions for multi-environment training. However, these techniques are not to be scalable. In this work, we present a flexible and environment-agnostic architecture designed for learning across multiple environments simultaneously without padding or environment-specific embeddings, while enabling transfer learning for new environments. We also propose training algorithms for this architecture to enable both online and offline RL. Our experiments demonstrate that multi-environment training with one agent is possible in heterogeneous environments and parameter-sharing with environment-specific layers is not effective in transfer learning.

Method

(1) Agnostic Architecture

MY ALT TEXT

Figure 1. Basic Building Block (Left). Policy and Value Networks (Right)

We want to train one multi-modal agent on heterogeneous multiple RL environments. They have each state-action space. But training in this setting requires some techniques like padding or parametrized embeddings because the input and output sizes of the neural network are fixed. To address this issue, we develop the agnostic architecture that performs sequence-to-sequence mapping. This architecture does not require padding, masking, parametrized embeddings, and a modified RL interface.

(2) Training Procedure

MY ALT TEXT

Figure 2. Distributed Training for Multiple Environments
(Edited from DD-PPO, Erik Wijmans, et al.)

The existing gym interface does not support parallel batch processing for heterogeneous RL environments. We combine DD-PPO algorithm and our architecture for efficient parallel training. The core idea is to assign copies of the model and subset of environments on CPUs and GPUs and synchronize the gradients when training multiple environments.

Experiments

(1) Online Multi-Environment Training

MY ALT TEXT

Figure 3. Results of multi-environment training in 16 heterogeneous environments
(Classic Control and Mujoco)

In the Figure 3, we observe that in most environments, the agnostic agent approaches the performance of the single env agent, and in a few environments it outperforms it.



MY ALT TEXT

Figure 4. Results of multi-environment training in 16 heterogeneous environments
(Mujoco and Atari)

To train the Atari environments together, we used the Agnostic Encoder on Mujoco and the Resnet Encoder on the Atrai environments. In Figure 4, when we trained the Atari and Mujoco environments simultaneously, performance did not improve quickly on the Atari.



(2) Offline Multi-Environment Training

Table 1. Performances of DD-IQL

MY ALT TEXT

Table 1 shows the results of modifying the offline RL algorithm, IQL, on the D4RL dataset and training it with our architecture (DD-IQL). We achieve about 89% performance compared to the single-env agent.




(3) Pretraining and Transfer Learning

MY ALT TEXT

In these experiments, we compare the performance of a baseline and an agnostic agent. The baseline is a setup similar to Deepmind's GATO, with environment-specific encoder and decoder layers. In transfer Learning, baseline has to add new layers and learn from scratch.

MY ALT TEXT

Figure 5. Pretraining

MY ALT TEXT

Figure 6. Transfer Learning

In the experiments, both the baseline and agnostic agents are close to the performance of agents trained on a single env. However, in transfer learning, the baseline does not learn well. We suspect that this is due to overfitting the environments trained on in pretraining.