A toolbox with the goal of speeding up research on bargaining in MARL (cooperation problems in MARL).

longtermrisk, updated 🕥 2022-09-29 13:04:07

marltoolbox: Facilitate and speed up the research on bargaining in MARL.

CI CI Notebooks CI Weekly Tests Codacy Badge

Table of contents

Overview

Major features of this toolbox:
This toolbox contains algorithms, environments, evaluation tools, and helper functions to conduct research on bargaining in MARL.

This toolbox relies on the Ray/Tune/RLLib framework to provide the basic RL components and research functionalities.

Additional features of using the Ray/Tune/RLLib research framework:

  • using components from RLLib with extensive configuration available (e.g. using a PPO policy or a priority replay buffer)
  • track your experiments, log easily in TensorBoard, run hyperparameter search
  • be agnostic to the deep learning framework
  • create new algorithms using the very simple Tune API or the RLLib API
  • use the RLLib API to take advantage of a fully customizable training pipeline
  • create distributed algorithms (e.g. by using the policy factory of RLLib)

Philosophy: Implement when needed. Improve at each new use. Keep it simple. Keep it flexible. Keep the maintenance cost low.

Support: We actively support researchers by adding tools that they see relevant for research on bargaining in MARL.

Get started

How to use this toolbox

Introduction

marltoolbox is a toolbox in that you should fork/clone and customize for yourself. You can create new experiments by starting from the existing examples. You should edit/inherit any functionality that doesn't fit exactly your needs. This repository is intended as a toolbox that can be shared in a research team. It is not intended to be used in production.
marltoolbox is not a framework that provide a simple API to run experiments in a few lines of codes (this is a feature of RLLib).

RLLib is built on top of Tune and Tune is built on top of Ray. This toolbox marltoolbox, is built to work with RLLib but also to allow to fallback to Tune only if needed, at the cost of some functionalities.

To speed up research, we advise to take advantages of the functionalities of Tune and RLLib.

a) Read the README of the Ray project (which includes Tune and RLLib):

Ray README (<5 min)

b) Read this quick introduction to Tune

Tune's key concepts (< 5 min)

c) Read this quick introduction to RLLib

RLlib in 60 seconds (< 5 min)

d) Introduction to this toolbox:

Without any local installation, you can work through 2 tutorials to introduce marltoolbox together with Tune and RLLib.
Please use Google Colab to run them:

Advanced introduction To explore `Tune` further: - [`Tune` documentation](https://docs.ray.io/en/latest/tune/user-guide.html) - [`Tune` tutorials](https://github.com/ray-project/tutorial/tree/master/tune_exercises) - [`Tune` examples](https://docs.ray.io/en/master/tune/examples/index.html#tune-general-examples) To explore `RLLib` further: - [a simple tutorial](https://colab.research.google.com/github/ray-project/tutorial/blob/master/rllib_exercises/rllib_exercise02_ppo.ipynb) where `RLLib` is used to train a PPO algorithm - [`RLLib` documentation](https://docs.ray.io/en/master/rllib-toc.html) - [`RLLib` tutorials](https://github.com/ray-project/tutorial/tree/master/rllib_exercises) - [`RLLib` examples](https://github.com/ray-project/ray/tree/master/rllib/examples) To explore the toolbox `marltoolbox` further, take a look at [our examples](https://github.com/longtermrisk/marltoolbox/tree/master/marltoolbox/examples).

Toolbox installation

The installation is tested with Ubuntu 18.04 LTS (preferred) and 20.04 LTS.
It requires less than 20 Go of space including all the dependencies like PyTorch, etc.

(Optional) Connect to your virtual machine(VM) on Google Cloud Platform(GCP) ```bash gcloud compute ssh {replace-by-instance-name} ```
(Usually optional) Do some basic upgrade and install some basic requirements (e.g. needed on a new VM) ```bash sudo apt update sudo apt upgrade sudo apt-get install build-essential # Run this command another time (especially needed with Ubuntu 20.04 LTS) sudo apt-get install build-essential ```
(Optional) Use a virtual environment ```bash # If needed, install conda: ## Follow instruction at https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html ## Like that: wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh bash Miniconda3-latest-Linux-x86_64.sh # Enter. Enter... yes. Enter. yes. exit # Connect again to the VM or open a new terminal gcloud compute ssh {replace-by-instance-name} # Check your conda installation conda list # Create a virtual environment: conda create -y -n marltoolbox python=3.8.5 conda activate marltoolbox pip install --upgrade pip ```

Install the toolbox: marltoolbox

```bash

Install dependencies

For RLLib

conda install -y psutil

(optional) To be able to use most of the gym environments

sudo apt-get install -y libglu1-mesa-dev libgl1-mesa-dev libosmesa6-dev xvfb ffmpeg curl patchelf libglfw3 libglfw3-dev cmake zlib1g zlib1g-dev swig

Install marltoolbox

git clone https://github.com/longtermrisk/marltoolbox.git cd marltoolbox

Here are different installation instructions to support different algorithms

Default install

pip install -e .

If you are planning to use LOLA then run instead:

conda install -y python=3.6 pip install -e .[lola] ```

Test the installation

```bash

Check that RLLib is working

Use RLLib built-in training functionalities

rllib train --run=PPO --env=CartPole-v0 --torch

Ctrl+C to stop the training

Check that the toolbox is working

python ./marltoolbox/examples/rllib_api/pg_ipd.py

You should get the status TERMINATED

Visualize the logs

tensorboard --logdir ~/ray_results

If working on GCP: forward the connection from a Virtual Machine(VM) to your machine

Run this command on your local machine from another terminal (not in the VM)

gcloud compute ssh {replace-by-instance-name} -- -NfL 6006:localhost:6006

Go to your browser to visualize the url http://localhost:6006/

```

(Optional) Install additional deep learning libraries (PyTorch CPU only is installed by default) ```bash # Install PyTorch with GPU # Check cuda version nvidia-smi # Look for "CUDA Version: XX.X" # With the right cuda version: conda install pytorch torchvision cudatoolkit=[cuda version like 10.2] -c pytorch # Check PyTorch installation and if your GPU is available to PyTorch python import torch torch.__version__ torch.cuda.is_available() exit() # Install Tensorflow pip install tensorflow ```

Training models

Probably the greatest value of using RLLib/Tune and this toolbox is that you can use the provided environments, policies and some components of marltoolbox and RLLib (like a PPO agent) anywhere (e.g. without using Tune nor RLLib for anything else).

Yet we recommend to use Tune and if possible RLLib. There are mainly 3 ways to run experiments with Tune or RLLib. They support increasing functionalities but also use more and more constrained APIs.

Tune function API (the less constrained, not recommended) - **Constraints:** With the `Tune` function API, you only need to provide the training function. [See the `Tune` documentation](https://docs.ray.io/en/master/tune/key-concepts.html). - **Best used:** If you want to very quickly run some code from an external repository. - **Functionalities:** Running several seeds in parallel and comparing their results. Easily plot values to TensorBoard and visualizing the plots in live. Tracking your experiments and hyperparameters. Hyperparameter search. Early stopping.
Tune class API (very few constraints, recommended) - **Constraints:** You need to provide a Trainer class with at minimum a setup method and a step method. [See the `Tune` documentation](https://docs.ray.io/en/master/tune/key-concepts.html). - **Best used:** If you want to run some code from an external repository and you need checkpoints. Helpers in this toolbox (`marltoolbox.utils.policy.get_tune_policy_class`) will also allow you transform this class (already trained) into frozen `RLLib` policies. This is useful to produce evaluation against other `RLLib` algorithms or when using experimentation tools from `marltoolbox.utils`. - **Additional functionalities:** Cleaner format. Checkpoints. Allow conversion to the `RLLib` policy API. The trained agents can be converted to the `RLLib` policy API for evaluation only. This allows you to use functionalities which rely on the `RLLib` API (but not training).
RLLib API (quite constrained, recommended) - **Constraints:** You need to use the `RLLib` API (trainer, policy, callbacks, etc.). For information, `RLLib` trainer classes are specific implementations of the `Tune` class API (just above). [See the `RLLib` documentation](https://docs.ray.io/en/master/rllib-toc.html). - **Best used:** If you are creating a new training setup or policy from scratch. Or if you want a seamless integration with all `RLLib` components. Or if you need distributed training. - **Additional functionalities:** Using easily all components from `RLLib` (models, environments, algorithms, exploration, schedulers, preprocessing, etc.). Using the customizable trainer and policy factories from `RLLib`.

Some usages

Fall back to the Tune APIs when using the RLLib API is too costly If the setup you want to train already exist, has a training loop and if the cost to convert it into `RLLib` is too expensive, then with minimum changes you can use `Tune`. **When is the conversion cost to `RLLib` too high?** - If the algorithm has a complex unusual dataflow - If the algorithm has an unusual training process - like `LOLA`: performing "virtual" opponent updates - like `LTFT`: nested algorithms - If you don't need to change the algorithm - If you don't plan to run the algorithm against policies from `RLLib` - If you do not plan to work much with the algorithm. And thus, you do not want to invest time in the conversion to `RLLib`. - Some points above and you are only starting to use `RLLib` - etc. ###### Tutorials: - Tutorial_Basics_How_to_use_the_toolbox.ipynb ###### Examples: You can find such examples in `marltoolbox.examples.tune_class_api` and in `marltoolbox.examples.tune_function_api`.
Using components directly provided by RLLib or marltoolbox ###### Tutorials: - Tutorial_Basics_How_to_use_the_toolbox.ipynb ###### a) Examples using the `Tune` class API: - Using an A3C policy: `amd.py` with `use_rllib_policy = True` (toolbox example) - Using (custom or not) environments: - IPD and coin game environments: amd.py (toolbox example) - Asymmetric coin game environment: lola_pg_official.py (toolbox example) ###### b) Examples using the `RLLib` API: - IPD environments: pg_ipd.py (toolbox example) - Coin game environment: ppo_coin_game.py (toolbox example) - APEX_DDPG and the water world environment: [`multi_agent_independent_learning.py`](https://github.com/ray-project/ray/blob/master/rllib/examples/multi_agent_independent_learning.py) - MADDPG and the two step game environment: [`two_step_game.py`](https://github.com/ray-project/ray/blob/master/rllib/examples/two_step_game.py) - Policy Gradient (PG) and the rock paper scissors environment: [`rock_paper_scissors_multiagent.py`](https://github.com/ray-project/ray/blob/master/rllib/examples/rock_paper_scissors_multiagent.py) (in the `run_same_policy` function)
Customizing existing algorithms from RLLib ###### Examples: - Customize policy's postprocessing (processing after env.step) and trainer: inequity_aversion.py (toolbox example) - Change the loss function of the Policy Gradient (PG) Policy: [`rock_paper_scissors_multiagent.py`](https://github.com/ray-project/ray/blob/master/rllib/examples/rock_paper_scissors_multiagent.py) (in the `run_with_custom_entropy_loss` function)
Creating and using new custom policies in RLLib In `RLLib`, customizing a policy allows to change its training and evaluation logics. ###### Examples: - Hardcoded random Policy: [`multi_agent_custom_policy.py`](https://github.com/ray-project/ray/blob/master/rllib/examples/multi_agent_custom_policy.py) - Hardcoded fixed Policy: [`rock_paper_scissors_multiagent.py`](https://github.com/ray-project/ray/blob/master/rllib/examples/rock_paper_scissors_multiagent.py) (in the `run_heuristic_vs_learned` function) - Policy with nested Policies: `ltft_with_various_env.py` (toolbox example)
Using custom dataflows in RLLib (custom Trainer or Trainer's execution_plan) ###### Examples: - Training 2 different policies with 2 different Trainers (less complex but less sample efficient than the 2nd method below): [`multi_agent_two_trainers.py`](https://github.com/ray-project/ray/blob/master/rllib/examples/multi_agent_two_trainers.py) - Training 2 different policies with a custom Trainer (more complex, more sample efficient): [`two_trainer_workflow.py`](https://github.com/ray-project/ray/blob/master/rllib/examples/two_trainer_workflow.py)
Using experimentation tools from the toolbox ###### Tutorials: - Evaluations_Level_1_best_response_and_self_play_and_cross_play.ipynb ###### Examples: - Training a level 1 best response: `l1br_amtft.py` (toolbox example) - Evaluating same-play and cross-play performances: `amtft_various_env.py` (toolbox example)

Main contents of the toolbox

Environments - various matrix social dilemmas - various coin games - bargaining with alternating offers ([Emergent Communication through Negotiation](https://arxiv.org/abs/1804.03980))
Algorithms - AMD ([Adaptive Mechanism Design](https://arxiv.org/abs/1806.04067)) - amTFT ([Approximate Markov Tit-For-Tat](https://arxiv.org/abs/1707.01068)) - LTFT ([Learning Tit-For-Tat](https://longtermrisk.org/files/toward_cooperation_learning_games_oct_2020.pdf), **simplified version**) - LOLA-Exact, LOLA-PG, LOLA-DICE - supervised learning - population - This policy plays an episode by sampling a policy from a population of similar policies - hierarchical - It is a base policy class which allows the use of nested algorithms
Utils - exploration - SoftQ with temperature schedule - SoftQ with clustering of the Q values - log - callbacks to log values from environments and policies - lvl1_best_response - helper functions to train level 1 exploiters - policy - helper to transform a trained Tune Trainer into frozen RLLib policies - postprocessing - helpers to compute welfare functions and add this data in the evaluation batch (the batches sampled by the evaluation workers) - restore - helpers to load a checkpoint only for a chosen policy (instead of for all existing policies as RLLib does) - rollout - a rollout runner function which can be called from inside a RLLib policy - self_and_cross_perf - a helper to evaluate the performance in self-play and cross-play. "self-play": playing against agents from the same training run. "cross-play": playing against agents from different training runs. - plot - helpers to plot results
Scripts - aggregate_and_plot_tensorboard_data - a script to aggregate the logged values from several seeds (into mean, std, etc.) and to create summary plots

TODO and wishlist

Improvements - Add unit tests for the algorithms - Refactor the algorithm to make them more readable - Use the logger everywhere - Add and improve docstrings - Set good hyper-parameters in the custom examples - Report all results directly in Weights&Biases (saving download time from VM)
New algorithms - Multi-agent adversarial IRL - Multi-agent generative adversarial imitation learning - Model-based RL like PETS, MPC - Opponent modeling like k-level - Capability to use algorithms from OpenSpiel like MCTS
New functionalities - Reward uncertainty - Full / partial observability of opponent actions - (partial) Parameter transparency - Easy benchmarking with metrics specific to MARL - (more on) Exploitability evaluation - Performance against a suite of other MARL algorithms
New environments - Capability to use environments from OpenSpiel - (iterated) Ultimatum game (including variants)

Issues

Upgrade to RLLib 1.3 and improvements around amTFT

opened on 2022-01-21 13:26:20 by Manuscrit None

Upgrade to use RLLib master (almost v1.3).

opened on 2021-04-15 13:30:09 by Manuscrit

Remove use lock_replay during training (must not use it in LTFT). Create submodule marltoolbox.utils.log. Move methods to summarize a model into an helper class. use before_init_loss instead of after_init (policy class factory arg).

Add quick and long end-to-end test for alt_offers

opened on 2021-03-23 09:38:09 by Manuscrit

To add in https://github.com/longtermrisk/marltoolbox/tree/master/tests/marltoolbox/examples. The quick test in test_end_to_end.py The long test in manual_test_end_to_end.py

longtermrisk
GitHub Repository

toolbox bargaining marl colab algorithm reinforcement-learning multiagent-reinforcement-learning cooperation cooperation-problems reinforcement-learning-algorithms