# RL CIRL Research

shaktikshri, updated 🕥 2022-12-08 06:10:17

Project Description is available at my university page under Proactive Troubleshooting [Aug2018-Present].

I will update this readme as I get more time, the individual files however have an extensive documentation.

`dqn.py`

This file gives a template for constructing a Deep Q Learning network. You can specify the no. of hidden units you want but for the moment it takes only one hidden layer.

`IRL/irl_finite_space.py`

This file implements the Finite Space IRL as put forth in Andrew Ng and Stuart Russel in Algorithms for Inverse Reinforcement Learning. I used pulp based linear program solver but many people prefer using cvxopt package as well. For a sample reward structure the following result was obtained,

`dqn_sin_stability.py`

This is the most recent code I am stuck on (among many other codes :P ). This should ideally be a DQN implementation of stabilizing a sin function, or a general function. For any given continuous values of a noisy sin output, the agent should choose a noise correction scheme which smoothly approximates the sin function, or the function in consideration. Both the noise and the correction values can be continuous real values which makes this problem non trivial.

`acla_with_approxq.py`

Lately i realized that the function stabilization problem I was trying to handle couldnt be done without a continuos action space consideration. As the first step towards Continuous Actor Critic Learning Algorithm (CACLA) proposed by Hasselt and Wiering in Reinforcement Learning in Continuous Action Spaces I tried the the ACLA algorithm, which as you'd expect is just a slight variation of the DQN form. The file `acla_with_approxq.py` is an implementation of the ACLA with value function approximation. I tried out the following configurations, 1. Actor and Critic with experience replays [This training was by far the slowest I ever saw. There was a progress of 180 episodes over 8hours on an AMD Ryzen Threadripper 2990WX 128 GB] 2. Combinations of fixing Q targets in Critic and Actor 3. Using dropouts in Actor 4. Combinations of stochastic/batch updates of critic/actor

Out of all these, updating Critic with experience replay without target scaling and actor in batch along with target scaling gave the best performance. The environment was CartPole-v1. Timesteps of 500 were achieved within 100 episodes, a further learning rate decay for both actor and critic is expected to speed up this convergence. The actor still has a high variance even though I used a one step TD error as the advantage. However this variance seems to die off once the convergence is achieved, something I'm still trying to explain to myself.

`acla_with_mc_returns.py`

This file performs an actor critic learning algorithm with monte carlo estimates of the returns. Several experiments were performed and were found consistent with stochastic behaviors of the gradients. The stochastic parameter updates were best with an SGD with learning rate scheduling and nesterov accelerated gradient. However, a full batch gradient descent beat the sgd by a large margin and converged within 500 episodes for cartpole v1.

`cacla.py`
So finally I was able to write the Continuous Actor Critic Learning Algorithm. I benchmarked this against cartpole continuous environment. The training started pretty low on enthusiasm, but to my surprise the algorithm hit 1189 timesteps in the 420th episode! I used number of updates towards an action where variancet is the running variance of the TD(0) error and δt is the TD(0) error at time t. A gaussian exploration was used. The Actor was trained in a full batch mode, the Critic used an experience replay with fixed targets updated every copy_epochs episodes. It is worth appreciating the reduction in Actor's variance over time and the corresponding increase in the timesteps.

## Preliminary Results

I ran an experiment where a noisy sin function was to be stabilised. The noise came from a another time dependent function with 4 unique levels. The environment is defined in env_definition.py. This can be interpreted as a function filtering from a convolution. The file experiment_sin_stability.py gives the continuous actor critic learning algorithm we used here. The following results were obtained,

The hits+partial hits were the percentage of points in the domain where the algorithm brought the noisy curve within [-0.1, +0.1] deviation of the actual value after correction. A maximum hit of 75.5% was observed.

The original curve is the one observed with convolution, the corrected one if the one which the agent gives out after correction.

## Issues

### Bump certifi from 2019.6.16 to 2022.12.7

opened on 2022-12-08 06:10:17 by dependabot[bot]

Bumps certifi from 2019.6.16 to 2022.12.7.

Commits

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`.

Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/shaktikshri/adaptiveSystems/network/alerts).

### Bump lief from 0.9.0 to 0.12.2

opened on 2022-10-06 20:01:41 by dependabot[bot]

Bumps lief from 0.9.0 to 0.12.2.

Release notes

Sourced from lief's releases.

## 0.12.2

No release notes provided.

## 0.12.0

Changelog is here: https://lief-project.github.io/doc/latest/changelog.html#march-25-2022

## 0.11.5

No release notes provided.

## 0.11.4

No release notes provided.

## 0.11.3

No release notes provided.

## 0.10.1

No release notes provided.

## 0.10.0

No release notes provided.

Commits

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`.

Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/shaktikshri/adaptiveSystems/network/alerts).

### Bump joblib from 0.13.2 to 1.2.0

opened on 2022-09-30 19:47:47 by dependabot[bot]

Bumps joblib from 0.13.2 to 1.2.0.

Changelog

Sourced from joblib's changelog.

## Release 1.2.0

• Fix a security issue where `eval(pre_dispatch)` could potentially run arbitrary code. Now only basic numerics are supported. joblib/joblib#1327

• Make sure that joblib works even when multiprocessing is not available, for instance with Pyodide joblib/joblib#1256

• Avoid unnecessary warnings when workers and main process delete the temporary memmap folder contents concurrently. joblib/joblib#1263

• Fix memory alignment bug for pickles containing numpy arrays. This is especially important when loading the pickle with `mmap_mode != None` as the resulting `numpy.memmap` object would not be able to correct the misalignment without performing a memory copy. This bug would cause invalid computation and segmentation faults with native code that would directly access the underlying data buffer of a numpy array, for instance C/C++/Cython code compiled with older GCC versions or some old OpenBLAS written in platform specific assembly. joblib/joblib#1254

• Vendor cloudpickle 2.2.0 which adds support for PyPy 3.8+.

• Vendor loky 3.3.0 which fixes several bugs including:

• robustly forcibly terminating worker processes in case of a crash (joblib/joblib#1269);

• avoiding leaking worker processes in case of nested loky parallel calls;

• reliability spawn the correct number of reusable workers.

## Release 1.1.0

• Fix byte order inconsistency issue during deserialization using joblib.load in cross-endian environment: the numpy arrays are now always loaded to use the system byte order, independently of the byte order of the system that serialized the pickle. joblib/joblib#1181

• Fix joblib.Memory bug with the `ignore` parameter when the cached function is a decorated function.

... (truncated)

Commits

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`.

Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/shaktikshri/adaptiveSystems/network/alerts).

### Bump nbconvert from 5.5.0 to 6.5.1

opened on 2022-08-23 17:52:10 by dependabot[bot]

Bumps nbconvert from 5.5.0 to 6.5.1.

Release notes

Sourced from nbconvert's releases.

## Release 6.5.1

No release notes provided.

## New Contributors

Full Changelog: https://github.com/jupyter/nbconvert/compare/6.4.5...6.5

## New Contributors

Full Changelog: https://github.com/jupyter/nbconvert/compare/6.4.2...6.4.3

## New Contributors

... (truncated)

Commits

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`.

Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/shaktikshri/adaptiveSystems/network/alerts).

### Bump mistune from 0.8.4 to 2.0.3

opened on 2022-07-29 22:56:50 by dependabot[bot]

Bumps mistune from 0.8.4 to 2.0.3.

Release notes

Sourced from mistune's releases.

## Version 2.0.2

Fix `escape_url ` via lepture/mistune#295

## Version 2.0.1

Fix XSS for image link syntax.

## Version 2.0.0

First release of Mistune v2.

## Version 2.0.0 RC1

In this release, we have a Security Fix for harmful links.

## Version 2.0.0 Alpha 1

This is the first release of v2. An alpha version for users to have a preview of the new mistune.

Changelog

Sourced from mistune's changelog.

## Changelog

Here is the full history of mistune v2.

Version 2.0.4

``````
Released on Jul 15, 2022

Fix url plugin in &lt;a&gt; tag
Fix * formatting

``````Version 2.0.3
``````

Released on Jun 27, 2022

• Fix `table` plugin
• Security fix for CVE-2022-34749

Version 2.0.2

``````
Released on Jan 14, 2022
Fix escape_url
``````Version 2.0.1
``````

Released on Dec 30, 2021

XSS fix for image link syntax.

Version 2.0.0

``````
Released on Dec 5, 2021
This is the first non-alpha release of mistune v2.
``````Version 2.0.0rc1
``````

Released on Feb 16, 2021

Version 2.0.0a6

``````
</tr></table>
``````

... (truncated)

Commits

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`.

Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/shaktikshri/adaptiveSystems/network/alerts).

### Bump distributed from 2.1.0 to 2021.10.0

opened on 2022-07-15 21:59:18 by dependabot[bot]

Bumps distributed from 2.1.0 to 2021.10.0.

Commits

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`.

Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/shaktikshri/adaptiveSystems/network/alerts).
##### Shakti Kumar

reinforcement-learning inverse-reinforcement-learning cirl deep-reinforcement-learning