Code for A Hierarchical Model for Data-to-Text Generation (Rebuffel, Soulier, Scoutheeten, Gallinari; ECIR 2020)

KaijuML, updated 🕥 2022-12-08 03:20:10

Data-to-Text-Hierarchical PWC

Code for A Hierarchical Model for Data-to-Text Generation (Rebuffel, Soulier, Scoutheeten, Gallinari; ECIR 2020); most of this code is based on OpenNMT.

UPDATE 11/03/2021: The original checkpoints used to produce results from the paper are officialy lost. However, I still have the actual model outputs, which are now included in this repo. Simply unzip outputs.zip.

Furthermore, Radmil Raychev and Craig Thomson from the University of Aberdeen are currently working with this repo, and have agreed to share their checkpoints, namely htransformer.tar.gz2.
(Note that this file is not downloadable by command line, still looking for a better alternative)

Once it's downloaded, simply tar -xvf htransformer.tar.gz2.
You'll find the data used to train the model, as well as *.cfg files and *.pt checkpoints. Note that the data is from SportSett, which contains some additional info (such as day of the week for instance).
(Also see Thomson et al. for more info regarding the impact of additional data on system performances.)

Requirements

You will need a recent python to use it as is, especially OpenNMT. However I guess files could be tweaked to work with older pythons. Please note that at the time of writting, torch==1.1.0 can be problematic with very recent version of python. I suggest running the code with python=3.6

Full requirements can be found in requirements.txt. Note that they are not really all required, it's the full pip freeze of a clean conda virtual env, you can probably make it work with less.

Beyond standard packages included in miniconda, usefull packages are torch==1.1.0 torchtext==0.4 and some others required to make onmt work (PyYAML and configargparse for example). I also use more_itertools to format the dataset.

Dataset

The dataset used in the paper can be downloaded here. More specifically, you just need to download the RotoWire dataset:

bash cd data wget https://github.com/harvardnlp/boxscore-data/raw/master/rotowire.tar.bz2 tar -xvjf rotowire.tar.bz2 cd ..

You'll need to format the dataset so that it can be preprocessed by OpenNMT.

python data/make-dataset.py --folder data/

At this stage, your repository should look like this:

. ├── onmt/ # Most of the heavy-lifting is done by onmt ├── data/ # Dataset is here │ ├── rotowire/ # Raw data stored here ├ ├── make-dataset.py # formating script ├ ├── train_input.txt ├ ├── train_output.txt │ └── ... └── ...

Experiments

Before any code run, we build an experiment folder to keep things contained

python create-experiment.py --name exp-1

At this stage, your repository should look like this:

. ├── onmt # Most of the heavy-lifting is done by onmt ├── experiments # Experiments are stored here │ └── exp-1 │ │ ├── data ├ │ ├── gens │ │ └── models ├── data # Dataset is here └── ...

Preprocessing

Before training models via OpenNMT, you must preprocess the data. I've handled all useful parameters with a config file. Please check it out if you want to tweak things, I have tried to include comments on each command. For futher info you can always check out the OpenNMT preprocessing doc

python preprocess.py --config preprocess.cfg

At this stage, your repository should look like this:

├── onmt # Most of the heavy-lifting is done by onmt ├── experiments # Experiments are stored here │ └── exp-1 │ │ ├── data │ │ │ ├── data.train.0.pt │ │ │ ├── data.valid.0.pt │ │ │ ├── data.vocab.pt │ │ │ ├── preprocess-log.txt ├ │ ├── gens │ │ └── models ├── data # Dataset is here └── ...

Training

To train a hierarchical model on Rotowire you can run:

python train.py --config train.cfg

To train with different parameters than used in the paper, please refer to my comments in the config file, or check OpenNMT train doc.

This config file runs the training for 100 000 steps, however we manually stopped the training at 30 000.

Translating

Before translating, we average the last checkpoints. If you did anything different from previous commands, please change the first few line of average-checkpoints.py.

You can average checkpoints by running:

python average-checkpoints.py --folder exp-1 --steps 31000 32000 33000 34000 35000

Now you can simply translate the test input by running:

python translate.py --config translate.cfg

Evaluation

RG, CS and CO metrics were originaly (see here) coded in Lua and python2. Because of compatibility issues with modern hardware, I have re-implemented RG in PyTorch and python3:

Follow instructions at: KaijuML/rotowire-rg-metric.

You can evaluate the BLEU score using SacreBLEU from Post, 2018. See the repo for installation, it should be a breeze with pip.

You can get the BLEU score by running:

cat experiments/exp-1/gens/test/predictions.txt | sacrebleu --force data/test_output.txt

(Note that --force is not required as it doesn't change the score computation, it just suppresses a warning because this dataset is already tokenized)

Alternatively you can use any prefered method for BLEU computation. I have also checked scoring models with NLTK and scores were virtually the same.

Issues

Bump certifi from 2019.11.28 to 2022.12.7

opened on 2022-12-08 03:20:09 by dependabot[bot]

Bumps certifi from 2019.11.28 to 2022.12.7.

Commits


Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/KaijuML/data-to-text-hierarchical/network/alerts).

Bump nbconvert from 5.6.1 to 6.5.1

opened on 2022-08-23 17:58:15 by dependabot[bot]

Bumps nbconvert from 5.6.1 to 6.5.1.

Release notes

Sourced from nbconvert's releases.

Release 6.5.1

No release notes provided.

6.5.0

What's Changed

New Contributors

Full Changelog: https://github.com/jupyter/nbconvert/compare/6.4.5...6.5

6.4.3

What's Changed

New Contributors

Full Changelog: https://github.com/jupyter/nbconvert/compare/6.4.2...6.4.3

6.4.0

What's Changed

New Contributors

... (truncated)

Commits


Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/KaijuML/data-to-text-hierarchical/network/alerts).

Bump mistune from 0.8.4 to 2.0.3

opened on 2022-07-29 23:00:19 by dependabot[bot]

Bumps mistune from 0.8.4 to 2.0.3.

Release notes

Sourced from mistune's releases.

Version 2.0.2

Fix escape_url via lepture/mistune#295

Version 2.0.1

Fix XSS for image link syntax.

Version 2.0.0

First release of Mistune v2.

Version 2.0.0 RC1

In this release, we have a Security Fix for harmful links.

Version 2.0.0 Alpha 1

This is the first release of v2. An alpha version for users to have a preview of the new mistune.

Changelog

Sourced from mistune's changelog.

Changelog

Here is the full history of mistune v2.

Version 2.0.4


Released on Jul 15, 2022
  • Fix url plugin in <a> tag
  • Fix * formatting

Version 2.0.3

Released on Jun 27, 2022

  • Fix table plugin
  • Security fix for CVE-2022-34749

Version 2.0.2


Released on Jan 14, 2022

Fix escape_url

Version 2.0.1

Released on Dec 30, 2021

XSS fix for image link syntax.

Version 2.0.0


Released on Dec 5, 2021

This is the first non-alpha release of mistune v2.

Version 2.0.0rc1

Released on Feb 16, 2021

Version 2.0.0a6


</tr></table> 

... (truncated)

Commits
  • 3f422f1 Version bump 2.0.3
  • a6d4321 Fix asteris emphasis regex CVE-2022-34749
  • 5638e46 Merge pull request #307 from jieter/patch-1
  • 0eba471 Fix typo in guide.rst
  • 61e9337 Fix table plugin
  • 76dec68 Add documentation for renderer heading when TOC enabled
  • 799cd11 Version bump 2.0.2
  • babb0cf Merge pull request #295 from dairiki/bug.escape_url
  • fc2cd53 Make mistune.util.escape_url less aggressive
  • 3e8d352 Version bump 2.0.1
  • Additional commits viewable in compare view


Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/KaijuML/data-to-text-hierarchical/network/alerts).

Bump numpy from 1.17.4 to 1.22.0

opened on 2022-06-21 23:51:40 by dependabot[bot]

Bumps numpy from 1.17.4 to 1.22.0.

Release notes

Sourced from numpy's releases.

v1.22.0

NumPy 1.22.0 Release Notes

NumPy 1.22.0 is a big release featuring the work of 153 contributors spread over 609 pull requests. There have been many improvements, highlights are:

  • Annotations of the main namespace are essentially complete. Upstream is a moving target, so there will likely be further improvements, but the major work is done. This is probably the most user visible enhancement in this release.
  • A preliminary version of the proposed Array-API is provided. This is a step in creating a standard collection of functions that can be used across application such as CuPy and JAX.
  • NumPy now has a DLPack backend. DLPack provides a common interchange format for array (tensor) data.
  • New methods for quantile, percentile, and related functions. The new methods provide a complete set of the methods commonly found in the literature.
  • A new configurable allocator for use by downstream projects.

These are in addition to the ongoing work to provide SIMD support for commonly used functions, improvements to F2PY, and better documentation.

The Python versions supported in this release are 3.8-3.10, Python 3.7 has been dropped. Note that 32 bit wheels are only provided for Python 3.8 and 3.9 on Windows, all other wheels are 64 bits on account of Ubuntu, Fedora, and other Linux distributions dropping 32 bit support. All 64 bit wheels are also linked with 64 bit integer OpenBLAS, which should fix the occasional problems encountered by folks using truly huge arrays.

Expired deprecations

Deprecated numeric style dtype strings have been removed

Using the strings "Bytes0", "Datetime64", "Str0", "Uint32", and "Uint64" as a dtype will now raise a TypeError.

(gh-19539)

Expired deprecations for loads, ndfromtxt, and mafromtxt in npyio

numpy.loads was deprecated in v1.15, with the recommendation that users use pickle.loads instead. ndfromtxt and mafromtxt were both deprecated in v1.17 - users should use numpy.genfromtxt instead with the appropriate value for the usemask parameter.

(gh-19615)

... (truncated)

Commits


Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/KaijuML/data-to-text-hierarchical/network/alerts).

Bump notebook from 6.1.5 to 6.4.12

opened on 2022-06-16 23:35:29 by dependabot[bot]

Bumps notebook from 6.1.5 to 6.4.12.

Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/KaijuML/data-to-text-hierarchical/network/alerts).

Bump ipython from 7.10.2 to 7.16.3

opened on 2022-01-21 20:11:48 by dependabot[bot]

Bumps ipython from 7.10.2 to 7.16.3.

Commits


Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/KaijuML/data-to-text-hierarchical/network/alerts).
Clément Rebuffel

Data-to-Text Generation - PhD Student

GitHub Repository

nlg natural-language-generation rotowire hierarchical-models