Code for Politeness Transfer: A Tag and Generate Approach

tag-and-generate, updated 🕥 2022-12-08 06:21:10

Tagger and Generator

Dataset preparation: tag-and-generate/tagger-generator/tag-and-generate-data-prep

Training, inference, evaluation: tag-and-generate/tagger-generator/tag-and-generate-train


Walkthrough

We will now present an example of training the politeness transfer system from scratch. The process has five steps: * Step 1: Getting the code * Step 2: Getting the training data * Step 3: Preparing parallel data for training * Step 4: Training the tagger and generator * Step 5: Running inference

Step 1: Getting the code

We begin by cloning this repo:

sh git clone https://github.com/tag-and-generate/tagger-generator.git The cloned folder contains: i) tag-and-generate-data-prep the codebase used for creating the parallel tag and generate dataset, and ii) tag-and-generate-train, the training code.

Each of these folders has a requirements.txt file that can be used to download the dependencies.

Next, let's create a folder inside tagger-generator to save all the datasets/tags:

sh cd tagger-generator mkdir data

Step 2: Getting the training data.

The training data in a ready to use format is located here.

Download the zip file to the data folder created above and extract politeness.tsv.

sh unzip politeness_processed.zip head politeness.tsv txt|style|split -----|-----|----- forwarded by tana jones / hou / ect on 09/28/2000|P_2|train the clickpaper approvals for 9/27/00 are attached below .|P_7|train "hello everyone : please let me know if you have a subscription to "" telerate "" ?"|P_7|train we are being billed for this service and i do not know who is using it .|P_0|train

As we can see, the data is in the tsv format and has the right header.

You can also use gdown to directly download the file:

sh gdown --id 1E9GHwmVM9DL9-KiaIaG5lm_oagLWe908

Now that we have the codebase and the dataset, let's start by creating the parallel data required for training the models. Let's do a listing of the folder so far to make sure we are on the same page:

sh (dl) [email protected]:~/tagger-generator$ ls data LICENSE README.md tag-and-generate-data-prep tag-and-generate-train So, we are in the repo (tagger-generator), and see the two code folders (tag-and-generate-data-prep and tag-and-generate-train), as well as the data folder (data). Further, the data folder has the politeness.tsv file that we just downloaded: sh (dl) [email protected]:~/tagger-generator$ ls data/ politeness_processed.zip politeness.tsv

Step 3: Preparing parallel data for training

We prepare the parallel data using tag-and-generate-data-prep:

sh cd tag-and-generate-data-prep python src/run.py --data_pth ../data/politeness.tsv --outpath ../data/ --style_0_label P_9 --style_1_label P_0 --is_unimodal True More details on these options are located in tag-and-generate/tagger-generator/tag-and-generate-data-prep. In summary, we specify the input file, the label for the style of interest (P_9) and a neutral/contrastive style (P_0). Importantly, we specify --is_unimodal True. This option ensures that the parallel data is created as per the unimodal style setting (Figure 3 in the paper).

After data-prep finishes, we see several files in ../data/. The important files are described below:

  • P_9_tags.json: these are the politeness tags or phrases identified as polite phrases:

"thank you" "thank" "looking forward" "glad" "be interested"

  • The data prep code creates two sets of training files: one for the tagger and another for the generator. To understand these, let's take a sample sentence please get back to me if you have any additional concerns . and look at how it is represented in different files:

    • entagged_parallel.train.en (input to the tagger):
      • back to me have concerns .
    • entagged_parallel.train.tagged (output of the tagger):
      • [P_90] back to me [P_91] have [P_92] concerns .
    • engenerated_parallel.train.en (input to the generator):
      • [P_90] back to me [P_91] have [P_92] concerns .
    • engenerated_parallel.train.generated (output of the generator)
      • please get back to me if you have any additional concerns .

    Here, P_9 is the style tag, and the number after the style tag captures the position of the tag in the sentence.

With the data files ready, we are ready to run training.

Step 4: Training the tagger and generator

All the training and inference related scripts/code is present in tag-and-generate-train, so let's cd to it.

sh cd tag-and-generate-train

In order to prepare the files for training, we first process them using BPE.

sh bash scripts/prepare_bpe.sh tagged ../data/ bash scripts/prepare_bpe.sh generated ../data/

We can now start training the tagger and generator:

sh nohup bash scripts/train_tagger.sh tagged politeness ../data/ > tagger.log & nohup bash scripts/train_generator.sh generated politeness ../data/ > generator.log &

politeness is a user-defined handle that we will use during inference.

After the training finishes, the best models (given by validation perplexity) are stored in models:

sh (dl) [email protected]:~/tagger-generator/tag-and-generate-train$ ls models/politeness/bpe/ en-generated-generator.pt en-tagged-tagger.pt

For our run, at the end of 5 epochs, the validation perplexity was 1.26 for the tagger, and 1.76 for the generator.

Step 5: Running inference

Let's test out the trained models on some sample sentences:

```sh (dl) [email protected]:~/tagger-generator/tag-and-generate-train$ cat > input.txt send me the text files. look into this issue.

bash scripts/inference.sh input.txt sample tagged generated politeness P_9 P_9 ../data/ 3 ```

Here sample is a unique identifier for the inference job, and politeness is the identifier we used for the training job. P_9 is the style tag (kept the same for unimodal jobs). (Please see the README at tag-and-generate/tagger-generator/tag-and-generate-train for more details).

The final and intermediate outputs are located in experiments folder:

sh (dl) [email protected]:~/tagger-generator/tag-and-generate-train$ ls experiments/sample_* experiments/sample_generator_input experiments/sample_tagged experiments/sample_output experiments/sample_tagger_input

Let's look at the final output:

sh (dl) [email protected]:~/tagger-generator/tag-and-generate-train$ cat experiments/sample_output please send me the text files. we would like to look into this issue. Not bad!

We hope this walkthrough is helpful in understanding and using the codebase. Here are some additional helpful links:

Issues

Bump certifi from 2019.6.16 to 2022.12.7 in /tag-and-generate-data-prep

opened on 2022-12-08 06:21:09 by dependabot[bot]

Bumps certifi from 2019.6.16 to 2022.12.7.

Commits


Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/tag-and-generate/tagger-generator/network/alerts).

Bump certifi from 2019.6.16 to 2022.12.7 in /tag-and-generate-train

opened on 2022-12-08 06:21:09 by dependabot[bot]

Bumps certifi from 2019.6.16 to 2022.12.7.

Commits


Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/tag-and-generate/tagger-generator/network/alerts).

Bump pillow from 6.2.0 to 9.3.0 in /tag-and-generate-data-prep

opened on 2022-11-22 04:41:43 by dependabot[bot]

Bumps pillow from 6.2.0 to 9.3.0.

Release notes

Sourced from pillow's releases.

9.3.0

https://pillow.readthedocs.io/en/stable/releasenotes/9.3.0.html

Changes

... (truncated)

Changelog

Sourced from pillow's changelog.

9.3.0 (2022-10-29)

  • Limit SAMPLESPERPIXEL to avoid runtime DOS #6700 [wiredfool]

  • Initialize libtiff buffer when saving #6699 [radarhere]

  • Inline fname2char to fix memory leak #6329 [nulano]

  • Fix memory leaks related to text features #6330 [nulano]

  • Use double quotes for version check on old CPython on Windows #6695 [hugovk]

  • Remove backup implementation of Round for Windows platforms #6693 [cgohlke]

  • Fixed set_variation_by_name offset #6445 [radarhere]

  • Fix malloc in _imagingft.c:font_setvaraxes #6690 [cgohlke]

  • Release Python GIL when converting images using matrix operations #6418 [hmaarrfk]

  • Added ExifTags enums #6630 [radarhere]

  • Do not modify previous frame when calculating delta in PNG #6683 [radarhere]

  • Added support for reading BMP images with RLE4 compression #6674 [npjg, radarhere]

  • Decode JPEG compressed BLP1 data in original mode #6678 [radarhere]

  • Added GPS TIFF tag info #6661 [radarhere]

  • Added conversion between RGB/RGBA/RGBX and LAB #6647 [radarhere]

  • Do not attempt normalization if mode is already normal #6644 [radarhere]

... (truncated)

Commits


Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/tag-and-generate/tagger-generator/network/alerts).

Bump pillow from 6.2.0 to 9.3.0 in /tag-and-generate-train

opened on 2022-11-22 04:41:40 by dependabot[bot]

Bumps pillow from 6.2.0 to 9.3.0.

Release notes

Sourced from pillow's releases.

9.3.0

https://pillow.readthedocs.io/en/stable/releasenotes/9.3.0.html

Changes

... (truncated)

Changelog

Sourced from pillow's changelog.

9.3.0 (2022-10-29)

  • Limit SAMPLESPERPIXEL to avoid runtime DOS #6700 [wiredfool]

  • Initialize libtiff buffer when saving #6699 [radarhere]

  • Inline fname2char to fix memory leak #6329 [nulano]

  • Fix memory leaks related to text features #6330 [nulano]

  • Use double quotes for version check on old CPython on Windows #6695 [hugovk]

  • Remove backup implementation of Round for Windows platforms #6693 [cgohlke]

  • Fixed set_variation_by_name offset #6445 [radarhere]

  • Fix malloc in _imagingft.c:font_setvaraxes #6690 [cgohlke]

  • Release Python GIL when converting images using matrix operations #6418 [hmaarrfk]

  • Added ExifTags enums #6630 [radarhere]

  • Do not modify previous frame when calculating delta in PNG #6683 [radarhere]

  • Added support for reading BMP images with RLE4 compression #6674 [npjg, radarhere]

  • Decode JPEG compressed BLP1 data in original mode #6678 [radarhere]

  • Added GPS TIFF tag info #6661 [radarhere]

  • Added conversion between RGB/RGBA/RGBX and LAB #6647 [radarhere]

  • Do not attempt normalization if mode is already normal #6644 [radarhere]

... (truncated)

Commits


Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/tag-and-generate/tagger-generator/network/alerts).

Bump joblib from 0.14.0 to 1.2.0 in /tag-and-generate-train

opened on 2022-09-30 20:05:53 by dependabot[bot]

Bumps joblib from 0.14.0 to 1.2.0.

Changelog

Sourced from joblib's changelog.

Release 1.2.0

  • Fix a security issue where eval(pre_dispatch) could potentially run arbitrary code. Now only basic numerics are supported. joblib/joblib#1327

  • Make sure that joblib works even when multiprocessing is not available, for instance with Pyodide joblib/joblib#1256

  • Avoid unnecessary warnings when workers and main process delete the temporary memmap folder contents concurrently. joblib/joblib#1263

  • Fix memory alignment bug for pickles containing numpy arrays. This is especially important when loading the pickle with mmap_mode != None as the resulting numpy.memmap object would not be able to correct the misalignment without performing a memory copy. This bug would cause invalid computation and segmentation faults with native code that would directly access the underlying data buffer of a numpy array, for instance C/C++/Cython code compiled with older GCC versions or some old OpenBLAS written in platform specific assembly. joblib/joblib#1254

  • Vendor cloudpickle 2.2.0 which adds support for PyPy 3.8+.

  • Vendor loky 3.3.0 which fixes several bugs including:

    • robustly forcibly terminating worker processes in case of a crash (joblib/joblib#1269);

    • avoiding leaking worker processes in case of nested loky parallel calls;

    • reliability spawn the correct number of reusable workers.

Release 1.1.0

  • Fix byte order inconsistency issue during deserialization using joblib.load in cross-endian environment: the numpy arrays are now always loaded to use the system byte order, independently of the byte order of the system that serialized the pickle. joblib/joblib#1181

  • Fix joblib.Memory bug with the ignore parameter when the cached function is a decorated function.

... (truncated)

Commits
  • 5991350 Release 1.2.0
  • 3fa2188 MAINT cleanup numpy warnings related to np.matrix in tests (#1340)
  • cea26ff CI test the future loky-3.3.0 branch (#1338)
  • 8aca6f4 MAINT: remove pytest.warns(None) warnings in pytest 7 (#1264)
  • 067ed4f XFAIL test_child_raises_parent_exits_cleanly with multiprocessing (#1339)
  • ac4ebd5 MAINT add back pytest warnings plugin (#1337)
  • a23427d Test child raises parent exits cleanly more reliable on macos (#1335)
  • ac09691 [MAINT] various test updates (#1334)
  • 4a314b1 Vendor loky 3.2.0 (#1333)
  • bdf47e9 Make test_parallel_with_interactively_defined_functions_default_backend timeo...
  • Additional commits viewable in compare view


Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/tag-and-generate/tagger-generator/network/alerts).

Bump joblib from 0.14.0 to 1.2.0 in /tag-and-generate-data-prep

opened on 2022-09-30 19:01:22 by dependabot[bot]

Bumps joblib from 0.14.0 to 1.2.0.

Changelog

Sourced from joblib's changelog.

Release 1.2.0

  • Fix a security issue where eval(pre_dispatch) could potentially run arbitrary code. Now only basic numerics are supported. joblib/joblib#1327

  • Make sure that joblib works even when multiprocessing is not available, for instance with Pyodide joblib/joblib#1256

  • Avoid unnecessary warnings when workers and main process delete the temporary memmap folder contents concurrently. joblib/joblib#1263

  • Fix memory alignment bug for pickles containing numpy arrays. This is especially important when loading the pickle with mmap_mode != None as the resulting numpy.memmap object would not be able to correct the misalignment without performing a memory copy. This bug would cause invalid computation and segmentation faults with native code that would directly access the underlying data buffer of a numpy array, for instance C/C++/Cython code compiled with older GCC versions or some old OpenBLAS written in platform specific assembly. joblib/joblib#1254

  • Vendor cloudpickle 2.2.0 which adds support for PyPy 3.8+.

  • Vendor loky 3.3.0 which fixes several bugs including:

    • robustly forcibly terminating worker processes in case of a crash (joblib/joblib#1269);

    • avoiding leaking worker processes in case of nested loky parallel calls;

    • reliability spawn the correct number of reusable workers.

Release 1.1.0

  • Fix byte order inconsistency issue during deserialization using joblib.load in cross-endian environment: the numpy arrays are now always loaded to use the system byte order, independently of the byte order of the system that serialized the pickle. joblib/joblib#1181

  • Fix joblib.Memory bug with the ignore parameter when the cached function is a decorated function.

... (truncated)

Commits
  • 5991350 Release 1.2.0
  • 3fa2188 MAINT cleanup numpy warnings related to np.matrix in tests (#1340)
  • cea26ff CI test the future loky-3.3.0 branch (#1338)
  • 8aca6f4 MAINT: remove pytest.warns(None) warnings in pytest 7 (#1264)
  • 067ed4f XFAIL test_child_raises_parent_exits_cleanly with multiprocessing (#1339)
  • ac4ebd5 MAINT add back pytest warnings plugin (#1337)
  • a23427d Test child raises parent exits cleanly more reliable on macos (#1335)
  • ac09691 [MAINT] various test updates (#1334)
  • 4a314b1 Vendor loky 3.2.0 (#1333)
  • bdf47e9 Make test_parallel_with_interactively_defined_functions_default_backend timeo...
  • Additional commits viewable in compare view


Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/tag-and-generate/tagger-generator/network/alerts).
Politeness Transfer: A Tag and Generate Approach
GitHub Repository