We will now present an example of training the politeness transfer system from scratch. The process has five steps: * Step 1: Getting the code * Step 2: Getting the training data * Step 3: Preparing parallel data for training * Step 4: Training the tagger and generator * Step 5: Running inference
We begin by cloning this repo:
sh
git clone https://github.com/tag-and-generate/tagger-generator.git
The cloned folder contains: i) tag-and-generate-data-prep
the codebase used for creating the parallel tag and generate dataset, and ii) tag-and-generate-train
, the training code.
Each of these folders has a requirements.txt
file that can be used to download the dependencies.
Next, let's create a folder inside tagger-generator
to save all the datasets/tags:
sh
cd tagger-generator
mkdir data
The training data in a ready to use format is located here.
Download the zip file to the data
folder created above and extract politeness.tsv
.
sh
unzip politeness_processed.zip
head politeness.tsv
txt|style|split
-----|-----|-----
forwarded by tana jones / hou / ect on 09/28/2000|P_2|train
the clickpaper approvals for 9/27/00 are attached below .|P_7|train
"hello everyone : please let me know if you have a subscription to "" telerate "" ?"|P_7|train
we are being billed for this service and i do not know who is using it .|P_0|train
As we can see, the data is in the tsv format and has the right header.
You can also use gdown
to directly download the file:
sh
gdown --id 1E9GHwmVM9DL9-KiaIaG5lm_oagLWe908
Now that we have the codebase and the dataset, let's start by creating the parallel data required for training the models. Let's do a listing of the folder so far to make sure we are on the same page:
sh
(dl) [email protected]:~/tagger-generator$ ls
data LICENSE README.md tag-and-generate-data-prep tag-and-generate-train
So, we are in the repo (tagger-generator), and see the two code folders (tag-and-generate-data-prep
and tag-and-generate-train
), as well as the data folder (data
).
Further, the data folder has the politeness.tsv
file that we just downloaded:
sh
(dl) [email protected]:~/tagger-generator$ ls data/
politeness_processed.zip politeness.tsv
We prepare the parallel data using tag-and-generate-data-prep
:
sh
cd tag-and-generate-data-prep
python src/run.py --data_pth ../data/politeness.tsv --outpath ../data/ --style_0_label P_9 --style_1_label P_0 --is_unimodal True
More details on these options are located in tag-and-generate/tagger-generator/tag-and-generate-data-prep. In summary, we specify the input file, the label for the style of interest (P_9
) and a neutral/contrastive style (P_0
). Importantly, we specify --is_unimodal True
. This option ensures that the parallel data is created as per the unimodal style setting (Figure 3 in the paper).
After data-prep finishes, we see several files in ../data/
.
The important files are described below:
"thank you"
"thank"
"looking forward"
"glad"
"be interested"
The data prep code creates two sets of training files: one for the tagger
and another for the generator
.
To understand these, let's take a sample sentence please get back to me if you have any additional concerns .
and look at how it is represented in different files:
entagged_parallel.train.en
(input to the tagger):back to me have concerns .
entagged_parallel.train.tagged
(output of the tagger): [P_90] back to me [P_91] have [P_92] concerns .
engenerated_parallel.train.en
(input to the generator):[P_90] back to me [P_91] have [P_92] concerns .
engenerated_parallel.train.generated
(output of the generator)please get back to me if you have any additional concerns .
Here, P_9
is the style tag, and the number after the style tag captures the position of the tag in the sentence.
With the data files ready, we are ready to run training.
All the training and inference related scripts/code is present in tag-and-generate-train
, so let's cd
to it.
sh
cd tag-and-generate-train
In order to prepare the files for training, we first process them using BPE.
sh
bash scripts/prepare_bpe.sh tagged ../data/
bash scripts/prepare_bpe.sh generated ../data/
We can now start training the tagger and generator:
sh
nohup bash scripts/train_tagger.sh tagged politeness ../data/ > tagger.log &
nohup bash scripts/train_generator.sh generated politeness ../data/ > generator.log &
politeness
is a user-defined handle that we will use during inference.
After the training finishes, the best models (given by validation perplexity) are stored in models
:
sh
(dl) [email protected]:~/tagger-generator/tag-and-generate-train$ ls models/politeness/bpe/
en-generated-generator.pt en-tagged-tagger.pt
For our run, at the end of 5 epochs, the validation perplexity was 1.26 for the tagger, and 1.76 for the generator.
Let's test out the trained models on some sample sentences:
```sh (dl) [email protected]:~/tagger-generator/tag-and-generate-train$ cat > input.txt send me the text files. look into this issue.
bash scripts/inference.sh input.txt sample tagged generated politeness P_9 P_9 ../data/ 3 ```
Here sample
is a unique identifier for the inference job, and politeness
is the identifier we used for the training job. P_9
is the style tag (kept the same for unimodal jobs). (Please see the README at tag-and-generate/tagger-generator/tag-and-generate-train for more details).
The final and intermediate outputs are located in experiments folder:
sh
(dl) [email protected]:~/tagger-generator/tag-and-generate-train$ ls experiments/sample_*
experiments/sample_generator_input experiments/sample_tagged
experiments/sample_output experiments/sample_tagger_input
Let's look at the final output:
sh
(dl) [email protected]:~/tagger-generator/tag-and-generate-train$ cat experiments/sample_output
please send me the text files.
we would like to look into this issue.
Not bad!
We hope this walkthrough is helpful in understanding and using the codebase. Here are some additional helpful links:
Bumps certifi from 2019.6.16 to 2022.12.7.
9e9e840
2022.12.07b81bdb2
2022.09.24939a28f
2022.09.14aca828a
2022.06.15.2de0eae1
Only use importlib.resources's new files() / Traversable API on Python ≥3.11 ...b8eb5e9
2022.06.15.147fb7ab
Fix deprecation warning on Python 3.11 (#199)b0b48e0
fixes #198 -- update link in license9d514b4
2022.06.154151e88
Add py.typed to MANIFEST.in to package in sdist (#196)Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase
.
Bumps certifi from 2019.6.16 to 2022.12.7.
9e9e840
2022.12.07b81bdb2
2022.09.24939a28f
2022.09.14aca828a
2022.06.15.2de0eae1
Only use importlib.resources's new files() / Traversable API on Python ≥3.11 ...b8eb5e9
2022.06.15.147fb7ab
Fix deprecation warning on Python 3.11 (#199)b0b48e0
fixes #198 -- update link in license9d514b4
2022.06.154151e88
Add py.typed to MANIFEST.in to package in sdist (#196)Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase
.
Bumps pillow from 6.2.0 to 9.3.0.
Sourced from pillow's releases.
9.3.0
https://pillow.readthedocs.io/en/stable/releasenotes/9.3.0.html
Changes
- Initialize libtiff buffer when saving #6699 [
@radarhere
]- Limit SAMPLESPERPIXEL to avoid runtime DOS #6700 [
@wiredfool
]- Inline fname2char to fix memory leak #6329 [
@nulano
]- Fix memory leaks related to text features #6330 [
@nulano
]- Use double quotes for version check on old CPython on Windows #6695 [
@hugovk
]- GHA: replace deprecated set-output command with GITHUB_OUTPUT file #6697 [
@nulano
]- Remove backup implementation of Round for Windows platforms #6693 [
@cgohlke
]- Upload fribidi.dll to GitHub Actions #6532 [
@nulano
]- Fixed set_variation_by_name offset #6445 [
@radarhere
]- Windows build improvements #6562 [
@nulano
]- Fix malloc in _imagingft.c:font_setvaraxes #6690 [
@cgohlke
]- Only use ASCII characters in C source file #6691 [
@cgohlke
]- Release Python GIL when converting images using matrix operations #6418 [
@hmaarrfk
]- Added ExifTags enums #6630 [
@radarhere
]- Do not modify previous frame when calculating delta in PNG #6683 [
@radarhere
]- Added support for reading BMP images with RLE4 compression #6674 [
@npjg
]- Decode JPEG compressed BLP1 data in original mode #6678 [
@radarhere
]- pylint warnings #6659 [
@marksmayo
]- Added GPS TIFF tag info #6661 [
@radarhere
]- Added conversion between RGB/RGBA/RGBX and LAB #6647 [
@radarhere
]- Do not attempt normalization if mode is already normal #6644 [
@radarhere
]- Fixed seeking to an L frame in a GIF #6576 [
@radarhere
]- Consider all frames when selecting mode for PNG save_all #6610 [
@radarhere
]- Don't reassign crc on ChunkStream close #6627 [
@radarhere
]- Raise a warning if NumPy failed to raise an error during conversion #6594 [
@radarhere
]- Only read a maximum of 100 bytes at a time in IMT header #6623 [
@radarhere
]- Show all frames in ImageShow #6611 [
@radarhere
]- Allow FLI palette chunk to not be first #6626 [
@radarhere
]- If first GIF frame has transparency for RGB_ALWAYS loading strategy, use RGBA mode #6592 [
@radarhere
]- Round box position to integer when pasting embedded color #6517 [
@radarhere
]- Removed EXIF prefix when saving WebP #6582 [
@radarhere
]- Pad IM palette to 768 bytes when saving #6579 [
@radarhere
]- Added DDS BC6H reading #6449 [
@ShadelessFox
]- Added support for opening WhiteIsZero 16-bit integer TIFF images #6642 [
@JayWiz
]- Raise an error when allocating translucent color to RGB palette #6654 [
@jsbueno
]- Moved mode check outside of loops #6650 [
@radarhere
]- Added reading of TIFF child images #6569 [
@radarhere
]- Improved ImageOps palette handling #6596 [
@PososikTeam
]- Defer parsing of palette into colors #6567 [
@radarhere
]- Apply transparency to P images in ImageTk.PhotoImage #6559 [
@radarhere
]- Use rounding in ImageOps contain() and pad() #6522 [
@bibinhashley
]- Fixed GIF remapping to palette with duplicate entries #6548 [
@radarhere
]- Allow remap_palette() to return an image with less than 256 palette entries #6543 [
@radarhere
]- Corrected BMP and TGA palette size when saving #6500 [
@radarhere
]
... (truncated)
Sourced from pillow's changelog.
9.3.0 (2022-10-29)
Limit SAMPLESPERPIXEL to avoid runtime DOS #6700 [wiredfool]
Initialize libtiff buffer when saving #6699 [radarhere]
Inline fname2char to fix memory leak #6329 [nulano]
Fix memory leaks related to text features #6330 [nulano]
Use double quotes for version check on old CPython on Windows #6695 [hugovk]
Remove backup implementation of Round for Windows platforms #6693 [cgohlke]
Fixed set_variation_by_name offset #6445 [radarhere]
Fix malloc in _imagingft.c:font_setvaraxes #6690 [cgohlke]
Release Python GIL when converting images using matrix operations #6418 [hmaarrfk]
Added ExifTags enums #6630 [radarhere]
Do not modify previous frame when calculating delta in PNG #6683 [radarhere]
Added support for reading BMP images with RLE4 compression #6674 [npjg, radarhere]
Decode JPEG compressed BLP1 data in original mode #6678 [radarhere]
Added GPS TIFF tag info #6661 [radarhere]
Added conversion between RGB/RGBA/RGBX and LAB #6647 [radarhere]
Do not attempt normalization if mode is already normal #6644 [radarhere]
... (truncated)
d594f4c
Update CHANGES.rst [ci skip]909dc64
9.3.0 version bump1a51ce7
Merge pull request #6699 from hugovk/security-libtiff_buffer2444cdd
Merge pull request #6700 from hugovk/security-samples_per_pixel-sec744f455
Added release notes0846bfa
Add to release notes799a6a0
Fix linting00b25fd
Hide UserWarning in logs05b175e
Tighter test case13f2c5a
Prevent DOS with large SAMPLESPERPIXEL in Tiff IFDDependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase
.
Bumps pillow from 6.2.0 to 9.3.0.
Sourced from pillow's releases.
9.3.0
https://pillow.readthedocs.io/en/stable/releasenotes/9.3.0.html
Changes
- Initialize libtiff buffer when saving #6699 [
@radarhere
]- Limit SAMPLESPERPIXEL to avoid runtime DOS #6700 [
@wiredfool
]- Inline fname2char to fix memory leak #6329 [
@nulano
]- Fix memory leaks related to text features #6330 [
@nulano
]- Use double quotes for version check on old CPython on Windows #6695 [
@hugovk
]- GHA: replace deprecated set-output command with GITHUB_OUTPUT file #6697 [
@nulano
]- Remove backup implementation of Round for Windows platforms #6693 [
@cgohlke
]- Upload fribidi.dll to GitHub Actions #6532 [
@nulano
]- Fixed set_variation_by_name offset #6445 [
@radarhere
]- Windows build improvements #6562 [
@nulano
]- Fix malloc in _imagingft.c:font_setvaraxes #6690 [
@cgohlke
]- Only use ASCII characters in C source file #6691 [
@cgohlke
]- Release Python GIL when converting images using matrix operations #6418 [
@hmaarrfk
]- Added ExifTags enums #6630 [
@radarhere
]- Do not modify previous frame when calculating delta in PNG #6683 [
@radarhere
]- Added support for reading BMP images with RLE4 compression #6674 [
@npjg
]- Decode JPEG compressed BLP1 data in original mode #6678 [
@radarhere
]- pylint warnings #6659 [
@marksmayo
]- Added GPS TIFF tag info #6661 [
@radarhere
]- Added conversion between RGB/RGBA/RGBX and LAB #6647 [
@radarhere
]- Do not attempt normalization if mode is already normal #6644 [
@radarhere
]- Fixed seeking to an L frame in a GIF #6576 [
@radarhere
]- Consider all frames when selecting mode for PNG save_all #6610 [
@radarhere
]- Don't reassign crc on ChunkStream close #6627 [
@radarhere
]- Raise a warning if NumPy failed to raise an error during conversion #6594 [
@radarhere
]- Only read a maximum of 100 bytes at a time in IMT header #6623 [
@radarhere
]- Show all frames in ImageShow #6611 [
@radarhere
]- Allow FLI palette chunk to not be first #6626 [
@radarhere
]- If first GIF frame has transparency for RGB_ALWAYS loading strategy, use RGBA mode #6592 [
@radarhere
]- Round box position to integer when pasting embedded color #6517 [
@radarhere
]- Removed EXIF prefix when saving WebP #6582 [
@radarhere
]- Pad IM palette to 768 bytes when saving #6579 [
@radarhere
]- Added DDS BC6H reading #6449 [
@ShadelessFox
]- Added support for opening WhiteIsZero 16-bit integer TIFF images #6642 [
@JayWiz
]- Raise an error when allocating translucent color to RGB palette #6654 [
@jsbueno
]- Moved mode check outside of loops #6650 [
@radarhere
]- Added reading of TIFF child images #6569 [
@radarhere
]- Improved ImageOps palette handling #6596 [
@PososikTeam
]- Defer parsing of palette into colors #6567 [
@radarhere
]- Apply transparency to P images in ImageTk.PhotoImage #6559 [
@radarhere
]- Use rounding in ImageOps contain() and pad() #6522 [
@bibinhashley
]- Fixed GIF remapping to palette with duplicate entries #6548 [
@radarhere
]- Allow remap_palette() to return an image with less than 256 palette entries #6543 [
@radarhere
]- Corrected BMP and TGA palette size when saving #6500 [
@radarhere
]
... (truncated)
Sourced from pillow's changelog.
9.3.0 (2022-10-29)
Limit SAMPLESPERPIXEL to avoid runtime DOS #6700 [wiredfool]
Initialize libtiff buffer when saving #6699 [radarhere]
Inline fname2char to fix memory leak #6329 [nulano]
Fix memory leaks related to text features #6330 [nulano]
Use double quotes for version check on old CPython on Windows #6695 [hugovk]
Remove backup implementation of Round for Windows platforms #6693 [cgohlke]
Fixed set_variation_by_name offset #6445 [radarhere]
Fix malloc in _imagingft.c:font_setvaraxes #6690 [cgohlke]
Release Python GIL when converting images using matrix operations #6418 [hmaarrfk]
Added ExifTags enums #6630 [radarhere]
Do not modify previous frame when calculating delta in PNG #6683 [radarhere]
Added support for reading BMP images with RLE4 compression #6674 [npjg, radarhere]
Decode JPEG compressed BLP1 data in original mode #6678 [radarhere]
Added GPS TIFF tag info #6661 [radarhere]
Added conversion between RGB/RGBA/RGBX and LAB #6647 [radarhere]
Do not attempt normalization if mode is already normal #6644 [radarhere]
... (truncated)
d594f4c
Update CHANGES.rst [ci skip]909dc64
9.3.0 version bump1a51ce7
Merge pull request #6699 from hugovk/security-libtiff_buffer2444cdd
Merge pull request #6700 from hugovk/security-samples_per_pixel-sec744f455
Added release notes0846bfa
Add to release notes799a6a0
Fix linting00b25fd
Hide UserWarning in logs05b175e
Tighter test case13f2c5a
Prevent DOS with large SAMPLESPERPIXEL in Tiff IFDDependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase
.
Bumps joblib from 0.14.0 to 1.2.0.
Sourced from joblib's changelog.
Release 1.2.0
Fix a security issue where
eval(pre_dispatch)
could potentially run arbitrary code. Now only basic numerics are supported. joblib/joblib#1327Make sure that joblib works even when multiprocessing is not available, for instance with Pyodide joblib/joblib#1256
Avoid unnecessary warnings when workers and main process delete the temporary memmap folder contents concurrently. joblib/joblib#1263
Fix memory alignment bug for pickles containing numpy arrays. This is especially important when loading the pickle with
mmap_mode != None
as the resultingnumpy.memmap
object would not be able to correct the misalignment without performing a memory copy. This bug would cause invalid computation and segmentation faults with native code that would directly access the underlying data buffer of a numpy array, for instance C/C++/Cython code compiled with older GCC versions or some old OpenBLAS written in platform specific assembly. joblib/joblib#1254Vendor cloudpickle 2.2.0 which adds support for PyPy 3.8+.
Vendor loky 3.3.0 which fixes several bugs including:
robustly forcibly terminating worker processes in case of a crash (joblib/joblib#1269);
avoiding leaking worker processes in case of nested loky parallel calls;
reliability spawn the correct number of reusable workers.
Release 1.1.0
Fix byte order inconsistency issue during deserialization using joblib.load in cross-endian environment: the numpy arrays are now always loaded to use the system byte order, independently of the byte order of the system that serialized the pickle. joblib/joblib#1181
Fix joblib.Memory bug with the
ignore
parameter when the cached function is a decorated function.
... (truncated)
5991350
Release 1.2.03fa2188
MAINT cleanup numpy warnings related to np.matrix in tests (#1340)cea26ff
CI test the future loky-3.3.0 branch (#1338)8aca6f4
MAINT: remove pytest.warns(None) warnings in pytest 7 (#1264)067ed4f
XFAIL test_child_raises_parent_exits_cleanly with multiprocessing (#1339)ac4ebd5
MAINT add back pytest warnings plugin (#1337)a23427d
Test child raises parent exits cleanly more reliable on macos (#1335)ac09691
[MAINT] various test updates (#1334)4a314b1
Vendor loky 3.2.0 (#1333)bdf47e9
Make test_parallel_with_interactively_defined_functions_default_backend timeo...Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase
.
Bumps joblib from 0.14.0 to 1.2.0.
Sourced from joblib's changelog.
Release 1.2.0
Fix a security issue where
eval(pre_dispatch)
could potentially run arbitrary code. Now only basic numerics are supported. joblib/joblib#1327Make sure that joblib works even when multiprocessing is not available, for instance with Pyodide joblib/joblib#1256
Avoid unnecessary warnings when workers and main process delete the temporary memmap folder contents concurrently. joblib/joblib#1263
Fix memory alignment bug for pickles containing numpy arrays. This is especially important when loading the pickle with
mmap_mode != None
as the resultingnumpy.memmap
object would not be able to correct the misalignment without performing a memory copy. This bug would cause invalid computation and segmentation faults with native code that would directly access the underlying data buffer of a numpy array, for instance C/C++/Cython code compiled with older GCC versions or some old OpenBLAS written in platform specific assembly. joblib/joblib#1254Vendor cloudpickle 2.2.0 which adds support for PyPy 3.8+.
Vendor loky 3.3.0 which fixes several bugs including:
robustly forcibly terminating worker processes in case of a crash (joblib/joblib#1269);
avoiding leaking worker processes in case of nested loky parallel calls;
reliability spawn the correct number of reusable workers.
Release 1.1.0
Fix byte order inconsistency issue during deserialization using joblib.load in cross-endian environment: the numpy arrays are now always loaded to use the system byte order, independently of the byte order of the system that serialized the pickle. joblib/joblib#1181
Fix joblib.Memory bug with the
ignore
parameter when the cached function is a decorated function.
... (truncated)
5991350
Release 1.2.03fa2188
MAINT cleanup numpy warnings related to np.matrix in tests (#1340)cea26ff
CI test the future loky-3.3.0 branch (#1338)8aca6f4
MAINT: remove pytest.warns(None) warnings in pytest 7 (#1264)067ed4f
XFAIL test_child_raises_parent_exits_cleanly with multiprocessing (#1339)ac4ebd5
MAINT add back pytest warnings plugin (#1337)a23427d
Test child raises parent exits cleanly more reliable on macos (#1335)ac09691
[MAINT] various test updates (#1334)4a314b1
Vendor loky 3.2.0 (#1333)bdf47e9
Make test_parallel_with_interactively_defined_functions_default_backend timeo...Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase
.