Basic Utilities for PyTorch Natural Language Processing (NLP)

PetrochukM, updated πŸ•₯ 2022-07-16 23:44:23

Basic Utilities for PyTorch Natural Language Processing (NLP)

PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. torchnlp extends PyTorch to provide you with basic text data processing functions.

PyPI - Python Version Codecov Downloads Documentation Status Build Status Twitter: PetrochukM

Logo by Chloe Yeo, Corporate Sponsorship by WellSaid Labs

Installation 🐾

Make sure you have Python 3.6+ and PyTorch 1.0+. You can then install pytorch-nlp using pip:

python pip install pytorch-nlp

Or to install the latest code via:

python pip install git+https://github.com/PetrochukM/PyTorch-NLP.git

Docs

The complete documentation for PyTorch-NLP is available via our ReadTheDocs website.

Get Started

Within an NLP data pipeline, you'll want to implement these basic steps:

1. Load your Data 🐿

Load the IMDB dataset, for example:

```python from torchnlp.datasets import imdb_dataset

Load the imdb training dataset

train = imdb_dataset(train=True) train[0] # RETURNS: {'text': 'For a movie that gets..', 'sentiment': 'pos'} ```

Load a custom dataset, for example:

```python from pathlib import Path

from torchnlp.download import download_file_maybe_extract

directory_path = Path('data/') train_file_path = Path('trees/train.txt')

download_file_maybe_extract( url='http://nlp.stanford.edu/sentiment/trainDevTestTrees_PTB.zip', directory=directory_path, check_files=[train_file_path])

open(directory_path / train_file_path) ```

Don't worry we'll handle caching for you!

2. Text to Tensor

Tokenize and encode your text as a tensor.

For example, a WhitespaceEncoder breaks text into tokens whenever it encounters a whitespace character.

```python from torchnlp.encoders.text import WhitespaceEncoder

loaded_data = ["now this ain't funny", "so don't you dare laugh"] encoder = WhitespaceEncoder(loaded_data) encoded_data = [encoder.encode(example) for example in loaded_data] ```

3. Tensor to Batch

With your loaded and encoded data in hand, you'll want to batch your dataset.

```python import torch from torchnlp.samplers import BucketBatchSampler from torchnlp.utils import collate_tensors from torchnlp.encoders.text import stack_and_pad_tensors

encoded_data = [torch.randn(2), torch.randn(3), torch.randn(4), torch.randn(5)]

train_sampler = torch.utils.data.sampler.SequentialSampler(encoded_data) train_batch_sampler = BucketBatchSampler( train_sampler, batch_size=2, drop_last=False, sort_key=lambda i: encoded_data[i].shape[0])

batches = [[encoded_data[i] for i in batch] for batch in train_batch_sampler] batches = [collate_tensors(batch, stack_tensors=stack_and_pad_tensors) for batch in batches] ```

PyTorch-NLP builds on top of PyTorch's existing torch.utils.data.sampler, torch.stack and default_collate to support sequential inputs of varying lengths!

4. Training and Inference

With your batch in hand, you can use PyTorch to develop and train your model using gradient descent. For example, check out this example code for training on the Stanford Natural Language Inference (SNLI) Corpus.

Last But Not Least

PyTorch-NLP has a couple more NLP focused utility packages to support you! πŸ€—

Deterministic Functions

Now you've setup your pipeline, you may want to ensure that some functions run deterministically. Wrap any code that's random, with fork_rng and you'll be good to go, like so:

```python import random import numpy import torch

from torchnlp.random import fork_rng

with fork_rng(seed=123): # Ensure determinism print('Random:', random.randint(1, 231)) print('Numpy:', numpy.random.randint(1, 231)) print('Torch:', int(torch.randint(1, 2**31, (1,)))) ```

This will always print:

text Random: 224899943 Numpy: 843828735 Torch: 843828736

Pre-Trained Word Vectors

Now that you've computed your vocabulary, you may want to make use of pre-trained word vectors to set your embeddings, like so:

```python import torch from torchnlp.encoders.text import WhitespaceEncoder from torchnlp.word_to_vector import GloVe

encoder = WhitespaceEncoder(["now this ain't funny", "so don't you dare laugh"])

vocab_set = set(encoder.vocab) pretrained_embedding = GloVe(name='6B', dim=100, is_include=lambda w: w in vocab_set) embedding_weights = torch.Tensor(encoder.vocab_size, pretrained_embedding.dim) for i, token in enumerate(encoder.vocab): embedding_weights[i] = pretrained_embedding[token] ```

Neural Networks Layers

For example, from the neural network package, apply the state-of-the-art LockedDropout:

```python import torch from torchnlp.nn import LockedDropout

input_ = torch.randn(6, 3, 10) dropout = LockedDropout(0.5)

Apply a LockedDropout to input_

dropout(input_) # RETURNS: torch.FloatTensor (6x3x10) ```

Metrics

Compute common NLP metrics such as the BLEU score.

```python from torchnlp.metrics import get_moses_multi_bleu

hypotheses = ["The brown fox jumps over the dog 笑"] references = ["The quick brown fox jumps over the lazy dog 笑"]

Compute BLEU score with the official BLEU perl script

get_moses_multi_bleu(hypotheses, references, lowercase=True) # RETURNS: 47.9 ```

Help :question:

Maybe looking at longer examples may help you at examples/.

Need more help? We are happy to answer your questions via Gitter Chat

Contributing

We've released PyTorch-NLP because we found a lack of basic toolkits for NLP in PyTorch. We hope that other organizations can benefit from the project. We are thankful for any contributions from the community.

Contributing Guide

Read our contributing guide to learn about our development process, how to propose bugfixes and improvements, and how to build and test your changes to PyTorch-NLP.

Related Work

torchtext

torchtext and PyTorch-NLP differ in the architecture and feature set; otherwise, they are similar. torchtext and PyTorch-NLP provide pre-trained word vectors, datasets, iterators and text encoders. PyTorch-NLP also provides neural network modules and metrics. From an architecture standpoint, torchtext is object orientated with external coupling while PyTorch-NLP is object orientated with low coupling.

AllenNLP

AllenNLP is designed to be a platform for research. PyTorch-NLP is designed to be a lightweight toolkit.

Authors

Citing

If you find PyTorch-NLP useful for an academic publication, then please use the following BibTeX to cite it:

@misc{pytorch-nlp, author = {Petrochuk, Michael}, title = {PyTorch-NLP: Rapid Prototyping with PyTorch Natural Language Processing (NLP) Tools}, year = {2018}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/PetrochukM/PyTorch-NLP}}, }

Issues

SpacyWordSplitter: module not found - AllenNLP v1.5.0

opened on 2023-02-03 19:26:25 by vrunm

I have been using AllenNLP v1.5.0 for migrating old code. What is the version of spacy that works well AllenNLP v1.5.0 ? I have tried using spacy 2.1.0 but still ran into some import issues.

Also I got an import error as : from allennlp.data.tokenizers.word_splitter import SpacyWordSplitter ModuleNotFoundError: No module named 'allennlp.data.tokenizers.word_splitter'

Also allennlp.data.tokenizers.word_splitter no longer exists in the latest version of AllenNLP. It has been replaced with SpacyTokenizer. So what would be the correct version of Spacy to use for a tokenizer in AllenNLP v1.5.0?

torchnlp ERROR: No matching distribution found for torch==1.0.0

opened on 2023-01-19 18:56:08 by skr3178

torchnlp ERROR: No matching distribution found for torch==1.0.0

docs: fix simple typo, experessed -> expressed

opened on 2022-07-16 23:44:23 by timgates42

There is a small typo in torchnlp/encoders/text/subword_text_tokenizer.py.

Should read expressed rather than experessed.

Semi-automated pull request generated by https://github.com/timgates42/meticulous/blob/master/docs/NOTE.md

fix issue 105, 120

opened on 2022-01-22 00:08:57 by pazamass

Issue #105 and #120 pass the language as an argument rather than as kwargs. setting the default language value to 'en_core_web_sm'.

PackagesNotFoundError in anaconda

opened on 2021-12-24 08:34:55 by SihanLiuEcho

I wanted to install this package in anaconda with "conda install torchnlp", but it came out with the "PackagesNotFoundError" notion, how can I install it in anaconda?

Error in SpacyEncoder when language argument is passed

opened on 2021-11-15 20:50:39 by enaserianhanzaei

Expected Behavior

from torchnlp.encoders.text import SpacyEncoder encoder = SpacyEncoder(["This ain't funny.", "Don't?"], language='en

Actual Behavior

TypeError: init() got an unexpected keyword argument 'language'

Releases

Python 3.5 Support, Sampler Pipelining, Finer Control of Random State, New Corporate Sponsor 2019-11-04 04:45:44

Major Updates

  • Updated my README emoji game to be more ambiguous while maintaining fun and heartwarming vibe. πŸ•
  • Support for Python 3.5
  • Extensive rewrite of README to focus on new users and building an NLP pipeline.
  • Support for Pytorch 1.2
  • Added torchnlp.random for finer grain control of random state building on PyTorch's fork_rng. This module controls the random state of torch, numpy and random. ```python import random import numpy import torch

from torchnlp.random import fork_rng

with fork_rng(seed=123): # Ensure determinism print('Random:', random.randint(1, 231)) print('Numpy:', numpy.random.randint(1, 231)) print('Torch:', int(torch.randint(1, 2**31, (1,)))) - Refactored `torchnlp.samplers` enabling pipelining. For example:python from torchnlp.samplers import DeterministicSampler from torchnlp.samplers import BalancedSampler

data = ['a', 'b', 'c'] + ['c'] * 100 sampler = BalancedSampler(data, num_samples=3) sampler = DeterministicSampler(sampler, random_seed=12) print([data[i] for i in sampler]) # ['c', 'b', 'a'] - Added `torchnlp.samplers.balanced_sampler` for balanced sampling extending Pytorch's `WeightedRandomSampler`. - Added `torchnlp.samplers.deterministic_sampler` for deterministic sampling based on `torchnlp.random`. - Added `torchnlp.samplers.distributed_batch_sampler` for distributed batch sampling. - Added `torchnlp.samplers.oom_batch_sampler` to sample large batches first in order to force an out-of-memory error. - Added `torchnlp.utils.lengths_to_mask` to help create masks from a batch of sequences. - Added `torchnlp.utils.get_total_parameters` to measure the number of parameters in a model. - Added `torchnlp.utils.get_tensors` to measure the size of an object in number of tensor elements. This is useful for dynamic batch sizing and for `torchnlp.samplers.oom_batch_sampler`.python from torchnlp.utils import get_tensors

random_object_ = tuple([{'t': torch.tensor([1, 2])}, torch.tensor([2, 3])]) tensors = get_tensors(random_object_) assert len(tensors) == 2 ``` - Added a corporate sponsor to the library: https://wellsaidlabs.com/

Minor Updates

  • Fixed snli example (https://github.com/PetrochukM/PyTorch-NLP/pull/84)
  • Updated .gitignore to support Python's virtual environments (https://github.com/PetrochukM/PyTorch-NLP/pull/84)
  • Removed requests and pandas dependency. There are only two dependencies remaining. This is useful for production environments. (https://github.com/PetrochukM/PyTorch-NLP/pull/84)
  • Added LazyLoader to reduce dependency requirements. (https://github.com/PetrochukM/PyTorch-NLP/commit/4e84780a8a741d6a90f2752edc4502ab2cf89ecb)
  • Removed unused torchnlp.datasets.Dataset class in favor of basic Python dictionary lists and pandas. (https://github.com/PetrochukM/PyTorch-NLP/pull/84)
  • Support for downloading tar.gz files and unpacking them faster. (https://github.com/PetrochukM/PyTorch-NLP/commit/eb61fee854576c8a57fd9a20ee03b6fcb89c493a)
  • Rename itos and stoi to index_to_token and token_to_index respectively. (https://github.com/PetrochukM/PyTorch-NLP/pull/84)
  • Fixed batch_encode, batch_decode, and enforce_reversible for torchnlp.encoders.text (https://github.com/PetrochukM/PyTorch-NLP/pull/69)
  • Fix FastText vector downloads (https://github.com/PetrochukM/PyTorch-NLP/pull/72)
  • Fixed documentation for LockedDropout (https://github.com/PetrochukM/PyTorch-NLP/pull/73)
  • Fixed bug in weight_drop (https://github.com/PetrochukM/PyTorch-NLP/pull/76)
  • stack_and_pad_tensors now returns a named tuple for readability (https://github.com/PetrochukM/PyTorch-NLP/pull/84)
  • Added torchnlp.utils.split_list in favor of torchnlp.utils.resplit_datasets. This is enabled by the modularity of torchnlp.random. (https://github.com/PetrochukM/PyTorch-NLP/pull/84)
  • Deprecated torchnlp.utils.datasets_iterator in favor of Pythons itertools.chain. (https://github.com/PetrochukM/PyTorch-NLP/pull/84)
  • Deprecated torchnlp.utils.shuffle in favor of torchnlp.random. (https://github.com/PetrochukM/PyTorch-NLP/pull/84)
  • Support encoding larger datasets following fixing this issue (https://github.com/PetrochukM/PyTorch-NLP/issues/85).
  • Added torchnlp.samplers.repeat_sampler following up on this issue: https://github.com/pytorch/pytorch/issues/15849

Release 0.4.0 - Encoder rewrite, variable sequence collate support, reduced memory usage, doctests, removed SRU 2019-04-03 02:06:45

Major updates

  • Rewrote encoders to better support more generic encoders like a LabelEncoder. Furthermore, added broad support for batch_encode, batch_decode and enforce_reversible.
  • Rearchitected default reserved tokens to ensure configurability while still providing the convenience of good defaults.
  • Added support to collate sequences with torch.utils.data.dataloader.DataLoader. For example: ```python3 from functools import partial from torchnlp.utils import collate_tensors from torchnlp.encoders.text import stack_and_pad_tensors

collate_fn = partial(collate_tensors, stack_tensors=stack_and_pad_tensors) torch.utils.data.dataloader.DataLoader(args, collate_fn=collate_fn, *kwargs) ``` - Added doctest support ensuring the documented examples are tested. - Removed SRU support, it's too heavy of a module to support. Please use https://github.com/taolei87/sru instead. Happy to accept a PR with a better tested and documented SRU module! - Update version requirements to support Python 3.6 and 3.7, dropping support for Python 3.5. - Updated version requirements to support PyTorch 1.0+. - Merged https://github.com/PetrochukM/PyTorch-NLP/pull/66 reducing the memory requirements for pre-trained word vectors by 2x.

Minor Updates

  • Formatted the code base with YAPF.
  • Fixed pandas and collections warnings.
  • Added invariant assertion to Encoder via enforce_reversible. For example: Python3 encoder = Encoder().enforce_reversible() Ensuring Encoder.decode(Encoder.encode(object)) == object
  • Fixed the accuracy metric for PyTorch 1.0.

0.3.7.post1 2018-11-29 22:38:53

Minor release fixing some issues and bugs.

0.3.0 2018-05-06 17:51:32

Release 0.3.0

Major Features And Improvements

  • Upgraded to PyTorch 0.4.0
  • Added Byte-Pair Encoding (BPE) pre-trained subword embeddings in 275 languages
  • Refactored download scripts to torchnlp.downloads
  • Enable Spacy encoder to run in multiple languages.
  • Added a boolean aligned option to FastText supporting MUSE (Multilingual Unsupervised and Supervised Embeddings)

Bug Fixes and Other Changes

  • Create non-existent cache dirs for torchnlp.word_to_vector.
  • Add set operation to torchnlp.datasets.Dataset with support for slices, columns and rows
  • Updated biggest_batches_first in torchnlp.samplers to be more efficient at approximating memory then Pickle
  • Enabled torch.utils.pad_tensor and torch.utils. pad_batch to support N dimensional tensors
  • Updated to sacremoses to fix NLTK moses dependancy for torch.text_encoders
  • Added __getitem()__ for _PretrainedWordVectors. For example: from torchnlp.word_to_vector import FastText vectors = FastText() tokenized_sentence = ['this', 'is', 'a', 'sentence'] vectors[tokenized_sentence]
  • Added __contains__ for _PretrainedWordVectors. For example: ```

    from torchnlp.word_to_vector import FastText vectors = FastText()

'the' in vectors True 'theqwe' in vectors False ```

Initial Release 2018-04-08 21:34:04

Michael Petrochuk

World Record Holder β€’ Deep Learning (DL) Engineer & Researcher β€’ CTO @ https://wellsaidlabs.com

GitHub Repository Homepage

pytorch nlp natural-language-processing pytorch-nlp torchnlp data-loader embeddings word-vectors python deep-learning dataset metrics neural-network sru machine-learning