PyTorch-NLP, or torchnlp
for short, is a library of basic utilities for PyTorch
NLP. torchnlp
extends PyTorch to provide you with
basic text data processing functions.
Logo by Chloe Yeo, Corporate Sponsorship by WellSaid Labs
Make sure you have Python 3.6+ and PyTorch 1.0+. You can then install pytorch-nlp
using
pip:
python
pip install pytorch-nlp
Or to install the latest code via:
python
pip install git+https://github.com/PetrochukM/PyTorch-NLP.git
The complete documentation for PyTorch-NLP is available via our ReadTheDocs website.
Within an NLP data pipeline, you'll want to implement these basic steps:
Load the IMDB dataset, for example:
```python from torchnlp.datasets import imdb_dataset
train = imdb_dataset(train=True) train[0] # RETURNS: {'text': 'For a movie that gets..', 'sentiment': 'pos'} ```
Load a custom dataset, for example:
```python from pathlib import Path
from torchnlp.download import download_file_maybe_extract
directory_path = Path('data/') train_file_path = Path('trees/train.txt')
download_file_maybe_extract( url='http://nlp.stanford.edu/sentiment/trainDevTestTrees_PTB.zip', directory=directory_path, check_files=[train_file_path])
open(directory_path / train_file_path) ```
Don't worry we'll handle caching for you!
Tokenize and encode your text as a tensor.
For example, a WhitespaceEncoder
breaks
text into tokens whenever it encounters a whitespace character.
```python from torchnlp.encoders.text import WhitespaceEncoder
loaded_data = ["now this ain't funny", "so don't you dare laugh"] encoder = WhitespaceEncoder(loaded_data) encoded_data = [encoder.encode(example) for example in loaded_data] ```
With your loaded and encoded data in hand, you'll want to batch your dataset.
```python import torch from torchnlp.samplers import BucketBatchSampler from torchnlp.utils import collate_tensors from torchnlp.encoders.text import stack_and_pad_tensors
encoded_data = [torch.randn(2), torch.randn(3), torch.randn(4), torch.randn(5)]
train_sampler = torch.utils.data.sampler.SequentialSampler(encoded_data) train_batch_sampler = BucketBatchSampler( train_sampler, batch_size=2, drop_last=False, sort_key=lambda i: encoded_data[i].shape[0])
batches = [[encoded_data[i] for i in batch] for batch in train_batch_sampler] batches = [collate_tensors(batch, stack_tensors=stack_and_pad_tensors) for batch in batches] ```
PyTorch-NLP builds on top of PyTorch's existing torch.utils.data.sampler
, torch.stack
and default_collate
to support sequential inputs of varying lengths!
With your batch in hand, you can use PyTorch to develop and train your model using gradient descent. For example, check out this example code for training on the Stanford Natural Language Inference (SNLI) Corpus.
PyTorch-NLP has a couple more NLP focused utility packages to support you! π€
Now you've setup your pipeline, you may want to ensure that some functions run deterministically.
Wrap any code that's random, with fork_rng
and you'll be good to go, like so:
```python import random import numpy import torch
from torchnlp.random import fork_rng
with fork_rng(seed=123): # Ensure determinism print('Random:', random.randint(1, 231)) print('Numpy:', numpy.random.randint(1, 231)) print('Torch:', int(torch.randint(1, 2**31, (1,)))) ```
This will always print:
text
Random: 224899943
Numpy: 843828735
Torch: 843828736
Now that you've computed your vocabulary, you may want to make use of pre-trained word vectors to set your embeddings, like so:
```python import torch from torchnlp.encoders.text import WhitespaceEncoder from torchnlp.word_to_vector import GloVe
encoder = WhitespaceEncoder(["now this ain't funny", "so don't you dare laugh"])
vocab_set = set(encoder.vocab) pretrained_embedding = GloVe(name='6B', dim=100, is_include=lambda w: w in vocab_set) embedding_weights = torch.Tensor(encoder.vocab_size, pretrained_embedding.dim) for i, token in enumerate(encoder.vocab): embedding_weights[i] = pretrained_embedding[token] ```
For example, from the neural network package, apply the state-of-the-art LockedDropout
:
```python import torch from torchnlp.nn import LockedDropout
input_ = torch.randn(6, 3, 10) dropout = LockedDropout(0.5)
input_
dropout(input_) # RETURNS: torch.FloatTensor (6x3x10) ```
Compute common NLP metrics such as the BLEU score.
```python from torchnlp.metrics import get_moses_multi_bleu
hypotheses = ["The brown fox jumps over the dog η¬"] references = ["The quick brown fox jumps over the lazy dog η¬"]
get_moses_multi_bleu(hypotheses, references, lowercase=True) # RETURNS: 47.9 ```
Maybe looking at longer examples may help you at examples/
.
Need more help? We are happy to answer your questions via Gitter Chat
We've released PyTorch-NLP because we found a lack of basic toolkits for NLP in PyTorch. We hope that other organizations can benefit from the project. We are thankful for any contributions from the community.
Read our contributing guide to learn about our development process, how to propose bugfixes and improvements, and how to build and test your changes to PyTorch-NLP.
torchtext and PyTorch-NLP differ in the architecture and feature set; otherwise, they are similar. torchtext and PyTorch-NLP provide pre-trained word vectors, datasets, iterators and text encoders. PyTorch-NLP also provides neural network modules and metrics. From an architecture standpoint, torchtext is object orientated with external coupling while PyTorch-NLP is object orientated with low coupling.
AllenNLP is designed to be a platform for research. PyTorch-NLP is designed to be a lightweight toolkit.
If you find PyTorch-NLP useful for an academic publication, then please use the following BibTeX to cite it:
@misc{pytorch-nlp,
author = {Petrochuk, Michael},
title = {PyTorch-NLP: Rapid Prototyping with PyTorch Natural Language Processing (NLP) Tools},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/PetrochukM/PyTorch-NLP}},
}
I have been using AllenNLP v1.5.0 for migrating old code. What is the version of spacy that works well AllenNLP v1.5.0 ? I have tried using spacy 2.1.0 but still ran into some import issues.
Also I got an import error as :
from allennlp.data.tokenizers.word_splitter import SpacyWordSplitter
ModuleNotFoundError: No module named 'allennlp.data.tokenizers.word_splitter'
Also allennlp.data.tokenizers.word_splitter no longer exists in the latest version of AllenNLP. It has been replaced with SpacyTokenizer. So what would be the correct version of Spacy to use for a tokenizer in AllenNLP v1.5.0?
torchnlp ERROR: No matching distribution found for torch==1.0.0
There is a small typo in torchnlp/encoders/text/subword_text_tokenizer.py.
Should read expressed
rather than experessed
.
Semi-automated pull request generated by https://github.com/timgates42/meticulous/blob/master/docs/NOTE.md
Issue #105 and #120 pass the language as an argument rather than as kwargs. setting the default language value to 'en_core_web_sm'.
I wanted to install this package in anaconda with "conda install torchnlp", but it came out with the "PackagesNotFoundError" notion, how can I install it in anacondaοΌ
from torchnlp.encoders.text import SpacyEncoder encoder = SpacyEncoder(["This ain't funny.", "Don't?"], language='en
TypeError: init() got an unexpected keyword argument 'language'
torchnlp.random
for finer grain control of random state building on PyTorch's fork_rng
. This module controls the random state of torch
, numpy
and random
.
```python
import random
import numpy
import torchfrom torchnlp.random import fork_rng
with fork_rng(seed=123): # Ensure determinism
print('Random:', random.randint(1, 231))
print('Numpy:', numpy.random.randint(1, 231))
print('Torch:', int(torch.randint(1, 2**31, (1,))))
- Refactored `torchnlp.samplers` enabling pipelining. For example:
python
from torchnlp.samplers import DeterministicSampler
from torchnlp.samplers import BalancedSampler
data = ['a', 'b', 'c'] + ['c'] * 100
sampler = BalancedSampler(data, num_samples=3)
sampler = DeterministicSampler(sampler, random_seed=12)
print([data[i] for i in sampler]) # ['c', 'b', 'a']
- Added `torchnlp.samplers.balanced_sampler` for balanced sampling extending Pytorch's `WeightedRandomSampler`.
- Added `torchnlp.samplers.deterministic_sampler` for deterministic sampling based on `torchnlp.random`.
- Added `torchnlp.samplers.distributed_batch_sampler` for distributed batch sampling.
- Added `torchnlp.samplers.oom_batch_sampler` to sample large batches first in order to force an out-of-memory error.
- Added `torchnlp.utils.lengths_to_mask` to help create masks from a batch of sequences.
- Added `torchnlp.utils.get_total_parameters` to measure the number of parameters in a model.
- Added `torchnlp.utils.get_tensors` to measure the size of an object in number of tensor elements. This is useful for dynamic batch sizing and for `torchnlp.samplers.oom_batch_sampler`.
python
from torchnlp.utils import get_tensors
random_object_ = tuple([{'t': torch.tensor([1, 2])}, torch.tensor([2, 3])]) tensors = get_tensors(random_object_) assert len(tensors) == 2 ``` - Added a corporate sponsor to the library: https://wellsaidlabs.com/
snli
example (https://github.com/PetrochukM/PyTorch-NLP/pull/84).gitignore
to support Python's virtual environments (https://github.com/PetrochukM/PyTorch-NLP/pull/84)requests
and pandas
dependency. There are only two dependencies remaining. This is useful for production environments. (https://github.com/PetrochukM/PyTorch-NLP/pull/84)LazyLoader
to reduce dependency requirements. (https://github.com/PetrochukM/PyTorch-NLP/commit/4e84780a8a741d6a90f2752edc4502ab2cf89ecb)torchnlp.datasets.Dataset
class in favor of basic Python dictionary lists and pandas
. (https://github.com/PetrochukM/PyTorch-NLP/pull/84)tar.gz
files and unpacking them faster. (https://github.com/PetrochukM/PyTorch-NLP/commit/eb61fee854576c8a57fd9a20ee03b6fcb89c493a)itos
and stoi
to index_to_token
and token_to_index
respectively. (https://github.com/PetrochukM/PyTorch-NLP/pull/84)batch_encode
, batch_decode
, and enforce_reversible
for torchnlp.encoders.text
(https://github.com/PetrochukM/PyTorch-NLP/pull/69)FastText
vector downloads (https://github.com/PetrochukM/PyTorch-NLP/pull/72)LockedDropout
(https://github.com/PetrochukM/PyTorch-NLP/pull/73)weight_drop
(https://github.com/PetrochukM/PyTorch-NLP/pull/76)stack_and_pad_tensors
now returns a named tuple for readability (https://github.com/PetrochukM/PyTorch-NLP/pull/84)torchnlp.utils.split_list
in favor of torchnlp.utils.resplit_datasets
. This is enabled by the modularity of torchnlp.random
. (https://github.com/PetrochukM/PyTorch-NLP/pull/84)torchnlp.utils.datasets_iterator
in favor of Pythons itertools.chain
. (https://github.com/PetrochukM/PyTorch-NLP/pull/84)torchnlp.utils.shuffle
in favor of torchnlp.random
. (https://github.com/PetrochukM/PyTorch-NLP/pull/84)torchnlp.samplers.repeat_sampler
following up on this issue: https://github.com/pytorch/pytorch/issues/15849LabelEncoder
. Furthermore, added broad support for batch_encode
, batch_decode
and enforce_reversible
.torch.utils.data.dataloader.DataLoader
. For example:
```python3
from functools import partial
from torchnlp.utils import collate_tensors
from torchnlp.encoders.text import stack_and_pad_tensorscollate_fn = partial(collate_tensors, stack_tensors=stack_and_pad_tensors) torch.utils.data.dataloader.DataLoader(args, collate_fn=collate_fn, *kwargs) ``` - Added doctest support ensuring the documented examples are tested. - Removed SRU support, it's too heavy of a module to support. Please use https://github.com/taolei87/sru instead. Happy to accept a PR with a better tested and documented SRU module! - Update version requirements to support Python 3.6 and 3.7, dropping support for Python 3.5. - Updated version requirements to support PyTorch 1.0+. - Merged https://github.com/PetrochukM/PyTorch-NLP/pull/66 reducing the memory requirements for pre-trained word vectors by 2x.
pandas
and collections
warnings.Encoder
via enforce_reversible
. For example:
Python3
encoder = Encoder().enforce_reversible()
Ensuring Encoder.decode(Encoder.encode(object)) == object
Minor release fixing some issues and bugs.
torchnlp.downloads
torchnlp.word_to_vector
.set
operation to torchnlp.datasets.Dataset
with support for slices, columns and rowsbiggest_batches_first
in torchnlp.samplers
to be more efficient at approximating memory then Pickletorch.utils.pad_tensor
and torch.utils. pad_batch
to support N dimensional tensorstorch.text_encoders
__getitem()__
for _PretrainedWordVectors
. For example:
from torchnlp.word_to_vector import FastText
vectors = FastText()
tokenized_sentence = ['this', 'is', 'a', 'sentence']
vectors[tokenized_sentence]
__contains__
for _PretrainedWordVectors
. For example:
```from torchnlp.word_to_vector import FastText vectors = FastText()
'the' in vectors True 'theqwe' in vectors False ```
World Record Holder β’ Deep Learning (DL) Engineer & Researcher β’ CTO @ https://wellsaidlabs.com
GitHub Repository Homepagepytorch nlp natural-language-processing pytorch-nlp torchnlp data-loader embeddings word-vectors python deep-learning dataset metrics neural-network sru machine-learning