Utility for analyzing Transformer based representations of language.

kanishkamisra, updated 🕥 2023-03-26 21:18:48

minicons: Enabling Flexible Behavioral and Representational Analyses of Transformer Language Models

Downloads

This repo is a wrapper around the transformers library from hugging face :hugs:

Installation

Install from Pypi using:

pip install minicons

Supported Functionality

  • Extract word representations from Contextualized Word Embeddings
  • Score sequences using language model scoring techniques, including masked language models following Salazar et al. (2020).

Examples

  1. Extract word representations from contextualized word embeddings:

```py from minicons import cwe

model = cwe.CWE('bert-base-uncased')

context_words = [("I went to the bank to withdraw money.", "bank"), ("i was at the bank of the river ganga!", "bank")]

print(model.extract_representation(context_words, layer = 12))

''' tensor([[ 0.5399, -0.2461, -0.0968, ..., -0.4670, -0.5312, -0.0549], [-0.8258, -0.4308, 0.2744, ..., -0.5987, -0.6984, 0.2087]], grad_fn=) '''

if model is seq2seq:

model = cwe.EncDecCWE('t5-small')

print(model.extract_representation(context_words))

'''(last layer, by default) tensor([[-0.0895, 0.0758, 0.0753, ..., 0.0130, -0.1093, -0.2354], [-0.0695, 0.1142, 0.0803, ..., 0.0807, -0.1139, -0.2888]]) ''' ```

  1. Compute sentence acceptability measures (surprisals) using Word Prediction Models:

```py from minicons import scorer

mlm_model = scorer.MaskedLMScorer('bert-base-uncased', 'cpu') ilm_model = scorer.IncrementalLMScorer('distilgpt2', 'cpu') s2s_model = scorer.Seq2SeqScorer('t5-base', 'cpu')

stimuli = ["The keys to the cabinet are on the table.", "The keys to the cabinet is on the table."]

use sequence_score with different reduction options:

Sequence Surprisal - lambda x: -x.sum(0).item()

Sequence Log-probability - lambda x: x.sum(0).item()

Sequence Surprisal, normalized by number of tokens - lambda x: -x.mean(0).item()

Sequence Log-probability, normalized by number of tokens - lambda x: x.mean(0).item()

and so on...

print(ilm_model.sequence_score(stimuli, reduction = lambda x: -x.sum(0).item()))

''' [39.879737854003906, 42.75846481323242] '''

MLM scoring, inspired by Salazar et al., 2020

print(mlm_model.sequence_score(stimuli, reduction = lambda x: -x.sum(0).item())) ''' [13.962685585021973, 23.415111541748047] '''

Seq2seq scoring

Blank source sequence, target sequence specified in stimuli

print(s2s_model.sequence_score(stimuli, source_format = 'blank'))

Source sequence is the same as the target sequence in stimuli

print(s2s_model.sequence_score(stimuli, source_format = 'copy')) ''' [-7.910910129547119, -7.835635185241699] [-10.555519104003906, -9.532546997070312] ''' ```

Tutorials

Recent Updates

  • November 6, 2021: MLM scoring has been fixed! You can now use model.token_score() and model.sequence_score() with MaskedLMScorers as well!
  • June 4, 2022: Added support for Seq2seq models. Thanks to Aaron Mueller 🥳

Citation

If you use minicons, please cite the following paper:

tex @article{misra2022minicons, title={minicons: Enabling Flexible Behavioral and Representational Analyses of Transformer Language Models}, author={Kanishka Misra}, journal={arXiv preprint arXiv:2203.13112}, year={2022} }

Issues

incorrect tokenizers

opened on 2022-11-23 16:21:16 by fivehills

Hi,

Minicons seems not tokenize alphabet-based texts given "bert" pre-trained models are introduced. It is desirable to generate the surprisal values for word forms in real life rather than the split forms. For example, "symbolised" is split into ('symbol', 9.485310554504395), ('##ised', 6.920506000518799),. However, I want to get the surprisal value for the real word ("symbolised"). I am not sure how to solve this problem. The package also seems to incorrectly generate surprisal values for some real words, particularly those long words with suffix or prefix, because a long word with prefix or suffix will be split into several units.

Many thanks!

```python

In [13]: model = scorer.MaskedLMScorer('bert-base-multilingual-cased', 'cpu')

In [14]: ge_sen=["Janus symbolisierte häufig Veränderungen und Übergänge, wie de ...: n Wechsel von einer Bedingung zur anderen, von einer Perspektive zur an ...: deren und das Heranwachsen junger Menschen zum Erwachsenenalter."]

In [15]: model.token_score(ge_sen, surprisal = True, base_two = True) Out[15]: [[('Jan', 7.411351680755615), ('##us', 6.953413963317871), ('symbol', 8.663262367248535), ('##isierte', 8.227853775024414), ('häufig', 9.369148254394531), ('Veränderungen', 4.863248348236084), ('und', 3.3478829860687256), ('Über', 3.3023200035095215), ('##gänge', 0.40428581833839417), (',', 0.048578906804323196), ('wie', 1.878091812133789), ('den', 5.769808769226074), ('Wechsel', 3.2879366874694824), ('von', 0.016336975619196892), ('einer', 0.016496576368808746), ('Bed', 0.0244187843054533), ('##ingu', 0.09460146725177765), ('##ng', 0.018612651154398918), ('zur', 0.9586092829704285), ('anderen', 1.2600054740905762), (',', 0.3100062906742096), ('von', 0.013392632827162743), ('einer', 0.025651555508375168), ('Pers', 0.007922208867967129), ('##pektive', 0.03971010446548462), ('zur', 0.8729674220085144), ('anderen', 1.6451447010040283), ('und', 2.9337639808654785), ('das', 0.1244136244058609), ('Hera', 1.1853374242782593), ('##n', 1.9540393352508545), ('##wachsen', 0.006810512859374285), ('junge', 2.0289151668548584), ('##r', 0.007776367478072643), ('Menschen', 3.1449434757232666), ('zum', 5.088050365447998), ('Er', 0.001235523377545178), ('##wachsenen', 0.01289732288569212), ('##alter', 0.12524327635765076), ('.', 0.02648257650434971)]]

In [16]: en_sen=["Janus often symbolised changes and transitions, such as moving ...: from one condition to another, from one perspective to another, and yo ...: ung people growing into adulthood."]

In [17]: en_sen Out[17]: ['Janus often symbolised changes and transitions, such as moving from one condition to another, from one perspective to another, and young people growing into adulthood.']

In [18]: model.token_score(en_sen, surprisal = True, base_two = True) Out[18]: [[('Jan', 7.161930084228516), ('##us', 4.905619144439697), ('often', 5.8594160079956055), ('symbol', 9.485310554504395), ('##ised', 6.920506000518799), ('changes', 4.574926853179932), ('and', 3.2199747562408447), ('transition', 5.44439697265625), ('##s', 0.018392512574791908), (',', 0.02080027014017105), ('such', 0.04780016839504242), ('as', 0.013945729471743107), ('moving', 7.4285569190979), ('from', 0.008073553442955017), ('one', 0.2561193108558655), ('condition', 18.707305908203125), ('to', 0.014606142416596413), ('another', 0.8214359283447266), (',', 0.7367089986801147), ('from', 0.06036728620529175), ('one', 0.4734668731689453), ('perspective', 13.356915473937988), ('to', 0.06987723708152771), ('another', 0.7075008749961853), (',', 0.08287912607192993), ('and', 2.1124203205108643), ('young', 6.065392017364502), ('people', 3.042752742767334), ('growing', 4.334306716918945), ('into', 4.379203796386719), ('adult', 1.3680847883224487), ('##hood', 0.2171218991279602), ('.', 0.06372988969087601)]]

In [19]: sp_sen=["Jano suele simbolizar los cambios y las transiciones, como el ...: paso de una condición a otra, de una perspectiva a otra, y el crecimien ...: to de los jóvenes hacia la edad adulta."]

In [20]: model.token_score(sp_sen, surprisal = True, base_two = True) Out[20]: [[('Jan', 11.449429512023926), ('##o', 7.180861949920654), ('suele', 7.2584357261657715), ('simbol', 4.928884983062744), ('##izar', 0.018150361254811287), ('los', 0.03109721466898918), ('cambios', 3.5657286643981934), ('y', 6.550257682800293), ('las', 0.04733512923121452), ('trans', 3.946718454360962), ('##iciones', 0.35458695888519287), (',', 0.0718887448310852), ('como', 0.6874077916145325), ('el', 0.009603511542081833), ('paso', 5.542746067047119), ('de', 0.015120714902877808), ('una', 0.010806013830006123), ('condición', 15.124244689941406), ('a', 0.03922305256128311), ('otra', 0.5000980496406555), (',', 0.5917675495147705), ('de', 0.05246984213590622), ('una', 0.015467431396245956), ('perspectiva', 11.579629898071289), ('a', 0.05401906371116638), ('otra', 0.39069506525993347), (',', 0.024377508088946342), ('y', 1.9166929721832275), ('el', 0.006273927167057991), ('crecimiento', 6.725331783294678), ('de', 0.011221524327993393), ('los', 0.7993561029434204), ('jóvenes', 4.965604305267334), ('hacia', 3.6372487545013428), ('la', 0.27643802762031555), ('edad', 0.262629896402359), ('adulta', 0.3033374845981598), ('.', 0.03442404791712761)]]

In [21]: ru_sen=["Янус часто символизировал изменения и переходы, такие как пере ...: ход от одного состояния к другому, от одной перспективы к другой, а так ...: же молодых людей, вступающих во взрослую жизнь."]

In [22]: model.token_score(ru_sen, surprisal = True, base_two = True) Out[22]: [[('Ян', 7.062388896942139), ('##ус', 7.699002742767334), ('часто', 10.491772651672363), ('символ', 1.846983551979065), ('##из', 0.5921100974082947), ('##ировал', 7.98089599609375), ('изменения', 9.341201782226562), ('и', 1.0752657651901245), ('пер', 0.0009851165814325213), ('##еход', 0.024955371394753456), ('##ы', 0.4438115358352661), (',', 0.036848314106464386), ('такие', 1.1838680505752563), ('как', 0.00436423160135746), ('пер', 0.006749975029379129), ('##еход', 0.0007649788167327642), ('от', 0.038956135511398315), ('одного', 0.11314807087182999), ('состояния', 9.765267372131348), ('к', 0.005316327791661024), ('другому', 1.0975052118301392), (',', 0.7671073079109192), ('от', 0.021667061373591423), ('одной', 1.209750771522522), ('пер', 0.016496576368808746), ('##спект', 0.001849157502874732), ('##ивы', 0.4393042325973511), ('к', 0.001108944183215499), ('другой', 1.3383814096450806), (',', 0.024982888251543045), ('а', 0.026218410581350327), ('также', 1.0678788423538208), ('молодых', 8.310323715209961), ('людей', 0.6578267812728882), (',', 0.008406511507928371), ('в', 0.719817578792572), ('##ступ', 0.6718330383300781), ('##ающих', 0.12413019686937332), ('во', 0.23966126143932343), ('в', 0.04291310906410217), ('##з', 0.00735535379499197), ('##рос', 0.2500462532043457), ('##лу', 0.015621528029441833), ('##ю', 0.00034396530827507377), ('жизнь', 3.4308295249938965), ('.', 0.15315811336040497)]]

```python

```

Kanishka

PhD Student @Purdue University working on Cognitive Science and Natural Language Understanding.

GitHub Repository Homepage

nlp natural-language-processing transformers language-model