This repo is a wrapper around the transformers
library from hugging face :hugs:
Install from Pypi using:
pip install minicons
```py from minicons import cwe
model = cwe.CWE('bert-base-uncased')
context_words = [("I went to the bank to withdraw money.", "bank"), ("i was at the bank of the river ganga!", "bank")]
print(model.extract_representation(context_words, layer = 12))
'''
tensor([[ 0.5399, -0.2461, -0.0968, ..., -0.4670, -0.5312, -0.0549],
[-0.8258, -0.4308, 0.2744, ..., -0.5987, -0.6984, 0.2087]],
grad_fn=
model = cwe.EncDecCWE('t5-small')
print(model.extract_representation(context_words))
'''(last layer, by default) tensor([[-0.0895, 0.0758, 0.0753, ..., 0.0130, -0.1093, -0.2354], [-0.0695, 0.1142, 0.0803, ..., 0.0807, -0.1139, -0.2888]]) ''' ```
```py from minicons import scorer
mlm_model = scorer.MaskedLMScorer('bert-base-uncased', 'cpu') ilm_model = scorer.IncrementalLMScorer('distilgpt2', 'cpu') s2s_model = scorer.Seq2SeqScorer('t5-base', 'cpu')
stimuli = ["The keys to the cabinet are on the table.", "The keys to the cabinet is on the table."]
print(ilm_model.sequence_score(stimuli, reduction = lambda x: -x.sum(0).item()))
''' [39.879737854003906, 42.75846481323242] '''
print(mlm_model.sequence_score(stimuli, reduction = lambda x: -x.sum(0).item())) ''' [13.962685585021973, 23.415111541748047] '''
stimuli
print(s2s_model.sequence_score(stimuli, source_format = 'blank'))
stimuli
print(s2s_model.sequence_score(stimuli, source_format = 'copy')) ''' [-7.910910129547119, -7.835635185241699] [-10.555519104003906, -9.532546997070312] ''' ```
model.token_score()
and model.sequence_score()
with MaskedLMScorers
as well!If you use minicons
, please cite the following paper:
tex
@article{misra2022minicons,
title={minicons: Enabling Flexible Behavioral and Representational Analyses of Transformer Language Models},
author={Kanishka Misra},
journal={arXiv preprint arXiv:2203.13112},
year={2022}
}
Hi,
Minicons seems not tokenize alphabet-based texts given "bert" pre-trained models are introduced. It is desirable to generate the surprisal values for word forms in real life rather than the split forms. For example, "symbolised" is split into ('symbol', 9.485310554504395), ('##ised', 6.920506000518799),. However, I want to get the surprisal value for the real word ("symbolised"). I am not sure how to solve this problem. The package also seems to incorrectly generate surprisal values for some real words, particularly those long words with suffix or prefix, because a long word with prefix or suffix will be split into several units.
Many thanks!
```python
In [13]: model = scorer.MaskedLMScorer('bert-base-multilingual-cased', 'cpu')
In [14]: ge_sen=["Janus symbolisierte häufig Veränderungen und Übergänge, wie de ...: n Wechsel von einer Bedingung zur anderen, von einer Perspektive zur an ...: deren und das Heranwachsen junger Menschen zum Erwachsenenalter."]
In [15]: model.token_score(ge_sen, surprisal = True, base_two = True) Out[15]: [[('Jan', 7.411351680755615), ('##us', 6.953413963317871), ('symbol', 8.663262367248535), ('##isierte', 8.227853775024414), ('häufig', 9.369148254394531), ('Veränderungen', 4.863248348236084), ('und', 3.3478829860687256), ('Über', 3.3023200035095215), ('##gänge', 0.40428581833839417), (',', 0.048578906804323196), ('wie', 1.878091812133789), ('den', 5.769808769226074), ('Wechsel', 3.2879366874694824), ('von', 0.016336975619196892), ('einer', 0.016496576368808746), ('Bed', 0.0244187843054533), ('##ingu', 0.09460146725177765), ('##ng', 0.018612651154398918), ('zur', 0.9586092829704285), ('anderen', 1.2600054740905762), (',', 0.3100062906742096), ('von', 0.013392632827162743), ('einer', 0.025651555508375168), ('Pers', 0.007922208867967129), ('##pektive', 0.03971010446548462), ('zur', 0.8729674220085144), ('anderen', 1.6451447010040283), ('und', 2.9337639808654785), ('das', 0.1244136244058609), ('Hera', 1.1853374242782593), ('##n', 1.9540393352508545), ('##wachsen', 0.006810512859374285), ('junge', 2.0289151668548584), ('##r', 0.007776367478072643), ('Menschen', 3.1449434757232666), ('zum', 5.088050365447998), ('Er', 0.001235523377545178), ('##wachsenen', 0.01289732288569212), ('##alter', 0.12524327635765076), ('.', 0.02648257650434971)]]
In [16]: en_sen=["Janus often symbolised changes and transitions, such as moving ...: from one condition to another, from one perspective to another, and yo ...: ung people growing into adulthood."]
In [17]: en_sen Out[17]: ['Janus often symbolised changes and transitions, such as moving from one condition to another, from one perspective to another, and young people growing into adulthood.']
In [18]: model.token_score(en_sen, surprisal = True, base_two = True) Out[18]: [[('Jan', 7.161930084228516), ('##us', 4.905619144439697), ('often', 5.8594160079956055), ('symbol', 9.485310554504395), ('##ised', 6.920506000518799), ('changes', 4.574926853179932), ('and', 3.2199747562408447), ('transition', 5.44439697265625), ('##s', 0.018392512574791908), (',', 0.02080027014017105), ('such', 0.04780016839504242), ('as', 0.013945729471743107), ('moving', 7.4285569190979), ('from', 0.008073553442955017), ('one', 0.2561193108558655), ('condition', 18.707305908203125), ('to', 0.014606142416596413), ('another', 0.8214359283447266), (',', 0.7367089986801147), ('from', 0.06036728620529175), ('one', 0.4734668731689453), ('perspective', 13.356915473937988), ('to', 0.06987723708152771), ('another', 0.7075008749961853), (',', 0.08287912607192993), ('and', 2.1124203205108643), ('young', 6.065392017364502), ('people', 3.042752742767334), ('growing', 4.334306716918945), ('into', 4.379203796386719), ('adult', 1.3680847883224487), ('##hood', 0.2171218991279602), ('.', 0.06372988969087601)]]
In [19]: sp_sen=["Jano suele simbolizar los cambios y las transiciones, como el ...: paso de una condición a otra, de una perspectiva a otra, y el crecimien ...: to de los jóvenes hacia la edad adulta."]
In [20]: model.token_score(sp_sen, surprisal = True, base_two = True) Out[20]: [[('Jan', 11.449429512023926), ('##o', 7.180861949920654), ('suele', 7.2584357261657715), ('simbol', 4.928884983062744), ('##izar', 0.018150361254811287), ('los', 0.03109721466898918), ('cambios', 3.5657286643981934), ('y', 6.550257682800293), ('las', 0.04733512923121452), ('trans', 3.946718454360962), ('##iciones', 0.35458695888519287), (',', 0.0718887448310852), ('como', 0.6874077916145325), ('el', 0.009603511542081833), ('paso', 5.542746067047119), ('de', 0.015120714902877808), ('una', 0.010806013830006123), ('condición', 15.124244689941406), ('a', 0.03922305256128311), ('otra', 0.5000980496406555), (',', 0.5917675495147705), ('de', 0.05246984213590622), ('una', 0.015467431396245956), ('perspectiva', 11.579629898071289), ('a', 0.05401906371116638), ('otra', 0.39069506525993347), (',', 0.024377508088946342), ('y', 1.9166929721832275), ('el', 0.006273927167057991), ('crecimiento', 6.725331783294678), ('de', 0.011221524327993393), ('los', 0.7993561029434204), ('jóvenes', 4.965604305267334), ('hacia', 3.6372487545013428), ('la', 0.27643802762031555), ('edad', 0.262629896402359), ('adulta', 0.3033374845981598), ('.', 0.03442404791712761)]]
In [21]: ru_sen=["Янус часто символизировал изменения и переходы, такие как пере ...: ход от одного состояния к другому, от одной перспективы к другой, а так ...: же молодых людей, вступающих во взрослую жизнь."]
In [22]: model.token_score(ru_sen, surprisal = True, base_two = True) Out[22]: [[('Ян', 7.062388896942139), ('##ус', 7.699002742767334), ('часто', 10.491772651672363), ('символ', 1.846983551979065), ('##из', 0.5921100974082947), ('##ировал', 7.98089599609375), ('изменения', 9.341201782226562), ('и', 1.0752657651901245), ('пер', 0.0009851165814325213), ('##еход', 0.024955371394753456), ('##ы', 0.4438115358352661), (',', 0.036848314106464386), ('такие', 1.1838680505752563), ('как', 0.00436423160135746), ('пер', 0.006749975029379129), ('##еход', 0.0007649788167327642), ('от', 0.038956135511398315), ('одного', 0.11314807087182999), ('состояния', 9.765267372131348), ('к', 0.005316327791661024), ('другому', 1.0975052118301392), (',', 0.7671073079109192), ('от', 0.021667061373591423), ('одной', 1.209750771522522), ('пер', 0.016496576368808746), ('##спект', 0.001849157502874732), ('##ивы', 0.4393042325973511), ('к', 0.001108944183215499), ('другой', 1.3383814096450806), (',', 0.024982888251543045), ('а', 0.026218410581350327), ('также', 1.0678788423538208), ('молодых', 8.310323715209961), ('людей', 0.6578267812728882), (',', 0.008406511507928371), ('в', 0.719817578792572), ('##ступ', 0.6718330383300781), ('##ающих', 0.12413019686937332), ('во', 0.23966126143932343), ('в', 0.04291310906410217), ('##з', 0.00735535379499197), ('##рос', 0.2500462532043457), ('##лу', 0.015621528029441833), ('##ю', 0.00034396530827507377), ('жизнь', 3.4308295249938965), ('.', 0.15315811336040497)]]
```python
```
PhD Student @Purdue University working on Cognitive Science and Natural Language Understanding.
GitHub Repository Homepagenlp natural-language-processing transformers language-model