Geometry-aware Multilingual Embeddings

anoopkunchukuttan, updated 🕥 2022-12-08 02:29:28

Geometry-aware Multilingual Embedding

Code for learning multilingual embeddings using the method reported in:

Pratik Jawanpuria, Arjun Balgovind, Anoop Kunchukuttan, Bamdev Mishra. Learning Multilingual Word Embeddings in Latent Metric Space: A Geometric Approach. Transaction of the Association for Computational Linguistics (TACL), Volume 7, p.107-120, 2019.

Environment Setup

Do the following steps in order: 1. Clone the repository 2. Create a python virtual environment without Tensorflow (if TF is present Pymanopt gives out of memory errors).
3. pip install numpy scipy ipdb 4. pip install git+https://github.com/pymanopt/pymanopt.git --upgrade 5. In Pymanopt code(located at C:\Anaconda\envs\ENVRNMT_NAME\Lib\site-packages\pymanopt\tools\autodiff for Windows or the Linux equivalent), at line 46,49,101,104 add a parameter to the call of theano.function, allow_input_downcast=True 6. conda install theano pygpu 7. In Users\USER_NAME make a file .theanorc.txt with following content:

    [global]
    device = cuda
    floatX = float32
  1. Install cupy based on your CUDA version
  2. Two GPUs are needed

Note: While using this setup with Pymanopt, make sure to import cupy before importing theano, as sometimes theano throws an error that it is unable to find the correct CUDA version. However, the use of Cupy before this fixes the issue.

Datasets

The datasets can be downloaded by running the following commands in vecmap_data/ and muse_data/

./get_vecmap_data.sh
./get_muse_data.sh

Reproducing Results

The results that have been reported in Learning Multilingual Word Embeddings in Latent Metric Space: A Geometric Approach can be reproduced by running the following scripts: * Results of the GeoMM algorithm reported in Table 1, 2, and 6:

    ./geomm_results.sh
  • Results of the GeoMM-Multi algorithm reported in Table 1, 2, and 6:
    ./geomm_multi_results.sh
    
    • Results of the GeoMM-Semi algorithm reported in Table 7:

      ./geomm_semi_results.sh

Note: Since our code makes use of CUDA and FP32 precision, it may not be possible to reproduce our results exactly, due to minor numerical variations in GPU operations. However, the effect on the final results is negligible, as we have observed the variations usually lie within an error margin of 0.1 or 0.2.

Note: Added geomm_optimized.py which can replace geomm.py in all use-cases. Reduces time-taken for en-es pair from 188.5 second to 6.5 second.

GeoMM Embeddings

We provide GeoMM bilingual and multilingual embeddings. These are normalized embeddings in the latent space, . The embeddings are made available under the following license: Creative Commons Attribution-NonCommercial 4.0 International License.

MUSE Dataset

These embeddings have been trained jointly using en-XX MUSE bilingual dictionaries and Wikipedia FastText embeddings.

||||||| |---|---|---|---|---|---| | de | en | es | fr | ru | zh |

VecMap Dataset

These embeddings have been trained jointly using en-XX bilingual dictionaries and embeddings from the VecMap dataset.

|||||| |---|---|---|---|---| | de | en | es | fi | it |

English-Indian language bilingual embeddings

These bilingual embeddings have been trained using the CommonCrawl+Wikipedia FastText Embeddings and the MUSE bilingual dictionaries.

|||| |---|---|---| | en-hi | en-bn | en-ta |

Acknowledgements

The data-processing part of our code was taken from Mikel Artetxe's Vecmap Repository.

References

Please cite Learning Multilingual Word Embeddings in Latent Metric Space: A Geometric Approach if you found the resources in this repository useful.

@article{jawanpuria2018learning,
  title={Learning multilingual word embeddings in latent metric space: a geometric approach},
  author={Jawanpuria, Pratik and Balgovind, Arjun and Kunchukuttan, Anoop and Mishra, Bamdev},
  journal={Transaction of the Association for Computational Linguistics (TACL)},
  volume={7},
  pages={107--120},
  year={2019}
}

Issues

Bump certifi from 2018.4.16 to 2022.12.7

opened on 2022-12-08 02:29:28 by dependabot[bot]

Bumps certifi from 2018.4.16 to 2022.12.7.

Commits


Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/anoopkunchukuttan/geomm/network/alerts).

Bump pillow from 6.2.0 to 9.3.0

opened on 2022-11-22 04:30:19 by dependabot[bot]

Bumps pillow from 6.2.0 to 9.3.0.

Release notes

Sourced from pillow's releases.

9.3.0

https://pillow.readthedocs.io/en/stable/releasenotes/9.3.0.html

Changes

... (truncated)

Changelog

Sourced from pillow's changelog.

9.3.0 (2022-10-29)

  • Limit SAMPLESPERPIXEL to avoid runtime DOS #6700 [wiredfool]

  • Initialize libtiff buffer when saving #6699 [radarhere]

  • Inline fname2char to fix memory leak #6329 [nulano]

  • Fix memory leaks related to text features #6330 [nulano]

  • Use double quotes for version check on old CPython on Windows #6695 [hugovk]

  • Remove backup implementation of Round for Windows platforms #6693 [cgohlke]

  • Fixed set_variation_by_name offset #6445 [radarhere]

  • Fix malloc in _imagingft.c:font_setvaraxes #6690 [cgohlke]

  • Release Python GIL when converting images using matrix operations #6418 [hmaarrfk]

  • Added ExifTags enums #6630 [radarhere]

  • Do not modify previous frame when calculating delta in PNG #6683 [radarhere]

  • Added support for reading BMP images with RLE4 compression #6674 [npjg, radarhere]

  • Decode JPEG compressed BLP1 data in original mode #6678 [radarhere]

  • Added GPS TIFF tag info #6661 [radarhere]

  • Added conversion between RGB/RGBA/RGBX and LAB #6647 [radarhere]

  • Do not attempt normalization if mode is already normal #6644 [radarhere]

... (truncated)

Commits


Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/anoopkunchukuttan/geomm/network/alerts).

Bump pyspark from 2.2.1 to 3.2.2

opened on 2022-11-11 07:47:43 by dependabot[bot]

Bumps pyspark from 2.2.1 to 3.2.2.

Commits
  • 78a5825 Preparing Spark release v3.2.2-rc1
  • ba978b3 [SPARK-39099][BUILD] Add dependencies to Dockerfile for building Spark releases
  • 001d8b0 [SPARK-37554][BUILD] Add PyArrow, pandas and plotly to release Docker image d...
  • 9dd4c07 [SPARK-37730][PYTHON][FOLLOWUP] Split comments to comply pycodestyle check
  • bc54a3f [SPARK-37730][PYTHON] Replace use of MPLPlot._add_legend_handle with MPLPlot....
  • c5983c1 [SPARK-38018][SQL][3.2] Fix ColumnVectorUtils.populate to handle CalendarInte...
  • 32aff86 [SPARK-39447][SQL][3.2] Avoid AssertionError in AdaptiveSparkPlanExec.doExecu...
  • be891ad [SPARK-39551][SQL][3.2] Add AQE invalid plan check
  • 1c0bd4c [SPARK-39656][SQL][3.2] Fix wrong namespace in DescribeNamespaceExec
  • 3d084fe [SPARK-39677][SQL][DOCS][3.2] Fix args formatting of the regexp and like func...
  • Additional commits viewable in compare view


Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/anoopkunchukuttan/geomm/network/alerts).

Bump mako from 1.0.7 to 1.2.2

opened on 2022-09-16 17:56:40 by dependabot[bot]

Bumps mako from 1.0.7 to 1.2.2.

Release notes

Sourced from mako's releases.

1.2.2

Released: Mon Aug 29 2022

bug

  • [bug] [lexer] Fixed issue in lexer where the regexp used to match tags would not correctly interpret quoted sections individually. While this parsing issue still produced the same expected tag structure later on, the mis-handling of quoted sections was also subject to a regexp crash if a tag had a large number of quotes within its quoted sections.

    References: #366

1.2.1

Released: Thu Jun 30 2022

bug

  • [bug] [tests] Various fixes to the test suite in the area of exception message rendering to accommodate for variability in Python versions as well as Pygments.

    References: #360

misc

  • [performance] Optimized some codepaths within the lexer/Python code generation process, improving performance for generation of templates prior to their being cached. Pull request courtesy Takuto Ikuta.

    References: #361

1.2.0

Released: Thu Mar 10 2022

changed

  • [changed] [py3k] Corrected "universal wheel" directive in setup.cfg so that building a wheel does not target Python 2.

    References: #351

  • [changed] [py3k] The bytestring_passthrough template argument is removed, as this flag only applied to Python 2.

... (truncated)

Commits


Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/anoopkunchukuttan/geomm/network/alerts).

Bump nbconvert from 5.3.1 to 6.5.1

opened on 2022-08-23 17:31:23 by dependabot[bot]

Bumps nbconvert from 5.3.1 to 6.5.1.

Release notes

Sourced from nbconvert's releases.

Release 6.5.1

No release notes provided.

6.5.0

What's Changed

New Contributors

Full Changelog: https://github.com/jupyter/nbconvert/compare/6.4.5...6.5

6.4.3

What's Changed

New Contributors

Full Changelog: https://github.com/jupyter/nbconvert/compare/6.4.2...6.4.3

6.4.0

What's Changed

New Contributors

... (truncated)

Commits


Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/anoopkunchukuttan/geomm/network/alerts).

Bump mistune from 0.8.3 to 2.0.3

opened on 2022-07-29 22:49:38 by dependabot[bot]

Bumps mistune from 0.8.3 to 2.0.3.

Release notes

Sourced from mistune's releases.

Version 2.0.2

Fix escape_url via lepture/mistune#295

Version 2.0.1

Fix XSS for image link syntax.

Version 2.0.0

First release of Mistune v2.

Version 2.0.0 RC1

In this release, we have a Security Fix for harmful links.

Version 2.0.0 Alpha 1

This is the first release of v2. An alpha version for users to have a preview of the new mistune.

Version 0.8.4

  • Support an escaped pipe char in a table cell. #150
  • Fix ordered and unordered list. #152
  • Fix spaces between = in HTML tags
  • Add max_recursive_depth for list and blockquote.
  • Fix fences code block.
Changelog

Sourced from mistune's changelog.

Changelog

Here is the full history of mistune v2.

Version 2.0.4


Released on Jul 15, 2022
  • Fix url plugin in <a> tag
  • Fix * formatting

Version 2.0.3

Released on Jun 27, 2022

  • Fix table plugin
  • Security fix for CVE-2022-34749

Version 2.0.2


Released on Jan 14, 2022

Fix escape_url

Version 2.0.1

Released on Dec 30, 2021

XSS fix for image link syntax.

Version 2.0.0


Released on Dec 5, 2021

This is the first non-alpha release of mistune v2.

Version 2.0.0rc1

Released on Feb 16, 2021

Version 2.0.0a6


</tr></table> 

... (truncated)

Commits
  • 3f422f1 Version bump 2.0.3
  • a6d4321 Fix asteris emphasis regex CVE-2022-34749
  • 5638e46 Merge pull request #307 from jieter/patch-1
  • 0eba471 Fix typo in guide.rst
  • 61e9337 Fix table plugin
  • 76dec68 Add documentation for renderer heading when TOC enabled
  • 799cd11 Version bump 2.0.2
  • babb0cf Merge pull request #295 from dairiki/bug.escape_url
  • fc2cd53 Make mistune.util.escape_url less aggressive
  • 3e8d352 Version bump 2.0.1
  • Additional commits viewable in compare view


Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/anoopkunchukuttan/geomm/network/alerts).
Anoop Kunchukuttan

I work on Machine Learning and NLP. I am interested in multilingual computing, Indian language NLP and machine translation.

GitHub Repository

multilingual nlp translation bilingual-word-embedding word-embedding