Document ranking via sentence modeling using BERT

castorini, updated 🕥 2022-12-08 05:13:28

Birch

DOI

Document ranking via sentence modeling using BERT

Note: The results in the arXiv paper Simple Applications of BERT for Ad Hoc Document Retrieval have been superseded by the results in the EMNLP'19 paper [Cross-Domain Modeling of Sentence-Level Evidence for Document Retrieval]. To reproduce the results in the arXiv paper, please follow the instructions here instead.

Environment & Data

```

Set up environment

pip install virtualenv virtualenv -p python3.5 birch_env source birch_env/bin/activate

Install dependencies

pip install Cython # jnius dependency pip install -r requirements.txt

For inference, the Python-only apex build can also be used

git clone https://github.com/NVIDIA/apex cd apex && pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Set up Anserini (last reproduced with commit id: 5da46f610435be6364700bc5a6144253ed3f3b59)

git clone https://github.com/castorini/anserini.git cd anserini && mvn clean package appassembler:assemble cd eval && tar xvfz trec_eval.9.0.4.tar.gz && cd trec_eval.9.0.4 && make && cd ../../..

Download data and models

wget https://zenodo.org/record/3381673/files/emnlp_bert4ir_v2.tar.gz tar -xzvf emnlp_bert4ir_v2.tar.gz ```

Experiment Names: - mb_robust04, mb_core17, mb_core18 - car_mb_robust04, car_mb_core17, car_mb_core18 - msmarco_mb_robust04, msmarco_mb_core17, msmarco_mb_core18 - robust04, car_core17, car_core18 - msmarco_robust04, msmarco_core17, msmarco_core18

Training

For BERT(MB):

export CUDA_VISIBLE_DEVICES=0; experiment=mb; \ nohup python -u src/main.py --mode training --experiment ${experiment} --collection mb \ --local_model models/bert-large-uncased.tar.gz \ --local_tokenizer models/bert-large-uncased-vocab.txt --batch_size 16 \ --data_path data --predict_path data/predictions/predict.${experiment} \ --model_path models/saved.${experiment} --eval_steps 1000 \ --device cuda --output_path logs/out.${experiment} > logs/${experiment}.log 2>&1 &

For BERT(CAR -> MB) and BERT(MS MARCO -> MB):

export CUDA_VISIBLE_DEVICES=0; experiment=<car_mb, msmarco_mb>; \ nohup python -u src/main.py --mode training --experiment ${experiment} --collection mb \ --local_model <models/pytorch_msmarco.tar.gz, models/pytorch_car.tar.gz> \ --local_tokenizer models/bert-large-uncased-vocab.txt --batch_size 16 \ --data_path data --predict_path data/predictions/predict.${experiment} \ --model_path models/saved.${experiment} --eval_steps 1000 \ --device cuda --output_path logs/out.${experiment} > logs/${experiment}.log 2>&1 &

Inference

For BERT(MB), BERT(CAR -> MB) and BERT(MS MARCO -> MB):

export CUDA_VISIBLE_DEVICES=0; experiment=<experiment_name>; \ nohup python -u src/main.py --mode inference --experiment ${experiment} --collection <robust04, core17, core18> \ --load_trained --model_path <models/saved.mb_1, models/saved.car_mb_1, models/saved.msmarco_mb_1> \ --batch_size 4 --data_path data --predict_path data/predictions/predict.${experiment} \ --device cuda --output_path logs/out.${experiment} > logs/${experiment}.log 2>&1 &

For BERT(CAR) and BERT(MS MARCO):

export CUDA_VISIBLE_DEVICES=0; experiment=<experiment_name; \ nohup python -u src/main.py --mode inference --experiment ${experiment} --collection <robust04, core17, core18> \ --local_model <models/pytorch_msmarco.tar.gz, models/pytorch_car.tar.gz> \ --local_tokenizer models/bert-large-uncased-vocab.txt --batch_size 4 \ --data_path data --predict_path data/predictions/predict.${experiment} \ --device cuda --output_path logs/out.${experiment} > logs/${experiment}.log 2>&1 &

Note that this step takes a long time. If you don't want to evaluate the pretrained models, you may skip to the next step and evaluate with our predictions under data/predictions.

Retrieve sentences from top candidate documents

python src/utils/split_docs.py --collection <robust04, core17, core18> \ --index <path/to/index> --data_path data --anserini_path <path/to/anserini/root>

Evaluation

experiment=<experiment_name> collection=<robust04, core17, core18> anserini_path=<path/to/anserini/root> index_path=<path/to/lucene/index> data_path=<path/to/data/root>

BM25+RM3 Baseline

``` ./eval_scripts/baseline.sh ${collection} ${index_path} ${anserini_path} ${data_path}

./eval_scripts/eval.sh baseline ${collection} ${anserini_path} ${data_path} ```

Sentence Evidence

```

Tune hyperparameters (if you do not have apex working, run this script with an additional "NOAPEX" param at the end)

./eval_scripts/train.sh ${experiment} ${collection} ${anserini_path}

Run experiment

./eval_scripts/test.sh ${experiment} ${collection} ${anserini_path}

Evaluate with trec_eval

./eval_scripts/eval.sh ${experiment} ${collection} ${anserini_path} ${data_path} ```

Issues

Bump certifi from 2019.3.9 to 2022.12.7

opened on 2022-12-08 05:13:28 by dependabot[bot]

Bumps certifi from 2019.3.9 to 2022.12.7.

Commits


Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/castorini/birch/network/alerts).

Bump numpy from 1.16.2 to 1.22.0

opened on 2022-06-21 22:06:26 by dependabot[bot]

Bumps numpy from 1.16.2 to 1.22.0.

Release notes

Sourced from numpy's releases.

v1.22.0

NumPy 1.22.0 Release Notes

NumPy 1.22.0 is a big release featuring the work of 153 contributors spread over 609 pull requests. There have been many improvements, highlights are:

  • Annotations of the main namespace are essentially complete. Upstream is a moving target, so there will likely be further improvements, but the major work is done. This is probably the most user visible enhancement in this release.
  • A preliminary version of the proposed Array-API is provided. This is a step in creating a standard collection of functions that can be used across application such as CuPy and JAX.
  • NumPy now has a DLPack backend. DLPack provides a common interchange format for array (tensor) data.
  • New methods for quantile, percentile, and related functions. The new methods provide a complete set of the methods commonly found in the literature.
  • A new configurable allocator for use by downstream projects.

These are in addition to the ongoing work to provide SIMD support for commonly used functions, improvements to F2PY, and better documentation.

The Python versions supported in this release are 3.8-3.10, Python 3.7 has been dropped. Note that 32 bit wheels are only provided for Python 3.8 and 3.9 on Windows, all other wheels are 64 bits on account of Ubuntu, Fedora, and other Linux distributions dropping 32 bit support. All 64 bit wheels are also linked with 64 bit integer OpenBLAS, which should fix the occasional problems encountered by folks using truly huge arrays.

Expired deprecations

Deprecated numeric style dtype strings have been removed

Using the strings "Bytes0", "Datetime64", "Str0", "Uint32", and "Uint64" as a dtype will now raise a TypeError.

(gh-19539)

Expired deprecations for loads, ndfromtxt, and mafromtxt in npyio

numpy.loads was deprecated in v1.15, with the recommendation that users use pickle.loads instead. ndfromtxt and mafromtxt were both deprecated in v1.17 - users should use numpy.genfromtxt instead with the appropriate value for the usemask parameter.

(gh-19615)

... (truncated)

Commits


Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/castorini/birch/network/alerts).

Bump ipython from 7.6.0 to 7.16.3

opened on 2022-01-21 19:55:25 by dependabot[bot]

Bumps ipython from 7.6.0 to 7.16.3.

Commits


Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/castorini/birch/network/alerts).

robust04_cv.py

opened on 2021-06-10 02:07:11 by wangjiajia5889758

Can you provide the code 'robust04_cv.py' for us?

Bump urllib3 from 1.24.2 to 1.26.5

opened on 2021-06-01 23:46:48 by dependabot[bot]

Bumps urllib3 from 1.24.2 to 1.26.5.

Release notes

Sourced from urllib3's releases.

1.26.5

:warning: IMPORTANT: urllib3 v2.0 will drop support for Python 2: Read more in the v2.0 Roadmap

  • Fixed deprecation warnings emitted in Python 3.10.
  • Updated vendored six library to 1.16.0.
  • Improved performance of URL parser when splitting the authority component.

If you or your organization rely on urllib3 consider supporting us via GitHub Sponsors

1.26.4

:warning: IMPORTANT: urllib3 v2.0 will drop support for Python 2: Read more in the v2.0 Roadmap

  • Changed behavior of the default SSLContext when connecting to HTTPS proxy during HTTPS requests. The default SSLContext now sets check_hostname=True.

If you or your organization rely on urllib3 consider supporting us via GitHub Sponsors

1.26.3

:warning: IMPORTANT: urllib3 v2.0 will drop support for Python 2: Read more in the v2.0 Roadmap

  • Fixed bytes and string comparison issue with headers (Pull #2141)

  • Changed ProxySchemeUnknown error message to be more actionable if the user supplies a proxy URL without a scheme (Pull #2107)

If you or your organization rely on urllib3 consider supporting us via GitHub Sponsors

1.26.2

:warning: IMPORTANT: urllib3 v2.0 will drop support for Python 2: Read more in the v2.0 Roadmap

  • Fixed an issue where wrap_socket and CERT_REQUIRED wouldn't be imported properly on Python 2.7.8 and earlier (Pull #2052)

1.26.1

:warning: IMPORTANT: urllib3 v2.0 will drop support for Python 2: Read more in the v2.0 Roadmap

  • Fixed an issue where two User-Agent headers would be sent if a User-Agent header key is passed as bytes (Pull #2047)

1.26.0

:warning: IMPORTANT: urllib3 v2.0 will drop support for Python 2: Read more in the v2.0 Roadmap

  • Added support for HTTPS proxies contacting HTTPS servers (Pull #1923, Pull #1806)

  • Deprecated negotiating TLSv1 and TLSv1.1 by default. Users that still wish to use TLS earlier than 1.2 without a deprecation warning should opt-in explicitly by setting ssl_version=ssl.PROTOCOL_TLSv1_1 (Pull #2002) Starting in urllib3 v2.0: Connections that receive a DeprecationWarning will fail

  • Deprecated Retry options Retry.DEFAULT_METHOD_WHITELIST, Retry.DEFAULT_REDIRECT_HEADERS_BLACKLIST and Retry(method_whitelist=...) in favor of Retry.DEFAULT_ALLOWED_METHODS, Retry.DEFAULT_REMOVE_HEADERS_ON_REDIRECT, and Retry(allowed_methods=...) (Pull #2000) Starting in urllib3 v2.0: Deprecated options will be removed

... (truncated)

Changelog

Sourced from urllib3's changelog.

1.26.5 (2021-05-26)

  • Fixed deprecation warnings emitted in Python 3.10.
  • Updated vendored six library to 1.16.0.
  • Improved performance of URL parser when splitting the authority component.

1.26.4 (2021-03-15)

  • Changed behavior of the default SSLContext when connecting to HTTPS proxy during HTTPS requests. The default SSLContext now sets check_hostname=True.

1.26.3 (2021-01-26)

  • Fixed bytes and string comparison issue with headers (Pull #2141)

  • Changed ProxySchemeUnknown error message to be more actionable if the user supplies a proxy URL without a scheme. (Pull #2107)

1.26.2 (2020-11-12)

  • Fixed an issue where wrap_socket and CERT_REQUIRED wouldn't be imported properly on Python 2.7.8 and earlier (Pull #2052)

1.26.1 (2020-11-11)

  • Fixed an issue where two User-Agent headers would be sent if a User-Agent header key is passed as bytes (Pull #2047)

1.26.0 (2020-11-10)

  • NOTE: urllib3 v2.0 will drop support for Python 2. Read more in the v2.0 Roadmap <https://urllib3.readthedocs.io/en/latest/v2-roadmap.html>_.

  • Added support for HTTPS proxies contacting HTTPS servers (Pull #1923, Pull #1806)

  • Deprecated negotiating TLSv1 and TLSv1.1 by default. Users that still wish to use TLS earlier than 1.2 without a deprecation warning

... (truncated)

Commits
  • d161647 Release 1.26.5
  • 2d4a3fe Improve performance of sub-authority splitting in URL
  • 2698537 Update vendored six to 1.16.0
  • 07bed79 Fix deprecation warnings for Python 3.10 ssl module
  • d725a9b Add Python 3.10 to GitHub Actions
  • 339ad34 Use pytest==6.2.4 on Python 3.10+
  • f271c9c Apply latest Black formatting
  • 1884878 [1.26] Properly proxy EOF on the SSLTransport test suite
  • a891304 Release 1.26.4
  • 8d65ea1 Merge pull request from GHSA-5phf-pp7p-vc2r
  • Additional commits viewable in compare view


Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/castorini/birch/network/alerts).

Bump pygments from 2.4.2 to 2.7.4

opened on 2021-03-29 20:19:27 by dependabot[bot]

Bumps pygments from 2.4.2 to 2.7.4.

Release notes

Sourced from pygments's releases.

2.7.4

  • Updated lexers:

    • Apache configurations: Improve handling of malformed tags (#1656)

    • CSS: Add support for variables (#1633, #1666)

    • Crystal (#1650, #1670)

    • Coq (#1648)

    • Fortran: Add missing keywords (#1635, #1665)

    • Ini (#1624)

    • JavaScript and variants (#1647 -- missing regex flags, #1651)

    • Markdown (#1623, #1617)

    • Shell

      • Lex trailing whitespace as part of the prompt (#1645)
      • Add missing in keyword (#1652)
    • SQL - Fix keywords (#1668)

    • Typescript: Fix incorrect punctuation handling (#1510, #1511)

  • Fix infinite loop in SML lexer (#1625)

  • Fix backtracking string regexes in JavaScript/TypeScript, Modula2 and many other lexers (#1637)

  • Limit recursion with nesting Ruby heredocs (#1638)

  • Fix a few inefficient regexes for guessing lexers

  • Fix the raw token lexer handling of Unicode (#1616)

  • Revert a private API change in the HTML formatter (#1655) -- please note that private APIs remain subject to change!

  • Fix several exponential/cubic-complexity regexes found by Ben Caller/Doyensec (#1675)

  • Fix incorrect MATLAB example (#1582)

Thanks to Google's OSS-Fuzz project for finding many of these bugs.

2.7.3

... (truncated)

Changelog

Sourced from pygments's changelog.

Version 2.7.4

(released January 12, 2021)

  • Updated lexers:

    • Apache configurations: Improve handling of malformed tags (#1656)

    • CSS: Add support for variables (#1633, #1666)

    • Crystal (#1650, #1670)

    • Coq (#1648)

    • Fortran: Add missing keywords (#1635, #1665)

    • Ini (#1624)

    • JavaScript and variants (#1647 -- missing regex flags, #1651)

    • Markdown (#1623, #1617)

    • Shell

      • Lex trailing whitespace as part of the prompt (#1645)
      • Add missing in keyword (#1652)
    • SQL - Fix keywords (#1668)

    • Typescript: Fix incorrect punctuation handling (#1510, #1511)

  • Fix infinite loop in SML lexer (#1625)

  • Fix backtracking string regexes in JavaScript/TypeScript, Modula2 and many other lexers (#1637)

  • Limit recursion with nesting Ruby heredocs (#1638)

  • Fix a few inefficient regexes for guessing lexers

  • Fix the raw token lexer handling of Unicode (#1616)

  • Revert a private API change in the HTML formatter (#1655) -- please note that private APIs remain subject to change!

  • Fix several exponential/cubic-complexity regexes found by Ben Caller/Doyensec (#1675)

  • Fix incorrect MATLAB example (#1582)

Thanks to Google's OSS-Fuzz project for finding many of these bugs.

Version 2.7.3

(released December 6, 2020)

... (truncated)

Commits
  • 4d555d0 Bump version to 2.7.4.
  • fc3b05d Update CHANGES.
  • ad21935 Revert "Added dracula theme style (#1636)"
  • e411506 Prepare for 2.7.4 release.
  • 275e34d doc: remove Perl 6 ref
  • 2e7e8c4 Fix several exponential/cubic complexity regexes found by Ben Caller/Doyensec
  • eb39c43 xquery: fix pop from empty stack
  • 2738778 fix coding style in test_analyzer_lexer
  • 02e0f09 Added 'ERROR STOP' to fortran.py keywords. (#1665)
  • c83fe48 support added for css variables (#1633)
  • Additional commits viewable in compare view


Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/castorini/birch/network/alerts).
Castorini

Deep learning for natural language processing and information retrieval at the University of Waterloo

GitHub Repository