OpenNIR

An end-to-end neural ad-hoc ranking pipeline.

This project is maintained by Georgetown-IR-Lab

Vocabularies

Overview

Vocabularies let you swap out the underlying text representation of models.

Word Vectors

OpenNIR includes three ways to handle out-of-vocabularly terms:

There are four sources of word vectors implemented:

Contextualized Vectors config/bert

OpenNIR has a BERT implementation to provide contextualized representations. Out-of-vocabulary terms are handled by the WordPiece tokenizer (i.e., split into subords). This vocabularly also outputs CLS representation, which can be a useful signal for ranking.

There are two encoding stragegies:

The pretrained model weights can be configured with vocab.bert_base (default bert-base-uncased). This accepts any value supported by the HuggingFace transformers library for the BERT model (see here), and the following: