An end-to-end neural ad-hoc ranking pipeline.
This project is maintained by Georgetown-IR-Lab
Vocabularies let you swap out the underlying text representation of models.
OpenNIR includes three ways to handle out-of-vocabularly terms:
vocab=wordvec
throws an error for OOV.vocab=wordvec_unk
uses a single UNK
token to represent all OOV.vocab=wordvec_hash
uses multiple UNK
tokens, assigned via the hash value of the OOV term.There are four sources of word vectors implemented:
vocab.source=fasttext
FastText vectors. Variants: wiki-news-300d-1M
crawl-300d-2M
vocab.source=glove
GloVE vectors. Variants: cc-42b-300d
, cc-840b-300d
vocab.source=convknrm
Vectors from the ConvKNRM experiments. Variants: knrm-bing
knrm-sogou
, convknrm-bing
convknrm-sogou
vocab.source=bionlp
Trained on PubMED vocab.variant=pubmed-pmc
config/bert
OpenNIR has a BERT implementation to provide contextualized representations. Out-of-vocabulary terms are handled by the WordPiece tokenizer (i.e., split into subords). This vocabularly also outputs CLS representation, which can be a useful signal for ranking.
There are two encoding stragegies:
vocab.encoding=joint
- the query and document are modeled in the same sequencevocab.encoding=sep
- the query and document are modeled independentlyThe pretrained model weights can be configured with vocab.bert_base
(default bert-base-uncased
).
This accepts any value supported by the HuggingFace transformers library for the BERT model (see here),
and the following: