An end-to-end neural ad-hoc ranking pipeline.
This project is maintained by Georgetown-IR-Lab
Vocabularies let you swap out the underlying text representation of models.
OpenNIR includes three ways to handle out-of-vocabularly terms:
vocab=wordvec throws an error for OOV.vocab=wordvec_unk uses a single UNK token to represent all OOV.vocab=wordvec_hash uses multiple UNK tokens, assigned via the hash value of the OOV term.There are four sources of word vectors implemented:
vocab.source=fasttext FastText vectors. Variants: wiki-news-300d-1M crawl-300d-2Mvocab.source=glove GloVE vectors. Variants: cc-42b-300d, cc-840b-300dvocab.source=convknrm Vectors from the ConvKNRM experiments. Variants: knrm-bing knrm-sogou, convknrm-bing convknrm-sogouvocab.source=bionlp Trained on PubMED vocab.variant=pubmed-pmcconfig/bertOpenNIR has a BERT implementation to provide contextualized representations. Out-of-vocabulary terms are handled by the WordPiece tokenizer (i.e., split into subords). This vocabularly also outputs CLS representation, which can be a useful signal for ranking.
There are two encoding stragegies:
vocab.encoding=joint - the query and document are modeled in the same sequencevocab.encoding=sep - the query and document are modeled independentlyThe pretrained model weights can be configured with vocab.bert_base (default bert-base-uncased).
This accepts any value supported by the HuggingFace transformers library for the BERT model (see here),
and the following: