Vocabularies let you swap out the underlying text representation of models.
OpenNIR includes three ways to handle out-of-vocabularly terms:
vocab=wordvecthrows an error for OOV.
vocab=wordvec_unkuses a single
UNKtoken to represent all OOV.
UNKtokens, assigned via the hash value of the OOV term.
There are four sources of word vectors implemented:
vocab.source=fasttextFastText vectors. Variants:
vocab.source=gloveGloVE vectors. Variants:
vocab.source=convknrmVectors from the ConvKNRM experiments. Variants:
vocab.source=bionlpTrained on PubMED
OpenNIR has a BERT implementation to provide contextualized representations. Out-of-vocabulary terms are handled by the WordPiece tokenizer (i.e., split into subords). This vocabularly also outputs CLS representation, which can be a useful signal for ranking.
There are two encoding stragegies:
vocab.encoding=joint- the query and document are modeled in the same sequence
vocab.encoding=sep- the query and document are modeled independently
The pretrained model weights can be configured with
This accepts any value supported by the HuggingFace transformers library for the BERT model (see here),
and the following: