Objective criteria for text search results and some surprising results
The COVID-19 Open Research Dataset can help researchers and the health community in the fight against a global pandemic. The Vespa team is contributing by releasing a search app based on the dataset. Since the data comes with no reliable labels to judge a good search result from a bad one, we would like to propose objective criteria to evaluate search results that do not rely on human-annotated labels. We use this criterion to run experiments and evaluate the value delivered by term-matching and semantic signals. We then show that the semantic signals deliver poor results even when considering a fine-tuned version of a model specifically designed for scientific text.

Released by the Allen Institute for AI, the COVID-19 Open Research Dataset (CORD-19) contains over 44,000 scholarly articles, including over 29,000 with full text, about COVID-19 and the coronavirus family of viruses for use by the global research community. It was released to mobilize researchers to apply recent advances in natural language processing to generate new insights in support of the fight against this infectious disease. And it did exactly that.
As soon as it was released, there were a Kaggle challenge, a dataset explorer, fine-tuned embedding models and a run to collect labelled data:
Given my latest experience with labels containing strong term-matching bias in the MS MARCO dataset and the fact that we at vespa.ai wanted to move fast to build a search app around the CORD-19 dataset, I decided to spend some time to think how I could evaluate between different matching criteria and ranking functions without labelled data.
Objective criteria for text search
The goal was to have an objective criteria and to move away from the “it looks good enough” criteria so commonly used when reliable labels are not available. My proposal is simple, we can use the title of the article as a query and consider the associated abstract as the relevant document for the query.

This criteria is simple, can scale to massive amounts of data since we do not rely on human annotation, and it makes sense. Think like this, if you use the title as a query and a given method is not able to retrieve the correct abstract and include it in the top 100 of the resulting list we have a very sub-optimal ranking function for the context of a CORD-19 search app.
Results
Some of the results obtained are summarized in this section. We report here three important metrics. The percentage of documents matched by the query, the recall at the top 100 positions and the mean reciprocal rank (MRR) considering the top 100 documents returned.
Term-matching
Table 1 shows results obtained by ranking documents with the term-matching signal BM25 score. The first row shows the result when we only match documents with abstracts that contains every word in the title (AND operator). This is way too restrictive, matching only a small fraction of documents (0.01%) and therefore misses many relevant abstracts leading to poor recall and MRR metrics (20% and 19% respectively).

The second row match all the documents with abstracts that contains at least one word of the title (OR operator). This is way too broad, matching almost all the documents in the corpus (89%) but leads to good recall and MRR metrics (94% and 80% respectively).
A middle ground is obtained when using the Vespa weakAND operator. It skips many documents based on a simple to compute term-matching equation, leading it to match only 19% of the corpus while retaining comparable recall and MRR metrics (95% and 77% respectively) with those retrieved by the more expensive OR operator. We can also tune how many documents to retrieve with weakAND. We set it to 1.000 documents in this case to compare with the nearest neighbor operator we use for the semantic search experiments.
Semantic search
The first row of table 2 reports the result obtained with semantic search. For this experiment we decided to use the scibert-nli model, which is a fine-tuned version of AllenAI’s SciBERT. We had high hopes for this model since it is a fine-tuned version of a model designed to work with scientific text.

However, the results were not on par with our expectations. We retrieved on average around 14% of the corpus with the Vespa nearestNeighbor operator set to retrieve 1.000 documents. It means that we retrieve at least 1.000 documents based on the distance between the title and abstract embeddings, where the embeddings are constructed by the scibert-nli model. The ranking function was set to be the dot-product between the title and abstract embeddings. This setup led to the worst recall and MRR of our experiments (17% and 8% respectively).
The first thought that crossed my mind when I looked at the results was that something was wrong with the code. So to sanity-check this I decided to run the same experiment but now using the abstract as the query. The task then becomes the abstract trying to retrieve itself. If the setup is correct the results should be (near) perfect since the distance between the same embedding should be approximately zero (apart from rounding errors).
The second row of table 2 reports this sanity-check experiment and validates our setup, obtaining a perfect recall and near-perfect MRR. This at least remove the chance that there was something completely wrong with the match phase, ranking function and experiment setup implementation when applied to embeddings. So, the poor performance of the semantic model continues to hit us as an odd and surprising result.
Remarks
We tried our best to clean the data to have only meaningful titles and abstracts included in the experiments so that the semantic search experiments would not be unfairly treated. We excluded many articles that clearly had a wrong title and/or abstract such as “Author index” or “Subject index”. The cleaning reduced the number of documents considered from 44.000 to around 30.000.
After that we created the title and abstract embeddings with no additional pre-processing step as we believe this is how most people are going to use it:
title_embedding = model(title) abstract_embedding = model(abstract)
We are of course open to suggestions on how to construct the embeddings from the text via the fined-tuned model if there is indication that it would significantly improve the results. All the embeddings are normalized (L2-norm) to have length 1.
We also combined term-matching and semantic signals but got no significant improvements over the pure term-matching setup.
Conclusions
Table 3 summarizes the results discussed here. The clear winner so far has been the weakAND + BM25 combination. The results obtained with semantic search were so disappointing that it deserves further investigation. It is important to highlight that we are using and evaluating the semantic model in a search context. The (poor) performance reported here does not necessarily generalize to other semantic tasks.

Having objective criteria to evaluate search results that do not depend on human-annotated data is important not only for the cases that have no explicit labels such as the CORD-19 dataset. It is also useful when dealing with datasets that have biased labels, as for example in the case of the MS MARCO dataset being biased towards term-matching signals, likely due to its data collection design.



