Learning from unlabelled data with COVID-19 Open Research Dataset

Objective criteria for text search results and some surprising results

The COVID-19 Open Research Dataset can help researchers and the health community in the fight against a global pandemic. The Vespa team is contributing by releasing a search app based on the dataset. Since the data comes with no reliable labels to judge a good search result from a bad one, we would like to propose objective criteria to evaluate search results that do not rely on human-annotated labels. We use this criterion to run experiments and evaluate the value delivered by term-matching and semantic signals. We then show that the semantic signals deliver poor results even when considering a fine-tuned version of a model specifically designed for scientific text.

Photo by National Cancer Institute on Unsplash

Released by the Allen Institute for AI, the COVID-19 Open Research Dataset (CORD-19) contains over 44,000 scholarly articles, including over 29,000 with full text, about COVID-19 and the coronavirus family of viruses for use by the global research community. It was released to mobilize researchers to apply recent advances in natural language processing to generate new insights in support of the fight against this infectious disease. And it did exactly that.

As soon as it was released, there were a Kaggle challenge, a dataset explorer, fine-tuned embedding models and a run to collect labelled data:

Given my latest experience with labels containing strong term-matching bias in the MS MARCO dataset and the fact that we at vespa.ai wanted to move fast to build a search app around the CORD-19 dataset, I decided to spend some time to think how I could evaluate between different matching criteria and ranking functions without labelled data.

Objective criteria for text search

The goal was to have an objective criteria and to move away from the “it looks good enough” criteria so commonly used when reliable labels are not available. My proposal is simple, we can use the title of the article as a query and consider the associated abstract as the relevant document for the query.

Photo by Marc A on Unsplash

This criteria is simple, can scale to massive amounts of data since we do not rely on human annotation, and it makes sense. Think like this, if you use the title as a query and a given method is not able to retrieve the correct abstract and include it in the top 100 of the resulting list we have a very sub-optimal ranking function for the context of a CORD-19 search app.

Results

Some of the results obtained are summarized in this section. We report here three important metrics. The percentage of documents matched by the query, the recall at the top 100 positions and the mean reciprocal rank (MRR) considering the top 100 documents returned.

Term-matching

Table 1 shows results obtained by ranking documents with the term-matching signal BM25 score. The first row shows the result when we only match documents with abstracts that contains every word in the title (AND operator). This is way too restrictive, matching only a small fraction of documents (0.01%) and therefore misses many relevant abstracts leading to poor recall and MRR metrics (20% and 19% respectively).

Table 1: Key result metrics involving term-matching.

The second row match all the documents with abstracts that contains at least one word of the title (OR operator). This is way too broad, matching almost all the documents in the corpus (89%) but leads to good recall and MRR metrics (94% and 80% respectively).

A middle ground is obtained when using the Vespa weakAND operator. It skips many documents based on a simple to compute term-matching equation, leading it to match only 19% of the corpus while retaining comparable recall and MRR metrics (95% and 77% respectively) with those retrieved by the more expensive OR operator. We can also tune how many documents to retrieve with weakAND. We set it to 1.000 documents in this case to compare with the nearest neighbor operator we use for the semantic search experiments.

Semantic search

The first row of table 2 reports the result obtained with semantic search. For this experiment we decided to use the scibert-nli model, which is a fine-tuned version of AllenAI’s SciBERT. We had high hopes for this model since it is a fine-tuned version of a model designed to work with scientific text.

Table 2: Key metrics for the semantic search results

However, the results were not on par with our expectations. We retrieved on average around 14% of the corpus with the Vespa nearestNeighbor operator set to retrieve 1.000 documents. It means that we retrieve at least 1.000 documents based on the distance between the title and abstract embeddings, where the embeddings are constructed by the scibert-nli model. The ranking function was set to be the dot-product between the title and abstract embeddings. This setup led to the worst recall and MRR of our experiments (17% and 8% respectively).

The first thought that crossed my mind when I looked at the results was that something was wrong with the code. So to sanity-check this I decided to run the same experiment but now using the abstract as the query. The task then becomes the abstract trying to retrieve itself. If the setup is correct the results should be (near) perfect since the distance between the same embedding should be approximately zero (apart from rounding errors).

The second row of table 2 reports this sanity-check experiment and validates our setup, obtaining a perfect recall and near-perfect MRR. This at least remove the chance that there was something completely wrong with the match phase, ranking function and experiment setup implementation when applied to embeddings. So, the poor performance of the semantic model continues to hit us as an odd and surprising result.

Remarks

We tried our best to clean the data to have only meaningful titles and abstracts included in the experiments so that the semantic search experiments would not be unfairly treated. We excluded many articles that clearly had a wrong title and/or abstract such as “Author index” or “Subject index”. The cleaning reduced the number of documents considered from 44.000 to around 30.000.

After that we created the title and abstract embeddings with no additional pre-processing step as we believe this is how most people are going to use it:

title_embedding = model(title)
abstract_embedding = model(abstract)

We are of course open to suggestions on how to construct the embeddings from the text via the fined-tuned model if there is indication that it would significantly improve the results. All the embeddings are normalized (L2-norm) to have length 1.

We also combined term-matching and semantic signals but got no significant improvements over the pure term-matching setup.

Conclusions

Table 3 summarizes the results discussed here. The clear winner so far has been the weakAND + BM25 combination. The results obtained with semantic search were so disappointing that it deserves further investigation. It is important to highlight that we are using and evaluating the semantic model in a search context. The (poor) performance reported here does not necessarily generalize to other semantic tasks.

Table 3: Key result metrics involving term-matching and semantic search

Having objective criteria to evaluate search results that do not depend on human-annotated data is important not only for the cases that have no explicit labels such as the CORD-19 dataset. It is also useful when dealing with datasets that have biased labels, as for example in the case of the MS MARCO dataset being biased towards term-matching signals, likely due to its data collection design.

Why you should NOT use MS MARCO to evaluate semantic search

And likely not many other widely used datasets either

If we want to investigate the power and limitations of semantic vectors (pre-trained or not), we should ideally prioritize datasets that are less biased towards term-matching signals. This piece shows that the MS MARCO dataset is more biased towards those signals than we expected and that the same issues are likely present in many other datasets due to similar data collection designs.

Photo by Free To Use Sounds on Unsplash

MS MARCO is a collection of large scale datasets released by Microsoft with the intent of helping the advance of deep learning research related to search. It was our first choice when we decided to create a tutorial showing how to setup a text search application with Vespa. It was getting a lot of attention from the community, in great part due to the intense competition around leaderboards. Besides, being a large and challenging annotated corpus of documents, it checked all the boxes at the time.

We followed up the first basic search tutorial with a blog post and a tutorial on how to use ML in Vespa to improve the text search application. So far so good. Our first issue came when we were writing the third tutorial on how to use (pre-trained) semantic embeddings and approximate nearest neighbor search to improve the application. At this point we started to realize that maybe the full-text ranking MS MARCO dataset was not the best way to go.

After looking more closely at the data, we started to realize that the dataset was highly biased towards term-matching signals. And by that I mean, much more than we expected.

But we know it is biased …

Before we go on to the data, we must say that we expected bias in the dataset. According to the MS MARCO dataset paper, they built the dataset by:

  1. Sampling queries from Bing’s search logs.
  2. Filtering out non question queries.
  3. Retrieve relevant documents for each question using Bing from its large-scale web index.
  4. Automatically extract relevant passages from those documents
  5. Human editors then annotate passages that contain useful and necessary information for answering the questions

Looking at steps 3 and 4 (and maybe 5), it is not surprising to find bias in the dataset. And to be fair, I think the bias is recognized as an issue in the literature. The surprise was the degree of the bias that we observed and how this might affect experiments involving semantic search.

Semantic embeddings setup

Our main goal was to illustrate how we can create out-of-the-box semantic aware text search applications by using term-matching and semantic signals. This combined with Vespa’s ability to perform Approximate Nearest Neighbor search would allow users to build such applications at scale.

In the results presented next we use BM25 scores as our term-matching signal and the sentence BERT model to generate embeddings to represent the semantic signal. Similar results were obtained with simpler term-matching signals and other semantic models like Universal Sentence Encoder. More details and code can be found in the tutorial.

Combining signals

We started with a reasonable baseline involving only term-matching signals. Next, we got promising results when we used only semantic signals in the application, just to sanity check the setup and to confirm that there was indeed relevant information contained in the embeddings. After that, the obvious follow up was to combine both signals.

Vespa offers a lot of possibilities here as we can combine term-matching and semantic signals, both in the match phase and in the ranking phase. In the match phase, we can use the nearestNeighbor operator for the semantic vectors and the multitude of operators usually used for term-matching such as the usuals AND and OR grammar to combine query tokens or useful approximations like weakAND. In the ranking phase, we can use well known ranking features such as BM25 and the Vespa tensor evaluation framework to do whatever we want with input signals such as the semantic embeddings.

It was when we started to experiment with all these possibilities that we began to question the usefulness of the MS MARCO dataset for this type of experiment. The main point was that, although the semantic signals were doing a decent job in isolation, the improvements would disappear when term-matching signals were taken into account.

We were expecting a significant intersection between term-matching and semantic signals since both should contain information about query document relevance. However, the semantic signals need to complement the term-matching signals for it to be valuable, given that they are more expensive to store and compute. This means that they should match relevant documents that would not otherwise be matched by term-matching signals.

However, this was not the case, as far as we could see it. So, we decided to look more closely at the data.

Term-matching bias

To better investigate what was going on, we collected query-document data from Vespa about both relevant and random documents. For example, the next graph shows the empirical distribution of the sum of dot-products between the query and title embeddings and between the query and body embeddings. The blue histogram shows the distribution for random (and therefore likely non-relevant to the queries) documents. The red histogram shows the same information but now conditioned on the fact that the documents are relevant to the queries.

Empirical distribution of embedding’s dot-product scores. Given a set of queries, blue represents random (non-relevant) documents and red represents relevant documents.

As expected, we got much higher scores on average for relevant documents. Great. Now, let’s look at a similar graph for the BM25 scores. The results are similar but much more extreme in this case. Relevant documents have much higher BM25 scores, to the point where almost no relevant document has low enough signal to be excluded from being retrieved by term-matching signals. This means that, after accounting for term-matching, there are almost no relevant documents left to be matched by semantic signals. This is true even if the semantic embeddings are informative.

Empirical distribution of BM25 scores. Given a set of queries, blue represents random (non-relevant) documents and red represents relevant documents.

In such a scenario, the best we can hope for is that both signals are positively correlated for relevant documents, showing that both carry information about query-document relevance. This seems indeed to be the case in the scatter plot below that visually shows a much stronger correlation between BM25 scores and embedding scores for the relevant documents (red) than between the scores of the general population (black).

Scatter plot of embedding’s dot-product scores versus BM25 scores. Given a set of queries, black represents random (non-relevant) documents and red represents relevant documents.

Remarks and conclusion

At this point, a reasonable observation would be that we are talking about pre-trained embeddings and that we could get better results if we fine-tuned the embeddings to the specific application at hand. This might very well be the case but there are at least two important considerations to be taken into account: cost and overfitting. The resource/cost consideration is important but more obvious to be recognized. You either have the money to pursue it or not. If you do, you still should check to see if the improvement you get is worth the cost.

The main issue, in this case, relates to overfitting. It is not easy to avoid overfitting when using big and complex models such as Universal Sentence Encoder and sentence BERT. Even if we use the entire MS MARCO dataset, which is considered a big and important recent developments to help advance the research around NLP tasks, we only have around 3 million documents and 300 thousand labeled queries to work with. This is not necessarily big relative to such massive models.

Another important observation is that BERT-related architectures have dominated the MSMARCO leaderboards for quite some time. Anna Rogers wrote a good piece about some of the challenges involved on the current trend of using leaderboards to measure model performance in NLP tasks. The big takeaway is that we should be careful when interpreting those results as it becomes hard to understand if the performance comes from architecture innovation or excessive resources (read overfitting) being deployed to solve the task.

But despite all those remarks, the most important point here is that if we want to investigate the power and limitations of semantic vectors (pre-trained or not), we should ideally prioritize datasets that are less biased towards term-matching signals. This might be an obvious conclusion, but what is not obvious to us at this moment is where to find those datasets since the bias reported here are likely present in many other datasets due to similar data collection designs.