October | 2020 | Thiago G. Martins

A pyvespa library overview: Connect, query, collect data and evaluate query models.

Vespa is the faster, more scalable and advanced search engine currently available, imho. It has a native tensor evaluation framework, can perform approximate nearest neighbor search and deploy the latest developments in NLP modeling, such as BERT models.

This post will give you an overview of the Vespa python API available through the pyvespa library. The main goal of the library is to allow for faster prototyping and to facilitate Machine Learning experiments for Vespa applications.

We are going to connect to the CORD-19 search app and use it as an example here. You can later use your own application to replicate the following steps. Future posts will go deeper into each topic described in this overview tutorial.

You can also run the steps contained here from Google Colab.

Install

Warning: The library is under active development and backward incompatible changes may occur. Feedback and contributions are welcome.

The library is available at PyPI and therefore can be installed with pip.

!pip install pyvespa

Connect to a running Vespa application

We can connect to a running Vespa application by creating an instance of Vespa with the appropriate url. The resulting app will then be used to communicate with the application.

from vespa.application import Vespa

app = Vespa(url = "https://api.cord19.vespa.ai")

Define a Query model

Easily define matching and ranking criteria

When building a search application, we usually want to experiment with different query models. A Query model consists of a match phase and a ranking phase. The matching phase will define how to match documents based on the query sent and the ranking phase will define how to rank the matched documents. Both phases can get quite complex and being able to easily express and experiment with them is very valuable.

In the example below we define the match phase to be the Union of the WeakAnd and the ANN operators. The WeakAnd will match documents based on query terms while the Approximate Nearest Neighbor (ANN) operator will match documents based on the distance between the query and document embeddings. This is an illustration of how easy it is to combine term and semantic matching in Vespa.

from vespa.query import Union, WeakAnd, ANN
from random import random

match_phase = Union(
    WeakAnd(hits = 10), 
    ANN(
        doc_vector="title_embedding", 
        query_vector="title_vector", 
        embedding_model=lambda x: [random() for x in range(768)],
        hits = 10,
        label="title"
    )
)

We then define the ranking to be done by the bm25 rank-profile that is already defined in the application schema. We set list_features=True to be able to collect ranking-features later in this tutorial. After defining the match_phase and the rank_profile we can instantiate the Query model.

from vespa.query import Query, RankProfile

rank_profile = RankProfile(name="bm25", list_features=True)

query_model = Query(match_phase=match_phase, rank_profile=rank_profile)

Query the vespa app

Send queries via the query API. See the query page for more examples.

We can use the query_model that we just defined to issue queries to the application via the query method.

query_result = app.query(
    query="Is remdesivir an effective treatment for COVID-19?", 
    query_model=query_model
)

We can see the number of documents that were retrieved by Vespa:

query_result.number_documents_retrieved

And the number of documents that were returned to us:

len(query_result.hits)

Labelled data

How to structure labelled data

We often need to either evaluate query models or to collect data to improve query models through ML. In both cases we usually need labelled data. Let’s create some labelled data to illustrate their expected format and their usage in the library.

Each data point contains a query_id, a query and relevant_docs associated with the query.

labelled_data = [
    {
        "query_id": 0, 
        "query": "Intrauterine virus infections and congenital heart disease",
        "relevant_docs": [{"id": 0, "score": 1}, {"id": 3, "score": 1}]
    },
    {
        "query_id": 1, 
        "query": "Clinical and immunologic studies in identical twins discordant for systemic lupus erythematosus",
        "relevant_docs": [{"id": 1, "score": 1}, {"id": 5, "score": 1}]
    }
]

Non-relevant documents are assigned "score": 0 by default. Relevant documents will be assigned "score": 1 by default if the field is missing from the labelled data. The defaults for both relevant and non-relevant documents can be modified on the appropriate methods.

Collect training data

Collect training data to analyse and/or improve ranking functions. See the collect training data page for more examples.

We can collect training data with the collect_training_data method according to a specific Query model. Below we will collect two documents for each query in addition to the relevant ones.

training_data_batch = app.collect_training_data(
    labelled_data = labelled_data,
    id_field = "id",
    query_model = query_model,
    number_additional_docs = 2,
    fields = ["rankfeatures"]
)

Many rank features are returned by default. We can select some of them to inspect:

training_data_batch[
    [
        "document_id", "query_id", "label", 
        "textSimilarity(title).proximity", 
        "textSimilarity(title).queryCoverage", 
        "textSimilarity(title).score"
    ]
]

Evaluating a query model

Define metrics and evaluate query models. See the evaluation page for more examples.

We will define the following evaluation metrics:

% of documents retrieved per query
recall @ 10 per query
MRR @ 10 per query

from vespa.evaluation import MatchRatio, Recall, ReciprocalRank

eval_metrics = [MatchRatio(), Recall(at=10), ReciprocalRank(at=10)]

Evaluate:

evaluation = app.evaluate(
    labelled_data = labelled_data,
    eval_metrics = eval_metrics, 
    query_model = query_model, 
    id_field = "id",
)
evaluation

Your first step to improve the cord19 search application.

This is the first on a series of blog posts that will show you how to improve a text search application, from downloading data to fine-tuning BERT models.

You can also run the steps contained here from Google Colab.

The team behind vespa.ai have built and open-sourced a CORD-19 search engine. Thanks to advanced Vespa features such as Approximate Nearest Neighbors Search and Tranformers support via ONNX it comes with the most advanced NLP methodology applied to search that is currently available.

Our first step is to download relevance judgments to be able to evaluate current query models deployed in the application and to train better ones to replace those already there.

Download the data

The files used in this section can be found at https://ir.nist.gov/covidSubmit/data.html. We will download both the topics and the relevance judgements data. Do not worry about what they are just yet, we will explore them soon.

!wget https://ir.nist.gov/covidSubmit/data/topics-rnd5.xml
!wget https://ir.nist.gov/covidSubmit/data/qrels-covid_d5_j0.5-5.txt

Parse the data

Topics

The topics file is in XML format. We can parse it and store in a dictionary called topics. We want to extract a query, a question and a narrative from each topic.

import xml.etree.ElementTree as ET

topics = {}
root = ET.parse("topics-rnd5.xml").getroot()
for topic in root.findall("topic"):
    topic_number = topic.attrib["number"]
    topics[topic_number] = {}
    for query in topic.findall("query"):
        topics[topic_number]["query"] = query.text
    for question in topic.findall("question"):
        topics[topic_number]["question"] = question.text        
    for narrative in topic.findall("narrative"):
        topics[topic_number]["narrative"] = narrative.text

There are a total of 50 topics. For example, we can see the first topic below:

topics["1"]

{'query': 'coronavirus origin',
 'question': 'what is the origin of COVID-19',
 'narrative': "seeking range of information about the SARS-CoV-2 virus's origin, including its evolution, animal source, and first transmission into humans"}

Each topic has many relevance judgements associated with them.

Relevance judgements

We can load the relevance judgement data directly into a pandas DataFrame.

import pandas as pd

relevance_data = pd.read_csv("qrels-covid_d5_j0.5-5.txt", sep=" ", header=None)
relevance_data.columns = ["topic_id", "round_id", "cord_uid", "relevancy"]

The relevance data contains all the relevance judgements made throughout the 5 rounds of the competition. relevancy equals to 0 is irrelevant, 1 is relevant and 2 is highly relevant.

relevance_data.head()

We are going to remove two rows that have relevancy equal to -1, which I am assuming is an error.

relevance_data[relevance_data.relevancy == -1]

relevance_data = relevance_data[relevance_data.relevancy >= 0]

The plot below shows that there are quite a few relevance judgments for each topic and the number of relevant documents varies quite a lot across topics.

import plotly.express as px

fig = px.histogram(relevance_data, x="topic_id", color = "relevancy")
fig.show()

Next we will discuss how we can use this data to evaluate and improve cord19 search app.

Thiago G. Martins

Data Scientist at Yahoo!

Monthly Archives: October 2020

How to connect and interact with search applications from python