Relevancy tuning for searchmysite.net

5 Dec 2020

Introduction to relevancy tuning for searchmysite.net

This post contains details of the most recent round of relevancy tuning for searchmysite.net. I’ve decided to dedicate a whole post to the subject, given how important but under-appreciated the topic is.

It is surprisingly difficult to find much good information about relevancy tuning on the internet, unless it is hidden away on some hard-to-find personal websites somewhere. For most of the big sites the scoring algorithm is opaque, perhaps to try to retain an advantage in the game of cat and mouse with the Search Engine Optimisation practitioners. This of course isn’t a concern for searchmysite.net, with its model designed to both keep spam out and remove the financial incentive for spam.

There is the excellent Relevant Search book by Doug Turnbull and John Berryman, and I see that there are now some books on applying ML and AI techniques to search which I’d like to read at some point. Note that some of these AI/ML techniques are applied at the “re-ranking” stage (i.e. fine tuning the ranking of a set of already ranked results) because they are more computationally expensive, so you have to get the basic ranking about right first. Note also that many AI/ML techniques will rely on capturing as much information about the user as possible, which is something that this privacy-aware site is not planning on doing (it’ll be interesting to see quite how good a search which knows nothing about the user will ever be able to get).

As per previous posts, I started out with some incredibly simple relevancy tuning just to get started. By the time there were a few hundred sites, I introduced “indexed inlinks” to try to act a little like the PageRank algorithm. Since then, there have been a few hundred more sites added, plus there has been some useful user feedback, so it was time to review again. It really is a never-ending piece of work, so this post isn’t documenting anything like the end-state, just some more of the journey.

Background

Feedback on existing relevancy tuning

One of the difficulties with relevancy tuning is knowing what you’re aiming for, i.e. knowing what good results look like. Ideally you’d engage some active users to identify relevancy signals (i.e. the sorts of data that would be useful to capture for the relevancy tuning), perhaps build an interface for A/B testing, build a dataset of test queries and ideal results, etc. But that wasn’t possible here at this time. I did however get two pieces of useful user feedback:

A search for “hobbies” returned an effectively blank page entitled “Hobbies” at the very top of the results. When I say blank page, it wasn’t blank in the HTML sense, because it had a header and footer which contained words that the search engine was indexing, but was blank in terms of having a title but no content.
A search for “mastodon” didn’t return some of the more well known results as near the top as you’d expect. These pages were the ones you might have seen before joining Mastodon, or one of the “Introduction to Mastodon” pages you’d typically receive a link to when joining a Mastodon server.

So these were a couple of good cases to get started with.

A short summary of Solr’s scoring model

It’s worth a quick recap of Solr’s base scoring model first. It defaults to BM25, which computes inverse document frequency (idf) * term frequency (tf), where idf is:

log(1 + (N - n + 0.5) / (n + 0.5))

and tf is:

freq / (freq + k1 * (1 - b + b * dl / avgdl))

and where N is the total number of documents with field, n is the number of documents containing term, freq is the occurrences of term within document, k1 the term saturation parameter, b the length normalization parameter, dl the length of field, and avgdl the average length of field.

If you specify a Query Fields (qf), Solr calculates a new score for each of the fields in the qf where there is a match (i.e. title if the term is in the title, description if the term is in the description etc.), and then returns the highest of these.

If you specify a Boost Query (bq) and/or Boost Function (bf), it calculates the score for the bq and/or bf, and adds it to the qf score if a qf is specified or the base score if qf is not specified. In this way bq and bf are said to be additive.

The earlier relevancy tuning configuration

As per searchmysite.net: Building a simple search for non-commercial websites the initial formula was:

      <str name="qf">title^1.5 tags^1.2 description^1.2 url^1.2 author^1.1 body</str>
      <str name="pf">title^1.5 tags^1.2 description^1.2 url^1.2 author^1.1 body</str>
      <str name="bq">is_home:true^2.5 contains_adverts:false^15</str>

This was little more than a placeholder to get started, given there were only a handful of sites being indexed so it wasn’t clear what good results would be or how to test anything particularly sophisticated.

As per searchmysite.net update: Seeding and scaling this was updated to:

      <str name="qf">title^1.5 tags^1.3 description^1.3 url^1.3 author^1.1 body</str>
      <str name="pf">title^1.5 tags^1.3 description^1.3 url^1.3 author^1.1 body</str>
      <str name="bq">contains_adverts:false^18 owner_verified:true^1.8</str>
      <str name="bf">sum(1,log(indexed_inlinks_count))</str>

The “blank” page problem

Existing scores from Solr’s debugQuery and debug.explain.structured

In the case of the “blank” page, running a debugQuery with the most recent qf, pf, bq and bf set as per above, and removing the parts of the debugQuery we’re not interested in, the explain was:

        "value":8.09684,
        "description":"sum of:",
        "details":[{
            "value":8.061661,
            "description":"max of:",
            "details":[{
                "value":8.061661,
                "description":"weight(title:hobbies in 3) [SchemaSimilarity], result of:",
                "details":[{
                    "value":8.061661,
                    "description":"score(freq=1.0), computed as boost * idf * tf from:",
                    "details":[{
                        "value":1.5,
                        "description":"boost"},
                      {
                        "value":9.458829,
                        "description":"idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                        "details":[{
                            "value":3,
                            "description":"n, number of documents containing term"},
                          {
                            "value":44872,
                            "description":"N, total number of documents with field"}]},
                      {
                        "value":0.56819296,
                        "description":"tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                        "details":[{
                            "value":1.0,
                            "description":"freq, occurrences of term within document"},
                            ...
                          {
                            "match":true,
                            "value":3.0,
                            "description":"dl, length of field"},
                          {
                            "match":true,
                            "value":5.870008,
                            "description":"avgdl, average length of field"}]}]}]},              {
                "value":4.6403437,
                "description":"weight(body:hobbies in 3) [SchemaSimilarity], result of:",
                ...
          {
            "value":0.03517885,
            "description":"weight(contains_adverts:F in 3) [SchemaSimilarity], result of:",
            ...
          {
            "value":0.0,
            "description":"FunctionQuery(sum(const(1),log(int(indexed_inlinks_count)))), product of:",
                "match":true,
                "value":"-Infinity",
                "description":"sum(const(1),log(int(indexed_inlinks_count)=0))"},
              {
                "value":1.0,
                "description":"boost"}]}]},

This shows it was getting a really high inverse document frequency score (9.458829) for title, because there are only 3 documents out of 44,872 documents with “hobbies” in the title, and it was also getting a very high term frequency score (0.56819296) for title, because the title only had 3 words one of which was “hobbies”. Multiplying the 1.5 boost by idf and tf gave 8.061661 for “hobbies” in title. Running the same formula for body gave a lower score (4.6403437), so it used the value 8.061661 for the qf because it takes the “max of” these. To that it added 0.03517885 for the bq score (given the result didn’t contain adverts), and added another 0.0 for the bf score (given there were no indexed inlinks), giving a total score of 8.09684.

In contrast, the second top result had a total of only 5.588503. The the highest scoring field for qf was description with 4.5533237, comprising 1.3 boost * 7.8957644 for idf * 0.44359946 for tf. Note that even without the boost and the bq and bf scores, both the idf and the tf were still quite a bit lower, because the term wasn’t quite so rare in the description and the description had a higher percentage of words which were not the keyword. The final score of 5.588503 was the 4.5533237, plus another 0.03517885 for lack of adverts, and 1.0 because there was one indexed inlink.

The problem was, in short, that the keyword was a rarely occurring term, which appeared with very high frequency in the “problem” document title, making it look like a good match. No amount of tuning the existing parameters could compensate for the fact that this document was going to score highly. Even if you boosted the description and indexed inlinks so the second result came top, the “bad” result would still linger near the top, and such a change might make a greater number of other results worse. An alternative approach was therefore required.

Adding a new content field

The first step was to look at what was being indexed and searched. The fields being searched were title, tags, description, url, author and body. Body was worth looking at further given that is the field that contains the “blank” page. The body field is built from the whole <body> tag, with tags removed and the content of script and style tags also removed. The body field therefore still contained a lot of text for the navigation and footer.

So the first change was to introduce a new field which tries to strip that out. I called the field “content”. It is built it from the <body> tag, with the contents of any <nav>, <header>, and <footer> tags removed, and using just the contents of the <main> or <article> tags if present.

Although the “blank” page had a big <h1> comprising the word “Hobbies”, that was inside a <header> tag, so was removed. In fact, the content field was completely empty in the case of this page. The solution was therefore to give pages with an empty content field a much lower score, e.g. via if(exists(content),1,0.5) which gives a score of 1 if there is a content field and 0.5 if not.

This not only solved this particular issue, but will also be beneficial in other cases where the content field is blank. Searching just the “content” rather than the whole body will also help with another issue I’d previously noticed, which is that searching for terms which commonly appear in headers and footers, e.g. searching for the word “search”, didn’t yield especially useful results. So good all round. The use of the body field should probably be phased out, given it is taking up a fair bit of storage and most likely isn’t going to be used anywhere any more.

The “well known results” problem

From indexed inlinks to index inlink domains

The problem statement is that the “well known” pages about Mastodon should come at the top of the results for a search for “mastodon”. The “well known” pages are the ones you might have read before joining Mastodon, or been given a link to when joining a Mastodon server.

In theory, the solution should be to give the pages with more links to them a greater prominence in the results. Unfortunately, looking at the data in this case, the “well known” pages didn’t have many “indexed inlinks”, i.e. links from other indexed sites. That might be because the links to those well-known pages are on non-indexed sites, e.g. other social media platforms, and Mastodon itself. Unfortunately I can’t see a “natural” solution to this problem, i.e. one that doesn’t involve “artificial” solutions such as admins manually boosting or all users contributing to a collaborative search (neither of which should be completely ruled out in some future point BTW).

However, it did inspire looking more closely at the “indexed inlinks”, which I still believe should be one of the main search signals at this stage.

One of the things I noticed was that some sites had links to other sites in their page templates, e.g. in the footer, so if they had 100s of pages on their site it meant the destination ended up with 100s of indexed inlinks. It turned out that this didn’t have as big a negative impact as you might expect because (i) using log(indexed_inlinks_count) significantly scaled down large values, and (ii) the bf is additive, i.e. added to the base score, so in some cases it’s effect was “drowned out” by high qf scores.

The first step in remediating this was to introduce the new fields indexed_inlink_domains and indexed_inlink_domains_count, which (as their names suggest) just counts the unique domains that link to a page rather than all the pages that link to a page. The idea is that indexed_inlink_domains_count will be a much more valuable signal than indexed_inlinks_count. This would change sum(1,log(indexed_inlinks_count)) to sum(1,log(indexed_inlink_domains_count)).

From additive scoring to multiplicative scoring

Given indexed_inlink_domains_count should be a consistently good signal, it would be good to ensure it is applied consistently. However, as noted in searchmysite.net update: Seeding and scaling and above, the bf is additive and so skews results in ways than can be more difficult to predict, e.g. if the qf or base scores for the first two results are 9 for the first result and 5.5 for the second result then a adding a bf score of 3 to the second result won’t change the rankings, but if the qf or base scores for the first two results are 8 and 5.5 then adding a bf score of 3 to the second result would make it become the first result.

The solution here is to use boost rather than bf, because the boost score multiplies the base score rather than adds to it. We also need to change the formula from sum(1,log(indexed_inlink_domains_count)) to sum(1, log(sum(1, indexed_inlink_domains_count))) to ensure that at the very least (i.e. if the indexed_inlink_domains_count is 0) the base score is multiplied by 1 (i.e. kept the same) rather than multiplied by 0 or negative infinity (which would make the result completely disappear). Given it should be such a strong signal, I’ve also amplified the indexed_inlink_domains_count a little: sum(1, log(sum(1, product(indexed_inlink_domains_count, 1.8)))).

Similar logic can be applied to the contains_adverts and owner_verified fields - if a result contains adverts we want to consistently downrank, and if it is owner verified we want to consistently uprank, again taking care not to multiply the base score by 0 or negative infinity.

The new relevancy tuning formula

So, putting all of this together, here’s the new formula:

      <str name="qf">title^1.1 description^1.05 author^1.05 tags url content</str>
      <str name="pf">title^1.1 description^1.05 author^1.05 tags url content</str>
      <str name="boost">product( sum(1, log(sum(1, product(indexed_inlink_domains_count, 1.8)))), if(contains_adverts,0.5,1), if(owner_verified,1.1,1), if(exists(content),1,0.5) )</str>

This tunes down the qf and pf field boosts a little, completely removes the additive bq and bf boosts, and puts the key additional signals in a multiplicative boost. The multiplicative boost also multiplies rather than adds its components.

If there are no indexed inlink domains, no adverts, it is not owner verified, and there is a content field, the score is multiplied by 1 * 1 * 1 * 1, i.e. 1. If there are 5 indexed inlink domains, the score will be multiplied by 2 (1 + log of 1 + 5 * 1.8), and if there are adverts, the score will be multiplied by 0.5. And so on.

A note on owner_verified

One thing I mentioned in the “Improving the relevancy tuning” section of searchmysite.net update: Seeding and scaling is worth repeating and clarifying: A higher placing in the results is not one of the benefits of verifying ownership of your site, e.g. it is not mentioned Quick Add, nor is there any plan to list it as a benefit anywhere. The only reason the owner_verified field is in the boost formula is because it might help improve the ranking. If it turns out it doesn’t improve it, it can be removed. The absolute priority in the relevancy tuning is to improve the quality of results, and nothing else.

Results

Looking at the results from the queries in the original feedback:

“hobbies” - I think the results here are better, with the “blank” page gone, and more hobby related content higher.
“mastodon” - I think these are improved too, e.g. returning the site of a developer who has developed a Mastodon client which was not on the first page of results before.

Also, looking at the most popular single term queries from the access logs (in alphabetical order):

linux
openbsd
rust
test

And the most popular phrases (also in alphabetical order):

brewing beer
minimal css

I think results are also mostly the same or slightly better. Although in one case (a search for “test”) a test page with almost no content is returned. This isn’t picked up by the “blank” page change because the actual <article> tag includes the words “test” and “testing”, making it score highly on both the inverse document frequency and the term frequency. Hopefully as more content is indexed, and more pages are boosted with the likes of indexed inlink domains, these edge cases will fall down the results naturally.

Next steps

There are a lots of ideas for other possible future changes, e.g. in the short to medium term:

Recency - adjusting the scores so pages last modified beyond a certain time, e.g. 2 years, are downranked on a scale according to age. This of course could be a small amount to allow other factors such as indexed inlink domains to compensate. Part of the issue with this is that only around 45% of the pages which are currently being indexed return a last modified date, so it might be necessary to do something more complicated like keep track of version history to calculate last modified date.
Page length - downrank pages less than a certain length, e.g. 140 characters. This would help with the “test” and “testing” case above, and might also help lower the micro-blogging style content which can sometimes be of less long term interest.

And of course, in the longer term, there are lots of other possibilities, e.g. the advanced AI/ML, possibly also capturing clicks on links to feed in popularity, maybe some kind of collaborative search, or some other ideas, all subject to being able to do so in a privacy aware way of course.

But before spending further time on relevancy tuning, it would be great to get more user feedback on whether these, or other changes, would be worthwhile.