Some of the challenges of building an internet search

One of the objectives of - indexing just good quality content

It is a truth universally acknowledged that most of the modern internet is rubbish. Even Google appears to index trillions of pages1 but only saves around “hundreds of billions”2 in their search index, suggesting even they chuck most of it out.

The approach takes is to try to index just “the good stuff”, rather than to try to index the whole internet with all of its garbage, and better quality content in the index should be a factor which helps improve results quality. There are a number of techniques it uses to try to achieve this, primarily:

  • It only indexes user-submitted personal and independent sites, to try to make it more of a non-commercial and community effort.
  • There is a moderation layer, plus annual review process, to keep a form of curation.
  • Pages containing adverts are detected and heavily downranked, to try to eliminate the incentive for spam.

A less obvious potential quality issue is however - what happens when indexing for a good site fails? You don’t want the stale content remaining in the index, where it might be returned as a result and lead to a wasted click. To that end, I have taken a look through all the recent failures in the indexing log and tried to determine the cause of these failures. This post contains a summary of my findings.

The challenges of keeping the search index clean

Roughly 5% of all sites were not being indexed due to indexing errors. Of these indexing errors, the breakdown is as follows:

Reason %
Site offline 62.0%
robots.txt blocked 19.0%
Cloudflare blocked 5.5%
Home page moved 5.5%
Other site issue 5.5%
Indexing issue 2.5%

To be honest, I had hoped that most of the indexing errors would be issues with my indexing code which I could simply fix to get the site back in the index. However, it turns out that there was only one such issue3. Almost all were more fundamental issues.

Sites going offline

The number 1 source of indexing errors, by a significant margin, was simply sites becoming unreachable. There are actually a lot of different subcategories, such as the domain having expired, the site returning an HTTP error like 404 Not Found or 500 Server Error, etc., but the net result is the same - the site is effectively dead and no-one can see it.

I had an early indicator of this problem when I originally seeded the search engine, using the data from the Indie Map (see update: Seeding and scaling). Nearly 20% of that list had simply disappeared completely by the time I got to it.

Since that initial load, there have been several hundred more site submissions, stretching back well over a year. However, in that year, many of those have since disappeared too. Sites on, for example, seem to be some of the most short-lived, so I haven’t been approving many of those recently.

I guess that is one of the pitfalls of focussing on personal and independent websites - these are the sorts of sites where the owners are more likely to lose interest or even run out of money for hosting.

robots.txt blocking indexing

The second most common source of indexing errors is the site being blocked by robots.txt, e.g. via a

User-agent: *
Disallow: /

I was a bit surprised by how many of these there were to be honest. There have even been cases where the owners have gone to the trouble of verifying their site ownership via Verified Add, but blocked the indexing (presumably unwittingly) via their robots.txt.

Part of the problem is that so much internet traffic now is from bots which are generally doing more harm than good. Although the ironic thing is that some of the most troublesome bots will simply ignore the robots.txt directives, so the User-agent: * Disallow: / will block good bots but not bad bots.

Cloudflare blocking indexing

I had thought that this was going to be a bigger problem, given (for example) comments in “Cloudflare’s CAPTCHA replacement with FIDO2/WebAuthn is a bad idea”.

But it is a problem none-the-less. It is related to the robots.txt issue above, i.e. where bad bots ignore the robots.txt directives, and Cloudflare are trying to use their position to block the bad bots at a lower-level. As with many security related issues though, it is stronger to have an “allow list” than a “deny list”, which means new search engines such as mine are blocked by default.

I’ve logged Issue #46 Indexing of some sites is blocked by Cloudflare to get added to Cloudflare’s “allow list”, and hope to follow up in due course.

Home page moved & other site issues

The home page moving is a kind of link rot. Users submit a “home page”, which is the point at which indexing starts, but if that link changes and a redirect isn’t set up, indexing will fail. For example, if they submit the home page, and subsequently move the blog to but don’t set up a redirect fom the old to the new location, then won’t be able to find the new location. Ideally people would submit their domain root as the home page, e.g., because that shouldn’t change, although one site has even changed domain.

Other site issues include very intermittent availability. I’m imagining some might even be run on home servers and/or might not have full monitoring and/or alerting.

Not sure there’s much I can do about these issues at this stage.

Part solution

I’ve recently implemented better handling of multiple failed reindexes. This disables indexing for a site if there are two consecutive failed indexing attempts, and also deletes any already indexed documents.

This should be a reasonable part solution for now. I say part solution because it doesn’t periodically check if indexing has become possible again, e.g. because the site has come back online, or the robots.txt has been updated. The hope is that if the site owners do make the changes to allow indexing again, they resubmit, see the appropriate error message, and use the Contact form to let me know, which I’m hoping will be a good enough for now. Another option would be for me to periodically manually re-enable disabled sites, and let the system automatically disable them again if they still can’t be indexed.

Anyway, I thought it was worth a blog post to get across that building an internet search engine isn’t that easy (these are just some of the challenges), but also to reflect on some of the wider issues we have with the modern web.

  1. “[in 2008] our systems that process links on the web to find new content hit a milestone: 1 trillion (as in 1,000,000,000,000) unique URLs on the web” according to ↩︎

  2. “The Google Search index contains hundreds of billions of web pages” according to ↩︎

  3. A small number of sites which didn’t like the original user agent string “indexer (+”, but did like the new user agent string “Mozilla/5.0 (compatible; SearchMySiteBot/1.0; +”. ↩︎