searchmysite.net blog

Five year anniversary

Today marks the 5th anniversary of the launch of searchmysite.net, and it is also a year since the last blog entry (which was the Four year retrospective), so now seems a good time to summarise progress.

The original objective of searchmysite.net was the ambitious “search just the good stuff” (it even indexed wikipedia until that was stopped to cut costs), then it was changed to “search non-commercial sites”, although few people seemed to know what “non-commercial sites” were so it was changed to “search personal and independent sites”. However, in order give the site more focus, and in line with the Unix philosophy of trying to “do one thing and do it well”, independent sites are now being removed. I’m really sorry to see some of my favourite independent sites go, but searching all good personal sites is hopefully a much more achievable goal for a spare-time project than searching all good personal and independent sites.

If someone wants to set up an instance just for independent sites I’d be happy to send the list for an initial import. That might even provide an opportunity to resurrect the original idea of building one front-end to pull together results from multiple search instances each indexing their own specialist niches, thereby allowing the building a federated search for all the small islands of “good stuff” that do remain if you know where to find them.

Usage levels: Hitting the Hacker News home page, then returning to normal

The usage stats from the past 12 months have been dominated by the spike in March 2025:

searchmysite.net analytics July 2024 - July 2025

That’s from Search My Site – open-source search engine for personal and independent websites which made the Hacker News (HN) home page on 25 Mar 2025. This generated some useful feedback, my favourite being “I like this, thank you! I just lost an hour of time to the exact sort of random but considered personal websites that I think made the Web great in the first place” which I think sums up what searchmysite.net is trying to achieve - making “surfing the web” an enjoyable and at times enriching leisure activity once more.

Compared to the Almost all searches on my independent search engine are now from SEO spam bots blog post which was also popular on HN 3 years earlier, more people did seem to get the point of searchmysite.net this time around (previously a lot of people saw a search box and expected it to be a whole-internet search). One interesting difference is that 3 years ago I had loads of direct emails but this time no emails whatsoever (and only 1 direct message on Mastodon).

The server also handled the load just fine.

Usage has now largely returned to normal (approx 30 real users a day and approx 1000 a month), and unfortunately the activity in March didn’t lead to new subscriptions.

Milestone: 3000th site indexed

There were however over 200 new sites submitted in the space of just a few days, leading to another milestone - the 3000th site was indexed on 25 Mar 2025. This is good because more content should lead to better search results.

For reference, the 1000th site was on 12 Mar 2022 and the 2000th site was on 16 Dec 2023, which is a reminder that the user-submitted-sites model is something of a slow grower.

It is also worth noting that although there are (currently) 3,271 sites indexed, there are actually (currently) 4,327 sites in the database, and a part of that difference is sites which were indexed and then later deindexed. As discussed in Some of the challenges of building an internet search, the top 2 reasons for sites being deindexed are sites going offline (currently 187), and sites starting to block indexing via their robots.txt (currently 15). Will be interesting to see if the robots.txt blocking increases over time with sites trying to block the big AI crawlers.

Further work to disable embeddings and LLM

In 2023 vector embeddings and a self-hosted LLM were deployed to experiment with “retreival augmented generation”. Unfortunately the results, at least with the small self-hosted LLM, were not that great. Also, it did add to running costs and complexity, which is a particular concern given that most days around 99.99% of traffic is just bots rather than real people. So the functionality was never fully released. Further work has been undertaken in 2025 to remove this.

Still hoping that at some point in future the size and speed of self-hostable LLMs will improve such that will be a viable option and democratise AI/ML, but until then it seems AI/ML is still just for the big budget projects.

Running costs have remained fairly low thanks to the move to Hetzner’s cheaper ARM servers, but the number of subscriptions has fallen to the lowest levels since the first year:

Year   Expenses   Income
Jul 2020 – Jun 2021 £351.57 £57.25
Jul 2021 – Jun 2022 £456.04 £80.51
Jul 2022 – Jun 2023 £94.51 £137.32
Jul 2023 – Jun 2024 £174.79 £125.06
Jul 2024 – Jun 2025 £138.39 £68.46
Totals: £1076.91 £400.68

So unfortunately it is falling short of covering the running costs.

All major dependencies are now up-to-date

Finally, with releases 1.4.19 and 1.4.20 earlier in July 2025, everything is now using the latest version of everything:

  • Solr 9.8.1
  • Scrapy 2.13.3
  • Postgres 17.5
  • Apache httpd 2.4.63
  • Flask 3.1.1
  • Bootstrap 5.3.7

I suspect keeping everything up-to-date over that period of time was easier with these technologies than it would have been if I had used whatever the flavour-of-the-week was 5 years ago:-)