Three year retrospective

Today is the 3 year anniversary1 of (the open source search engine for the indieweb / small web / digital gardens, which heavily downranks pages with adverts, and aims to pay running costs via a listing fee and search-as-a-service), so that seems like a good opportunity for a quick recap of progress so far and hints at what is likely ahead.

Progress so far

Highlights from the past 3 years

  • It was fully open sourced in Dec 2020.
  • I’ve had good feedback from real users, so I know is still helping some real people find real content that is difficult to find elsewhere, and even that the search as a service is proving useful2.
  • It now has nearly 1,800 sites listed (although if 0.01% of the world’s population has an actively updated personal web site, that still means has just 0.2% of actively updated personal web sites).
  • The English language wikipedia was added to the index, although later removed for cost reasons, but that does show the system can handle 10s of millions of documents, in addition to 100s of thousands of (admittedly mostly spam bot) searches a day.
  • Every results view now has an RSS output, so you can e.g. subscribe to search queries for your favourite topics or even make an RSS feed for a site which doesn’t already have one.
  • A number of alternative search engines have since been launched, both commercial ones with big funding, and independent ones like, so that does suggest there is real interest in new approaches to search.

Usage levels

Stats have been fairly static, averaging around 10 real users a day for most of the time, although for the past 2.5 weeks there have been at least 30 users a day, which is good (let’s hope that continues): analytics May 2022 - Jul 2023

Sadly, over 99.6% of search queries are still from SEO spam bots:

Day Real searches   Spam bot searches   %age spam
Sat 8 Jul 2023 75 79947 99.91%
Sun 9 Jul 2023 57 134957 99.96%
Mon 10 Jul 2023 93 134257 99.93%
Tue 11 Jul 2023 34 134202 99.97%
Wed 12 Jul 2023 66 19111 99.66%

Running costs vs paid listings

Some good news is that running costs have fallen significantly following the migration from AWS to Hetzner and the number of paid listings has been increasing:

Year   Expenses   Income
Jul 2020 – Jun 2021 £351.57 £57.25
Jul 2021 – Jun 2022 £456.04 £80.51
Jul 2022 – Jun 2023 £94.51 £137.32

Future plans

There are a couple of big changes planned for in the near future. I’d hoped to have them ready for the 3 year anniversary, but didn’t quite make it (remember this is a spare-time side-project).

Redesign and first major outside open source contribution

Don’t want to spoil the surprise on this one so won’t go into too much detail, but is quite exciting so I will mention it here: there’s a big redesign in the works from the first major outside open source contributor (clue: “retro futurism”).

Vector search and possibly even chat-with-your-website functionality

As the first phase of the vector search rollout, pages are now having embeddings created (using Hugging Face’s Sentence Transformers) and indexed (via Apache Solr’s dense vector search). Next step is the query interface.

I’ve also started experimenting with a privateGPT style chat with your website functionality, although it remains to be seen how viable it would be to rollout given the resource constraints and the SEO spam bot issue. Still, it might help attract more interest in the project, in particular more contributors.

So hopefully some more good things to come.

Closing thoughts

Evolving concerns

When I launched the project in 2020, one of my biggest concerns was that an advert-laden rip-off would be set up, and that the copy would be search-engine-optimised to get lots more traffic than the advert-free original. Now that this project has become more established, this is less of a concern.

Then at the start of 2022 my big concern became the onslaught of SEO spam bot traffic. It was partly a fear that all the time and money I was spending on the project was benefitting the SEO spammers way more than it was benefitting the small number of genuine users, and furthermore that the SEO spammers were performing nefarious black-hat SEO operations which would damage the sites in, e.g. SEO spammers looking for links with vulnerabilities they could exploit to automatically post backlinks, or for original content to copy to link farms, etc. That is still a concern, but no longer the main one.

Finally, at the end of 2022, we had the arrival of ChatGPT, and the threat of massive volumes of AI-generated blogspam. A YC-funded company even appeared at the end of the year promising to “instantly create briefs or text that’s modeled after high-ranking content and sounds human-made”. Trying to keep all of that joyless and meaningless junk out of the system is only part of the fear - the other part is that the AI systems need human-produced content for training, and as the internet fills up with AI-generated content the human-made content will become more difficult to find, making a potential target for the bots to use to improve themselves. Again it is the fear of trying to help but ending up making things worse, like trying to connect up the pockets of human resistance against Skynet in The Terminator films, and in doing so inadvertently leading the machines to the humans. Maybe it is better if remains “under the radar”, passed on by word-of-mouth between those in-the-know.

Raising the prospect of advert-free search engines

On a more positive note, given the relative success of the paid listing model, this project does now show that it is possible to run a sustainable search engine without resorting to either advertising, some kind of pay-per-search subscription, or relying on charitable donations. This was one of the original objectives outlined in the first blog post on 18 Jul 2020, so if nothing else it has succeeded in that objective.

Looking through some of the support forums for the big search engines, filled with unanswered pleas for help from businesses who have seen unexplained dramatic falls in the number of search engine referrals, I’m convinced that tens of millions of site owners would pay a small fee for official support and likely tens of thousands would pay a higher fee for enterprise-level support. That would be more than enough to pay the running costs for a large search engine, with enough margin of error to make a small profit most of the time.

The large search engines could then completely eliminate advertising, thereby removing the conflict of interest a certain Sergey Brin and Lawrence Page warned of in their “The Anatomy of a Large-Scale Hypertextual Web Search Engine” paper from 1998: “Advertising funded search engines will be inherently biased towards the advertisers and away from the needs of consumers”.

  1. The first submission by a real user was on 17 Jul 2020, which I usually take as the anniversary date. Other possible dates would be 13 Jun 2020 when I made the first commit, or 18 Jul 2020 when I published the first blog post, or 31 Jul 2020 when I did the Show HN. ↩︎

  2. Actual feedback on 3 Jul 2023 was that the paid search-as-a-service was “certainly satisfactory”. If I were into wearing t-shirts with slogans, I’d get one printed with “ certainly satisfactory” in large letters:-) ↩︎