searchmysite.net is now open source

12 Dec 2020

Introduction

I used the quote “talk is cheap, show me the code” in the introduction to my first post on searchmysite.net. Well, here it is: https://github.com/searchmysite/searchmysite.net/.

Why aren’t other search engines open source?

Pretty much every other search engine treats their inner workings as a closely guarded secret. This is to stop the spammers figuring out how to game the system and increase the ranking of their results to earn a greater share of the advertising revenue. However, this isn’t a concern for searchmysite.net, because its operating model is designed to both keep spam out and to remove the financial incentive for spam in the first place:

Only user-submitted sites are indexed, with a moderation layer on top. At the moment this instance focusses on personal and independent websites, which by their nature are less likely to be spammy, but as this instance grows and/or other instances are set up, the idea is to retain a community-based approach to content curation and moderation.
Spam content exists to make money, and the money is made by people clicking on links to pages with adverts. However, this system detects pages with adverts and heavily downranks them, so fewer people are likely to click on them. It’s a fairly simple idea, but quite powerful, and not something other advertising-funded search engines can do. The funding model for searchmysite.net is the listing fee which gets site owners additional benefits, e.g. access to the search as a service features like being able to trigger reindexing on demand. In this way the incentive model is aligned with user needs rather than in conflict with them, and designed to both be sustainable and remain user aligned.

What open source licence is it?

It is licenced under the GNU Affero General Public License (AGPL) licence. In short, this licence means anyone is free to use the code unmodified as long as they publicly acknowledge the source, and if they make any form of change they must make their changes available under an AGPL licence too.

This licence has been chosen following a number of high profile cases where certain tech giants have made open source systems available as “cloud based services” and profitted immensely from doing so, but not returned anything to the people and community that created the systems¹. Profit in itself isn’t a problem, especially if it is in return for a useful service, but what is a problem is abusing the people who make the profit possible. The AGPL licence has been designed to prevent such abuse.

Why is it on GitHub?

Those who have been reading other posts on my personal site may have been aware I’ve been using GitLab pretty much exclusively. Two of the original factors in GitLab’s favour, i.e. that they offered private repos for free and had a continuous integration environment, aren’t major differentiators any more. However, I do still prefer GitLab’s philosophy, e.g. the GitLab documentation pretty much exclusively guides you through making changes via standard git commands, while GitHub often guides you towards their bespoke web interface and proprietary gh commands.

Unfortunately though, it seems open source projects still get much more visibilty on GitHub, so the difficult decision has been made to move it from GitLab to GitHub. It shouldn’t really be that big a deal though because it can always be moved somewhere else later.

What are the future plans?

There’s a log of issues, in no particular order, at https://github.com/searchmysite/searchmysite.net/issues.

In some ways one of the most important items in the list is #6 Build a community because that could help ensure future plans are aligned with what users actually want. I’ve set up a GitHub Discussion forum as a start, but if necessary a new forum or chat group could be set up.

Personally, I’m also quite interested in #10 Index wikipedia. In some ways it would be an exception because it would mean a custom indexer (it shouldn’t use the spider) and special configuration (it will clearly need a page limit higher than the default of 50), but in other ways it would still fit within the model given the category would be “independent website” and it wouldn’t be owner verified and it would still be subject to exactly the same ranking formula. I think that would be a reasonable compromise to make - having the majority of paid listing fees subsidising the creation of one (or more) unpaid listing(s) which make the service much more useful for all. It would start turning searchmysite.net from a niche search engine to one that people could consider using on a fairly frequent basis.

There have also been some fairly excited discussions about this project potentially being the start of a new approach to federated search, with various groups setting up separate instances to index their own interesting parts of the internet (e.g. a developer focussed one indexing developer Q&A and documentation sites), and a front-end deployed to each instance which is (or can be made) aware of all the other instances. Kind-of like the best bits of Searx (a metasearch engine, in this case each instance having a metasearch for the other instances) and YaCy (distributed indexing) but without the drawbacks. That said, while it could be amazing, it would be a non-trivial undertaking to do this properly, with challenges including:

Integration between instances would have to be performed at a fairly low level for the best results, both in terms of equivalence in relevancy scoring and responsiveness of the results. This is sometimes referred to as index-time merging rather than query-time merging², and would be at e.g. the SolrCloud or Cross Data Centre Replication level rather than at the front-end and API level.
Distinct and non-overlapping spheres of interest would have to be very clearly defined, to avoid unnecessary repeat indexing load on small sites, and prevent duplicated results in the combined interface.
It would most likely work better if there were a small number of high quality and well maintained instances, rather than lots of noisy and unreliable instances, which starts taking you back to a more (although not necessarily fully) centralised approach.

Realistically, to implement significant changes such as this, it would need a number of volunteers or enough people paying the listing fee to fund a small paid team, and for any significant increase in the number of sites indexed to be sustainable, it would need enough people paying the listing fee to continue paying for upgraded infrastructure.

In the meantime, it would be best to focus on one thing and to do it really well. At the moment that one thing is searching personal and independent websites.

How can I contribute?

See the contributor guidelines for how you can help. It would be awesome if you could. Search isn’t easy, and you will be up against multi-billion dollar giants and multi-million dollar startups. But just remember, unlike this project, they’re all after the same thing - a cut of the multi-billion dollar advertising market.

See e.g. https://stratechery.com/2019/aws-mongodb-and-the-economic-realities-of-open-source/. ↩︎
Of course, as an interim solution, you could bypass results merging altogether, e.g. by having the results from each instance on separate tabs. ↩︎