Lots of new web feed (RSS and Atom) related functionality

29 Oct 2022

A quick summary of the new web feed for all search results

All search results pages (including Newest Pages and Browse Sites) now have a web feed icon in the top middle next to the results count (in between the Filters and Sort by). Clicking this takes you to an OpenSearch Atom format web feed¹ for that query.

This allows you to, for example:

Subscribe to new posts about your favourite topics. For example, if you are interested in seeing new posts about Stable Diffusion, search for “stable diffusion” (with double-quotes), change Sort by to “Published date (newest first)”, (optionally) set filters like Language and In web feed, then copy and paste the web feed link into your feed reader.
Create feeds for sites which don’t provide feeds. For example, use Browse Sites to get to the site you want (use e.g. Sort by Domain if necessary), click the Domain link to return all results from that domain, set Filters and Sort by if necessary, and use the web feed link.

Note that this functionality:

Isn’t aiming to replace the feed readers, given the point of those is that they allow you to curate your own lists of feeds. However, the hope is that it will be a useful feed in itself, which might in turn lead to the discovery of new individual feeds for you to add to your feed reader.
Purposefully contains just the headline and summary, so you will still need to click through to the source site for the full article. This is so the source site still retains control of their content, have visibility of who is accessing it, etc.

A bit of background to the new web feed functionality

I got some good feedback earlier in the year from various people referencing RSS feeds, e.g. someone who said they used searchmysite.net primarily to find RSS feeds to add to their RSS feed reader. That got me thinking about web feeds in general.

I have a theory that web feeds are not promoted more because they are difficult to monetise. Instead of web feeds, people have been encouraged to move to the activity feeds curated by the social media platforms, which of course are liberally peppered with the all-important advertisements.

It’s like many things on the internet moving to progressively worse alternatives so Big Tech can make bigger profits. For example, we had a whole generation persuaded to use SMS messages in place of email, despite SMS being worse than email in almost every way (e.g. 160 character limit, no concept of multiple destinations let alone cc or bcc, no ability to attach files, etc.), primarily because SMS cost money and marketers claimed higher engagement rates. And now we have a whole new generation being brought up to prefer various incompatible messaging platforms in place of SMS, so now I need a phone filled with messaging apps (most of which have no desktop version), just so the advertising companies can harvest more personal data than ever (even messaging apps with allegedly end-to-end encryption will surface valuable metadata).

Anyway, something useful, which isn’t monetisable and so is not widely promoted, sounds like a perfect fit for a project like searchmysite.net, so I got to thinking how I could use web feeds to make searchmysite.net better and vice versa.

A summary of some of the other new web feed search and aggregation functionality

Initially the first changes were to:

Auto-discover a site’s web feed, and expose that feed link on the Browse page. See #64 and the later improvements at #77. Funny story - the improvements to the feed detection led me to discover feeds on my own site that I didn’t even know I had².
Allow site owners to specify their own web feed link, in case the auto-discovered link is missing or incorrect.

I then realised that the feed could be useful for other purposes, e.g.

Start the crawl from a web feed in addition to the home page. See #54.
Make the Newest Pages even more useful by implementing a more frequent incremental index, i.e. so new posts can be identified more quickly. See #34.

I’ve also added a new field to indicate whether a page appears within a feed (see #71). This may in future (when fully populated) be used to:

Boost the relevancy score of pages which are in web feeds, to help e.g. raise content pages above landing/listing pages. See #73.
Filter Newest Posts to show just posts in web feeds, to make that page more of a feed itself. See #74.

Some potential issues with web feeds

One of the things I have found though, to be honest, is that web feeds are a bit of a mess. I’ve seen all sorts of issues, e.g. RSS (i.e. XML) returned as Content-Type: text/html, multi-megabyte feeds that exceed searchmysite.net max-file-size limit, missing data (e.g. only 49% of pages have a last modified date, and just 14% a published date³), etc. And of course the issue that many sites have a lot of feeds, some of which have limited value (e.g. feeds for specific tags), which the site owner might not even know about, all with no clear way of identifying the primary feed.

It also remains to be seen whether making it easier to extract data in a structured format will have any negative consequences. At the start of the project I was reluctant to expose all the data as a feed because I was concerned that someone would suck out all the content and spit it up onto an advert laden rip-off site that blog spammers had made way more popular due to black hat SEO. That is of course still a risk, but after nearly 2.5 years I’d hope that searchmysite.net has enough of a presence to ensure it remains more popular than any rip-offs. But it has to be noted that between 99.67% and 99.93% of all searches⁴ are still coming from spammers with their automated SEO searches. Furthermore, since the last update on the automated SEO searches issue, the SEO software they use has figured how to follow the pagination links on searchmysite.net, so I now get 100 page requests for every automated SEO query in place of one⁵, and they also now bypass some of the measures I put in place to try to block them (some of which have had to be disabled anyway to allow the new OpenSearch Atom feed to function).

Even if there are negative consequences though, I hope that they will be outweighed by positive benefits.

If you’re not too familiar with web (aka RSS/Atom) feeds, here’s a couple of useful pages: About Feeds and What using RSS feeds feels like. ↩︎
I use Hugo which automatically generates feeds for all tag and category pages. ↩︎
Source: Some checks on already indexed data to see how useful the filters would (or wouldn’t) be for #4. ↩︎
24 Oct 2022, total searches 1020, real searches 2 (99.80% spam). 25 Oct 2022, total searches 2147, real searches 7 (99.67% spam). 26 Oct 2022, total searches 1449, real searches 1 (99.93% spam). 27 Oct 2022, total searches 970, real searches 2 (99.79% spam). 28 Oct 2022, total searches 1602, real searches 5 (99.68% spam). ↩︎
I can assure the “SEO practitioners” that searchmysite.net is never likely to contain one full page of useful results for highly specialised queries like “pest control” in some suburb of some small US city, or “full mouth dental implants” somewhere I’ve never heard of, or whatever they’re trying to SEO optimise, let alone 100 pages. ↩︎