Almost all searches on my independent search engine are now from SEO spam bots
Introduction
searchmysite.net was launched nearly 2 years ago to help people discover all the great original content on personal and independent websites which is so hard to find now that the major search engines have become swamped by SEO spam. It employed a number of novel techniques to try to avoid the same fate, such as having a community-driven curation and moderation layer, heavily downranking pages containing adverts to reduce the incentive for SEO spam, and aiming to pay the running costs from its search as a service rather than by selling user data to advertisers.
That makes it especially tragic to report that nearly all the traffic to the site is now from SEO spam bots, presumably searching for all that elusive SEO spam-free content. Here’s the stats for recent requests to the search page1:
Day | Real users | Spam bots | %age spam |
---|---|---|---|
Sun 01 May 2022 | 3 | 36,534 | 99.992% |
Mon 02 May 2022 | 2 | 35,086 | 99.994% |
Tue 03 May 2022 | 7 | 39,501 | 99.982% |
Wed 04 May 2022 | 3 | 36,388 | 99.992% |
Thu 05 May 2022 | 1 | 37,529 | 99.997% |
Fri 06 May 2022 | 9 | 37,994 | 99.976% |
Sat 07 May 2022 | 12 | 58,648 | 99.980% |
Sun 08 May 2022 | 13 | 70,107 | 99.981% |
Mon 09 May 2022 | 6 | 43,471 | 99.986% |
Tue 10 May 2022 | 1 | 157,587 | 99.999% |
Wed 11 May 2022 | 0 | 162,783 | 100.000% |
There are two parts to this problem:
- Usage by spam bots has increased dramatically.
- Usage by real users is almost non-existent.
Usage by spam bots has increased dramatically
I’ve always had some activity from bots2, but it has been manageable. However, in mid-April 2022, bot activity started to increase dramatically.
I didn’t notice at first because the web analytics only shows real users, and the unusual activity could only be seen by looking at the server logs. I initially suspected that it was another search engine scraping results and showing them on their results page, because the IP addresses, user agents and search queries were all different. I then started to wonder if it was a DDoS attack, as the scale of the problem and the impact it was having on the servers (and therefore running costs) started to become apparent.
After some deeper investigation, I noticed that most of the search queries followed a similar pattern, e.g. containing “Powered by” and the name of some blog or forum tool and then the search query. It turns out that these search patterns are “scraping footprints”. These are used by the SEO practitioners, when combined with their search terms, to search for URLs to target, implying that searchmysite.net has been listed as a search engine in one or more SEO tools like ScrapeBox, GSA SEO or SEnuke.
It is hard to imagine any legitimate white-hat SEO techniques requiring these search results, so I would have to imagine it is for black-hat SEO operations, e.g. looking for quality links to which to attempt to automatically post backlinks, or for original content to copy to link farms, or for good sources of email addresses, etc.
I sincerely hope that none of the sites in searchmysite.net have been adversely affected by this. I have been doing my best to block the SEO spammer requests since discovering the issue3, but I am concerned it could be a losing battle, and one with nothing to gain by winning.
Usage by real users is almost non-existent
Most of the tiny number of real users have come from links posted to places like Hacker News, and there is almost no organic traffic from other search engines. I wrote about this in Progress update Q1 & Q2 2021, including stats on the top traffic sources, along with details of all the white-hat SEO operations I had performed to try to improve ranking within the search engines. If you were into conspiracy theories you could claim that the major search engines were trying to stifle the competition, but a more realistic explanation is simply that searchmysite.net is being drowned out by SEO spam.
Little has changed since then. In fact, in the searchmysite.net retrospective and future plans published the following year, I noted that there had been multiple weeks where not one single real person had visited a single blog entry for the whole week4.
Conclusion: Is there anybody out there?
I’m not one for conspiracy theories, but it’s hard not to think that there is some truth to The Dead Internet Theory, i.e. the belief that the Internet is now empty and devoid of real people, and that everything is just bots talking to bots to generate content and clicks in order to get a share of the all-important and ever-growing advertising revenue5.
I don’t think we’re quite at that stage yet though. I know there are still some real people doing great things on the internet, whether other real people know of their existence or not. However, I am genuinely concerned that my project to try to connect these people may have had unintended negative consequences, like in those post-apocalyptic stories where the robots have taken over the earth and well meaning attempts to link pockets of human resistance are doomed to draw the robots to them too.
This time I’m really not sure what the solution is.
UPDATE (16 May 2022): See also the discussion on Hacker News at https://news.ycombinator.com/item?id=31395231.
-
The stats show requests to /search, with real users identified in the web analytics solution and cross-referenced against the log files, and the spam bots pulled from the log files. Requests from my devices haven’t been filtered out, so many of the unusually high number of real users on Sat 7 May and Sun 8 May (12 and 13 respectively) will actually be me testing v1.0.14 released that weekend on various devices. ↩︎
-
e.g. periodic requests for files like /wp-login.php, or people running the masscan port scanner, and of course other search engines spidering my search engine (these are usually identifiable by their user agents, even if some ignore robots.txt). ↩︎
-
See https://github.com/searchmysite/searchmysite.net/issues/55. In summary, if they break through the current reverse proxy level protection, options include an invisible ReCAPTCHA (but given I’ve sometimes 160,000 requests a day I’d be well over the 1,000,000 a month free tier limit), requiring JavaScript as per the web analytics or some Cross Site Request Forgery style protection (but those would place much more load on the servers), or CloudFlare (but the searchmysite.net spider is still currently blocked by CloudFlare as per Some of the challenges of building an internet search). ↩︎
-
Actually since January 2022 (when the retrospective was published) I’ve had more and longer stretches with zero page views anywhere on the blog, e.g. Tue 29 Mar 2022 to Wed 6 Apr 2022 inclusive. If I’d had a decent amount of real users visiting and never returning I could reasonably conclude that updating the blog wasn’t the most productive use of my time and effort, but without any real users in the first place it is hard to gauge whether people like it or not. ↩︎
-
Digital advertising spend is expected to reach US$571.16Bn in 2022 and US$785.08Gb in 2025 according to https://www.emarketer.com/content/worldwide-digital-ad-spending-year-end-update. It is hard to image spending that amount of money, let alone when it is essentially funding the destruction of the internet as we know it. ↩︎