Web analytics on searchmysite.net

29 Nov 2020

Introduction

Keen-eyed observers may have noticed that the Analytics section on the Privacy Policy has recently been updated. I thought it would be worth a short post with further information.

The original web anlytics solution

Hopefully it goes without saying that some form of web analytics is useful even for a privacy aware site like searchmysite.net, because you really need to know how a site is being used when looking at certain issues and enhancements, scaling the infrastructure etc.

When I originally launched searchmysite.net (see searchmysite.net: Building a simple search for non-commercial websites) I used the GoAccess log file based web analytics solution. I like the idea of log file based web analytics for a privacy aware site because it doesn’t require JavaScript or cookies or anything else user-facing. However, log file based solutions can be a little awkward to set up, especially in a dockerised environment:

The official httpd docker image (which searchmysite.net uses) defaults to sending its logs to /proc/self/fd/1 (STDOUT) rather than an actual log file, for better integration with docker logs¹.
House-keeping activities such as log file rotation can complicate things.
If the service becomes popular enough to warrant spreading the load over multiple web servers, there is the additional challenge of aggregating all the log files (unless of course the logs were captured in a single SSL terminating reverse proxy).

I know there are solutions like ELK Stack (Elasticsearch, Logstash, Kibana) for managing data collection, log parsing and visualisation, but implementing those is a project in itself, and more than the simple web analytics required at this stage.

The new privacy aware analytics solution

Fortunately there are a growing number of privacy aware web analytics solutions, perhaps helped by the spread of GDPR and PECR cookie consent popups².

I decided to look at Plausible for a number of reasons:

It is completely cookie-free.
It is open source.
There is an option to serve the JavaScript from your own domain so my “No files are downloaded from third parties, so there is no opportunity for third parties to track your use of this site” claim would remain true.
It has a self-hosted option.

I also really like Plausible’s business model: it is to be a sustainable open-source project. Although there is a free self-hosted solution, they aim to pay their running costs and salaries via the premium managed solution, i.e. they have a concrete plan to remain a viable company without having to depend on advertising, investor cash, or charity. This sounds just like the sort of thing I wrote about in What went wrong with the internet (and how can it be fixed)? which is of course what inspired the creation of searchmysite.net.

Anyway, I’m using the self-hosted Plausible solution for now, primarily for reasons of cost (searchmysite.net is self-funded and hasn’t switched on the listing fee yet). Setup was relatively straight-forward, with a good and growing community of users. I’ve been running it for about a week and it seems to be working well so far.

In the interests of transparency, I am still keeping the web server log files for a short period. One of the reasons for this is to get details of the most popular search queries, which is missing from many analytics solutions. Information on popular searches could be useful for relevancy tuning, which will be the topic for my next post.

See View logs for a container or service. My workaround was to reconfigure the logging to send to an actual file (logs/access.log) and set up a volume mount to be able to access the logs from the host. ↩︎
I sometimes wonder if the point behind the General Data Protection Regulation (GDPR) and Privacy and Electronic Communication Regulations (PECR) legislation was to try and encourage sites to cut back on the amount of tracking cookies, but what seems to have happened is that the big sites have put their effort into developing the consent functionality instead, and most users have got used to simply clicking “accept all”. ↩︎