Unifying Sync and Backup

For a little over three years I have been using a single Subversion repository as the way I keep all my files synced across multiple computers, much like how someone would use Dropbox or Google Drive. This article is a bit of the story of how I got here, why it works, and why it's not as crazy of an idea as it probably sounds.

Background

The story starts when I was still a Windows user, and relying on Google Drive's desktop client to keep files backed up and synced between my laptop and desktop computer. This was adequate for some time, but it was rather rough around the edges, leading to occasional frustrations.

For instance, if I shut down my desktop computer too soon after making a change, it wouldn't properly upload in time, and would therefore be missing on my laptop. The worst part of this though is that because every individual edit is atomic, rather than grouped in logical units, I would sometimes not notice that only a few things were off with the files on my laptop until after making many conflicting changes. These small textual changes, in the context of computer programming, were often significant semantic changes. Because tools like Google Drive make an idealistic assumption that everything will always stay in sync, when such a small synchronisation errors happen, there are no good facilities for handling these, making the problem potentially much worse, with the previous version from my desktop computer overwriting the changes on my laptop entirely.

There is also no good way to tell Google Drive to exclude certain files which may come and go easily, including compilation artifacts. The most extreme example occured when programming in OCaml, where I noticed some extremely strange inconsistencies within the compiled file. Namely, sometimes I would compile the code and get a binary which simply would not run, and on other occasions I would make a change to the code, recompile, and then the changes would not be part of the binary that I run. Because at the time I had only been programming in OCaml for a few months, and was working with unusual compilation settings while trying to write my own standard library, I had assumed I did something wrong. On the contrary, it so happens that as a result of quick creation and deletion of compilation artefacts by my build system, Google Drive was overwriting the latest compilation artefacts immediately after they were created. This was most starkly noticed when opening the file explorer, clearing all the compilation artefacts with my makefile, and then noticing them automatically show up again within a second by Google Drive, which didn't recognise the edit as intended.

Soon after this experience, I switched to Linux for reasons entirely unrelated to the above. When I did so, there were only a small handful of applications that I used in Windows which weren't available, and Google Drive's syncing client was one of them. So, I did the natural thing, and tried a third party syncing client called insync, hoping that such a client would not be prone to the same problems as the official one. I was wrong. Insync, very frustratingly, treats file system renames as deletions and recreations. After trying to rename a large number of folders in bulk, I ended up with a huge number of inconsistencies with duplicate files across similarly named folders. This experience does not exactly instill confidence about the low probability of data loss.

The Solution

So then I had a crazy idea, why not just use version control? It handles syncing gracefully as conflict resolution is explicit, where differing local copies can coexist before being resolved when pushing to the server.

I had imagined a workflow like the following: When logging in to a new computer, download all the changes from the server, and then proceed to work on whatever it is I was doing. When done, push the changes to the server, ready to pull from a new computer. If I ever forgot to do a push, although I wouldn't have the latest changes, I could still work on files included in the transaction since Subversion would explicitly notify me about conflicts when I later tried to rectify them.

An obvious, although unintended, benefit of using version control, is that I now have a full history of all of my files. If I want to see what my computer's filesystem looked like on August 10-th of 2022, I can just do that, without any difficulty or complication. I am also no longer afraid to delete something under the premise that I may need it "just in case", keeping my filesystem much more well organised and less prone to digital hoarding. I can just delete it; if in 6 months I realise that I needed it, it can be pulled from history.

This also implicitly allows for a very natural solution to the problem of backup. I have my Subversion server running at home (previously on a Raspberry Pi, now on an old Framework laptop mainboard running on its own) which periodically, via a cron job, runs a full backup to Backblaze B2. This means I have two local copies of the current version of my files, a full history of my files on a server at home, and a full backup of the history on the cloud. This solution is also less expensive than standard online file storage services; all of the software involved is free, a Raspberry Pi has the only ongoing cost of power, which is very low (and is a cheap computer to start with), and my Backblaze bill is consistently only few dollars a month, with hundreds of gigabytes being stored.

Why Subversion

One matter I have yet to address is why I chose Subversion over another version control system, such as Git, Mercurial, or CVS. There are three reasons for this.

Because I intended to run the server locally on a Raspberry Pi, I needed something that was easy to set up and for which I could easily find help online if something went wrong. Git and Subversion are the most popular version control systems, and likewise are far ahead of any alternatives in this respect. Setting up an Apache server for a Subversion repository is an afternoon project at worst, and now that I have written instructions for myself regarding the process, takes half-an hour at worst. I also ran into zero issues when I did my first dry run to see how it works.
Given that most of my computer's files would be within the version control repository, I would need a system which could handle a huge number of, sometimes very large files, without lagging on commits, updates, checkouts, etc. Subversion is simply way better than Git at compressing large binary files, which is why it is often chosen by big game development studios who need to store large assets. A 200 gigabyte Subversion repository is much closer to the typical use case of Subversion than a 200 gigabyte Git repository is for Git.
I use Git for tracking many of my projects, and I want the .git folder to be treated like any other folder from the perspective of this main file repository. This is an important one as my main repository will contain other Git repositories nested within it. Git submodules are simply an unacceptable solution for this; if you don't know why then just try using Git submodules...

The Downsides

Have you ever read a blog post where the author describes some technology or program they've used for over a year and they only have positive things to say about it? Whenever I do, I become extremely skeptical. Everything has a downside, and using anything for long enough will expose that downside. This is not one of those blog posts, and I will not be a blind cheerleeder. I have two main words of caution associated with this approach to backup for anyone wishing to try it.

If you want to be able to commit and download files anywhere, and want to host your server at home rather than paying for an online out-of-the-box solution, the setup time and ongoing maintenence will be higher. You will need to set up dynamic DNS, or pay for a static IP address. You also have to worry about security and logins, rather than just piggybacking off your home wifi network's security. Not wanting to fiddle with any of this, the trade-off I have to deal with is the fact that if I forget to download the files from my desktop on to my laptop before leaving for university or work for the day, I have to live without them that day.
Generally I update the files before working, and then commit them when I'm done, on each computer. If I ever forget to do this, Subversion notifies me of the conflict and I resolve it, and everything is fine. Except for the one time I made a commit in a Git repository which fell under my main Subversion repository, and then made a separate commit on my other computer, without syncing the Subversion repositories. Now Subversion identifies conflicts within the .git folder, and as you may imagine, conflicts in this situation are nothing short of painful. Ultimately what I ended up doing was temporarily setting up a remote repository for the Git repository, pushing from one computer, and then dealing with the conflict when pulling from the other. I could then sync the two repositories through the remote. I then deleted the first repository locally and recloned it from the web, and then completely deleted it from my other computer and recovered it from the first in Subversion.

If you are interested in the technical details of how such a Subversion server is set up, continue reading part two.