π

RelFS: A Hypothetical Tag-Based File System by Nayuki

Show Sidebar

A reddit thread mentioned the article on "Designing better file organization around tags, not hierarchies" by Nayuki which was last updated in early 2017.

Since I have some scientific and hands-on background on the topic of file management by tags, I would like to comment on this article. Unfortunately, Nayuki does not provide any comment feature on his web page. Hence, I'm publishing this comment on my blog.

The article by Nayuki is written similar to a classic white-paper for a research conference or a scientific journal. It consists of following sections:

I "printed" the web page into a PDF file which resulted in 29 pages of A4 with my page settings. Therefore, it's a lengthy paper that requires some time to read carefully. Nevertheless, it offers good content for readers who are into file management in general and tag-based approaches in particular. I really do recommend it. If you continue reading my comments now, you should have read the article beforehand in order to get the most out of my comments as I won't repeat most content from the article I'm referring to.

The Status of the Project

First of all, I've skimmed through the list of blog articles by Nayuki starting with 2017 and was not able to find anything related. Looking at his list of GitHub repositories, I found the Relational File System (RelFS) that currently holds three commits from December 2017. The one document in this repository summarizes the main parts of the RelFS concept.

The only GitHub issue is a question related to the status of the project which was not answered by Nayuki. However, he responded with a thumbs-up to the answer of another person that basically wrote "this repo is to track changes/updates to that design". According to these observations, RelFS is a one-man concept that did not get updated for a couple of years and did not attract any peers yet. At least, we now have a name for this hypothetical tag-based file system: RelFS. :-)

Context

Nayuki is trying to come up with some fresh ideas on the issue of how to organize files. He's starting from scratch, not caring about implementation issues or compatibility to existing systems - at least for the most parts.

What I found interesting was his notion that he feels limited by the fact that every file within nowadays file systems need to have a unique file name within one single directory. (Read section "Folder vs. Directory" of this article if you want to learn about the difference between "folders" and "directories" from my perspective.) Interestingly, I myself was never worried about that specific limitation of file systems. Maybe I accepted this as part of my way of judging things as a tech-savvy person, having blinkers. Maybe I was never irritated by it because I tend to use long, descriptive file names according to my file name convention.

Nayuki does not want to name his files. He sympathizes with computer-generated file names and concepts where names are defined by and directly derived from the content similar to the inner workings of Git, restic, IPFS or others.

Escaping the limitations of single-classification, he clearly goes for multi-classification using labels, tags or meta-data in general, getting rid of not only file names but also directory paths.

For most parts of the article, I do have the very same or similar opinion compared to Nayuki. However, I do want to express some of my concerns with his approach in the following sections. The order of my comments does not reflect the order of topics mentioned in the article.

Immutable Files

The highest impact on the rest of the concept results from his definition of a file:

Define a "file" to be an immutable, finite sequence of bytes.

Unfortunately, I do have many issues with this self-limitation. Although most of the files I curate on my personal storage devices are immutable (not being modified from the time it entered my system until now and probably forever), the files that do get changed over time by adding, removing or changing content tend to have a higher value to me.

Later-on, Nayuki addresses the issue in the open question section of his article when he refers to mutable files with following words:

I think the easiest way to deal with this is to exclude it from my scope. My goal is to design a system that can organize and access a set of independent, timeless facts. The ability to deal with mutable files would only dilute the impact of this system, and introduce multitudes of technical issues regarding data modeling, interoperability, semantics, etc.
Thus, probably the best solution is to have a traditional hierarchical file system for managing mutable files. Once a user is satisfied, she will choose to import the finished files into the immutable tag-oriented data store.

In other words, even with a brand new concept for this hypothetical RelFS, Nayuki needs a traditional hierarchical file system (HFS) in parallel in order to work on files.

At this point, this new concept of RelFS is limited to the archive section of the personal data storage only. Maintaining both systems will not solve the issues that got listed in the motivation part of the article. Quite the contrary: with two systems in place, the user is confronted with a fragmentation issue she did not have before. Where is my file? Was it considered mutable by my past self (HFS) or do I have to use a completely different retrieval workflow to locate it in the archive of the immutable files (RelFS)? I don't see any advantage to the common user here.

Furthermore, in a different section, Nayuki mentions that applications needs to have specific parts that deal with storing and retrieving of files. Those parts have to support both, completely different storage patterns as well. This is a very large issue of the practical implementation of such a system. I can not think of any solution where the average user can be convinced of.

Personal versus Collaborative

With the introduction of the concept of tagging as the one and only retrieval tool for RelFS, Nayuki needs to discuss some aspects of applying tags to files. He specifically mentions Danbooru-style_boards where retrieval is independent of file names, collaborative tagging is applied and tags themselves can be annotated.

It might be a subtle comment I want to give at this point. To me, there is a huge difference between personal and collaborative information management. This is particular true for social tagging (or Folksonomy) in contrast to a personal set of tags.

Independent of my personal recommendation of curating a limited set of tags within a controlled vocabulary, there are some fundamental differences between many persons tagging one item and one person applying keywords to the same item. Those aspects should be discussed separately and not mixed together. Unfortunately, this article mixes those concepts multiple times. For example when adding the interesting concepts of shared tag cores:

One extension of using tag cores is that we can create a public vocabulary of tags with universally accepted meanings.

Whenever something like this is introduced, there are many different questions that arise and needs to be addressed as well. For example, who is responsible for curating tags or tag relationships? How does the process look like when a universally accepted meaning for a keyword needs to be created? I highly doubt that this is possible at all.

In addition to those social questions, there are some technical implications as well. Where are those tags stored? How can those tags be synchronized among all people using RelFS? What about pushing changes because of a changed common agreement? This is a really messy topic to discuss.

Another aspect when it comes to sharing point of views on tags is when files are transferred from one person to another:

Selecting metadata
If you choose to copy a photo from one storage device to another, which tags pertaining to this photo will also be copied? This question does not have a consistent answer in conventional HFSes (there are many conflicting ad hoc semantics among implementations). There doesn’t appear to be a universal answer in a tag-based system either. There are at least two aspects to the problem. Which types of tags get copied – name tags, title tags, timestamp tags, derivation tags? How deeply should higher-order tags be copied – tag cores, tag implications, notes about tags, etc.?

This is almost impossible to define for one single file that has multiple tags assigned. When a large set of different files are transferred, this more or less results in an impossible task to accomplish when personal tagging preferences should be taken into account.

Storage Devices and Permissions

In order to discuss the idea of using storage devices as a tool to separate access permissions, I need to quote this whole sequence:

Conventional HFSes manage private files by the mechanism of directory and file permissions (ownership, read/write, etc.). However this leaks information because there is always some point up the hierarchy where an unauthorized user knows of the existence of private files. For example, it might be the case that the user directory /home/john is readable by everyone, and the subdirectory /home/john/private is the starting point where only John has read access. I propose to manage private files by storage attachment: if you want to see private files, you need to be able to be to attach a particular storage device (enforced out of band through OS-level permissions or by encryption). If you have sensitive financial documents stored on a separate device (even if it’s an HDD partition), you can unmount it when you don’t need it, so that a rogue program won’t be able to access the sensitive data.

I highly doubt that this is working in practice. Not in our modern multi-threaded world. Without a permission system in place, you don't have any control at all.

This means that for example, a password manager app can be the only one that accesses a privileged storage device containing user passwords.

If you are using storage devices as a supplement for access permissions, you will end up with many different storage devices. At least dozens. On my side, probably even more, when I think of it in detail.

Storage devices are equal to physical storage devices, at least according to the article. If I would add logical devices as an additional option to this concept, I still end up with many different storage devices, each having a specific and pre-defined storage size. The overall free space would be fragmented into many parts. As a geek, I do not want to deal with such a situation. Although I still have a couple of unused Gigabytes, I would not be able to use them up properly when those Gigabytes are scattered over many different devices. Any non-geek person would not be able to manage this in an efficient way for sure.

Your email client software should be designed to list email messages from all available storage devices.

In my opinion, we do have a mismatch on responsibility here. I want this to be handled by the operating system and not by each individual software application. The underlying operating system has to provide consistent access to the storage layer.

Even worse, with dozens or hundreds of different storage devices, the UI proposal with "one icon per storage device" in every file open/save dialog does not scale at all. This would end in a big mess for the user.

And this does get even worse when you think of the fact that we still need traditional hierarchical file system (HFS) in parallel to the RelFS as mentioned before.

Levels of Detail

Nayuki clearly has a broad technological background. This is obvious to the reader when Nayuki switches from explaining high-level concepts down to discussing implementation details and vice versa. In my opinion, this is not good for any kind of reader, with or without background in technology. The overall goal was to introduce a concept, ignoring implementation details. At various points, Nayuki is violating this pattern and changes to questions of implementation details.

On the one hand side, he writes about tags, immutable files, not relying on file paths and so forth. Then he mentions details of concepts from RDBMS, IPFS, and implications for typing schemes and their verification.

I also got a bit irritated on the granularity of his concept. Most of the time, he writes about files just as we are using them these days: "a file and nothing but a file". He even mentions file extensions like PDF that should be added to files just like any other tag as well. In one section, he goes further. He thinks that RelFS files should be more fine-grained than this. Like each message within a conversation is a file as well as single temperature measurements of a series of measurements, each should be a file of its own.

As far as I sympathize with Information-Centric Systems, the reader most likely can not follow here. What is supposed to be file-level? The whole article I'm writing just now? Each section of the article? Or even each paragraph or each sentence? There might be a fruitful discussion hidden in this notion. When it comes to real-life situations where information can only be retrieved by assigned tags, I doubt that an excessively fine granularity turns out to be of high value to the average user.

Semantics

Although Nayuki obviously has many touch-points with it, he never mentions the concept of semantic triples or RDF. He clearly wants to put things into relation. For example when he is mentioning relation between tags. And among one single tag, he wants to express different meanings for the very same word and so forth.

Well, this is the classic domain of semantic triples, where a subject is assigned to an object with a specific predicate. This is the only viable way of defining relations in a way that a structured retrieval process can be supported. This way, a user is able to search for "flower" and the system is able to "understand" that "tulip" is an instance of the class "flower", providing them in the set of search results although they weren't tagged with "flower" in particular.

Multiple notions of this article reminded me on those concepts. It was never mentioned except in the descriptions of the citation of TagFS and indirectly with Tagsistent. Therefore, it gave me the impression that the wheel is re-invented here or there.

Retrieval Process

When using a system like RelFS, one of the biggest benefits would be a smooth file retrieval process. Having the advantage of multi-classification, the user doesn't need to remember storage paths. Instead, she would be able to retrieve information by filtering using tags.

If the proposed file system only supported simple tags, then tag queries are baked into the API. But with complex and custom tags, how do we express queries regarding files, field values, and references? It is likely that the full power of the relational database model is required to express useful queries.

I, too, agree to this statement. If there are only tags and no storage paths, I'd need a sophisticated search functionality. At least when navigation using my TagTrees is not an additional retrieval option.

From my experience with many different levels of computer users, I don't think that using SQL would be acceptable to the large majority of people. And what use is a nice tag-based system when retrieval tasks require advanced technological skills?

Containers

The article discusses the obvious requirement of handling a set of files as one entity. Nayuki introduces the concept of containers or bundles where selected files can be handled accordingly.

From my point of view, those containers also needs to be able to be nested in order to be of value. For example, when you choose to have a fine granularity (a document container that consists of multiple sub-files including image files) and need to transfer multiple of those documents containers as one entity, forming another container. This aspect of nested containers was not mentioned directly.

Anyhow, I could not think of any property of those containers they do not share with classic directories. You still would require to have some way of expressing "paths", pointing to those different levels of nesting. However, to get rid of those paths or directories was part of the motivation in the first place. This rather obvious contradiction is also not mentioned in the article.

Other Open Questions

In general, the section on open questions lists some still unsolved issues with this concept.

To be honest, I did not understand the part on cyclic references. I did not get the issue and I was more than irritated by the proposed solution.

A good hash algorithm is a crucial part of the whole concept. Potential issues are mentioned at multiple spots. A very bold statement is:

256 bits should be more than enough space to prevent hash collisions for all of human history going forward.

Considering the level of granularity that was already mentioned ("each single temperature measurement is a file") and taking into account that a very large number of files get shared and referenced on the internet, this is a necessary thing to have. I'm not good with estimations on large numbers. This section on Wikipedia claims that SHA-256 has a "Security (in bits) against collision attacks" of 128. This is two to the power of 128. According to this discussion or that discussion, the likelihood of collisions for SHA-256 is more or less non-existent. To be precise, we have to assume that the SHA-256 hash algorithm has no flaw that can be use to attack by a malicious person provoking collisions. For situations without malicious math-geniuses finding future flaws, those hash-sums are safe. This better be true because:

In order to not deal with all this messiness of changing algorithms, the best hope is to choose a hash algorithm that will never be broken in the future.

Being the Party Pooper

After all this criticism, please remember what I wrote before: For most parts of the article, I do have the very same or similar opinion compared to Nayuki. It's a very good article that reflects major pain points of "modern" computer system hardly anybody seems to recognize.

I just do not think that a realization of this concept will result in something that is working in practice, providing a substantial progress for the average user.

In summary, I don't think that a concept like RelFS will ever hit the average desktop in our life-time. Too many clever concepts of tag-based file systems never made it beyond research paper level. I can provide even more papers than the RelFS article lists in the references.

My Approach

I already blogged about the opinion that our software ecosystems are more or less frozen and real innovation happens in hardware only. Real innovation is far from having a realistic chance from my point of view. Backward-compatibility as well as "it's good enough" provide some kind of lock-in effect for our computer systems. The average computer user does not even use Ctrl-C and Ctrl-V to access the clipboard [Lane 2005: Hidden costs of graphical user interfaces]. Therefore, we also have a severe issue with computer literacy. Not the best situation to discuss improvements on conceptual level when we do not even take advantage of the most basic possibilities to maximize efficiency we do have nowadays.

My personal approach is to mitigate the downsides of current systems instead of proposing a green-field concept without chances of making it to the desktop.

Read about and try out the tools I've developed myself using multi-classification. I've contributed to the Personal Information Management (PIM) research field as well (a bit) with respect to navigation (in contrast to search) and also provided some ideas based on tag-based navigation.

Non-academic contributions from the lessons learned from my research activities:

If you would like to take part of the discussion, please leave a comment below.

Comment via email (persistent) or via Disqus (ephemeral) comments below: