Files, Folders and Streams
File systems have been around for a long time. Even before the advent of computers, most organizations stored their paper files in manilla folders that were, in turn, stored in filing cabinets. As computing moved beyond the card and deck stage, the need to persist code and data on magnetic materials forced the introduction of files as abstractions, with the idea being that a file consisted of linked blocks pulled into memory until a pointer was encountered that had no corresponding block of content.
Folders in turn were locations in memory that held a title and date (that is to say, metadata). The first (or control) block of the file would then contain a pointer to the “folder” that contained it. Such folders could also incorporate pointers to other folder blocks, If a folder didn’t have a link to a parent, then it was considered a root folder. This gave rise to the file-folder paradigm – all files were contained within some nested folder of arbitrary depth, creating a tree structure with the notion that a file could then be located along a file path from the file to the root node.
This bottom-up structure made traversing to the root very fast, but the simple process of getting a directory of files meant traversing this tree in reverse order. This was a considerably slower process, as you had to search for all files that had the folder as a parent. To facilitate this, most operating systems created a reverse lookup table that would hold folder locations and pointers from the folder to any children (file or folder). This lookup table was a very early database index. When a new file or folder was created, the index was updated. Should the table get corrupted (as it did with depressing regularity), the file system would then have to traverse the file blocks in reverse order to reconstruct this, a process that could take minutes or even hours.
File records, consequently, contained a very minimal amount of information. Because file systems could contain millions of files, most changed comparatively slowly once the basic form was established.
One of the biggest shifts in computing today is the transition from hierarchies (trees) to graphs. Trees perforce require the creation of specialized indexes to search a tree’s breadth. In a semantic knowledge graph, on the other hand, there is one master index (the triple store) that can be readily be traversed from top-down as from bottom-up.
Moreover, all trees are graphs, but not all graphs are trees. This means that it is possible to create relationships between nodes. These would be considered symbolic links in a file system and were assigned parsimoniously because they necessitated very specialized treatment (they could create circular graphs). In a semantic graph, a symbolic link is simply a link, and circularity is a known problem with known solutions.
The Benefits of a Semantic File System
Once you realize that a file system is a linked structure, it’s a short hop from there to the idea that you could in fact model a file system in RDF, the question comes down to why you would want to. It turns out that there are in fact quite a number of good reasons to build a file system on top of RD, what could be described as a Semantic File System (SFS):
- File Abstraction. If a file is treated as a resource, then there exists a URI (an identifier) that serves as the control block for that resource. This control block could in turn point to a file stream in another filesystem (such as on the drive that contains the knowledge graph), an external resource on the web, or within a literal contained in the graph itself (either as text or as 64 bit encoded binary). From the perspective of the graph the file URI is an abstraction.
- Very Fast Search. If you have a path fragment, you can retrieve all file resources that have that fragment in a very straightforward SPARQL query, without having to know the full path or needing to traverse the entire graph. This makes for much faster file retrieval.
- Aliasing. You can also assign multiple potential path aliases to the same resource. This means that the same file can be referenced in multiple ways and that you can walk up the filepath by truncating the filepath name then looking for the resource with the corresponding truncated “directory” name.
- Categorization. You can apply a category to a given file then create a virtual folder for that category that contains all of the files and folders in that category, regardless of where they are in the system.
- Faceting. This comes in handy especially when applying multiple categories to the same virtual folder – find all images about cats that are also about diet.
- Auto-categorization. If your system does autoclassification, then simply by adding the file into the “file system”, the file will be auto-classified and will automatically show up in a magical folder. This is a powerful tool for organization that significantly reduces the overhead of curating content. It also means that if you have any rules that apply to a particular category, the rule will be performed transparently.
- Packaging. File systems tend to be at the heart of most packaging schemes. By encoding file representations (whether inline or virtual) as RDF, you can take advantage of the graph stack to copy only necessary files, build dependency graphs, and switch on or off specific configurations with minimal (or even no scripting) involved.
- ACLs and Security. This similarly applies to permissions and security, with the possibility that Access Control can be determined dynamically based on arbitrary rules and constraints (such as those derived from SHACL validation). This can be a real boon for managing access to private information, as static controls can become too rigid and complex when you have thousands or more different assets, especially when these become time-dependent.
- Managing Content Type and Serialization. Typically file systems rely upon a complex web of mime-types, file name extensions, and use rules to determine how a specific file is retrieved or rendered. With a semantic system, this information can be encapsulated through a content-type profile in RDF that can either take a user parameter or gracefully degrade to alternative serializations. For instance, a mostly tabular dataset expressed as a CSV could be returned as JSON, XML, or possibly transformed into Turtle or an Excel document should the user request it.
Implementing a Semantic File System
When you get down to it, a file system is another form of taxonomy – a way of organizing resources within a system. It is not a full operating system, though once you have the components necessary to create a semantic file system, it becomes possible from there to make a command line interface (CLI) that could invoke functions that are, in turn, tied into the semantic underpinnings of such a system. This would look a lot like the node (Javascript) or Python command line interface and could become critical for creating complex pipeline scripts.
This means that triple stores with a scripting interface layer – TopQuadrant’s EDG, AllegroGraph, and Stardog, at last count – could effectively implement a Semantic File System (or SFS). – with neo4J and a few others outside the RDF space also able to create something analogous. Going from the FSF to a CLI primarily involves defining enough of an API foundation to retrieve lists of URLs representing the results of queries, changing folders, creating or deleting files or folders, and so forth. Significantly, this can also be tied into named RDG graphs, each of which provides a handle to a graph for specific operations
At the moment, however, the only wide-scale SFS is the still highly experimental Solid Project implementation based upon Tim Berners-Lee’s efforts in 2017 to create a freely available distributed data system (via Inrupt.com) and corresponding API that could be used to store and access files within a knowledge graph. Again, this system is so powerful that such virtual files could be images, videos, text or binary files, links to external files, process pipelines, data feeds, or service calls hidden beneath the abstraction layer of the SFS. Moreover, as each Solid node (or Pod) has a public web address, it’s not difficult to imagine Solid as a protocol for federation, something that has been desired for some time but has been remarkably difficult to implement at the proprietary layer.
Summary
The Semantic File System, or SFS, is a powerful potential application for knowledge graphs that moves beyond the base triple store space and instead points to what could be a truly distributed file system across any number of different kinds of servers. It sees files (and folders) as malleable and abstract, looks hierarchical but respects that a file system has the potential to be not just a taxonomy but instead a tool to reach into the hypergraph realm of ontologies. Semantic File Systems sees metadata as more than a tiny set of fixed attributes, and as such can be much more powerful in managing identity and controlling access. Finally, Semantic File Systems are an integral part of a CLI for distributed data fabric.
1,452 total views, 2 views today
1 thought on “What Is A Semantic File System?”
You must log in to post a comment.