File systems have been around for a long time. Even before the advent of computers, most organizations stored their paper files in manilla folders that were, in turn, stored in filing cabinets. This set the stage for the folder and file metaphor, but it’s important to understand that under the covers, we’re a long way from manilla folders and aluminum drawers..
Files, Folders, and Streams
As computing moved beyond the card and deck stage, the need to persist code and data on magnetic materials forced the introduction of files as abstractions, with the idea being that a file consisted of linked blocks pulled into memory until a pointer was encountered that had no corresponding block of content.
Folders, in turn, were locations in memory that held a title and date (that is, metadata). The file’s first (or control) block would then contain a pointer to the “folder” that contained it. Such folders could also incorporate pointers to other folder blocks, If a folder didn’t have a link to a parent, it was considered a root folder. This gave rise to the file-folder paradigm – all files were contained within some nested folder of arbitrary depth, creating a tree structure with the notion that a file could then be located along a file path from the file to the root node.
This bottom-up structure made traversing to the root very fast, but the simple process of getting a directory of files meant traversing this tree in reverse order. This process was considerably slower, as you had to search for all files with the folder as a parent. To facilitate this, most operating systems created a reverse lookup table that would hold folder locations and pointers from the folder to any children (file or folder). This lookup table was a very early database index. When a new file or folder was created, the index was updated. Should the table get corrupted (as it did with depressing regularity), the file system would have to traverse the file blocks in reverse order to reconstruct this, a process that could take minutes or even hours.
File records, consequently, contained a very minimal amount of information because file systems could contain millions of files, most changed slowly once the basic form was established.
One of the biggest shifts in computing today is the transition from hierarchies (trees) to graphs. Trees perforce require the creation of specialized indexes to search a tree’s breadth. In a semantic knowledge graph, on the other hand, there is one master index (the triple store) that can be readily be traversed from top-down as from bottom-up.
Moreover, all trees are graphs, but not all graphs are trees. This means that it is possible to create relationships between nodes. These would be considered symbolic links in a file system and were assigned parsimoniously because they necessitated very specialized treatment (they could create circular graphs). In a semantic graph, a symbolic link is simply a link, and circularity is a known problem with known solutions.
The Benefits of a Semantic File System
Once you realize that a file system is a linked structure, it’s a short hop from there to the idea that you could, model a file system in RDF, the question comes down to why you would want to. It turns out that there are quite a number of good reasons to build a file system on top of RD, what could be described as a Semantic File System (SFS):
- File Abstraction. If a file is treated as a resource, then a URI (an identifier) exists that serves as the control block for that resource. This control block could point to a file stream in another filesystem (such as on the drive that contains the knowledge graph), an external resource on the web, or within a literal contained in the graph itself (either as text or as 64-bit encoded binary). From the perspective of the graph, the file URI is an abstraction.
- Very Fast Search. If you have a path fragment, you can retrieve all file resources that have that fragment in a very straightforward SPARQL query without knowing the full path or needing to traverse the entire graph. This makes for much faster file retrieval.
- Aliasing. You can also assign multiple potential path aliases to the same resource. This means that the same file can be referenced in multiple ways. You can walk up the file path by truncating the file path name and then looking for the resource with the corresponding truncated “directory” name.
- Categorization. You can apply a category to a given file and then create a virtual folder that contains all of the files and folders in that category, regardless of where they are in the system.
- Faceting. This comes in handy, especially when applying multiple categories to the same virtual folder – find all images about cats that are also about diet.
- Auto-categorization. If your system does autoclassification, then simply by adding the file into the “file system”, the file will be auto-classified and will automatically show up in a magical folder. This is a powerful tool for organizations that significantly reduces the overhead of curating content. It also means that if you have any rules that apply to a particular category, the rule will be performed transparently.
- Packaging. File systems tend to be at the heart of most packaging schemes. By encoding file representations (whether inline or virtual) as RDF, you can use the graph stack to copy only necessary files, build dependency graphs, and switch on or off specific configurations with minimal (or even no scripting) involved.
- ACLs and Security. This similarly applies to permissions and security, with Access Control possibly being determined dynamically based on arbitrary rules and constraints (such as those derived from SHACL validation). This can be a real boon for managing access to private information. Static controls can become too rigid and complex when you have thousands or more different assets, especially when these become time-dependent.
- Managing Content Type and Serialization. Typically file systems rely upon a complex web of mime-types, file name extensions, and use rules to determine how a specific file is retrieved or rendered. A semantic system can encapsulate this information through a content-type profile in RDF that can either take a user parameter or gracefully degrade to alternative serializations. For instance, a mostly tabular dataset expressed as a CSV could be returned as JSON, XML, or transformed into Turtle or an Excel document should the user request it.
Implementing a Semantic File System
When you get down to it, a file system is another form of taxonomy – a way of organizing resources within a system. It is not a full operating system, though once you have the components necessary to create a semantic file system, it becomes possible from there to make a command line interface (CLI) that could invoke functions that are, in turn, tied into the semantic underpinnings of such a system. This would look a lot like the node (Javascript) or Python command line interface and could become critical for creating complex pipeline scripts.
This means that triple stores with a scripting interface layer – TopQuadrant’s EDG, AllegroGraph, and Stardog, at last count – could effectively implement a Semantic File System (or SFS). – with neo4J and a few others outside the RDF space also able to create something analogous. Going from the FSF to a CLI primarily involves defining enough of an API foundation to retrieve lists of URLs representing the results of queries, changing folders, creating or deleting files or folders, and so forth. Significantly, this can also be tied into named RDG graphs, each of which provides a handle to a graph for specific operations
At the moment, however, the only wide-scale SFS is the still highly experimental Solid Project implementation based upon Tim Berners-Lee’s efforts in 2017 to create a freely available distributed data system (via Inrupt.com) and corresponding API that could be used to store and access files within a knowledge graph. Again, this system is so powerful that such virtual files could be images, videos, text or binary files, links to external files, process pipelines, data feeds, or service calls hidden beneath the abstraction layer of the SFS. Moreover, as each Solid node (or Pod) has a public web address, it’s not difficult to imagine Solid as a protocol for federation, something that has been desired for some time but has been remarkably difficult to implement at the proprietary layer.
Echoing File Systems Such As Sharepoint
The RSS or Atom Feed is one of the most powerful (and frequently overlooked) tools in the programmer’s toolbox. Systems such as Sharepoint or Atlassian Confluence create a virtual file system that appears to be the traditional folder/file arrangement on the surface but more than likely maps to a flat file repository or database on the back end. In many cases, these systems expose RSS feeds that will return a certain number of file or folder-like objects with associated metadata using either XML or JSON (the functionality to do this may need to be set by an administrator, especially for Sharepoint).
Atom feeds contain metadata to distinguish between file and folder nodes in Sharepoint or Confluence. The URL returned for a folder node is typically the base address needed for getting the corresponding RSS or Atom feed. This can be used to write a spider that recursively walks through the available files or folders externally, getting the metadata and URLs and using that to build an emulation of that site using relatively simple Javascript or Python apps, storing that information as triples.
The real benefit to making echo file systems, however, comes in the ability to associate richer metadata with the file. Spider processes can download the associated documents, then use NLP to extract entities from those documents, or use ChatGPT-like auto-summarisers to create overviews and category tags of the content that can be displayed or searched. Additionally, such spiders can read image EXIF metadata or use image or video categorizers to identify content. Finally, thumbnails can be retrieved or generated.
The advantage to this approach is that you can add in semantic filtering, querying, and proxy (aliased) file names without having to make significant changes to your base Sharepoint or similar systems. What’s being stored is not the actual content but rather intelligent metadata proxies. The principal downside is that the new file system will always be a little out of sync with the existing one, though intelligent caching can lessen that significantly.
Summary
The Semantic File System, or SFS, is a powerful potential application for knowledge graphs that moves beyond the base triple store space and instead points to what could be a truly distributed file system across any number of different kinds of servers. It sees files (and folders) as malleable and abstract, and looks hierarchical but respects that a file system has the potential to be not just a taxonomy but instead a tool to reach into the hypergraph realm of ontologies. Semantic File Systems sees metadata as more than a tiny set of fixed attributes and can be much more powerful in managing identity and controlling access. Finally, Semantic File Systems are integral to a CLI for distributed data fabric.
1 thought on “Building Semantic File Systems for Fun and Profit”
You must log in to post a comment.