So, as you may know, I am ardently pursuing a new file system architecture that incorporates a few key strategies for organizing files. One of these strategies is using unique ID’s, generated by a hash function, for each file so that they can be requested by an app without relying on a namespace (folder structure). The reasoning for this is outlined in some of my previous posts. It helps make our software and our systems more resilient by avoiding the brittle nature of hard-coded file paths in the filesystem. In this way, a program can use a file and allow the file system kernel module within the operating system determine where that file should come from. This provides performance benefits (for larger files especially) when they exist on a faster internal drive as well as a removable drive that they want to use between computers.
Collisions
For a hash-based system to work, we need to make sure that the globally unique identifiers (GUIDs) used in a content addressable storage medium are actually unique, because, if they are not, then the system will look at a file you downloaded from the internet, or one you are trying to save, and compare it to a different file on the local system with the same ID, then ignore the new file because it thinks you already have it on disk. Not very smart, right?
Side Note about volatile files
As a side note, the user could want an explicit copy of the file, which he or she intends to edit, but this particular desire is not what a GUID scheme is made for, so the system would not use a GUID for files that are in a volatile period of their lifespan. While a file handle is open on a file for editing, the file’s GUID is revoked and the file is put into a namespace for volatile files. The file may still be accessed using a locally-unique ID that is not the same as a GUID. The “volatile” period is determined by the user, or by the system, after a period of inactivity.
Let’s do some Q and A with ChatGPT…
Q: What are the collision characteristics of the latest hashing algorithms? Is SHA-2 or SHA-3 a perfect hashing algorithm?
ChatGPT says:
Collision Characteristics of SHA-2 and SHA-3
- SHA-2 (SHA-224, SHA-256, SHA-384, SHA-512)
- Collision Resistance: No known practical collisions, but theoretical weaknesses exist in SHA-1, prompting the adoption of SHA-2 and SHA-3.
- Preimage & Second Preimage Resistance: No feasible attacks have been found.
- Security Concerns: While SHA-2 remains strong, researchers pushed for SHA-3 as an alternative in case future advances break SHA-2.
- SHA-3 (Keccak-based)
- Collision Resistance: SHA-3 is designed to be more resistant to collision attacks than SHA-2.
- Preimage & Second Preimage Resistance: Stronger against future cryptographic attacks due to its sponge construction.
- Security Concerns: No known practical vulnerabilities.
Is SHA-2 or SHA-3 a Perfect Hashing Algorithm?
No, neither SHA-2 nor SHA-3 is a perfect hashing algorithm because:
- Perfect Hashing requires a function that guarantees no collisions in a fixed set of inputs, typically for use in hash tables.
- Cryptographic Hashing (SHA-2, SHA-3) aims for collision resistance, but due to the pigeonhole principle, collisions must exist when hashing an infinite set of inputs to a finite output.
Thus, while SHA-2 and SHA-3 minimize the probability of collisions, they do not meet the definition of a perfect hash function, which guarantees zero collisions within a specific dataset.
Existing hashing algorithms are not “perfect hashing algorithms”. These are considered impossible to achieve, because of the nature of creating a fixed-length signature from a larger chunk of data of indeterminate length. For example, in common English, we use the term “ref” to refer to “Referee” or “ref” to refer to a “reference”. While we didn’t use a way to jumble the letters up, we did abbreviate the longer word to a fixed length of 3. Doing this caused a “collision” between different terms. We don’t want this. If I’m talking about referees and you’re saying I already saved a file, “refs”, about referees but the system saved a file, “refs”, about references, we are not talking about the same thing!
Have I said “refer” enough in one sentence for your tastes? If not, please refer to this ChatGPT prompt. After you use the link to generate your own Refer poem, submit it here and read mine! Afterward, you can view all submissions here.
Back to what we were talking about, we don’t want hashing collisions! Older hashing algorithms like MD5 and SHA-1 have been demonstrated to cause collisions. As of right now, SHA-2 has not caused any known collisions, and in order to understand why that might be the case, we need to think about some Physics, later in this article…
The Physics and Math
To determine whether a hash-based GUID is appropriate for a file-system, where you really don’t want any hash collisions, causing your computer to delete your data or ignore it when you ask it to save, we have to consider some physics and do some math.
The first holistic check is: can a hash table represent as many unique IDs as there are atoms in the universe? This is a REALLY REALLY good heuristic to see if it might do what we need.
Q: How long is a SHA2 hash?
A: SHA-2 family hashes come in four main lengths: 224, 256, 384, and 512 bits.
Q: How many unique IDs can the 512 bit one [SHA-2 algorithm] produce?
A: It can produce 2^{512} unique IDs, which is approximately 1.34×10^{154}.
Q: How many atoms are there in the universe?
A: Common estimates place the number of atoms in the observable universe at around 10^{80}.
Next, I needed to use a current benchmark of total files floating around. While not a guarantee, the Internet Archive tends to hoard information more than any other service known to man. It is government funded, so it has no budget issues from what I can tell.
Q: How many unique files have been indexed by the Internet Archive?
A: A rough estimate is on the order of 8×10^{11} web pages archived (Wayback Machine). For overall items, it’s in the billions range (∼10^{9}∼10^{11}) in scientific notation.
Ah, but this still isn’t a good measuring stick for a file system. A file system must be used well into the future. We want a file system that solves the problem FOR ALL TIME, not “for now”. So we need to get an approximation of how many files are being added each year. I wish I could get a proper regression analysis to know if this is an exponential growth rate, but a linear estimate will have to do until I revisit this aspect of the file system blueprints.
Q: What has the growth rate been on internet archive’s number of files?
A: They add billions of new web captures per year. Exact rates vary, but the archive has grown from a few billion pages in the early 2000s to over 800+ billion captures today—an overall long-term growth rate of billions of items each year.
Q: How many years would it take for it to reach half the total amount of unique IDs for sha-2 512A: Half of all possible SHA-512 outputs is about 2^{511} ≈ 6.7×10^{153}.
If the Internet Archive grows by about 10^{11} items per year (very rough estimate):
(6.7×10^{153}) / 10^{11} = 6.7×10^{142} years.
That’s vastly longer than the age of the universe (∼10^{10} years).
OK, so this seems like a reasonable strategy for a file-system. GOOD TAH KNOW!
Leave a Reply to What is a Prime File? – Ryan JohnsonCancel reply