A hash function is a mathematical process that maps a bitstring of any length to a bitstring of a defined length. This process must also be easy to compute. A one-way hash function must be preimage resistant and second preimage resistant while a secure cryptographic or ideal hash function must be preimage resistant, second preimage resistant, and collision resistant – we'll cover what these terms mean in the following lessons of this chapter. The hash functions we'll be looking at in this course are the secure ideal hash functions found in the BSV system: SHA-256 and RIPEMD-160. For the remainder of this course, when you see the term hash function, it means secure cryptographic or ideal hash function.
The output values of hash functions are often referred to as message digests, hash values, checksums, or hashes. We will use these terms interchangeably throughout this course.
You can think of a message digest as analogous to a fingerprint for digital information. In the physical world, a fingerprint is an identifier unique to each individual person such that no two people ever share the same fingerprint. A hash function takes some input message (which can be almost anything as all digital information is essentially a binary string of ones and zeros), and produces a unique, compressed, fixed-length, digital fingerprint for that input message.
As a result of their unique properties, hash functions are a very important tool in many areas of information technology:
Hash Tables
Digital signatures
Keyed hashes
One-way transformations
Key derivation
These operations and techniques form the foundation of most digital storage and communication technologies. However, the two fundamental uses of hash functions are: maintaining data integrity, and data storage & access efficiency. Since message digests are digital fingerprints, they are very useful for making sure data hasn’t been tampered with.
For example, Bob wants to download the most recent version of Debian Linux so he goes to debian.org and downloads an .iso file for the version of Debian Linux he would like to use. However, he wants to be sure he’s downloaded the correct file before running it to minimize the risk the file has malware embedded in it. Before running the .iso file, Bob checks the website for a checksum or message digest corresponding to the version of the file he downloaded, and he finds the following listed:
Seeing the listed checksums were generated using the SHA-512 hash function, Bob then runs his newly downloaded .iso file through the SHA-512 hash function on his computer to make sure the output matches one of the provided hashes:
And, sure enough, the output checksum matches one of the given checksums on the Debian website, so Bob can be reasonably sure the file he has downloaded is safe to open.
In addition to data integrity, using hash functions as a tagging or identification mechanism in data storage or in a data structure can provide efficiency gains. Message digests of hash functions are a fixed length but still unique, and hash functions can take almost anything as an input value. Many data structures found in information systems utilize hash functions to identify or index data.
Fundamentally, all computer systems use hash tables (also called associative arrays, maps, or dictionaries) within their memory storage architectures to make memory access as efficient as possible. Hash tables are such an important data structure that all other data structures can be constructed using them; there are even some programming languages like AWK that use them exclusively. Hash tables work a lot like arrays; the difference being hash tables allow for anything that can be hashed to be used as an index value creating a key-value mapping.
For example, given an array of size 10, the contents of the array are stored by accessing the array at indexes from 0..9:
Whereas hash tables use a hash of data modulus the length of the array (i.e. the remainder of the message digest divided by the length of the array) to derive the index the data will be stored in the array.
Notice the output of the hash table is not in order like the array was, and that's because the index of the value is determined by the hash of its key. This is what makes hash tables so efficient. Because the index of the value is provided when fetching the data, retrieval is fast, yet almost any kind of input can be used to derive the index.
BSV Blockchain makes extensive use of Merkle Trees. A Merkle Tree is a binary tree structure (meaning its branches terminate in two nodes) that's made entirely of hashes. For example, the leaf nodes at the bottom layer of the Merkle trees used in BSV are the hash values of the raw data from bitcoin transactions. And the successive layers of the tree leading back to the root node are constructed by concatenating the two hashes from the previous layer, and hashing them together to form a new hash. This has the effect of connecting the tree together to be represented securely by one single root node value. This structure makes it very easy to find data at the bottom of the tree, as well as to prove specific data exists in the tree - which is particularly important for scaling the BSV system efficiently.
See our course Bitcoin Primitives: Merkle Trees
Although hash tables and data structures like Merkle Trees are where hash functions find their greatest use, the most widely recognized use of hash functions is in cryptography. In addition to encryption and key-exchanges, digital signatures make extensive use of hash functions. Like a physical signature, a digital signature associates an identity to a document or message. To ensure the security of digital signatures, well tested algorithms are used to both construct and verify them.
In BSV, digital signatures are most commonly used for Pay-to-Public-Key-Hash (P2PKH) transactions – the most common type of transaction template -- where the ownership of funds are transferred from the sending party to the receiving party. The two other common uses of digital signatures in BSV are for signing arbitrary messages and using an address as part of the verification process, and MinerID which is a message included in the first transaction of a block that acts as an authentication mechanism in case another miner (node) tries to act dishonestly while using the ID of a competing miner.
With that said, it's helpful to recognize that even though hash functions and cryptographic digital signatures are used in BSV, it is not a cryptographic system, and it's actually incorrect to use the term “cryptocurrency” when referring to Bitcoin or a system that follows the Bitcoin architectural design like BSV. This is a really important distinction to be aware of because as we'll see in chapter 7, it's fundamental to how the BSV system operates and how it's secured.
Since hash functions are a type of cryptographic primitive -- i.e., some of the building blocks used to encrypt messages and data -- the concepts of hashing and encryption are often conflated. However, they are not the same. The three main differences between hashing and encryption are as follows:
Hash functions don't care about data loss: A hash function produces a fixed-length output in clear text because it’s nothing more than a unique digital fingerprint and data loss is not a concern. In contrast, encryption encodes and preserves input messages.
Hash functions are not reversible: A hash function is a one-way function that does not need to be reversed. In fact, it's important it's not reversible. In contrast, it's crucial the intended recipient of an encrypted message is able to reverse the encryption process in order to decrypt the message so they can read it.
Hash functions have a collision risk: A hash function has a collision risk. Due to the fixed length of message digests, it's possible for two different input values to produce the same digest. BSV uses hash functions like SHA-256 and RIPEMD-160 because they greatly reduce the risk of collisions, yet, although they're probabilistically insignificant, they can still theoretically happen. In contrast, since encryption preserves 100% of the input message, each encrypted message produces a unique output so there's no collision risk.
Hash functions are computationally infeasible to reverse -- in the same sense as trying to unscramble an egg or put toothpaste back into its tube. Said in another way, hash functions are one-way 'trap-door' functions; they make it easy to find a message digest from an input, but so difficult to find an input from a message digest, that it would either take longer than human beings are capable of waiting (sometimes longer than the age of the universe) or the cost of marshalling the necessary computing power to reverse them is so high, it's not worth doing it.
To be considered preimage resistant, it should take hash attempts (where m is the length of the output in bits) to find the corresponding input message for a particular message digest. If a preimage can be found using fewer than hash attempts, the hash function is no longer considered preimage resistant and is no longer secure.
The message digest of a hash function is consistent given the same input value. In other words, hash functions are deterministic -- the same input produces the same output hash every time. Furthermore, if even a single bit of the input message is changed, the resulting output hash is significantly different, unique, and preimage resistant.
It’s computationally infeasible to find two different input values that produce the same output value for a hash function. Although the fixed length of their output means a hash function has a collision risk, an ideal hash function produces message digests with a length large enough that the risk of finding a collision is so small, it’s safe to ignore in most situations.
Ideally, a hash function with an output value of bitlength should require at least operations to find a collision. In a birthday attack, given a set of m equally probable values, roughly random samples are needed to find a single collision. This means if a collision is found in fewer than random samples, the collision resistance property of the hash function is “broken”, and it’s no longer considered secure.
This is particularly important with respect to BSV's proof-of-work process.
The fundamental issue with constructing a hash function is the fact that it must be able to take an input of arbitrary length and compress it into an output of a fixed length. If input is an arbitrary length, it's difficult to achieve second preimage resistance. Also, for a compression function to be collision resistant, it must take fixed length inputs that are longer than its output. One way to remedy this issue is to use a preprocessing step to pad the input to a fixed length rather than compress it directly. Once the input has been preprocessed, it can be compressed consistently.
In 1979, both Ralph Merkle and Ivan Damgård independently proved that so long as an appropriate padding scheme is used in the preprocessing step, and the compression function used is collision resistant (implying preiamge and second preimage resistance), it follows that the hash function itself is also collision resistant. The most popular hash functions in use today are based on this Merkle-Damgård method of construction.
Merkle-Damgård hash functions have three parts: input and preprocessing, compression, and final value construction and output. The preprocessing step pads the input message so its congruent to a fixed bitlength, e.g. 512 or 1024, and then appends the length of the input itself (this is called Merkle-Damgård strengthening). The compression step uses bitwise logical functions to compute and mutate the processed input in rounds using chaining variables to link each round. Finally, the chaining variables are concatenated and outputted most commonly in hexadecimal format.
The first two Merkle-Damgård hash functions to gain widespread adoption were Ron Rivest's MD4 and MD5. First specified in 1992 in the Internet Engineering Task Force (IETF) Request For Comments (RFCs) 1320 and 1321, respectively, MD4 and MD5 were the first hash functions to use a 448 congruent to 512 message block padding scheme with the remaining 64 bits left for Merkle-Damgård strengthening.
MD4 enjoyed short-lived success, but Rivest realized shortly after its release that it was likely to be insecure, so he designed MD5 shortly thereafter. Despite Hans Dobbertin's announcement of a collision found for MD5's compression function in 1996, MD5 enjoyed long-lived success. While cryptographers were recommending SHA-1 as a replacement to MD5 as early as 1996, MD5 continued to be used well into the early 2000s.
However, in 2004, Xiaoyun Wang, Dengguo Feng, Xuejia Lai, and Hongbo Yu developed an analytical collision attack that could reportedly be performed within an hour on an IBM p690 cluster. This marked the end for MD5 for use in key derviation and digital signatures. In 2008, CMU Software Engineering Institute announced MD5 was "cryptographically broken and unsuitable for further use".
Even so, the groundwork laid by MD4 and MD5 provided a practical framework for newer generations of hash functions to follow; including BSV's SHA-256 and RIPEMD-160 which both use 512 bit message blocks and 32 bit words.
As already mentioned, the two hash functions found in the BSV system are SHA-256 and RIPEMD-160. For reasons explored further in Chapter 7, Bitcoin uses SHA-256 and RIPEMD-160 in double hash forms: SHA-256(SHA-256) and RIPEMD-160(SHA-256) – often abstracted and referred to as HASH-256 and HASH-160, respectively, following the convention set out in the original bitcoin code.
However, the following table displays the full range of hash functions commonly found in the greater BSV ecosystem; including SHA-512 and the HMACs of SHA-256 and SHA-512 which are often utilized by wallet implementations:
SHA-256
32 Bytes
Generates unique 256-bit value from input string
1. Proof-of-work algorithm 2. Address Creation
RIPEMD-160
20 Bytes
Generates unique 160-bit value from an input string
Address creation
HASH-256
32 Bytes
SHA256 hash of a SHA256 hash
1. Blocks 2. Transactions
HASH-160
20 Bytes
RIPEMD160 hash of a SHA256 hash
Address creation
SHA512
64 Bytes
Generates unique 512-bit value from input string
Wallet encryption (AES)
SHA256HMAC/SHA512HMAC
32 Bytes
HMAC Prevents length extension attacks and can be used with any hash function
Address Creation