1 of 4

What are Hash Functions?

A hash function is a mathematical process that maps a bitstring of any length to a bitstring of a defined length. This process must also be easy to compute. A one-way hash function must be preimage resistant and second preimage resistant while a secure cryptographic or ideal hash function must be preimage resistant, second preimage resistant, and collision resistant – we'll cover what these terms mean in the following lessons of this chapter. The hash functions we'll be looking at in this course are the secure ideal hash functions found in the BSV system: SHA-256 and RIPEMD-160. For the remainder of this course, when you see the term hash function, it means secure cryptographic or ideal hash function.

The output values of hash functions are often referred to as message digests, hash values, checksums, or hashes. We will use these terms interchangeably throughout this course.

You can think of a message digest as analogous to a fingerprint for digital information. In the physical world, a fingerprint is an identifier unique to each individual person such that no two people ever share the same fingerprint. A hash function takes some input message (which can be almost anything as all digital information is essentially a binary string of ones and zeros), and produces a unique, compressed, fixed-length, digital fingerprint for that input message.

As a result of their unique properties, hash functions are a very important tool in many areas of information technology:

Hash Tables
Digital signatures
Keyed hashes
One-way transformations
Key derivation

These operations and techniques form the foundation of most digital storage and communication technologies. However, the two fundamental uses of hash functions are: maintaining data integrity, and data storage & access efficiency. Since message digests are digital fingerprints, they are very useful for making sure data hasn’t been tampered with.

Data Integrity

For example, Bob wants to download the most recent version of Debian Linux so he goes to debian.org and downloads an .iso file for the version of Debian Linux he would like to use. However, he wants to be sure he’s downloaded the correct file before running it to minimize the risk the file has malware embedded in it. Before running the .iso file, Bob checks the website for a checksum or message digest corresponding to the version of the file he downloaded, and he finds the following listed:

c685b85cf9f248633ba3cd2b9f9e781fa03225587e0c332aef2063f6877a1f0622f56d44cf0690087b0ca36883147ecb5593e3da6f965968402cdbdf12f6dd74  debian-11.2.0-amd64-netinst.iso 
f2da0996196f19585b464e48bfb6b8e4938eb596667d92a5ebd428e1a88a1a115c00f1d052f350eca44fa08f42f0500c63351763dfb47f1e1783f917590c175d  debian-edu-11.2.0-amd64-netinst.iso 
9b5b0475fbb3235ebb7da71415f10921b02131b7debc9325403f85f9f6798a3e902d6257831a7ec9c7aef3256817fb76c4f01bb5d035bfcdb3dc24da24fa1bda  debian-mac-11.2.0-amd64-netinst.iso

Seeing the listed checksums were generated using the SHA-512 hash function, Bob then runs his newly downloaded .iso file through the SHA-512 hash function on his computer to make sure the output matches one of the provided hashes:

openssl sha512 ~/Downloads/debian-11.2.0-amd64-netinst 

//Output: SHA512(debian-11.2.0-amd64-netinst.iso)= c685b85cf9f248633ba3cd2b9f9e781fa03225587e0c332aef2063f6877a1f0622f56d44cf0690087b0ca36883147ecb5593e3da6f965968402cdbdf12f6dd74

And, sure enough, the output checksum matches one of the given checksums on the Debian website, so Bob can be reasonably sure the file he has downloaded is safe to open.

Efficient Data Storage and Retrieval

In addition to data integrity, using hash functions as a tagging or identification mechanism in data storage or in a data structure can provide efficiency gains. Message digests of hash functions are a fixed length but still unique, and hash functions can take almost anything as an input value. Many data structures found in information systems utilize hash functions to identify or index data.

Fundamentally, all computer systems use hash tables (also called associative arrays, maps, or dictionaries) within their memory storage architectures to make memory access as efficient as possible. Hash tables are such an important data structure that all other data structures can be constructed using them; there are even some programming languages like AWK that use them exclusively. Hash tables work a lot like arrays; the difference being hash tables allow for anything that can be hashed to be used as an index value creating a key-value mapping.

For example, given an array of size 10, the contents of the array are stored by accessing the array at indexes from 0..9:

package main 

import ( 
	"fmt" 
) 

func main() { 
	var array [10]string = [10]string{"one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten"} 

	for i := 0; i < 9; i++ { 
		fmt.Printf("%s, ", array[i]) 
	}  

	fmt.Printf("%s\n", array[9]) 

} 

//Output: one, two, three, four, five, six, seven, eight, nine, ten

Whereas hash tables use a hash of data modulus the length of the array (i.e. the remainder of the message digest divided by the length of the array) to derive the index the data will be stored in the array.

package main 

import ( 
	"bytes" 
	"encoding/binary" 
	"fmt" 
	bk "github.com/libsv/go-bk/crypto" 
	"math/big" 
) 
 
type Cell struct { 
	key   string 
	value string 
	next  *Cell 
} 

type HashTable struct { 
	Cells []*Cell 
} 
 
func main() { 
	hashTable := NewHashTable() 
	hashTable.Insert("0", "zero") 
	hashTable.Insert("1", "one") 
	hashTable.Insert("2", "two") 
	hashTable.Insert("3", "three") 
	hashTable.Insert("4", "four") 
	hashTable.Insert("5", "five") 
	hashTable.Insert("6", "six") 
	hashTable.Insert("7", "seven") 
	hashTable.Insert("8", "eight") 
	hashTable.Insert("9", "nine") 
  
	fmt.Printf("%s\n", hashTable) 
 
} 

func (hashTable *HashTable) String() string { 
 
	var out bytes.Buffer  

	for _, cell := range hashTable.Cells {  

		if cell != nil {  

			fmt.Fprintf(&out, "%s ", cell.value) 

			if cell.next != nil { 
				for nextCell := cell.next; nextCell != nil; nextCell = nextCell.next { 
					fmt.Fprintf(&out, "%s ", nextCell) 
				} 
			} 
		} 
	} 

  	return out.String() 
} 
 
func (cell *Cell) String() string { 

	var out bytes.Buffer 

	for ; cell != nil; cell = cell.next { 
  
		fmt.Fprintf(&out, "%s", cell.value) 
  
	}  

	return out.String() 

}   

func NewHashTable() *HashTable { 
	return &HashTable{Cells: make([]*Cell, 50)} 

}   

func GetIndex(key string) (index int) {   

	bInt := new(big.Int) 
	bInt.SetBytes(bk.Sha256d([]byte(key))) 

	return int(binary.BigEndian.Uint64(bInt.Bytes()) % 50) 
} 

func (hashTable *HashTable) Insert(key string, value string) { 

	index := GetIndex(key) 

	if hashTable.Cells[index] == nil { 

		hashTable.Cells[index] = &Cell{key: key, value: value} 

	} else {  

		head := hashTable.Cells[index] 

		for ; head != nil; head = head.next { 

			if head.key == key { 
				head.value = value 
			} 
			return 
		} 
 
		head.next = &Cell{key: key, value: value} 
	} 
}   

func (hashTable *HashTable) Get(key string) (string, bool) { 

	index := GetIndex(key)   

	if hashTable.Cells[index] != nil {  

		head := hashTable.Cells[index] 

		for ; ; head = head.next { 
 
			if head.key == key { 
				return head.value, true 
			}  

			if head.next == nil { 
				break 
			} 
		} 

	} 

	return "", false 
}  

//Output: three four five nine eight zero six seven one two

Notice the output of the hash table is not in order like the array was, and that's because the index of the value is determined by the hash of its key. This is what makes hash tables so efficient. Because the index of the value is provided when fetching the data, retrieval is fast, yet almost any kind of input can be used to derive the index.

Merkle Trees

BSV Blockchain makes extensive use of Merkle Trees. A Merkle Tree is a binary tree structure (meaning its branches terminate in two nodes) that's made entirely of hashes. For example, the leaf nodes at the bottom layer of the Merkle trees used in BSV are the hash values of the raw data from bitcoin transactions. And the successive layers of the tree leading back to the root node are constructed by concatenating the two hashes from the previous layer, and hashing them together to form a new hash. This has the effect of connecting the tree together to be represented securely by one single root node value. This structure makes it very easy to find data at the bottom of the tree, as well as to prove specific data exists in the tree - which is particularly important for scaling the BSV system efficiently.

See our course Bitcoin Primitives: Merkle Trees

Digital Signatures

Although hash tables and data structures like Merkle Trees are where hash functions find their greatest use, the most widely recognized use of hash functions is in cryptography. In addition to encryption and key-exchanges, digital signatures make extensive use of hash functions. Like a physical signature, a digital signature associates an identity to a document or message. To ensure the security of digital signatures, well tested algorithms are used to both construct and verify them.

In BSV, digital signatures are most commonly used for Pay-to-Public-Key-Hash (P2PKH) transactions – the most common type of transaction template -- where the ownership of funds are transferred from the sending party to the receiving party. The two other common uses of digital signatures in BSV are for signing arbitrary messages and using an address as part of the verification process, and MinerID which is a message included in the first transaction of a block that acts as an authentication mechanism in case another miner (node) tries to act dishonestly while using the ID of a competing miner.

With that said, it's helpful to recognize that even though hash functions and cryptographic digital signatures are used in BSV, it is not a cryptographic system, and it's actually incorrect to use the term “cryptocurrency” when referring to Bitcoin or a system that follows the Bitcoin architectural design like BSV. This is a really important distinction to be aware of because as we'll see in chapter 7, it's fundamental to how the BSV system operates and how it's secured.

The Differences Between Hashing and Encryption

Since hash functions are a type of cryptographic primitive -- i.e., some of the building blocks used to encrypt messages and data -- the concepts of hashing and encryption are often conflated. However, they are not the same. The three main differences between hashing and encryption are as follows:

Hash functions don't care about data loss: A hash function produces a fixed-length output in clear text because it’s nothing more than a unique digital fingerprint and data loss is not a concern. In contrast, encryption encodes and preserves input messages.
Hash functions are not reversible: A hash function is a one-way function that does not need to be reversed. In fact, it's important it's not reversible. In contrast, it's crucial the intended recipient of an encrypted message is able to reverse the encryption process in order to decrypt the message so they can read it.
Hash functions have a collision risk: A hash function has a collision risk. Due to the fixed length of message digests, it's possible for two different input values to produce the same digest. BSV uses hash functions like SHA-256 and RIPEMD-160 because they greatly reduce the risk of collisions, yet, although they're probabilistically insignificant, they can still theoretically happen. In contrast, since encryption preserves 100% of the input message, each encrypted message produces a unique output so there's no collision risk.

The Three Important Properties of Hash Functions

1. Preimage Resistance: One-way

Hash functions are computationally infeasible to reverse -- in the same sense as trying to unscramble an egg or put toothpaste back into its tube. Said in another way, hash functions are one-way 'trap-door' functions; they make it easy to find a message digest from an input, but so difficult to find an input from a message digest, that it would either take longer than human beings are capable of waiting (sometimes longer than the age of the universe) or the cost of marshalling the necessary computing power to reverse them is so high, it's not worth doing it.

To be considered preimage resistant, it should take $2^m$ hash attempts (where m is the length of the output in bits) to find the corresponding input message for a particular message digest. If a preimage can be found using fewer than $2^m$ hash attempts, the hash function is no longer considered preimage resistant and is no longer secure.

2 .Second Preimage Resistance: Deterministic

The message digest of a hash function is consistent given the same input value. In other words, hash functions are deterministic -- the same input produces the same output hash every time. Furthermore, if even a single bit of the input message is changed, the resulting output hash is significantly different, unique, and preimage resistant.

3. Collision Resistance

It’s computationally infeasible to find two different input values that produce the same output value for a hash function. Although the fixed length of their output means a hash function has a collision risk, an ideal hash function produces message digests with a length large enough that the risk of finding a collision is so small, it’s safe to ignore in most situations.

Ideally, a hash function with an output value of bitlength $m$ should require at least $2^{m/2}$ operations to find a collision. In a birthday attack, given a set of m equally probable values, roughly $2^{m/2}$ random samples are needed to find a single collision. This means if a collision is found in fewer than $2^{m/2}$ random samples, the collision resistance property of the hash function is “broken”, and it’s no longer considered secure.

This is particularly important with respect to BSV's proof-of-work process.

The Hash Functions Found in BSV

Merkle-Damgård Hash Functions

The fundamental issue with constructing a hash function is the fact that it must be able to take an input of arbitrary length and compress it into an output of a fixed length. If input is an arbitrary length, it's difficult to achieve second preimage resistance. Also, for a compression function to be collision resistant, it must take fixed length inputs that are longer than its output. One way to remedy this issue is to use a preprocessing step to pad the input to a fixed length rather than compress it directly. Once the input has been preprocessed, it can be compressed consistently.

In 1979, both Ralph Merkle and Ivan Damgård independently proved that so long as an appropriate padding scheme is used in the preprocessing step, and the compression function used is collision resistant (implying preiamge and second preimage resistance), it follows that the hash function itself is also collision resistant. The most popular hash functions in use today are based on this Merkle-Damgård method of construction.

Merkle-Damgård hash functions have three parts: input and preprocessing, compression, and final value construction and output. The preprocessing step pads the input message so its congruent to a fixed bitlength, e.g. 512 or 1024, and then appends the length of the input itself (this is called Merkle-Damgård strengthening). The compression step uses bitwise logical functions to compute and mutate the processed input in rounds using chaining variables to link each round. Finally, the chaining variables are concatenated and outputted most commonly in hexadecimal format.

MD4 and MD5

The first two Merkle-Damgård hash functions to gain widespread adoption were Ron Rivest's MD4 and MD5. First specified in 1992 in the Internet Engineering Task Force (IETF) Request For Comments (RFCs) 1320 and 1321, respectively, MD4 and MD5 were the first hash functions to use a 448 congruent to 512 message block padding scheme with the remaining 64 bits left for Merkle-Damgård strengthening.

MD4 enjoyed short-lived success, but Rivest realized shortly after its release that it was likely to be insecure, so he designed MD5 shortly thereafter. Despite Hans Dobbertin's announcement of a collision found for MD5's compression function in 1996, MD5 enjoyed long-lived success. While cryptographers were recommending SHA-1 as a replacement to MD5 as early as 1996, MD5 continued to be used well into the early 2000s.

However, in 2004, Xiaoyun Wang, Dengguo Feng, Xuejia Lai, and Hongbo Yu developed an analytical collision attack that could reportedly be performed within an hour on an IBM p690 cluster. This marked the end for MD5 for use in key derviation and digital signatures. In 2008, CMU Software Engineering Institute announced MD5 was "cryptographically broken and unsuitable for further use".

Even so, the groundwork laid by MD4 and MD5 provided a practical framework for newer generations of hash functions to follow; including BSV's SHA-256 and RIPEMD-160 which both use 512 bit message blocks and 32 bit words.

BSV's Hash Functions

As already mentioned, the two hash functions found in the BSV system are SHA-256 and RIPEMD-160. For reasons explored further in Chapter 7, Bitcoin uses SHA-256 and RIPEMD-160 in double hash forms: SHA-256(SHA-256) and RIPEMD-160(SHA-256) – often abstracted and referred to as HASH-256 and HASH-160, respectively, following the convention set out in the original bitcoin code.

However, the following table displays the full range of hash functions commonly found in the greater BSV ecosystem; including SHA-512 and the HMACs of SHA-256 and SHA-512 which are often utilized by wallet implementations:

Hash Function

Output Length

Description

Example Application in BSV

SHA-256

32 Bytes

Generates unique 256-bit value from input string

1. Proof-of-work algorithm 2. Address Creation

RIPEMD-160

20 Bytes

Generates unique 160-bit value from an input string

Address creation

HASH-256

32 Bytes

SHA256 hash of a SHA256 hash

1. Blocks 2. Transactions

HASH-160

20 Bytes

RIPEMD160 hash of a SHA256 hash

Address creation

SHA512

64 Bytes

Generates unique 512-bit value from input string

Wallet encryption (AES)

SHA256HMAC/SHA512HMAC

32 Bytes

HMAC Prevents length extension attacks and can be used with any hash function

Address Creation