This article is one part of a longer deep dive into Leios, Cardano’s proposed consensus upgrade. The full post is available at https://314pool.com/post/leios. We’ve broken it into smaller, standalone reads to make it more accessible for the Cardano Cube audience.
Now, rather than explaining Leios immediately, I’ve found it’s most helpful to build up to Leios with a sequence of protocols, each of which naively attempts to try to address a problem with the previous one. Then, we will show how that protocol is vulnerable to some kind of attack. If you’d prefer to just read a complete description of Leios, rather than following how you might arrive at that design yourself, feel free to skip to the next section! You can come back later if or when you get curious about why certain decisions were made.
Leios 0 - Design
To understand the first step we’ll take, let’s look at that first phase of the protocol a bit closer, where we propagate transactions through the network to fill up the mempools.
While propagating transactions in the mempool, there’s no requirement that we “wait” for a previous block. This means we can effectively utilize bandwidth because transactions are relatively small and can be propagated at any time, meaning they can fill in periods of silence to fully saturate the connections between two nodes. This means that the maximum throughput of propagating transactions, so long as we’re not trying to reach consensus and sequence them into blocks, is tremendous.
A decent lower-bound estimate on the average bandwidth available between high-quality network nodes is 100 megabits per second. If we take a pessimistic view, and half that to 50 Mbps, and assume our average transaction is 350 bytes, then we can propagate around 17850 transactions per second!
And, let’s observe that we’re effectively downloading each block twice: First, we download the individual pieces of some future block by downloading the transactions as they are propagated across the network before they are included in a block; and then when a block is produced, the block is composed of individual pieces we’ve very likely downloaded before!
So, let’s change the format of our blocks: instead of a block containing the full bytes of each transaction, a block just consists of the transaction hashes it includes. Now, using the same Praos protocol, which produces a block up to 88 kilobytes large every 20 seconds on average, we can reference ~2750 transactions, for a throughput of 137.5 transactions per second. And, since we have (likely) already downloaded and validated those transactions to include them in our mempool, we have to do less work to validate the block. We can likely make the blocks bigger and more frequent, relying on most of the work to happen in real-time, while we download the transactions, rather than right when we must produce the block.
Now, instead of a node that hits 95% CPU each time a block is made, but is quiet the rest of the time, we can fully utilize our resources and consistently do useful work at all times.
Leios 0 - The Attack
Unfortunately, the key weakness in the protocol above is the word “likely”.
If you download a block from a peer, and you haven’t seen one of the transactions it references, you have to wait until you do. Perhaps you can request that from a peer, but then in the worst case, we just revert back to downloading the transactions after we’ve been told about the block, which is equivalent to if they had just been included in the block in the first place, which destroys our security guarantees.
Worse, suppose a malicious block producer includes references to transactions that no one has seen! They could generate 32 random bytes, claim that was a transaction that had been propagated to them, and share a block containing this “transaction ID” throughout the network, halting the production of every other node while they tried to download a transaction that didn’t even exist!
This is a data-availability problem: when downloading and validating a block from a peer, and deciding which fork is the current tip, a block-producing node needs some way to have high confidence that even if they haven’t seen and validated a specific transaction yet, they will see the transaction soon, and it will be valid.
Leios 0.1 - Design
To fix this, we can borrow an idea from a related Cardano project called Mithril.
In Mithril, stakepool operators opt in to “vote” on certain payloads, such as a checkpoint of the ledger state. Their vote carries a certain weight, depending on their stake. Thus, building on the assumption that a majority of the stake in the network is honest, once a certain majority of the stake of the network has voted on a specific payload, you can safely rely on the integrity of that payload.
Cardano projects like cardano-up
by Blink Labs, or dolos
by TxPipe use this property to quickly bootstrap a Cardano node without having to synchronize and validate the whole history of the blockchain from scratch.
In Leios 0.1, we’re going to allow stake pool operators to “vote” on transactions that are available and valid. We gossip these votes throughout the network, and each node gathers votes until it’s seen at least 70% of the network vote on the transaction. Then, when we produce the Praos block, we include only the transaction IDs of any transactions that have reached the threshold, and a compact “certificate” proving we’ve seen a sufficient number of votes, using something like ALBA Arguments.
Now, when we receive the block, we just check the certificate for the transactions we haven’t yet seen, and we can have very high confidence that the whole block will be available and valid before we’ve downloaded every piece of it. This allows our blocks to be small, propagate quickly, and include lots of transactions that were produced in parallel.
Leios 0.1 - The Flaw
If I was narrating this blog post, it’s at this point that I would insert a record scratch.
Unfortunately, the design above performs even worse than Praos! What went wrong?
In practice, this process of voting on transactions is very slow (we need to propagate votes from hundreds of stake pools to reach the required threshold), and the certificates are very large (several kilobytes). Suddenly, we can only fit a few transactions in a block, most of our network traffic is eaten up by propagating millions of these voting messages, and the time it takes for any given transaction to make it into a block is on the order of several minutes, if not longer!
Clearly, “transactions” is not the right level of granularity for making these attestations and certificates.
Leios 0.2 - Design
If transactions are not the right granularity for these certificates, let’s instead group transactions into large batches. We’ll gather dozens or even hundreds of transactions, and propagate them around the network as a group. The block producers will cast “votes” on these blocks, sharing those votes with their peers.
We can also produce these batches in parallel around the network, to better utilize our bandwidth.
Like before, once any node observes votes from a majority of the stake, they can produce a compact certificate. Then, when a node becomes eligible to produce a Praos block, they can include a reference to that batch, and the certificate proving its availability and validity.
To differentiate between these two different groupings, we introduce the following names for them:
- Input Blocks: large batches of transactions, that represent the input to the system
- Ranking Blocks: blocks produced via the Praos consensus protocol, which determine some ordering for transactions
Leios 0.2 - The Shortcoming
Unfortunately, this only gets us so far. If we try to make these input blocks too big, then we have ruined the concurrency properties that let us saturate the bandwidth of the network. And if we try to produce too many in parallel, then only a few will make it into the Ranking blocks because of the size of the certificates.
We would like to produce input blocks that are at the right granularity for parallelized generation, such that we don’t lose all of our efficiency to scheduling overhead, while also reducing the overhead introduced by voting and including a certificate.
It’s only half tongue-in-cheek that I say: this is computer science, so the problem is solved by adding another layer!
Leios 0.3 - Design
To address the above, we introduce yet another type of block, specifically for this “endorsement” and voting stage. We call these, unimaginatively, Endorsement Blocks.
Now, the protocol generates mid-sized “input blocks” that aggregate many transactions. These are large enough to represent a meaningful amount of work for a single core to validate so that when we validate them in parallel, the scheduling overhead lost to multithreading is minimal, but small enough that they can be generated concurrently across the whole network, and efficiently saturate the network bandwidth between nodes. This is a parameter that can be tuned, but you can think in the 100 kilobytes to 350 kilobytes range, representing several hundred average-sized transactions.
These are produced at a high enough rate to saturate most of the network bandwidth, and concurrently across the whole network. Again, this is a parameter that can be tuned, but just to build a mental model, you can picture the network producing 5 input blocks per second.
To ensure these blocks are produced at a useful rate, and resistant to censorship, we use a similar “Poisson Schedule” slot lottery as in Praos, but at a much higher frequency.
Then, using another, separate slot lottery, nodes on the network are also eligible to produce endorsement blocks. These aggregate the IDs of all the input blocks seen by that node and attest to the validity and availability of the nodes. As this block is propagated around the network, other nodes validate that, yes indeed, these transactions exist and are valid, and vote with their stake weight on that validity.
Endorsement blocks in this design are produced roughly as frequently as Ranking blocks.
Like in previous designs, once an Endorsement block reaches a supermajority threshold of stake votes, then the next time a Ranking block is produced, the slot leader includes the single endorsement block and the single compact voting certificate. Because this block now canonically decides the ordering, when other nodes download this block, they can validate the certificate and have high confidence that, even if they haven’t downloaded all the transactions yet, they will soon, and they will be valid with high probability. That confidence means that, when it comes time to produce the next ranking block, even if it’s just a short time later, they can do so, even if they haven’t yet reconstructed the full ledger state. This, ultimately, keeps network forks very low and maintains all of the great properties of Praos.
Leios 0.3 - Suboptimality
The design described above is very close to the full Leios protocol, but it still ultimately under-utilizes the network. The network needs to have enough time to propagate input blocks, include them in an endorsement block, circulate votes on endorsement blocks, and finally include the certificate in a Ranking block. Given the extra delay, the time between ranking blocks has stretched from an average of 20 seconds, up to 75 to 150 seconds.
We’ve replaced one “quiet period” with another.
Leios 0.4 - Design
Building on the previous protocol, we can increase the density by copy-pasting the protocol with a technique known as “pipelining”.
We can run multiple instances of the protocol described above in parallel but shifted in time. At any given point in time, there is one copy of the protocol that is generating input blocks, another copy that is generating endorsement blocks, another that is voting on that endorsement block, and another that is producing a ranking block.
In this way, we restore the regular, 20-second heartbeat of Ranking blocks, each packed with the full networking capacity that the network of Cardano nodes can support.
Leios 0.4 - Chain Quality
There’s one final problem in the design above (at least, one final problem that we have a clear solution for), and one final improvement to the protocol I’d like to present before we arrive at the version of Leios described in the research paper.
The main remaining problem has to do with chain quality. Chain quality is a property described in the original Ouroboros Praos paper, and it represents the number of honest blocks within a given window of time. A high chain quality means that there are relatively short windows of time between any two honest blocks, while a lower chain quality means that those windows of time starts to grow. Chain quality drops, for example, when someone fails to produce a scheduled block, or produces a block that maliciously censors transactions, for example. The Ouroboros Praos protocol places bounds and guarantees on the chain quality of the network.
However, when chain quality drops in Ouroboros Leios, it has an outsized impact. If a ranking block isn’t produced for a specific pipeline, then the endorsement block from that pipeline never makes it on-chain, and all of the transactions it references are lost. Similarly, when a malicious stake pool operator is elected to produce a block they can choose an endorsement block that censors transactions they don’t like, or refuse to select an endorsement block at all.
Leios 0.5
To resolve this is fairly simple: endorsement blocks are allowed to reference previous endorsement blocks, in case they were missed.
When reconstructing the sequence of transactions afterward, we first follow these recursive references to find any endorsement blocks that haven’t yet been seen on-chain.
This ensures that, if we miss a ranking block, the bounds on chain quality that Ouroboros Praos gives us will ensure that the missed endorsement block is eventually included.
Next up: How Leios Works – A Pipelined Protocol for Scalability