# Cowsay's BFT Dev Kit
Abstract
$$BFT\ Prototyping\ Made\ Simple$$
[View codebase](https://github.com/sgdxbc/bft-kit).
## Introduction
The Cowsay's BFT Development Kit effort provides the prototype implementations and evaluation scaffolding of a selective collection of distributed protocols that relate to tolerating Byzantine faults. The development starts with a list of well known atomic broadcast protocols (or approximately, blockchains), while later will be extended to include distributed hash tables and transactional stores.
|Module|Work|Official Implementation|
|-|-|-|
|`pbft`|[PBFT (OSDI'99)](https://pmg.csail.mit.edu/papers/osdi99.pdf)|[BFT project](https://pmg.csail.mit.edu/bft/#sw)|
|`hotstuff`|[HotStuff (PODC'19)](https://dl.acm.org/doi/pdf/10.1145/3293611.3331591) [(arxiv)](https://arxiv.org/pdf/1803.05069)|[hot-stuff/libhotstuff](https://github.com/hot-stuff/libhotstuff)|
The prototypes prioritize *canonical*, *intuitive* and *comprehensive*. The implementations truthfully reflect the intentions of paper texts. We don't invent homemade optimizations, we don't introduce extra presumptions, and we don't prefer obscurity (or simply more lines of code) for extreme performance e.g. minimal data copies. The goal is to formulate the *reference implementations* that are easy to reason about and to "manageably" mess around by applying experimental modifications.
The evaluation scaffolding proposes a standard methodology to study the performance implications of the prototypes. It contains reusable components for framing the prototypes into simulated or realistic environments, based on idiomatic practices from my previous experiences. Realizing and benchmarking your next state of the art network protocol idea becomes easy: implement the core protocol with few hundred of lines of code, import or copy the rest heavy lifting from the common modules, make a copy of the examples, binaries and configurations and make some tweaks, and you are good to start looking at performance metrics.
### Features
One codebase for all protocols. Just need to suffer once for the pains of setting up development environment (as a standard Rust codebase, probably not actually painful!) and everything you need is up and just working. Having all implementations sharing the same setup and set of peripherals (e.g. programming language, IO solution, cryptographic libraries, etc.) enables perfect apple to apple comparison.
Ergonomics for rapid development and evaluation iteration. Include useful building blocks for benchmarking protocol performance under various conditions, featuring solutions for orchestrating configurable concurrent clients to perform RPCs and aggregate SLO metrics, in close loop and open loop fashions. The codebase includes configured certificates for QUIC endpoints and protocol layer signature schemes. The secret keys are all ready.
The secret key for QUIC endpoints is stored (deliberately) in clear text at `src/crypto/cert/key_pair.pem` and embedded in compiled artifacts during building. Do not use it in production (like everything else in the codebase). If you would like to use another private key, run `rotate-cert` example in project root folder to override it.
(On mobile devices, click the numbered superscripts to toggle the sidenotes.)
Thorough and consistent logging. Annotate on all slow path entrances with `WARN` logs. All fast path logs have `TRACE` level.
Network transport with QUIC. Messages are sent with ephemeral streams that multiplexing the underlying connections, enjoying the desirable properties of QUIC protocol: reliable (i.e. automatic retransmission), while being free from head of line blocking by allowing the (streams of the) messages to be delivered out of order.
**Technology Stack.** Build with the greatest from the ecosystem. The dev kit uses `tokio` for IO transport and timers, `quinn` for QUIC network layer, `bincode` for serialization/deserialization, `sha2` and `secp256k1` (`threshold_crypto` and `givre`) for hashing and (threshold) digital signatures, `tracing` and `tracing-subscriber` for logging, and `hdrhistogram` for metrics statistics.
## Quick Start: Ways to Run PBFT
We show various ways to run the same implementation of PBFT (`pbft::Replica`), each of which is tailored for different stages in the development lifecycle.
**Testing.** The implementation interacts with an abstract transport layer and is completely free from I/O (see design practices below for details). The tests are based on a simulated transport (implemented in `common::testing` module), introducing minimal additional code dependencies to test the prototype. Execute
$ cargo test --lib -- pbft::tests
to run all PBFT tests. You may want to pass environment variables `RUST_LOG` and `RUST_BACKTRACE` as well.
The tests mostly check for liveness, so that as long as all tests are passed, if a protocol gets stuck in the real deployment, it's probably not the protocol implementation but rather the network transport to blame. This dev kit is not good at ensuring correctness; you may still want to look into approaches like model checking for that (maybe will add it to the codebase in the future). Also, the simulated transports may not deliver the messages in a realistic order. The simulated order is just a *reasonable* one that the protocol prototype should be able to work with.
Utilizing `proptest` for simulation is desirable, but I'm a little worried about whether the randomized order will be reasonable or how to enforce that.
**Loopback examples.** The `src/pbft/transport.rs` provides a network transport to run PBFT clients and replicas in asynchronous tasks (or coroutines). The `pbft` example runs multiple client and replica tasks in the same process with loopback network. The configurations are hardcoded in the example.
**Manual cluster-scale deployment.** The `pbft-client` and `pbft` binaries run *a group of close loop clients* and a replica according to a set of configuration files. Execute
$ cargo run -r --bin pbft -- configs/replica0.conf
to run a PBFT replica#0 that is configured with `configs/replica0.conf`, and it will further discover `common.conf` and `network.conf` in the same folder of the replica configuration, which is `configs` in this case. Then, either on the same machine or other machine with identical compiled binaries and configurations (a network file system would be convenient), execute the same command with `configs/replica1.conf` to run replica#1 configured by it and the same other two configurations as replica#0. Repeat two more times for replica#2 and replica#3.
Wait a while until the replicas report "replica ready" in the logs. Execute
$ cargo run -r --bin pbft-client -- configs/client.conf
to run PBFT clients configured by `configs/client.conf` if it exists (which is not the case currently). The `common.conf` and `network.conf` are discovered as same as the replicas. The clients will exit after report the performance statistics. It is possible to rerun clients against the same replica network, but start with a fresh network is recommended. Send Ctrl-C to one replica to exit it, and the other replicas will exit (kind of crash actually) after their connections to the exited one closed (unexpectedly). Alternatively, if all replicas are running on the same machine, execute
$ pkill -INT pbft
to shutdown all of them.
The (conventional) taxonomy of configurations. `replicaN.conf` and `client.conf` serve for exclusively configuring replica#N or a client process, at the same time locating the other two configures. `common.conf` and `network.conf` holds shared items for all participants. They two are not essentially different, just `network.conf` are usually dedicated to store network addresses.
The configuration layout is relatively flexible: the binaries merge the three relevant configurations into one single configuration store, then extract all kinds of configuration structures from it. The above description of which configuration is for what is just convention. Also, the extraction is not sensitive to extra entries, to allow reuse same set of configurations across systems that accept different configurations (as long as no name conflict).
Notes on network file system. The above description applies to development codebase that lives in network file system. However, building inside network file system may not be desirable. Alternatively, you can put the development codebase outside the network file system's mounting root, and execute
$ uv run ./scripts/cp_artifacts.py $DIR
To build binaries and copy them into `$DIR` under network file system.
**Automated deployment.** Work in progress.
## Design Practices
If you only want to experiment with this codebase this section will not be useful. Keep reading if you wondering "why the hell he writes the codebase in such a way".
The short (but annoying) answer is: [all attempts to write it in alternative forms](https://github.com/neatsys-failures) didn't end satisfying.
**Abstract over receiver actions instead of runtime context, or [sans I/O] (that is specialized to higher level protocols).** In a codebase that consists of protocol state machines and the "outside world" e.g. network transport, simulated event queue, etc, it is tempting to let receiver (i.e. protocol state machines) and runtime context to hold reference to each other. Runtime calls into receivers for processing arriving messages, and receivers call runtime context in the middle of processing for making any *effect*, mostly send messages and set timers. The projects following this practice include the good old [specpaxos] (replicating whom became my starting point from long time ago), and [Demikernel].
[sans I/O]: https://sans-io.readthedocs.io/
[specpaxos]: https://github.com/UWSysLab/specpaxos
[Demikernel]: https://github.com/microsoft/demikernel
However, this is not the optimal solution at least for *simplicity*. In particular of Rust, cyclic (mutable) references between receivers and runtime cannot be elegantly expressed. If we don't want to involve `dyn`Not elaborate on inconveniences caused by it for brevity, but Niko has [a series of blogs](https://smallcultfollowing.com/babysteps/series/dyn-async-traits/) that is kind of on them., the receiver and runtime types will lock into each other through generic parameters. The root cause of these troubles is to interleave code of protocol logic and calling runtime context methods.
Sans I/O has been the preferred practice to implement protocol recently. Different from its original proposed setup where protocols interact with outside world with receiving and sending binary buffers, in this dev kit we would like to design a rather higher level interface e.g. for better testing ergonomics. The receiving buffer is replaced by calling into two methods `receive_message` and `tick`, and the sending buffer is replaced by an "action" buffer where actions are like "send a message to each one of these replicas", etc.Alternatively, `tick` could be an active `SetTimer` action, but with that I was struggling with designing a proper abstraction for canceling timers.
The `common::ReplicaAction` is an enumeration of common actions of BFT replicas. It may looks like simply applying command design pattern to materialize the calls to runtimes and realize *inversion of control*. However, note that the `SendTo` and `Finalize` will probably be "called" on different components: network transport and the service state machine driven by order commands. The enumeration is an aggregation of replica's *intents*, not some outside component's *capabilities*.
The downside of this approach is the mandatory copying/moving of the message sending out. If the protocols hold a runtime context and can pass to the context the reference to some data they own, runtime may be able to serialize the data inline and avoid taking ownership of the data. The copying/moving should not be a heavy hitter of performance in most expected cases, and even if we implement some protocols dealing with bulk data in the future, swallow copy achieved by e.g. utilizing `bytes` is straightforward. On the other hand, even if the message is immediately serialized, the runtime usually cannot finish sending inline, for example if it utilizes asynchronous sending primitive (and we probably don't want to color our protocols with `async`), then a moving (on either data or the serialized bytes) becomes unavoidable.
**Disaggregated protocol implementation.** While the previous practice is on how to separate between protocol state machines and outside "one big world state", this one argues that we should further extract the protocol *core* from the rest part of it. A neat practice on code structure inspired by [libhotstuff].
[libhotstuff]: https://github.com/hot-stuff/libhotstuff
The protocol core should be decoupled from message passing. It does not touch the wire message format directly. Instead, it passes its *intents* to the wrapping "protocol", which deals with assembling/disassembling wire messages, translate *core actions* into protocol actions and received messages into *core events*.
The goal is not to isolate the IO complexity from the protocol cores: sans I/O practice already nailed it. The goal of this disaggregation is to further isolate the *imperfect network condition* out of the core protocol logic. From the protocol core's perspective, there's no lost message, even neither slow/reordered one: the core events it receives are always about things it is currently interested, and every issued core action will always effect. This enables very concise protocol cores, usually can fit into single (relatively lengthy) methods like `pbft::ReplicaCore::handle(..)` (close to 100 lines at the time of writing). These cores are easy to reason about the correctness and can match the paper text as much as possible.
Besides network asynchronous, protocol cores can also opt out of cryptographs and certain detailed message flows. The most significant case would be collecting quorum certificates. To do so, protocol core simply produces a single "now watch for a set of particular matching messages" core action, and handle a single "here you are the messages" core event afterward. Actually, the initial motivation of applying this practice is when implementing the collection of quorum certificates with threshold signatures in HotStuff. As it turns out, the state of the art threshold signature protocols (not algorithms yet!) requires one additional "sign request" broadcast before the (partial) signing can conduct. This is a technical detail that is completely unrelated to how HotStuff works, but it requires to modify the message format, affect how the messages are produced and consumed, and will sneak into all around the protocol code if without the disaggregation. However, we can keep it out of the protocol core with a clear separation.
Note on the protocol states. While protocol core should be self-contained, the wrapping protocol code is not necessary to be so. The wrapping protocol owns the protocol core and can read the core states directly e.g. for discarding outdated messages.While possible, you probably don't want to write into the core states from the wrapping protocol, which probably violate every software engineering practice. Instead, pass an event to the core and let it update itself.
Disaggregation is only for isolating the engineering details, not to make independent (or even reusable) engineering details. Disaggregation should not incur more data copies/copying, and the wrapping protocol only (additionally) stores those (usually ephemeral) states that is not interested by the core e.g. to tolerate network reordering. At the same time, make sure whatever *is* interested by the core is indeed kept by the core. Do not turn it into "puppet style" where the core produces actions that contain reference-like indexes into the wrapping protocol's states in order to manipulate them "remotely", just because the wrapping protocol also needs them or those are "foreign" opaque states to the core like signatures. Instead, store them in the core and pass them by value.
**Defaults to specialized.** Well this one is half convention, half takeaway of my development experience. Every line of code in this codebase, when it is written down, is *specialized* code that work with only one protocol or service. When there's a need to implement similar functionality in more scenarios, first write mostly or even fully duplicated code i.e. copy it around, then identify the *reusable* snippets and extract them into commonly shared modules.
Do not attempt to write generic code in its first instantiation. It probably will not fit (or only awkwardly fit) into the following use sites.
Furthermore, this practice leads to code modules that can be more easily used. On the opposite, the "extremely composable" approach, which only write orthogonal atomic snippets and construct the final binaries by accumulating tons of them, may result in amazing codeIf you wonder how it would looks like, [here](https://github.com/neatsys/neatworks/blob/augustus/src/pbft/replica.rs) is a PBFT replica I have written., but the code will also be really painful to develop and iterate. I wrote a [blog] about it.
[blog]: https://zhuanlan.zhihu.com/p/690959110
There's never "one code fits all". When the code does not fit, write (seemingly redundant) code, not adjust the "all" to make it fits.