4.6 Congestion Control
This specification provides a solution to deal with network congestion in the IOTA network. The congestion control algorithm described in this file decides which messages should be processed and gossiped to node's neighbors and in what order to do so.
4.6.1 Introduction
Every network has to deal with its intrinsic limited resources in terms of bandwidth and node capabilities (CPU and storage). In this document, we present a congestion control algorithm to regulate the influx of messages in the network with the goal of maximizing throughput (messages/bytes per second) and minimizing delays. Furthermore, the following requirements must be satisfied:
- Consistency. If a message is written by one honest node, it shall be written by all honest nodes within some delay bound.
- Fairness. Nodes can obtain a share of the available throughput depending on their access Mana. Throughput is shared in such a way that an attempt to increase the allocation of any node necessarily results in the decrease in the allocation of some other node with an equal or smaller allocation (max-min fairness).
- Security. Malicious nodes shall be unable to interfere with either of the above requirements.
Further information can be found in our a paper Access Control for Distributed Ledgers in the Internet of Things: A Networking Approach.
The Congestion Control specification depends on the following specifications:
4.6.1.2 Proposal
In this specification, we present the congestion control algorithm that shall be implemented by all IOTA nodes. Nodes cannot take any advantage by not following the protocol. Conversely, they may eventually be considered as malicious nodes and banned. Our algorithm has three core components:
- A scheduling algorithm which ensures fair access for all nodes according to their access Mana.
- A TCP-inspired algorithm for decentralized rate setting to efficiently utilize the available bandwidth while preventing large delays.
- A blacklisting policy to ban malicious nodes.
4.6.2 Congestion Control Algorithm
4.6.2.1 Outbox Management
Once the message has successfully passed the message parser checks and is solid, it is enqueued into the outbox for scheduling (see Section 2.4 - Data Flow). The outbox is logically split into several queues, each one corresponding to a different node issuing messages. In this section, we describe the operations of message enqueuing (and dequeuing) into (from) the outbox.
The enqueuing mechanism includes the following components:
- Classification. The mechanism identifies the queue where the message belongs to according to the node ID of the message issuer.
- Message enqueuing. The message is actually enqueued, queue is sorted by message timestamps in increasing order and counters are updated (e.g., counters for the total number of bytes in the queue).
- Message drop. In some circumstances, due to network congestion or to ongoing attacks, some messages shall be dropped to guarantee bounded delays and isolate attacker's messages. Specifically, a node shall drop messages in two situations:
- since buffers are of a limited size, if the total number of bytes in all queues exceeds a certain threshold, new incoming messages are dropped;
- to guarantee the security of the network, if a certain queue exceeds a given threshold, new incoming packets from that specific node ID will be dropped.
The dequeue mechanism includes the following components:
- Queue selection. A queue is selected according to round robin scheduling algorithm. In particular, we use a modified version of the deficit round robin (DRR) algorithm, and we describe it in Section 3.4.2.2 - Scheduler.
- Message dequeuing. The first message of the queue is dequeued, and list of active nodes is updated.
- Scheduler management. Scheduler counters and pointers are updated.
4.6.2.2 Scheduler
The most critical task is the scheduling algorithm which must guarantee that, for an honest node node
, the following requirements will be met:
node
's messages will not accumulate indefinitely at any node (i.e., starvation is avoided), so the consistency requirement will be ensured.node
's fair share (according to its access Mana) of the network resources are allocated to it, guaranteeing the fairness requirement.- Malicious nodes sending above their allowed rate will not interrupt
node
's throughput, fulfilling the security requirement.
We remind the reader that the above requirements are described in Section 3.4.1 - Summary.
Although nodes in our setting are capable of more complex and customised behaviour than a typical router in a packet-switched network, our scheduler must still be lightweight and scalable due to the potentially large number of nodes requiring differentiated treatment. It is estimated that over 10,000 nodes operate on the Bitcoin network, and we expect that an even greater number of nodes are likely to be present in the IoT setting. For this reason, we adopt a scheduler based on Deficit Round Robin (DRR) (the Linux implementation of the FQ-CoDel packet scheduler, which is based on DRR, supports anywhere up to 65535 separate queues).
The DRR scans all non-empty queues in sequence. When a non-empty queue is selected, its priority counter (called deficit) is incremented by a certain value (called quantum). Then, the value of the deficit counter is a maximal amount of bytes that can be sent at this turn: if the deficit counter is greater than the weight of the message at the head of the queue, this message can be scheduled and the value of the counter is decremented by this weight. In our implementation, the quantum is proportional to node's access Mana and we add a cap on the maximum deficit that a node can achieve to keep the network latency low. It is also important to mention that the weight of the message can be assigned in such a way that specific messages can be prioritized (low weight) or penalized (large weight); by default, in our mechanism the weight is proportional to the message size measured in bytes. The weight of a message is set by the function WorkCalculator()
, and additional details can be found in Section 2.4 - Data Flow.
Here a fundamental remark: the network manager sets up a desired maximum (fixed) rate SCHEDULING_RATE
at which messages will be scheduled, computed in weight (see above) per second. This implies that every message is scheduled after a delay which is equal to the weight (size as default) of the latest scheduled message times the parameter SCHEDULING_RATE
. This rate mostly depends on the degree of decentralization desired: e.g., a larger rate leads to higher throughput but would leave behind slower devices which will fall out of sync.
4.6.2.3 Rate Setting
If all nodes always had messages to issue, i.e., if nodes were continuously willing to issue new messages, the problem of rate setting would be very straightforward: nodes could simply operate at a fixed, assured rate, sharing the total throughput according to the percentage of access Mana owned. The scheduling algorithm would ensure that this rate is enforceable, and that increasing delays or dropped messages are only experienced by misbehaving node. However, it is unrealistic that all nodes will always have messages to issue, and we would like nodes to better utilise network resources, without causing excessive congestion and violating any requirement.
We propose a rate setting algorithm inspired by TCP — each node employs additive increase, multiplicative decrease (AIMD) rules to update their issuance rate in response to congestion events. In the case of distributed ledgers, all message traffic passes through all nodes, contrary to the case of traffic typically found in packet switched networks and other traditional network architectures. Under these conditions, local congestion at a node is all that is required to indicate congestion elsewhere in the network. This observation is crucial, as it presents an opportunity for a congestion control algorithm based entirely on local traffic.
Our rate setting algorithm outlines the AIMD rules employed by each node to set their issuance rate. Rate updates for a node node
take place each time a new message is scheduled if the node
has a non-empty set of its own messages not yet scheduled. Node node
sets its own local additive-increase variable localIncrease(node)
based on its access Mana and on a global increase rate parameter RATE_SETTING_INCREASE
. An appropriate choice of RATE_SETTING_INCREASE
ensures a conservative global increase rate which does not cause problems even when many nodes increase their rate simultaneously. Nodes wait RATE_SETTING_PAUSE
seconds after a global multiplicative decrease parameter RATE_SETTING_DECREASE
, during which there are no further updates made, to allow the reduced rate to take effect and prevent multiple successive decreases. At each update, node
checks how many of its own messages are in its outbox queue, and responds with a multiplicative decrease if this number is above a threshold, backoff(node)
, which is proportional to node
's access Mana. If the number of node
's messages in the outbox is below the threshold, node
's issuance rate is incremented by its local increase variable localIncrease(node)
.
4.6.2.4 Message Blocking and Blacklisting
If an incoming message made the outbox total buffer size to exceed its maximum capacity MAX_BUFFER
, the same message would be dropped. In our analysis, we set buffers to be large enough to accommodate traffic from all honest nodes.
Furthermore, to mitigate spamming actions from malicious nodes, we add an additional constraint: if node
's access Mana-scaled queue length (i.e., queue length divided by node's access Mana) exceeds a given threshold MAX_QUEUE
, any new incoming packet from node
will be dropped, hence the node is blacklisted. The attacker is blacklisted for a certain time BLACKLIST_TIME
during which no messages issued by node
can be added to the outbox. Please note that it is still possible to receive message from the attacker through solidification requests, which is important in order to guarantee the consistency requirement. Finally, when a node is blacklisted, the blacklister does not increase its own rate for a time RATE_SETTING_QUARANTINE
, to avoid errors in the perception of the current congestion level.
4.6.3 Algorithmic Details
4.6.3.1 Protocol Parameters
In line with the previous section, all nodes know the global variables described in Table 4.6.1.
Parameter | Type | Description |
---|---|---|
SCHEDULING_RATE | integer | clock time interval between subsequent executions of the function Schedule() |
RATE_SETTING_INCREASE | float | global additive increase parameter |
RATE_SETTING_DECREASE | float | global multiplicative decrease parameter (larger than 1) |
RATE_SETTING_PAUSE | integer | waiting time before next ownID 's rate update after backoff |
RATE_SETTING_QUARANTINE | integer | waiting time before next ownID 's rate update after blacklisting |
MAX_BUFFER | integer | maximum buffer size (in bytes) |
MAX_QUEUE | float | maximum access Mana-scaled inbox length |
MAX_DEFICIT | float | maximum cap for accumulated deficit |
MAX_RATE | float | maximum rate at which a node can be allowed to issue messages |
MIN_MANA | integer | minimum amount of Mana needed to issue messages |
BLACKLIST_TIME | integer | time interval during which no messages from blacklisted nodes are added to the outbox |
Table 4.6.1: Global constants.
4.6.3.2 Local Variables
Local variables are described in Table 4.6.2.
Variable | Type | Description |
---|---|---|
ownRate | float | issuance rate of ownID according to the rate setter |
activeNode | list | updated list of nodes having at least one message in the outbox queue (more details in Section 4.6.3.5 - Implementation) |
bufferQueue | queue | actual outbox queue where messages are ready to be scheduled (more details in Section 4.6.3.5 - Implementation) |
nextID | nodeID | pointer to the specific queue where next message can be scheduled from |
mana | list | contains the up-to-date (at the time the vector is used) value of the access Mana given a certain nodeId . The way in which mana is updated is out of the scope of this spec, and further information can be found in Section 5.3 - Mana |
backoff | float | local threshold for rate setting's backoff |
localIncrease | float | local additive increase parameter |
blacklisted | list | list of timestamps indicating if a specific nodeId is blacklisted. If node is not blacklisted, the entry is 0 |
pauseUpdates | integer | time interval during which rate setter is not updated |
messageWorker | list | list of messages that ownNode issued but are still not part of the outbox |
Table 4.6.2: Local variables.
4.6.3.3 Built-in Functions
Pseudocodes introduced in the next section will use the built-in functions described in Table 4.6.3.
Function | Description |
---|---|
Len(x) | measures the length of a data structure x |
Append(x, y) | add a new element y to list x |
WorkCalculator(x) | provides the weight of message x |
Sort(x, y) | sort list x by metric y |
Parents(x) | gives the list of parents of a message x |
CurrentTime() | current time computed with the local clock |
Execute(x) | process and gossip message x |
Pause(x) | stop execution of a function for x time units |
Table 4.6.3: Built-in functions.
4.6.3.4 Pseudocode
The congestion control algorithm follows the solidification in the data flow: when a new message msg
arrives to the scheduler, the function Enqueue(msg)
will be triggered in order to properly add msg
to the outbox. At the same time, at regular intervals (given by SCHEDULING_RATE
times WorkCalculator(x)
where x
is the latest scheduled message), the function Schedule()
picks a new message that has to be gossiped to neighbors and to be added to the local ledger. Simultaneously, RateSetting()
adjusts the message generation rate of ownID
according to the network congestion.
Enqueue(msg)
The function Enqueue(msg)
adds a new message msg
into the outbox and updates the list of active nodes accordingly. The checks on blacklisting condition (blacklisted[nodeID] == FALSE
) and buffer size (Len(bufferQueue) < MAX_BUFFER
) may be moved to the parser checker for optimization purposes.
### upon arrival of a new message msg (having passed solidification) ###
FUNCTION Enqueue(msg)
nodeID = msg.nodeID
IF mana[nodeID] > MIN_MANA AND (blacklisted[nodeId] == 0 OR CurrentTime() - blacklisted[nodeID] > BLACKLIST_TIME)
blacklisted[nodeId] = 0
IF Len(bufferQueue) < MAX_BUFFER
IF activeNode[nodeID] != NULL
# other messages from nodeID are already in the queue
nodeQueue = activeNode[nodeID]
IF (Len(nodeQueue) + Len(msg))/mana[nodeID] < MAX_QUEUE
# append msg
Append(bufferQueue, msg)
Sort(bufferQueue, timestamp)
ELSE
# blacklist nodeID and pause rate setting updates
blacklisted[nodeID] = CurrentTime()
pauseUpdates = Max(pauseUpdates, RATE_SETTING_QUARANTINE)
ELSE
# no other messages for nodeID are present in the buffer
Append(activeNode, nodeID)
activeNode[nodeID].deficit = MAX_DEFICIT
Append(bufferQueue, msg)
RateSetting()
The function RateSetting()
updates the rate ownRate
at which messages can be issued by the node. The maximum value that ownRate
can reach is MAX_RATE
. At the bootstrap, the value of ownRate
is initialized by the proportion of access Mana owned by the node times RATE_SETTING_INCREASE
.
FUNCTION RateSetting()
# update issueing rate if no recent backoff
IF ownRate < MAX_RATE
# retrieve message queue of the same node
IF Len(bufferQueue[ownID]) / mana[ownID] > backoff
ownRate = ownRate / RATE_SETTING_DECREASE
pauseUpdates = Max(pauseUpdates, RATE_SETTING_PAUSE)
ELSE
ownRate += RATE_SETTING_INCREASE * mana[ownID] / Sum(mana)
Schedule()
At regular intervals, i.e., every SCHEDULING_RATE
times WorkCalculator(x)
where x
is the latest scheduled message, the function Schedule()
selects the next message to gossip and process, if at least one message exists in bufferQueue
. Otherwise, it returns NULL
and the scheduling slot is missed. The local variable pauseUpdates
is initialized to 0.
FUNCTION msg = Schedule()
WHILE TRUE
IF Len(bufferQueue) > 0
# msg represents the message in the outbox
msg = bufferQueue[nextID].head
# point to a nodeId with enough deficit, having a valid message
WHILE activeNode[nextID].deficit < Weight(msg) OR msg.timestamp > CurrentTime() OR Parents(msg) have not been scheduled
activeNode[nextID].deficit += mana[nextID]
IF activeNode[nextID].deficit > MAX_DEFICIT
activeNode[nextID].deficit = MAX_DEFICIT
nextID++
activeNode[nextID].deficit -= Weight(msg)
IF activeNode[nextID].deficit < 0
activeNode[nextID] = 0
# remove scheduled message from queue
Remove(bufferQueue[nextID].head)
# update list of active nodes
IF Len(bufferQueue[nextID]) == 0
Remove(activeNode[nextID])
# update own rate setting
IF pauseUpdates > 0
pauseUpdates -= 1
ELSE IF Len(messageWorker) > 0
RateSetting()
Execute(msg)
Pause(SCHEDULING_RATE * WorkCalculator(msg))
4.6.3.5 Implementation
In this section, we describe the main architectural components used to handle the outbox queue, that is activeNode
and bufferQueue
(see Image 4.6.4). The scope of this section is to provide an insight on how to efficiently implement the above pseudocode.
activeNode
: it is a list which includes the node IDs of the nodes having at least one message in the outbox queue. Each node ID in the list points to its oldest message in the outbox buffer.bufferQueue
: it is the actual outbox queue. It is possible to build overlapping virtual queues (indicated by colors in the figure) to represent different queues per node. This data structure has a limited fixed sizeMAX_BUFFER
, and messages (in each queue) are sorted by timestamp.
Other information about the hardware implementation of similar scheduling algorithms can be found at this link.
Image 4.6.4: Proposed data structure for the implementation of the congestion control algorithm.
4.6.4 Optional and Future Optimizations
4.6.4.1 Synchronization
When the network has a high level of congestion, it may be difficult for an out-of-sync node to synchronize as most of its scheduling rate is consumed by new messages. Hence, it is nice to have a mechanism allowing to schedule messages faster to catch up with the rest of the network under special conditions.
Specifically, consider the following two scenarios:
- Node is bootstrapping.
- Node's
syncStatus
flag is set toFALSE
(see Section 4.2 - Timestamp).
In either of these scenarios, the node is very far behind the rest of the newtork. In this case, we suggest to bypass the DRR scheduler, and schedule solid old messages in FIFO order at the largest possible rate the node can process. We repeat that this feature is optional: while it reduces the time needed to synchronize, it is not strictly needed for the correct functioning of the congestion control algorithm.
4.6.4.2 Adaptive Minimum Access Mana
Nodes must hold a sufficient amount of access Mana (larger than MIN_MANA
) to be able to successfully issue new messages. We are currently investigating a way to adapt this threshold over time, depending on the current traffic congestion of the network.
4.6.4.3 Dynamic Scheduling Rate
In the current proposal, the throughput is preset by the network manager. This value takes into account nodes’ hardware as well as bandwidth capacity. Hardware improvement or protocol optimizations will not result in a performance improvement if the network manager does not change the throughput parameter SCHEDULING_RATE
. We are currently investigating a way to dynamically adapt the throughput according to the network and protocol characteristics based on neighbors health state.