because of the refinery running in the background, if we remove a peer
from the white or greylist on upgrade it can create an index error in
when seperate threads execute this code at the same time:
refinery:
upgrades node to whitelist, removes from greylist
upgrade_host:
upgrades node to anchorlist, removes from greylist
leaving a "safe" peer on the grey or whitelist is not a problem. the
only impact is that we risk selecting a peer from the whitelist that we
are already connected as an anchor, but p2p checks exist to protect
from this.
equally, if we remove from the greylist or whitelist on upgrade_host this can
happen:
upgrade_host:
upgrades node to anchorlist, removes from greylist
protocol_addr:
recv Addr
do we have this node? no, add to greylist
refinery: promotes to whitelist, etc
the above scenario makes removing the host in this case redundant.
Previously we were shuffling hosts that we select to avoid trying to
connect to them in a deterministic order. However, this contradicts the
protocol of ordering hostlists by last_seen.
Instead, we should try to connect to addresses stored at the top of the
hostlists first, as they are most likely to be active.
The problem of multiple slots competing for the same peer should be
solved by the various locking checks in check_address_with_lock().
Lilith now periodically pings nodes on its whitelist, updating their
last_seen field if they are active, otherwise downgrading them to
greylist.
This is to prevent Lilith from sharing inactive peers with other hosts
when it shares its whitelist.
This commit introduces a new Session method called downgrade_host().
It gets called on two occasions:
* if we receive a stop signal on a channel (Inbound, Outbound, Manual sessions)
* if we cannot establish a connection (Outbound and Manual session)
This commit deprecates the "rejected" vector inside Outbound session
that prevented us from instantly reconnecting to an inactive host.
we reduce this in light of 4bf43ec521.
Hostlist sharing now works as follows:
* Start node
* wait 5 sec
* start greylist refinery
* send whitelisted entry to peers
This commit forces all whitelisted peers through the greylist each time
we start a node, thus ensuring that whitelisted peers are in fact
reachable.
Otherwise, our whitelist risks getting polluted with unreachable
whitelisted peers. While the greylist has a refinery process to deal
with this, the whitelist has no such process.
downgrade_host() is costly (reads through many vectors) and redundant
because it repeats the behavior that should be handled by the greylist
refinery.
what we need is a way to avoid instantly reconnecting to hosts that we
have just failed to connect to.
this commit introduces a simple fix- we write to a vector called
`rejected` and avoid trying to reconnect to its entries once a
connection has failed.
however, this is not ideal as it means we will never connect to that
host- see TODO.
previously we were checking whether the entry is in any of the hostlists
(hostlist_contains) prior to storing entries in the greylist.
this creates a logic error which we have now fixed.
the impact is we will now likely have duplicate connections between the
greylist and other lists. however, it shouldn't be a problem.
make a standalone function that removes the address from the white or
greylist and adds it to the anchorlist. call it inside outbound and
manual session.
this commit also removes a redundant Result<()> from
[hostlist]_store_or_upgrade() method and updates its usage.
We also add various debug statements and cleanup.
* don't fetch addresses with lock in a nested way- first fetch the
address, then apply the locking checks
* delete blocking function get_active_connect_count. instead, just use
slots rather that active connections to avoid deadlock on p2p channels.
TODO: there seems to be another deadlock that occurs more rarely. From
preliminary investigations, it doesn't seem to be related to p2p channels.
if we don't have any hosts on the anchorlist, chose from the whitelist.
if we don't have any on the whitelist, chose from the greylist.
same principle for whitelist_fetch_with_schemes (each method has a
different selection priority)