Merge pull request #1 from libp2p/feat/readme-jupyterhub

add readme for jupyterhub workflow
2026-01-08 21:48:00 -05:00 · 2020-04-29 17:55:40 -04:00
parent eb7c766356 0641991dc2
commit c090869f29
1 changed files with 147 additions and 0 deletions
--- a/README-shared-environment.md
+++ b/README-shared-environment.md
@@ -0,0 +1,147 @@
+# Shared jupyterhub Workflow
+
+This doc describes how we (the gossipsub-hardening team at Protocol Labs) has been running the tests in this repo.
+
+## Connecting to the Jupyterhub server
+
+We have an EC2 instance running Ubuntu 18.04, with [The Littlest Jupyterhub][tljh] installed. It doesn't
+have a persistent domain or SSL cert, so we've been connecting to it using an SSH tunnel.
+
+The current incantation is:
+
+```shell
+ssh -A -L 8666:localhost:80 jupyter-protocol@ec2-3-122-216-37.eu-central-1.compute.amazonaws.com
+```
+
+This will open a shell as the `jupyter-protocol` user, and tunnel traffic from port 80 on the remote
+machine to port 8666 on localhost.
+
+If your ssh key isn't authorized, ping @yusefnapora (or someone else with access, if I'm sleeping or something)
+to get added to the `authorized_keys` file.
+
+The `-A` flag enables ssh-agent forwarding, which will let you pull from this repo while you're shelled in, assuming 
+your SSH key is linked to your github account & you have read access to this repo. Note that the agent forwarding
+doesn't seem to work if you're inside a tmux session on the remote host. There's probably a way to
+get it working, but I've just been doing `git pull` outside of tmux.
+
+Once the tunnel is up, you can go to `http://localhost:8666`, where you'll be asked to sign in. Sign in as
+user `protocol` with an empty password.
+
+## Server Environment
+
+There are some things specific to the environment that are worth mentioning.
+
+The testground daemon is running inside a tmux session owned by the `jupyter-protocol` user. Running `tmux attach`
+while shelled in should open it for you - you may have to switch to a different pane - it's generally running in
+the first pane.
+
+If for some reason testground isn't running, (e.g. `ps aux | grep testground` comes up empty), you can start the
+daemon with:
+
+```shell
+testground --vv daemon
+```
+
+The testground that's on the `$PATH` is a symlink to `~/repos/testground/testground`, so if you pull in changes
+to testground and rebuild, it should get picked up by the runner scripts, etc automatically.
+
+This repo is checked out to `~/repos/gossipsub-hardening`, and there's a symlink to it in `~/testground/plans`, so that
+the daemon can find it and run our plans.
+
+## Cluster setup
+
+The [testground/infra](https://github.com/testground/infra) repo is checked out at `~/repos/infra`. It contains
+the scripts for creating and deleting the k8s cluster. The infra README has more detail and some helpful commands,
+but here are some of the most relevant, plus some things to try if things break.
+
+Before running any of the commands related to the cluster, you'll need to source some environment vars:
+
+```
+source ~/k8s-env.sh
+```
+
+
+To see the current status of the cluster:
+
+```shell
+kops validate cluster
+```
+
+If that command can't connect to the cluster VMS at all, it either means the cluster has been deleted, 
+or you need to export the kubectl config:
+
+```shell
+kops export kubecfg --state $KOPS_STATE_STORE --name=$NAME
+```
+
+If `kops validate cluster` still can't connect to anything, someone probably deleted the cluster when they were
+done with it. To create it:
+
+```shell
+cd ~/repos/infra/k8s
+./install.sh cluster.yaml
+```
+
+This will take a few minutes, and the newly created cluster will only have 4 workers. To resize it:
+
+```shell
+kops edit ig nodes
+```
+
+and edit the `maxSize` and `minSize` params - set both to the desired node count. Then, apply the changes with
+
+```shell
+kops update cluster $NAME --yes
+```
+
+After a few minutes, `kops validate cluster` should show all the instances up, and the cluster will be ready.
+
+## Running Tests
+
+Everything in the [main README](./README.md) should apply when running tests on the server, but you can ignore
+the parts that tell you to run `jupyter notebook` manually.
+
+When you log into the Jupyterhub server, you should see a file browser interface. Navigate to 
+`repos/gossipsub-hardening/scripts` to open the `Runner.ipynb` notebook.
+
+There are a bunch of `config-*.json` files next to the runner notebook - these are a good starting point for
+configuring test variations - the `config-1k.json` is the baseline test described in the report.
+
+At the moment, some of the config json files may still be targeting the `feat/hardening` branch and will give an
+error right away if you run them - change the branch in the Pubsub config panel to `master` and it should be all good.
+
+If you want to target "vanilla" gossipsub (v1.0), you can set the branch to `release-v0.2` and uncheck the
+"target hardened API" checkbox in the UI.
+
+After a successful run, you should see the path to the analysis notebook printed. Nav there with the Jupyter
+file browser to run the analysis notebook and generate the charts, etc.
+
+## Troubleshooting
+
+Sometimes, especially if you're running with lots of instances, `weave` (the thing that manages the k8s data network)
+will give up the ghost, and one or more test instances will get stuck and be unable to communicate with the others.
+
+If you never see the `All networks initialized` message in the testground output, or if it takes several minutes to
+get to `All networks initialized` after all instances are in the Running state, it's likely that you've hit this issue.
+
+If weave has failed, you may see some weave pods stuck in a "not ready" state if you run
+
+```shell
+kubectl validate cluster
+```
+
+You can try forcibly removing the stuck weaves, although I can't find the magic commands for that at the moment.
+What I've been doing instead is scaling the cluster down to a single worker and then back up, to start with a clean
+slate. Scaling down to zero would probably be better, now that I think of it...
+
+If you do hit the weave issue, you can try lowering the # of connections for the attacker, having fewer attackers,
+or packing the attacker peers less tightly into their containers by adjusting the number of attacker nodes and
+the number of attack peers per container. Spreading the attackers out over more containers may help, but you may
+also need to resize the cluster and add more worker VMs. 
+
+If you don't have enough resources, testground will fail you right away and helpfully tell you what the limit it.
+
+You can control the CPU and RAM allocated for each test container by editing `~/testground/.env.toml` and restarting
+the testground daemon.
+
+[tljh]: http://tljh.jupyter.org/en/latest/