One of the most difficult problems about running experiments across multiple remote compute resources is to do so securely. The simplest approach is to just spawn the computehosts, and assign a floating IP address to each host. This way, one can access all their hosts over the Internet.
What is wrong with the simplest approach?
Certain experiments may expose dashboards or other services like databases with sensitive data and permissions to the Internet. There are also experiments which require the use of old, outdated, or unmaintained software for the purposes of reproducibility or because of lack of community support. When you expose a service to the Internet, you are also exposing it to hackers who will find it and will attempt to compromise it immediately. Our hardware is very powerful, and is thus an extremely valuable target to hackers. As has happened to multiple projects in the past, your server could be compromised by a malicious party, and your valuable time could be wasted as all your nodes have to be shut down in response. These mistakes can arise due to carelessness or even simple misunderstandings. It’s better to eliminate the possibility of this happening by reducing the attack surface for hackers. This means that one does not directly expose their entire project to the open Internet. We’ve talked about this before in our security best practices blog; we highly recommend checking it out for more security tips.
The obvious issue with reserving a floating IP address for every host is that you have to reserve all those IP addresses! An unfortunate reality is that there is a finite number of public IP addresses available for use across the entire Internet. Back when the Internet was just beginning to take shape, the ~4.3 billion IPv4 addresses seemed like plenty to go around for the entire world. Now, we know that is not enough, and the world has already run out! Chameleon owns and is able to lease a very very small subset of the global IP address space, and we ask that you use them sparingly, and release them when you no longer need them. If your experiment requires you to use multiple compute nodes, wouldn’t it be better to only use up a single public IP address instead of one for each node?
What is bastion host and why should you use it?
If your resources are a precious prize sought after by hackers, then there needs to be an effective way to limit their access. In medieval warfare, the inside of a fortress was protected by walls, guarded by a bastion. . In securing your experiments, a bastion host functions in the same way. A bastion host is a single host which exposes itself to the Internet, and allows other more-valuable or vulnerable hosts to be protected behind it. The goal of the bastion host is to limit the attack surface by only exposing one highly-secure service (SSH) to the Internet. That service also uses a single public IP address, and it gives you access to a secure tunnel over which you can communicate with your other nodes. Our save the planet blog post tells you exactly how.
To help you set up your first bastion host environment, we’ve published an experiment pattern on Trovi which automatically deploys a basic bastion/worker cluster that can be easily tuned and extended for your experiment’s needs! At the end of the notebook, it provides a list of remote connections which can be used to securely dispatch jobs to all your workers, which is a great jumping off point for actual experimentation.
Are my capabilities limited behind a bastion host?
Not at all! Since you’re connecting to your workers via SSH, you have full shell access to your hosts. If you want to access a “web-facing” service running on your nodes, you can do that as well via an SSH tunnel. You can also facilitate this process with excellent free tools like sshuttle. This is incredibly useful for accessing things like databases or incredibly-insecure-by-default protocols like VNC. There is basically zero reason not to use an SSH tunnel unless you are researching something which specifically prevents you from doing it.
Can I do it without any extra work?
We published a tutorial notebook so that you don’t have to figure out how to set this up on your own. The simple explanation is that your project has one public-facing host (the bastion host) which accepts incoming connections via SSH. The SSH protocol is then able to use that connection to tunnel to your worker nodes securely behind a private network. Neat! With this notebook, you can tweak parameters, copy, cut, and paste to implement bastion host architecture in your own experiments with ease. Please, try it out for yourself, and as always, feel free to reach out for help if anything goes wrong.
Learn all the do's and don'ts of keeping your experiment server secure to maximize your experiment time!
IPv4 address exhaustion, along with natural resources depletion and global warming, has long been recognized as one of the greatest environmental threats facing humanity. Read on to learn how to keep floating IPs afloat!