Carving a little island on the internet - Part 2

Posted on Apr 20, 2024

This is the second part of the little selfhosting adventure I embarked on when creating this blog.

We’ve already covered how to create a website with Hugo, setup an instance at a cloud provider then spin up nginx to serve that website. It’s now time to bring it to The Web^TM.

Oh and we’ll also make it a breeze to push content out.

Open the (HTTP) floodgates

The network configuration of our instance is spread over multiple components. One of them is the Virtual Cloud Network (or VCN). The subnet we’ve configured and the route table that is setup by default will remain as is, but we’ll have to work on the ingress rules. These define what incoming traffic is authorized. Out of the box, the main thing to note is that it’s allowing SSH traffic on port 22.

We will start by allowing HTTP and HTTPS traffic. For that we simply add two rules that allow TCP traffic from IP 0.0.0.0/0, which is the CIDR notation for “any IP”, on the standard HTTP and HTTPS ports, respectively 80 and 443.

Ideally we would only accept HTTPS traffic, but allowing HTTP will let us test our setup for now, until we grab a certificate.

The VCN is not the only place we need to configure, there’s also the instance’s own firewall. We’ll have to use iptables in order to add rules allowing traffic on certain ports¹. As a disclaimer, I’m in no way an expert in this area, but what we’ll do is relatively basic.

We can run sudo iptables -L first to check which rules are in place. On our instance, some rules have already been configured, among which a couple noteworthy ones:

Chain INPUT (policy ACCEPT)
target     prot opt source      destination
ACCEPT     all  --  anywhere    anywhere        state RELATED,ESTABLISHED
...
ACCEPT     tcp  --  anywhere    anywhere        state NEW tcp dpt:ssh
REJECT     all  --  anywhere    anywhere        reject-with icmp-host-prohibited

First, we have to note that these rules are grouped by chains, and applied in the order they’re listed. In this case, we can see that the first rule allows all traffic under the condition that a connection has already been established, or that it’s a new connection associated with an existing connection (as per the RELATED state). Extra rules might follow, but we’ll stop on the next one I’ve listed. This allow TCP traffic, for all sources and destinations, permitting new connections on the destination port associated with ssh². Then we have a catch-all REJECT rule which rejects everything. More specifically, and that’s where the concept of rule chains kicks in, “everything which hasn’t been explicitly ACCEPTed until now”.

Let’s add a rule to allow HTTP traffic coming in. We have to take one thing in consideration, following the chain behaviour I’ve described: that rule must come in before the catch-all REJECT rule. Been there, done that, wondered why there was no traffic coming in despite adding the rule. This is a potential reason.

Here we go:

sudo iptables -I INPUT 3 -m state --state NEW -p tcp --dport 80 -j ACCEPT

There’s a bit to unpack here. First we specify -I INPUT to insert, in the INPUT chain. We then specify a number (here 3) which is the index at which we want to insert the rule. Remember this has to be before the rejection rule³.

Then we specify a parameter that the request has to match with -m. In this case we’ll match on the connection state, in this case NEW connections.

Next are the protocol we want to allow with -p tcp and the destination port with --dport 80.

Finally we specify where to “jump” with the -j option. We have the option of jumping to another chain, or to specific built-in targets which decide immediately what to do with the packet, such as the ACCEPT target which simply allows it.

We can do the same with port 443 to allow HTTPS traffic, and that should be sufficient for the firewall. In the case where the default OUTPUT policy is not set to ACCEPT or if there is a rule akin to the catch-all REJECT rule in our INPUT policy, we also need to allow the traffic for established connection to go out. We can do so with the following command:

sudo iptables -A OUTPUT -p tcp --sport 80 -m state --state ESTABLISHED -j ACCEPT

Here we specify that we want to add the rule to the OUTPUT table, and the other notable change is that we are now configuring it for the source port with --sport 80.

Once the firewall is setup, and the ingress rules have been configured on the VCN, nginx should be able to receive incoming HTTP request, and serve the responses. Once again, we can test that from another computer with

curl -H 'Host: example.com' <instance-ip>:80

and we should get the HTML response in return⁴.

Cool but I don’t want that lousy HTTP, I want s.e.c.u.r.i.t.y.

Setting up HTTP is pretty straightforward, but HTTPS is the standard today, to ensure we are actually communicating with the website that we’re accessing and to encrypt the traffic during the exchanges. For a personal blog this might be a bit overkill, but it’s still better to serve over HTTPS.

This is a pretty complicated topic, but we’ll go through the basics of setting it up for this project.

In order to enable HTTPS, we need a handful of things: a domain name, and a certificate from a CA (certificate authority). The domain name provides us with basically an alias for our public IP, and the certificate is a cryptographic document that (very roughly) lets clients validate that the server they are communicating with is actually the one they want to access, and establish an encrypted connection with the server.

I went with a simple domain name over at OVH, and then relied on Let’s Encrypt for the certificate. They’re a non-profit that provide Domain Validation certificate for free, with a very easy process. They use the Electronic Frontier Foundation Certbot tool to automate the certificate issuance and server configuration.

On Ubuntu, we simply need to install Certbot through snapd⁵ with the --classic flag, then symlink it to a location where it can be picked up via the PATH variable, e.g. /usr/bin/certbot. Finally, we get a certificate and install it with the following command:

sudo certbot --nginx

and test that automatic renewal⁶ will work with

sudo certbot renew --dry-run

If all went well, certbot will have modified our nginx configuration, and we can now access the website through the HTTPS protocol.

Removing one manual step of the chain

We’ve got pretty much everything we need, and we could do with what we have now. However we’re developers, by nature we are lazy to some degree. Lazy not as in “I don’t want to do this manually everytime” but rather “I don’t want to take the risk to screw up each time I do that same manual task”.

Let’s add some automation. I’m using Gitlab, so we’ll set up a Gitlab CI/CD pipeline in order to deploy our website anytime we push a commit onto the main branch of our repository. The concept are fairly simple, and should be very easy to implement with CircleCI or Github Actions, whatever suits your environment.

Before we write our gitlab-ci.yml, we need to configure a few variables in our project’s settings. This is important to avoid committing any sensitive parameters such as secrets, user login, etc. directly in the repository, which seriously undermines the security of our project. To do so, we go to the project, then Settings > CI/CD, and expand the Variables* section. Here we’ll create three variables:

DEPLOYMENT_TARGET_ADDRESS: the IP of our machine, since it might change in the future it’s handy to only have to change that in the settings and avoid a commit
DEPLOYMENT_TARGET_USER_LOGIN: with which user we’ll connect to our remote host
SSH_PRIVATE_KEY: the private key we’ll use to ssh into the host, which should have its corresponding public key installed on the server
SSH_KNOWN_HOSTS: the known host information that we will store to ensure that we are connecting to the right host

The first two are setup as standard variables, meaning they will be replaced by the string they contain. For the latter two, we can use the special “File” type variable, which means that it will be replaced by a path to a temporary file that contains the content of the variable.

Once we’ve setup these variables, we can write our configuration file:

image: golang:latest

variables:
  DEPLOYMENT_TARGET_ADDRESS:
    description: "The remote target address to deploy to"
  DEPLOYMENT_TARGET_USER_LOGIN:
    description: "The username with which to log into the deployment target"
  SSH_PRIVATE_KEY:
    description: "The private key to access the remote"
  SSH_KNOWN_HOSTS:
    description: "The known host entry for the remote"

before_script:
  - 'which ssh-agent || ( apt-get update -y && apt-get install openssh-client git -y )'
  - 'which rsync || (apt-get update -y && apt-get install rsync -y )'
  - eval $(ssh-agent -s)
  - chmod 400 "$SSH_PRIVATE_KEY"
  - ssh-add "$SSH_PRIVATE_KEY"
  - mkdir -p ~/.ssh
  - chmod 700 ~/.ssh
  - cp "$SSH_KNOWN_HOSTS" ~/.ssh/known_hosts
  - chmod 644 ~/.ssh/known_hosts
  - wget https://github.com/gohugoio/hugo/releases/download/v0.124.1/hugo_0.124.1_Linux-64bit.tar.gz
  - tar -xzf hugo_0.124.1_Linux-64bit.tar.gz
  - cp hugo $GOPATH/bin/hugo
  - hugo version

workflow:
      rules:
        - if: $CI_COMMIT_TAG
          when: never
        - if: $CI_COMMIT_BRANCH == 'main'

build_and_deploy:
  script:
    - hugo
    - rsync -avz --delete public/ "$DEPLOYMENT_TARGET_USER_LOGIN@$DEPLOYMENT_TARGET_ADDRESS:~/www"

Let’s break it down.

After listing our variables, we’ll setup the environment in a before_script to be able to run hugo and to access our remote host. To do so we ensure that the ssh agent is installed, as well as rsync.

before_script:
  - 'which ssh-agent || ( apt-get update -y && apt-get install openssh-client git -y )'
  - 'which rsync || (apt-get update -y && apt-get install rsync -y )'

We then setup our private key to be able to access our remote, setup the known_hosts file with the information of our remote, to make sure we can connect to the host.

  - eval $(ssh-agent -s)
  - chmod 400 "$SSH_PRIVATE_KEY"
  - ssh-add "$SSH_PRIVATE_KEY"
  - mkdir -p ~/.ssh
  - chmod 700 ~/.ssh
  - cp "$SSH_KNOWN_HOSTS" ~/.ssh/known_hosts
  - chmod 644 ~/.ssh/known_hosts

Finally we install Hugo from the Github public releases of the project.

  - wget https://github.com/gohugoio/hugo/releases/download/v0.124.1/hugo_0.124.1_Linux-64bit.tar.gz
  - tar -xzf hugo_0.124.1_Linux-64bit.tar.gz
  - cp hugo $GOPATH/bin/hugo
  - hugo version

We then configure the workflow to never run by default, and only trigger when there is a commit on the main branch.

workflow:
      rules:
        - if: $CI_COMMIT_TAG
          when: never
        - if: $CI_COMMIT_BRANCH == 'main'

All that remains is to define a a simple job, build_and_deploy, which will render our website with hugo, and use rsync to copy the website content over SSH to our remote host.

build_and_deploy:
  script:
    - hugo
    - rsync -avz --delete public/ "$DEPLOYMENT_TARGET_USER_LOGIN@$DEPLOYMENT_TARGET_ADDRESS:~/www"

We notably use the --delete option of rsync to remove files that are present on the server but not in the public folder created by Hugo, to keep things clean.

And voila! Now everytime we’ll push to the main branch, the pipeline will run and update the website’s content on the remote, which can then be served immediately.

Wrapping it up

There’s a lot that I only skimmed, or haven’t even started to explain for all the steps we went through, but this gives a good picture of the basics needed to create a simple self-hosted setup.

I also haven’t mentioned some extra steps I took, notably related to security. For example, I’ve changed the port I use to ssh to a custom one, in order to avoid 98% of the low-effort malicious traffic that’s out there. As is customary I suppose when attempting that for the first time, and despite trying to make sure everything was correctly configured, I locked myself out of my instance of course. I had misconfigured the firewall and added the rule accepting traffic on my custom port after the catch-all REJECT rule, and didn’t first make sure I could connect with the new port before changing the sshd config. I also didn’t realize at first that you could specify multiple port to listen on in the sshd configuration.

There’s also a lot of extra steps I could take to harden security and reduce the fragility of the setup, such as:

having a fixed public IP on the Oracle Cloud side which the VCN then routes to the instance, as currently I use the instance’s public IP, which is subject to change if I need to kill the instance, or if Oracle reclaim it and forces me to get a new one
bundle all the instance setup steps into a nice script, which would greatly simplify setting up a new instance should the current one get bricked (which already happened once when I changed the ssh port and missed a step)

Finally, I have a couple things in mind that I would like to add to this setup, such as analytics. I’ve toyed with on-premise Plausible and umami so far, but haven’t reached anything conclusive for now. The easy setups for Plausible and Umami rely on docker, and the runtime performance I’ve noticed seem too high for the small instance I have for now. I’ll probably need to look into a lighter setup if I want to use these.

It still has many flaws, but for now, I have my little internet island, from where I can easily toss little messages in a bottle into the sea.

I know there’s friendlier ways of configuring a firewall on Ubuntu, but this was an opportunity to get to know iptables a bit, and on top of that, some of these such as ufw are known to be somewhat tricky to use with Oracle Cloud instances. ↩︎
We can always specify raw port numbers and iptables will use an alias if it’s a standard port number, i.e. here for port 22 it will list it as “the ssh port”. ↩︎
To keep it simple, it can just be the index of the rejection rule, pushing it down and inserting the new rule right before. One can check rule indices with sudo iptables -L --line-numbers to make sure. ↩︎
I insist. iptables is an extremely powerful and complex tool. While this made my setup work, and I believe it’s rather sane, I’m in no way an expert and there’s a entire rabbit hole of complexity to dive into so please do not underestimate that. ↩︎
On the topic of Ubuntu not being great in low resource environment, the fact that it comes with snap which is sometimes the only package manager that has certain packages is frustrating. snapd sometimes goes a bit wild in terms of resource consumption, which is far from ideal on our modest instance. ↩︎
The certbot package comes in with a cron job or systemd timer which will ensure that the certificates we were issued will be renewed before they expire. This will spare us of ending up with a website that won’t be accessible via HTTPS all of a sudden. ↩︎