Highly-available Chef cluster

setting up a highly-available chef cluster in AWS

Avi Friedman
Innovid

--

Here at Innovid, a leading video marketing platform where we serve 1.3 million hours of video per day, we’ve used Chef to manage our AWS instances for years now, but only recently transitioned it to a highly available cluster. In order to do so, we’ve followed the comprehensive documentation given by the fellows at Hashicorp.
Our cluster architecture was built with:

  • 3 backend nodes
  • 2 frontend nodes
Cluster topology (adapted from: https://docs.chef.io/install_server_ha.html)

After spinning up these 5 instances in different availability zones (according to the chef-ha documentation, it currently doesn’t support cross-region deployment) in a single VPC (3 different subnets — 1 for each backend), we attached them to 2 security groups (described here in YAML form, which we parse in our Troposphere+CloudFormation tool to create these security groups in AWS):

YAML with chef-cluster security group configuration

Note that we have 3 SGs: The third SG is for the ALB which we connect our 2 frontends to.

Important note: we started first by creating specific egress (outbound) rules, as per the documentation. That worked for quite a while. After a couple of months we started seeing issues with the cluster. After some digging, and great help from the Chef Slack community (which we highly recommend you join here), we found out these issues were caused by the clocks on the backend nodes shifting out of sync from each other (the Ubuntu NTP couldn’t sync since it needs outbound UDP on port 123). We’ve since opened the instances for all outbound traffic.

Next, we wrote the following script to create the cluster programmatically.
In order for it to work you’ll need:

  • All nodes must be on the same VPC (to allow access from each node to the rest). They can be on different subnets, as long as routing is enabled between subnets.
  • A configured AWS CLI, which the script uses to extract instances info.
  • Nodes should have a “Name” tag, which the script uses to search for the node and extract its IP (tag: Name=some-name). You can change the names in the script.
  • Chef server version > 12.14 (script uses 12.15.8). Earlier versions require some extra .pem files to be copied from the first frontend to the second before configuring it. It would still work for the cluster of backends + first frontend.

You can find the script below or here. You can change it to fit your needs, or even just use it as a step-by-step guide to create the cluster.

Issues we came across

In an earlier version of chef-backend (we used 1.3.2) there was apparently some issue with adding and removing nodes.

We wanted to replace all the nodes in the cluster, so we started off by adding 3 more nodes, then wanted to start removing old nodes. As soon as we removed the nodes, the cluster went haywire. The Elasticsearch service on all nodes started failing with weird messages in the log:

2017-02-15_15:13:22.62732 [DEBUG][action.admin.cluster.health] [c1e85a03d38320f23e31e9c1c562f013] no known master node, scheduling a retry2017-02-15_15:13:22.60354 [DEBUG][action.admin.cluster.health] [c1e85a03d38320f23e31e9c1c562f013] timed out while retrying [cluster:monitor/health] after failure (timeout [2s])2017-02-15_15:13:22.60403 [WARN ][rest.suppressed          ] /_cluster/health Params: {timeout=2s}2017-02-15_15:13:22.60404 MasterNotDiscoveredException[null]

and the status of the Elasticsearch service was offline. After many attempts to understand what’s going on, the help came (again) from the Chef community. A bug in Chef (that seems to have been fixed, since this hasn’t happened to us in the newer version) caused the value of discovery.zen.minimum_master_nodes: 1 from the Elasticsearch configuration to increase when adding nodes, but not decrease upon removal of nodes.

To fix that, we did the following:

  1. Rejoin the cluster from another node (to have a 4th node). This will probably fail, due to ElasticSearch not being able to connect, however, you’ll still be a part of the cluster (you can check that 4 nodes exist with sudo chef-backend-ctl status)
  2. Type the following command in one of the nodes (this will reduce the minimum_master_nodes back to 2):
curl -XPUT localhost:9200/_cluster/settings -d '{"persistent" : {"discovery.zen.minimum_master_nodes" : 2}}'

now, remove the 4th node:

chef-backend-ctl remove-node <4th.node.ip.address>

Note: we used the IP of the node (in our case, private IP, but it depends on how you created the cluster). The node will not be showing with its name in cluster-status due to it being only partly-joined.

Next, verify your services are now ok and running 3 nodes at a healthy state by running sudo chef-backend-ctl status.

--

--