Deploying Node.js at Scale

Vlad Cealicu
CCData
Published in
4 min readMay 2, 2019

--

This is a more in-depth explanation of the How do we deploy? chapter from part three of how and why our system works (The Commercial API Journey). I also wanted to do a quick overview of what we have learned through the five years we have been using Node.js at CryptoCompare.

Make sure you use a service like forever to keep your servers up

We have always been using forever to run our Node.js servers. We only recently started using specific PID file names. It’s saved us a lot of times while we learned to make sure to listen for errors on Redis, Redis cluster, Postgres and other external dependencies.

Don’t use more NPM packages than you really need

Even with one of the most used APIs in the cryptocurrency space, we don’t rely on a lot of external NPM packages. We only use Redis, Redis Cluster, Postgres, async and forever. We ended up building our own for adding endpoints, user authentication and rate limiting.

SSL termination + gzipping is best left to Nginx

When we started, we only had 2 API servers load balanced. We ran the gzipping and the SSL termination inside our node app.

As our API usage grew, we noticed CPUs starting to spike. After ignoring/building around it for a few months, we decided to start looking at ways to save CPU cycles. We started by profiling our API servers and quickly noticed that a huge chunk of the CPU was used on gzipping and SSL termination. After a bit of exploratory work, we decided to put Nginx in front of all our API servers. By this point, we were serving around 50 calls a second from each API and we had around 10 one core API servers that we quite high on CPU usage.

As soon as we added Nginx in front of them CPU dropped from 60%–70% on average to just 10%–20%. We were able to scale to over 1500 calls a second per server, with Node.js using around 60% CPU and Nginx at about 20%. We are now running 25 servers that are doing between 200 and 1000 calls a second with plenty of CPU to spare and amazing response times.

We calculate the internal response time by just adding a callback to the event loop and seeing how long it takes for it to be executed

Sort out deploying with no downtime early on

It took us quite a while to start deploying with no downtime, we only implemented the process a few months ago. It’s not complicated but we always knew we would be up in <100ms so we never allocated a lot of resources for investigating this. It’s only when we started having paying customers complaining about occasional errors that we started investigating.

The process is simple but it relies on quite a few things:

  1. Running Nginx in front of your node servers
  2. Being able to run your node server on different ports
  3. Having forever set up with pid files named the same as the port your app is running on
  4. Using Ansible to deploy new changes

The way it works is:

  1. Check which is the current running port by looking for the port_name.pid in the .forever/pids/ folder.
function determinePort() {
if [ ! -f $log_path$pid_folder$portOne.pid ]; then
startPort=$portOne
stopPort=$portTwo
else
startPort=$portTwo
stopPort=$portOne
fi
}

2. Deploy new code in a folder corresponding to the startPort (we deploy by deleting the folder and cloning the repo again)

3. Start the service on the new port

$forever_path start  -c "node --max-old-space-size=$maxSpaceSize" --pidFile $log_path$pid_folder$startPort.pid -l $log_path$server_name_v2$extension_forever -e $log_path$server_name_v2$extension_error -a $file_path$server_name_v2$extension_js port=$startPort $defaultParams_v2 $cloud

4. Write the two ports in a file, this is read by Ansible to know what to tell Nginx to do

echo "$stopPort $startPort" > $basepath/../nginx_port_change

5. Keep trying to check localhost:second_port_from_file returns a 200, give up after 10 seconds if no 200 is received and cancel the deployment

6. Change the port in the Nginx config to second_port_from_file

7. Check config works

8. Reload Nginx config

9. Wait 10 seconds

10. Stop the old server

$forever_path stop (cat $log_path$pid_folder$first_port_from_file.pid)

If any of the steps fail, we just cancel the deployment and investigate what went wrong.

This is a more in-depth explanation of the How do we deploy? chapter from part three of how and why our system works (The Commercial API Journey). You might also like part one: The stack discovery and part two: The API dissection.

--

--