About a month ago I set up statistics for the official DDNet servers. My motivations for this are:
Monitor the servers more easily
Get notified about server problems
Have nice graphs to look at
The choices for the software used are mainly made to keep resource usage low, a
general principle used for DDNet since we run on cheap VPSes all around the
world and are limited in CPU and memory resources. In the rest of this post we
will explore the 3 major tools used, their purpose in our solution as well as
their performance impact:
Nim: Favorite programming language for performance and readability
Gathering live server statistics with ServerStatus
We've been running BotoX's ServerStatus to get live server statistics for some time now. It works quite well to quickly notice major server problems like a high load or incoming (D)DoS attacks, provided that you keep an eye out for it.
On a regular Saturday morning the end result looks like this:
On a quick glance you notice that the DDNet.tw server has high CPU usage (which is totally normal since it runs some hefty cron jobs every 20 minutes) and DDNet RUS is receiving a small DoS attack with just 1.6 MB/s (unfortunately also totally normal). Apart from that everything looks fine.
ServerStatus footprint, calculated from Linux /proc statistics as follows (in Nim):
Server (9 clients)
Recording and graphing data with RRDtool
I haven't used RRDtool for about 7 years, but it's still an excellent tool to record data into a fixed-size round robin database. For us three functions of RRDtool are important: create to create the database, update to add a new value to be aggregated into the database, and graph to render the database into a beautiful graph.
CPU, network and memory are the most important resources for me, so their usage should be recorded. Let us use network traffic as an example and create a database:
First we need to think about what data we want to record in the RRD:
1 sample = 30 seconds
1 day = 2880 samples = 6 * 480 pixels, each pixel is 03:00 min
7 days = 20160 samples = 42 * 480 pixels, each pixel is 21:00 min
49 days = 141120 samples = 147 * 960 pixels, each pixel is 73:30 min
Then we can use this to create the actual database file:
If you're curious about what exactly happens here, you can find more information in rrdcreate(1).
The resulting ddnet.tw-net.rrd file is just 32 KB in size and will forever stay that exact size. (All our databases together are just 1 MB.) New data in each round robin archive simply overwrites the oldest data. A disadvantage of RRDtool is that you need to think ahead and plan what data you want to store.
The next step is to put new data into our little database, which we should do every 30 seconds:
Super simple! 42 is our network_rx value, 1234 the value for network_tx. These values are now aggregated using the AVERAGE and finally put into their respective archives.
Once we have enough values we can finally create the graph, for example for 1 day: