As a first step to a potential data center class capability ADA North Pool as purchased both a Mellanox ConnectX-4 MCX4111A-XCAT 10Gigabit Ethernet Card and a Mellanox ConnectX-4 MCX4121A-ACAT 25Gigabit Ethernet Card. In theory these cards can in optimal situations handle 75 million TCP packets of information per second.
I tought I would share my current screen setup on the server. I am using terminator bash terminals and hardware monitoring is your standard stuff like Glances, Htop and Iotop. But also I use Chrony for making sure my system clock is up to date and tuned to make sure my network is up to date, both running as services on the computer. I had my own custom made scripts for forking / bootstrap checking but replaced them with the great script Redoracle made. I also run Prometheus, Grafana and Nginx for monitoring and websites and a few security measurements, one of them that I think its fine to mention is I run Fail2ban. I have removed some information that could affect security and replaced with a red bar.
Celebrating a milestone for ADA North pool we have come a long way since epoch 1 and 2 where we produce one block in each epoch! We have tweaked the numbers even more when it comes to connection settings and believe we have found a sweet spot that will give us even more optimal operations.
To keep everything running smoothly we upgrade to latest packages for our webserver technology and we also added chronyd to make sure our server keeps running at correct time (in millisecond measurement ranges) while also adding maintenance software that automates upgrades and keeps the system tuned and performing over time.
I compiled 0.8.5 because the pre-compiled options do not offer Journald log support and you can get this if you compile the binaries yourself.
Having read several TCP networking guides and doing tweaks adjusting to any – at the moment – type of challenges I realized this would be very time consuming. I am trying to set up as much automation as possible so the server is robust. In that regard we found https://tuned-project.org/ and it has had already dramatic results cutting cpu usage in half and allowing us to double our network connectivity without any networking errors.
Had another long night where I tweaked the node cluster (less nodes but more quality to each nodes with more connections and file limits for each) and also optimized the server painstakingly reducing one by one some metrics I know affects the quality the serer has to read and process incoming connections fast until I found an optimal point. After this the server seems again more stable (hoping this time it is going to last for a while!)
Now with the server more stable I have had some time to add more “bells and whistles” type of features so I have now not only a hardware monitoring but also a Jormungandr server monitoring that user can look how the server is doing right here on the web page, I have added these as menu links. As always feel free to contact me for encouragement / suggestions.
With a very high input/output for our files I have decided that its more important with NVME SSD performance for input output file operations than 4Terrabyte of RAID and will likely switch over to full NVME hardware if we are successful in the Testnet period. It also has the side benefit of a 90% less watt usage per disk.
To improve system performance we have added several tweaks and also some software that will help us better handle large connection loads that is needed to keep having an “overview” of the network that will help our server not following forks. One such measure is the google BBR algorithm that has been tried and tested by a major corporation in high stress situations. https://cloud.google.com/blog/products/gcp/tcp-bbr-congestion-control-comes-to-gcp-your-internet-just-got-faster
Hopefully epoch 7 will be a good one! To be sure we are still following with dedication and keep improving the network performance looking under any stone we can find.