Having read several TCP networking guides and doing tweaks adjusting to any – at the moment – type of challenges I realized this would be very time consuming. I am trying to set up as much automation as possible so the server is robust. In that regard we found https://tuned-project.org/ and it has had already dramatic results cutting cpu usage in half and allowing us to double our network connectivity without any networking errors.
Had another long night where I tweaked the node cluster (less nodes but more quality to each nodes with more connections and file limits for each) and also optimized the server painstakingly reducing one by one some metrics I know affects the quality the serer has to read and process incoming connections fast until I found an optimal point. After this the server seems again more stable (hoping this time it is going to last for a while!)
Now with the server more stable I have had some time to add more “bells and whistles” type of features so I have now not only a hardware monitoring but also a Jormungandr server monitoring that user can look how the server is doing right here on the web page, I have added these as menu links. As always feel free to contact me for encouragement / suggestions.
With a very high input/output for our files I have decided that its more important with NVME SSD performance for input output file operations than 4Terrabyte of RAID and will likely switch over to full NVME hardware if we are successful in the Testnet period. It also has the side benefit of a 90% less watt usage per disk.
To improve system performance we have added several tweaks and also some software that will help us better handle large connection loads that is needed to keep having an “overview” of the network that will help our server not following forks. One such measure is the google BBR algorithm that has been tried and tested by a major corporation in high stress situations. https://cloud.google.com/blog/products/gcp/tcp-bbr-congestion-control-comes-to-gcp-your-internet-just-got-faster
Hopefully epoch 7 will be a good one! To be sure we are still following with dedication and keep improving the network performance looking under any stone we can find.
I had to pull an all nighter as even while the new cluster of nodes is helping out a lot in stability it did not help when there where suddenly a strong propagation in the network for several forks. To help with this issue we also increased the network capacity of the main server so now it sees most of the active nodes in the network. This allows the server to look at what is the longest chain and correctly identify what path to take regarding forks.
Also when you look at the numbers we had an amazing performance. We lost some blocks during the early time period of forks but around 4 AM (as I told I did an all nighter for this) after also bootrapping to the nearest IOHK servers (3 servers are located in EU) and implementing these changes the pool performed admirable. Lets look at the math:
We know that 4320 blocks are on offer each epoch (10% of the total 43200 slots) We also know that currently only around half of this is actually produced by pools (around 2600). In our case we have a stake of around 66 million delegated to us and 0.95% of the total stake. With 4320 blocks we should with 0.95% recieve 41 blocks. Guess how many blocks ADA North Pool produced during the epoch? 41! So overall given all that happened epoch 6 I am actually very pleased with the performance and keep in mind that performance matric is a short term performance metric, you should also ask yourself the question how many blocks is this pool producing compared with its stake?
Epoch 5 was rough on ADA North Pool who only managed to produce around 60% of blocks. Granted it was rough for many other pool operators but ANP prides itself in trying to the best of its ability to produce blocks. To that extent we have utilized our strong server hardware and added a cluster of nodes that will help keep the network stable and performing well. We have also done several more tweaks to the nginx html server to improve its performance so the site is available.
First of all I would like to apologize that the webserver has been down for 24 hours. The site was simply not ready for the kind of load that came with the Test Net 0.8.2 ITN when all users of the Deadelus wallet wanted to access it. I have done a lot of optimization these last hours and hope to come back much stronger because of this. Due to DNS propagation it will take 24-48 hours before users see these changes. As I said earlier I am learning this as we go.
In that regard we also had some trouble with bootstrapping simply getting stuck due to the massive amount of bootstrapping calls to IOHK servers. To make sure this is not a problem in the feature we have added both our own backup nodes, private nodes in a pool alliance and added more features to error script logging for automation of handling any incidents.
In short I am still learning but the pool and the website is getting stronger and stronger day by day.
As the 0.8.2 ITN had its first problem with block not moving up ADA North Pool has added a temporary fail safe in an automated script to make sure server resets if the block is not moving up again. More longer term we are working into more advance automation of logging tools (probably trough Greylog) but for the time being this is an easy workaround.
ADA North Pool had its first error that caused the service to have to reset. This took somewhat less than 1 second so did not in any way affect the pools ability to produce blocks but in line with transparency and since we are in TestNet phase this is reported here on server status updates.
Another historic moment ADA North Pool has now been lucky to be selected for a block to be produced in two epochs in a row. Also this epoch our block was produced successfully and took 0.0033 seconds from wake up time until finished time.