Managing a large global cloud infrastructure with high availability to provide 100% SLA requires the ability to stay on top of arising issues. Even with the advanced n+1 redundancy throughout the systems, problems need to be detected immediately and resolved quickly. For this purpose, we maintain a network of server monitoring across our data centres employing Zabbix.
With the number of new host deployments growing quickly, we needed to rethink our approach to monitoring it all. As such, we recently built an entirely new monitoring network based on the active agent-server model.
Masters of monitoring
Zabbix is a state of the art open source infrastructure monitoring software that we have utilised for a long time. It’s designed for real-time monitoring of millions of metrics collected from tens of thousands of servers, virtual machines and network devices.
Capable of detect problem states within the incoming metric flow automatically Zabbix requires no need to peer at incoming metrics continuously. In addition, Zabbix boasts highly flexible definition options, root cause analysis, anomaly detection and trend prediction. It can separate problem conditions and resolution conditions while operating on multiple severity levels.
Our previous system
With history from over half a decade, we’ve certainly seen our fair share of changes. From the introduction of our signature storage technology MaxIOPS to the new but already familiar feeling design language, we are no strangers to upgrades and improvements!
Our old monitoring network comprised of monitoring servers that periodically requested a live response from all other hosts in our infrastructure. The passive monitoring model was true and tested and had served us nicely so far. But alas, even with all its features, it was time for an upgrade.
Distributed monitoring network
One of the main goals of the big revamp was to be able to scale effortlessly as our infrastructure grows. Implementing the new Zabbix installation introduced a number of improvements as well as challenges while it tirelessly watches over our network. So for redundancy, the new system was developed in parallel to the old one to ensure reliability. Running both monitoring systems side by side will also allow us to compare the two as we work to improve our early issue detection.
As mentioned, our new Zabbix network is set up using active monitoring model built on three main components:
Each monitored node will have a lightweight Zabbix Agent installed on it. On the first bootup, the agent contacts a proxy which checks against known nodes and instructs the agent on a list of active checks based on auto registration rules. Once configured, the agent actively sends the results of the requested check periodically to the proxy. Thanks to the agent’s small footprint, it can easily be run without impact on system resources.
Due to the highly distributed nature of our data centres, we’ve opted to include proxy servers at each of our zones. All endpoints are actively reporting their monitoring data to their local proxy which then relays the information to the Zabbix servers.
Zabbix proxies collect monitoring data from the monitored devices and send the information to the Zabbix server, essentially working on behalf of the server. All collected data is buffered locally and then transferred to the Zabbix server the proxy belongs to. This allows proxies to retain data for example while a monitoring server is under maintenance and in practice monitor the server itself.
Running proxies at each DC has been very beneficial to distribute the load of the Zabbix servers. On top of the added redundancies, the resource requirements on the monitoring servers are also reduced. It’s the ideal solution for cloud infrastructure monitoring in remote locations without local administrators.
While much of the monitoring is performed autonomously throughout our cloud infrastructure, the human aspect is still location bound. The monitoring servers work as the brain of the operation by gathering the data and presents it in an easily readable format. In addition to visual alerts in the system, Zabbix also informs the responsible personnel about occurred events using other different channels such as a trusty old SMS.
The automation of a new node registration is partly offloaded to the proxies but Zabbix server still needs to become aware of the additions as well. Depending on the intended use of each new node, Zabbix applies the correct template and includes the node to the correct host groups for monitoring. Strong encryption between all Zabbix components ensures secure communication to all of our ~150 000 followed items with almost 20 000 triggers.
The new and improved monitoring system is already showing its benefits with automated registration. Furthermore, with our aim to continually grow, not only in our current data centres but by expanding to new regions as well, Zabbix will enable us to effortlessly scale up our monitoring in the future.
However, there is still more work to be done, As an important component of our promise of 100% SLA, we look forward to fully utilizing Zabbix. One of the advanced features of Zabbix is the ability to automatically resolve issues. Naturally, any actions that may affect the state of our users’ cloud servers requires thorough testing and validation but the potential is promising. Implementing a tireless machine administrator at scale will be challenging but also highly rewarding to both ourselves and our users!