Posted on 30.4.2019

Distributed monitoring for global cloud infrastructure

cloud infrastructure

Managing a large global cloud infrastructure with high availability to provide 100% SLA requires the ability to stay on top of arising issues. Even with the advanced n+1 redundancy throughout the systems, problems need to be detected immediately and resolved quickly. For this purpose, we maintain a network of server monitoring across our data centres employing Zabbix.

With the number of new host deployments growing quickly, we needed to rethink our approach to monitoring it all. As such, we recently built an entirely new monitoring network based on the active agent-server model.

Test hosting on UpCloud!

Masters of monitoring

Zabbix is a state of the art open source infrastructure monitoring software that we have utilised for a long time. It’s designed for real-time monitoring of millions of metrics collected from tens of thousands of servers, virtual machines and network devices.

Capable of detect problem states within the incoming metric flow automatically Zabbix requires no need to peer at incoming metrics continuously. In addition, Zabbix boasts highly flexible definition options, root cause analysis, anomaly detection and trend prediction. It can separate problem conditions and resolution conditions while operating on multiple severity levels.

Our previous system

With history from over half a decade, we’ve certainly seen our fair share of changes. From the introduction of our signature storage technology MaxIOPS to the new but already familiar feeling design language, we are no strangers to upgrades and improvements!

Our old monitoring network comprised of monitoring servers that periodically requested a live response from all other hosts in our infrastructure. The passive monitoring model was true and tested and had served us nicely so far. But alas, even with all its features, it was time for an upgrade.

Distributed monitoring network

One of the main goals of the big revamp was to be able to scale effortlessly as our infrastructure grows. Implementing the new Zabbix installation introduced a number of improvements as well as challenges while it tirelessly watches over our network. So for redundancy, the new system was developed in parallel to the old one to ensure reliability. Running both monitoring systems side by side will also allow us to compare the two as we work to improve our early issue detection.

As mentioned, our new Zabbix network is set up using active monitoring model built on three main components:

Agents

Each monitored node will have a lightweight Zabbix Agent installed on it. On the first bootup, the agent contacts a proxy which checks against known nodes and instructs the agent on a list of active checks based on auto registration rules. Once configured, the agent actively sends the results of the requested check periodically to the proxy. Thanks to the agent’s small footprint, it can easily be run without impact on system resources.

Proxies

Due to the highly distributed nature of our data centres, we’ve opted to include proxy servers at each of our zones. All endpoints are actively reporting their monitoring data to their local proxy which then relays the information to the Zabbix servers.

Zabbix proxies collect monitoring data from the monitored devices and send the information to the Zabbix server, essentially working on behalf of the server. All collected data is buffered locally and then transferred to the Zabbix server the proxy belongs to. This allows proxies to retain data for example while a monitoring server is under maintenance and in practice monitor the server itself.

Running proxies at each DC has been very beneficial to distribute the load of the Zabbix servers. On top of the added redundancies, the resource requirements on the monitoring servers are also reduced. It’s the ideal solution for cloud infrastructure monitoring in remote locations without local administrators.

Servers

While much of the monitoring is performed autonomously throughout our cloud infrastructure, the human aspect is still location bound. The monitoring servers work as the brain of the operation by gathering the data and presents it in an easily readable format. In addition to visual alerts in the system, Zabbix also informs the responsible personnel about occurred events using other different channels such as a trusty old SMS.

The automation of a new node registration is partly offloaded to the proxies but Zabbix server still needs to become aware of the additions as well. Depending on the intended use of each new node, Zabbix applies the correct template and includes the node to the correct host groups for monitoring. Strong encryption between all Zabbix components ensures secure communication to all of our ~150 000 followed items with almost 20 000 triggers.

System summary

The new and improved monitoring system is already showing its benefits with automated registration. Furthermore, with our aim to continually grow, not only in our current data centres but by expanding to new regions as well, Zabbix will enable us to effortlessly scale up our monitoring in the future.

However, there is still more work to be done, As an important component of our promise of 100% SLA, we look forward to fully utilizing Zabbix. One of the advanced features of Zabbix is the ability to automatically resolve issues. Naturally, any actions that may affect the state of our users’ cloud servers requires thorough testing and validation but the potential is promising. Implementing a tireless machine administrator at scale will be challenging but also highly rewarding to both ourselves and our users!

Janne Ruostemaa

Editor-in-Chief

Leave a Reply

Your email address will not be published. Required fields are marked *

Cloud hosting in Asia: How to choose the best cloud provider

Asia is among the most dynamically developing regions in the world, which means access to Asian markets is crucial for global companies. With the recent Covid-19 outbreak, we are witnessing the massive shift to digitalisation. Even many international companies that had previously relied on physical presence in the region withdrew and moved their local offices […]

Product Updates

Vision and culture

Introducing Private Cloud: Going beyond traditional cloud infrastructure

Public cloud has grown in popularity immensely and while suitable for many types of needs, sometimes you just need that little bit of extra, of everything. If you are looking for more customizability, heightened security, better performance, and perfect availability, search no further. For anyone wishing to ascend beyond the public cloud, we are now offering a […]

Announcements

UpCloud and Montel – Partnership to bring Kubernetes to Cloud Natives

UpCloud partner programme member and software company, Montel Intergalactic, delivers DevOps expertise to clients across all parts of the software lifecycle and creates future-proof cloud infrastructure for clients on UpCloud. We spoke to Lauri Kainulainen, CTO of Montel, to explore how beneficial a tight-knit partnership is for cloud computing, and why the businesses who switch […]

Industry analyses

Vision and culture

Back to top