Although we all have to deal with unexpected events, we also have tools to prevent them. Like mentioned in the last post, log files must be accessible upfront, otherwise the troubleshooting is compromised. Before any issue occurs, there’s a lot we can do, in order to be aware of what’s going on, act proactively and don’t let the problem become reality.
Most of the companies have already implemented a monitoring solution. Usually my sysadmin friends are the people in charge of such solutions. If you have this responsibility, you know how difficult is gather all the metrics, show them in fancy dashboards, and properly send alerts to the ones who must react in case of some evidence of trouble. Maybe, more often than you would like to, you have to justify why some metric wasn’t considered, or wasn’t shown, or some alert wasn’t sent. The bigger the monitoring service, the more likely to happen this kind of situation.
Don’t let your avoiding problems task become a problem itself. You can use open source tools and get a monitoring server ready to do the job. Once up and running, you will be able to easily plug any other server into the monitoring service, with no need of an installed agent. In addition, you will be able to send alert notifications through instant messaging apps, like and , instead of by email.
The solution combines , a high performance time series database, , a time series analytics and monitoring tool, and , an agentless automation tool. With Ansible is possible to extract constantly the servers’ hardware metrics and store them in the InfluxDB database. With Grafana is possible to connect to InfluxDB database and show the metrics in dashboards, define thresholds and configure alerts. The solution can be checked out on , and the details are shown right below.
UPDATE: This Codeyourinfra solution has been refactored and migrated to .
The development environment
The monitored environment was reproduced using local machines, one representing the monitoring server (monitor) and the other two as servers that could be plugged into the monitoring service (server1 and server2). was used to manage this development environment. With the Vagrantfile below, it’s possible to smoothly turn on and provision the monitoring server, by executing the command vagrant up monitor. Notice that the VMs server1 and server2 are also defined, but they can be booted up later, if you want to plug just one or both into the monitoring service.
Vagrant.configure("2") do |config|
config.vm.box = "minimal/trusty64"
config.vm.define "monitor" do |monitor|
monitor.vm.hostname = "monitor.local"
monitor.vm.network "private_network", ip: "192.168.33.10"
monitor.vm.provision "ansible" do |ansible|
ansible.playbook = "playbook-monitor.yml"
end
end
(1..2).each do |i|
config.vm.define "server#{i}" do |server|
server.vm.hostname = "server#{i}.local"
server.vm.network "private_network", ip: "192.168.33.#{i+1}0"
end
end
end
The monitoring server provisioning is done by Ansible, and is divided in two basic parts: installation of the tools (, and ) and configuration of the monitoring service. Notice that Ansible is used to install Ansible! The playbook-monitor.yml below shows that.
Besides, rather than putting all the tasks in a big unique file, each tool installation’s tasks were placed in a specific YML file, in order to get the code clean, organized and easy to understand. The grouped tasks then can be dynamically included in the main playbook through the statement.
---
- hosts: monitor
become: yes
gather_facts: no
tasks:
- name: Install apt-transport-https (required for the apt_repository task)
apt:
name: apt-transport-https
update_cache: yes
tags:
- installation
- name: Install InfluxDB
include_tasks: influxdb-installation.yml
tags:
- installation
- name: Install Grafana
include_tasks: grafana-installation.yml
tags:
- installation
- name: Install Ansible
include_tasks: ansible-installation.yml
tags:
- installation
- name: Configure monitoring
include_tasks: monitoring-configuration.yml
tags:
- configuration
The monitoring service configuration
The monitoring service configuration is composed by some steps, as shown in the monitoring-configuration.yml file below. First and foremost, the InfluxDB database, named monitor, is created. InfluxDB provides a very useful , which can be used for a variety of database operations. For interacting with webservices, the is the most indicated. All the metrics extracted from the monitored servers are stored in the monitor database.
After that, the Grafana data source that connects to the InfluxDB database is created. That way Grafana is able to access all the stored metrics data. Like InfluxDB, Grafana has an which allows make most if not all of the configuration, through JSON-formatted content. Besides the creation, the and the first are also created. Notice that, in order to assume as successful the task when the playbook is executed again, and guarantee the , responses statuses other than 200 are considered as well.
The configured Slack notification channel points to a . Of course you can , but I’m pretty sure you will want to create your own, and invite the troubleshooting guys to join. Don’t forget to create in your Slack workspace a and replace the by the generated webhook URL.
The initial dashboard shows the used memory percentage metric. Other metrics can be added to it, or you can create new dashboards, at your will. A threshold of 95% was defined, so you can visually know when the metric exceeded such limit. An alert was also defined, and a notification is sent to the configured Slack channel when the last five metric values are greater than or equal to the limit of 95%. The alert also send a notification when the server health is restabilized.
With Ansible you can perform tasks in several servers at the same time. It’s possible because everything is done through SSH from a master host, even if it’s your own machine. Besides that, Ansible knows the target servers through the inventory file (/etc/ansible/hosts), where they are defined and also grouped. During the monitoring service configuration, the group monitored_servers is created in the inventory file. Every server once in this group is automatically monitored. Plugging a server into the monitoring service is as simple as adding a line in the file. The first server monitored is the monitoring server itself (localhost).
In order to prevent Ansible from checking the SSH key of the servers plugged into the monitoring service, it’s necessary to disable the default behavior in the Ansible configuration file (/etc/ansible/ansible.cfg). This way Ansible won’t have problems in collecting metrics from any new server through SSH.
Finally, an Ansible playbook (playbook-get-metrics.yml) is used to connect to all monitored servers and extract all the relevant metrics needed. It’s placed in the /etc/ansible/playbooks directory and configured in to be executed every minute. Just to sum up, every minute the metrics are collected, stored, shown and in case of evidence of trouble, an alert is sent. Isn’t it awesome!
---
- name: Create the InfluxDB database
uri:
url: http://localhost:8086/query
method: POST
body: "q=CREATE DATABASE monitor"
- name: Create the Grafana datasource
uri:
url: http://localhost:3000/api/datasources
method: POST
user: admin
password: admin
force_basic_auth: yes
body: "{{lookup('file','monitor-datasource.json')}}"
body_format: json
register: response
failed_when: response.status != 200 and response.status != 409
- name: Create the Slack notification channel
uri:
url: http://localhost:3000/api/alert-notifications
method: POST
user: admin
password: admin
force_basic_auth: yes
body: "{{lookup('file','slack-notification-channel.json')}}"
body_format: json
register: response
failed_when: response.status != 200 and response.status != 500
- name: Create the Grafana dashboard
uri:
url: http://localhost:3000/api/dashboards/db
method: POST
user: admin
password: admin
force_basic_auth: yes
body: "{{lookup('file','used_mem_pct-dashboard.json')}}"
body_format: json
register: response
failed_when: response.status != 200 and response.status != 412
- name: Add localhost to Ansible inventory
blockinfile:
path: /etc/ansible/hosts
block: |
[monitored_servers]
localhost ansible_connection=local
- name: Disable SSH key host checking
ini_file:
path: /etc/ansible/ansible.cfg
section: defaults
option: host_key_checking
value: False
- name: Create the Ansible playbooks directory if it doesn't exist
file:
path: /etc/ansible/playbooks
state: directory
- name: Copy the playbook-get-metrics.yml
copy:
src: playbook-get-metrics.yml
dest: /etc/ansible/playbooks/playbook-get-metrics.yml
owner: root
group: root
mode: 0644
- name: Get metrics from monitored servers every minute
cron:
name: "get metrics"
job: "ansible-playbook /etc/ansible/playbooks/playbook-get-metrics.yml"
Collecting the metrics
The playbook-get-metrics.yml file below is responsible for extracting from the monitored_servers all the important metrics and storing them in the monitor database. Initially the only extracted metric is the used memory percentage, but you can easily start to extract more metrics adding tasks in the playbook.
Notice that the is used to store the metric in the monitor database. 192.168.33.10 is the IP address of the monitoring server and 8086 is the port where InfluxDB is on. The used memory percentage has the key used_mem_pct in the database, and you must choose an appropriate key for each metric you start to extract.
Ansible by default collects information about the target host. It’s an initial step before the tasks execution. The collected data is then available to be used by the tasks. The hostname (ansible_hostname) is one of those, essential to differentiate the server from where the metric is extracted. By the way, the used memory percentage is calculated also using two of the data gathered by Ansible: the used real memory in megabytes (ansible_memory_mb.real.used) and the total real memory in megabytes too (ansible_memory_mb.real.total). If you want to know all of such data, execute the command ansible monitor -m setup -u vagrant -k -i hosts, and type vagrant when prompted the SSH password. Notice that the information is JSON-formatted, and the values can be accessed through dot-notation.
---
- hosts: monitored_servers
tasks:
- name: Used memory percentage
uri:
url: http://192.168.33.10:8086/write?db=monitor
method: POST
body: "used_mem_pct,host={{ansible_hostname}} value={{ansible_memory_mb.real.used / ansible_memory_mb.real.total * 100}}"
status_code: 204
Plugging a server into the monitoring service
Probably you’ve already executed the command vagrant up monitor, in order to get the monitoring server up and running. If not, do it right now. It demands some time, depending on how fast is your Internet connection. You can follow the output and see each step of the server provisioning.
When finished, open your browser and access the Grafana web application by typing the URL http://192.168.33.10:3000. The user and the password to log in are the same: admin. Click in the used_mem_pct dashboard link, and take a look at the values concerning the monitoring server in the presented line chart. You may need to wait a few minutes until having enough values to track.
Ok, you may now want to plug another server into the monitoring service, and see its values in the line chart too. So, turn on the server1, for example, executing the command vagrant up server1. After that, execute the Ansible playbook below through the command ansible-playbook playbook-add-server.yml -u vagrant -k -i hosts. The -u argument defines the SSH user, the -k argument prompts for password input (vagrant, too), and the -i argument points to the file, where the monitoring server is defined.
You will be prompted to inform the new server’s IP address and the SSH credentials, in order to enable Ansible to connect to the server. That’s enough to plug the server into the monitoring service, simply by inserting a line in the monitoring server’s /etc/ansible/hosts file. The next time CRON execute the playbook-get-metrics.yml, one minute later, server1 will be also considered a monitored server, so its metrics will be extracted, stored and shown in the dashboard too.
---
- hosts: monitor
become: yes
gather_facts: no
vars_prompt:
- name: "host"
prompt: "Enter host"
private: no
- name: "user"
prompt: "Enter user"
private: no
- name: "password"
prompt: "Enter password"
private: yes
tasks:
- name: Add the server into the monitored_servers group
lineinfile:
path: /etc/ansible/hosts
insertafter: "[monitored_servers]"
line: "{{host}} ansible_user={{user}} ansible_ssh_pass={{password}}"
Conclusion
Monitoring is key in high performance organizations. It’s one of the pillars of DevOps. Better monitoring solutions shorten feedback cycles, foster the continuous learning and the continuous improving.
Among the variety of monitoring solutions, the one just described aims to be cheap, flexible and easy to implement. Some benefits of its adoption are:
- the solution does not require installing an agent in every monitored server, taking advantage from the agentless feature of Ansible;
- it stores all the metrics data in InfluxDB, a high performance time series database;
- it centralizes the data presentation and the alerts configuration in Grafana, a powerful data analytics and monitoring tool.
I hope this solution can solve at least one of your pain points in your monitoring tasks. Experiment it and improve it and share it at your will.
Finally, if you want my help in automating something, please give me more details, tell me your problem. It may be a problem of someone else too.