Note: It seems my original post from last week didn’t get posted on lemmy.world from kbin (I can’t seem to find it) so I’m reposting it. Apologies to those who may have already seen this.
I’m looking to deploy some form of monitoring across my selhosted servers and I’m a bit confused about the different options.
I have a small network of three machines that I would like to monitor. I am not looking for a solution that lets me monitor tens, hundreds, or thousands of nodes. Furthermore, I am more interested in being able to observe metrics for each node individually rather than in aggregate. Each of these machines performs a different task so aggregate metrics from these machines are not particularly meaningful. However, collecting all the metrics centrally so that I can have a single dashboard to view them all in one convenient place is definitely something I would like.
With that said, I have been trying to understand the different (popular) options that are available and I would like to hear what the community’s experience is with these options and if anybody has any advice on any of these in light of my requirements above.
Prometheus seems like the default go-to for monitoring. This would require deploying a node_exporter on each node, a prometheus service, and a grafana dashboard. That’s all fine, I can do that. However, from all that I’m reading it doesn’t seem like Prometheus is optimised for my use case of monitoring each node individually. I’m sure it’s possible, but I’m concerned that because this is not what it’s meant for, it would take me ages to set it up such that I’m happy with it.
Netdata seems like a comprehensive single-device monitoring solution. It also appears that it is possible to run your own registry to help with distributed monitoring. Not gonna lie, the netdata dashboard looks slick. An important additional advantage is that it comes packaged on Debian (all my machines run Debian). However, it looks like it does not store the metrics for very long. To solve that I could also set up InfluxDB and Grafana for long-term metrics. I could use Prometheus instead of InfluxDB in this arrangement, but I’m more likely to deploy a bunch of IoT devices than I am to deploy servers needing monitoring which means InfluxDB is a bit more future-proof for me as it could be reused for IoT data.
Cockpit is another single-device solution which additionally provides direct control of the system. The direct control is probably not so much of a plus as then I would never let Cockpit be accessible from outside my home network whereas I wouldn’t mind that so much for dashboards with read-only data (still behind some authentication of course). It’s also probably not built for monitoring specifically, but I included this in the list in case somebody has something interesting to say about it.
What’s everybody’s experience with the above solutions and does anybody have advice specific to my situation? I’m currently leaning to netdata with my own registry at first and later add InfluxDB and Grafana for long-term metrics.
I have been using Uptime Kuma for internal monitoring and Uptime Robot for external.
I like the combination and it seems like what you are looking for.
https://github.com/louislam/uptime-kuma https://uptimerobot.com
I use this solution also but for external I run Uptime Kuma on a $5 VPS that also stores my Unraid Backups
Smart. I am under Uptime Robots free tier just to monitor my public DNS and that NPM is up and routing to next cloud and ssh.
Seconded for simplicity. If OP is looking for complex statistics, it may not do the trick, but it’s about as straightforward and quick to set up as a monitoring solution can get.
Personally I opt for zabbix. But I’ve been working with it for years. Simple deployment, lots of support, just works.
Netdata is great and easily deployed via docker. I ran it bare metal before and was also pleased if that’s your preference.
Netdata when it works is pretty great, however it tends to eat up the RAM of whatever I put it on until the whole server stops responding. If they fixed whatever caused… that. I would totally still be using it.
Have you tried reducing how long netdata store metrics? Even 24 hours of metrics use a lot of RAM to store. On small servers I’ll just limit it to 8 hours of metrics because I’m mostly interested with the alerts and live resource usage.
I’ve never had that with the few systems I’ve ran it on
I’ve also never had that issue. It’s had quite a few updates since I started using it.
I am more interested in being able to observe metrics for each node individually rather than in aggregate.
This requirement makes me think netdata would be a good solution. In my current setup, each host has its own netdata dashboard and manages its own health checks/alarms. I have also enabled streaming which sends metrics from all hosts to a “parent/master” netdata instance from which I can see all metrics from all hosts without checking each dashboard individually.
However, it looks like it does not store the metrics for very long.
I still have to look into this, in the past it was certainly true and you had to setup a prometheus instance to store (and downsample, who needs few-seconds resolution for one year old metrics) metrics for long-term archival - but looking at the documentation right now, it looks possible to store long-term metrics in the netdata DB itself, by moving old metrics to a lower-definition storage tier: https://learn.netdata.cloud/docs/configuring/optimizing-metrics-database/change-how-long-netdata-stores-metrics
An important additional advantage is that it comes packaged on Debian (all my machines run Debian).
Same. However I install and update it from their third-party APT repository - it’s one of the rare cases where I prefer upstream releases to Debian stable packages, the last few upstream releases have been really nice (for example I’m not sure the new tiered retention system is availabel in the v1.37.1 Debian stable package)
My automated installation procedure (ansible role) is here if you’re interested (start at
tasks/main.yml
and follow theimport_tasks
).Thanks a lot for these tips! Especially about using the upstream deb.
check_mk is what I use at home and at work, it’s a fork of nagios/icinga, works with agents, nagios plugins, or snmp, and if somehow you can’t find what you want to monitor, writing custom checks is as easy as writing a bash script
I’m also using checkmk and have been happy with it. I had been using zabbix prior but found it so be cumbersome and sluggish.
I opted for checkmk as well and don’t want to switch. It’s got a good default for Linux monitoring and it will tell me about random things to fix after reboots, or that memory/disc is getting low so I can fix it quickly.
When monitoring 15 virtual machines on one physical the default of checking every minute for all machines raised the temp over 80 degrees Celsius on the physical machine and triggered a warning. Checking every five minutes is more that I need, so I went with that change.
I have a little 4 core/ 8gb ram VM running my work instance that monitors over a thousand clients on 60s check intervals, you may want to look into your config. I honestly have no idea what could cause 15 machines to cost that much computationally
Sorry, 70 degrees, not 80. The load was fine. It’s a machine to test things, but I kept using checkmk since I really liked it. All on one server, both monitor server and all clients.It’s an old workstation - it runs around 60 degrees normally.
That said, it could very much be a config issue, I installed with the ansible role and left most everything as default. A very easy installation, and with ansible very easy to add new hosts to monitor as well. I’m up to 36 now, including some docker containers.
I switched back to 1 minute to test, and is warned for temp within 20 minutes, from 60 degrees to hovering around 70. Load from 2 to 3.5, threads from 1k to 1.2k all on the physical side. There’s also a small change in IO that seems to be the checkmk server writing more to disk - the cpu on that host is only slighty.
I’m guessing that the temp going over is hardware related, a better fan might fix that issue.
I don’t know if the load/thread increase is reasonable, but given the amount of checks done in the agent I’m perfectly OK with giving those resources to have all the data points checkmk collects available. It’s helped a lot being able to go into details to see what’s going on, checkmk makes that so easy.
That’s odd. I’m currently monitoring 17 vms on one host along with a handful of physical devices. Nothing like the issues you’ve encountered has happened.
I’m a fan of Zabbix. I’ve used it in a datacenter environment but it’s much easier to configure than Icinga/Nagios and not as hackey as Prometheus/grafana.
I’ve used nagios, check_mk, zabbix and currently using prometheus + grafana. I suggest prometheus + grafana. But you may want to use netdata as the exporter instead of node_exporter. Or both.
Thanks for your reply! Out of curiosity, what made you go with Prometheus over zabbix and check_mk in the end? Those two seem to be heavily recommended.
nagios (and check_mk) are plain old tech. Newer ones have been built with lessons learned. zabbix I don’t like because configuration is in a database. prometheus is nice because it’s performant and configuration is in a file (which can be version controlled in git and deployed with e.g. ansible). Data in database, config in plain text files.
I’m running netdata on each of my servers and it has every feature I need. If u choose netdata, make sure not to install the nightly builds since they get updated all the time and sometimes break features. One annoying thing with netdata is you have to pay a subscription for the option to disable individual alert types. I have a nearly full hard drive and there’s an alert for that which won’t go away. Same thing for temporary inbound packet drops which seems to happen everytime one particular Plex user forcibly transcodes content (they’re old and remote and won’t change their Plex client settings 😡). Each error they send you an email.
you have to pay a subscription for the option to disable individual alert types
Never heard of that. You can disable individual alarms by setting
to: silent
in the relevanthealth.d
configuration file (e.g.health.d/disks.conf
). This is exactly what I do for packet drops.Ooh sweet thnx 4 the tip - will try this later! I was trying via the web dashboard which I’m pretty sure requires a subscription
I use LibreNMS, which is a fork of Observium. It is primarily SNMP polling, so if you haven’t worked with SNMP before there can be a bit of a learning curve to get it set up. Once you get the basics working it’s pretty easy to add service monitoring, syslog collection, alerting and more. And since it’s SNMP you can monitor network hardware pretty easily as well as servers.
The dashboards aren’t as beautiful as some other options but there is lot to work with.
deleted by creator
Currently setting up my own monitoring stack:
- Fluentbit to gather metrics/logs
- Fluentd to aggregate logs
- Elasticsearch as database
- Grafana for visualization
Hi ! I’m also trying to navigate the monitoring solutions, I thought it will be easier … :( Maybe someone has a recommendation:
I’m looking for a lightweight tool for my personal home lab (Ionos VPS 2GB ram 2cpu), so no need for scalability or big data, etc. I’m experimenting with some services (syncthing, silverbullet-md, wireguard) and there is not much ram left for anything else. I’ve being reading about Prometheus+Grafana, but sounds like an overkill, like checkMk, Zabbix , Graphite, netdata…
I mostly need status of the hadware (ram+cpu) and containers running.Ideally, I can see an historical of a few days in a web based dashboard.
Currently I’m using Glances because it was easy to install and very lightweight but if I want to visualize the persisted data, I need something like Grafana, etc.
(sorry for the long comment I wasn’t sure if I should have to start a new post) Thanks a lot 🤓
Since i’m already running it otherwise, i’ve been running stuff through Home Assistant and using lovelace dashboards.