As in, when I watched YouTube tutorials, I often see YouTubers have a small widget on their desktop giving them an overview of their ram usage, security level, etc. What apps do you all use to track this?

  • Pesfreak92@alien.topB
    link
    fedilink
    English
    arrow-up
    0
    ·
    11 months ago

    Uptime Kuma and Grafana. Uptime Kuna to monitor if a service is up and running and Grafana to monitor the host like CPU, RAM, SSD usage etc.

    • SadanielsVD@alien.topB
      link
      fedilink
      English
      arrow-up
      0
      ·
      11 months ago

      This. If you have more servers you can also get them all connected to a single UI where you can see all the Infos at once. With netdata cloud

      • Spaceman_Splff@alien.topB
        link
        fedilink
        English
        arrow-up
        0
        ·
        11 months ago

        Just set this up yesterday. I used a parent node and then have all my vms point to that. Took like an hour to figure it out

        • scotrod@alien.topB
          link
          fedilink
          English
          arrow-up
          0
          ·
          11 months ago

          Hey, did you use the cloud functionality or not? I’m tryna go all local with parent-child kind of capability but so far unable to.

          • Spaceman_Splff@alien.topB
            link
            fedilink
            English
            arrow-up
            0
            ·
            11 months ago

            The parent still is visible to the cloud portal. My understanding is the data all resides local, but when you login to their cloud portal, it connects to the parent to display the information. I’m still playing with it to confirm. My parent node shows all the child nodes on the local interface but the cloud still shows them all.

  • AstrologicalMob@alien.topB
    link
    fedilink
    English
    arrow-up
    0
    ·
    11 months ago

    I currently use thr classic “Hu seems slow, checks basic things like disk usage and process CPU/RAM usage I’ll do a reboot to fix it for now”.

  • borouhin@alien.topB
    link
    fedilink
    English
    arrow-up
    0
    ·
    11 months ago

    Alerts are much more important than fancy dashboards. You won’t be staring at your dashboard 24/7 and you probably won’t be staring at it when bad things happen.

    Creating your alert set not easy. Ideally, every problem you encounter should be preceded by corresponding alert, and no alert should be false positive (require no action). So if you either have a problem without being alerted from your monitoring, or get an alert which requires no action - you should sit down and think carefully what should be changed in your alerts.

    As for tools - I recommend Prometheus+Grafana. No need for separate AletrManager, as many guides recommend, recent versions of Grafana have excellent built-in alerting. Don’t use those ready-to-use dashboards, start from scratch, you need to understand PromQL to set everything up efficiently. Start with a simple dashboard (and alerts!) just for generic server health (node exporter), then add exporters for your specific services, network devices (snmp), remote hosts (blackbox), SSL certs etc. etc. Then write your own exporters for what you haven’t found :)

    • io-x@alien.topB
      link
      fedilink
      English
      arrow-up
      0
      ·
      11 months ago

      I was looking at loki+grafana. is prometheus a replacement for loki in this setup and is it preferred?

      • borouhin@alien.topB
        link
        fedilink
        English
        arrow-up
        0
        ·
        11 months ago

        No, they serve different purposes. Loki is for logs, Prometheus is for metrics. Grafana helps to visualize data from both.

    • Michaelscarn69-@alien.topOPB
      link
      fedilink
      English
      arrow-up
      0
      ·
      11 months ago

      Thank you for this. I think I need a deeper understanding of Prometheus. I’ll look into it. You are awesome

    • atheken@alien.topB
      link
      fedilink
      English
      arrow-up
      0
      ·
      11 months ago

      One thing about using Prometheus alerting is that it’s one less link in the chain that can break, and you can also keep your alerting configs in source control. So it’s a little less “click-ops,” but easier to reproduce if you need to rebuild it at a later date.

  • Mother_Construction2@alien.topB
    link
    fedilink
    English
    arrow-up
    0
    ·
    11 months ago

    I know that it needs a fix when my dad complaining that he can’t watch TV and the rolling door doesn’t open in the morning.

  • HCharlesB@alien.topB
    link
    fedilink
    English
    arrow-up
    0
    ·
    11 months ago

    Checkmk (Raw - free version.) Some setup aspects are a bit annoying (wants to monitor every last ZFS dataset and takes too long to ‘ignore’ them one by one.) It does alert me to things that could cause issues, like the boot partition almost full. I run it in a Docker container on my (primarily) file server.

    • djbon2112@alien.topB
      link
      fedilink
      English
      arrow-up
      0
      ·
      11 months ago

      I second CMK.

      A TICK stack is unwieldy, Grafana takes a lot of setup, and all of this assumes you both know what to monitor and get stats on it.

      CMK by contrast is plug and play. Install the server on a VM or host, install thr agent on your other systems, and you’re good to go.

      • joshiegy@alien.topB
        link
        fedilink
        English
        arrow-up
        0
        ·
        11 months ago

        I’m running a tick stack with a couple of thousands of servers - way less CPU usage than checkmk/nagios or anything else from the previous millennium …

        • djbon2112@alien.topB
          link
          fedilink
          English
          arrow-up
          0
          ·
          11 months ago

          How do you solve the problem of runaway memory usage? Even monitoring a few dozen hosts, memory usage would grow to many GB and continue to grow indefinitely until it OOM’d, and from my reading Influx has no way to prevent this.

  • Theon@alien.topB
    link
    fedilink
    English
    arrow-up
    0
    ·
    11 months ago

    Netdata, I’ve meant to look into Grafana but it always seemed way too overcomplicated and heavy for my purposes. Maybe one day, though…

  • JoeB-@alien.topB
    link
    fedilink
    English
    arrow-up
    0
    ·
    11 months ago

    I use Telegraf + InfluxDB + Grafana for monitoring my home network and systems. Grafana has a learning curve for building panels and dashboards, but is incredibly flexible. I use it for more than server performance. I have a dual-monitor “kiosk” (old Mac mini) in my office displaying two Grafana dashboards. These are:

    Network/Power/Storage showing:

    • firewall block events & sources for last 12 hrs (from pfSense via Elasticsearch),
    • current UPS statuses and power usage for last 12 hrs (Telegraf apcupsd plugin -> InfluxDB),
    • WAN traffic for last 12 hrs ( from pfSense via Telegraf -> InfluxDB),
    • current DHCP clients (custom Python script -> MySQL), and
    • current drive and RAID pool health (custom Python scripts -> MySQL)

    Server sensors and performance showing:

    • current status of important cron jobs (using Healthchecks -> Prometheus),
    • current server CPU usage and temps, and memory usage (Telegraf -> InfluxDB)
    • server host CPU usage and temps, and memory usage for last 3 hrs (Telegraf -> InfluxDB)
    • Proxmox VM CPU and memory usage for last 3 hrs (Proxmox -> InfluxDB)
    • Docker container CPU and memory usage for last 3 hrs (Telegraf Docker plugin -> InfluxDB)

    Netdata works really well for system performance for Linux and can be installed from the default repositories of major distributions.

    • daniel280187@alien.topB
      link
      fedilink
      English
      arrow-up
      0
      ·
      11 months ago

      Network/Power/Storage

      Pretty cool dashboards. I liked the DHCP clients info, does it also report DHCP reservations?

      Where do you do DHCP, on the PFSense or somewhere else?

  • Dogeek@alien.topB
    link
    fedilink
    English
    arrow-up
    0
    ·
    11 months ago

    Oh lord, I have so much info to give ! For the setup, it’s running on kubernetes 1.28.2, so YMMV. My monitoring stack is :

    • Grafana – Dashboards
    • Alertmanager – Alerting
    • Prometheus – Time series Database
    • Loki – Logs database
    • Promtail – Log collector
    • Mimir – Long term metrics&logs storage
    • Tempo – Datadog APM, but with Grafana, allows you to track requests through a network of services, invaluable to link your reverse proxy, to your apps, to your SSO to your database…
    • SMTP Relay – A homemade SMTP relay that eases setting up mail alerts, allows me to push mail through mailjet using my domain
    • Node-exporter – exports metrics for the server
    • Exportarr – exports metrics for sonarr/radarr etc
    • pihole-exporter – exports pihole metrics for prometheus scraping
    • smart-exporter – exports S.M.A.R.T metrics (for HDD health)
    • ntfy – for notifications to my phone (other than mail)

    The rest is pretty much the same, if the service exports prometheus metrics by default, I use that, and write a ServiceMonitor and a Service manifest for that, it usually looks like that

    apiVersion: monitoring.coreos.com/v1
    kind: ServiceMonitor
    metadata:
      name: traefik
      labels:
        app.kubernetes.io/component: traefik
        app.kubernetes.io/instance: traefik
        app.kubernetes.io/managed-by: kustomize
        app.kubernetes.io/name: traefik
        app.kubernetes.io/part-of: traefik
    spec:
      selector:
        matchLabels:
          app.kubernetes.io/name: traefik-metrics
      endpoints:
      - port: metrics
        interval: 30s
        path: /metrics
        scheme: http
        tlsConfig:
          insecureSkipVerify: true
      namespaceSelector:
        matchNames:
        - traefik
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: traefik-metrics
      namespace: traefik
      labels:
        app.kubernetes.io/name: traefik-metrics
    spec:
      type: ClusterIP
      ports:
        - protocol: TCP
          name: metrics
          port: 8082
      selector:
        app.kubernetes.io/name: traefik
    

    If the app doesn’t include a prometheus endpoint, I just find an existing exporter for that app, most popular ones have that, and ready made grafana dashboards.

    For alerting, I create PrometheusRule object with the prometheus query and the message to alert me (depending on the severity, it’s either a mail for med-low severity incidents, phone notification for high sev). I try to keep mails / notifications to a minimum, just alerts on load, CPU, RAM, and potential SMART errors as well give me alerts.