Hi all, I have been chasing answers for this for months on the Proxmox forums, Reddit, and the LevelOneTechs forums and haven’t gotten much guidance, unfortunately. Hoping Lemmy will be the magic solution!

Perhaps I couched my initial thread in too much detail, so after some digging I got to more focused questions for my follow-up (effectively what this thread is), but I still didn’t get much of a response!

In all this time, I even have one more random thought I haven’t asked elsewhere - if I “Add Storage” of my ZFS pools in Proxmox, even though the categories don’t really fit data storage (the categories are like VM data, CT data, ISOs, snippets, etc.), then I could attach these to a VM or CT and replicate them via “normal” Proxmox cluster replication - is it OK to add such data pools to this storage section as such?

Anyway, to the main course - the summary of what I’m seeking help on is below.

Long story short:

  • 3 nodes in a cluster, using ZFS.
    • CTs and VMs are replicated across nodes via GUI.
  • I want to replicate data in ZFS pools (which my CTs and VMs use - CTs through bind-mounts, VMs through NFS shares) to the other nodes.
    • Currently using Sanoid/Syncoid to make this happen from one node to two others via cronjob.

So three questions:

  1. If I do Sanoid/Syncoid on all three nodes (to each other) - is this stupid, and will it fail - or will each node recognize snapshots for a ZFS pool and incrementally update if needed (even if the snapshot was made on/by a different node)?
    • As a sub-question to this - and kind of the point of my overall thread and the previous one - is this even a sensible way to approach this, or is there a better way?
  2. For the GUI-based replication tasks, since I have CTs replicating to other systems, if I unchecked “skip replication” for the bind-mounted ZFS pools - would this accomplish the same thing? Or would it fail outright? I seem to remember needing to check this for some reason.
  3. Is this PVE-zsync suitable for my situation? I see mention of no recursive syncing, which I don’t fully know what that means, and I don’t know if that’s a dealbreaker. I suppose if this is the correct choice - then I need to delete my current GUI-based CT/VM replication tasks?

For those with immense patience, here was the original thread with more detail:

Hi all, so I setup three Proxmox servers (two identical, one “analogous”) - and the basics about the setup are as follows:

  • VMs and CTs are replicated every 15 minutes from one node to the other two.
  • One CT runs Cockpit for SMB shares - bind-mounted to the ZFS pools with the datasets that are SMB-shared.
    • I use this for accessing folders via GUI over the network from my PC.
  • One CT runs an NFS server (no special GUI, only CLI) - bind-mounted to the ZFS pools with the datasets that are NFS-shared (same as SMB-shared ones).
    • Apps that need to tap into data use NFS shares (such as Jellyfin, Immich, Frigate) provided by this CT.
  • Two VMs are of Debian, running Docker for my apps.
  • VMs and CTs are all stored on 2x2TB M.2 NVMe SSDs.
  • Data is stored in folders (per the NFS/SMB shares) on a 4x8TB ZFS pool with specific datasets like Media, Documents, etc. and a 1x4TB SSD ZFS “pool” for Frigate camera footage storage.

Due to having hardware passed-through to the VMs (GPU and Google Coral TPU) and using hardware resource mappings (one node as an Nvidia RTX A2000, two have Nvidia RTX 3050s - can have them all with the same mapped resource node ID to pass-through without issue despite being different GPUs), I don’t have instant HA failover.

Additionally, as I am using ZFS with data on all three separate nodes, I understand that I have a “gap” window in the event of HA where the data on one of the other nodes may not be all the way up-to-date if a failover occurs before a replication.

So after all the above - this brings me to my question - what is the best way to replicate data that is not a VM or a CT, but raw data stored on those ZFS pools for the SMB/NFS shares - from one node to another?

I have been using Sanoid/Syncoid installed on one node itself, with cronjobs. I’m sure I’m not using it perfectly (boy did I have a “fun” time with the config files), and I did have a headache with retention and getting rid of ZFS snapshots in a timely manner to not fill up the drives needlessly - but it seems to be working.

I just setup the third node (the “analogous” one in specs) which I want to be the active “primary” node and need to copy data over from the other current primary node - I just want to do it intelligently, and then have this node, in its new primary node role, take over the replication of data to the other two nodes.

Would so very, very badly appreciate guidance from those more informed/experienced than me on such topics.

  • pr0927@lemmy.worldOP
    link
    fedilink
    English
    arrow-up
    1
    ·
    edit-2
    8 days ago

    Ah, I see - this is effectively the same as the first image I shared, but via shell instead of GUI, right?

    For my NFS server CT, my config file is as follows currently, with bind-mounts:

    arch: amd64
    cores: 2
    hostname: bridge
    memory: 512
    mp0: /spynet/NVR,mp=/mnt/NVR,replicate=0,shared=1
    mp1: /holocron/Documents,mp=/mnt/Documents,replicate=0,shared=1
    mp2: /holocron/Media,mp=/mnt/Media,replicate=0,shared=1
    mp3: /holocron/Syncthing,mp=/mnt/Syncthing,replicate=0,shared=1
    net0: name=eth0,bridge=vmbr0,firewall=1,gw=192.168.0.1,hwaddr=BC:24:11:62:C2:13,ip=192.168.0.82/24,type=veth
    onboot: 1
    ostype: debian
    rootfs: ctdata:subvol-101-disk-0,size=8G
    startup: order=2
    swap: 512
    lxc.apparmor.profile: unconfined
    lxc.cgroup2.devices.allow: a
    lxc.cap.drop:
    

    For full context, my list of ZFS pools (yes, I’m a Star Wars nerd):

    NAME                                    USED  AVAIL  REFER  MOUNTPOINT
    holocron                               13.1T  7.89T   163K  /holocron
    holocron/Documents                     63.7G  7.89T  52.0G  /holocron/Documents
    holocron/Media                         12.8T  7.89T  12.8T  /holocron/Media
    holocron/Syncthing                      281G  7.89T   153G  /holocron/Syncthing
    rpool                                  13.0G   202G   104K  /rpool
    rpool/ROOT                             12.9G   202G    96K  /rpool/ROOT
    rpool/ROOT/pve-1                       12.9G   202G  12.9G  /
    rpool/data                               96K   202G    96K  /rpool/data
    rpool/var-lib-vz                        104K   202G   104K  /var/lib/vz
    spynet                                 1.46T  2.05T    96K  /spynet
    spynet/NVR                             1.46T  2.05T  1.46T  /spynet/NVR
    virtualizing                           1.20T   574G   112K  /virtualizing
    virtualizing/ISOs                       620M   574G   620M  /virtualizing/ISOs
    virtualizing/backup                     263G   574G   263G  /virtualizing/backup
    virtualizing/ctdata                    1.71G   574G   104K  /virtualizing/ctdata
    virtualizing/ctdata/subvol-100-disk-0  1.32G  6.68G  1.32G  /virtualizing/ctdata/subvol-100-disk-0
    virtualizing/ctdata/subvol-101-disk-0   401M  7.61G   401M  /virtualizing/ctdata/subvol-101-disk-0
    virtualizing/templates                  120M   574G   120M  /virtualizing/templates
    virtualizing/vmdata                     958G   574G    96K  /virtualizing/vmdata
    virtualizing/vmdata/vm-200-disk-0      3.09M   574G    88K  -
    virtualizing/vmdata/vm-200-disk-1       462G   964G  72.5G  -
    virtualizing/vmdata/vm-201-disk-0      3.11M   574G   108K  -
    virtualizing/vmdata/vm-201-disk-1       407G   964G  17.2G  -
    virtualizing/vmdata/vm-202-disk-0      3.07M   574G    76K  -
    virtualizing/vmdata/vm-202-disk-1      49.2G   606G  16.7G  -
    virtualizing/vmdata/vm-203-disk-0      3.11M   574G   116K  -
    virtualizing/vmdata/vm-203-disk-1      39.6G   606G  7.11G  -
    

    So you’re saying to list the relevant four ZFS datasets in there but, instead of as bind-points, as virtual drives (as seen in the “rootfs” line)? Or rather, as “storage backed mount points” from here:

    https://pve.proxmox.com/wiki/Linux_Container#_storage_backed_mount_points

    Hopefully I’m on the right track!

    • ikidd@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      8 days ago

      Add a fresh disk from one of your ZFS backed storages instead of a mountpoint from a directory. When I do that I get a mount like:

      mp0: local-zfs:subvol-105-disk-1,mp=/mountpoint,replicate=0,size=8G

      Can cat me /etc/pve/storage.cfg so I can see where your mp’s are coming from? I’d expect to see them show up as storagename:dataset like mine unless you’re mounting them as directory type.

      • pr0927@lemmy.worldOP
        link
        fedilink
        English
        arrow-up
        1
        ·
        8 days ago

        So currently I haven’t re-added any of the data-storing ZFS pools to the Datacenter storage section (wanted to understand what I’m doing before trying anything). Right now my storage.cfg reads as follows (without having added anything):

        zfspool: virtualizing
                pool virtualizing
                content images,rootdir
                mountpoint /virtualizing
                nodes chimaera,executor,lusankya
                sparse 0
        
        zfspool: ctdata
                pool virtualizing/ctdata
                content rootdir
                mountpoint /virtualizing/ctdata
                sparse 0
        
        zfspool: vmdata
                pool virtualizing/vmdata
                content images
                mountpoint /virtualizing/vmdata
                sparse 0
        
        dir: ISOs
                path /virtualizing/ISOs
                content iso
                prune-backups keep-all=1
                shared 0
        
        dir: templates
                path /virtualizing/templates
                content vztmpl
                prune-backups keep-all=1
                shared 0
        
        dir: backup
                path /virtualizing/backup
                content backup
                prune-backups keep-all=1
                shared 0
        
        dir: local
                path /var/lib/vz
                content snippets
                prune-backups keep-all=1
                shared 0
        

        Under my ZFS pools (same on each node), I have the following:

        The “holocron” pool is a RAIDZ1 combo of 4x8TB HDDs, “virtualizing” is RAID mirrored 2x2TB SSDs, and “spynet” is a single 4TB SSD (NVR storage).

        When you say to “add a fresh disk” - you just mean to add a resource to a CT/VM, right? I trip on the terminology at times, haha. And would it be wise to add the root ZFS pool (such as “holocron”) or to add specific datasets under it (such as "Media or “Documents”)?

        I’m intending to create a test dataset under “holocron” to test this all out before I put my real data through any risk, of course.

        • ikidd@lemmy.world
          link
          fedilink
          English
          arrow-up
          1
          ·
          edit-2
          7 days ago

          Another advantage to always specifying your resources as zfs storage backed is snapshotting. I have one CT docker host and one VM docker host, both ZFS backed, and when I make any major changes or upgrades, I snapshot the guest, and sometimes clone it if I want to test it before pushing it to prod. I have had to revert them because I didn’t like what I did and it’s super simple then. Never having used directory resources, I don’t know how snapshotting works but I can’t imagine it’s the same since PM isn’t triggering the snapshot mechanism in ZFS if it’s a directory mountpoint.

          Backups via PBS are also snapshotted, so you don’t run the risk of having a database-stored metadata out of sync with the storage.

        • ikidd@lemmy.world
          link
          fedilink
          English
          arrow-up
          1
          ·
          7 days ago

          Yes, add a resource and specify it as the ZFS storage you want. I’m mystified why your disk resources in your NFS CT show up as directories and not zfs:dataset unless you were specifying them as the host mountpoint instead of as their ZFS dataset. I think if it’s as a directory, you need to take care of the underlying replication, and if it’s specified as a dataset, then PM will. You could probably just fix those to be ZFS specified mountpoints and you’d be fine.

          I only specify one pool on a real zpool to Proxmox and work from there, I don’t make ZFS pools in PM out of the underlying datasets. I find that method confusing but I can see why you do it, and I can’t imagine it would mess anything up by doing it.

          • pr0927@lemmy.worldOP
            link
            fedilink
            English
            arrow-up
            1
            ·
            6 days ago

            Oh I added the disk resources via shell (via nano) to that config for the NFS server CT, following some guide for bind-mounts. I guess that’s the wrong format and treated them like directories instead of ZFS pools?

            I’ll follow the formatting you’ve used (ans I think what results “naturally” from the GUI adding of such a ZFS storage dataset.

            And yeah I don’t think replication works if it’s not ZFS, so I need to fix that.

            Per your other commend - agreed regarding the snapshotting - it’s already saved me on a Home Assistant VM I have running, so I’d love to have that properly working for the actual data in the ZFS pools too.

            Is it generally a best practice to only create the “root” ZFS pool and not these datasets within Proxmox (or any hypervisor)?

            Thanks so much for your assistance BTW, this has all been reassuring that I’m not in some lost fool land, haha.

            • ikidd@lemmy.world
              link
              fedilink
              English
              arrow-up
              1
              ·
              6 days ago

              IDK about “best practice”. I think what you have going works fine, it’ll just create new datasets under those datasets that you’re linking to instead of a level higher like I end up with. I just find it easier to find snapshots, etc without having to drill down. But if you have other properties you want to apply to just those datasets like compression, etc, that’s a perfectly cromulent thing to do.

              That makes sense about the NFS mounts, then, because the format is what I would call a bind mount as well. I think resetting them to the actual zfs dataset will have a better result for you.

              NP on assistance, it looks like you were having some issues and I’m happy to share what little I know to help. Feel free to ping me if you need more help with this or other issues.

              • pr0927@lemmy.worldOP
                link
                fedilink
                English
                arrow-up
                1
                ·
                3 days ago

                It took me a while to find a moment to give it a go, but I created a test dataset under my main ZFS pool and added it to a CT - it did snapshots and replication fine.

                The one question I have is - for the bind-mounts, I didn’t have any size set - and they accurately show remaining disk space for the pool they are on.

                Here it seems I MUST give a size? Is that correct? I didn’t really want to allocate a smaller size for any given dataset, if possible. I saw something about storage-backed mount points, and adding them (via config file, versus GUI) and setting “size=0” - if this is of a ZFS dataset, would this “turn” it into a directory and prevent snapshotting or replicating to other nodes?

                One last question - when I’m adding anything to the Datacenter’s storage section, do I want to check availability for all nodes? Does that matter?

                Thanks again!

                • ikidd@lemmy.world
                  link
                  fedilink
                  English
                  arrow-up
                  1
                  ·
                  2 days ago

                  I would try setting the size=0. AFAIK, you are expected to put a size in, but PM might be smart enough to derive the actual remaining size as correct. I’ve never come at this from behind like you are, I’ve always built my storage out directly into ZFS storage as I built the CTs.

                  As to the storage questions, I’ve always done it as all nodes, but I also set up and name my pools exactly the same across all nodes, they are hardware alike. I don’t know what would happen if there wasn’t a pool named that in the other node, I would guess it would just fail out and that would mess up HA and probably replication.

              • pr0927@lemmy.worldOP
                link
                fedilink
                English
                arrow-up
                1
                ·
                6 days ago

                Ah makes sense. Will try to give this a go tomorrow when i have time/energy. Appreciate it, definitely will do!