Since TrueNAS Scale is linux based, I think this is as safe of a place as any here for me to rant a little about my
current server issue that I briefly touched on here.
So I'm running a Dell R730XD, which essentially replaced both my Dell R710 and a Supermicro server that share the same generation of Xeon chips as the Dell R710. The Dell R710 was my VMware ESXi box and the Supermicro was my FreeNAS box, and the move to the R730XD was part of a consolidation project of mine. Since I didn't like the direction VMware is going since they were acquired by Broadcom, getting rid of the free versions and jacking up the costs skyhigh, it was only natural to decide between TrueNAS
(Formally FreeNAS) and Proxmox. Specifically TrueNAS Scale and Proxmox are linux based, debian I believe. Where as TrueNAS Core is FreeBSD based.
Either way, the virtualization features of TrueNAS Scale and Proxmox interested me since I can still run my storage server all the same. The last unit was purely HDD based, but this new setup consists of both a larger HDD array and an SSD array to rival the old setup.
So here's a breakdown on what I've been running in the Dell R730XD, and why I chose TrueNAS Scale:
- 2 x Intel Xeon E5-2687W v4 Processors
- 192GB DDR4 ECC Memory
- Dell Intel X710 Quad Port 10GbE SFP+ Network Card
- Dell PERC H730P RAID Controller (Configured to HBA mode)
- Dell PowerEdge 12Gbps SAS HBA Controller (IT Mode) - (eSAS)
- Dell PLX PCI-e Switch Card (Connects the 4 x 2.5" U.2 NVMe drives on the front hotswap bay.)
- 2 x Dell EMC WD Ultrastar SS540 800GB Enterprise SAS SSD 2.5"
- 3 x Kioxia PM6-R 15.36TB 2.5" SAS SSD
- 1 x Samsung PM1643a 15.36TB 2.5" SAS SSD
- 4 x Intel Optane 905P 1.5TB 2.5" U.2 NVMe SSD
- NVIDIA Tesla P40 - 24GB GPU (I originally ran an EVGA NVIDIA Titan X - 12GB GPU in here.)
Then I have an HP Enterprise StorageWorks D2600 enclosure connected from basically SAS2 -> SAS3 connection of the HBA controller. Running:
- 6 x 18TB Western Digital 18TB UltraStar DC HC550 HDDs
Since the primary function of this server is to serve as my NAS, I opted for TrueNAS because in order to maintain the 2.5" hotswap bays with the internal SAS controller, for both the SSD array for my storage and mirrored SSDs for the operating system itself, it just made more sense to go this route over Proxmox. Otherwise I would have had to get
another storage controller and split the rear 2 x 2.5" SAS drives from the 24 x 2.5" bays up front with PCI-e passthrough being a requirement for getting this stuff working.
The two 800GB SAS SSDs are mirrored for the install of TrueNAS Scale. Each pair of the Intel Optane drives are mirrored and combined into their own pool for VMs. Then each pair of the 15.36TB SSDs are mirrored and put into the same storage pool for a total capacity of 30.72TB of storage capacity.
I then have the HDD array in RAIDZ2 for a 72TB capacity with the ability to lose two drives at once without data loss.
Storage wise, everything's working out nicely, I have it configured to take snapshots of each dataset and replicate anything on the SSDs onto the HDDs, so in the event of anything going wrong with the drives or files, I have backups on hand within the same system. The snapshot feature is especially handy.
The secondary function of this server is to host both Emby and Plex, which is where the GPU comes into play. Then a VM for my Palworld server and other random game serves I want to run for me and my friends.
The tertiary function more or less was running Stable Diffusion, which was done in the same virtual machine I ran Plex and Emby on.
I had zero issues with isolating the Titan X and doing a PCI-e passthrough to the specific VM running Emby, Plex, and Stable Diffusion. The only thing I had to do was apply a
NVIDIA driver patch in the VM itself to unlock the artificial limit for the NVENC streams needed for Emby and Plex, and giving the VM access to all 6 CPU cores. Especially for the 4k content.
Then here comes the problem...
I buy the NVIDIA Tesla P40 off my IT buddy, and the past two days of struggle I've yet to get this to work in the VM itself. This loads and works flawlessly under TrueNAS itself, and I even installed an Emby docker container through the available apps on TrueNAS, which is a click of a button to install, and things transcode just fine with GPU acceleration. Though the moment I isolate the GPU and try to pass it into the virtual machine, the VM just won't
post, the console remains a black screen like a dead computer. The logs don't show any specific errors either. It just shows the same process as booting up the system as it did before. If I remove the graphics card PCI-e pass through device, the VM boots no problem.
It's been a royal pain in the butt because there have been issues some people ran across with TrueNAS, Proxmox, and even some cases of VMware ESXi. Though what few threads I found on it, answers are scarce. I've decided to expand my search regarding the NVIDIA Tesla cards in general, including the baby P4 model, which
one person had the same exact issues I did, who got zero help from anyone but found disabling Ensure Display Device worked for him. Sadly that didn't resolve my issue.
I've even looked into some BIOS settings regarding memory mapping specifics settings, and virtualization stuff is all enabled, which is how the IOMMU grouping and original passthrough of the Titan X simply worked to begin with.
The thing is I know you have to deal with grid licensing with NVIDIA if you are going to use this GPU as a vGPU for multiple virtual machines, but I'm not trying to use it as a virtual GPU, just a direct passthrough. So I can't be sure if something funny is going on there, with the memory allocation, or TrueNAS Scale, some linux issue, or what at this point, but it's been really mind numbing.
