• 0 Posts
  • 35 Comments
Joined 1 year ago
cake
Cake day: January 26th, 2025

help-circle
  • Fairly significant factor when building really large systems. If we do the math, there ends up being some relationships between

    • disk speed
    • targets for ”resilver” time / risk acceptance
    • disk size
    • failure domain size (how many drives do you have per server)
    • network speed

    Basically, for a given risk acceptance and total system size there is usually a sweet spot for disk sizes.

    Say you want 16TB of usable space, and you want to be able to lose 2 drives from your array (fairly common requirement in small systems), then these are some options:

    • 3x16TB triple mirror
    • 4x8TB Raid6/RaidZ2
    • 6x4TB Raid6/RaidZ2

    The more drives you have, the better recovery speed you get and the less usable space you lose to replication. You also get more usable performance with more drives. Additionally, smaller drives are usually cheaper per TB (down to a limit).

    This means that 140TB drives become interesting if you are building large storage systems (probably at least a few PB), with low performance requirements (archives), but there we already have tape robots dominating.

    The other interesting use case is huge systems, large number of petabytes, up into exabytes. More modern schemes for redundancy and caching mitigate some of the issues described above, but they are usually onlu relevant when building really large systems.

    tl;dr: arrays of 6-8 drives at 4-12TB is probably the sweet spot for most data hoarders.








  • Here I am, running separate tailscale instances and a separate reverse proxy for like 15 different services, and that’s just one VM… All in all, probably 20-25 tailscale instances in a single physical machine.

    Don’t think about Tailscale like a normal VPN. Just put it everywhere. Put it directly on your endpoints, don’t route. Then lock down all your services to the tailnet and shut down any open ports to the internet.


  • My NAS will stay on bare metal forever. Any complications there is something I really don’t want. Passthrough of drives/PCIe-devices works fine for most things, but I won’t use it for ZFS.

    As for services, I really hate using Docker images with a burning passion. I’m not trusting anyone else to make sure the container images are secure - I want the security updates directly from my distribution’s repositories, and I want them fully automated, and I want that inside any containers. Having Nixos build and launch containers with systemd-nspawn solves some of it. The actual docker daemon isn’t getting anywhere near my systems, but I do have one or two OCI images running. Will probably migrate to small VMs per-service once I get new hardware up and running.

    Additionally, I never found a source of container images I feel like I can trust long term. When I grab a package from Debian or RHEL, I know that package will keep working without any major changes to functionality or config until I upgrade to the next major. A container? How long will it get updates? How frequently? Will the config format or environment variables or mount points change? Will a threat actor assume control of the image? (Oh look, all the distros actually enforce GPG signatures in their repos!)

    So, what keeps me on bare metal? Keeping my ZFS pools safe. And then just keeping away from the OCI ecosystem in general, the grass is far greener inside the normal package repositories.


  • I mean, the passkey is still in there. It’s protected by convention. It’s a bearer token wrapped in a password manager, presented as a revolution.

    We have the technology, can we please pour the same amount of resources into what we’ve already had for decades? Passkeys solve the UX issue for ”normal people”, that’s the selling point.





  • Lustre 2.16 got released recently, so in a year or so you may actually be able to run commercially supported Lustre with IPv6 support. Yay!

    After that, it’s only a matter of time before it’s finally possible to start testing supercomputers with IPv6! (And finally building a production system with IPv6 a few more years after that, when all the bugs have been squashed)

    Look at the Top500 list. Fucking everyone runs Lustre somewhere, and usually old versions. The US strategic nuclear weapons research is practically all on Lustre. My guess is most weather forecasting globally runs on Lustre. (Oh, and a shitton of AI of course.)

    Up until now, you were stuck with mounting your filesystem over IPv4 (well, kinda IPv4 over RDMA, ish). If you want commercial support for your hundreds of petabytes (you do), you still can’t migrate. And this isn’t a small indie project without testers, it’s commercially supported with billions in revenue, supporting compute hardware for even more money.

    My point with this rambling is that a open source software that is this widely deployed, depended upon and this well funded, still failed to roll out IPv6 support until now. The long tail of migrating the world to IPv6 hasn’t even begun yet, we are still in the early days. Soon someone will start looking at the widely deployed, depended upon and badly funded stuff.

    And maybe, if IPv6 didn’t try to change a bunch of extra stuff, we’d be further along. (Though, in the specific case of Lustre, I’ll gladly accuse DDN and Whamcloud for being incompetent…)


  • In the real world, addresses are an abstraction to provide knowledge needed to move something from point A to point B. We could use coordinates or refer to the exact office the recipient sits in, but we don’t. Actually, we usually try to keep it at a fairly high level of abstraction.

    The analogy is broken, because in the real world, we don’t want extremely exact addressing and transport without middlemen. We want abstract addresses, with transport routing partially to fully decoupled from the addressing scheme. GP provides a nice argument for IPv4.

    I know how NAT works, but we are working within the constraints of a very broken analogy here. Also yes, internal logistics can and will be the harbinger of unnecessary bureaucracy, especially when implemented correctly.


  • And yet, in the real world we actually use distribution centers and loading docks, we don’t go sending delivery boys point to point. At the receiving company’s loading docks, we can have staff specialise in internal delivery, and also maybe figure out if the package should go to someone’s office or a temporary warehouse or something. The receiver might be on vacation, and internal logistics will know how to figure out that issue.

    Meanwhile, the point-to-point delivery boy will fail to enter the building, then fail to find the correct office, then get rerouted to a private residence of someone on vacation (they need to sign personally of course), and finally we need another delivery boy to move the package to the loading dock where it should have gone in the first place.

    I get the ”let’s slaughter NAT” arguments, but this is an argument in favour of NAT. And in reality, we still need to have routing and firewalls. The exact same distribution network is still in use, but with fewer allowances for the recipient to manage internal delivery.

    Personal opinion: IPv6 should have been almost exactly the same as IPv4, but with more numbers and a clear path to do transparent IPv6 to IPv4 traffic without running dual stack (maybe a NAT?). IPv6 is too complex, error prone and unsupported to deploy without shooting yourself in the foot, even now, a few decades after introduction.


  • the H200 has a very impressive bandwith of 4.89 TB/s, but for the same price you can get 37 TB/s spread across 58 RX 9070s, but if this actually works in practice i don’t know.

    Your math checks out, but only for some workloads. Other workloads scale out like shit, and then you want all your bandwidth concentrated. At some point you’ll also want to consider power draw:

    • One H200 is like 1500W when including support infrastructure like networking, motherboard, CPUs, storage, etc.
    • 58 consumer cards will be like 8 servers loaded with GPUs, at like 5kW each, so say 40kW in total.

    Now include power and cooling over a few years and do the same calculations.

    As for apples and oranges, this is why you can’t look at the marketing numbers, you need to benchmark your workload yourself.


  • Well, a few issues:

    • For hosting or training large models you want high bandwidth between GPUs. PCIe is too slow, NVLink has literally a magnitude more bandwidth. See what Nvidia is doing with NVLink and AMD is doing with InfinityFabric. Only available if you pay the premium, and if you need the bandwidth, you are most likely happy to pay.
    • Same thing as above, but with memory bandwidth. The HBM-chips in a H200 will run in circles around the GDDR-garbage they hand out to the poor people with filthy consumer cards. By the way, your inference and training is most likely bottlenecked by memory bandwidth, not available compute.
    • Commercially supported cooling of gaming GPUs in rack servers? Lol. Good luck getting any reputable hardware vendor to sell you that, and definitely not at the power densities you want in a data center.
    • TFLOP16 isn’t enough. Look at 4 and 8 bit tensor numbers, that’s where the expensive silicon is used.
    • Nvidias licensing agreements basically prohibit gaming cards in servers. No one will sell it to you at any scale.

    For fun, home use, research or small time hacking? Sure, buy all the gaming cards you can. If you actually need support and have a commercial use case? Pony up. Either way, benchmark your workload, don’t look at marketing numbers.

    Is it a scam? Of course, but you can’t avoid it.