KAYTUS ships KSManage Ultra for the AI factory data center

KAYTUS has introduced KSManage Ultra, a full-stack AI infrastructure management platform built for high-density AI racks. The product ships centralized visibility across GPUs, CPUs, memory, networking, power, and liquid cooling, plus a fault-isolation layer meant to flag risky nodes before they hit a training or inference job. The launch is a direct answer to the operational gap that has emerged as AI rack density has moved past what a traditional data center management tool can.

AI racks now combine GPUs, high-bandwidth networking equipment, power distribution units, cooling distribution units, and direct-liquid cooling loops into one tightly integrated system, and the failure modes are no longer single-component failures. A bad PCIe link, a firmware mismatch, or a coolant leak on a single node can compromise a multi-day training run, and a typical BMC and IPMI stack is not built to correlate the system-level signal the operator actually needs. KSManage Ultra is KAYTUS's bet that the management plane for these racks will look very different from the management plane for a conventional server fleet, and that the first vendor to ship that plane at scale will own the procurement default for the next generation of AI factory buildouts.

KSManage Ultra's take on full-stack AI data center visibility

The product unifies in-band and out-of-band management into a single interface and ties together operating system, application, hardware health, power consumption, and temperature data with the liquid cooling telemetry that has historically sat in a separate operations stack. That correlation is the part that matters for AI workloads, because the failure mode is rarely a single GPU going bad. It is a memory channel degrading under sustained inference load, a PCIe link training at the wrong speed, a cooling loop running hotter than expected during a long pretraining run, or a firmware version drift between nodes that turns a checkpoint into a silent inconsistency. The platform watches the GPU, the memory, the PCIe links, the network, the firmware consistency, the cooling system, and the power infrastructure continuously, and isolates any node it flags as faulty or high risk before that node is scheduled onto a job.

The liquid cooling path is the one most enterprise data center teams have been asking for and not getting from incumbent vendors. KSManage Ultra monitors coolant systems at multiple levels, detects leaks early, automatically shuts down affected nodes, isolates the affected equipment from the rest of the cluster, and sends alerts to operators. The pattern is similar to the fire-suppression playbook that traditional data centers have used for years, but applied to direct-liquid cooling, where a single leak can take out a rack faster than any software failure. The platform correlates the cooling telemetry with the compute and network telemetry so an operator can tell the difference between a node that is overheating because of a bad fan and a node that is overheating because the workload itself is running hot, which is the kind of signal that prevents a false positive from killing a job.

The deployment story is the part enterprise buyers will care about most. KAYTUS claims KSManage Ultra reduces rack onboarding time from around 50 minutes to less than three minutes through batch scanning, automatic topology mapping, and template-based stress testing, driver installation, hardware configuration, and software deployment. The automation is more important than the headline number. A typical AI cluster rollout today is bottlenecked less by GPU supply than by the time it takes to bring a rack online, validate it, and integrate it with the existing scheduler, configuration management database, and observability stack, and that bottleneck is what is slowing enterprise AI factory buildouts. KSManage Ultra's open-API architecture integrates with scheduling platforms, configuration management databases, servers, networking equipment, and cooling systems, which is the part that lets the platform fit into a heterogeneous AI infrastructure footprint rather than forcing a customer to rip and replace.

How the platform isolates faulty nodes and prevents wasted compute

The fault-isolation path is built around the same principle as the leak detection path. Continuous health evaluation across GPUs, memory, PCIe links, networking, firmware consistency, cooling, and power infrastructure produces a per-node risk score, and the scheduler can be told to keep any flagged node off the next job. KAYTUS frames this as a utilization and availability win, because the alternative is to discover a hardware problem mid-job, when a training run is already producing checkpoints and the cost of restarting is measured in GPU-hours, not minutes. The same telemetry also lets the platform pre-position replacement parts and trigger a maintenance window before a node fails, which is a more honest prediction story than the AI-ops vendor pitches that have dominated the last two years.

The open-API architecture is the second half of the bet. KAYTUS is positioning KSManage Ultra as a management plane that sits between the AI factory's heterogeneous compute fleet and the scheduler and observability stack above it, rather than as a vertical replacement for either. The platform exposes APIs to scheduling platforms, configuration management databases, and external cooling systems, and the design intent is that a customer can keep using the scheduler, observability, and CMDB they already have, and let KSManage Ultra supply the AI-specific hardware telemetry and fault-isolation logic. That positioning is closer to the BMC-plus-IPMI pattern of the last 20 years than to the all-in-one observability pitch that dominates the AI-ops space, and it is the right call for a buyer that already has a substantial operations team and does not want to retrain them on a new tool.

The AI factory framing is the third half of the bet. KAYTUS is explicitly using the AI factory language, the same framing NVIDIA, Microsoft, and the hyperscalers have all adopted, to position KSManage Ultra as the operations plane for the kind of GPU-dense, liquid-cooled, multi-thousand-node cluster that an AI factory represents. The same vendor has been building toward this position for several years, with prior KSManage versions focusing on the in-rack and in-cluster layers; KSManage Ultra is the step that pulls cooling, power, and fault isolation into the same product surface. The previous coverage of NVIDIA's expansion of its global AI infrastructure footprint with Vera Rubin and Sharon AI covers the same compute side of the AI factory story from the silicon and data center perspective, and the AI infrastructure resource page lays out the broader chips, cloud, and capacity choices that frame where KSManage Ultra fits in the buyer decision.

What the AI factory pitch means for enterprise AI buyers

The enterprise AI buyer takeaway is that the management plane is now part of the AI factory buildout decision, not a separate procurement that happens after the GPUs are in the rack. A buyer evaluating a multi-thousand-GPU deployment now has to think about how a leak on a single rack will be detected, how a firmware drift across a cluster will be caught, and how a bad node will be kept off the next job, and those questions do not have a good answer in the traditional BMC and IPMI stack. KSManage Ultra is one of the first vendor answers to make that explicit, and the bet is that the next 18 months of AI factory rollouts will treat the management plane as a first-class part of the procurement, not an afterthought. The full announcement is in the eeNews Europe coverage of the KAYTUS KSManage Ultra launch dated June 29, 2026.

KAYTUS ships KSManage Ultra for the AI factory data center

KSManage Ultra's take on full-stack AI data center visibility

How the platform isolates faulty nodes and prevents wasted compute

What the AI factory pitch means for enterprise AI buyers

Get a weekly summary of our most popular articles

Comments

Related articles

Claude on Microsoft Foundry now GA on NVIDIA Blackwell Ultra silicon

Shield ships two governed AI agents to close compliance alerts

Straiker raises $64M to secure the AI agent workforce