[Case Study] GPU Training Impossible Deadline Achieved

TecAce Software
Jul 12, 2023
4 min read

BACKGROUND

The customer was looking to start an LLM project but was in a time crunch. The current solution was to purchase Nvidia’s DGX H100 HPC units to fulfill their need to create their solution to supply their customer’s needs for a Large Language Model.

The initial purchase was done at the beginning of the year with a supplier that promised them 4 nodes by the middle of the year and the balance of an additional 4 nodes by the end of the year. This meant that their total fulfillment for 8 nodes would be close to 12 months.

CHALLENGES

Due to the high demand for SXM5 GPU HPC’s standard practice for most resellers was lengthy lead times. SXM5 is a newer solution to the H100 HPC units, primary solutions surrounding H100 have been limited to PCIe cards, though H100 Tensor GPUs.

As indicated in the graph below you can see that H100 Tensor Core GPU which is the PCIe variant 5x of the A100 predecessor solution and the SXM5 is approximately 9x the speedup over A100. That being said Nvidia’s control over production and limited viability after production is causing a bottleneck in fulfillment especially when ordering directly from Nvidia which is their DGX H100 SXM5 GPU HPC.

Our client initially ordered 8 total units to complete their training needs but the order needed to be split into two separate orders. The first four units are to be delivered by June and the next 4 nodes are to be delivered by December. Though this was acceptable by the end user, only because they thought they didn’t have any other alternatives.

(Source: https://nvdam.widen.net/s/5kgbjq2v2t/hpc-hgx-h100-datasheet-nvidia-web)

SOLUTION

Once TecAce Software learned of this dilemma, we reached out to our partners to bridge the gap to help create a solution to minimize their downtime. In a matter of days, TecAce was able to coordinate a Cloud Based solution that would enable our customers to use a solution that would allow them to train their LLM.

High-level block diagram of HGX H100 8-GPU (Source: https://cirrascale.com/solutions-nvidia-hgx-h100.php#)

Finding a Cloud service provider isn’t a novel idea, but what was required only needed to bridge their service gap for only a month or so. Undertaking such commitment is almost impossible to find for such a short timeframe since the Cloud Service Provider needs to commit to a large financial burden to support the client’s need which only lasts for a limited time.

Our partnership with this vendor provided the flexibility which our customers needed to initiate their projects to supersede the expectations of their clients. In addition to the Cloud Services TecAce was once again able to connect to another resource; a manufacturer of High-Performance Compute servers that incorporates the SXM5 Tensor GPU just as the Nvidia DGX platform.

H100 enables next-generation AI and HPC breakthroughs (Source: https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/)

The majority of the suppliers for HPC units are currently dedicated to either A100 devices or if it is the newer H100 technology it is the PCIe variant which is still faster than the A100 GPU, however, the SXM5 has interconnect speeds of up to 900GB/s while the PCIe has 600GB/s. When you’re talking about training time having a 33% decrease in time to completion it makes a world of difference.

NVIDIA H100 GPU on new SXM5 Module (Source: https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/)

When you ask anyone in the industry how long will it take for an AI to be trained, the answer will vary in terms of hours, days, weeks, and even longer. It boils down to factors such as hardware, optimization, the number of layers in the neural network size of your dataset, and more.

All things being equal hardware is where you can identify your bottleneck and can make the most adjustments to capture the most change to minimize your training time.

With Nvidia controlling the market with their DGX H100 SXM5 GPU servers TecAce was able to partner with the only other manufacturer to provide hardware with the same technology. This relationship that TecAce has with our partners; not only were we able to secure the hardware much sooner for their 2nd half of deployment, we were able to save our clients a considerable amount of money.

To summarize, we coordinated and acquired hardware and cloud solutions to help start their project much sooner and decrease their dollar spending, allowing them to procure additional resources for other projects.

RESULT

The client was able to start their training model two months in advance of their original timeframe by utilizing the capabilities and services of the cloud solution. You may think this would be a no brainer but you have to keep in mind that the lowest threshold for a cloud service provider is at 6 months and for a cluster of 8 nodes at the speeds required would average $300K per month and most other service providers are asking one to three years commitment.

That means to accomplish a stop-gap the minimum would be a 6-month commitment valued at $1.8 Million, which in itself would come close to a 4-node cluster to own outright. TecAce was able to do the research and find the right provider for both cloud services and hardware acquisition all at more than competitive pricing saving our customer time and money.

The value add of TecAce dealing with global projects, understanding the cultural needs from different parts of the world, executing those needs and wants amongst cultural differences as well seamless transactions so that the time zones aren’t even a factor all made this solution painless and profitable for our customer and partners.

Having a singular relationship allows for the customer to have a dedicated representative to assist in every aspect of their business continuity. Financially it also allows the customer to leverage their entire purchasing budget so that partners such as TecAce will provide the best fiscal number without the need to try and convince individuals at every turn of their business.

BACKGROUND

CHALLENGES

SOLUTION

RESULT

Comments