Because the GPUs work in parallel, the implications are different from those in a conventional network.
“If we were on video and a burst of error occurs, TCP / IP does a fairly good work of bridging and broadcast,” said Gartner. “But in the infrastructure of AI, because the GPUs work in parallel, it is very sensitive to the problems which could occur on a link. All these GPUs exchange information and are synchronized, and therefore fundamentally, you must stop the workload and save at a control point and restart the workload. And which can cause a 40% reduction in the performance of the cluster.”
“It really suggests to our customers that they must focus much more on the reliability of the optics,” said Gartner.
Reliability tests reveal weaknesses
Cisco in the past has carried out a reliability test for which it has acquired 20 different optics from different suppliers, recalled Gartner. “They were optics of 100 g and 400 g at the time”, and all were in accordance with industry standards, and yet “none of these optics successfully completed our stress test,” he said.
Cisco testing environments make changes under different conditions, such as temperature or humidity level, or the voltage level that optics sees on the host, or through the signals from the host. “We do all these things in various combinations,” said Gartner.
Although the optics can technically comply with industry standards, “what we know is that if they were put in a stressful environment … They would not play,” he said, “and this is the thing we try to raise awareness of our customers.”