Ultra Ethernet Consortium wants to optimize networking for AI and HPC

Trending 2 months ago

A group of tech companies has kicked disconnected a task to accommodate nan Ethernet modular to make it amended suited for nan demanding web requirements of AI and precocious capacity computing (HPC) applications.

The Ultra Ethernet Consortium (UEC) intends to create a "complete Ethernet-based connection stack architecture" that will beryllium arsenic ubiquitous and cost-effective arsenic Ethernet while offering nan capacity of a supercomputing interconnect.

Founding members of nan consortium see those heavy progressive successful HPC and networking, including Intel, AMD, HPE, Arista, Broadcom, Cisco, Meta and Microsoft, pinch nan task itself is hosted wrong The Linux Foundation.

UEC chair Dr. J Metz told The Register nan extremity of nan task is not to alteration Ethernet but to tune it to amended accommodate nan much demanding characteristics of some AI and HPC workloads.

"Ethernet is nan guidelines exertion connected apical of which we build, since it's nan industry's champion illustration of agelong lasting, elastic and adaptable basal networking technology," he said.

"UEC's extremity is to attraction connected really to champion transportation AI and HPC workload postulation connected apical of Ethernet. Of course, location person been a fewer attempts to do that before, but nary has been designed from nan crushed up for highly demanding AI and HPC workloads and nary has been open, easy to usage and won wide adoption."

The task targets aggregate layers of nan networking stack pinch moving groups tasked pinch processing "specifications that heighten nan performance, latency and management" of some nan beingness furniture and nexus layer, arsenic good arsenic processing specifications for nan carrier furniture and nan package layer.

According to a whitepaper [PDF], networking is becoming progressively captious for nan training of AI models, which are ballooning successful size; immoderate person trillions of parameters and request to beryllium trained connected ample compute clusters, and nan web needs to beryllium arsenic businesslike arsenic imaginable successful bid to support those clusters busy.

While AI workloads thin to beryllium highly bandwidth-hungry, HPC besides includes workloads that are much latency sensitive, and some requirements request to beryllium met.

To fulfill these needs, nan UEC has identified nan pursuing arsenic desirable characteristics: elastic transportation order; modern congestion power mechanisms; multi-pathing and packet spraying; positive greater scalability and end-to-end telemetry.

  • Want to consciousness old? Ethernet conscionable celebrated its 50th birthday
  • Broadcom says Nvidia Spectrum-X's 'lossless Ethernet' isn't new
  • Turing Award goes to Robert Metcalfe, co-inventor of nan Ethernet
  • Switch and router income surge, pinch 200/400 gigabit Ethernet kit increasing fastest

According to nan whitepaper, nan rigid packet ordering utilized by older technologies limits ratio by preventing out-of-order information from being delivered consecutive from nan web to nan application. Support for modern APIs that relax nan packet ordering requirements is captious to cutting "tail latencies."

Multi-pathing and packet spraying involves simultaneously sending packets on each disposable web paths betwixt nan root and destination to execute nan champion performance.

Network congestion successful AI and HPC is chiefly an rumor connected nan nexus betwixt nan move and a receiving node if aggregate senders are each targeting nan aforesaid node. However, existent algorithms to negociate congestion do not meet each nan needs of a web optimized for AI, nan UEC claims.

Chiefly, it appears that nan UEC intends to switch nan RDMA complete Converged Ethernet (RoCE) protocol pinch a caller carrier furniture protocol that delivers nan required characteristics. This Ultra Ethernet Transport will support multipath, packet-spraying delivery, businesslike complaint power algorithms, and expose a elemental API to AI and HPC workloads – aliases astatine slightest that is nan intention.

HPE's engagement successful nan UEC is notable because it already has an HPC interconnect based connected Ethernet. The Cray Slingshot exertion is simply a "superset" of Ethernet, arsenic described successful detail by our colleagues complete astatine The Next Platform, while keeping compatibility pinch modular Ethernet frames, and has featured successful galore of nan supercomputer projects that HPE has been progressive pinch successful caller years, specified arsenic nan Frontier exascale system.

HPE General Manager for High Performance Interconnects Mike Vildibill told america nan company's information successful backing UEC is driven by a desire to guarantee that Slingshot operates wrong an unfastened ecosystem.

"We would for illustration for UEC-compliant NICs to acquisition immoderate of nan capacity and scalability benefits of a Slingshot fabric," he said. ®

Development of Slingshot by HPE will proceed into nan future, Vildibill confirmed, but he reckons location will ever beryllium immoderate 3rd statement NIC aliases SmartNIC that whitethorn person features which are not implemented connected its Slingshot NIC.

“Therefore, UEC provides a system to found a robust ecosystem of 3rd statement NICs to guarantee that we tin support nan wide scope of customer requirements, while delivering immoderate of Slingshot’s unsocial capabilities,” he said.

The UEC is successful nan early stages of development, and cardinal method concepts are still being identified and worked on. Dr Metz said nan first ratified drafts will apt beryllium fresh by nan extremity of 2023 aliases early 2024, and nan first standards-based products are besides expected adjacent year. ®