{"version":"1.0","type":"rich","provider_name":"Acast","provider_url":"https://acast.com","height":250,"width":700,"html":"<iframe src=\"https://embed.acast.com/$/68470ba8d911dedd6501609c/69fa7cc41353c87e11f7c17d?\" frameBorder=\"0\" width=\"700\" height=\"250\"></iframe>","title":"Why AI needs a new kind of supercomputer network - Episode 18","thumbnail_width":200,"thumbnail_height":200,"thumbnail_url":"https://open-images.acast.com/shows/68470ba8d911dedd6501609c/1780597323197-2591a204-ebf1-4002-8497-079f15af6525.jpeg?height=200","description":"<p>Training frontier models isn’t as simple as adding more GPUs—one small problem and the whole coordinated dance falls apart. OpenAI’s Mark Handley and Greg Steinbrecher discuss how a new supercomputer network design, used to train some of the company’s latest models, keeps the whole system moving in lockstep, even with record numbers of GPUs. They break down Multipath Reliable Connection, a new protocol OpenAI developed with AMD, Broadcom, Intel, Microsoft, and Nvidia, and why they’re making it available for the whole industry to use.</p><p><br></p><p><strong>Chapters</strong></p><p>00:00 Intro</p><p>00:39 Greg and Mark's paths to OpenAI</p><p>04:34 Why training AI stresses networks differently</p><p>10:05 Bottlenecks, failures, and the cost of waiting</p><p>15:19 How Multipath Reliable Connection works</p><p>18:59 A protocol to route around failures</p><p>25:05 Why OpenAI is making MRC an open standard</p><p>35:09 Could AI compute move to space?</p><p><br></p><p><br></p>","author_name":"OpenAI"}