Here’s the problem: we are always under pressure to reduce the time it takes to develop a new model, while datasets only grow in size. Running a training job on a single node is pretty easy, but nobody wants to wait hours and then run it again, only to realize that it wasn’t right to begin with.