

Removing memory operations is one of the best ways to improve performance.

Memory bandwidth is typically the scarcest resource on hardware accelerators, so Fusion is XLA's single most important optimization. These intermediate computations directly to their users while keeping themĮntirely in GPU registers. Produced by y*z and x+y*z to memory instead it "streams" the results of Moreover, this fused operation does not write out the intermediate values

"fusing" the addition, multiplication and reduction into a single GPU kernel. Graph so that it computes the result in a single kernel launch. One for the addition and one for the reduction. Run without XLA, the graph launches three kernels: one for the multiplication, Optimization XLA does in the context of a simple TensorFlow computation: def model_fn(x, y, z): Model-specific information for optimization. Because these kernels are unique to the model, they can exploit Graph into a sequence of computation kernels generated specifically for the XLA provides an alternative mode of running models: it compiles the TensorFlow Precompiled GPU kernel implementation that the executor dispatches to. When a TensorFlow program is run, all of the operations are executed
