Skip to main content

GPU Performance

Tips for maximizing GPU compute throughput in Morph.


Operation Fusion

The compiler automatically fuses compatible operations:

gpu {
result is (a * b) + c; // fused to single FMA kernel
}

Readback Optimization

Set MORPHLANG_GPU_READBACK_OPT=1 to avoid unnecessary GPU→CPU transfers. Data stays on the GPU until explicitly needed on the CPU.


Metrics

Enable timing metrics:

MORPHLANG_GPU_DUMP_METRICS=1 morph run
# or: set the same variable and run your binary / test driver — metrics hook the GPU runtime, not a specific CLI subcommand

Output includes per-block dispatch time, memory usage, and kernel count when the GPU stack is exercised (see tensor/runtime tests for MORPHLANG_GPU_* usage).


Best Practices

  1. Minimize data transfers — keep data on the GPU
  2. Batch operations — use single gpu blocks
  3. Use large tensors — GPU overhead not worth it for small data
  4. Profile first — use metrics to find actual bottlenecks

Next Steps