GPU Performance
Tips for maximizing GPU compute throughput in Morph.
Operation Fusion
The compiler automatically fuses compatible operations:
gpu {
result is (a * b) + c; // fused to single FMA kernel
}
Readback Optimization
Set MORPHLANG_GPU_READBACK_OPT=1 to avoid unnecessary GPU→CPU transfers. Data stays on the GPU until explicitly needed on the CPU.
Metrics
Enable timing metrics:
MORPHLANG_GPU_DUMP_METRICS=1 morph run
# or: set the same variable and run your binary / test driver — metrics hook the GPU runtime, not a specific CLI subcommand
Output includes per-block dispatch time, memory usage, and kernel count when the GPU stack is exercised (see tensor/runtime tests for MORPHLANG_GPU_* usage).
Best Practices
- Minimize data transfers — keep data on the GPU
- Batch operations — use single gpu blocks
- Use large tensors — GPU overhead not worth it for small data
- Profile first — use metrics to find actual bottlenecks
Next Steps
- Concurrency — CPU parallelism