GPU Performance

Tips for maximizing GPU compute throughput in Morph.

Operation Fusion

The compiler automatically fuses compatible operations:

gpu {
    result is (a * b) + c;    // fused to single FMA kernel
}

Readback Optimization

Set MORPHLANG_GPU_READBACK_OPT=1 to avoid unnecessary GPU→CPU transfers. Data stays on the GPU until explicitly needed on the CPU.

Metrics

Enable timing metrics:

MORPHLANG_GPU_DUMP_METRICS=1 morph run
# or: set the same variable and run your binary / test driver — metrics hook the GPU runtime, not a specific CLI subcommand

Output includes per-block dispatch time, memory usage, and kernel count when the GPU stack is exercised (see tensor/runtime tests for MORPHLANG_GPU_* usage).

Best Practices

Minimize data transfers — keep data on the GPU
Batch operations — use single gpu blocks
Use large tensors — GPU overhead not worth it for small data
Profile first — use metrics to find actual bottlenecks

Next Steps

Concurrency — CPU parallelism

Operation Fusion​

Readback Optimization​

Metrics​

Best Practices​

Next Steps​

Operation Fusion

Readback Optimization

Metrics

Best Practices

Next Steps