Have you benchmarked grouped GEMM vs. batched GEMM for your use case? Let’s discuss below ⬇️

Enter – a game changer for batched, variable-sized matmul operations.

📖 NVIDIA cuBLASLt Developer Guide → Grouped GEMM section

🔍 The grouped GEMM interface allows you to execute a list of independent matrix multiplications in a single kernel launch , drastically reducing launch latency and improving GPU utilization.