GPU 模式讲座 1 的笔记

ID:21241 / 打印

gpu 模式讲座 1 的笔记

分析器

计算机性能取决于时间和内存的权衡。由于计算设备比较昂贵，所以大多数时候，时间是首先要关心的。

为什么要使用分析器？

cuda 是异步的，因此无法使用 python 时间模块
分析器更加强大

工具

共有三个分析器：

autograd 分析器：数值
pytorch 分析器：视觉
nvidia nsight 计算

autograd 分析器利用 torch.cuda.event() 来测量性能。

pytorch profiler 利用 profiler 上下文管理器 torch.profiler 中的 profile() 方法来分析性能。
您可以将结果导出为 .json 文件并将其上传到 chrome://tracing/ 进行可视化。

演示

课程提供了一个简单的程序来展示如何使用autograd profiler来分析三种平方运算方法的性能：

通过 torch.square()
由 ** 操作员
由 * 操作员

def time_pytorch_function(func, input):     # cuda is async so can't use python time module     start = torch.cuda.event(enable_timing=true)     end = torch.cuda.event(enable_timing=true)      # warmup     for _ in range(5):         func(input)      start.record()     func(input)     end.record()     torch.cuda.synchronize()     return start.elapsed_time(end)  time_pytorch_function(torch.square, b) time_pytorch_function(square_2, b) time_pytorch_function(square_3, b)

下面的结果是在 nvidia t4 gpu 上完成的。

profiling torch.square: self cpu time total: 10.577ms self cuda time total: 3.266ms  profiling a * a: self cpu time total: 5.417ms self cuda time total: 3.276ms  profiling a ** 2: self cpu time total: 6.183ms self cuda time total: 3.274ms

事实证明：

cuda 运算速度比 cpu 更快。
* 运算符执行的是 aten::multiply 操作，而不是 aten::pow，并且前者更快。这可能是因为乘法比 pow 使用得更多，并且许多开发人员花时间对其进行优化。
cuda 上的性能差异很小。考虑到 cpu 时间，torch.square 是最慢的操作
aten::square 是对 aten::pow 的调用
所有三种方法都启动了一个名为native::vectorized_elementwise_kernel

在 pytorch 中集成 cuda 内核

有几种方法可以做到这一点：

使用torch.utils.cpp_extendsion中的load_inline
使用 numba，它是一个编译器，可将经过修饰的 python 函数编译为在 cpu 和 gpu 上运行的机器代码
使用 triton

我们可以使用torch.utils.cpp_extendsion中的load_inline通过load_inline（name，cpp_sources，cuda_sources，functions，with_cuda，build_directory）将cuda内核加载为pytorch扩展。

from torch.utils.cpp_extension import load_inline  square_matrix_extension = load_inline(     name='square_matrix_extension',     cpp_sources=cpp_source,     cuda_sources=cuda_source,     functions=['square_matrix'],     with_cuda=true,     extra_cuda_cflags=["-o2"],     build_directory='./load_inline_cuda',     # extra_cuda_cflags=['--expt-relaxed-constexpr'] )  a = torch.tensor([[1., 2., 3.], [4., 5., 6.]], device='cuda') print(square_matrix_extension.square_matrix(a))

动手实践

对均值操作使用 autograd 分析器

使用 autograd profiler 时，请记住：

录制前预热gpu，使gpu进入稳定状态
平均多次运行以获得更可靠的结果

import torch  # method 1: use `torch.mean()` def mean_all_by_torch(input_tensor):     return torch.mean(input_tensor)  # method 2: use `mean()` of the tensor def mean_all_by_tensor(input_tensor):     return input_tensor.mean()  # method 3: use `torch.sum()` and `tensor.numel()` def mean_all_by_combination(input_tensor):     return torch.sum(input_tensor) / input_tensor.numel()  def time_pytorch_function(func, input_tensor, warmup=5, runs=100):     # warmup     for _ in range(warmup):       func(input_tensor)      times = []     start = torch.cuda.event(enable_timing=true)     end = torch.cuda.event(enable_timing=true)      for _ in range(runs):         start.record()         func(input_tensor)         end.record()         torch.cuda.synchronize()         times.append(start.elapsed_time(end))      return sum(times) / len(times)  input_tensor = torch.randn(10000, 10000).cuda()  print("torch.mean() time:", time_pytorch_function(mean_all_by_torch, input_tensor)) print("tensor.mean() time:", time_pytorch_function(mean_all_by_tensor, input_tensor)) print("manual mean time:", time_pytorch_function(mean_all_by_combination, input_tensor))   with torch.profiler.profile() as prof:     mean_all_by_torch(input_tensor) print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))  with torch.profiler.profile() as prof:     mean_all_by_tensor(input_tensor) print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))  with torch.profiler.profile() as prof:     mean_all_by_combination(input_tensor) print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

使用 pytorch 分析器进行均值操作

import torch from torch.profiler import profile, profileractivity  with profile(activities=[profileractivity.cpu, profileractivity.cuda]) as prof:     for _ in range(10):         mean_tensor = torch.mean(torch.randn(10000, 10000).cuda())  prof.export_chrome_trace("mean_trace.json")

为 torch.mean() 实现 triton 代码

import triton import triton.language as tl import torch  @triton.jit def mean_kernel(     x_ptr,          # pointer to input tensor     output_ptr,     # pointer to output tensor     n_elements,     # total number of elements     BLOCK_SIZE: tl.constexpr,  # number of elements per block ):      pid = tl.program_id(0)      block_start = pid * BLOCK_SIZE     block_end = tl.minimum(block_start + BLOCK_SIZE, n_elements)      acc = 0.0      for idx in range(block_start, block_end):         x = tl.load(x_ptr + idx)         acc += x       block_mean = acc / n_elements      # Store result     tl.store(output_ptr + pid, block_mean)  # Wrapper function def triton_mean(x: torch.Tensor) -> torch.Tensor:      x = x.contiguous().view(-1)     n_elements = x.numel()       BLOCK_SIZE = 1024     grid = (triton.cdiv(n_elements, BLOCK_SIZE),)       output = torch.empty(grid[0], device=x.device, dtype=x.dtype)       mean_kernel[grid](         x_ptr=x,         output_ptr=output,         n_elements=n_elements,         BLOCK_SIZE=BLOCK_SIZE,     )      return output.sum()  # Example usage: if __name__ == "__main__":     # Create test tensor     x = torch.randn(1000000, device='cuda')      # Compare results     torch_mean = torch.mean(x)     triton_mean_result = triton_mean(x)      print(f"PyTorch mean: {torch_mean}")     print(f"Triton mean: {triton_mean_result}")     print(f"Difference: {abs(torch_mean - triton_mean_result)}")