Skip to content

Commit

Permalink
【Hackathon No.13】为 Paddle 新增 triu_indices API (#207)
Browse files Browse the repository at this point in the history
* 【Hackathon No.34】为 Paddle 优化 poisson op 在 GPU 上的计算性能

* 【Hackathon No.13】为 Paddle 新增 triu_indices API

* 【Hackathon No.13】为 Paddle 新增 triu_indices API

* modify doc
  • Loading branch information
Rayman96 authored Aug 15, 2022
1 parent 32b5841 commit 50abd7d
Showing 1 changed file with 365 additions and 0 deletions.
365 changes: 365 additions & 0 deletions rfcs/APIs/20220813_api_design_for_triu_indices.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,365 @@
# paddle.tril_indices设计文档


|API名称 | paddle.triu_indices |
|---|---|
|提交作者 | Rayman的团队 |
|提交时间 | 2022-08-13 |
|版本号 | V1.0 |
|依赖飞桨版本 | develop |
|文件名 | 20220813_api_design_for_triu_indices.md |


# 一、概述
## 1、相关背景
`triu_indices` 能获取一个2维矩阵的上三角元素的索引,其输出 Tensor 的 shape 为$[2, N]$,相当于有两行,第一行为 上三角元素的行索引,第二行为下三角元素的列索引。调用方式与`tril_indices(rows, cols, offset)`对应。offset的范围为$[-rows+1,cols-1]$。

## 2、功能目标

在Paddle框架中增加`paddle.triu_indices`这个API。

## 3、意义

Paddle将提供高效的`triu_indices`API供用户直接调用。

# 二、飞桨现状
飞桨目前没有提供`triu_indices`这个API,且无法通过API组合的方式间接实现其功能。

相关接口:
1. 飞桨目前提供了triu函数,输入矩阵和对角线的参数,返回矩阵上三角部分,其余部分元素为零。调用接口为`paddle.triu(input, diagonal=0, name=None)`
[源码](https://github.com/PaddlePaddle/Paddle/blob/release/2.3/python/paddle/tensor/creation.py#L674)
[文档](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/triu_cn.html#triu)

2. 飞桨提供了tril_indices函数,与期望实现的triu_indices类似,但其返回的是下三角元素索引[源码](https://github.com/PaddlePaddle/Paddle/blob/6d31dc937704380efe2dee97716c3da47b7060f1/python/paddle/tensor/creation.py#L1721)

```python
import paddle

# example 1, default offset value
data1 = paddle.tril_indices(4,4,0)
print(data1)
# [[0, 1, 1, 2, 2, 2, 3, 3, 3, 3],
# [0, 0, 1, 0, 1, 2, 0, 1, 2, 3]]
# example 2, positive offset value
data2 = paddle.tril_indices(4,4,2)
print(data2)
# [[0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3],
# [0, 1, 2, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3]]
# example 3, negative offset value
data3 = paddle.tril_indices(4,4,-1)
print(data3)
# [[ 1, 2, 2, 3, 3, 3],
# [ 0, 0, 1, 0, 1, 2]]
```

# 三、业内方案调研
PyTorch和Numpy中都有triu_indices这个API
## PyTorch

### 实现解读
pytorch 中接口配置为: [在线文档](https://pytorch.org/docs/stable/generated/torch.triu_indices.html?highlight=triu_indices#torch.triu_indices)

```python
torch.triu_indices(row, col, offset=0, *, dtype=torch.long, device='cpu', layout=torch.strided) → Tensor
```

在PyTorch中,triu_indices是由C++和CUDA实现的,其中CPU核心代码为:

```c++
Tensor triu_indices_cpu(
int64_t row, int64_t col, int64_t offset, c10::optional<ScalarType> dtype_opt,
c10::optional<Layout> layout_opt, c10::optional<Device> device_opt, c10::optional<bool> pin_memory_opt) {
if (!dtype_opt.has_value()) {
dtype_opt = ScalarType::Long;
}

check_args(row, col, layout_opt);

auto triu_size = row * col - get_tril_size(row, col, offset - 1);

// create an empty Tensor with correct size
auto result = at::native::empty_cpu({2, triu_size}, dtype_opt, layout_opt, device_opt, pin_memory_opt);

AT_DISPATCH_ALL_TYPES_AND(kBFloat16, result.scalar_type(), "triu_indices", [&]() -> void {
// fill the Tensor with correct values
scalar_t* result_data = result.data_ptr<scalar_t>();
int64_t i = 0;
// not typing std::max with scalar_t as it could be an unsigned type
// NOTE: no need to check if the returned value of std::max overflows
// scalar_t, as i and triu_size act as a guard.
scalar_t c = std::max<int64_t>(0, offset), r = 0;
while (i < triu_size) {
result_data[i] = r;
result_data[triu_size + i++] = c;

// move to the next column and check if (r, c) is still in bound
c += 1;
if (c >= col) {
r += 1;
// not typing std::max with scalar_t as it could be an unsigned type
// NOTE: not necessary to check if c is less than col or overflows here,
// because i and triu_size act as a guard.
c = std::max<int64_t>(0, r + offset);
}
}
});

return result;
}
```
CPU端的代码主要逻辑是:
1. 对入参进行检查
2. 复用get_tril_size()函数,通过总维度减去下三角区域得到需要的tensor维度
3. 创建空的Tensor并赋值得到正确的输出。
GPU核心代码为:
```c++
template <typename scalar_t>
__global__
void triu_indices_kernel(scalar_t * tensor,
int64_t col_offset,
int64_t m_first_row,
int64_t col,
int64_t rectangle_size,
int64_t triu_size) {
int64_t linear_index = blockIdx.x * blockDim.x + threadIdx.x;
if (linear_index < triu_size) {
int64_t r, c;
if (linear_index < rectangle_size) {
// the coordinate is within the top rectangle
r = linear_index / col;
c = linear_index % col;
} else {
// the coordinate falls in the bottom trapezoid
get_coordinate_in_triu_trapezoid(
m_first_row, linear_index - rectangle_size, r, c);
r += rectangle_size / col;
}
c += col_offset;
tensor[linear_index] = r;
tensor[linear_index + triu_size] = c;
}
}
// Some Large test cases for the fallback binary search path is disabled by
// default to speed up CI tests and to avoid OOM error. When modifying the
// implementation, please enable them in test/test_cuda.py and make sure they
// pass on your local server.
Tensor triu_indices_cuda(
int64_t row, int64_t col, int64_t offset, c10::optional<ScalarType> dtype_opt,
c10::optional<Layout> layout_opt, c10::optional<Device> device_opt, c10::optional<bool> pin_memory_opt) {
check_args(row, col, layout_opt);
auto triu_size = row * col - get_tril_size(row, col, offset - 1);
auto tensor = empty_cuda({2, triu_size}, dtype_opt, layout_opt, device_opt, pin_memory_opt);
if (triu_size > 0) {
// # of triu elements in the first row
auto m_first_row = offset > 0 ?
std::max<int64_t>(col - offset, 0) : // upper bounded by col
col;
// size of the top rectangle
int64_t rectangle_size = 0;
if (offset < 0) {
rectangle_size = std::min<int64_t>(row, -offset) * col;
}
dim3 dim_block = cuda::getApplyBlock();
dim3 dim_grid;
// using triu_size instead of tensor.numel(), as each thread takes care of
// two elements in the tensor.
TORCH_CHECK(
cuda::getApplyGrid(triu_size, dim_grid, tensor.get_device()),
"unable to get dim grid");
AT_DISPATCH_ALL_TYPES_AND(at::ScalarType::Half, tensor.scalar_type(), "triu_indices_cuda", [&] {
triu_indices_kernel<<<
dim_grid, dim_block, 0, at::cuda::getCurrentCUDAStream()>>>(
tensor.data_ptr<scalar_t>(),
std::max<int64_t>(0, offset),
m_first_row,
col,
rectangle_size,
triu_size);
C10_CUDA_KERNEL_LAUNCH_CHECK();
});
}
return tensor;
}
```

上述CUDA代码是计算逻辑的GPU端实现,整体上未进行特殊的优化,计算逻辑清晰简洁。


### 使用示例

```python
>>> import torch
>>> a = torch.triu_indices(3, 3)
>>> a
tensor([[0, 0, 0, 1, 1, 2],
[0, 1, 2, 1, 2, 2]])

>>> a = torch.triu_indices(4, 3, -1)
>>> a
tensor([[0, 0, 0, 1, 1, 1, 2, 2, 3],
[0, 1, 2, 0, 1, 2, 1, 2, 2]])

>>> a = torch.triu_indices(4, 3, 1)
>>> a
tensor([[0, 0, 1],
[1, 2, 2]])
```

## NumPy

### 实现解读

调用接口为`numpy.triu_indices(n,k=0,m=None)`,n为矩阵行数,m为矩阵列数(可选),k为偏移,正数向右上方向偏移,
返回二维数组为指定元素的行列

```python
def triu_indices(n, k=0, m=None):
tri_ = ~tri(n, m, k=k - 1, dtype=bool)

return tuple(broadcast_to(inds, tri_.shape)[tri_]
for inds in indices(tri_.shape, sparse=True))
```
上述代码调用函数`tri()`获得一个n*m维矩阵,其下上角元素为True,其余元素为False. 其底层通过umath库实现。
后通过indices()函数取出此矩阵的行列下标,用broadcast_to()函数展开坐标.

### 使用示例

```python
>>> import numpy as np
>>> iu1 = np.triu_indices(4)
>>> iu2 = np.triu_indices(4, 2)

Here is how they can be used with a sample array:

>>> a = np.arange(16).reshape(4, 4)
>>> a
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])

Both for indexing:

>>> a[iu1]
array([ 0, 1, 2, ..., 10, 11, 15])

And for assigning values:

>>> a[iu1] = -1
>>> a
array([[-1, -1, -1, -1],
[ 4, -1, -1, -1],
[ 8, 9, -1, -1],
[12, 13, 14, -1]])

These cover only a small part of the whole array (two diagonals right
of the main one):

>>> a[iu2] = -10
>>> a
array([[ -1, -1, -10, -10],
[ 4, -1, -1, -10],
[ 8, 9, -1, -1],
[ 12, 13, 14, -1]])

```
使用triu_indices接口可以取出指定的对角线元素行列坐标,从而修改矩阵中指定的对角线元素值

# 四、对比分析
`numpy.triu_indices``torch.triu_indices`功能相同,但实现方式略有不同
pytorch中根据定义直接计算需要输出的下标,而numpy中使用一系列的函数巧妙地进行输出

分析numpy的实现巧妙,但是中间变量占用空间大,在规模大时会影响性能,且不支持GPU加速。

另外`paddle.tril_indices`是实现逻辑与pytorch类似,故`paddle.triu_indices`主体参考pytorch的实现思路保持代码一致性。

# 五、设计思路与实现方案

## 命名与参数设计
API设计为`paddle.triu_indices(rows, cols, offset,dtype=None)`,产生一个2行x列的二维数组存放指定上三角区域的坐标,第一行为行坐标,第二行为列坐标

参数类型要求:

- `rows``cols``offset`的类型是`int`
- 输出`Tensor`的dtype默认参数为None时使用'int64',否则以用户输入为准

## 底层OP设计

`paddle/fluid/operators/triu_indices_op.cc`添加triu_indices算子的描述,

`paddle/phi/infermeta/nullary.h`中声明形状推断的函数原型,在`paddle/phi/infermeta/nullary.cc`中实现。

```c++
void TriuIndicesInferMeta(const int& rows,
const int& cols,
const int& offset,
MetaTensor* out);
```

`paddle/phi/kernels/triu_indices_kernel.h`中声明核函数的原型

```c++
template <typename Context>
void TriuIndicesKernel( const Context& dev_ctx,
const int& rows,
const int& cols,
const int& offset,
DataType dtype,
DenseTensor* out);
```

分别在 `paddle/phi/kernels/cpu/triu_indices_kernel.cc``paddle/phi/kernels/gpu/triu_indices_kernel.cu`注册和实现核函数
实现逻辑借鉴pytorch直接计算下标。
CPU实现逻辑:计算输出数组大小,开辟空间,遍历每个位置赋值行列坐标。
GPU实现逻辑:计算输出数组大小,计算每个block负责的原始行列,按照输出数组大小进行平均的任务划分,实现每个block的赋值kernel。(目前pytorch版本的逻辑在device端实际存在一定的线程束分化,如有时间可以尝试进行优化。)

## python API实现方案

`python/paddle/tensor/creation.py`中增加`triu_indices`函数,并添加英文描述

```python
def triu_indices(row, col, offset=0, dtype='int64'):
# ...
# 参数检查
# ...
# 增加算子
# ...
return out
```
## 单测及文档填写
` python/paddle/fluid/tests/unittests/`中添加`test_triu_indices.py`文件进行单测,测试代码使用numpy计算结果后对比,与numpy对齐
` docs/api/paddle/`中添加中文API文档

# 六、测试和验收的考量

- 输入合法性及有效性检验;

- 对比与Numpy的结果的一致性:
不同情况
$(m>n || n>m || offset \in \{1-rows , cols-1\} || offset \notin \{1-rows , cols-1\})$

- CPUGPU测试。

# 七、可行性分析和排期规划
已完成主体开发,8.21前完成单元测试并提交

# 八、影响面
triu_indices是独立API,不会对其他API产生影响。

# 名词解释

# 附件及参考资料

0 comments on commit 50abd7d

Please sign in to comment.