Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gradient from taichi ndarray does not match torch for euclidean distance (latest nightly) #8101

Closed
oliver-batchelor opened this issue May 29, 2023 · 6 comments
Assignees

Comments

@oliver-batchelor
Copy link
Contributor

oliver-batchelor commented May 29, 2023

Hi,

An example below trying to use a taichi kernel on torch tensors with autograd on the most recent nightly (taichi-nightly-1.7.0.post20230529). Latest torch from pip (2.01), python 3.10.

Outputs (distances) is the same in both cases.

Gradients from torch (x.grad):

tensor([[-0.5774, -0.5774, -0.5774],
        [-0.5774, -0.5774, -0.5774]]) 

Gradients from taichi (x.grad):

tensor([[-1.1547, -1.1547, -1.1547],
        [-1.1547, -1.1547, -1.1547]]) 

A couple of other problems:

  • Running this with CPU tensors (and arch=ti.cpu) causes a memory allocation crash.
  • There does not seem to be an error if I do not call contiguous - i.e. grad_output.contiguous() - the result just seems (more) different

I am only moderately confident I'm using this the right way so please excuse me if I'm abusing the system somehow!

import taichi as ti
import torch
from taichi.types import ndarray
device=torch.device('cuda', 0)

@ti.kernel
def pairwise_distances_kernel(x:ndarray(ti.math.vec3), y:ndarray(ti.math.vec3), distances:ndarray(ti.f32)):
  for i in range(x.shape[0]):
    distances[i] = (x[i] - y[i]).norm()
  
class PointDistance(torch.autograd.Function):
  @staticmethod
  def forward(ctx, x:torch.Tensor, y:torch.Tensor):
      distances = torch.zeros((x.shape[0],), device=x.device, dtype=torch.float32)
      pairwise_distances_kernel(x, y, distances)
      ctx.save_for_backward(x, y, distances)
      return distances

  @staticmethod
  def backward(ctx, grad_output):
      x, y, distances = ctx.saved_tensors
      distances.grad = grad_output.contiguous()      
      pairwise_distances_kernel.grad(x, y, distances)
      return x.grad, y.grad

def inputs():
  x = torch.arange(0, 6).reshape(2, 3).float().requires_grad_(True)
  y = torch.arange(6, 12).reshape(2, 3).float().requires_grad_(True)  
  return x, y

def test_torch():
  x, y = inputs() 
  distances = torch.norm(x - y, p=2, dim=1)
  loss = distances.sum()
  loss.backward()
  print(x.grad, y.grad)

def test_taichi():
  x, y = inputs()
  distances = PointDistance.apply(x, y)
  loss = distances.sum()
  loss.backward()
  print(x.grad, y.grad)

if __name__ == "__main__":
  ti.init(arch=ti.cuda, debug=True, log_level=ti.DEBUG)
  test_torch()
  test_taichi()

@github-project-automation github-project-automation bot moved this to Untriaged in Taichi Lang May 29, 2023
@oliver-batchelor oliver-batchelor changed the title Gradient from taichi ndarray does not match torch for euclidean distance Gradient from taichi ndarray does not match torch for euclidean distance (latest nightly) May 29, 2023
@jim19930609 jim19930609 moved this from Untriaged to Todo in Taichi Lang Jun 2, 2023
@ailzhang
Copy link
Contributor

ailzhang commented Jun 2, 2023

@oliver-batchelor The reason that taichi returns doubled result is from


  @staticmethod
  def backward(ctx, grad_output):
      x, y, distances = ctx.saved_tensors
      distances.grad = grad_output.contiguous()      
      pairwise_distances_kernel.grad(x, y, distances)
      return x.grad, y.grad

After running pairwise_distances_kernel.grad() kernel the gradient is already accumulated on x.grad and y.grad. Say if you print them here you'll see their results are actually correct.

But when you return x.grad and y.grad to torch it'll accumulate them to inputs' gradient again. That's why you saw the results are doubled.

The root cause here is that torch expected the backward function to have no side-effect to gradients (but only return them), however running taichi kernels does have side effect.

So there're two ways to work around it, remove taichi side effect or just return zero grad to torch.
E.g.

  # actually this may work for this example, manipulating gradient is dangerous so the solution below is recommended.
  @staticmethod
  def backward(ctx, grad_output):
      x, y, distances = ctx.saved_tensors
      distances.grad = grad_output.contiguous()
      x.grad.fill_(0)
      y.grad.fill_(0)
      pairwise_distances_kernel.grad(x, y, distances)
      x_grad, y_grad = x.grad.clone(), y.grad.clone()
      x.grad.fill_(0)
      y.grad.fill_(0)
      print(x_grad, y_grad)
      return x_grad, y_grad

or

  # recommended
  @staticmethod
  def backward(ctx, grad_output):
      x, y, distances = ctx.saved_tensors
      distances.grad = grad_output.contiguous()      
      pairwise_distances_kernel.grad(x, y, distances)
      return torch.zeros_like(x), torch.zeros_like(y)

@oliver-batchelor
Copy link
Contributor Author

oliver-batchelor commented Jun 7, 2023

Ah ha, thanks!

It did seem a little weird to me returning the .grad, because normally the .grad attribute is None at that point.

I must admit I just followed the recipe from here (which I assume has also double gradients) - but this wouldn't necessarily cause big problems, just a doubled learning rate... depending on what you used it for.

https://github.com/taichi-dev/taichi-nerfs/blob/main/modules/hash_encoder.py

@bobcao3
Copy link
Collaborator

bobcao3 commented Jun 14, 2023

We might want to fix the taichi-nerfs one to ensure hyper parameter compatibility hmmm. But I will mark this one resolved

@oliver-batchelor
Copy link
Contributor Author

From further testing both these methods seem flawed - I'll create some test cases but in essence:

  1. the direct previous node(s) in the graph seem to end up with zero grad always in this case.
  2. direct previous nodes have correct grad but the grad further up the graph is all zero

Seems it's not safe to mess with the .grad nor is it safe to return zeros!!

@oliver-batchelor
Copy link
Contributor Author

On second glance I think method 1 seems to do the right thing - because normally the direct predecessor node won't have grad anyway unless retain_grad is set to True - and it matches the pytorch behaviour here. The only difference in this case is that it will have zeros as grad rather than "None" which is what it is under pytorch autograd.

@oliver-batchelor
Copy link
Contributor Author

Is it possible to create another attribute on the tensor object like taichi_grad or some such that means we don't have to mess with the torch .grad thus avoiding all these problems?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

3 participants