Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mindspore训练模型时显存无限制增加 #302

Open
Veteranback opened this issue Aug 16, 2024 · 0 comments
Open

mindspore训练模型时显存无限制增加 #302

Veteranback opened this issue Aug 16, 2024 · 0 comments

Comments

@Veteranback
Copy link

Veteranback commented Aug 16, 2024

我在用pytorch迁移到mindspore上来,目前发现mindspore需要自己计算出梯度,通过opt(grad)进行更新。但是在实际训练中,我发现程序占据显存无限制的增长,个人推测是梯度没有更新导致的。

模型的运算函数如下:

 def construct(
            self,
            input_ids=None,
            attention_mask=None,
            token_type_ids=None,
            detect_labels=None,
            correct_labels=None
    ):

        hidden_states = self.bert(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids)[0]
        detect_outputs = self.tag_detect_projection_layer(hidden_states)
        correct_outputs = self.tag_label_projection_layer(hidden_states)

        result = {
            "detect_outputs": detect_outputs,
            "correct_outputs": correct_outputs,
            "detect_loss": None,
            "correct_loss": None,
            "loss": None,
        }

        loss = None
        if detect_labels is not None and correct_labels is not None:
            detect_loss = self._detect_criterion(detect_outputs.view(-1, self.args.detect_vocab_size), detect_labels.view(-1))
            correct_loss = self._correct_criterion(
                correct_outputs.view(-1, self.args.correct_vocab_size), correct_labels.view(-1))
            loss = detect_loss + correct_loss
            result["detect_loss"] = detect_loss
            result["correct_loss"] = correct_loss

        elif detect_labels is not None:
            loss = self._detect_criterion(
                detect_outputs.view(-1, self.args.detect_vocab_size), detect_labels.view(-1))
        elif correct_labels is not None:
            loss = self._correct_criterion(
                correct_outputs.view(-1, self.args.correct_vocab_size), correct_labels.view(-1))

        result["loss"] = loss
        #output=result
        return result

训练时定义的函数如下:

------------------------------------
def foward_fn(self,batch_data):
        detect_labels = batch_data[3]
        correct_labels = batch_data[4]
        output = self.model(batch_data[0],
                            batch_data[1],
                            batch_data[2],
                            detect_labels,
                            correct_labels)
        return output
--------------------------------------------
self.optimizer=AdamW(self.model.trainable_params(), lr=args.learning_rate, eps=args.adam_epsilon)
grad_fn=mindspore.value_and_grad(self.foward_fn,None,self.optimizer.parameters,has_aux=False)
for epoch in range(1, self.epochs + 1):
    for step, batch_data in enumerate(self.train_loader):
        output,grad=grad_fn(batch_data)
        loss=output['loss'].mean()
        mindspore.ops.clip_by_norm(x=grad, max_norm=self.args.max_grad_norm)
        self.optimizer(grad)

请问一下是不是我的代码存在问题呢?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant