From 0d9f88d0fe67119d864d4aaa5d88feb8cac57484 Mon Sep 17 00:00:00 2001 From: Pushpedra Date: Thu, 12 Oct 2023 18:15:43 +0530 Subject: [PATCH 1/4] SparseGPT --- summaries/sparsegpt.md | 38 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 38 insertions(+) create mode 100644 summaries/sparsegpt.md diff --git a/summaries/sparsegpt.md b/summaries/sparsegpt.md new file mode 100644 index 0000000..349ad9c --- /dev/null +++ b/summaries/sparsegpt.md @@ -0,0 +1,38 @@ +# SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot + +## summary: + +SparseGPT is a post-training pruning method for compressing large language models such as GPT3 efficiently and accurately. The method can be used to prune large language models in one-shot with minimal accuracy loss. For example, you can use SparseGPT to prune OPT-175B to 50% sparsity with a 0.13 decrease in perplexity.Thus 100 billion weights from the model can be ignored at inference time, increasing the model's throughput while reducing latency. SparseGPT can be applied to a GPT model with 175B parameters on a single GPU in a couple of hours with negligible accuracy loss. + +## contribution: + becuase Large language models (LLMs) solve natural language processing problems with astounding accuracy. However, these models are enormous and require a lot of space, cost, and computation power to deploy. For example, the GPT-175B model has 175 billion parameters requiring 320GB of storage and at least 5 A100 GPUs with 80GB of memory each for inference. This computation power is expensive, making this solution only viable for some organizations. Hence, deployment of such models falls outside the purview of small organizations and individuals.  + +## remark: +Post-training compression is usually done by splitting the full-model compression problem into layer-wise subproblems, whose solution quality is measured in terms of the ` 2-error between the output, for given inputs X` , of the uncompressed layer with weights W` and that of the compressed one. +methodology: +The SparseGPT algorithm works as follows, given a fixed pruning mask:  +:Prune weights in each column of the weight matrix incrementally using a sequence of hessian inverse + +:Update the remainder of the weights in the rows located to the right of the column being processed + +SparseGPT is local because it performs weight updates after each pruning step, maintaining the input-output relationship between each layer. The high parametrization of GPT models makes it possible to make the updates without any global gradient information. The cost of the reconstruction process consists of the computation of the initial Hessian, iterating through the inverse Hessian sequence, and pruning. + +The pseudocode is interpreted as: +1.Create a pruning mask M with zeros and ones. +2.Construct a matrix E to store the block quantization errors. +3.Calculate the inverse Hessian sequence information +4.Loop the blocks while updating the pruning mask and the error. +5.Select the largest weights for each block and set their corresponding values in the pruning mask to 1. +6.Prune the remaining weights and set their corresponding values in the pruning mask to 0. +7.Compute the pruning error for each weight in the block and set the error in the E matrix. +8.Freeze the weights that were not pruned. +9.Update the weights in the block using the Cholesky decomposition of the inverse Hessian matrix. +10.Update weights not updated in the previous loop after processing all the blocks. +11.Set pruned weights to 0 by element-wise multiplying the weight matrix with the pruning mask. + +## two cents: +SparseGPT solves the row-hessian challenge by reusing Hessians between rows and distinct pruning masks, leading to an accurate and efficient algorithm. + +large-scale generative pretrained Transformerfamily models can be compressed to high sparsity via weight pruning in one shot, without any retraining, at low loss of accuracy, when measured both in terms of perplexity and zero-shot performance through this method. + + From 26dcbb872a6f885635a692068db464cecb20d817 Mon Sep 17 00:00:00 2001 From: Pushpedra Date: Sun, 29 Oct 2023 11:07:18 +0530 Subject: [PATCH 2/4] message --- summaries/sparsegpt.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/summaries/sparsegpt.md b/summaries/sparsegpt.md index 349ad9c..02c0dc6 100644 --- a/summaries/sparsegpt.md +++ b/summaries/sparsegpt.md @@ -1,5 +1,6 @@ # SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot - +## AUTHOR: +Elias Frantar ,Dan Alistarh ## summary: SparseGPT is a post-training pruning method for compressing large language models such as GPT3 efficiently and accurately. The method can be used to prune large language models in one-shot with minimal accuracy loss. For example, you can use SparseGPT to prune OPT-175B to 50% sparsity with a 0.13 decrease in perplexity.Thus 100 billion weights from the model can be ignored at inference time, increasing the model's throughput while reducing latency. SparseGPT can be applied to a GPT model with 175B parameters on a single GPU in a couple of hours with negligible accuracy loss. @@ -35,4 +36,6 @@ SparseGPT solves the row-hessian challenge by reusing Hessians between rows and large-scale generative pretrained Transformerfamily models can be compressed to high sparsity via weight pruning in one shot, without any retraining, at low loss of accuracy, when measured both in terms of perplexity and zero-shot performance through this method. +## RESOURCE +https://paperswithcode.com/paper/massive-language-models-can-be-accurately From 082525bdd21906d6763cb4762793023bbab90220 Mon Sep 17 00:00:00 2001 From: Pushpendra Mishra <133359466+love-mishra@users.noreply.github.com> Date: Thu, 28 Dec 2023 10:10:02 +0530 Subject: [PATCH 3/4] Update sparsegpt.md --- summaries/sparsegpt.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/summaries/sparsegpt.md b/summaries/sparsegpt.md index 02c0dc6..7cb2c8d 100644 --- a/summaries/sparsegpt.md +++ b/summaries/sparsegpt.md @@ -1,14 +1,14 @@ # SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot ## AUTHOR: Elias Frantar ,Dan Alistarh -## summary: +## Summary: SparseGPT is a post-training pruning method for compressing large language models such as GPT3 efficiently and accurately. The method can be used to prune large language models in one-shot with minimal accuracy loss. For example, you can use SparseGPT to prune OPT-175B to 50% sparsity with a 0.13 decrease in perplexity.Thus 100 billion weights from the model can be ignored at inference time, increasing the model's throughput while reducing latency. SparseGPT can be applied to a GPT model with 175B parameters on a single GPU in a couple of hours with negligible accuracy loss. -## contribution: +## Contribution: becuase Large language models (LLMs) solve natural language processing problems with astounding accuracy. However, these models are enormous and require a lot of space, cost, and computation power to deploy. For example, the GPT-175B model has 175 billion parameters requiring 320GB of storage and at least 5 A100 GPUs with 80GB of memory each for inference. This computation power is expensive, making this solution only viable for some organizations. Hence, deployment of such models falls outside the purview of small organizations and individuals.  -## remark: +## Remark: Post-training compression is usually done by splitting the full-model compression problem into layer-wise subproblems, whose solution quality is measured in terms of the ` 2-error between the output, for given inputs X` , of the uncompressed layer with weights W` and that of the compressed one. methodology: The SparseGPT algorithm works as follows, given a fixed pruning mask:  @@ -31,7 +31,7 @@ The pseudocode is interpreted as: 10.Update weights not updated in the previous loop after processing all the blocks. 11.Set pruned weights to 0 by element-wise multiplying the weight matrix with the pruning mask. -## two cents: +## Two cents: SparseGPT solves the row-hessian challenge by reusing Hessians between rows and distinct pruning masks, leading to an accurate and efficient algorithm. large-scale generative pretrained Transformerfamily models can be compressed to high sparsity via weight pruning in one shot, without any retraining, at low loss of accuracy, when measured both in terms of perplexity and zero-shot performance through this method. From ae1b99b76613c205adfd892acc84a802ea6e6db9 Mon Sep 17 00:00:00 2001 From: Pushpendra Mishra <133359466+love-mishra@users.noreply.github.com> Date: Tue, 21 May 2024 20:01:12 +0530 Subject: [PATCH 4/4] Update sparsegpt.md --- summaries/sparsegpt.md | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/summaries/sparsegpt.md b/summaries/sparsegpt.md index 7cb2c8d..fe0f913 100644 --- a/summaries/sparsegpt.md +++ b/summaries/sparsegpt.md @@ -19,16 +19,27 @@ The SparseGPT algorithm works as follows, given a fixed pruning mask:  SparseGPT is local because it performs weight updates after each pruning step, maintaining the input-output relationship between each layer. The high parametrization of GPT models makes it possible to make the updates without any global gradient information. The cost of the reconstruction process consists of the computation of the initial Hessian, iterating through the inverse Hessian sequence, and pruning. The pseudocode is interpreted as: + 1.Create a pruning mask M with zeros and ones. + 2.Construct a matrix E to store the block quantization errors. + 3.Calculate the inverse Hessian sequence information + 4.Loop the blocks while updating the pruning mask and the error. + 5.Select the largest weights for each block and set their corresponding values in the pruning mask to 1. + 6.Prune the remaining weights and set their corresponding values in the pruning mask to 0. + 7.Compute the pruning error for each weight in the block and set the error in the E matrix. + 8.Freeze the weights that were not pruned. + 9.Update the weights in the block using the Cholesky decomposition of the inverse Hessian matrix. + 10.Update weights not updated in the previous loop after processing all the blocks. + 11.Set pruned weights to 0 by element-wise multiplying the weight matrix with the pruning mask. ## Two cents: