Wanda (Pruning by Weights and Activations) is a technique for compressing Large Language Models without the need for retraining. It decides which weights to remove based on the product of their magnitude and the input activation, effectively maintaining model performance while reducing size.
Proposed in 'A Simple and Effective Pruning Approach for Large Language Models' (Sun et al., 2023).
Enables running massive models on consumer hardware with minimal accuracy loss.