Techniques

Quantization

Also known as: quantisation

In one line

Compressing a model by using lower-precision numbers, so it runs faster and fits on smaller hardware.

What does Quantization mean?

Quantizing a 70B model from 16-bit to 4-bit can shrink it 4x with modest quality loss — enough to run on a laptop or edge device.

Running Llama 3.1 8B in 4-bit quantised form on a MacBook.

A neural network trained on huge text collections to predict the next word — the engine behind ChatGPT, Claude and Gemini.

The act of running a trained model to get an answer — as opposed to training it.

A model whose weights are freely downloadable, so anyone can run or modify it.