Scaling Transformers
Scaling a transformer means increasing its capacity, data exposure, context length, training compute, or serving throughput. In practice, scaling is controlled by several coupled variables: number of parameters, number of training tokens, model dimension, depth, attention heads, sequence length, batch size, optimizer state, hardware memory, and inference latency. A transformer can be scaled in many ways, but useful scaling is constrained by compute, memory, data quality, and optimization stability. What...