Is VRAM the only bottleneck or processing power is also insufficient to run top models on a single GPU?
From what I understand, VRAM is often the primary bottleneck when working with these models. For example, running a 401B parameter LLaMA 3.1 model or a 671B parameter DeepSeek R1 might require hundreds of GBs of VRAM, which is why setups with multiple GPUs (like a dozen A100s or H100s) are common.
Is processing power (e.g., TFLOPS, CUDA cores, etc.) also insufficient to run these models on a single top consumer-grade GPU, even if it were equipped with hundreds of GBs of VRAM?
For instance, if an RTX 5090 had 1TB of VRAM but otherwise identical hardware, how many tokens per second could it achieve when running a DeepSeek R1 671B parameter model?