Consider this hypothetical scenario: if you were given $100,000 to build a PC/server to run open-source LLMs like LLaMA 3 for single-user purposes, what would you build?
Consider this hypothetical scenario: if you were given $100,000 to build a PC/server to run open-source LLMs like LLaMA 3 for single-user purposes, what would you build?
Depends on what you’re doing with it, but prompt/context processing is a lot faster on Nvidia GPUs than on Apple chips, though if you are using the same prefix all the time it’s a bit better.
The time to first token is a lot faster on datacenter GPUs, especially as context length increases, and consumer GPUs don’t have enough vram.