Kimi K2 Thinking at 28.3 t/s on 4x Mac Studio cluster
The article discusses testing the llama.cpp RPC vs Exo's new RDMA Tensor setting on a cluster of 4 Mac Studios, achieving 28.3 tokens per second (t/s) with the Kimi K2 model.
Why it matters
This testing provides insights into the performance of large language models on a Mac Studio cluster, which could be relevant for researchers and developers working on AI applications.
Key Points
- 1Tested llama.cpp RPC vs Exo's new RDMA Tensor setting
- 2Used a cluster of 4 Mac Studios (2x 512GB, 2x 256GB)
- 3Achieved 28.3 tokens per second with the Kimi K2 model
- 4Lack of a tool like llama-bench in Exo makes direct comparisons difficult
Details
The author was testing the performance of the llama.cpp RPC and Exo's new RDMA Tensor setting on a cluster of 4 Mac Studio computers provided by Apple. The setup included 2 Mac Studios with 512GB of RAM and 2 with 256GB. The testing focused on the Kimi K2 model, which achieved a throughput of 28.3 tokens per second. However, the author noted the lack of a tool like llama-bench in Exo, making it more difficult to do direct comparisons with context sizes and prompt processing speeds.
No comments yet
Be the first to comment