Kimi K2 Thinking at 28.3 t/s on 4x Mac Studio cluster

The article discusses testing the llama.cpp RPC vs Exo's new RDMA Tensor setting on a cluster of 4 Mac Studios, achieving 28.3 tokens per second (t/s) with the Kimi K2 model.

💡

Why it matters

This testing provides insights into the performance of large language models on a Mac Studio cluster, which could be relevant for researchers and developers working on AI applications.

Key Points

  • 1Tested llama.cpp RPC vs Exo's new RDMA Tensor setting
  • 2Used a cluster of 4 Mac Studios (2x 512GB, 2x 256GB)
  • 3Achieved 28.3 tokens per second with the Kimi K2 model
  • 4Lack of a tool like llama-bench in Exo makes direct comparisons difficult

Details

The author was testing the performance of the llama.cpp RPC and Exo's new RDMA Tensor setting on a cluster of 4 Mac Studio computers provided by Apple. The setup included 2 Mac Studios with 512GB of RAM and 2 with 256GB. The testing focused on the Kimi K2 model, which achieved a throughput of 28.3 tokens per second. However, the author noted the lack of a tool like llama-bench in Exo, making it more difficult to do direct comparisons with context sizes and prompt processing speeds.

Like
Save
Read original
Cached
Comments
?

No comments yet

Be the first to comment

AI Curator - Daily AI News Curation

AI Curator

Your AI news assistant

Ask me anything about AI

I can help you understand AI news, trends, and technologies