MiMo-V2-Flash - SGLang - mtp triton attention
The article presents testing results for a 4x 6000 Blackwell workstation card setup, with details on context, prompt, output, end-to-end speed, and accuracy length for different input sizes.
Why it matters
The article offers technical details on the performance of a multi-GPU workstation setup, which is relevant for understanding the capabilities and limitations of large language models.
Key Points
- 1Testing results for a 4x 6000 Blackwell workstation card setup
- 2Metrics include context, prompt, output, end-to-end speed, and accuracy length
- 3Tested input sizes range from 4K to 100K tokens
Details
The article provides performance metrics for a system running on 4x 6000 Blackwell workstation cards. It tests the system's capabilities across a range of input sizes, from 4K to 100K tokens. The key metrics reported are context, prompt, output, end-to-end speed, and accuracy length. As the input size increases, the end-to-end speed decreases, going from 100.2 tokens/second for 4K inputs down to 24.5 tokens/second for 100K inputs. The accuracy length remains relatively stable, ranging from 2.24 to 2.50. This data provides insights into the scalability and performance characteristics of the tested system.
No comments yet
Be the first to comment