LFM2 2.6B-Exp on Android: 40+ TPS and 32K context
I'm hugely impressed with LiquidAI's new LFM2 2.6B-Exp model, performing at GPT-4 levels across a wide variety of benchmarks (many but perhaps not quite most), plus reasoning too. Try the cloud version here: https://playground.liquid.ai/chat?model=cmjdu187p00013b6o7tttjvlw LFM2 uses a hybrid design (gated convolutions and grouped query attention), so it has a tiny KV cache footprint. This makes it capable of super smart, high speed, long context local inference on phones. I'm using https://huggingface.co/LiquidAI/LFM2-2.6B-Exp-GGUF with llama.cpp: Download LFM2-2.6B-Exp-Q4_K_M.gguf (~1.6GB); Get PocketPal AI or Maid from the Google Play Store or GitHub[1][2]. Or better, install Termux and compile llama.cpp with OpenCL support to utilize your phone's GPU (tutorial for Adreno support. Get Termux from F-Droid or GitHub, NOT the Google Play Store -- the Play Store version is outdated and will fail to compile current llama.cpp code.) Import the local model file using these sampler settings recommended by Liquid AI: Temperature: 0.3 Min-P: 0.15 Repetition Penalty: 1.05 Those values support the <think> tag for reasoning. If --jinja on the command line (optionally after building with --reasoning-format none to show all the reasoning tokens) doesn't get you reasoning, this system prompt will: You are a helpful AI assistant. You always reason before responding, using the following format: <think> your internal reasoning </think> your external response. PocketPal has GPU support on iOS using Apple's "Metal" API, but I don't have an iPhone so I can't vouch for whether it achieves the 40+ tokens/second you can get with the Termux method compiling llama.cpp with GGML_OPENCL=ON on Android. submitted by /u/Competitive_Travel16 [link] [comments]
Why it matters
This news highlights the impressive performance and efficiency of a large language model that can run on mobile devices, which could have significant implications for the deployment of advanced AI capabilities in real-world applications.
Key Points
- 1LFM2 uses a hybrid design with gated convolutions and grouped query attention, allowing for a small KV cache footprint
- 2The model can achieve over 40 tokens per second (TPS) and 32K context on Android devices
- 3The article provides instructions for downloading the model and running it on Android using Termux and llama.cpp with OpenCL support
- 4PocketPal AI and Maid apps are mentioned as options for running the model on Android
Details
The LFM2 2.6B-Exp model from LiquidAI is capable of performing at GPT-4 levels across a wide variety of benchmarks, with the added capability of reasoning. The model uses a hybrid design with gated convolutions and grouped query attention, which allows for a small KV cache footprint. This makes the model well-suited for high-speed, long-context inference on Android devices. The article provides instructions for downloading the model and running it on Android using the Termux app and llama.cpp with OpenCL support to utilize the phone's GPU. The author claims that this setup can achieve over 40 tokens per second (TPS) and 32K context on Android. The article also mentions PocketPal AI and Maid as alternative apps for running the model on Android.
No comments yet
Be the first to comment