LocalLLaMA Reddit1/2|Research & Papers Products & Services

LFM2 2.6B-Exp on Android: 40+ TPS and 32K context

I'm hugely impressed with LiquidAI's new LFM2 2.6B-Exp model, performing at GPT-4 levels across a wide variety of benchmarks (many but perhaps not quite most), plus reasoning too. Try the cloud version here: https://playground.liquid.ai/chat?model=cmjdu187p00013b6o7tttjvlw LFM2 uses a hybrid design (gated convolutions and grouped query attention), so it has a tiny KV cache footprint. This makes it capable of super smart, high speed, long context local inference on phones. I'm using https://huggingface.co/LiquidAI/LFM2-2.6B-Exp-GGUF with llama.cpp: Download LFM2-2.6B-Exp-Q4_K_M.gguf (~1.6GB); Get PocketPal AI or Maid from the Google Play Store or GitHub[1][2]. Or better, install Termux and compile llama.cpp with OpenCL support to utilize your phone's GPU (tutorial for Adreno support. Get Termux from F-Droid or GitHub, NOT the Google Play Store -- the Play Store version is outdated and will fail to compile current llama.cpp code.) Import the local model file using these sampler settings recommended by Liquid AI: Temperature: 0.3 Min-P: 0.15 Repetition Penalty: 1.05 Those values support the <think> tag for reasoning. If --jinja on the command line (optionally after building with --reasoning-format none to show all the reasoning tokens) doesn't get you reasoning, this system prompt will: You are a helpful AI assistant. You always reason before responding, using the following format: <think> your internal reasoning </think> your external response. PocketPal has GPU support on iOS using Apple's "Metal" API, but I don't have an iPhone so I can't vouch for whether it achieves the 40+ tokens/second you can get with the Termux method compiling llama.cpp with GGML_OPENCL=ON on Android. submitted by /u/Competitive_Travel16 [link] [comments]

💡

Why it matters

This news highlights the impressive performance and efficiency of a large language model that can run on mobile devices, which could have significant implications for the deployment of advanced AI capabilities in real-world applications.

Key Points

1LFM2 uses a hybrid design with gated convolutions and grouped query attention, allowing for a small KV cache footprint
2The model can achieve over 40 tokens per second (TPS) and 32K context on Android devices
3The article provides instructions for downloading the model and running it on Android using Termux and llama.cpp with OpenCL support
4PocketPal AI and Maid apps are mentioned as options for running the model on Android

Details

The LFM2 2.6B-Exp model from LiquidAI is capable of performing at GPT-4 levels across a wide variety of benchmarks, with the added capability of reasoning. The model uses a hybrid design with gated convolutions and grouped query attention, which allows for a small KV cache footprint. This makes the model well-suited for high-speed, long-context inference on Android devices. The article provides instructions for downloading the model and running it on Android using the Termux app and llama.cpp with OpenCL support to utilize the phone's GPU. The author claims that this setup can achieve over 40 tokens per second (TPS) and 32K context on Android. The article also mentions PocketPal AI and Maid as alternative apps for running the model on Android.

LFM2 2.6B-Exp on Android: 40+ TPS and 32K context

Why it matters

Key Points

Details

Dive deeper

Related Articles

I forked ik_llama.cpp and added a "--numa mirror" mode to m…

I pretrained and post trained a 500M parameter LLM and 330M…

2× Radeon R9700 — Qwen 3.6 27B Q8 MTP on llama.cpp

Tokenomics

ROCm vs Vulkan vs vLLM on Dual R9700's

Rollin' MiMo-2.5 on two Halo Strixeses

8-16 MI50s Minimax M3 @19 tps TG (peak)

Why is AutoRound being slept on so hard?

Watch local LLMs escape the rooms you design

Gemma 4 QAT seems to respond significantly better to KV cac…

AI Curator

Ask me anything about AI

Related Articles

I forked ik_llama.cpp and added a "--numa mirror" mode to m…

I pretrained and post trained a 500M parameter LLM and 330M…

2× Radeon R9700 — Qwen 3.6 27B Q8 MTP on llama.cpp

ROCm vs Vulkan vs vLLM on Dual R9700's

Rollin' MiMo-2.5 on two Halo Strixeses

8-16 MI50s Minimax M3 @19 tps TG (peak)

Why is AutoRound being slept on so hard?

Watch local LLMs escape the rooms you design

Gemma 4 QAT seems to respond significantly better to KV cac…