Micro benchmarking Apple M1 Max - MLX vs GGUF - LLM QWEN 2.5
MLX is an ML framework targeted at Apple Silicon. It provides noticeable ML performance gains when compared to the standard (GGUF) techniques running on Apple Silicon. This MLX project describes MLX as:
MLX is an array framework for machine learning on Apple silicon, brought to you by Apple machine learning research.
A notable difference from MLX and other frameworks is the unified memory model. Arrays in MLX live in shared memory. Operations on MLX arrays can be performed on any of the supported device types without transferring data.LM Studio added support for Apple Silicon MLX models in 2024. I totally ignored it until I saw a 2025/02 Reddit post in the /r/ocallama subreddit. I wanted to execute their microbenchmark on my Mac to get a feel for the possible performance difference. The performance improvement is exciting. I am waiting on really jumping into the MLX until Ollama supports MLX something they are working on as of 2025/02.
Test Bed
Machine
Apple M1 Max 10-Core CPU; 64GB Unified Memory; 1TB Solid State Drive; 32-Core GPU/16-Core Neural Engine
Software
LLM execution platforms.
- Ollama for GGUF models
- llm (LMStudio) for MLX models
Test Results
Model | Type | App | tokens/sec | tokens/sec | App | Type | Model |
---|---|---|---|---|---|---|---|
mlx-community/Qwen2.5-7B-Instruct-4bit | MLX | llm | 63.7 | 40.75 | Ollama | GGUF | qwen2.5:7b |
mlx-community/Qwen2.5-14B-Instruct-4bit | MLX | llm | 27.8 | 21.7 | Ollama | GGUF | qwen2.5:14b |
mlx-community/Qwen2.5-32B-Instruct-4bit | MLX | llm | 10.92 | 8.5 | Ollama | GGUF | qwen2.5:32b |
Commands
Ollama
I used Ollama to test the GGUF models. I already had it installed for use in VSCode assist. The stream: option here causes this to only return the statistics and not the prompt output. This assumes the model has already been downloaded and is available. I installed ollama and downloaded the models using ollama pull <model> prior to running the tests.
curl http://localhost:11434/api/generate -d '{"model":"qwen2.5:32b","prompt":"write a 500 word short story", "stream":false}'
Result Contains
- eval_count
- eval_duration
Example
- "eval_count":735,"eval_duration":86329000000
- 735 * 10 ^ 9 / 86329000000
- 8.51 tokens per second
llm
llm mlx download-model mlx-community/Qwen2.5-7B-Instruct-4bit
llm -m mlx-community/Qwen2.5-7B-Instruct-4bit 'write a 500 word short story'
llm logs -c --json
Result Contains
"generation_tps": 63.73...
References
- https://github.com/itsmostafa/inference-speed-tests
- https://github.com/ollama/ollama
- https://github.com/ollama/ollama/blob/main/docs/api.md
- https://github.com/lmstudio-ai/mlx-engine
- https://github.com/ml-explore/mlx
Comments
Post a Comment