Micro benchmarking Apple M1 Max - MLX vs GGUF - LLM QWEN 2.5


MLX is an ML framework targeted at Apple Silicon.  It provides noticeable ML performance gains when compared to the standard (GGUF) techniques running on Apple Silicon.  This MLX project describes MLX as:  

MLX is an array framework for machine learning on Apple silicon, brought to you by Apple machine learning research.
A notable difference from MLX and other frameworks is the unified memory model. Arrays in MLX live in shared memory. Operations on MLX arrays can be performed on any of the supported device types without transferring data.

LM Studio added support for Apple Silicon MLX models in 2024. I totally ignored it until I saw a 2025/02 Reddit post in the /r/ocallama subreddit.  I wanted to execute their microbenchmark on my Mac to get a feel for the possible performance difference.  The performance improvement is exciting.  I am waiting on really jumping into the MLX until Ollama supports MLX something they are working on as of 2025/02.

Test Bed

The general consensus is that the later Apple M processors get even greater benefit than the M1 which has great memory bandwidth but is slightly under spec'd on the processing side to make full use of the memory bandwidth.

Machine

Apple M1 Max 10-Core CPU; 64GB Unified Memory; 1TB Solid State Drive; 32-Core GPU/16-Core Neural Engine

Software

LLM execution platforms.

  • Ollama for GGUF models
  • llm (LMStudio) for MLX models

Test Results

ModelTypeApptokens/sectokens/secAppTypeModel
mlx-community/Qwen2.5-7B-Instruct-4bitMLXllm63.740.75OllamaGGUFqwen2.5:7b
mlx-community/Qwen2.5-14B-Instruct-4bitMLXllm27.821.7OllamaGGUFqwen2.5:14b
mlx-community/Qwen2.5-32B-Instruct-4bitMLXllm10.928.5OllamaGGUFqwen2.5:32b

Commands

Ollama

I used Ollama to test the GGUF models.  I already had it installed for use in VSCode assist. The stream: option here causes this to only return the statistics and not the prompt output. This assumes the model has already been downloaded and is available. I installed ollama and downloaded the models using ollama pull <model> prior to running the tests.

curl http://localhost:11434/api/generate -d '{"model":"qwen2.5:32b","prompt":"write a 500 word short story", "stream":false}'

Result Contains


The tokens/sec is not directly returned.  It must be calcualted
  • eval_count 
  •  eval_duration
tokens/sec = eval_count * 10^9 / eval_duration

Example

  1. "eval_count":735,"eval_duration":86329000000
  2. 735 * 10 ^ 9 / 86329000000
  3. 8.51 tokens per second

llm

I used llm for the MLX models because the mlx effort centers around the llm project. For llm I had to run an extra command to pull the performance statistics. I installed llm using brew and mlx support using llm.

Once installed, these are the commands I used for the benchmarks.

llm mlx download-model mlx-community/Qwen2.5-7B-Instruct-4bit

llm -m mlx-community/Qwen2.5-7B-Instruct-4bit 'write a 500 word short story'

llm logs -c --json

Result Contains

      "generation_tps": 63.73...

References

Revision History

Created 2023 02

Comments

Popular posts from this blog

Installing the RNDIS driver on Windows 11 to use USB Raspberry Pi as network attached

Understanding your WSL2 RAM and swap - Changing the default 50%-25%

Almost PaaS Document Parsing with Tika and AWS Elastic Beanstalk