Micro benchmarking Apple M1 Max - MLX vs GGUF

Micro benchmarking Apple M1 Max - MLX vs GGUF - LLM QWEN 2.5

March 02, 2025

MLX is an ML framework targeted at Apple Silicon. It provides noticeable ML performance gains when compared to the standard (GGUF) techniques running on Apple Silicon. This MLX project describes MLX as:

MLX is an array framework for machine learning on Apple silicon, brought to you by Apple machine learning research.

A notable difference from MLX and other frameworks is the unified memory model. Arrays in MLX live in shared memory. Operations on MLX arrays can be performed on any of the supported device types without transferring data.

LM Studio added support for Apple Silicon MLX models in 2024. I totally ignored it until I saw a 2025/02 Reddit post in the /r/ocallama subreddit. I wanted to execute their microbenchmark on my Mac to get a feel for the possible performance difference. The performance improvement is exciting. I am waiting on really jumping into the MLX until Ollama supports MLX something they are working on as of 2025/02.

Test Bed

The general consensus is that the later Apple M processors get even greater benefit than the M1 which has great memory bandwidth but is slightly under spec'd on the processing side to make full use of the memory bandwidth.

Machine

Apple M1 Max 10-Core CPU; 64GB Unified Memory; 1TB Solid State Drive; 32-Core GPU/16-Core Neural Engine

Software

LLM execution platforms.

Ollama for GGUF models
llm (LMStudio) for MLX models

Test Results

Model	Type	App	tokens/sec	tokens/sec	App	Type	Model
mlx-community/Qwen2.5-7B-Instruct-4bit	MLX	llm	63.7	40.75	Ollama	GGUF	qwen2.5:7b
mlx-community/Qwen2.5-14B-Instruct-4bit	MLX	llm	27.8	21.7	Ollama	GGUF	qwen2.5:14b
mlx-community/Qwen2.5-32B-Instruct-4bit	MLX	llm	10.92	8.5	Ollama	GGUF	qwen2.5:32b

Commands

Ollama

I used Ollama to test the GGUF models. I already had it installed for use in VSCode assist. The stream: option here causes this to only return the statistics and not the prompt output. This assumes the model has already been downloaded and is available. I installed ollama and downloaded the models using ollama pull <model> prior to running the tests.

curl http://localhost:11434/api/generate -d '{"model":"qwen2.5:32b","prompt":"write a 500 word short story", "stream":false}'

Result Contains

The tokens/sec is not directly returned. It must be calcualted

eval_count
eval_duration

tokens/sec = eval_count * 10^9 / eval_duration

Example

"eval_count":735,"eval_duration":86329000000
735 * 10 ^ 9 / 86329000000
8.51 tokens per second

llm

I used llm for the MLX models because the mlx effort centers around the llm project. For llm I had to run an extra command to pull the performance statistics. I installed llm using brew and mlx support using llm.

Once installed, these are the commands I used for the benchmarks.

llm mlx download-model mlx-community/Qwen2.5-7B-Instruct-4bit
llm -m mlx-community/Qwen2.5-7B-Instruct-4bit 'write a 500 word short story'
llm logs -c --json

Result Contains

"generation_tps": 63.73...

References

https://github.com/itsmostafa/inference-speed-tests
https://github.com/ollama/ollama
https://github.com/ollama/ollama/blob/main/docs/api.md
https://github.com/lmstudio-ai/mlx-engine
https://github.com/ml-explore/mlx

Revision History

Created 2023 02

Blog de Joe Freeman