Coercing NVIDA bfloat16 LLM models to run on Tesla GPUs that only support float16
Newer LLM models are built around the bfloat16 data type that has different types of precision than the older float16. My Tesla vintage GPU supports the lower precision float16, not the newer bfloat16. you can coerce the model from bfloat16 to float16.
Disclaimer: The difference in precision can result in errors. Your mileage and accuracy may vary depending on the model
Hacking config.json for FP16
The basic steps are
- Download the model either manually or as part of an attempted run.
- Find the location of the model on disk. This typically is in the model cache directory
- ~/.cache/nvidia/nvidia-nims/ngc/<some_path>/config.json
- ~/.cache/nim/ngc/<some_path>/config.json
- ~/.cache/nvidia-nims/ngc/hub/<some-model>/snapshots/config.json
- Edit the config.json found for the model.
- Find the line "torch_dtype":"bfloat16"
- Change the value of torch_dtype to one supported by the card, float16 in my case. "torch_dtype":"float16".
The property is torch.dtype on the command line but is torch_dtype in JSON files.
sample config.json
You can see "torch_dtype": "bfloat16", partway down the file list
(base) joe@hp-z820:~$ cat ./.cache/nim/ngc/hub/models--nim--mistralai--mistral-7b-instruct-v03/snapshots/hf/config.json
{
"architectures": [
"MistralForCausalLM"
],
"attention_dropout": 0.0,
"bos_token_id": 1,
"eos_token_id": 2,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 14336,
"max_position_embeddings": 32768,
"model_type": "mistral",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 8,
"rms_norm_eps": 1e-05,
"rope_theta": 1000000.0,
"sliding_window": null,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.42.0.dev0",
"use_cache": true,
"vocab_size": 32768
}
Now we can walk through a few model definition cards to see if they fit in our card and are a candidate for dtype coercion. Some of the model cards show that they are float16 but they download with bfloat16 configuration files. We can overcome that with the cache edit.
NIM Anywhere Models
The NIM Anywhere project ships with a sample service that uses a model that expects to run bfloat16.
I was able to load the LLM model by changing the torch_dtype. I validated the model configuration with a few simple queries. I didn't do any precision verification testing.
Model meta-llama/Meta-Llama-3-8B-Instruct
- Model: meta-llama/Meta-Llama-3-8B-Instruct
- config.json location: ~/.cache/nvidia-nims/ngc/hub/models--nim--meta--llama3-8b-instruct/snapshots/hf/config.json
- torch_dtype: bfloat16
- model weight size: 15GB
Modifications to run on Tesla GPUs
- torch_dtype: float16
Hybrid RAG Ungated Models
These are the six local models supported by the NVIDIA AI Workstation Hybrid Rag example. Two are immediately available. The other four are gated, behind a token Hugging Face token wall.
nvidia/Llama3-ChatQA-1.5-8b
- Model: nvidia/Llama3-ChatQA-1.5-8b
- Mode Location: /data/models--nvidia--Llama3-ChatQA-1.5-8B
- config.json location: /data/models--nvidia--Llama3-ChatQA-1.5-8B/snapshots/3b98162e3f97550d62aeeb19ea50208f968c678a/config.json
- torch_dtype: float16
Phi-3-mini-128k-instruct
- Model: Phi-3-mini-128k-instruct
- ModelLocation: /data/models--microsoft--Phi-3-mini-128k-instruct/
- config.json location: /data/models--microsoft--Phi-3-mini-128k-instruct/snapshots/d548c233192db00165d842bf8edff054bb3212f8/config.json
- torch_dtype in config.json: bfloat16
Hybrid Rag Gated Models Supported in AI Workbench
Most of these models are BFloat16. The first one is FP16 so it would be runnable without modification on a Tesla CPU. This currently doesn't work in the AI Workbench. I think this is because NVidia erroneously set the torch_dtype to bf16 no matter what the model card says.
Llama-2-7b--chat-hf
* Model: Llama-2-7b--chat-hf
* torch_dtype in config.json: not investigated
Meta-Llama-3-8b-Instruct
* Model: Meta-Llama-3-8b-Instruct
* torch_dtype in config.json: blfoat16
Mistral-7B-Instruct-v0.2
* Model: Mistral-7B-Instruct-v0.2
* torch_dtype in config.json: bfloat16
Comments
Post a Comment