Coercing NVIDA bfloat16 LLM models to run on Tesla GPUs that only support float16

August 19, 2024

Newer LLM models are built around the bfloat16 data type that has different types of precision than the older float16. My Tesla vintage GPU supports the lower precision float16, not the newer bfloat16. you can coerce the model from bfloat16 to float16.

Disclaimer: The difference in precision can result in errors. Your mileage and accuracy may vary depending on the model

Hacking config.json for FP16

The basic steps are

Download the model either manually or as part of an attempted run.
Find the location of the model on disk. This typically is in the model cache directory

~/.cache/nvidia/nvidia-nims/ngc/<some_path>/config.json
~/.cache/nim/ngc/<some_path>/config.json
~/.cache/nvidia-nims/ngc/hub/<some-model>/snapshots/config.json

Edit the config.json found for the model.
Find the line "torch_dtype":"bfloat16"
Change the value of torch_dtype to one supported by the card, float16 in my case. "torch_dtype":"float16".

The property is torch.dtype on the command line but is torch_dtype in JSON files.

sample config.json

You can see "torch_dtype": "bfloat16", partway down the file list

(base) joe@hp-z820:~$ cat ./.cache/nim/ngc/hub/models--nim--mistralai--mistral-7b-instruct-v03/snapshots/hf/config.json
{
  "architectures": [
    "MistralForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 32768,
  "model_type": "mistral",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "rms_norm_eps": 1e-05,
  "rope_theta": 1000000.0,
  "sliding_window": null,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.42.0.dev0",
  "use_cache": true,
  "vocab_size": 32768
}

Now we can walk through a few model definition cards to see if they fit in our card and are a candidate for dtype coercion. Some of the model cards show that they are float16 but they download with bfloat16 configuration files. We can overcome that with the cache edit.

NIM Anywhere Models

The NIM Anywhere project ships with a sample service that uses a model that expects to run bfloat16.

I was able to load the LLM model by changing the torch_dtype. I validated the model configuration with a few simple queries. I didn't do any precision verification testing.

Model meta-llama/Meta-Llama-3-8B-Instruct

Model: meta-llama/Meta-Llama-3-8B-Instruct
config.json location: ~/.cache/nvidia-nims/ngc/hub/models--nim--meta--llama3-8b-instruct/snapshots/hf/config.json
torch_dtype: bfloat16
model weight size: 15GB

Modifications to run on Tesla GPUs

torch_dtype: float16

Hybrid RAG Ungated Models

These are the six local models supported by the NVIDIA AI Workstation Hybrid Rag example. Two are immediately available. The other four are gated, behind a token Hugging Face token wall.

nvidia/Llama3-ChatQA-1.5-8b

Model: nvidia/Llama3-ChatQA-1.5-8b
Mode Location: /data/models--nvidia--Llama3-ChatQA-1.5-8B
config.json location: /data/models--nvidia--Llama3-ChatQA-1.5-8B/snapshots/3b98162e3f97550d62aeeb19ea50208f968c678a/config.json
torch_dtype: float16

Phi-3-mini-128k-instruct

Model: Phi-3-mini-128k-instruct
ModelLocation: /data/models--microsoft--Phi-3-mini-128k-instruct/
config.json location: /data/models--microsoft--Phi-3-mini-128k-instruct/snapshots/d548c233192db00165d842bf8edff054bb3212f8/config.json
torch_dtype in config.json: bfloat16

Hybrid Rag Gated Models Supported in AI Workbench

Most of these models are BFloat16. The first one is FP16 so it would be runnable without modification on a Tesla CPU. This currently doesn't work in the AI Workbench. I think this is because NVidia erroneously set the torch_dtype to bf16 no matter what the model card says.

Llama-2-7b--chat-hf

* Model: Llama-2-7b--chat-hf

* torch_dtype in config.json: not investigated

Meta-Llama-3-8b-Instruct

* Model: Meta-Llama-3-8b-Instruct

* torch_dtype in config.json: blfoat16

Mistral-7B-Instruct-v0.1

* Model: Mistral-7B-Instruct-v0.1

* torch_dtype in config.json: bfloat16

Mistral-7B-Instruct-v0.2

* Model: Mistral-7B-Instruct-v0.2

* torch_dtype in config.json: bfloat16

Other

Revision History

Create 2024 08

Blog de Joe Freeman