Rocking an older Titan RTX 24GB as my local AI Code assist on Windows 11, Ollama and VS Code
This is about using a Turing NVIDIA Titan RTX GPU to locally execute code assist LLMs to be used in VSCode. This slightly older card has 24GB of VRAM making it a great local LLM. The Titan RTX is a two-slot dual-fan card. The Titan RTX is currently about the same price as a refurbished Ampere NVidia 3090 TI 24GB.
There are a bunch of ways to host the code support LLMs. We are using an early release Ollama as our LLM service and continue.dev VSCode extension as the language service inside VSCode.
This was tested on AMD Ryzen 8 core with 64GB of memory and the Titan RTX.
Related blog articles and videos
Several related blogs and videos that cover VSCode and local LLMs
- Blog Get AI code assist VSCode with local LLMs using Ollama and the Continue.dev extension - Mac
- Get AI code assist VSCode with local LLMs using LM Studio and the Continue.dev extension - Windows
- Rocking an older Titan RTX 24GB as my local AI Code assist on Windows 11, Ollama and VS Code
- YouTube Video Using local Large Language Models for AI code assist in Visual Studio Code
- YouTube Video See tabAutoComplete AI assist and Chat AI assist relying on local LLMs in 4 minutes
Install Ollama
Download and Install Ollama The early release version of Ollama runs from the command line as a daemon and has a widget in the Windows 11 tray.
Ollama is a simple tool that lets you run models locally, assuming you have the required hardware. It can run in a server mode providing local API endpoints for various tools like the VS Code AI assist extensions. I needed to run 3 models for the full-featured experience. The 24GB gives me some room to play.
Pulling Models
Browse the Ollama site to find a couple models you want. Then get them with ollama pull <model>. I wanted to try the most popular models with a couple languages and ended up with after a few
ollama pull nomic-embed-text
ollama pull deepseek-coder-v2
ollama pull codellama
ollama pull starcoder2
ollama pull codeqwen
ollama pull gemma2
ollama pull codegemma
ollama pull codestral
ollama pull phi3
ollama pull phi3.5
Selecting Models
Our VSCode plugin needs models for three functions: tabAutoComplete, Chat, and embedding. We have lots of choices.
PS C:\Users\joe> ollama ls
NAME ID SIZE MODIFIED
phi3.5:latest 3b387c8dd9b7 2.2 GB 14 minutes ago
gemma2:latest ff02c3702f32 5.4 GB 15 minutes ago
phi3:latest 4f2222927938 2.2 GB 15 minutes ago
codestral:latest fcc0019dcee9 12 GB 16 minutes ago
codeqwen:latest df352abf55b1 4.2 GB 16 minutes ago
starcoder2:latest f67ae0f64584 1.7 GB 18 minutes ago
codellama:latest 8fdf8f752f6e 3.8 GB 19 minutes ago
codegemma:latest 0c96700aaada 5.0 GB 19 minutes ago
deepseek-coder-v2:latest 8577f96d693e 8.9 GB 21 minutes ago
nomic-embed-text:latest 0a109f422b47 274 MB 23 minutes ago
.
This will look bad because Blogger doesn't support long lines and doesn't have scrolling areas for code:-(
Run the ollama server
Ollama can be run as a server making all of the downloaded models available or it can be run with a single model.
PS C:\Users\joe> ollama serve
Error: listen tcp 127.0.0.1:11434: bind: Only one usage of each socket address (protocol/network address/port) is normally permitted.
PS C:\Users\joe> ollama serve
2024/08/22 20:20:00 routes.go:1125: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\joe\\.ollama\\models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR:C:\\Users\\joe\\AppData\\Local\\Programs\\Ollama\\ollama_runners OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
time=2024-08-22T20:20:00.136-04:00 level=INFO source=images.go:782 msg="total blobs: 47"
time=2024-08-22T20:20:00.139-04:00 level=INFO source=images.go:790 msg="total unused blobs removed: 0"
time=2024-08-22T20:20:00.142-04:00 level=INFO source=routes.go:1172 msg="Listening on 127.0.0.1:11434 (version 0.3.6)"
time=2024-08-22T20:20:00.143-04:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu_avx2 cuda_v11.3 rocm_v6.1 cpu cpu_avx]"
time=2024-08-22T20:20:00.143-04:00 level=INFO source=gpu.go:204 msg="looking for compatible GPUs"
time=2024-08-22T20:20:00.312-04:00 level=INFO source=gpu.go:288 msg="detected OS VRAM overhead" id=GPU-c7ac10e0-547d-6d91-75d5-cdd64259f9f2 library=cuda compute=7.5 driver=12.6 name="NVIDIA TITAN RTX" overhead="928.0 MiB"
time=2024-08-22T20:20:00.314-04:00 level=INFO source=types.go:105 msg="inference compute" id=GPU-c7ac10e0-547d-6d91-75d5-cdd64259f9f2 library=cuda compute=7.5 driver=12.6 name="NVIDIA TITAN RTX" total="24.0 GiB" available="22.8 GiB"
VSCode Extension: Chat and auto-complete functions
The complete.dev extension has two paradigms for interacting with the LLM, chat and auto-complete. The three can each use different LLMs if you have enough GPU VRAM. complete.dev supports remote and local endpoints. this means you can use amix of local and remote models. I'm only going to show locally deployed models going forward.
You want the models to fit into VRAM because you don't want them to fallback to using the CPU. It will be slow.
- Chat: Chat happens in a dedicated pane where you can type or copy/paste your question. The extension lets you select any of the configured LLMs. Each conversation in the chat window can only be bound to a single LLM definition. You can change the LLM used for a new conversation by using the drop list in the chat pane. The following image was captured for a locally running Ollama instance with 4 available models.
The VSCode extension, chat function, understands that ollama can have multiple downloaded models. The extension UI lets you select which model you wish to chat with for any given interaction. - Autocomplete: Tab-based auto-complete that works inline with your work while coding. It just works once it is configured correctly. There can be only one auto-complete LLM. It has its own Ollama configuration section.
I have enough VRAM that I can locally run separate tabAutocomplete and chat LLMs.
Install and Configure the Continue.dev extension.
Install the continue extension. Edit the configuration by clicking on the gear item at the bottom of the extension pane.
~/.continue/config.json
This is the configuration file for the Continue Visual Studio Code extension. We must edit two different areas to enable and configure both chat and auto-complete
models: The models for chat
We can change models across chat sessions. This configuration says to use the local Ollama and auto-detect all the models currently available to that Ollama instance. You can see the model selector above.
"models": [
{
"title": "Ollama",
"provider": "ollama",
"model": "AUTODETECT"
}
],
tabAutoCompleteModel: The model for inline suggestions and auto-complete
This configuration says to use this specific model for tab auto-complete. The model is available on the local Ollama server just like the models in the models list.
"tabAutocompleteModel": {
"title": "Ollama",
"provider": "ollama",
"model": "codeqwen"
},
Embeddings
I specifically configured the embedding model.
"embeddingsProvider": {
"provider": "ollama",
"model": "nomic-embed-text"
}
Complete sample config.json file
This is my complete config.json for my windows machine with 24GB of graphics VRAM.
{
"models": [
{
"title": "Ollama",
"provider": "ollama",
"model": "AUTODETECT"
}
],
"customCommands": [
{
"name": "test",
"prompt": "{{{ input }}}\n\nWrite a comprehensive set of unit tests for the selected code. It should setup, run tests that check for correctness including important edge cases, and teardown. Ensure that the tests are complete and sophisticated. Give the tests just as chat output, don't edit any file.",
"description": "Write unit tests for highlighted code"
}
],
"tabAutocompleteModel": {
"title": "Ollama",
"provider": "ollama",
"model": "codeqwen"
},
"contextProviders": [
{
"name": "code",
"params": {}
},
{
"name": "docs",
"params": {}
},
{
"name": "diff",
"params": {}
},
{
"name": "terminal",
"params": {}
},
{
"name": "problems",
"params": {}
},
{
"name": "folder",
"params": {}
},
{
"name": "codebase",
"params": {}
}
],
"slashCommands": [
{
"name": "edit",
"description": "Edit selected code"
},
{
"name": "comment",
"description": "Write comments for the selected code"
},
{
"name": "share",
"description": "Export the current chat session to markdown"
},
{
"name": "cmd",
"description": "Generate a shell command"
},
{
"name": "commit",
"description": "Generate a git commit message"
}
],
"embeddingsProvider": {
"provider": "ollama",
"model": "nomic-embed-text"
}
}
Refresh the workspace after reconfiguring the extension
Reload the VSCode Workspace after editing the config file. This forces an immediate configuration refresh.
ctrl-shift-p
Developer:Reload Window
The reload is actually available here after a ctrl-shift-p
Examples
A chat session
This chat is about a unit test. Zero prompt engineering has been done to get better answers.
Autocomplete in action
Autocomplete will provide suggestions while you are typing in a code window. The light grey text in this image was provided by the LLM when I clicked on a Dart unit test file.
Memory usage with three separate models loaded
nvidia-smi lets us check the status of VRAM and the GPU. We can see here that the two LLMs consume 22GB of VRAM
PS C:\Windows\System32> nvidia-smi
Thu Aug 22 20:04:14 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.70 Driver Version: 560.70 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Driver-Model | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA TITAN RTX WDDM | 00000000:08:00.0 Off | N/A |
| 40% 38C P8 5W / 280W | 21824MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 4484 C ...\cuda_v11.3\ollama_llama_server.exe N/A |
| 0 N/A N/A 10752 C ...\cuda_v11.3\ollama_llama_server.exe N/A |
| 0 N/A N/A 23996 C ...\cuda_v11.3\ollama_llama_server.exe N/A |
+-----------------------------------------------------------------------------------------+
Don't grab models by the tag "latest"
The latest tag can bring down completely different versions and quantization on different days because it is the "latest", the last one uploaded.
Most of my work above was done using the latest version of a mode. That is stupid because you don't know which tuning you are using. I removed models tagged as latest and pulled explicitly by parameter count/size
codegemma:2b 926331004170 1.6 GB
codegemma:7b 0c96700aaada 5.0 GB
codellama:13b 9f438cb9cd58 7.4 GB
codellama:7b 8fdf8f752f6e 3.8 GB
codeqwen:7b df352abf55b1 4.2 GB
codestral:22b fcc0019dcee9 12 GB
deepseek-coder-v2:16b 8577f96d693e 8.9 GB
gemma2:27b 53261bc9c192 15 GB
gemma2:2b 8ccf136fdd52 1.6 GB
gemma2:9b ff02c3702f32 5.4 GB
granite-code:20b 31d8bc61e506 11 GB
granite-code:20b-instruct 31d8bc61e506 11 GB
granite-code:3b 63bedbdffbf0 2.0 GB
granite-code:3b-base-f16 22067b08f26e 7.0 GB
granite-code:3b-instruct 63bedbdffbf0 2.0 GB
granite-code:8b 998bce918de0 4.6 GB
granite-code:8b-base-f16 7c8fde9dfb87 16 GB
granite-code:8b-instruct 998bce918de0 4.6 GB
phi3.5:3.8b 3b387c8dd9b7 2.2 GB
phi3.5:3.8b-mini-instruct-fp16 570b68409ede 7.6 GB
starcoder2:15b 20cdb0f709c2 9.1 GB
starcoder2:3b f67ae0f64584 1.7 GB
starcoder2:7b 0679cedc1189 4.0 GB
Fini
That's it. At this point, you should have AI assist in your editors and in the chat pane. All of this capability without leaving the friendly environment of your local network!
Revision History
Created: 2024 08
Comments
Post a Comment