Get AI code assist VSCode with local LLMs using LM Studio and the Continue.dev extension - Windows
This is about running VSCode AI code assist locally replacing Copilot or some other service. You may run local models to guarantee none of your code ends up on external servers. Or, you may not want to maintain an ongoing AI subscription.
You want a big card. 8GB is a tiny card.
We run the model by clicking on the Local Server tab and then selecting the model in the drop list at the top center. You can see the current LM Studio model in purple on this screenshot.
Related blog articles and videos
Several related blogs and videos that cover VSCode and local LLMs
- Blog Get AI code assist VSCode with local LLMs using Ollama and the Continue.dev extension - Mac
- Get AI code assist VSCode with local LLMs using LM Studio and the Continue.dev extension - Windows
- Rocking an older Titan RTX 24GB as my local AI Code assist on Windows 11, Ollama and VS Code
- YouTube Video Using local Large Language Models for AI code assist in Visual Studio Code
- YouTube Video See tabAutoComplete AI assist and Chat AI assist relying on local LLMs in 4
- Blog Get AI code assist VSCode with local LLMs using Ollama and the Continue.dev extension - Mac
- Get AI code assist VSCode with local LLMs using LM Studio and the Continue.dev extension - Windows
- Rocking an older Titan RTX 24GB as my local AI Code assist on Windows 11, Ollama and VS Code
- YouTube Video Using local Large Language Models for AI code assist in Visual Studio Code
- YouTube Video See tabAutoComplete AI assist and Chat AI assist relying on local LLMs in 4
Installing LM Studio
LM Studio is a GUI program that lets you run models locally, assuming you have the required hardware. It is a GUI program, unlike Ollama which is a command line program. I'm using LM Studio because Ollama is in beta on Windows.
Download and Install LMStudio. Run the installer and then the program.
Selecting Models
We want models that are tuned for coding. LM Studio has many coding oriented models in the catalog. The search results will tell you which ones are appropriate for your GPU / CPU.
You should be judicious with the size of the models. You want them to run on the GPU and fit in video memory. You can use the same model for auto-completion and chat. Select the model best suited for your normal use case if you can only load one.
On Windows, you will be constrained by your NVidia GPU VRAM. You will need at least 8GB of VRAM for a single model plus embedding. You can run better models or split models with 12-16GB. On the Mac, the GPU can access some percentage of system memory. You will want at least 32GB of RAM on a Mac because main memory is VRAM on an ARM Mac.
Note: I used a single model for both auto-complete and chat because of the VRAM limitations on my Windows machine.
Download models and run them in LM Studio
The LM Studio catalog included 21 different coding-oriented models at the time of this article. I selected CodeLlama-7B-instrtuct solely because it had the most downloads. The My Models tab shows that I have one model downloaded. I also tried a couple of the DeepSeeker models.
CodeLlama is not optimized for auto-complete according to the VS code extension. You will get a warning when using it for auto-complete.
We run the model by clicking on the Local Server tab and then selecting the model in the drop list at the top center. You can see the current LM Studio model in purple on this screenshot.
Verify using the GPU
Use the nvidia-smi command to verify GPU usage.
Before Loading the model
PS C:\Users\joe> nvidia-smi
Thu Aug 22 07:53:02 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.70 Driver Version: 560.70 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Driver-Model | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3060 Ti WDDM | 00000000:08:00.0 Off | N/A |
| 0% 28C P8 4W / 240W | 129MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 22764 C ...\LM-Studio\app-0.2.31\LM Studio.exe N/A |
+-----------------------------------------------------------------------------------------+
After Loading the model
The DeepSeek-Coder-V2-Lite-Instruct model consumed about 5GB of VRAM. The embeddings require another 400MB-ish.
PS C:\Users\joe> nvidia-smi
Thu Aug 22 08:02:26 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.70 Driver Version: 560.70 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Driver-Model | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3060 Ti WDDM | 00000000:08:00.0 Off | N/A |
| 0% 29C P8 4W / 240W | 4286MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 7096 C ...\LM-Studio\app-0.2.31\LM Studio.exe N/A |
| 0 N/A N/A 22764 C ...\LM-Studio\app-0.2.31\LM Studio.exe N/A |
+-----------------------------------------------------------------------------------------+
VSCode Extension: Chat and auto-complete functions
The extension supports two modes for interacting with the LLM, chat and auto-complete. Ollama supports multiple simultaneous models. LM Studio supports a single active model. The VSCode extension understands this because it finds the available model by querying the API endpoint.
- Chat: Chat happens in a dedicated pane where you can type or copy/paste your question. The extension lets you select any of the configured LLMs. Each conversation in the chat window can only be bound to a single LLM definition.
- Autocomplete: Tab-based auto-complete that works in line with your work. It just works once it is configured correctly. The auto-complete LLM has its own configuration section. You can use the same LLM for both chat and auto-complete. You will have to use the same model for both if you only have enough VRAM for one.
If you only have enough VRAM for 1 model then you must decide if you wish to optimize for autocomplete or optimize for chat and pick the appropriate mode.
Install and Configure the Continue.dev extension.
The extension will recommend moving the chat pane to the right-hand side away from the primary sidebar. This makes the chat window the secondary side bar. That worked for me.
The continue extension may run a configuration wizard for you on the first install. If you miss this or need to reconfigure the extension later, then you will need to edit the config.json by hand.
I just go straight to the json editor
Click on the gear icon at the bottom of the Continue extension view. Or find the config file in ~/.config/config.json
Embeddings
You will want to enable embeddings in LM Studio and import that configuration. Download the embeddings and enable it in LM Studio on the Local Server tab. The embedding mode requires about 300 MB of VRAM.
~/.continue/config.json
This is the configuration file for the Continue Visual Studio Code extension. We must edit two different areas to enable and configure both chat and auto-complete
models: The models for chat
This configuration says to use the local LM Studio and auto-detect all the models currently available to that LM Studio instance. We could specify a single model but use this instead so that we can pick a specific model for any single chat session.
"models": [
{
"title": "LM Studio",
"provider": "lmstudio",
"model": "AUTODETECT"
}
],
You can see the models provided by this instance of LM Studio in the model selector in this screen-shot.
tabAutoCompleteModel: The model for inline suggestions and auto-complete
This configuration says to use this specific model for tab auto-complete. The model is available on the local LM Studio server just like the models in the models list.
"tabAutocompleteModel": {
"title": "LM Studio Codellama",
"provider": "lmstudio",
"model": "bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF"
},
Embeddings
We need to configure the embeddinsProvider section because we are using something other than Ollama.
"embeddingsProvider": {
"provider": "lmstudio",
"model": "nomic-ai/nomic-embed-text-v1.5-GGUF"
}
Complete sample config.json file
This is my complete config.json with the two entries shown above. Everything else is exactly as it was created without modification.
{
"models": [
{
"title": "LM Studio",
"provider": "lmstudio",
"model": "AUTODETECT"
}
],
"customCommands": [
{
"name": "test",
"prompt": "{{{ input }}}\n\nWrite a comprehensive set of unit tests for the selected code. It should setup, run tests that check for correctness including important edge cases, and teardown. Ensure that the tests are complete and sophisticated. Give the tests just as chat output, don't edit any file.",
"description": "Write unit tests for highlighted code"
}
],
"tabAutocompleteModel": {
"title": "LM Studio Codellama",
"provider": "lmstudio",
"model": "bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF"
},
"contextProviders": [
{
"name": "code",
"params": {}
},
{
"name": "docs",
"params": {}
},
{
"name": "diff",
"params": {}
},
{
"name": "terminal",
"params": {}
},
{
"name": "problems",
"params": {}
},
{
"name": "folder",
"params": {}
},
{
"name": "codebase",
"params": {}
}
],
"slashCommands": [
{
"name": "edit",
"description": "Edit selected code"
},
{
"name": "comment",
"description": "Write comments for the selected code"
},
{
"name": "share",
"description": "Export the current chat session to markdown"
},
{
"name": "cmd",
"description": "Generate a shell command"
},
{
"name": "commit",
"description": "Generate a git commit message"
}
],
"embeddingsProvider": {
"provider": "lmstudio",
"model": "nomic-ai/nomic-embed-text-v1.5-GGUF"
}
}
Fini
That's it. At this point, you should have AI assist in your editors and in the chat pane. All of this capability without leaving the friendly environment of your local network!
Revision History
Created: 2024 08
Comments
Post a Comment