Automation is often not enough to solve local problems, which would simplify the user experience, both for solving personal issues and would be an excellent module in the IP. In this article, we will look at how to deploy local LLM or well-known AI models via Docker using Ollama. The instruction is suitable for Ubuntu, Windows (WSL2) or macOS.
If your capacities are not enough to solve the tasks, then Serverspace provides the latest GPT models that are just as easy to integrate via API or use in a web dashboard.
Below is a step-by-step process for launching the language model via Docker using the CPU.
1. Installing Docker
Linux (Ubuntu/Debian)
sudo apt install docker.io
sudo systemctl enable --now docker
Windows/macOS
Download Docker Desktop from the official website and install it. After installation, check the operation:
And then you can check its performance by running the command:
2. Running Ollama in Docker (on CPU)
Ollama is a local server for running LLM (for example, LLaMA, Mistral, etc.).
--name ollama \
-p 11434:11434 \
-v ollama_data:/root/.ollama \
ollama/ollama
The parameter -v ollama_data:/root/.ollama — saves models between restarts. By default, Ollama will use the CPU if there are no GPU drivers inside the container.
docker exec -it ollama ollama pull llama3
docker exec -it ollama ollama list
3. Connecting the web interface — Open WebUI
--name open-webui \
-p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
ghcr.io/open-webui/open-webui:ollama
The interface will be available on: http://localhost:3000
4. Alternative: Text Generation Web UI (on CPU)
Create a Dockerfile:
RUN apt-get update && apt-get install -y git build-essential
RUN pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
RUN git clone https://github.com/oobabooga/text-generation-webui
WORKDIR /text-generation-webui
RUN pip install -r requirements_cpu_only.txt
COPY start.sh /start.sh
RUN chmod +x /start.sh
ENTRYPOINT ["/start.sh"]
The startup script start.sh:
rm -rf models && ln -s /models models
python server.py --listen --cpu
docker-compose.yml:
services:
textgen:
build: .
container_name: textgen_cpu
ports:
- "7860:7860"
volumes:
- ./models:/models
Launch:
To communicate with the model, you can also use the API, which will allow you to integrate it as a module into any IC. The final endpoint of the appeal:
To make an API request, you must use any tool or library.:
-d '{
"model": "llama3",
"prompt": "Tell me what quantum physics is in simple words",
"stream": false
}'
- model - The name of the model, for example llama3, mistral, gemma, phi, etc.
- prompt - The text of the request
- stream - If `true`, the response will come in parts (convenient for chat). If `false` — the entire response at once
- options - Additional generation parameters
Example with generation settings:
curl http://localhost:11434/api/generate \
-d'{
"model": "llama3",
"prompt": "Write a short story about a robot.",
"stream": false,
"options": {
"temperature": 0.8,
"top_k": 40,
"top_p": 0.9,
"num_predict": 200
}
}'
- temperature - Randomness control. The higher the value, the more creative
- top_k is the limit on the number of tokens, among which the next one is selected
- top_p is Probability control (nuclear sampling)
- num_predict is how many tokens to generate at most
5. Optimization for CPU
Limitation of Docker container resources: