04.07.2025

Deploying AI/LLM on your computer

Automation is often not enough to solve local problems, which would simplify the user experience, both for solving personal issues and would be an excellent module in the IP. In this article, we will look at how to deploy local LLM or well-known AI models via Docker using Ollama. The instruction is suitable for Ubuntu, Windows (WSL2) or macOS.

If your capacities are not enough to solve the tasks, then Serverspace provides the latest GPT models that are just as easy to integrate via API or use in a web dashboard.

Below is a step-by-step process for launching the language model via Docker using the CPU.

1. Installing Docker

Linux (Ubuntu/Debian)

sudo apt update
sudo apt install docker.io
sudo systemctl enable --now docker

Windows/macOS

Download Docker Desktop from the official website and install it. After installation, check the operation:

winget install Docker.DockerDesktop

And then you can check its performance by running the command:

docker run hello-world

2. Running Ollama in Docker (on CPU)

Ollama is a local server for running LLM (for example, LLaMA, Mistral, etc.).

docker run -d \
--name ollama \
-p 11434:11434 \
-v ollama_data:/root/.ollama \
ollama/ollama

The parameter -v ollama_data:/root/.ollama — saves models between restarts. By default, Ollama will use the CPU if there are no GPU drivers inside the container.

docker ps
docker exec -it ollama ollama pull llama3
docker exec -it ollama ollama list

3. Connecting the web interface — Open WebUI

docker run -d \
--name open-webui \
-p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
ghcr.io/open-webui/open-webui:ollama

The interface will be available on: http://localhost:3000

4. Alternative: Text Generation Web UI (on CPU)

Create a Dockerfile:

FROM python:3.11-slim
RUN apt-get update && apt-get install -y git build-essential
RUN pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
RUN git clone https://github.com/oobabooga/text-generation-webui
WORKDIR /text-generation-webui
RUN pip install -r requirements_cpu_only.txt
COPY start.sh /start.sh
RUN chmod +x /start.sh
ENTRYPOINT ["/start.sh"]

The startup script start.sh:

#!/bin/bash
rm -rf models && ln -s /models models
python server.py --listen --cpu

docker-compose.yml:

version: '3.8'
services:
textgen:
build: .
container_name: textgen_cpu
ports:
- "7860:7860"
volumes:
- ./models:/models

Launch:

docker compose up -d

To communicate with the model, you can also use the API, which will allow you to integrate it as a module into any IC. The final endpoint of the appeal:

POST http://localhost:11434/api/generate

To make an API request, you must use any tool or library.:

curl http://localhost:11434/api/generate \
-d '{
"model": "llama3",
"prompt": "Tell me what quantum physics is in simple words",
"stream": false
}'

model - The name of the model, for example llama3, mistral, gemma, phi, etc.
prompt - The text of the request
stream - If `true`, the response will come in parts (convenient for chat). If `false` — the entire response at once
options - Additional generation parameters

Example with generation settings:

curl http://localhost:11434/api/generate \
-d'{
"model": "llama3",
"prompt": "Write a short story about a robot.",
"stream": false,
"options": {
"temperature": 0.8,
"top_k": 40,
"top_p": 0.9,
"num_predict": 200
}
}'

temperature - Randomness control. The higher the value, the more creative
top_k is the limit on the number of tokens, among which the next one is selected
top_p is Probability control (nuclear sampling)
num_predict is how many tokens to generate at most

5. Optimization for CPU

Limitation of Docker container resources:

docker run --cpus=4 --memory=16g ollama/ollama

Use quantized models (GGUF/ggml) to reduce the load on the CLI interface (via curl) faster than WebUI