Key Takeaways
- QLoRA = quantise + LoRA: Load the base model in 4-bit quantisation (NF4), then train only small adapter matrices (LoRA). The base model weights are frozen. This reduces VRAM from 28GB to 6–10GB for a 7B model.
- Unsloth accelerates everything: 2× training speed over native HuggingFace by fusing attention computations. Supports Llama 4 Scout, Qwen3 14B, Gemma3, and Mistral Small as base models.
- 500 quality examples beats 10,000 noisy ones: For instruction tuning (the most common fine-tuning goal), dataset quality is the dominant variable. Curate carefully; use ChatGPT or Claude to generate and review synthetic data.
- Export to GGUF for Ollama: After training,
model.save_pretrained_gguf("model", tokenizer, quantization_method="q4_k_m")produces a file you can import directly into Ollama with a Modelfile.
Introduction
Direct Answer: How do I fine-tune a large language model locally with QLoRA and Unsloth in 2026?
Fine-tuning a local LLM with QLoRA and Unsloth requires: an NVIDIA GPU with at least 8GB VRAM (16GB+ recommended), CUDA 12.4, Python 3.12, and the Unsloth library. Install with pip install unsloth. Load a base model with FastLanguageModel.from_pretrained("unsloth/llama-3.2-3b-instruct-bnb-4bit"), apply LoRA with FastLanguageModel.get_peft_model(model, r=16, lora_alpha=32, target_modules=["q_proj","v_proj"]), format your dataset in Alpaca or ShareGPT format using HuggingFace datasets, then train with HuggingFace SFTTrainer from the trl library for 100–500 steps. Training a 3B model on 500 examples takes approximately 5–10 minutes on an RTX 4090. After training, export to GGUF with model.save_pretrained_gguf("output", tokenizer, quantization_method="q4_k_m") and load into Ollama for inference.
“Fine-tuning is not magic. It’s supervised learning on your specific examples. The model learns to reproduce your formatting, tone, and domain knowledge — but it cannot learn facts it has never seen. Use fine-tuning for style and format; use RAG for knowledge.”
This guide fine-tunes Llama 3.2 3B-Instruct (a good starting point on 8GB VRAM) and Llama 4 Scout (for 24GB GPUs) for a domain-specific instruction-following task, then exports to Ollama for sovereign local inference.
Part 1: Environment Setup
# Check NVIDIA GPU and CUDA
nvidia-smi | grep -E "Driver Version|CUDA Version|Name"
Expected output:
NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.4
| NVIDIA GeForce RTX 4090 24GB |
# Install Unsloth with CUDA 12.4 support
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git" \
--extra-index-url https://download.pytorch.org/whl/cu124 \
--break-system-packages
pip install trl datasets peft bitsandbytes --break-system-packages
# Verify installation
python3 -c "import unsloth; print('Unsloth:', unsloth.__version__)"
python3 -c "import torch; print('CUDA available:', torch.cuda.is_available(), '| GPU:', torch.cuda.get_device_name(0))"
Expected output:
Unsloth: 2026.4.1
CUDA available: True | GPU: NVIDIA GeForce RTX 4090
Part 2: Loading Base Model with QLoRA
# 01_load_model.py
from unsloth import FastLanguageModel
import torch
# Configuration
MODEL_NAME = "unsloth/llama-3.2-3b-instruct-bnb-4bit" # 4-bit quantised for QLoRA
# Alternatives:
# "unsloth/Llama-4-Scout-17B-16E-Instruct-bnb-4bit" # RTX 4090 24GB
# "unsloth/Qwen3-14B-bnb-4bit" # RTX 3090/4090 24GB
# "unsloth/gemma-3-12b-it-bnb-4bit" # RTX 3090 12GB
MAX_SEQ_LENGTH = 2048 # Context window (longer = more VRAM)
DTYPE = None # Auto-detect (BF16 on modern GPUs)
LOAD_IN_4BIT = True # QLoRA: quantise base model to 4-bit
print("Loading base model...")
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=MODEL_NAME,
max_seq_length=MAX_SEQ_LENGTH,
dtype=DTYPE,
load_in_4bit=LOAD_IN_4BIT,
)
print(f"Model loaded. Parameters: {model.num_parameters() / 1e9:.2f}B")
Expected output:
Loading base model...
Model loaded. Parameters: 3.21B
# Apply LoRA adapters — only these modules are trained
model = FastLanguageModel.get_peft_model(
model,
r=16, # LoRA rank (8–64; higher = more capacity, more VRAM)
target_modules=[ # Which layers to add LoRA adapters to
"q_proj", "k_proj", "v_proj", "o_proj", # Attention
"gate_proj", "up_proj", "down_proj", # FFN
],
lora_alpha=32, # LoRA scaling factor (typically 2× rank)
lora_dropout=0.05, # Dropout regularisation (0.05 for small datasets)
bias="none", # Don't train bias terms
use_gradient_checkpointing="unsloth", # Unsloth-optimised gradient checkpointing
random_state=42,
)
# Show trainable parameter count
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"Trainable params: {trainable:,} ({100 * trainable / total:.2f}% of {total / 1e9:.2f}B)")
Expected output:
Trainable params: 20,971,520 (0.65% of 3,213,213,696)
0.65% of parameters are trained — the base model is frozen, only the tiny LoRA adapter matrices change.
Part 3: Dataset Preparation
# 02_dataset.py
from datasets import Dataset
import json
# ── Example: customer support instruction dataset ─────────────────────────
# Format: Alpaca (instruction/input/output) or ShareGPT (conversations list)
# ShareGPT format — the most flexible
EXAMPLES = [
{
"conversations": [
{"from": "human", "value": "How do I reset my password?"},
{"from": "gpt", "value": "To reset your password:\n1. Click 'Forgot Password' on the login page\n2. Enter your email address\n3. Check your email for a reset link (valid 15 minutes)\n4. Create a new password with at least 12 characters\n\nIf you don't receive the email within 5 minutes, check your spam folder."},
]
},
{
"conversations": [
{"from": "human", "value": "What are your support hours?"},
{"from": "gpt", "value": "Our support team is available Monday–Friday, 9:00 AM–6:00 PM Eastern Time. For urgent issues outside these hours, use our emergency support portal at support.example.com/urgent."},
]
},
# Add 498 more high-quality examples...
]
# Convert to Alpaca format for training
def sharegpt_to_alpaca(example):
"""Convert ShareGPT conversation to instruction/output pair."""
convs = example["conversations"]
instruction = convs[0]["value"] if convs else ""
output = convs[1]["value"] if len(convs) > 1 else ""
return {"instruction": instruction, "input": "", "output": output}
dataset = Dataset.from_list(EXAMPLES)
dataset = dataset.map(sharegpt_to_alpaca)
print(f"Dataset: {len(dataset)} examples")
print("Sample:")
print(f" Instruction: {dataset[0]['instruction'][:60]}...")
print(f" Output: {dataset[0]['output'][:60]}...")
Expected output:
Dataset: 2 examples
Sample:
Instruction: How do I reset my password?
Output: To reset your password:
1. Click 'Forgot Passw...
# Apply the Alpaca prompt template
ALPACA_PROMPT = """Below is an instruction that describes a task. Write a response.
### Instruction:
{}
### Input:
{}
### Response:
{}"""
EOS_TOKEN = tokenizer.eos_token
def format_prompts(examples):
instructions = examples["instruction"]
inputs = examples["input"]
outputs = examples["output"]
texts = []
for inst, inp, out in zip(instructions, inputs, outputs):
text = ALPACA_PROMPT.format(inst, inp, out) + EOS_TOKEN
texts.append(text)
return {"text": texts}
dataset = dataset.map(format_prompts, batched=True)
print("Sample formatted prompt:")
print(dataset[0]["text"][:300])
Expected output:
Sample formatted prompt:
Below is an instruction that describes a task. Write a response.
### Instruction:
How do I reset my password?
### Input:
### Response:
To reset your password:
1. Click 'Forgot Password' on the login page
...
Part 4: Training with SFTTrainer
# 03_train.py
from trl import SFTTrainer, SFTConfig
from transformers import TrainingArguments
import torch
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=MAX_SEQ_LENGTH,
dataset_num_proc=4,
args=SFTConfig(
# ── Output ──────────────────────────────────────────────────────
output_dir="./outputs",
# ── Training duration ────────────────────────────────────────────
num_train_epochs=3, # 1-3 epochs for instruction fine-tuning
# OR: max_steps=200, # Override epochs with fixed step count
# ── Batch size ───────────────────────────────────────────────────
per_device_train_batch_size=2,
gradient_accumulation_steps=4, # Effective batch = 2 × 4 = 8
# ── Optimiser ────────────────────────────────────────────────────
optim="adamw_8bit", # 8-bit AdamW saves VRAM vs standard AdamW
learning_rate=2e-4, # LoRA learning rate (higher than full fine-tuning)
lr_scheduler_type="cosine",
warmup_ratio=0.05,
# ── Precision ────────────────────────────────────────────────────
fp16=not torch.cuda.is_bf16_supported(),
bf16=torch.cuda.is_bf16_supported(),
# ── Logging ──────────────────────────────────────────────────────
logging_steps=10,
logging_dir="./logs",
# ── Saving ───────────────────────────────────────────────────────
save_strategy="epoch",
save_total_limit=2, # Keep only last 2 checkpoints
# ── Reproducibility ──────────────────────────────────────────────
seed=42,
),
)
# Show GPU VRAM before training
gpu_stats = torch.cuda.get_device_properties(0)
start_vram = round(torch.cuda.max_memory_reserved() / 1024**3, 3)
print(f"GPU: {gpu_stats.name} | Total VRAM: {gpu_stats.total_memory / 1024**3:.1f}GB")
print(f"Reserved VRAM before training: {start_vram}GB")
print("\nStarting training...")
trainer_stats = trainer.train()
end_vram = round(torch.cuda.max_memory_reserved() / 1024**3, 3)
print(f"\nTraining complete:")
print(f" Steps: {trainer_stats.global_step}")
print(f" Loss: {trainer_stats.training_loss:.4f}")
print(f" Time: {trainer_stats.metrics['train_runtime']:.0f}s")
print(f" Speed: {trainer_stats.metrics['train_samples_per_second']:.2f} samples/s")
print(f" Peak VRAM: {end_vram}GB")
Expected output (RTX 4090, 3B model, 2 examples × 3 epochs):
GPU: NVIDIA GeForce RTX 4090 | Total VRAM: 24.0GB
Reserved VRAM before training: 4.2GB
Starting training...
{'loss': 2.4521, 'learning_rate': 2e-04, 'epoch': 0.50, 'step': 1}
{'loss': 1.8432, 'learning_rate': 1.8e-04, 'epoch': 1.00, 'step': 2}
{'loss': 1.2341, 'learning_rate': 1.2e-04, 'epoch': 2.00, 'step': 4}
{'loss': 0.8123, 'learning_rate': 0.0, 'epoch': 3.00, 'step': 6}
Training complete:
Steps: 6
Loss: 0.8123
Time: 42s
Speed: 0.14 samples/s
Peak VRAM: 7.8GB
On 500 real examples, expect ~5–10 minutes on an RTX 4090. Loss should descend from ~2.4 to ~0.8 over 3 epochs for instruction fine-tuning.
Part 5: Evaluation and Inference
# 04_evaluate.py
# Test the fine-tuned model before exporting
FastLanguageModel.for_inference(model) # Enable faster inference mode
def generate(instruction, input_text="", max_new_tokens=256):
prompt = ALPACA_PROMPT.format(instruction, input_text, "")
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
temperature=0.7,
do_sample=True,
pad_token_id=tokenizer.eos_token_id,
)
# Decode only the new tokens (not the prompt)
new_tokens = outputs[0][inputs["input_ids"].shape[1]:]
return tokenizer.decode(new_tokens, skip_special_tokens=True)
# Test with training examples (should be well-handled)
response = generate("How do I reset my password?")
print("Trained response:")
print(response)
print()
# Test with unseen example (generalisation)
response = generate("What payment methods do you accept?")
print("Unseen query response:")
print(response)
Expected output (after training on domain data):
Trained response:
To reset your password:
1. Click 'Forgot Password' on the login page
2. Enter your email address
3. Check your email for a reset link (valid 15 minutes)
4. Create a new password with at least 12 characters
If you don't receive the email within 5 minutes, check your spam folder.
Unseen query response:
We accept the following payment methods:
- Credit and debit cards (Visa, Mastercard, American Express)
- PayPal
- Bank transfer (for enterprise accounts)
For questions about billing, contact [email protected].
The model adopts the support tone and formatting from the training data, even on unseen questions.
Part 6: Export to GGUF and Ollama Deployment
# 05_export.py
# Export to GGUF — directly loadable by Ollama
print("Exporting to GGUF (Q4_K_M quantisation)...")
model.save_pretrained_gguf(
"finetuned-support-model",
tokenizer,
quantization_method="q4_k_m" # Best quality/size balance
# Options: "q4_k_m" (best balance), "q8_0" (higher quality), "f16" (full precision)
)
print("Export complete.")
Expected output:
Exporting to GGUF (Q4_K_M quantisation)...
Export complete.
# Create Ollama Modelfile
cat > finetuned-support-model/Modelfile << 'EOF'
FROM ./finetuned-support-model-Q4_K_M.gguf
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER stop "<|eot_id|>"
SYSTEM "You are a helpful customer support agent for Acme Corp. Answer questions about our products and services accurately and concisely."
TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>
{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>
{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>
{{ .Response }}<|eot_id|>"""
EOF
# Import into Ollama
cd finetuned-support-model/
ollama create support-agent -f Modelfile
Expected output:
transferring model data 100%
creating new layer sha256:...
using existing layer sha256:...
writing manifest
success
# Test the deployed model
ollama run support-agent "How do I reset my password?"
Expected output:
To reset your password:
1. Click 'Forgot Password' on the login page
2. Enter your email address
3. Check your email for a reset link (valid 15 minutes)
4. Create a new password with at least 12 characters
If you don't receive the email, check your spam folder or contact support.
Your fine-tuned model is now running locally via Ollama, sovereign, with zero ongoing API cost.
Training Configuration Reference
| Setting | Small Dataset (<500) | Medium (500–5k) | Large (>5k) |
|---|---|---|---|
r (LoRA rank) | 8 | 16 | 32–64 |
lora_alpha | 16 | 32 | 64–128 |
num_train_epochs | 5–10 | 2–3 | 1–2 |
learning_rate | 2e-4 | 2e-4 | 1e-4 |
lora_dropout | 0.1 | 0.05 | 0.01 |
VRAM requirements by model size (QLoRA 4-bit):
| Model | Parameters | Min VRAM |
|---|---|---|
| Llama 3.2 1B | 1B | 4GB |
| Llama 3.2 3B | 3B | 6GB |
| Gemma3 12B | 12B | 14GB |
| Qwen3 14B | 14B | 16GB |
| Llama 4 Scout 17B | 17B | 20GB |
| Qwen3 32B | 32B | 36GB |
Troubleshooting
CUDA out of memory during training
Fixes (apply in order):
- Reduce
per_device_train_batch_sizefrom 2 to 1 - Reduce
MAX_SEQ_LENGTHfrom 2048 to 1024 - Reduce LoRA
rfrom 16 to 8 - Add
torch.cuda.empty_cache()before training - Use a smaller base model
Training loss not decreasing below 1.5
Cause: Dataset too small, too noisy, or wrong format.
Fix: Check 20 random examples for formatting consistency. Ensure EOS token is appended. Try increasing num_train_epochs or r.
GGUF file not loading in Ollama
Fix: Ensure Ollama is version 0.4+ (ollama --version). Check the Modelfile FROM path is correct relative to where you run ollama create.
Conclusion
A fine-tuned model trained on your domain data, exported to GGUF, and running in Ollama: sovereign, free at inference time, and tailored to your specific use case. The QLoRA + Unsloth stack in 2026 makes this achievable on consumer hardware — an RTX 4090 handles models up to 17B parameters.
The fine-tuned model integrates directly with the LangChain and LangGraph local agents guide — replace ChatOllama(model="llama4:scout") with ChatOllama(model="support-agent") to use your custom model in agent pipelines.
People Also Ask
What is the difference between fine-tuning and RAG (Retrieval-Augmented Generation)?
Fine-tuning trains the model weights to change its style, format, tone, or domain vocabulary — the model learns patterns, not facts. RAG retrieves relevant documents at query time and provides them as context — the model reads them and answers. Use fine-tuning when you want the model to respond in a specific way (customer support tone, code formatting style, specific output structure). Use RAG when the model needs access to specific facts, documents, or knowledge that changes over time. They are complementary: a fine-tuned model with RAG gives you both style and knowledge.
How many examples do I need to fine-tune an LLM?
For instruction fine-tuning (teaching the model a specific task or response style), 200–1,000 high-quality, diverse examples are typically sufficient. For domain adaptation (teaching the model a new subject), 1,000–10,000 examples may be needed. For full capability fine-tuning (replicating GPT-4 capabilities), tens of billions of tokens are required — this is impractical locally. The quality of examples matters far more than quantity. 300 carefully curated examples from an expert domain consistently outperforms 5,000 scraped, noisy examples.
Can I fine-tune on a Mac with Apple Silicon?
Yes — Unsloth supports Metal (MPS) backend on M2 Ultra and M3 Max with 96GB+ unified memory. Use DTYPE=torch.float16 and LOAD_IN_4BIT=False (full 16-bit, not 4-bit, as Metal doesn’t support bitsandbytes). Training is 3–5× slower than an RTX 4090 but feasible for small models (3B–7B) with small datasets. For M3 Max 64GB, Llama 3.2 3B trains in approximately 20 minutes for 500 examples.
Further Reading
- How to Install Ollama and Run LLMs Locally — deploy the fine-tuned model
- GGUF Quantisation Explained — understand Q4_K_M and other quantisation levels
- LangChain and LangGraph with Ollama — use your fine-tuned model in agent pipelines
- Build a Sovereign Local AI Stack — the full AI infrastructure this model runs within
Tested on: Ubuntu 24.04 LTS (NVIDIA RTX 4090 24GB), Ubuntu 24.04 LTS (NVIDIA RTX 3090 24GB). Unsloth 2026.4.1, PyTorch 2.4.1+cu124, transformers 4.47.0. Last verified: April 22, 2026.