Fine-Tuning LLMs with QLoRA and Unsloth 2026: Local Training Guide

Key Takeaways

QLoRA = quantise + LoRA: Load the base model in 4-bit quantisation (NF4), then train only small adapter matrices (LoRA). The base model weights are frozen. This reduces VRAM from 28GB to 6–10GB for a 7B model.
Unsloth accelerates everything: 2× training speed over native HuggingFace by fusing attention computations. Supports Llama 4 Scout, Qwen3 14B, Gemma3, and Mistral Small as base models.
500 quality examples beats 10,000 noisy ones: For instruction tuning (the most common fine-tuning goal), dataset quality is the dominant variable. Curate carefully; use ChatGPT or Claude to generate and review synthetic data.
Export to GGUF for Ollama: After training, model.save_pretrained_gguf("model", tokenizer, quantization_method="q4_k_m") produces a file you can import directly into Ollama with a Modelfile.

Introduction

Direct Answer: How do I fine-tune a large language model locally with QLoRA and Unsloth in 2026?

Fine-tuning a local LLM with QLoRA and Unsloth requires: an NVIDIA GPU with at least 8GB VRAM (16GB+ recommended), CUDA 12.4, Python 3.12, and the Unsloth library. Install with pip install unsloth. Load a base model with FastLanguageModel.from_pretrained("unsloth/llama-3.2-3b-instruct-bnb-4bit"), apply LoRA with FastLanguageModel.get_peft_model(model, r=16, lora_alpha=32, target_modules=["q_proj","v_proj"]), format your dataset in Alpaca or ShareGPT format using HuggingFace datasets, then train with HuggingFace SFTTrainer from the trl library for 100–500 steps. Training a 3B model on 500 examples takes approximately 5–10 minutes on an RTX 4090. After training, export to GGUF with model.save_pretrained_gguf("output", tokenizer, quantization_method="q4_k_m") and load into Ollama for inference.

“Fine-tuning is not magic. It’s supervised learning on your specific examples. The model learns to reproduce your formatting, tone, and domain knowledge — but it cannot learn facts it has never seen. Use fine-tuning for style and format; use RAG for knowledge.”

This guide fine-tunes Llama 3.2 3B-Instruct (a good starting point on 8GB VRAM) and Llama 4 Scout (for 24GB GPUs) for a domain-specific instruction-following task, then exports to Ollama for sovereign local inference.

Part 1: Environment Setup

# Check NVIDIA GPU and CUDA
nvidia-smi | grep -E "Driver Version|CUDA Version|Name"

Expected output:

NVIDIA-SMI 560.35.03   Driver Version: 560.35.03   CUDA Version: 12.4
| NVIDIA GeForce RTX 4090 24GB  |

# Install Unsloth with CUDA 12.4 support
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git" \
    --extra-index-url https://download.pytorch.org/whl/cu124 \
    --break-system-packages

pip install trl datasets peft bitsandbytes --break-system-packages

# Verify installation
python3 -c "import unsloth; print('Unsloth:', unsloth.__version__)"
python3 -c "import torch; print('CUDA available:', torch.cuda.is_available(), '| GPU:', torch.cuda.get_device_name(0))"

Expected output:

Unsloth: 2026.4.1
CUDA available: True | GPU: NVIDIA GeForce RTX 4090

Part 2: Loading Base Model with QLoRA

# 01_load_model.py
from unsloth import FastLanguageModel
import torch

# Configuration
MODEL_NAME = "unsloth/llama-3.2-3b-instruct-bnb-4bit"  # 4-bit quantised for QLoRA
# Alternatives:
# "unsloth/Llama-4-Scout-17B-16E-Instruct-bnb-4bit"  # RTX 4090 24GB
# "unsloth/Qwen3-14B-bnb-4bit"                         # RTX 3090/4090 24GB
# "unsloth/gemma-3-12b-it-bnb-4bit"                    # RTX 3090 12GB

MAX_SEQ_LENGTH = 2048    # Context window (longer = more VRAM)
DTYPE = None             # Auto-detect (BF16 on modern GPUs)
LOAD_IN_4BIT = True      # QLoRA: quantise base model to 4-bit

print("Loading base model...")
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=MODEL_NAME,
    max_seq_length=MAX_SEQ_LENGTH,
    dtype=DTYPE,
    load_in_4bit=LOAD_IN_4BIT,
)
print(f"Model loaded. Parameters: {model.num_parameters() / 1e9:.2f}B")

Expected output:

Loading base model...
Model loaded. Parameters: 3.21B

# Apply LoRA adapters — only these modules are trained
model = FastLanguageModel.get_peft_model(
    model,
    r=16,                  # LoRA rank (8–64; higher = more capacity, more VRAM)
    target_modules=[       # Which layers to add LoRA adapters to
        "q_proj", "k_proj", "v_proj", "o_proj",   # Attention
        "gate_proj", "up_proj", "down_proj",        # FFN
    ],
    lora_alpha=32,         # LoRA scaling factor (typically 2× rank)
    lora_dropout=0.05,     # Dropout regularisation (0.05 for small datasets)
    bias="none",           # Don't train bias terms
    use_gradient_checkpointing="unsloth",  # Unsloth-optimised gradient checkpointing
    random_state=42,
)

# Show trainable parameter count
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"Trainable params: {trainable:,} ({100 * trainable / total:.2f}% of {total / 1e9:.2f}B)")

Expected output:

Trainable params: 20,971,520 (0.65% of 3,213,213,696)

0.65% of parameters are trained — the base model is frozen, only the tiny LoRA adapter matrices change.

Part 3: Dataset Preparation

# 02_dataset.py
from datasets import Dataset
import json

# ── Example: customer support instruction dataset ─────────────────────────
# Format: Alpaca (instruction/input/output) or ShareGPT (conversations list)

# ShareGPT format — the most flexible
EXAMPLES = [
    {
        "conversations": [
            {"from": "human", "value": "How do I reset my password?"},
            {"from": "gpt", "value": "To reset your password:\n1. Click 'Forgot Password' on the login page\n2. Enter your email address\n3. Check your email for a reset link (valid 15 minutes)\n4. Create a new password with at least 12 characters\n\nIf you don't receive the email within 5 minutes, check your spam folder."},
        ]
    },
    {
        "conversations": [
            {"from": "human", "value": "What are your support hours?"},
            {"from": "gpt", "value": "Our support team is available Monday–Friday, 9:00 AM–6:00 PM Eastern Time. For urgent issues outside these hours, use our emergency support portal at support.example.com/urgent."},
        ]
    },
    # Add 498 more high-quality examples...
]

# Convert to Alpaca format for training
def sharegpt_to_alpaca(example):
    """Convert ShareGPT conversation to instruction/output pair."""
    convs = example["conversations"]
    instruction = convs[0]["value"] if convs else ""
    output = convs[1]["value"] if len(convs) > 1 else ""
    return {"instruction": instruction, "input": "", "output": output}

dataset = Dataset.from_list(EXAMPLES)
dataset = dataset.map(sharegpt_to_alpaca)
print(f"Dataset: {len(dataset)} examples")
print("Sample:")
print(f"  Instruction: {dataset[0]['instruction'][:60]}...")
print(f"  Output:      {dataset[0]['output'][:60]}...")

Expected output:

Dataset: 2 examples
Sample:
  Instruction: How do I reset my password?
  Output:      To reset your password:
1. Click 'Forgot Passw...

# Apply the Alpaca prompt template
ALPACA_PROMPT = """Below is an instruction that describes a task. Write a response.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token

def format_prompts(examples):
    instructions = examples["instruction"]
    inputs = examples["input"]
    outputs = examples["output"]
    texts = []
    for inst, inp, out in zip(instructions, inputs, outputs):
        text = ALPACA_PROMPT.format(inst, inp, out) + EOS_TOKEN
        texts.append(text)
    return {"text": texts}

dataset = dataset.map(format_prompts, batched=True)
print("Sample formatted prompt:")
print(dataset[0]["text"][:300])

Expected output:

Sample formatted prompt:
Below is an instruction that describes a task. Write a response.

### Instruction:
How do I reset my password?

### Input:

### Response:
To reset your password:
1. Click 'Forgot Password' on the login page
...

Part 4: Training with SFTTrainer

# 03_train.py
from trl import SFTTrainer, SFTConfig
from transformers import TrainingArguments
import torch

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=MAX_SEQ_LENGTH,
    dataset_num_proc=4,

    args=SFTConfig(
        # ── Output ──────────────────────────────────────────────────────
        output_dir="./outputs",

        # ── Training duration ────────────────────────────────────────────
        num_train_epochs=3,          # 1-3 epochs for instruction fine-tuning
        # OR: max_steps=200,         # Override epochs with fixed step count

        # ── Batch size ───────────────────────────────────────────────────
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,  # Effective batch = 2 × 4 = 8

        # ── Optimiser ────────────────────────────────────────────────────
        optim="adamw_8bit",          # 8-bit AdamW saves VRAM vs standard AdamW
        learning_rate=2e-4,          # LoRA learning rate (higher than full fine-tuning)
        lr_scheduler_type="cosine",
        warmup_ratio=0.05,

        # ── Precision ────────────────────────────────────────────────────
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),

        # ── Logging ──────────────────────────────────────────────────────
        logging_steps=10,
        logging_dir="./logs",

        # ── Saving ───────────────────────────────────────────────────────
        save_strategy="epoch",
        save_total_limit=2,          # Keep only last 2 checkpoints

        # ── Reproducibility ──────────────────────────────────────────────
        seed=42,
    ),
)

# Show GPU VRAM before training
gpu_stats = torch.cuda.get_device_properties(0)
start_vram = round(torch.cuda.max_memory_reserved() / 1024**3, 3)
print(f"GPU: {gpu_stats.name} | Total VRAM: {gpu_stats.total_memory / 1024**3:.1f}GB")
print(f"Reserved VRAM before training: {start_vram}GB")

print("\nStarting training...")
trainer_stats = trainer.train()

end_vram = round(torch.cuda.max_memory_reserved() / 1024**3, 3)
print(f"\nTraining complete:")
print(f"  Steps:        {trainer_stats.global_step}")
print(f"  Loss:         {trainer_stats.training_loss:.4f}")
print(f"  Time:         {trainer_stats.metrics['train_runtime']:.0f}s")
print(f"  Speed:        {trainer_stats.metrics['train_samples_per_second']:.2f} samples/s")
print(f"  Peak VRAM:    {end_vram}GB")

Expected output (RTX 4090, 3B model, 2 examples × 3 epochs):

GPU: NVIDIA GeForce RTX 4090 | Total VRAM: 24.0GB
Reserved VRAM before training: 4.2GB

Starting training...
{'loss': 2.4521, 'learning_rate': 2e-04, 'epoch': 0.50, 'step': 1}
{'loss': 1.8432, 'learning_rate': 1.8e-04, 'epoch': 1.00, 'step': 2}
{'loss': 1.2341, 'learning_rate': 1.2e-04, 'epoch': 2.00, 'step': 4}
{'loss': 0.8123, 'learning_rate': 0.0, 'epoch': 3.00, 'step': 6}

Training complete:
  Steps:        6
  Loss:         0.8123
  Time:         42s
  Speed:        0.14 samples/s
  Peak VRAM:    7.8GB

On 500 real examples, expect ~5–10 minutes on an RTX 4090. Loss should descend from ~2.4 to ~0.8 over 3 epochs for instruction fine-tuning.

Part 5: Evaluation and Inference

# 04_evaluate.py
# Test the fine-tuned model before exporting

FastLanguageModel.for_inference(model)   # Enable faster inference mode

def generate(instruction, input_text="", max_new_tokens=256):
    prompt = ALPACA_PROMPT.format(instruction, input_text, "")
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=0.7,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
        )

    # Decode only the new tokens (not the prompt)
    new_tokens = outputs[0][inputs["input_ids"].shape[1]:]
    return tokenizer.decode(new_tokens, skip_special_tokens=True)

# Test with training examples (should be well-handled)
response = generate("How do I reset my password?")
print("Trained response:")
print(response)
print()

# Test with unseen example (generalisation)
response = generate("What payment methods do you accept?")
print("Unseen query response:")
print(response)

Expected output (after training on domain data):

Trained response:
To reset your password:
1. Click 'Forgot Password' on the login page
2. Enter your email address
3. Check your email for a reset link (valid 15 minutes)
4. Create a new password with at least 12 characters

If you don't receive the email within 5 minutes, check your spam folder.

Unseen query response:
We accept the following payment methods:
- Credit and debit cards (Visa, Mastercard, American Express)
- PayPal
- Bank transfer (for enterprise accounts)

For questions about billing, contact [email protected].

The model adopts the support tone and formatting from the training data, even on unseen questions.

Part 6: Export to GGUF and Ollama Deployment

# 05_export.py
# Export to GGUF — directly loadable by Ollama

print("Exporting to GGUF (Q4_K_M quantisation)...")
model.save_pretrained_gguf(
    "finetuned-support-model",
    tokenizer,
    quantization_method="q4_k_m"   # Best quality/size balance
    # Options: "q4_k_m" (best balance), "q8_0" (higher quality), "f16" (full precision)
)
print("Export complete.")

Expected output:

Exporting to GGUF (Q4_K_M quantisation)...
Export complete.

# Create Ollama Modelfile
cat > finetuned-support-model/Modelfile << 'EOF'
FROM ./finetuned-support-model-Q4_K_M.gguf

PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER stop "<|eot_id|>"

SYSTEM "You are a helpful customer support agent for Acme Corp. Answer questions about our products and services accurately and concisely."

TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"""
EOF

# Import into Ollama
cd finetuned-support-model/
ollama create support-agent -f Modelfile

Expected output:

transferring model data 100%
creating new layer sha256:...
using existing layer sha256:...
writing manifest
success

# Test the deployed model
ollama run support-agent "How do I reset my password?"

Expected output:

To reset your password:
1. Click 'Forgot Password' on the login page
2. Enter your email address
3. Check your email for a reset link (valid 15 minutes)
4. Create a new password with at least 12 characters

If you don't receive the email, check your spam folder or contact support.

Your fine-tuned model is now running locally via Ollama, sovereign, with zero ongoing API cost.

Training Configuration Reference

Setting	Small Dataset (<500)	Medium (500–5k)	Large (>5k)
`r` (LoRA rank)	8	16	32–64
`lora_alpha`	16	32	64–128
`num_train_epochs`	5–10	2–3	1–2
`learning_rate`	2e-4	2e-4	1e-4
`lora_dropout`	0.1	0.05	0.01

VRAM requirements by model size (QLoRA 4-bit):

Model	Parameters	Min VRAM
Llama 3.2 1B	1B	4GB
Llama 3.2 3B	3B	6GB
Gemma3 12B	12B	14GB
Qwen3 14B	14B	16GB
Llama 4 Scout 17B	17B	20GB
Qwen3 32B	32B	36GB

Troubleshooting

`CUDA out of memory` during training

Fixes (apply in order):

Reduce per_device_train_batch_size from 2 to 1
Reduce MAX_SEQ_LENGTH from 2048 to 1024
Reduce LoRA r from 16 to 8
Add torch.cuda.empty_cache() before training
Use a smaller base model

Training loss not decreasing below 1.5

Cause: Dataset too small, too noisy, or wrong format. Fix: Check 20 random examples for formatting consistency. Ensure EOS token is appended. Try increasing num_train_epochs or r.

GGUF file not loading in Ollama

Fix: Ensure Ollama is version 0.4+ (ollama --version). Check the Modelfile FROM path is correct relative to where you run ollama create.

Conclusion

A fine-tuned model trained on your domain data, exported to GGUF, and running in Ollama: sovereign, free at inference time, and tailored to your specific use case. The QLoRA + Unsloth stack in 2026 makes this achievable on consumer hardware — an RTX 4090 handles models up to 17B parameters.

The fine-tuned model integrates directly with the LangChain and LangGraph local agents guide — replace ChatOllama(model="llama4:scout") with ChatOllama(model="support-agent") to use your custom model in agent pipelines.

Key Takeaways

Introduction

Part 1: Environment Setup

Part 2: Loading Base Model with QLoRA

Part 3: Dataset Preparation

Part 4: Training with SFTTrainer

Part 5: Evaluation and Inference

Part 6: Export to GGUF and Ollama Deployment

Training Configuration Reference

Troubleshooting

`CUDA out of memory` during training

Training loss not decreasing below 1.5

GGUF file not loading in Ollama

Conclusion

People Also Ask

What is the difference between fine-tuning and RAG (Retrieval-Augmented Generation)?

How many examples do I need to fine-tune an LLM?

Can I fine-tune on a Mac with Apple Silicon?

Further Reading

Further Reading

Build a Sovereign Local AI Stack: Ollama + Open WebUI + pgvector 2026

AI Agent Security 2026: Prompt Injection, Tool Permissions & Sandboxing

How to Install and Configure Apache Web Server on Ubuntu 24.04 LTS (2026)

Comments

AI Agent Security 2026: Prompt Injection, Tool Permissions & Sandboxing

Caddy Reverse Proxy Tutorial 2026: Automatic HTTPS for Docker Apps

How to Install Caddy Web Server on Ubuntu 24.04 with Auto HTTPS

CI/CD Pipeline Design Guide 2026: Build, Test, Scan & Deploy Securely

Container Vulnerability Scanning 2026: Trivy, Grype & SBOM Generation

Recently Visited

Key Takeaways

Introduction

Part 1: Environment Setup

Part 2: Loading Base Model with QLoRA

Part 3: Dataset Preparation

Part 4: Training with SFTTrainer

Part 5: Evaluation and Inference

Part 6: Export to GGUF and Ollama Deployment

Training Configuration Reference

Troubleshooting

CUDA out of memory during training

Training loss not decreasing below 1.5

GGUF file not loading in Ollama

Conclusion

People Also Ask

What is the difference between fine-tuning and RAG (Retrieval-Augmented Generation)?

How many examples do I need to fine-tune an LLM?

Can I fine-tune on a Mac with Apple Silicon?

Further Reading

Further Reading

Build a Sovereign Local AI Stack: Ollama + Open WebUI + pgvector 2026

AI Agent Security 2026: Prompt Injection, Tool Permissions & Sandboxing

How to Install and Configure Apache Web Server on Ubuntu 24.04 LTS (2026)

The Sovereign Brief

You're in!

Comments

Recently Visited

`CUDA out of memory` during training