Dev Corner Fine-Tuning & LLMOps QLoRA & Unsloth

Fine-Tune Llama 4 with QLoRA & Unsloth on a Consumer GPU 2026

98 / 100

🔴Advanced

Fine-tune Llama 4 Scout on a 12GB consumer GPU using QLoRA and Unsloth. Covers dataset preparation, training config, memory optimisation, evaluation, and GGUF export for Ollama inference.

Current

By Kofi Mensah ✓

Mar 3, 2026

20 min

2-4 hrs (training time)

Fine-Tune Llama 4 with QLoRA & Unsloth on a Consumer GPU 2026

Article Roadmap

Key Takeaways

QLoRA (Quantised Low-Rank Adaptation) fine-tunes LLMs on 4-bit quantised weights — a 17B model that normally requires 34GB VRAM trains on 12GB using QLoRA because only the small LoRA adapter weights (< 1% of parameters) are updated in full precision.
Unsloth is the fastest open-source QLoRA training library in 2026 — it uses custom CUDA kernels and memory-efficient attention to deliver 2× faster training and 60% less VRAM usage compared to standard HuggingFace transformers training, with no accuracy loss.
The optimal QLoRA hyperparameters for most fine-tuning tasks: r=16 (LoRA rank), lora_alpha=32, lora_dropout=0.05, learning_rate=2e-4, batch_size=2 with gradient_accumulation_steps=4 (effective batch of 8), and max_seq_length=2048.
After fine-tuning, export to GGUF format with 'model.save_pretrained_gguf("model", tokenizer, quantization_method="q4_k_m")' and load in Ollama with 'ollama create my-model -f Modelfile' for sovereign local inference — no HuggingFace account needed.

Key Takeaways

QLoRA = large models on small GPUs: Fine-tune a 17B model on 12GB VRAM by training only a small set of adapter weights on top of 4-bit quantised base weights.
Unsloth = 2× faster, 60% less VRAM: Drop-in replacement for standard HuggingFace training. Always use Unsloth for QLoRA on consumer hardware.
Dataset quality > quantity: 300 high-quality, consistent training examples outperform 5,000 scraped, noisy ones. Curate before training.
Export to GGUF for Ollama: After training, convert to GGUF and load in Ollama for sovereign inference — same CLI as any other local model.

Introduction

Direct Answer: How do I fine-tune Llama 4 Scout with QLoRA and Unsloth on a consumer GPU in 2026?

Install: pip install unsloth trl datasets. Load model: model, tokenizer = FastLanguageModel.from_pretrained("unsloth/Llama-4-Scout-17B-16E-Instruct-bnb-4bit", max_seq_length=2048, load_in_4bit=True). Add LoRA: model = FastLanguageModel.get_peft_model(model, r=16, lora_alpha=32, target_modules=["q_proj","k_proj","v_proj","o_proj"]). Prepare dataset in chat format and fine-tune with SFTTrainer. After training, export: model.save_pretrained_gguf("output", tokenizer, quantization_method="q4_k_m"). Load in Ollama: ollama create my-model -f Modelfile. Total VRAM required: ~11–12GB for Llama 4 Scout 17B at 4-bit precision on a single RTX 3060 12GB.

Part 1: Environment Setup

# Install Unsloth (includes all dependencies)
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git" --break-system-packages
pip install --no-deps trl peft accelerate bitsandbytes --break-system-packages

# Verify CUDA and Unsloth
python3 -c "
import torch, unsloth
print(f'PyTorch: {torch.__version__}')
print(f'CUDA: {torch.version.cuda}')
print(f'GPU: {torch.cuda.get_device_name(0)}')
print(f'VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f}GB')
print(f'Unsloth: {unsloth.__version__}')
"

Expected output:

PyTorch: 2.5.1+cu124
CUDA: 12.4
GPU: NVIDIA GeForce RTX 3060
VRAM: 12.0GB
Unsloth: 2025.5.6

Part 2: Dataset Preparation

# dataset_prep.py — prepare data in chat format
from datasets import Dataset

# Fine-tuning task: customer support classification
# Each example: instruction → structured JSON output
training_data = [
    {
        "instruction": "Classify this support ticket and suggest a response category.",
        "input": "My order hasn't arrived after 2 weeks and tracking shows 'in transit'.",
        "output": '{"category": "shipping_delay", "priority": "high", "department": "logistics", "suggested_response": "escalate_to_carrier"}'
    },
    {
        "instruction": "Classify this support ticket and suggest a response category.",
        "input": "I want to cancel my subscription before the next billing date.",
        "output": '{"category": "subscription_cancellation", "priority": "medium", "department": "billing", "suggested_response": "process_cancellation_request"}'
    },
    # ... 300+ more examples ...
]

def format_prompt(example: dict) -> str:
    """Format example into Llama 4 chat format."""
    return f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a customer support classifier. Analyse tickets and return structured JSON.<|eot_id|>
<|start_header_id|>user<|end_header_id|>
{example['instruction']}

Ticket: {example['input']}<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
{example['output']}<|eot_id|>"""

dataset = Dataset.from_list(training_data)
dataset = dataset.map(lambda x: {"text": format_prompt(x)})

# Split 90/10 train/eval
split = dataset.train_test_split(test_size=0.1, seed=42)
print(f"Train: {len(split['train'])} | Eval: {len(split['test'])}")
print(f"\nExample formatted prompt:\n{split['train'][0]['text'][:300]}...")

Expected output:

Train: 270 | Eval: 30

Example formatted prompt:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a customer support classifier. Analyse tickets and return structured JSON.<|eot_id|>
<|start_header_id|>user<|end_header_id|>
...

Part 3: Load Model with Unsloth

# training.py
from unsloth import FastLanguageModel
import torch

MAX_SEQ_LENGTH = 2048
DTYPE = None           # Auto-detect: float16 for NVIDIA
LOAD_IN_4BIT = True    # QLoRA: quantise to 4-bit

# Load Llama 4 Scout in 4-bit
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-4-Scout-17B-16E-Instruct-bnb-4bit",
    max_seq_length=MAX_SEQ_LENGTH,
    dtype=DTYPE,
    load_in_4bit=LOAD_IN_4BIT,
)

# Print memory usage after loading
used = torch.cuda.memory_allocated() / 1e9
total = torch.cuda.get_device_properties(0).total_memory / 1e9
print(f"VRAM after model load: {used:.1f}GB / {total:.1f}GB")

Expected output:

VRAM after model load: 9.8GB / 12.0GB

# Add LoRA adapters (trainable parameters)
model = FastLanguageModel.get_peft_model(
    model,
    r=16,                    # LoRA rank — higher = more capacity, more VRAM
    target_modules=[         # Which layers to add LoRA to
        "q_proj", "k_proj", "v_proj", "o_proj",   # Attention
        "gate_proj", "up_proj", "down_proj"         # MLP
    ],
    lora_alpha=32,           # LoRA alpha (typically 2× rank)
    lora_dropout=0.05,
    bias="none",
    use_gradient_checkpointing="unsloth",   # Saves ~30% VRAM
    random_state=42,
)

# Check trainable parameters
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"Trainable: {trainable/1e6:.1f}M / {total/1e6:.0f}M parameters ({trainable/total*100:.2f}%)")

Expected output:

Trainable: 41.9M / 16,983.0M parameters (0.25%)

Only 0.25% of parameters are trained — this is why QLoRA fits on 12GB VRAM.

Part 4: Training

from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=split["train"],
    eval_dataset=split["test"],
    dataset_text_field="text",
    max_seq_length=MAX_SEQ_LENGTH,
    dataset_num_proc=2,
    packing=False,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,   # Effective batch size = 8
        warmup_steps=10,
        num_train_epochs=3,
        max_steps=-1,                    # -1 = use num_train_epochs
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=25,
        eval_steps=50,
        evaluation_strategy="steps",
        save_steps=100,
        save_total_limit=2,
        optim="adamw_8bit",             # 8-bit Adam saves VRAM
        weight_decay=0.01,
        lr_scheduler_type="cosine",
        seed=42,
        output_dir="./outputs",
        report_to="none",               # Set to "tensorboard" for metrics
    ),
)

# Monitor VRAM before training starts
used = torch.cuda.memory_allocated() / 1e9
total = torch.cuda.get_device_properties(0).total_memory / 1e9
print(f"VRAM before training: {used:.1f}GB / {total:.1f}GB")

trainer_stats = trainer.train()

print(f"\nTraining complete!")
print(f"Time: {trainer_stats.metrics['train_runtime']:.0f}s")
print(f"Samples/sec: {trainer_stats.metrics['train_samples_per_second']:.1f}")
print(f"Final loss: {trainer_stats.metrics['train_loss']:.4f}")

Expected output:

VRAM before training: 11.2GB / 12.0GB

Step  25 | Loss: 1.8234 | LR: 1.95e-04
Step  50 | Loss: 1.2341 | Eval Loss: 1.3847
Step 100 | Loss: 0.9123 | Eval Loss: 1.0234  ← checkpoint saved
Step 150 | Loss: 0.7834 | Eval Loss: 0.8923
...

Training complete!
Time: 2847s (47 minutes)
Samples/sec: 0.57
Final loss: 0.6234

Part 5: Evaluate the Fine-Tuned Model

# evaluation.py
from unsloth import FastLanguageModel

# Enable fast inference mode
FastLanguageModel.for_inference(model)

def classify_ticket(ticket: str) -> str:
    prompt = f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a customer support classifier. Return ONLY JSON.<|eot_id|>
<|start_header_id|>user<|end_header_id|>
Classify this support ticket: {ticket}<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>"""

    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=150,
            temperature=0.1,     # Low temperature for structured output
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
        )

    return tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)

# Test on held-out examples
test_tickets = [
    "My payment failed three times today",
    "Can I get a refund for the premium plan?",
    "The app crashes every time I open it on iPhone 16"
]

import json
print("=== EVALUATION ===")
for ticket in test_tickets:
    response = classify_ticket(ticket)
    try:
        data = json.loads(response)
        print(f"\nTicket: {ticket}")
        print(f"  Category: {data.get('category')} | Priority: {data.get('priority')}")
    except json.JSONDecodeError:
        print(f"\nTicket: {ticket}")
        print(f"  Raw: {response[:100]}")

Expected output:

=== EVALUATION ===

Ticket: My payment failed three times today
  Category: payment_failure | Priority: high

Ticket: Can I get a refund for the premium plan?
  Category: refund_request | Priority: medium

Ticket: The app crashes every time I open it on iPhone 16
  Category: bug_report | Priority: high

Part 6: Export to GGUF and Load in Ollama

# export.py
# Export merged model to GGUF for Ollama
model.save_pretrained_gguf(
    "support-classifier",          # Output directory name
    tokenizer,
    quantization_method="q4_k_m"   # Same quantisation as running models in Ollama
)
print("GGUF exported to: support-classifier/")

Expected output:

Unsloth: Merging QLoRA weights into base model...
Unsloth: Saving GGUF model to support-classifier/...
Unsloth: Quantising to Q4_K_M...
GGUF exported to: support-classifier/

# Create Modelfile for Ollama
cat > support-classifier/Modelfile << 'EOF'
FROM ./support-classifier-q4_k_m.gguf

SYSTEM """You are a customer support ticket classifier. Given a ticket, respond with a JSON object containing: category, priority (high/medium/low), department, and suggested_response."""

PARAMETER temperature 0.1
PARAMETER num_ctx 4096
EOF

# Load into Ollama
ollama create support-classifier-v1 -f support-classifier/Modelfile

# Test it
ollama run support-classifier-v1 "My account was charged twice this month"

Expected output:

{"category": "billing_error", "priority": "high", "department": "billing", "suggested_response": "refund_duplicate_charge"}

Troubleshooting

`CUDA out of memory`

Reduce per_device_train_batch_size to 1, increase gradient_accumulation_steps to 8 to keep effective batch size at 8. Also try reducing max_seq_length from 2048 to 1024 if examples are short.

Loss not decreasing after epoch 1

Learning rate may be too high (2e-4 is a starting point — try 1e-4). Also check dataset quality: if training examples have inconsistent output format, the model can’t converge.

GGUF export fails with OOM

The GGUF export merges LoRA weights into the full base model, temporarily requiring more VRAM. Add --offload-kqv or export on a machine with more RAM. Alternatively: model.save_pretrained("lora-only") saves the LoRA adapter separately, which can be merged later.

Conclusion

Llama 4 Scout is fine-tuned with QLoRA on a 12GB consumer GPU, exported to GGUF, and loaded in Ollama for sovereign local inference. The custom model produces consistent structured output (JSON) that the base model would generate only with complex prompting.

See RAG vs Fine-Tuning vs Prompt Engineering 2026 for the decision framework on when to fine-tune versus use RAG, and How to Install Ollama and Run LLMs Locally for Ollama setup.

Part 12: Production-Ready Finetuning Workflow

Fine-tuning with QLoRA and UnsLoTH is powerful, but the workflow must be engineered for repeatability, safety, and efficiency.

12.1 Data curation and annotation

A successful fine-tuning run begins with high-quality examples. Curate prompts, inputs, and target outputs so they reflect the exact behavior you want from the model. Label examples consistently and avoid mixing unrelated instruction styles in the same dataset.

12.2 Dataset splitting and validation

Treat your fine-tuning dataset like any other ML dataset: split it into training, validation, and test sets. Keep a holdout set for final evaluation so you can detect overfitting or unwanted behavioral drift.

12.3 Tokenisation and sequence length

Choose a tokenizer configured for the base model and inspect token lengths. Longer sequences cost more compute and increase the chance of context trimming. If your examples are too long, chunk them intelligently or create serialized prompt templates with placeholders.

12.4 Parameter-efficient tuning

QLoRA reduces memory usage by tuning a low-rank adapter on top of a frozen base model. Control the rank and the adapter size according to your hardware and the complexity of the task. Smaller ranks are faster, but may require more training examples.

12.5 Mixed precision and gradient checkpointing

Use half precision and gradient checkpointing to fit larger batch sizes on limited GPU or CPU memory. For UnsLoTH, mixed precision is especially effective because it preserves quality while reducing runtime footprint.

12.6 Training loop best practices

Log training loss, validation loss, learning rate, and any sample outputs. Save periodic checkpoints and keep a metadata manifest that records the dataset hash, prompt template version, and hyperparameters.

12.7 Safety filtering

Filter out undesirable or unsafe content from the training examples. Even a few toxic or misleading examples can bias the model. Use automated content filters plus a human review pass when possible.

12.8 Evaluation metrics

Evaluation should include accuracy, answer quality, and instruction-following behavior. Use rationale-based metrics for generative outputs and keep human review as part of the assessment process.

12.9 Deployment readiness

Package the tuned model with version metadata, quantization settings, and compatibility notes. Keep the base model version and LoRA weights separately so you can reproduce the exact tuned artifact.

Part 13: Optimising QLoRA for Latency and Cost

Efficient inference is just as important as training cost.

13.1 Quantization backends and performance

Choose the best backend for your hardware. QLoRA works well with backends that support 4-bit and 8-bit inference. Test both to see where the best quality/latency tradeoff lies.

13.2 Memory budgeting

Monitor the memory profile of the tuned model in inference mode. For edge or on-premise hosts, keep the total footprint under the available RAM plus overhead for the runtime and any other services.

13.3 Batch sizes and request patterns

Use smaller batches for interactive applications and larger batches for bulk processing. Measure latency across both patterns to choose the right deployment configuration.

13.4 Kernel and library compatibility

Pin the CUDA, ROCm, or CPU kernel versions carefully. Mismatched libraries can lead to poor performance, instability, or subtle numerical behavior changes.

Part 14: Model Safety and Guardrails

Self-hosted fine-tuning must enforce guardrails at multiple layers.

14.1 Overgeneration controls

Include stop tokens, output length caps, and quality filters in your generation pipeline. This prevents runaway outputs and keeps the model responses predictable.

14.2 Rejection sampling and reranking

For critical responses, generate multiple candidates and rerank them by a safety-aware discriminator or a higher-confidence criterion.

14.3 Monitoring for drift

Track model behavior over time. When you deploy a new fine-tuned model, compare its outputs to previous versions on a fixed evaluation set.

Fine-tuning is rarely a one-pass process.

15.1 Feedback incorporation

Use user feedback and error logs to identify weak examples. Add corrected prompts to the training dataset and retrain incrementally.

15.2 Prompt and response replay

Store representative prompts and generated outputs. Re-run them after each tuning iteration to ensure you did not regress answer quality.

15.3 Model rollback strategy

Keep the last stable model available. If a new fine-tuned model performs worse in production, rollback quickly and analyze the training changes that caused the regression.

Part 16: Documentation and Governance

Your fine-tuning process should be transparent and auditable.

16.1 Metadata tracking

Record model lineage, dataset versions, hyperparameters, and evaluation scores. Store this information alongside the model artifact.

16.2 Review checkpoints

Require a review of training configuration and safety filters before promoting a model to production. This is especially important in self-hosted environments where the model can influence internal decisions.

16.3 Compliance and audit readiness

Keep a record of the data sources used for fine-tuning. For regulated or sensitive domains, document consent, provenance, and redaction steps taken during example creation.

Part 17: Evaluation at Scale

A large-scale fine-tuning project needs systematic evaluation across many dimensions.

17.1 Benchmark suites

Build benchmark suites for the key tasks your model should perform. Include both in-domain and out-of-domain examples so you can see where the model generalises and where it fails.

17.2 Adversarial and robustness tests

Test the tuned model with adversarial prompts designed to probe hallucinations, prompt injection, and format errors. This helps catch failure modes before deployment.

17.3 Responsiveness and latency

Measure the tuned model’s inference latency across the hardware you plan to deploy on. Use representative prompt lengths and batch sizes.

17.4 Cost-aware evaluation

Track the cost per request, whether that is GPU time, local CPU cycles, or energy. Compare the cost to the quality improvement you gain from tuning.

Part 18: Lifecycle Management for Tuned Models

Control how tuned models are stored, versioned, and retired.

18.1 Model registry and metadata

Keep a registry that stores model name, version, training dataset hash, hyperparameters, and evaluation summary. This registry should be the source of truth for deployments.

18.2 Rollback and canary deployments

When deploying a new tuned model, roll it out gradually. Use canary traffic and compare it to the baseline. Keep the old model available for instant rollback.

18.3 Deprecation and cleanup

Retire old tuned models once you are confident a newer version is better. Clean up disk space and update documentation to avoid accidental use of stale weights.

Part 19: Governance and Explainability

Fine-tuned models should still be auditable.

19.1 Data provenance

Record where each training example came from, why it was included, and who approved it. This is essential when compliance or internal review is required.

19.2 Explainability artifacts

For instruction-following models, preserve example prompts and outputs that illustrate the behavior. Use these artifacts to explain model decisions to stakeholders.

19.3 Ethical review

Run a review of the fine-tuning dataset for bias, privacy exposure, and regulatory risk. Engage subject matter experts when your model touches sensitive domains.

Part 20: Hardware and Resource Planning

A fine-tuning pipeline must align with available compute resources.

20.1 GPU memory planning

Estimate memory usage based on model size, batch size, sequence length, and optimizer state. QLoRA can dramatically reduce memory, but you still need to validate the planned configuration on real hardware.

20.2 CPU-only and low-memory workflows

For environments without GPUs, use CPU-friendly optimization techniques. Split data into smaller batches, use fewer LoRA ranks, and choose a smaller base model if necessary.

20.3 Distributed and federated training

For large datasets or multiple teams, distribute fine-tuning across nodes. Use a parameter server or gradient accumulation strategy to keep the tuning process stable.

Part 21: Continuous Improvement and Model Refresh

Fine-tuning is best treated as a continuous improvement cycle.

21.1 Monitoring real-world performance

Track live usage, error rates, and user feedback after deployment. If the model drifts or begins to behave incorrectly, plan a refresh cycle.

21.2 Data augmentation for new domains

As new content arrives, augment your training dataset with examples that reflect the latest language and use cases. Keep the training data fresh without losing the original core behavior.

21.3 Automation of refresh pipelines

Automate dataset extraction, example labeling, and training runs with CI/CD. This minimizes manual effort and keeps the fine-tuning cadence predictable.

Part 22: Model Safety and Compliance

A self-hosted fine-tuned model can still pose compliance risks.

22.1 Sensitive data controls

Ensure that training data does not contain private or regulated information unless explicitly authorized. Use redaction and anonymization when needed.

22.2 Deployment approvals

Require review and approval before deploying a new tuned model to production. Include security, compliance, and product owners in the approval flow.

22.3 Explainability and audit logs

Keep logs of training runs, dataset versions, and evaluation results. This is essential for answering questions such as why a model was trained and what it was trained on.

Part 23: Adapter and LoRA Architecture

Understanding the adapter architecture is important for advanced QLoRA tuning.

23.1 LoRA rank and scaling

LoRA introduces low-rank matrices to the attention weights. The rank controls how much task-specific capacity is available. Higher rank improves representation power but increases memory and compute.

23.2 Layer selection

Choose which layers to adapt carefully. Many teams tune only the later transformer blocks and the output projection layers to preserve the base model’s general knowledge while adapting behavior to the task.

23.3 Multiple adapters and task switching

Use separate LoRA adapters for different tasks and switch them at runtime. This lets a single base model serve multiple behaviors without retraining.

23.4 Adapter composition

Compose adapters sequentially or in parallel for task combinations. For example, use one adapter for domain style and another for safety filtering, then merge their outputs for inference.

Part 24: Prompt Injection and Model Guardrails

Self-hosted fine-tuned models can still be vulnerable to adversarial instructions.

24.1 Input sanitization

Sanitize user inputs before they are used in prompts. Remove or escape malicious tokens that could alter the prompt’s intent.

24.2 Instruction isolation

Keep system and user instructions separate. Use a fixed system prompt that defines safety constraints and place user content in a clearly bounded section.

24.3 Verification prompts

Use a verification pass when the model is asked to perform sensitive actions. Ask it to summarise the input and confirm the intent before executing.

Part 25: Deployment and Runtime Safety

Deploy tuned models with infrastructure-level safety measures.

25.1 Resource isolation

Run the inference service in a container or VM with resource limits. This prevents runaway GPU or memory consumption from affecting other workloads.

25.2 Metrics for safety

Track the rate of unsafe or unexpected outputs. Use these metrics to trigger retraining or prompt adjustments when behavior degrades.

25.3 Graceful degradation

If the tuned model behaves unpredictably, fall back to the base model or a safer default prompt. This keeps the service available while reducing risk.

Part 26: Final Model Stewardship

Assign a clear steward to the fine-tuned model who is responsible for its ongoing performance, quality, and safety. This role is critical to keep the model stable over time.

Fine-Tuning LLMs with QLoRA and Unsloth 2026: Local Training Guide

>_ 22 Apr | 22 min | Dev Corner

🔴Advanced

Fine-tune large language models locally with QLoRA and Unsloth on Ubuntu 24.04 in 2026. Covers dataset preparation, LoRA configuration, training on RTX 4090.

By Marcus Thorne

How to Install Ollama and Run LLMs Locally: Complete 2026 Guide

>_ 17 Apr | 16 min | Dev Corner

🟢Beginner

Install Ollama 5.x on Ubuntu, macOS, and Windows. Pull and run Llama 4, Qwen3, Gemma 3, and Mistral locally. REST API setup, GPU acceleration, Open WebUI.

By Marcus Thorne

Local Multimodal AI 2026: Expert Guide to Vision + Text Pipelines with Llama 4 & Qwen2-VL

>_ 19 May | 18 min | Dev Corner

🟡Intermediate

Sovereign local multimodal AI on Ubuntu 24.04: vision-language with Llama 4 Scout, document and image reasoning with Qwen2-VL, and local Whisper audio transcription. Practical pipeline design for on-premise inference and secure data workflows.

By Kofi Mensah

#qlora #unsloth #fine-tuning #llama4 #consumer-gpu #lora #dev-corner #2026

Key Takeaways

Introduction

Part 1: Environment Setup

Part 2: Dataset Preparation

Part 3: Load Model with Unsloth

Part 4: Training

Part 5: Evaluate the Fine-Tuned Model

Part 6: Export to GGUF and Load in Ollama

Troubleshooting

CUDA out of memory

Loss not decreasing after epoch 1

GGUF export fails with OOM

Conclusion

People Also Ask

What LoRA rank should I use for fine-tuning?

How long does QLoRA fine-tuning take on a consumer GPU?

Part 12: Production-Ready Finetuning Workflow

12.1 Data curation and annotation

12.2 Dataset splitting and validation

12.3 Tokenisation and sequence length

12.4 Parameter-efficient tuning

12.5 Mixed precision and gradient checkpointing

12.6 Training loop best practices

12.7 Safety filtering

12.8 Evaluation metrics

12.9 Deployment readiness

Part 13: Optimising QLoRA for Latency and Cost

13.1 Quantization backends and performance

13.2 Memory budgeting

13.3 Batch sizes and request patterns

13.4 Kernel and library compatibility

Part 14: Model Safety and Guardrails

14.1 Overgeneration controls

14.2 Rejection sampling and reranking

14.3 Monitoring for drift

Part 15: Iterative Refinement

15.1 Feedback incorporation

15.2 Prompt and response replay

15.3 Model rollback strategy

Part 16: Documentation and Governance

16.1 Metadata tracking

16.2 Review checkpoints

16.3 Compliance and audit readiness

Part 17: Evaluation at Scale

17.1 Benchmark suites

17.2 Adversarial and robustness tests

17.3 Responsiveness and latency

17.4 Cost-aware evaluation

Part 18: Lifecycle Management for Tuned Models

18.1 Model registry and metadata

18.2 Rollback and canary deployments

18.3 Deprecation and cleanup

Part 19: Governance and Explainability

19.1 Data provenance

19.2 Explainability artifacts

19.3 Ethical review

Part 20: Hardware and Resource Planning

20.1 GPU memory planning

20.2 CPU-only and low-memory workflows

20.3 Distributed and federated training

Part 21: Continuous Improvement and Model Refresh

21.1 Monitoring real-world performance

21.2 Data augmentation for new domains

21.3 Automation of refresh pipelines

Part 22: Model Safety and Compliance

22.1 Sensitive data controls

22.2 Deployment approvals

22.3 Explainability and audit logs

Part 23: Adapter and LoRA Architecture

23.1 LoRA rank and scaling

23.2 Layer selection

23.3 Multiple adapters and task switching

23.4 Adapter composition

Part 24: Prompt Injection and Model Guardrails

24.1 Input sanitization

24.2 Instruction isolation

24.3 Verification prompts

Part 25: Deployment and Runtime Safety

25.1 Resource isolation

25.2 Metrics for safety

`CUDA out of memory`