Fine-Tuning LLMs: When to Use It and When to Skip It

Name: Fine-Tuning LLMs Explained: What It Is, When to Use It, and When to Skip It
Uploaded: 2026-06-03
Duration: 60 min
Channel: Aaron Gallant

↓Get slides ↗Colab

Released June 3, 2026

Top 3 takeaways

01

Usually, don't fine-tune

Start with model selection, prompting, and RAG. Fine-tuning earns its place mainly when you want to make a small model good enough, cut down on prompting, or keep data private and local.

02

The data is the work

Fine-tuning is supervised learning, so bad training data quietly degrades the model for good. Curate a few thousand representative labeled pairs and always hold some back to catch overfitting.

03

PEFT and QLoRA make it cheap

Freezing most layers, LoRA rank reduction, and quantization let you fine-tune a 13B model on a laptop. From there you tune learning rate, epochs, and dropout, and judge the result on accuracy plus human eval.

Aaron Gallant

Lead Instructor, Gauntlet AI

Lead Instructor at Gauntlet AI since the beginning; 18 years in tech across software development, data science, data pipelines, and internal tooling, including five years at Google. For his day job he synthetically generates HIPAA-compliant medical research data with LLMs and works deep in fine-tuning, reinforcement learning, and synthetic data generation — including fine-tuning jobs run on-premises on real hardware — so he knows how these models behave under the hood. MS in Computer Science, University of Illinois Urbana-Champaign; BA in Political Science & Philosophy, University of Rochester.

Lesson notes

A written walkthrough of the lecture, covering the patterns, the code, and the things that trip people up.

What Fine-Tuning Actually Changes

Fine-tuning changes the model itself, not just how you use it.

Prompting, RAG, memory, and context engineering all happen during inference, where the model's underlying parameters remain unchanged. Fine-tuning modifies those parameters, creating a lasting change in how the model behaves.

This is only possible with open-weight models, which allow you to continue training an existing model. Open-weight models are different from open-source models—they expose the trained parameters but not necessarily the full training process or code.

When to Fine-Tune

Most applications do not require fine-tuning.

Start by selecting the right model and improving your prompts. If the application depends on external knowledge, RAG is usually the better solution because building a knowledge base is much easier than creating high-quality training data.

Fine-tuning makes the most sense when you need a smaller model to perform a specialized task efficiently, run locally for privacy reasons, or consistently make decisions such as classification, fraud detection, or spam filtering.

The Fine-Tuning Process

Fine-tuning follows four steps:

Choose a base model.
Gather and clean training data.
Train the model.
Validate the results.

Most of the work happens in data preparation and evaluation. High-quality, representative data matters far more than large quantities of repetitive examples.

Always reserve a validation dataset so you can measure real performance and avoid overfitting.

PEFT, LoRA, and QLoRA

Modern fine-tuning is practical because only a small portion of the model needs to be trained.

Parameter-Efficient Fine-Tuning (PEFT) freezes most of the model while updating a small number of parameters. LoRA further reduces the amount of training required, and quantization shrinks the model by storing weights with lower precision.

Combined as QLoRA, these techniques make it possible to fine-tune large language models on consumer hardware instead of expensive GPU clusters.

Evaluation Matters More Than Training

Training a model is only part of the process. The real challenge is proving that it performs better.

Common metrics include accuracy, classification performance, alignment with human expectations, and perplexity, but the most valuable evaluation is still human judgment against clearly defined success criteria.

The lecture's main takeaway is that successful fine-tuning depends far more on data quality and evaluation than on the training algorithm itself. Models generate plausible outputs by design, making rigorous testing and human verification essential before deploying them in production.

FAQ

When should you fine-tune an LLM? +

Fine-tune when the failure is about style, format, or domain behavior that prompting and retrieval cannot fix. If you just need the model to know more facts, use retrieval instead, since a good prompt often matches fine-tuning at a fraction of the cost.

What is fine-tuning, really? +

You continue training on your own examples so the model adapts its behavior, which is a different thing from steering it at inference time with prompts or retrieved context.

What are PEFT and QLoRA? +

They are parameter-efficient techniques, where PEFT fine-tunes a small set of weights and QLoRA quantizes to cut memory, and together they make fine-tuning much cheaper than full retraining.

How do you choose between fine-tuning, RAG, and prompting? +

Use prompting for behavior you can specify, retrieval for missing knowledge, and fine-tuning for consistent style, format, or domain behavior that the other two cannot reliably produce.

Why can fine-tuning quietly waste time and budget? +

It needs data prep, training runs, and evaluation, and it frequently lands where a better prompt would have, so the cost only pays off for the right job.

How much data do you need to fine-tune? +

Less than people assume with PEFT and QLoRA, where quality and representativeness count far more than raw volume.

How do you know fine-tuning actually helped? +

Evaluate against a held-out set before and after, and if a prompt baseline matches it, you did not need to fine-tune.