Collin D. Johnson

MLX Fine-Tuning Kit

May 2026 · open-source kit

A reusable kit for fine-tuning open-source LLMs locally on Apple Silicon using MLX. Built around an M2 Max with 64GB unified memory, but works on any modern Mac with 16GB+. The repo bundles the templates, worked examples, and reference docs I copy into every new fine-tuning project.

View the repo on GitHub →

What's in it

The five-step workflow

Every project follows the same pattern, which is most of what the kit encodes:

  1. Define the behavior — Write a two-sentence before/after spec. If you can't, you're not ready to build the dataset.
  2. Build the dataset — 30 examples gets you a clear pattern, 100–300 gets real generalization. Quality over quantity. One bad example pollutes training.
  3. Baseline test — Run the base model on your test prompts before training. Skipping this is the #1 beginner mistake.
  4. Trainmlx_lm.lora --config lora_config.yaml. Watch train and val loss; if val starts climbing while train falls, you're overfitting.
  5. Evaluate — Compare side-by-side on prompts the model hasn't seen. Loss numbers help, but the real test is "do the outputs look like what I wanted?"

Quick start

git clone https://github.com/collindjohnson/mlx-finetune-kit
mkdir -p ~/AI/finetuning/my-new-project && cd ~/AI/finetuning/my-new-project
cp ../path/to/mlx-finetune-kit/templates/* .

python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip && pip install mlx-lm huggingface_hub

# Edit prepare_data.py with your dataset, then:
python prepare_data.py        # Build train/valid splits
python test_model.py --base   # Baseline outputs (don't skip)
mlx_lm.lora --config lora_config.yaml   # Train (5–15 min)
python test_model.py --tuned  # See if it worked

When not to fine-tune

Honest reminder: most "I want a custom AI" use cases are better served by better prompting, RAG, or few-shot examples in the prompt. Fine-tune when you need a consistent output format prompting can't reliably produce, you're running cheaply at scale, you need offline/private operation, or you want a small specialized model that beats a big general one on a narrow task.