Skip to content

train — LoRA Fine-Tuning & TTS Training

Train a LoRA adapter for image generation models (FLUX.2-klein, SDXL, Qwen) or a TTS voice model (Piper) from audio datasets.

Two Training Modes

The train command has two completely separate modes with different parameters, backends, and outputs:

ModeBackendDataset FormatOutputSee Section
Image (LoRA)Diffusers + PEFTimage.jpg + image.txt.safetensorsLoRA Fine-Tuning
Audio (TTS)Piper (VITS)metadata.csv + wavs/.ckpt.onnxTTS Training

The mode is auto-detected from --family / --backend flags or the dataset structure:

  • --family flux | --family sdxl | --family qwenImage (LoRA)
  • --backend piper | --backend coqui | --backend f5-ttsAudio (TTS)

Quick Reference

bash
# Auto-detect from dataset: images → LoRA, audio csv → TTS
datasety train --input ./dataset --output result

# Force TTS mode (audio parameters)
datasety train --input ./tts_dataset --output voice.ckpt --backend piper --model kontextox/piper-base-us --steps 500

# Force LoRA mode (image parameters)
datasety train --input ./images --output lora.safetensors --family flux --steps 500 --lr 1e-4 --lora-rank 16

Image vs Audio Parameters

Important: Parameters are mode-specific. Image parameters (like --lora-rank) do not apply to TTS training, and audio parameters (like --sample-rate) do not apply to LoRA training.

Image (LoRA) Parameters Only

These parameters are used when --family flux, --family sdxl, or --family qwen is set, or when the dataset contains image files:

OptionDescriptionDefault
--familyModel family: flux, sdxl, qwenauto-detected
--modelHuggingFace repo ID (base model)black-forest-labs/FLUX.2-klein-base-4B
--outputOutput .safetensors pathlora.safetensors
--stepsNumber of training steps100
--lrLearning rate1e-4
--lora-rankLoRA rank (higher = more capacity, larger file)16
--lora-alphaLoRA alpha (controls effective learning rate scale)16.0
--lora-dropoutLoRA dropout rate0.0
--image-sizeTraining resolution (square crop, must be divisible by 32 for Qwen)512
--deviceDevice: auto, cpu, cuda, mpsauto
--seedRandom seed42
--save-everySave checkpoint every N stepsend only
--resumeResume from a .safetensors checkpoint
--validation-splitFraction of dataset for validation (0.0–0.5)
--timestep-typeTimestep sampling: sigmoid, lognorm, linearsigmoid
--caption-dropoutProbability of dropping caption (unconditional training)0.05
--gradient-checkpointingEnable gradient checkpointing (saves VRAM)off
--optimizerOptimizer: adamw or adamw8bit (requires bitsandbytes)adamw
--lr-schedulerLR schedule: constant, cosine, linearconstant
--lr-warmup-stepsLinear warmup steps before target LR0
--gradient-accumulation-stepsAccumulate gradients over N steps1
--min-snr-gammaMin-SNR-γ loss weighting for SDXL (recommended: 5.0)disabled
--noise-offsetPer-channel noise offset for SDXL (recommended: 0.05–0.1)0.0

Audio (TTS) Parameters Only

These parameters are used when --backend piper (or coqui, f5-tts) is set, or when the dataset contains metadata.csv:

OptionDescriptionDefault
--backendTTS backend: piper (coqui, f5-tts planned)piper
--modelPiper base model (HF repo ID or local path)(required)
--outputOutput directory for .ckpt checkpoints(required)
--stepsNumber of training epochs100
--sample-rateAudio sample rate in Hz22050
--batch-sizeTraining batch size32
--acceleratorPyTorch Lightning accelerator: auto, gpu, cpuauto
--devicesNumber of GPUs: auto, 1, 2, -1 (all GPUs)auto
--test-textBackground inference text to test checkpoints as they drop
--seedRandom seed42

Shared Parameters (Both Modes)

OptionDescriptionDefault
--input, -iDataset directoryrequired
--stepsTraining steps (image) or epochs (audio)100

LoRA Fine-Tuning

Train a LoRA adapter for image generation models from a local dataset of image + caption pairs.

Supported model families: FLUX.2-klein (flow-matching), SDXL (DDPM), and Qwen (flow-matching, image-editing).

bash
datasety train --input ./dataset --output lora.safetensors

Dataset Format

The input directory must contain image files alongside matching .txt caption files:

dataset/
  001.jpg
  001.txt     ← "ohwx person wearing a red jacket"
  002.png
  002.txt     ← "ohwx person smiling outdoors"
  ...

Images are center-cropped to a square and resized to --image-size (default 512 px). Use datasety resize, datasety caption, and the other preparation commands to build the dataset before training.

Base vs Distilled Models

Always use the base (undistilled) model for LoRA training.

ModelTypeUse for
black-forest-labs/FLUX.2-klein-4BStep-distilled (4–8 steps)Inference only
black-forest-labs/FLUX.2-klein-9BStep-distilled (4–8 steps)Inference only
black-forest-labs/FLUX.2-klein-base-4BBase (undistilled)LoRA training
black-forest-labs/FLUX.2-klein-base-9BBase (undistilled)LoRA training

The tool will print a warning if you pass a distilled model.

Options

OptionDescriptionDefault
--input, -iDataset directory (images + .txt captions)required
--output, -oOutput LoRA .safetensors pathlora.safetensors
--model, -mHuggingFace repo ID (base model)black-forest-labs/FLUX.2-klein-base-4B
--familyModel family: flux, sdxl, qwenauto-detected
--stepsNumber of training steps100
--lrLearning rate1e-4
--lora-rankLoRA rank16
--lora-alphaLoRA alpha16.0
--lora-dropoutLoRA dropout rate0.0
--image-sizeTraining resolution (square crop)512
--deviceauto, cpu, cuda, mpsauto
--seedRandom seed42
--save-everySave checkpoint every N stepsend only
--resumeResume from a LoRA checkpoint (.safetensors)
--validation-splitFraction of dataset for validation (0.0-0.5)
--timestep-typeTimestep sampling: sigmoid, lognorm, linearsigmoid
--caption-dropoutProbability of dropping caption (unconditional)0.05
--gradient-checkpointingEnable gradient checkpointing (saves VRAM)off
--optimizeradamw or adamw8bit (requires bitsandbytes)adamw
--lr-schedulerLR schedule: constant, cosine, linearconstant
--lr-warmup-stepsLinear warmup steps before target LR0
--gradient-accumulation-stepsAccumulate gradients over N steps1
--min-snr-gammaMin-SNR-γ loss weighting for SDXL (recommended: 5.0)disabled
--noise-offsetPer-channel noise offset for SDXL (recommended: 0.05–0.1)0.0

Training Best Practices

The following defaults reflect ai-toolkit recommendations for fast, stable LoRA convergence:

TechniqueOptionRecommended valueEffect
Sigmoid timestep sampling--timestep-type sigmoiddefaultBiases toward mid-timesteps where most learning happens (vs uniform)
Caption dropout--caption-dropout 0.05default5% unconditional steps, improves CFG adherence
8-bit optimizer--optimizer adamw8bitopt-inHalves optimizer state memory; requires bitsandbytes
Cosine LR decay--lr-scheduler cosineopt-inPrevents late-stage oscillation
Warmup--lr-warmup-steps 50opt-inAvoids early large gradient steps
Gradient accumulation--gradient-accumulation-steps 4opt-inSimulates larger batch size without extra VRAM
Gradient checkpointing--gradient-checkpointingopt-inReduces VRAM ~30% at ~20% speed cost
Min-SNR-γ (SDXL only)--min-snr-gamma 5.0opt-inStabilises DDPM loss across timesteps
Noise offset (SDXL only)--noise-offset 0.05opt-inImproves dark/bright image generation

Examples

FLUX.2-klein LoRA with all best practices

bash
datasety train \
    --input ./dataset \
    --output ./lora/flux_lora.safetensors \
    --model black-forest-labs/FLUX.2-klein-base-4B \
    --steps 500 \
    --lr 1e-4 \
    --lora-rank 16 \
    --timestep-type sigmoid \
    --caption-dropout 0.05 \
    --optimizer adamw8bit \
    --lr-scheduler cosine \
    --lr-warmup-steps 50 \
    --gradient-accumulation-steps 2 \
    --gradient-checkpointing

Prepare a dataset first, then train:

bash
# 1. Prepare dataset
datasety resize -i ./raw -o ./dataset -r 512x512
datasety caption -i ./dataset -o ./dataset --trigger-word "ohwx person,"

# 2. Train LoRA on FLUX.2-klein-base-4B (~8 GB VRAM)
datasety train \
    --input ./dataset \
    --output ./lora/flux_lora.safetensors \
    --model black-forest-labs/FLUX.2-klein-base-4B \
    --steps 500 \
    --lr 1e-4 \
    --lora-rank 16

# 3. Use the trained LoRA with synthetic editing
datasety synthetic \
    --input-image photo.jpg \
    --output-image result.png \
    --prompt "ohwx person wearing sunglasses" \
    --lora ./lora/flux_lora.safetensors:0.8

SDXL LoRA

bash
datasety train \
    --input ./dataset \
    --output sdxl_lora.safetensors \
    --model stabilityai/stable-diffusion-xl-base-1.0 \
    --family sdxl \
    --steps 500 \
    --lr 1e-4 \
    --lora-rank 16 \
    --image-size 1024

Qwen Image-Edit LoRA (~30 GB VRAM)

Train a LoRA for Qwen/Qwen-Image-Edit-2511 (or 2509). The dataset needs only image + caption pairs — the same image is used as both source and target (reconstruction training), which teaches the model to preserve identity/style. Image size must be a multiple of 32.

bash
datasety train \
    --input ./dataset \
    --output qwen_lora.safetensors \
    --model Qwen/Qwen-Image-Edit-2511 \
    --steps 500 \
    --lr 5e-5 \
    --lora-rank 16 \
    --image-size 512

The resulting LoRA loads directly with --lora in datasety synthetic:

bash
datasety synthetic \
    --input-image photo.jpg --output-image edited.png \
    --model Qwen/Qwen-Image-Edit-2511 \
    --lora qwen_lora.safetensors:0.8 \
    --prompt "ohwx person wearing a winter hat"

Quick test run (20 steps)

Verify the training loop works before a full run:

bash
datasety train \
    --input ./dataset \
    --output test_lora.safetensors \
    --steps 20 \
    --save-every 10

Resume from checkpoint

bash
datasety train \
    --input ./dataset \
    --output lora.safetensors \
    --resume lora_step200.safetensors \
    --steps 500

Training with validation

bash
datasety train \
    --input ./dataset \
    --output lora.safetensors \
    --steps 500 \
    --validation-split 0.1    # 10% of images held out for validation loss

Save checkpoints during training

bash
datasety train \
    --input ./dataset \
    --output lora.safetensors \
    --steps 1000 \
    --save-every 200    # saves lora_step200.safetensors, lora_step400.safetensors, ...

VRAM Requirements

ModelVRAMNotes
FLUX.2-klein-base-4B~8 GBDefault, auto CPU-offload if needed
FLUX.2-klein-base-9B~18 GBHigher quality
SDXL~7 GBGood for object/style LoRAs
Qwen/Qwen-Image-Edit-2511~30 GBImage-editing LoRA, flow-matching

CPU offload is applied automatically when free VRAM is below the required amount.

LoRA Parameters Guide

ParameterRecommended rangeEffect
--lora-rank4–64Higher = more capacity, larger file
--lora-alphaEqual to rank (default)Controls effective learning rate scale
--steps100–2000More steps = more fitting (risk of overfitting)
--lr1e-51e-3Too high causes divergence; too low is slow
--image-size512 or 1024Match your target inference resolution

Output

The trained LoRA is saved as a .safetensors file in diffusers-compatible format — keys use the transformer. prefix for FLUX and Qwen models and unet. for SDXL, so the file loads directly with pipeline.load_lora_weights() and with --lora in datasety commands:

bash
datasety synthetic -i ./images -o ./output \
    --prompt "ohwx person in a park" \
    --lora lora.safetensors:0.8

The LoRA weight (:0.8) controls blend strength — typically 0.61.0.

Target modules

ModelTrainable modulesFile size (rank 16)
FLUX.2-klein-base-4Bto_q, to_k, to_v, to_out.0~38 MB
FLUX.2-klein-base-9Bto_q, to_k, to_v, to_qkv_mlp_proj~77 MB
SDXLto_q, to_k, to_v, to_out.0~25 MB
Qwen/Qwen-Image-Edit-2511to_q, to_k, to_v, to_out.0, add_q_proj, add_k_proj, add_v_proj, to_add_out~45 MB

The 9B FLUX model uses fused to_qkv_mlp_proj projections (single-transformer blocks). Qwen targets both image-stream and text-stream attention projections across 60 transformer blocks.


TTS Training (Piper)

Train a Piper TTS model from a dataset produced by datasety audio. Outputs a .ckpt checkpoint that can be exported to .onnx for inference.

Dataset Format

The input directory must contain a Piper/LJSpeech-compatible dataset:

tts_dataset/
├── wavs/
│   ├── utt_0001.wav
│   ├── utt_0002.wav
│   └── ...
└── metadata.csv
utt_0001.wav|Hello world, this is a test.
utt_0002.wav|How are you doing today?

TTS-Specific Options

OptionDescriptionDefault
--backendTTS backend: piper (coqui, f5-tts planned)piper
--modelPiper base model (HF repo ID or local path)(required)
--sample-rateAudio sample rate in Hz22050
--batch-sizeTraining batch size32
--acceleratorPyTorch Lightning accelerator: auto, gpu, cpuauto
--devicesNumber of GPUs: auto, 1, 2, -1 (all GPUs)auto
--test-textBackground inference text to test checkpoints as they drop

Piper Auto-Installer

On first run, datasety train automatically:

  1. Clones the kontextox/piper1-gpl repository to ~/.cache/datasety/
  2. Compiles the monotonic_align Cython extension
  3. Installs the Piper Python package and dependencies

No manual compilation required.

Multi-GPU Training

For dual-GPU setups (e.g., 2× L40S), PyTorch Lightning automatically enables Distributed Data Parallel (DDP):

bash
datasety train \
    --input ./tts_dataset \
    --output ./voice_model \
    --backend piper \
    --model "rhasspy/piper-checkpoints:en/en_US/kristin/medium" \
    --steps 1000 \
    --batch-size 32 \
    --accelerator gpu \
    --devices 2

PyTorch Lightning auto-detects and utilizes all available GPUs without extra configuration.

Background Voice Watcher

Pass --test-text to spin up a background daemon that watches for new .ckpt files, exports them to .onnx, and renders a .wav file using your test text. Listen to the model learning in real-time while the GPU keeps training:

bash
datasety train \
    --input ./tts_dataset \
    --output ./voice_model \
    --backend piper \
    --model "rhasspy/piper-checkpoints:en/en_US/kristin/medium" \
    --steps 1000 \
    --test-text "Hello, this is a test of my new voice."

Examples

Basic TTS Training

bash
datasety train \
    --input ./tts_dataset \
    --output ./voice_model \
    --backend piper \
    --model "rhasspy/piper-checkpoints:en/en_US/kristin/medium" \
    --steps 500

Multi-GPU Training with Voice Watcher

bash
datasety train \
    --input ./tts_dataset \
    --output ./voice_model \
    --backend piper \
    --model "rhasspy/piper-checkpoints:en/en_US/kristin/medium" \
    --steps 1000 \
    --batch-size 32 \
    --accelerator gpu \
    --devices 2 \
    --test-text "The quick brown fox jumps over the lazy dog."

Resume Training

bash
datasety train \
    --input ./tts_dataset \
    --output ./voice_model \
    --backend piper \
    --model "rhasspy/piper-checkpoints:en/en_US/kristin/medium" \
    --steps 2000

The trainer automatically resumes from the last checkpoint if one exists in the output directory.

Released under the MIT License.