train — LoRA Fine-Tuning & TTS Training
Train a LoRA adapter for image generation models (FLUX.2-klein, SDXL, Qwen) or a TTS voice model (Piper) from audio datasets.
Two Training Modes
The train command has two completely separate modes with different parameters, backends, and outputs:
| Mode | Backend | Dataset Format | Output | See Section |
|---|---|---|---|---|
| Image (LoRA) | Diffusers + PEFT | image.jpg + image.txt | .safetensors | LoRA Fine-Tuning |
| Audio (TTS) | Piper (VITS) | metadata.csv + wavs/ | .ckpt → .onnx | TTS Training |
The mode is auto-detected from --family / --backend flags or the dataset structure:
--family flux|--family sdxl|--family qwen→ Image (LoRA)--backend piper|--backend coqui|--backend f5-tts→ Audio (TTS)
Quick Reference
# Auto-detect from dataset: images → LoRA, audio csv → TTS
datasety train --input ./dataset --output result
# Force TTS mode (audio parameters)
datasety train --input ./tts_dataset --output voice.ckpt --backend piper --model kontextox/piper-base-us --steps 500
# Force LoRA mode (image parameters)
datasety train --input ./images --output lora.safetensors --family flux --steps 500 --lr 1e-4 --lora-rank 16Image vs Audio Parameters
Important: Parameters are mode-specific. Image parameters (like
--lora-rank) do not apply to TTS training, and audio parameters (like--sample-rate) do not apply to LoRA training.
Image (LoRA) Parameters Only
These parameters are used when --family flux, --family sdxl, or --family qwen is set, or when the dataset contains image files:
| Option | Description | Default |
|---|---|---|
--family | Model family: flux, sdxl, qwen | auto-detected |
--model | HuggingFace repo ID (base model) | black-forest-labs/FLUX.2-klein-base-4B |
--output | Output .safetensors path | lora.safetensors |
--steps | Number of training steps | 100 |
--lr | Learning rate | 1e-4 |
--lora-rank | LoRA rank (higher = more capacity, larger file) | 16 |
--lora-alpha | LoRA alpha (controls effective learning rate scale) | 16.0 |
--lora-dropout | LoRA dropout rate | 0.0 |
--image-size | Training resolution (square crop, must be divisible by 32 for Qwen) | 512 |
--device | Device: auto, cpu, cuda, mps | auto |
--seed | Random seed | 42 |
--save-every | Save checkpoint every N steps | end only |
--resume | Resume from a .safetensors checkpoint | |
--validation-split | Fraction of dataset for validation (0.0–0.5) | |
--timestep-type | Timestep sampling: sigmoid, lognorm, linear | sigmoid |
--caption-dropout | Probability of dropping caption (unconditional training) | 0.05 |
--gradient-checkpointing | Enable gradient checkpointing (saves VRAM) | off |
--optimizer | Optimizer: adamw or adamw8bit (requires bitsandbytes) | adamw |
--lr-scheduler | LR schedule: constant, cosine, linear | constant |
--lr-warmup-steps | Linear warmup steps before target LR | 0 |
--gradient-accumulation-steps | Accumulate gradients over N steps | 1 |
--min-snr-gamma | Min-SNR-γ loss weighting for SDXL (recommended: 5.0) | disabled |
--noise-offset | Per-channel noise offset for SDXL (recommended: 0.05–0.1) | 0.0 |
Audio (TTS) Parameters Only
These parameters are used when --backend piper (or coqui, f5-tts) is set, or when the dataset contains metadata.csv:
| Option | Description | Default |
|---|---|---|
--backend | TTS backend: piper (coqui, f5-tts planned) | piper |
--model | Piper base model (HF repo ID or local path) | (required) |
--output | Output directory for .ckpt checkpoints | (required) |
--steps | Number of training epochs | 100 |
--sample-rate | Audio sample rate in Hz | 22050 |
--batch-size | Training batch size | 32 |
--accelerator | PyTorch Lightning accelerator: auto, gpu, cpu | auto |
--devices | Number of GPUs: auto, 1, 2, -1 (all GPUs) | auto |
--test-text | Background inference text to test checkpoints as they drop | |
--seed | Random seed | 42 |
Shared Parameters (Both Modes)
| Option | Description | Default |
|---|---|---|
--input, -i | Dataset directory | required |
--steps | Training steps (image) or epochs (audio) | 100 |
LoRA Fine-Tuning
Train a LoRA adapter for image generation models from a local dataset of image + caption pairs.
Supported model families: FLUX.2-klein (flow-matching), SDXL (DDPM), and Qwen (flow-matching, image-editing).
datasety train --input ./dataset --output lora.safetensorsDataset Format
The input directory must contain image files alongside matching .txt caption files:
dataset/
001.jpg
001.txt ← "ohwx person wearing a red jacket"
002.png
002.txt ← "ohwx person smiling outdoors"
...Images are center-cropped to a square and resized to --image-size (default 512 px). Use datasety resize, datasety caption, and the other preparation commands to build the dataset before training.
Base vs Distilled Models
Always use the base (undistilled) model for LoRA training.
| Model | Type | Use for |
|---|---|---|
black-forest-labs/FLUX.2-klein-4B | Step-distilled (4–8 steps) | Inference only |
black-forest-labs/FLUX.2-klein-9B | Step-distilled (4–8 steps) | Inference only |
black-forest-labs/FLUX.2-klein-base-4B | Base (undistilled) | LoRA training ✓ |
black-forest-labs/FLUX.2-klein-base-9B | Base (undistilled) | LoRA training ✓ |
The tool will print a warning if you pass a distilled model.
Options
| Option | Description | Default |
|---|---|---|
--input, -i | Dataset directory (images + .txt captions) | required |
--output, -o | Output LoRA .safetensors path | lora.safetensors |
--model, -m | HuggingFace repo ID (base model) | black-forest-labs/FLUX.2-klein-base-4B |
--family | Model family: flux, sdxl, qwen | auto-detected |
--steps | Number of training steps | 100 |
--lr | Learning rate | 1e-4 |
--lora-rank | LoRA rank | 16 |
--lora-alpha | LoRA alpha | 16.0 |
--lora-dropout | LoRA dropout rate | 0.0 |
--image-size | Training resolution (square crop) | 512 |
--device | auto, cpu, cuda, mps | auto |
--seed | Random seed | 42 |
--save-every | Save checkpoint every N steps | end only |
--resume | Resume from a LoRA checkpoint (.safetensors) | |
--validation-split | Fraction of dataset for validation (0.0-0.5) | |
--timestep-type | Timestep sampling: sigmoid, lognorm, linear | sigmoid |
--caption-dropout | Probability of dropping caption (unconditional) | 0.05 |
--gradient-checkpointing | Enable gradient checkpointing (saves VRAM) | off |
--optimizer | adamw or adamw8bit (requires bitsandbytes) | adamw |
--lr-scheduler | LR schedule: constant, cosine, linear | constant |
--lr-warmup-steps | Linear warmup steps before target LR | 0 |
--gradient-accumulation-steps | Accumulate gradients over N steps | 1 |
--min-snr-gamma | Min-SNR-γ loss weighting for SDXL (recommended: 5.0) | disabled |
--noise-offset | Per-channel noise offset for SDXL (recommended: 0.05–0.1) | 0.0 |
Training Best Practices
The following defaults reflect ai-toolkit recommendations for fast, stable LoRA convergence:
| Technique | Option | Recommended value | Effect |
|---|---|---|---|
| Sigmoid timestep sampling | --timestep-type sigmoid | default | Biases toward mid-timesteps where most learning happens (vs uniform) |
| Caption dropout | --caption-dropout 0.05 | default | 5% unconditional steps, improves CFG adherence |
| 8-bit optimizer | --optimizer adamw8bit | opt-in | Halves optimizer state memory; requires bitsandbytes |
| Cosine LR decay | --lr-scheduler cosine | opt-in | Prevents late-stage oscillation |
| Warmup | --lr-warmup-steps 50 | opt-in | Avoids early large gradient steps |
| Gradient accumulation | --gradient-accumulation-steps 4 | opt-in | Simulates larger batch size without extra VRAM |
| Gradient checkpointing | --gradient-checkpointing | opt-in | Reduces VRAM ~30% at ~20% speed cost |
| Min-SNR-γ (SDXL only) | --min-snr-gamma 5.0 | opt-in | Stabilises DDPM loss across timesteps |
| Noise offset (SDXL only) | --noise-offset 0.05 | opt-in | Improves dark/bright image generation |
Examples
FLUX.2-klein LoRA with all best practices
datasety train \
--input ./dataset \
--output ./lora/flux_lora.safetensors \
--model black-forest-labs/FLUX.2-klein-base-4B \
--steps 500 \
--lr 1e-4 \
--lora-rank 16 \
--timestep-type sigmoid \
--caption-dropout 0.05 \
--optimizer adamw8bit \
--lr-scheduler cosine \
--lr-warmup-steps 50 \
--gradient-accumulation-steps 2 \
--gradient-checkpointingFLUX.2-klein LoRA (recommended)
Prepare a dataset first, then train:
# 1. Prepare dataset
datasety resize -i ./raw -o ./dataset -r 512x512
datasety caption -i ./dataset -o ./dataset --trigger-word "ohwx person,"
# 2. Train LoRA on FLUX.2-klein-base-4B (~8 GB VRAM)
datasety train \
--input ./dataset \
--output ./lora/flux_lora.safetensors \
--model black-forest-labs/FLUX.2-klein-base-4B \
--steps 500 \
--lr 1e-4 \
--lora-rank 16
# 3. Use the trained LoRA with synthetic editing
datasety synthetic \
--input-image photo.jpg \
--output-image result.png \
--prompt "ohwx person wearing sunglasses" \
--lora ./lora/flux_lora.safetensors:0.8SDXL LoRA
datasety train \
--input ./dataset \
--output sdxl_lora.safetensors \
--model stabilityai/stable-diffusion-xl-base-1.0 \
--family sdxl \
--steps 500 \
--lr 1e-4 \
--lora-rank 16 \
--image-size 1024Qwen Image-Edit LoRA (~30 GB VRAM)
Train a LoRA for Qwen/Qwen-Image-Edit-2511 (or 2509). The dataset needs only image + caption pairs — the same image is used as both source and target (reconstruction training), which teaches the model to preserve identity/style. Image size must be a multiple of 32.
datasety train \
--input ./dataset \
--output qwen_lora.safetensors \
--model Qwen/Qwen-Image-Edit-2511 \
--steps 500 \
--lr 5e-5 \
--lora-rank 16 \
--image-size 512The resulting LoRA loads directly with --lora in datasety synthetic:
datasety synthetic \
--input-image photo.jpg --output-image edited.png \
--model Qwen/Qwen-Image-Edit-2511 \
--lora qwen_lora.safetensors:0.8 \
--prompt "ohwx person wearing a winter hat"Quick test run (20 steps)
Verify the training loop works before a full run:
datasety train \
--input ./dataset \
--output test_lora.safetensors \
--steps 20 \
--save-every 10Resume from checkpoint
datasety train \
--input ./dataset \
--output lora.safetensors \
--resume lora_step200.safetensors \
--steps 500Training with validation
datasety train \
--input ./dataset \
--output lora.safetensors \
--steps 500 \
--validation-split 0.1 # 10% of images held out for validation lossSave checkpoints during training
datasety train \
--input ./dataset \
--output lora.safetensors \
--steps 1000 \
--save-every 200 # saves lora_step200.safetensors, lora_step400.safetensors, ...VRAM Requirements
| Model | VRAM | Notes |
|---|---|---|
| FLUX.2-klein-base-4B | ~8 GB | Default, auto CPU-offload if needed |
| FLUX.2-klein-base-9B | ~18 GB | Higher quality |
| SDXL | ~7 GB | Good for object/style LoRAs |
| Qwen/Qwen-Image-Edit-2511 | ~30 GB | Image-editing LoRA, flow-matching |
CPU offload is applied automatically when free VRAM is below the required amount.
LoRA Parameters Guide
| Parameter | Recommended range | Effect |
|---|---|---|
--lora-rank | 4–64 | Higher = more capacity, larger file |
--lora-alpha | Equal to rank (default) | Controls effective learning rate scale |
--steps | 100–2000 | More steps = more fitting (risk of overfitting) |
--lr | 1e-5 – 1e-3 | Too high causes divergence; too low is slow |
--image-size | 512 or 1024 | Match your target inference resolution |
Output
The trained LoRA is saved as a .safetensors file in diffusers-compatible format — keys use the transformer. prefix for FLUX and Qwen models and unet. for SDXL, so the file loads directly with pipeline.load_lora_weights() and with --lora in datasety commands:
datasety synthetic -i ./images -o ./output \
--prompt "ohwx person in a park" \
--lora lora.safetensors:0.8The LoRA weight (:0.8) controls blend strength — typically 0.6–1.0.
Target modules
| Model | Trainable modules | File size (rank 16) |
|---|---|---|
| FLUX.2-klein-base-4B | to_q, to_k, to_v, to_out.0 | ~38 MB |
| FLUX.2-klein-base-9B | to_q, to_k, to_v, to_qkv_mlp_proj | ~77 MB |
| SDXL | to_q, to_k, to_v, to_out.0 | ~25 MB |
| Qwen/Qwen-Image-Edit-2511 | to_q, to_k, to_v, to_out.0, add_q_proj, add_k_proj, add_v_proj, to_add_out | ~45 MB |
The 9B FLUX model uses fused to_qkv_mlp_proj projections (single-transformer blocks). Qwen targets both image-stream and text-stream attention projections across 60 transformer blocks.
TTS Training (Piper)
Train a Piper TTS model from a dataset produced by datasety audio. Outputs a .ckpt checkpoint that can be exported to .onnx for inference.
Dataset Format
The input directory must contain a Piper/LJSpeech-compatible dataset:
tts_dataset/
├── wavs/
│ ├── utt_0001.wav
│ ├── utt_0002.wav
│ └── ...
└── metadata.csvutt_0001.wav|Hello world, this is a test.
utt_0002.wav|How are you doing today?TTS-Specific Options
| Option | Description | Default |
|---|---|---|
--backend | TTS backend: piper (coqui, f5-tts planned) | piper |
--model | Piper base model (HF repo ID or local path) | (required) |
--sample-rate | Audio sample rate in Hz | 22050 |
--batch-size | Training batch size | 32 |
--accelerator | PyTorch Lightning accelerator: auto, gpu, cpu | auto |
--devices | Number of GPUs: auto, 1, 2, -1 (all GPUs) | auto |
--test-text | Background inference text to test checkpoints as they drop |
Piper Auto-Installer
On first run, datasety train automatically:
- Clones the kontextox/piper1-gpl repository to
~/.cache/datasety/ - Compiles the
monotonic_alignCython extension - Installs the Piper Python package and dependencies
No manual compilation required.
Multi-GPU Training
For dual-GPU setups (e.g., 2× L40S), PyTorch Lightning automatically enables Distributed Data Parallel (DDP):
datasety train \
--input ./tts_dataset \
--output ./voice_model \
--backend piper \
--model "rhasspy/piper-checkpoints:en/en_US/kristin/medium" \
--steps 1000 \
--batch-size 32 \
--accelerator gpu \
--devices 2PyTorch Lightning auto-detects and utilizes all available GPUs without extra configuration.
Background Voice Watcher
Pass --test-text to spin up a background daemon that watches for new .ckpt files, exports them to .onnx, and renders a .wav file using your test text. Listen to the model learning in real-time while the GPU keeps training:
datasety train \
--input ./tts_dataset \
--output ./voice_model \
--backend piper \
--model "rhasspy/piper-checkpoints:en/en_US/kristin/medium" \
--steps 1000 \
--test-text "Hello, this is a test of my new voice."Examples
Basic TTS Training
datasety train \
--input ./tts_dataset \
--output ./voice_model \
--backend piper \
--model "rhasspy/piper-checkpoints:en/en_US/kristin/medium" \
--steps 500Multi-GPU Training with Voice Watcher
datasety train \
--input ./tts_dataset \
--output ./voice_model \
--backend piper \
--model "rhasspy/piper-checkpoints:en/en_US/kristin/medium" \
--steps 1000 \
--batch-size 32 \
--accelerator gpu \
--devices 2 \
--test-text "The quick brown fox jumps over the lazy dog."Resume Training
datasety train \
--input ./tts_dataset \
--output ./voice_model \
--backend piper \
--model "rhasspy/piper-checkpoints:en/en_US/kristin/medium" \
--steps 2000The trainer automatically resumes from the last checkpoint if one exists in the output directory.