Getting Started
Installation
Install the core package (resize, align, shuffle, degrade):
pip install datasetyAdd features as needed:
pip install datasety[caption] # Florence-2 captioning
pip install datasety[synthetic] # Image editing (FLUX, Qwen, SDXL, etc.)
pip install datasety[mask] # Mask generation (SAM 3, SAM 2, CLIPSeg)
pip install datasety[filter] # Content filtering (CLIP, NudeNet)
pip install datasety[character] # Character dataset generation
pip install datasety[audio] # TTS audio datasets (Whisper transcription)
pip install datasety[train] # LoRA training (FLUX, SDXL)
pip install datasety[upload] # Upload to HuggingFace Hub
pip install datasety[workflow] # YAML/JSON workflow support
pip install datasety[all] # EverythingVerify the installation:
datasety --version
datasety --helpQuick Start
Prepare a LoRA Training Dataset
# 1. Resize images to training resolution
datasety resize -i ./raw -o ./dataset -r 1024x1024
# 2. Generate captions with a trigger word
datasety caption -i ./dataset -o ./dataset --trigger-word "[trigger]"Use a Vision API for Captions
export OPENAI_API_KEY=your-key
datasety caption -i ./dataset -o ./dataset --llm-api --model gpt-5-nanoBuild a TTS Audio Dataset
# From a YouTube video
datasety audio --input "https://www.youtube.com/watch?v=..." \
--output ./tts_dataset --language en --workers 4
# From a directory of audio files
datasety audio --input ./recordings/ --output ./dataset \
--normalize-numbers --workers 4
# With phoneme map filtering (drops invalid segments automatically)
datasety audio --input ./video.mp4 --output ./dataset \
--phoneme-map /path/to/piper/config.json
# Time-slicing from a URL
datasety audio --input "https://youtube.com/watch?v=...&start=50&end=90" \
--output ./datasetTrain a LoRA Adapter (Image Fine-Tuning)
# 1. Prepare dataset
datasety resize -i ./raw -o ./dataset -r 512x512
datasety caption -i ./dataset -o ./dataset --trigger-word "[trigger]"
# 2. Train LoRA on FLUX.2-klein-base-4B (~8 GB VRAM)
datasety train --input ./dataset \
--output ./lora/flux_lora.safetensors \
--model black-forest-labs/FLUX.2-klein-base-4B \
--steps 500 --lr 1e-4 --lora-rank 16Train a TTS Voice Model (Audio)
# Train a Piper TTS model (auto-installs dependencies on first run)
datasety train --input ./tts_dataset \
--output ./voice_model \
--backend piper \
--model kontextox/piper-base-us \
--steps 500
# Multi-GPU training (2x L40S, etc.)
datasety train --input ./tts_dataset \
--output ./voice_model \
--backend piper \
--model kontextox/piper-base-us \
--steps 1000 \
--accelerator gpu \
--devices 2
# With real-time voice testing
datasety train --input ./tts_dataset \
--output ./voice_model \
--backend piper \
--model kontextox/piper-base-us \
--test-text "Hello, this is a test of my new voice."Note: The
traincommand has two completely separate modes — Image (LoRA) and Audio (TTS) — with different parameters. Use--family flux/sdxl/qwenfor LoRA training, or--backend piperfor TTS training. See thetraindocs for full parameter reference.
Supports local video/audio files, YouTube URLs, directories, and .txt lists. See the audio docs for full options.
Supports custom providers via environment variables:
| Variable | Description | Default |
|---|---|---|
OPENAI_API_KEY | API key | required for --llm-api |
OPENAI_BASE_URL | Custom API endpoint | https://api.openai.com/v1 |
OPENAI_MODEL | Default model (when --model omitted) | gpt-5-nano |
Run a Workflow
Create datasety.yaml in your project:
steps:
- command: resize
args:
input: ./raw
output: ./dataset
resolution: 768x1024
crop-position: center
- command: caption
args:
input: ./dataset
output: ./dataset
trigger-word: "[trigger]"
- command: mask
args:
input: ./dataset
output: ./masks
keywords: "face,hair"Validate and execute:
datasety workflow --dry-run # validate all steps
datasety workflow # executeUpload to HuggingFace
Upload datasets or model adapters to HuggingFace Hub. The command auto-detects the type (audio, image, video, document, model, generic) and generates a HF-compliant README dataset card.
# Upload a TTS audio dataset
datasety upload --path ./tts_dataset --repo-id user/my-voice --type audio
# Upload a LoRA adapter
datasety upload --path ./lora_output --repo-id user/sdxl-lora --type model
# Dry-run first
datasety upload --path ./dataset --repo-id user/my-dataset --dry-runRequires HF_TOKEN env var or --token argument. See the upload docs for full options.
Commands Overview
Image Processing
| Command | Description | Extra Deps |
|---|---|---|
resize | Resize and crop to target resolution | -- |
caption | Generate captions (Florence-2 or API) | [caption] |
audio | Build TTS audio datasets from video/audio | [audio] |
align | Align control/target image pairs | -- |
mask | Text-prompted segmentation masks | [mask] |
filter | Filter by content (CLIP or NudeNet) | [filter] |
inspect | Dataset statistics and duplicate detection | -- |
degrade | Degraded versions for upscale training | -- |
upload | Upload datasets/models to HuggingFace Hub | -- |
Generation
| Command | Description | Extra Deps |
|---|---|---|
synthetic | Image editing with diffusion models | [synthetic] |
character | Identity-preserving character datasets | [character] |
shuffle | Random captions from text groups | -- |
Automation
| Command | Description | Extra Deps |
|---|---|---|
sweep | Parameter grid search for synthetic | [workflow] |
workflow | Multi-step pipelines from YAML/JSON | [workflow] |
Training
| Command | Description | Extra Deps |
|---|---|---|
train --family flux | LoRA fine-tuning: FLUX.2-klein, SDXL, Qwen (images + captions → .safetensors) | [train] |
train --backend piper | TTS training: Piper voice models (audio dataset → .ckpt/.onnx) | [train] |
Common Patterns
All commands that process image directories share these options:
| Option | Description |
|---|---|
--input, -i | Input directory |
--output, -o | Output directory |
--input-image | Single image mode (alternative to dir) |
--device | auto, cpu, cuda, or mps |
--dry-run | Preview without making changes |
--recursive, -R | Search input directory recursively |