Skip to content

Getting Started

Installation

Install the core package (resize, align, shuffle, degrade):

bash
pip install datasety

Add features as needed:

bash
pip install datasety[caption]        # Florence-2 captioning
pip install datasety[synthetic]      # Image editing (FLUX, Qwen, SDXL, etc.)
pip install datasety[mask]           # Mask generation (SAM 3, SAM 2, CLIPSeg)
pip install datasety[filter]         # Content filtering (CLIP, NudeNet)
pip install datasety[character]      # Character dataset generation
pip install datasety[audio]          # TTS audio datasets (Whisper transcription)
pip install datasety[video]          # Video datasets (same deps as audio)
pip install datasety[train]          # LoRA training (FLUX, SDXL)
pip install datasety[upload]         # Upload to HuggingFace Hub
pip install datasety[workflow]       # YAML/JSON workflow support
pip install datasety[all]            # Everything

Verify the installation:

bash
datasety --version
datasety --help

Quick Start

Prepare a LoRA Training Dataset

bash
# 1. Resize images to training resolution
datasety resize -i ./raw -o ./dataset -r 1024x1024

# 2. Generate captions with a template
datasety caption -i ./dataset -o ./dataset --template "[trigger] {{caption}}"

Use a Vision API for Captions

bash
export OPENAI_API_KEY=your-key

datasety caption -i ./dataset -o ./dataset --llm-api --model gpt-5-nano

Build a TTS Audio Dataset

bash
# From a YouTube video (flat .wav/.txt pairs by default)
datasety audio --input "https://www.youtube.com/watch?v=..." \
    --output ./tts_dataset --language en --workers 4

# With LJSpeech/Piper format (metadata.csv + wavs/)
datasety audio --input ./video.mp4 --output ./dataset --metadata

# From a directory of audio files
datasety audio --input ./recordings/ --output ./dataset \
    --normalize-numbers --workers 4

# With phoneme map filtering (drops invalid segments, requires --metadata)
datasety audio --input ./video.mp4 --output ./dataset \
    --metadata --phoneme-map /path/to/piper/config.json

# Time-slicing from a URL
datasety audio --input "https://youtube.com/watch?v=...&start=50&end=90" \
    --output ./dataset

Build a Video Dataset

bash
# From a YouTube video
datasety video --input "https://www.youtube.com/watch?v=..." \
    --output ./video_dataset --language en

# From a local video file
datasety video --input ./interview.mp4 --output ./dataset

# With frame-accurate cuts (slower, default is fast stream-copy)
datasety video --input ./video.mp4 --output ./dataset --re-encode

# Directory of clips with vocal isolation for transcription
datasety video --input ./clips/ --output ./dataset --demucs

Train a LoRA Adapter (Image Fine-Tuning)

bash
# 1. Prepare dataset
datasety resize -i ./raw -o ./dataset -r 512x512
datasety caption -i ./dataset -o ./dataset --template "photo of sks person, {{caption}}"

# 2. Train LoRA on FLUX.2-klein-base-4B (~8 GB VRAM)
datasety train --input ./dataset \
    --output ./lora/flux_lora.safetensors \
    --model black-forest-labs/FLUX.2-klein-base-4B \
    --steps 500 --lr 1e-4 --lora-rank 16

Train a TTS Voice Model (Audio)

bash
# Train a Piper TTS model (auto-installs dependencies on first run)
datasety train --input ./tts_dataset \
    --output ./voice_model \
    --backend piper \
    --model kontextox/piper-base-us \
    --steps 500

# Multi-GPU training (2x L40S, etc.)
datasety train --input ./tts_dataset \
    --output ./voice_model \
    --backend piper \
    --model kontextox/piper-base-us \
    --steps 1000 \
    --accelerator gpu \
    --devices 2

# With real-time voice testing
datasety train --input ./tts_dataset \
    --output ./voice_model \
    --backend piper \
    --model kontextox/piper-base-us \
    --test-text "Hello, this is a test of my new voice."

Note: The train command has two completely separate modes — Image (LoRA) and Audio (TTS) — with different parameters. Use --family flux/sdxl/qwen for LoRA training, or --backend piper for TTS training. See the train docs for full parameter reference.

Supports local video/audio files, YouTube URLs, directories, and .txt lists. See the audio docs for full options.

Supports custom providers via environment variables:

VariableDescriptionDefault
OPENAI_API_KEYAPI keyrequired for --llm-api
OPENAI_BASE_URLCustom API endpointhttps://api.openai.com/v1
OPENAI_MODELDefault model (when --model omitted)gpt-5-nano

Run a Workflow

Create datasety.yaml in your project:

yaml
steps:
  - command: resize
    args:
      input: ./raw
      output: ./dataset
      resolution: 768x1024
      crop-position: center

  - command: caption
    args:
      input: ./dataset
      output: ./dataset
      template: "[trigger] {{caption}}"

  - command: mask
    args:
      input: ./dataset
      output: ./masks
      keywords: "face,hair"

Validate and execute:

bash
datasety workflow --dry-run    # validate all steps
datasety workflow              # execute

Upload to HuggingFace

Upload datasets or model adapters to HuggingFace Hub. The command auto-detects the type (audio, image, video, document, model, generic) and generates a HF-compliant README dataset card.

bash
# Upload a TTS audio dataset
datasety upload --path ./tts_dataset --repo-id user/my-voice --type audio

# Upload a LoRA adapter
datasety upload --path ./lora_output --repo-id user/sdxl-lora --type model

# Dry-run first
datasety upload --path ./dataset --repo-id user/my-dataset --dry-run

Requires HF_TOKEN env var or --token argument. See the upload docs for full options.

Commands Overview

Image Processing

CommandDescriptionExtra Deps
resizeResize and crop to target resolution--
captionGenerate captions (Florence-2 or API)[caption]
audioBuild TTS audio datasets from video/audio[audio]
videoBuild video datasets from video files[video]
alignAlign control/target image pairs--
maskText-prompted segmentation masks[mask]
filterFilter by content (CLIP or NudeNet)[filter]
degradeDegraded versions for upscale training--
uploadUpload datasets/models to HuggingFace Hub--

Generation

CommandDescriptionExtra Deps
syntheticImage editing with diffusion models[synthetic]
characterIdentity-preserving character datasets[character]
shuffleRandom captions from text groups--

Automation

CommandDescriptionExtra Deps
sweepParameter grid search for synthetic[workflow]
workflowMulti-step pipelines from YAML/JSON[workflow]

Training

CommandDescriptionExtra Deps
train --family fluxLoRA fine-tuning: FLUX.2-klein, SDXL, Qwen (images + captions → .safetensors)[train]
train --backend piperTTS training: Piper voice models (audio dataset → .ckpt/.onnx)[train]

Common Patterns

All commands that process image directories share these options:

OptionDescription
--input, -iInput directory
--output, -oOutput directory
--input-imageSingle image mode (alternative to dir)
--deviceauto, cpu, cuda, or mps
--dry-runPreview without making changes
--recursive, -RSearch input directory recursively

Released under the MIT License.