Workflows

Workflows let you define multi-step datasety pipelines in YAML or JSON files. This is useful for reproducible dataset preparation.

Quick Start

Create datasety.yaml in your project directory:

yaml

steps:
  - command: resize
    args:
      input: ./raw
      output: ./dataset
      resolution: 1024x1024
  - command: caption
    args:
      input: ./dataset
      output: ./dataset
      template: "ohwx, {{caption}}"

Validate first, then run:

bash

datasety workflow --dry-run
datasety workflow

File Format

See the workflow command reference for full format details.

Template System

The caption, audio, and video commands support --template for formatting output text:

With placeholder — or is replaced with the generated text: template: "photo of sks person, " → photo of sks person, a woman in a red dress
Without placeholder — text is prepended: template: "ohwx person," → ohwx person, a woman in a red dress

In workflow YAML, use template: as the key:

yaml

- command: caption
  args:
    input: ./dataset
    output: ./dataset
    template: "photo of sks person, {{caption}}"

Real-World Pipelines

Face/Person LoRA Training

The most common use case: prepare a face LoRA dataset from raw selfies or portrait photos. Resize to square, caption with a template, and generate face masks so the trainer can focus loss on the subject.

yaml

# face-lora.yaml
# Input: ./raw/ containing 15-30 portrait photos (JPG/PNG from phone camera)
# Output: ./dataset/ with resized images, captions (.txt), and masks
steps:
  - command: resize
    args:
      input: ./raw
      output: ./dataset
      resolution: 1024x1024
      crop-position: top

  - command: caption
    args:
      input: ./dataset
      output: ./dataset
      template: "ohwx person, {{caption}}"

  - command: mask
    args:
      input: ./dataset
      output: ./dataset/masks
      keywords: "person,face,hair"
      model: clipseg
      threshold: 0.4
      padding: 10
      blur: 5

bash

datasety workflow -f face-lora.yaml --dry-run
datasety workflow -f face-lora.yaml
# Result: ./dataset/ has 001.jpg + 001.txt + masks/001.png for each image

Accessory Augmentation

You have 20 photos of a person and want to expand the dataset with synthetic variations wearing different accessories. This is useful when you want the LoRA to generalize beyond the reference photos.

yaml

# augment-accessories.yaml
# Input: ./dataset/ containing resized training images (from face LoRA step above)
# Output: ./augmented/ with synthetic edits, then re-captioned
steps:
  - command: synthetic
    args:
      input: ./dataset
      output: ./augmented/hats
      prompt: "the person is wearing a knitted beanie hat"
      steps: 4
      cfg-scale: 2.5
      seed: 42

  - command: synthetic
    args:
      input: ./dataset
      output: ./augmented/glasses
      prompt: "the person is wearing round sunglasses"
      steps: 4
      cfg-scale: 2.5
      seed: 42

  - command: synthetic
    args:
      input: ./dataset
      output: ./augmented/scarves
      prompt: "the person is wearing a red wool scarf"
      steps: 4
      cfg-scale: 2.5
      seed: 42

  - command: caption
    args:
      input: ./augmented/hats
      output: ./augmented/hats
      template: "ohwx person, {{caption}}"

  - command: caption
    args:
      input: ./augmented/glasses
      output: ./augmented/glasses
      template: "ohwx person, {{caption}}"

  - command: caption
    args:
      input: ./augmented/scarves
      output: ./augmented/scarves
      template: "ohwx person, {{caption}}"

Product Photography LoRA

Prepare a dataset from product photos for an object LoRA. Products often have white or cluttered backgrounds, so we use non-square portrait crops to preserve product shape.

yaml

# product-lora.yaml
# Input: ./product_photos/ containing product images (various sizes)
# Output: ./dataset/ ready for training
steps:
  - command: resize
    args:
      input: ./product_photos
      output: ./dataset
      resolution: 768x1024
      crop-position: center

  - command: caption
    args:
      input: ./dataset
      output: ./dataset
      template: "sks product, {{caption}}"
      florence-2-large: true

  - command: mask
    args:
      input: ./dataset
      output: ./dataset/masks
      keywords: "product,object,item"
      model: clipseg
      threshold: 0.3

Upscale/Restore Training

Create a paired dataset for training an upscale or image restoration model. The degradation step creates realistic artifacts (JPEG compression, noise, blur) that the model learns to reverse.

yaml

# upscale-training.yaml
# Input: ./originals/ containing high-quality source images
# Output: ./dataset/ with control/ (degraded) and target/ (original) subdirs
steps:
  - command: resize
    args:
      input: ./originals
      output: ./resized
      resolution: 1024x1024

  - command: degrade
    args:
      input: ./resized
      output: ./dataset
      type:
        - jpeg
        - noise
        - blur
      chain: true
      intensity-range: "0.3-0.7"
      paired: true
      seed: 42

  - command: align
    args:
      target: ./dataset/target
      control: ./dataset/control

  - command: caption
    args:
      input: ./dataset/target
      output: ./dataset/target

Background Replacement

Generate inverted masks (everything except the subject), then use synthetic editing to change backgrounds. Useful for placing subjects in varied environments.

yaml

# background-swap.yaml
# Input: ./portraits/ containing people photos with plain backgrounds
# Output: Three sets of re-backgrounded images
steps:
  - command: resize
    args:
      input: ./portraits
      output: ./resized
      resolution: 1024x1024
      crop-position: center

  - command: synthetic
    args:
      input: ./resized
      output: ./bg_outdoor
      prompt: "the person is standing in a sunny park with trees and grass"
      steps: 4
      cfg-scale: 2.5
      seed: 100

  - command: synthetic
    args:
      input: ./resized
      output: ./bg_studio
      prompt: "professional studio portrait with soft lighting and gray backdrop"
      steps: 4
      cfg-scale: 2.5
      seed: 100

  - command: synthetic
    args:
      input: ./resized
      output: ./bg_urban
      prompt: "the person is standing on a city street with buildings"
      steps: 4
      cfg-scale: 2.5
      seed: 100

Inpainting Dataset

Create an inpainting training dataset with source images, masks, and captions. The masks mark regions to inpaint (e.g., accessories that should be removable).

yaml

# inpainting-dataset.yaml
# Input: ./photos/ containing images of people with accessories
# Output: ./dataset/ with images, masks for accessories, and captions
steps:
  - command: resize
    args:
      input: ./photos
      output: ./dataset
      resolution: 1024x1024
      crop-position: top

  - command: mask
    args:
      input: ./dataset
      output: ./dataset/masks
      keywords: "hat,glasses,sunglasses,scarf,necklace,earring"
      model: sam3
      threshold: 0.3
      padding: 5
      blur: 3

  - command: caption
    args:
      input: ./dataset
      output: ./dataset
      florence-2-large: true

Vision API Captioning with Custom Provider

Use a third-party OpenAI-compatible API for captioning when you want higher-quality descriptions than Florence-2. Works with OpenRouter, Together, or any compatible endpoint.

yaml

# api-caption.yaml
# Requires: OPENAI_API_KEY and OPENAI_BASE_URL env vars
steps:
  - command: resize
    args:
      input: ./raw
      output: ./dataset
      resolution: 1024x1024

  - command: caption
    args:
      input: ./dataset
      output: ./dataset
      llm-api: true
      model: gpt-5-nano
      template: "ohwx person, {{caption}}"
      prompt: "Describe this person's appearance, clothing, pose, expression, and setting in one detailed paragraph. Do not mention image quality or photography terms."
      temperature: 0.3
      max-tokens: 200

bash

# Run with OpenRouter
OPENAI_BASE_URL=https://openrouter.ai/api/v1 \
OPENAI_API_KEY=sk-or-... \
datasety workflow -f api-caption.yaml

Multi-Resolution Dataset

Some trainers benefit from images at multiple resolutions. This workflow outputs the same source at three common training sizes.

yaml

# multi-res.yaml
# Input: ./raw/ containing high-res source images (>= 2048px)
steps:
  - command: resize
    args:
      input: ./raw
      output: ./dataset_512
      resolution: 512x512

  - command: resize
    args:
      input: ./raw
      output: ./dataset_768
      resolution: 768x768

  - command: resize
    args:
      input: ./raw
      output: ./dataset_1024
      resolution: 1024x1024

  - command: caption
    args:
      input: ./dataset_1024
      output: ./dataset_512
      template: "ohwx, {{caption}}"

  - command: caption
    args:
      input: ./dataset_1024
      output: ./dataset_768
      template: "ohwx, {{caption}}"

  - command: caption
    args:
      input: ./dataset_1024
      output: ./dataset_1024
      template: "ohwx, {{caption}}"

Sweep Then Train

Use sweep to find optimal generation parameters on a small sample, then apply the best settings to the full dataset.

bash

# Step 1: Test on 2-3 images to find the best steps + cfg-scale
mkdir ./sample && cp ./dataset/001.jpg ./dataset/002.jpg ./sample/

datasety sweep \
    -i ./sample -o ./sweep_results \
    -p "the person is wearing aviator sunglasses" \
    --steps 2,4,8 \
    --cfg-scale 1.5,2.5,3.5 \
    --seed 42 --run

# Step 2: Visually inspect ./sweep_results/steps4_cfg2.5/ etc.
# Pick the best combination, then apply to the full dataset:

yaml

# full-augment.yaml
steps:
  - command: synthetic
    args:
      input: ./dataset
      output: ./augmented
      prompt: "the person is wearing aviator sunglasses"
      steps: 4
      cfg-scale: 2.5
      seed: 42

  - command: caption
    args:
      input: ./augmented
      output: ./augmented
      template: "ohwx person, {{caption}}"

Cyanotype Style LoRA (API dataset + two models)

Train a cyanotype photographic style LoRA — the 1842 UV contact-print process producing Prussian-blue and bleached-white prints — on images generated via the FLUX API. The finished LoRAs let you:

FLUX.2-klein-base-4B: add the cyanotype style to any img2img edit
Qwen-Image-Edit-2511: convert any photograph to a cyanotype print

This workflow was run end-to-end and produced working LoRAs. See examples/cyanotype_lora/ for the full output including trained weights, dataset, and inference results.

yaml

# cyanotype-lora.yaml
# Requires: OPENAI_API_KEY + OPENAI_BASE_URL (OpenRouter) + HF_TOKEN
steps:
  - command: character
    args:
      output: ./dataset/raw
      num-images: 30
      prompts-file: ./prompts.txt # 30 curated cyanotype subject prompts
      image-api: true
      model: black-forest-labs/flux.2-klein-4b
      api-aspect-ratio: "1:1"
      api-image-size: 1K
      output-format: png

  - command: resize
    args:
      input: ./dataset/raw
      output: ./dataset/prepared
      resolution: 512x512
      crop-position: center

  - command: caption
    args:
      input: ./dataset/prepared
      output: ./dataset/prepared
      template: "cyanotype, {{caption}}"
      llm-api: true
      model: google/gemini-2.5-flash
      prompt: >
        Describe this image's subject and composition in one sentence.
        Focus on WHAT is depicted, not the style or color.
      temperature: 0.3
      max-tokens: 80

  - command: train
    args:
      input: ./dataset/prepared
      output: ./lora/cyanotype_flux4b.safetensors
      model: black-forest-labs/FLUX.2-klein-base-4B
      steps: 500
      lr: 1e-4
      lora-rank: 16
      timestep-type: sigmoid
      caption-dropout: 0.05
      lr-scheduler: cosine
      lr-warmup-steps: 50
      validation-split: 0.1
      seed: 42

  - command: train
    args:
      input: ./dataset/prepared
      output: ./lora/cyanotype_qwen.safetensors
      model: Qwen/Qwen-Image-Edit-2511
      steps: 300
      lr: 5e-5
      lora-rank: 16
      timestep-type: sigmoid
      caption-dropout: 0.05
      lr-scheduler: cosine
      lr-warmup-steps: 30
      validation-split: 0.1
      seed: 42

bash

datasety workflow -f cyanotype-lora.yaml --dry-run
datasety workflow -f cyanotype-lora.yaml

# Apply trained LoRA — FLUX img2img
datasety synthetic --input-image photo.jpg --output-image out.png \
    --model black-forest-labs/FLUX.2-klein-base-4B \
    --lora ./lora/cyanotype_flux4b.safetensors:0.9 \
    --prompt "cyanotype, botanical specimen, Prussian blue and white" \
    --steps 20 --cfg-scale 3.5 --strength 0.75

# Apply trained LoRA — Qwen photo-to-cyanotype
datasety synthetic --input-image photo.jpg --output-image out.png \
    --model Qwen/Qwen-Image-Edit-2511 \
    --lora ./lora/cyanotype_qwen.safetensors:0.8 \
    --prompt "cyanotype, convert to cyanotype print style, Prussian blue and white" \
    --steps 40 --true-cfg-scale 4.0

Train a LoRA from a Character Dataset

Prepare a character dataset using character (LLM-generated prompts + FLUX.2), then train a LoRA adapter on the result.

Note: Training requires the base (undistilled) model. The character generation step uses the fast FP8 inference model; the train step loads the full base model.

yaml

# character-lora.yaml
# Input: Optional reference face image(s) at ./reference/
# Output: ./lora/character_lora.safetensors ready to use with --lora
steps:
  - command: character
    args:
      output: ./character_dataset
      num-images: 50
      llm-ollama: qwen3.5:4b
      model: black-forest-labs/FLUX.2-klein-4b-fp8
      character-description: "a young woman with short auburn hair and freckles"
      steps: 4
      seed: 42

  - command: train
    args:
      input: ./character_dataset
      output: ./lora/character_lora.safetensors
      model: black-forest-labs/FLUX.2-klein-base-4B
      steps: 500
      lr: 1e-4
      lora-rank: 16
      image-size: 512

bash

datasety workflow -f character-lora.yaml --dry-run
datasety workflow -f character-lora.yaml

# Use the trained LoRA for inference
datasety synthetic \
    --input-image photo.jpg \
    --output-image result.png \
    --prompt "ohwx person in a forest" \
    --lora ./lora/character_lora.safetensors:0.8

Shuffled Caption Augmentation

Generate randomized captions to add variety to a training dataset. Each image gets a randomly assembled caption from predefined text groups, which helps prevent the model from memorizing exact phrasings.

yaml

# shuffle-captions.yaml
# Input: ./raw/ containing images
# Generates randomized captions from text groups
steps:
  - command: resize
    args:
      input: ./raw
      output: ./dataset
      resolution: 1024x1024

  - command: shuffle
    args:
      input: ./dataset
      output: ./dataset
      group:
        - "ohwx person,|a photo of ohwx,|ohwx,"
        - "looking at the camera|facing forward|in a relaxed pose|smiling"
        - "natural lighting|soft studio light|bright daylight|warm indoor lighting"
      seed: 42

TTS Audio Dataset from YouTube

Build a TTS training dataset from a YouTube video or a directory of audio files. The audio command transcribes speech, slices audio at word boundaries, and outputs paired .wav + .txt files by default, or LJSpeech-compatible wavs/ + metadata.csv with --metadata.

yaml

# tts-from-youtube.yaml
# Input: YouTube URL or local directory
# Output: ./tts_dataset/ with flat .wav/.txt pairs
steps:
  - command: audio
    args:
      input: "https://www.youtube.com/watch?v=..."
      output: ./tts_dataset
      whisper-model: large-v3
      language: en
      normalize-numbers: true
      workers: 4

  - command: audio
    args:
      input: ./recordings/
      output: ./tts_dataset
      whisper-model: base
      language: uk
      normalize-numbers: true
      workers: 4
      resume: true

bash

datasety workflow -f tts-from-youtube.yaml --dry-run
datasety workflow -f tts-from-youtube.yaml

# Resume later (skips already-processed files)
datasety workflow -f tts-from-youtube.yaml

Tip: Use --workers 4 (or more) to transcribe multiple files in parallel. Use --normalize-numbers to expand digits like 123 into words so the TTS model pronounces them correctly. Use --metadata for Piper/LJSpeech-compatible output.

TTS Audio Dataset with LJSpeech Format

For Piper training, use --metadata to generate the metadata.csv + wavs/ structure:

yaml

# tts-ljspeech.yaml
steps:
  - command: audio
    args:
      input: "https://www.youtube.com/watch?v=..."
      output: ./tts_dataset
      metadata: true
      whisper-model: large-v3
      language: en
      normalize-numbers: true

Video Dataset from YouTube

Build a video dataset from a YouTube video or local video files. The video command transcribes speech and slices the video into segments, outputting paired .mp4 + .txt files with timestamp-based naming.

yaml

# video-dataset.yaml
steps:
  - command: video
    args:
      input: "https://www.youtube.com/watch?v=..."
      output: ./video_dataset
      whisper-model: base
      language: en
    # Output: dQw4w9WgXcQ-000000-000003.mp4 + .txt

bash

datasety workflow -f video-dataset.yaml --dry-run
datasety workflow -f video-dataset.yaml

Tip: Use --re-encode for frame-accurate cuts (slower). Default is stream-copy (fast).

Upload TTS Dataset to HuggingFace

After building a TTS dataset with --metadata, upload it to HuggingFace Hub with auto-generated dataset card:

yaml

# upload-tts.yaml
steps:
  - command: audio
    args:
      input: "https://www.youtube.com/watch?v=..."
      output: ./tts_dataset
      metadata: true
      whisper-model: base
      language: en
      workers: 4

  - command: upload
    args:
      path: ./tts_dataset
      repo-id: your-username/my-voice-dataset
      type: audio
      private: true

bash

datasety workflow -f upload-tts.yaml --dry-run
datasety workflow -f upload-tts.yaml

Prepare and Upload LoRA Training Dataset

Resize, caption, and train a LoRA, then upload the adapter:

yaml

# train-and-upload.yaml
steps:
  - command: resize
    args:
      input: ./raw
      output: ./dataset
      resolution: 1024x1024

  - command: caption
    args:
      input: ./dataset
      output: ./dataset
      template: "ohwx person, {{caption}}"

  - command: train
    args:
      input: ./dataset
      output: ./lora/portrait_lora.safetensors
      model: black-forest-labs/FLUX.2-klein-base-4B
      steps: 500
      lora-rank: 16

  - command: upload
    args:
      path: ./lora/portrait_lora.safetensors
      repo-id: your-username/portrait-lora
      type: model

bash

datasety workflow -f train-and-upload.yaml --dry-run
datasety workflow -f train-and-upload.yaml

Workflows ​

Quick Start ​

File Format ​

Template System ​

Real-World Pipelines ​

Face/Person LoRA Training ​

Accessory Augmentation ​

Product Photography LoRA ​

Upscale/Restore Training ​

Background Replacement ​

Inpainting Dataset ​

Vision API Captioning with Custom Provider ​

Multi-Resolution Dataset ​

Sweep Then Train ​

Cyanotype Style LoRA (API dataset + two models) ​

Train a LoRA from a Character Dataset ​

Shuffled Caption Augmentation ​

TTS Audio Dataset from YouTube ​

TTS Audio Dataset with LJSpeech Format ​

Video Dataset from YouTube ​

Upload TTS Dataset to HuggingFace ​

Prepare and Upload LoRA Training Dataset ​

Workflows

Quick Start

File Format

Template System

Real-World Pipelines

Face/Person LoRA Training

Accessory Augmentation

Product Photography LoRA

Upscale/Restore Training

Background Replacement

Inpainting Dataset

Vision API Captioning with Custom Provider

Multi-Resolution Dataset

Sweep Then Train

Cyanotype Style LoRA (API dataset + two models)

Train a LoRA from a Character Dataset

Shuffled Caption Augmentation

TTS Audio Dataset from YouTube

TTS Audio Dataset with LJSpeech Format

Video Dataset from YouTube

Upload TTS Dataset to HuggingFace

Prepare and Upload LoRA Training Dataset