Skip to content

~# datasetyDataset - it's easy!

One tool for the full dataset pipeline — resize, caption, align, generate, mask, filter, degrade, train LoRA adapters, train TTS voices, upload to HuggingFace, and automate with workflows.

Full dataset pipeline

Quick Install

bash
pip install datasety          # core
pip install datasety[all]     # everything

Example Pipeline

bash
# 1. Resize raw photos
datasety resize -i ./raw -o ./dataset -r 1024x1024

# 2. Generate captions with a trigger word
datasety caption -i ./dataset -o ./dataset --trigger-word "[trigger]"

# 3. Generate face masks for focused training
datasety mask -i ./dataset -o ./masks -k "face,hair"

Or define it as a workflow:

yaml
# datasety.yaml
steps:
  - command: resize
    args: { input: ./raw, output: ./dataset, resolution: 1024x1024 }
  - command: caption
    args: { input: ./dataset, output: ./dataset, trigger-word: "[trigger]" }
  - command: mask
    args: { input: ./dataset, output: ./masks, keywords: "face,hair" }
bash
datasety workflow --dry-run    # validate
datasety workflow              # execute

Released under the MIT License.