audio
Build TTS (Text-to-Speech) audio datasets from video or audio files. Supports YouTube URLs, direct media URLs, local files, and directories of files (sorted by name). Outputs Piper/LJSpeech-compatible datasets with metadata.csv and a wavs/ directory.
Usage
# YouTube video
datasety audio --input "https://www.youtube.com/watch?v=..." --output ./dataset
# Local video file
datasety audio --input ./video.mp4 --output ./dataset
# Directory of audio/video files (sorted by name: 1.mp3, 2.mp3, ...)
datasety audio --input ./clips/ --output ./dataset
# With vocal isolation (removes background noise/music)
datasety audio --input ./video.mp4 --output ./dataset --demucs
# Custom Whisper model size
datasety audio --input ./video.mp4 --output ./dataset --whisper-model large-v3 --language enOptions
| Option | Description | Default |
|---|---|---|
--input, -i | Input: local file, directory, .txt list, YouTube/URL (append ?start=X&end=Y for time-slicing) | (required) |
--output, -o | Output directory for the dataset | (required) |
--sample-rate | Output audio sample rate in Hz | 22050 |
--demucs | Enable Demucs vocal isolation (removes background noise/music) | false |
--demucs-model | Demucs model name | htdemucs |
--whisper-model | Faster-Whisper model: tiny, base, small, medium, large-v3 | base |
--language | Language code (e.g., en, es, fr). Auto-detected if omitted | (auto) |
--device | Device: auto, cpu, cuda, mps | auto |
--min-duration | Minimum segment duration in seconds | 1.5 |
--max-duration | Maximum segment duration in seconds | 30.0 |
--merge-gap | Merge segments closer than this many seconds | 0.0 (off) |
--vad | Enable voice activity detection (VAD) to filter non-speech | false |
--normalize-numbers | Expand digits into words (e.g., 123 -> one hundred twenty-three) | false |
--no-clean-text | Disable special character stripping | false |
--phoneme-map | Path to config.json or phonemes.json. Silently drops segments with unknown chars | |
--workers | Number of parallel file workers (default: 1) | 1 |
--keep-temp | Keep temporary audio files at this path | |
--resume | Resume a previous run (skip existing chunks, append to CSV) | false |
--overwrite | Overwrite existing output directory | false |
--dry-run | Print pipeline steps without executing | false |
--verbose, -V | Print detailed progress messages | false |
Output
The command creates a dataset directory with the following structure:
output/
├── wavs/
│ ├── utt_0001.wav
│ ├── utt_0002.wav
│ └── ...
└── metadata.csvThe metadata.csv uses LJSpeech/Piper format:
utt_0001.wav|Hello world, this is a test.
utt_0002.wav|How are you doing today?Examples
YouTube Video
Extract speech from a YouTube video and create a TTS dataset:
datasety audio \
--input "https://www.youtube.com/watch?v=dQw4w9WgXcQ" \
--output ./tts_dataset \
--whisper-model base \
--language enLocal Video with Vocal Isolation
For videos with background music/noise, enable Demucs to isolate vocals:
datasety audio \
--input ./recording.mp4 \
--output ./clean_dataset \
--demucs \
--demucs-model htdemucsDirectory of Audio Files
Process a directory of audio/video files sorted by name. Useful when you have pre-recorded segments like 1.mp3, 2.mp3, etc.:
datasety audio \
--input ./recordings/ \
--output ./dataset \
--language enThe files are sorted numerically so 2.mp3 comes before 10.mp3. Supported formats include MP3, WAV, FLAC, OGG, M4A, AAC, OPUS, WEBM, MP4, MKV, AVI, and MOV.
Parallel Processing
When processing many files, use --workers to parallelize transcription across multiple files simultaneously:
datasety audio \
--input ./videos/ \
--output ./dataset \
--workers 4The transcription step is I/O-bound (waiting on model inference), so ThreadPoolExecutor achieves parallelism without loading multiple models into memory. Chunk indices are assigned globally after all files finish, so the metadata.csv is always in sorted order regardless of completion order.
High-Quality Transcription
Use a larger Whisper model for better transcription accuracy:
datasety audio \
--input ./video.mp4 \
--output ./hq_dataset \
--whisper-model large-v3 \
--language enNumber Expansion for TTS
Expand numbers to words so the TTS model knows how to pronounce them:
datasety audio \
--input ./video.mp4 \
--output ./dataset \
--normalize-numbersNon-English Languages
For non-English audio, always specify the language code for accurate transcription:
# Ukrainian
datasety audio \
--input "https://www.youtube.com/watch?v=..." \
--output ./dataset \
--language uk
# Spanish
datasety audio \
--input ./video.mp4 \
--output ./dataset \
--language esEnabling VAD for Noisy Audio
Voice Activity Detection (VAD) filters out non-speech audio. Enable it for videos with significant background noise or music:
datasety audio \
--input ./noisy_video.mp4 \
--output ./dataset \
--vadVAD merges continuous speech into fewer, longer segments. Disable it (default) for clean monologue where you want fine-grained segment boundaries.
Time-Slicing from URLs
Extract only a specific segment from a video using ?start=X&end=Y parameters:
datasety audio \
--input "https://www.youtube.com/watch?v=...&start=50&end=90" \
--output ./datasetWorks with both YouTube URLs and local files:
datasety audio \
--input "./video.mp4?start=10.5&end=30" \
--output ./datasetText List Input
Pass a .txt file containing one URL/path per line to process multiple sources:
# sources.txt
https://youtube.com/watch?v=...&start=0&end=60
https://youtube.com/watch?v=...&start=60&end=120
./local_clip.mp4datasety audio \
--input sources.txt \
--output ./dataset \
--workers 4Phoneme Map Filtering
Pass a Piper config.json or phonemes.json to silently drop any audio segments whose transcribed text contains characters (unexpanded numbers, emojis, foreign letters) not in the phoneme map. This prevents training crashes:
datasety audio \
--input ./video.mp4 \
--output ./dataset \
--phoneme-map /path/to/piper/config.jsonThe script automatically detects whether you passed a full Piper config (extracts phoneme_id_map) or a direct phoneme map.
Text Cleaning
The pipeline automatically cleans Whisper transcription artifacts:
- Hallucination loops: Removes repeated chaotic text from silent patches
- Punctuation: Fixes stray spaces before commas/periods, crushes double spaces
- Hyphens: Snaps word-connecting hyphens (
где - то→где-то) - Apostrophes: Connects separated apostrophes (
ім 'я→ім'я)
Disable with --no-clean-text if you need to preserve special characters.
Dry Run
Preview what would be processed without downloading or transcribing:
datasety audio \
--input ./clips/ \
--output ./dataset \
--dry-run \
--verbosePipeline Steps
- Download (if remote): Uses
yt-dlpto download YouTube/URL media - Extract: FFmpeg extracts audio as mono WAV at the target sample rate
- Isolate (optional): Demucs separates vocals from background
- Transcribe: Faster-Whisper identifies speech segments (VAD is off by default for cleaner segmentation; use
--vadto enable) - Slice: Audio is cut into segments matching speech timestamps, filtered by min/max duration
- Normalize: Text is cleaned (special chars stripped, numbers expanded if enabled)
- Export: Audio chunks saved to
wavs/, metadata tometadata.csv - Deduplicate: Consecutive duplicate text entries are removed (prevents Whisper hallucinations from creating duplicate chunks)
Requirements
- ffmpeg must be installed and on PATH
- Optional dependencies (install with
pip install datasety[audio]):yt-dlp- for YouTube/URL downloadingdemucs- for vocal isolationfaster-whisper- for transcriptionsoundfile- for audio slicingnum2words- for number expansion
Use with Piper Training
The output format is compatible with OHF-Voice/piper1-gpl:
piper-train fit \
--data.voice_name "my_voice" \
--data.csv_path /path/to/dataset/metadata.csv \
--data.audio_dir /path/to/dataset/wavs/ \
--model.sample_rate 22050 \
--data.espeak_voice "en" \
--data.cache_dir /path/to/cache/ \
--data.batch_size 32