JP TTS Engine Comparison

23 voices across 6 engines · Rate 1-10, review aggregates, pipeline tracks moves · M4 Pro 64GB

最近、新しい本を読み始めました。物語は静かな町から始まり、主人公はゆっくりと自分の過去と向き合っていきます。

Test passage rendered by every engine. ~10s of natural narrative JP.

Sort: Filter: Labels:

Click 1-10 under each card to rate. Ratings persist locally (localStorage). Sort/filter at top, but listening order stays shuffled until you opt in.

Aggregated from your ratings. Updates live.

Per-engine: what it is, how it works, license, Apple Silicon viability, the invocation that produced its sample.

Fish Audio S2 Pro ⚠ Fish Audio Research License (NC) MLX 8-bit (mlx-audio) Fish Audio · Dual-AR + RL · 10M+ hours · 80+ langs · 44.1kHz · NEW this push

How it works

Dual-Autoregressive (Dual-AR) architecture with RL alignment for natural prosody and emotional richness. Sub-word level prosody/emotion tags ([whisper], [excited], [angry]). Native multi-speaker and multi-turn dialogue. Trained on 10M+ hours, 80+ languages.

Strong

The "Fish sound" that started this whole hunt. Strong JP. Apache-Silicon-native via mlx-audio (port from fishaudio/s2-pro, 906 downloads, last updated Mar 2026). Zero-shot clone works.

Weak

License is the OG problem — NC for commercial. Personal blind test is fine, Manaoke commercial ship is not. To get commercial Fish output legally: $11/mo Fish Audio Plus hosted tier.

Verdict

This is the ceiling reference. If something else in this A/B matches or beats it, you have a permissive answer to the Fish-quality question. If only Fish wins, the $132/yr subscription is the path of least resistance.

python fish_tts.py "..." --ref asset/voice_matsukaze.wav --ref-text "..."

Supertonic 3 MIT code · OpenRAIL weights ONNX native (no diffusion) Supertone Inc (HYBE, Korea) · 99M params · 31 langs · 44.1kHz · released 2026-04-29

How it works

4-ONNX pipeline: duration_predictor → text_encoder → vector_estimator → vocoder. 99M params total — tiny vs 500M-2B competitors. No diffusion = no Apple Silicon MLX-diffusion gap. ONNX Runtime is mature on every platform.

Strong

Tiny + fast = real-time-plus on Apple Silicon. 6 pre-trained voice styles (F1-3 / M1-3). Inline expression tags. 44.1kHz studio-grade. 9,294 stars in 6 months. Korean company puts JP in top tier of attention.

Weak / unknown

JP quality vs JP-native engines unknown. 99M is small — may underperform on pitch-accent micro-prosody. OpenRAIL has use-based restrictions (no illegal content, no impersonation without consent).

python supertonic_tts.py "..." --lang ja --voice F1 # F1-F3 / M1-M3

VoxCPM2 Apache-2.0 MLX (2nd-class) OpenBMB · 2B params · tokenizer-free DiT · 48kHz

How it works

Tokenizer-free diffusion-autoregressive: LocEnc → TSLM → RALM → LocDiT. AudioVAE V2 reconstructs 48kHz. Trained on 2M+ hours multilingual. Five modes including Voice Design (instruct), clone, ultimate.

Strong

Multilingual coverage incl. JP. Apache-2.0. Voice Design via natural-language prompt. 48kHz output.

Weak on Apple Silicon

Diffusion-on-MLX undercooked — maintainer redirects performance-hungry users to nanovllm-voxcpm (CUDA fork). Clone mode produces "fluent-foreigner accent" (HF discussion #14).

JP accent fix (maintainer-confirmed)

Switch clone → Voice Design (instruct), lower CFG 2.0→1.5, chunk to ≤2 sentences. From VoxCPM issue #222.

python voxcpm_tts.py "..." --mode default --instruct "落ち着いた大人の朗読、自然な日本語" --cfg-value 1.5

Irodori-TTS v3 MIT MLX native Aratako (Chihiro Arata) · 500M params · RF-DiT + DACVAE · 48kHz · JP-only

How it works

Flow-matching DiT with cross-attention text/caption conditioning. Three condition signals: text, caption (Voice Design), ref_audio (clone). Speaker Inversion (PR #18, Apr 2026) addresses cross-sentence drift.

Strong

Built JP-first — JP creators specifically praise it for reducing 「ペラペラ外国人っぽさ」. MIT. 40+ emoji emotion markers (😭/🤧/😄). Small + fast on Apple Silicon.

Weak

Kanji weak — pre-convert to hiragana. Long sentences (>30s) trigger skipping — chunk. Narrow dynamic range (issue #9, unanswered 3+ weeks). Cross-sentence voice drift (Wooly-Fluffy #136, may be fixed by Speaker Inversion).

python irodori_tts.py "..." --caption "落ち着いた大人の朗読、自然な日本語"

Qwen3-TTS Apache-2.0 MLX native Alibaba · 24kHz · 18 langs incl JP

How it works

Multilingual TTS with built-in voice catalog (Ono_Anna for JP) plus zero-shot clone. Natural-language instruct prompt for emotion/style.

Verdict

"Decent, not Fish-tier." Reliable Apache baseline. Use when known-working commercial-clean is enough.

python qwen_tts.py "..." --voice Ono_Anna --instruct "穏やかな大人の朗読"

Recommended next research moves

5 broad research agents (Twitter, YouTube, Reddit, Academic, Alt-routes) done
4 GitHub deep-dive agents (Irodori, VoxCPM2, AivisSpeech, SBV2) done
Gap-fill agent (demos, HF, Discord, CN, awesome-lists) done
VoxCPM2 Voice Design + CFG 1.5 + chunked variants done
Irodori v3 via mlx-audio (just merged 2026-05-18) done
Supertonic 3 (Korean, 99M ONNX, all 6 voices) done
Fish Audio S2 Pro via mlx-audio (default + Matsukaze clone) done
Test Irodori v3 + Speaker Inversion knobs (speaker_kv_scale) — first public test of the post-PR-#18 configuration next
VOICEVOX (the JP-standard baseline, robotic but a useful floor) next
One month of Fish Audio Plus ($11) to compare hosted vs MLX-port quality next
Step-Audio-EditX (Apache-2.0, JA-capable) — skipped this round (China vendor) but architecturally interesting future
If Supertonic 3 wins the A/B, look at training a custom voice via their Voice Builder future

Quality levers by engine

VoxCPM2 — fix the accent

HF discussion #14 + VoxCPM issue #222 maintainer-confirmed:

Voice Design mode (--mode default --instruct "...") not clone
CFG 1.5 instead of default 2.0 (--cfg-value 1.5)
Chunk to ≤2 sentences per call
Avoid self-conditioning in Voice Design mode (produced 192s of hallucinated audio in our test)

Irodori v3 — play to strengths

Pre-convert unusual kanji to hiragana
Chunk anything >30s
Tune speaker_kv_scale for ref-adherence vs naturalness
Emoji emotion markers: 😭/🤧/😄
Accept narrow dynamic range — production-mix afterward

Supertonic 3 — what to try

F1 Mina is the recommended default voice
Quality steps 5-12 (default 8); higher = better but slower (still real-time-plus)
Speed 0.7-2.0
Inline tags: <laugh>, <breath>, <sigh>, +7
For unknown-language text, pass lang="na"

Fish S2 Pro — research-only

License caps it to personal use only — no Manaoke commercial
Use the in-A/B Fish renders as your quality ceiling
If something here ties or beats it, ship that instead
If only Fish wins on your ear: $11/mo Plus seat is the legal commercial path
Sub-word emotion tags: [whisper], [excited], [angry], etc.

Optimal config per use case

Manaoke karaoke

Local-only mandate. Default: Irodori v3 male clone (matsukaze ref) — JP-native consensus pick across April-May 2026 YouTube reviewers, MIT license, MLX 8-bit. Confirmed in Round-8 (2026-05-25).

Voice variety: VoxCPM2 Voice Design with character-instructs for non-narrator lines.

Avoid: AivisSpeech / SBV2 (transitive AGPL), all cloud TTS.

DBZ Narrator (personal)

Default: top-rated MIT/Apache engine from your blind A/B.

Personal viewing = AGPL is acceptable if needed, but Irodori MIT / Supertonic MIT are cleaner defaults.

Mom's record narration

Local default: Irodori v3 with Voice Design instruct prompts for narration style. MIT license keeps personal-use simple.

Silo / screenplay reads (offline)

Air-gapped Silo machine: no cloud. Local-only mandate.

Bet on top-rated local Apache/MIT engine. Chunk per-character with distinct captions/instructs.

Architectural reality on Apple Silicon (2026)

Things that aren't changing soon.

Diffusion-on-MLX is undercooked. VoxCPM2 maintainer redirects performance-hungry users to nanovllm-voxcpm (CUDA). Apple's ml-explore/mlx team has zero in-flight TTS work in 2026.
AGPL-3.0 engines are CPU-only on Mac by maintainer choice. tsukumijima explicit: 「GPU 対応は積極的には行っておりません」.
The actively Mac-friendly engines are RF-DiT (Irodori), Apache multilingual (VoxCPM2, Qwen3-TTS), 4-ONNX (Supertonic). Bet on these for local; bet on cloud for commercial polish.