Dia 설치 및 사용 가이드 (Mac)

개요

이 가이드는 Mac(특히 Apple Silicon M1/M2/M3/M4) 환경에서 Dia TTS 모델을 설치하고 사용하는 방법을 안내한다. Dia는 공식적으로 CUDA GPU 환경을 지원하지만, Apple Silicon의 Metal Performance Shaders(MPS)를 활용하거나 CPU 폴백 모드로 Mac에서도 실행할 수 있다.

주의: Apple Silicon에서의 실행은 공식 지원이 아니며, 처리 속도가 CUDA 환경 대비 현저히 느릴 수 있다. 빠른 체험을 원한다면 Hugging Face Spaces 데모를 먼저 사용해볼 것을 권장한다.

사전 요구사항

설치 전 아래 사항을 확인한다.

macOS: 최신 버전 권장 (Ventura 13.x 이상)
Python: 3.10 이상
RAM: 최소 16GB (32GB 이상 권장, 모델이 메모리에 로드됨)
저장 공간: 약 10GB 이상 (모델 파일 포함)
Homebrew: 패키지 관리자

Homebrew가 설치되어 있지 않다면 아래 명령으로 설치한다.

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Python과 Git을 Homebrew로 설치한다.

brew install python git

설치 방법

1. 저장소 클론

git clone https://github.com/nari-labs/dia.git
cd dia

2. 가상환경 생성 및 활성화

시스템 Python 환경을 오염시키지 않기 위해 가상환경 사용을 강력히 권장한다.

python3 -m venv .venv
source .venv/bin/activate

가상환경이 활성화되면 터미널 프롬프트 앞에 (.venv)가 표시된다.

3. 의존성 설치

pip install --upgrade pip
pip install -e .

또는 uv 패키지 관리자를 사용하는 경우 (더 빠른 설치):

pip install uv
uv run example/simple.py

4. Hugging Face Transformers 설치 (선택)

Transformers 라이브러리를 통한 통합 사용을 원한다면 개발 버전을 설치한다.

pip install git+https://github.com/huggingface/transformers.git

Apple Silicon MPS 설정

Apple Silicon Mac에서는 CUDA 대신 Apple의 Metal Performance Shaders(MPS)를 사용할 수 있다. 단, 일부 연산이 MPS에서 지원되지 않으므로 환경 변수 설정이 필요하다.

MPS 폴백 활성화

터미널에서 실행 전 아래 환경 변수를 설정한다.

export PYTORCH_ENABLE_MPS_FALLBACK=1

또는 Python 스크립트 최상단에 추가한다.

import os
os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"

디바이스 설정 코드

Python 코드에서 MPS, CUDA, CPU 순으로 자동 선택하는 패턴을 사용한다.

import torch

if torch.cuda.is_available():
    device = "cuda"
elif torch.backends.mps.is_available() and torch.backends.mps.is_built():
    device = "mps"
else:
    device = "cpu"

print(f"사용 디바이스: {device}")

float32 강제 적용 (MPS 오류 방지)

MPS 환경에서는 float64 연산이 지원되지 않아 오류가 발생할 수 있다. 이 경우 모델 로드 시 torch_dtype=torch.float32를 명시적으로 설정한다.

import torch
from dia.model import Dia

model = Dia.from_pretrained("nari-labs/Dia-1.6B", compute_dtype="float32")
model = model.to(device)

기본 대화 TTS 사용법

아래는 두 화자 간의 대화를 합성하는 가장 기본적인 예제다.

import os
import torch
import soundfile as sf
from dia.model import Dia

# MPS 폴백 활성화 (Apple Silicon)
os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"

# 디바이스 설정
if torch.backends.mps.is_available():
    device = "mps"
else:
    device = "cpu"

# 모델 로드 (첫 실행 시 모델 파일 다운로드, 약 수 GB)
model = Dia.from_pretrained("nari-labs/Dia-1.6B", compute_dtype="float32")
model = model.to(device)

# 대화 텍스트 작성 ([S1], [S2] 태그로 화자 구분)
text = """[S1] Hello! How are you doing today?
[S2] I'm doing great, thanks for asking! (laughs) How about you?
[S1] Pretty well! I've been working on a new project. (sighs) It's been a lot of work.
[S2] I hear you. (chuckles) Let me know if you need any help."""

# 음성 생성
output = model.generate(text)

# 오디오 파일로 저장
sf.write("output.wav", output, samplerate=44100)
print("output.wav 파일이 생성되었습니다.")

비언어적 표현 삽입 예시

비언어적 표현 태그를 활용하면 더 자연스럽고 풍부한 음성을 생성할 수 있다.

# 다양한 비언어적 표현 사용 예시
text = """[S1] Did you hear the news? (gasps) It's incredible!
[S2] What happened? Tell me everything. (leans forward — mumbles) I can't believe it.
[S1] They finally announced the product launch. (claps) We've been waiting for months!
[S2] (sighs) Finally. I was starting to lose hope. (chuckles) When does it ship?
[S1] Next week! (inhales) I'm so excited I can barely speak."""

output = model.generate(text)
sf.write("expressive_output.wav", output, samplerate=44100)

지원되는 주요 태그 목록: (laughs), (chuckles), (sighs), (gasps), (coughs), (clears throat), (mumbles), (groans), (humming), (singing), (screams), (claps), (applause), (sneezes), (burps), (whistles), (inhales), (exhales), (beep), (sniffs)

음성 복제 사용법

음성 복제(Voice Cloning)를 통해 특정 화자의 목소리로 새로운 텍스트를 합성할 수 있다.

요구사항

참조 오디오: 5~10초 분량의 WAV 파일 (배경 소음 없는 깨끗한 오디오)
참조 전사: 참조 오디오의 정확한 텍스트 스크립트

음성 복제 코드

import os
import torch
import soundfile as sf
from dia.model import Dia

os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"

device = "mps" if torch.backends.mps.is_available() else "cpu"
model = Dia.from_pretrained("nari-labs/Dia-1.6B", compute_dtype="float32")
model = model.to(device)

# 참조 오디오 파일 경로
audio_path = "reference_voice.wav"

# 텍스트 형식: [참조 오디오 전사] + [생성할 새 텍스트]
# 참조 오디오의 내용을 먼저 쓰고, 이어서 생성할 텍스트를 작성
text = "[S1] This is what the reference audio says. [S1] This is the new text I want to generate in the same voice."

# 참조 오디오와 함께 생성
output = model.generate(text, audio_prompt=audio_path)

sf.write("cloned_voice_output.wav", output, samplerate=44100)
print("음성 복제 완료: cloned_voice_output.wav")

주의사항

참조 오디오의 전사 텍스트가 실제 오디오 내용과 정확히 일치해야 최상의 결과가 나온다.
참조 오디오의 화자와 동일한 태그([S1] 또는 [S2])를 생성 텍스트에도 사용해야 한다.
너무 짧거나(5초 미만) 긴(15초 초과) 참조 오디오는 결과 품질이 저하될 수 있다.

자주 발생하는 오류 해결

오류 1: MPS float64 타입 변환 오류

RuntimeError: Cannot convert a MPS Tensor to float64 dtype as the MPS framework doesn't support float64.

해결 방법: 모델 로드 시 compute_dtype="float32" 를 명시하고, 환경 변수 PYTORCH_ENABLE_MPS_FALLBACK=1을 설정한다.

export PYTORCH_ENABLE_MPS_FALLBACK=1

model = Dia.from_pretrained("nari-labs/Dia-1.6B", compute_dtype="float32")

오류 2: 모듈을 찾을 수 없음

ModuleNotFoundError: No module named 'dia'

해결 방법: 가상환경이 활성화된 상태에서 패키지를 설치했는지 확인한다.

source .venv/bin/activate
pip install -e .

오류 3: 메모리 부족 (OOM)

RuntimeError: MPS backend out of memory

해결 방법: MPS 대신 CPU를 사용하거나, 다른 애플리케이션을 종료하여 메모리를 확보한다.

device = "cpu"  # MPS 대신 CPU 사용

오류 4: 모델 다운로드 실패

ConnectionError: HTTPSConnectionPool...

해결 방법: Hugging Face Hub 접근이 차단된 경우, VPN을 사용하거나 미리 모델을 다운로드한다.

pip install huggingface_hub
python -c "from huggingface_hub import snapshot_download; snapshot_download('nari-labs/Dia-1.6B')"

오류 5: 생성이 매우 느림

Apple Silicon에서 CPU 모드로 실행 시 1분 분량의 음성 생성에 수십 분이 소요될 수 있다. 이는 정상적인 동작이다.

해결 방법:

짧은 텍스트로 먼저 테스트한다.
MPS 모드를 시도해본다(PYTORCH_ENABLE_MPS_FALLBACK=1 설정 후).
클라우드 GPU 환경(Google Colab, Vast.ai 등)을 활용한다.

참고 링크

GitHub 저장소: https://github.com/nari-labs/dia
Hugging Face 모델: nari-labs/Dia-1.6B
온라인 데모: Hugging Face Spaces
PyTorch MPS 가이드: PyTorch Apple Silicon 지원 문서
Nari Labs 공식 사이트: https://dianarilabs.com

개요