AudioCraft: Audio Generation
Comprehensive guide to using Meta's AudioCraft for text-to-music and text-to-audio generation with MusicGen, AudioGen, and EnCodec.
When to use AudioCraft
Use AudioCraft when:
- Need to generate music from text descriptions
- Creating sound effects and environmental audio
- Building music generation applications
- Need melody-conditioned music generation
- Want stereo audio output
- Require controllable music generation with style transfer
Key features:
- MusicGen: Text-to-music generation with melody conditioning
- AudioGen: Text-to-sound effects generation
- EnCodec: High-fidelity neural audio codec
- Multiple model sizes: Small (300M) to Large (3.3B)
- Stereo support: Full stereo audio generation
- Style conditioning: MusicGen-Style for reference-based generation
Use alternatives instead:
- Stable Audio: For longer commercial music generation
- Bark: For text-to-speech with music/sound effects
- Riffusion: For spectogram-based music generation
- OpenAI Jukebox: For raw audio generation with lyrics
Quick start
Installation
bash
1# From PyPI
2pip install audiocraft
3
4# From GitHub (latest)
5pip install git+https://github.com/facebookresearch/audiocraft.git
6
7# Or use HuggingFace Transformers
8pip install transformers torch torchaudio
Basic text-to-music (AudioCraft)
python
1import torchaudio
2from audiocraft.models import MusicGen
3
4# Load model
5model = MusicGen.get_pretrained('facebook/musicgen-small')
6
7# Set generation parameters
8model.set_generation_params(
9 duration=8, # seconds
10 top_k=250,
11 temperature=1.0
12)
13
14# Generate from text
15descriptions = ["happy upbeat electronic dance music with synths"]
16wav = model.generate(descriptions)
17
18# Save audio
19torchaudio.save("output.wav", wav[0].cpu(), sample_rate=32000)
python
1from transformers import AutoProcessor, MusicgenForConditionalGeneration
2import scipy
3
4# Load model and processor
5processor = AutoProcessor.from_pretrained("facebook/musicgen-small")
6model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small")
7model.to("cuda")
8
9# Generate music
10inputs = processor(
11 text=["80s pop track with bassy drums and synth"],
12 padding=True,
13 return_tensors="pt"
14).to("cuda")
15
16audio_values = model.generate(
17 **inputs,
18 do_sample=True,
19 guidance_scale=3,
20 max_new_tokens=256
21)
22
23# Save
24sampling_rate = model.config.audio_encoder.sampling_rate
25scipy.io.wavfile.write("output.wav", rate=sampling_rate, data=audio_values[0, 0].cpu().numpy())
Text-to-sound with AudioGen
python
1from audiocraft.models import AudioGen
2
3# Load AudioGen
4model = AudioGen.get_pretrained('facebook/audiogen-medium')
5
6model.set_generation_params(duration=5)
7
8# Generate sound effects
9descriptions = ["dog barking in a park with birds chirping"]
10wav = model.generate(descriptions)
11
12torchaudio.save("sound.wav", wav[0].cpu(), sample_rate=16000)
Core concepts
Architecture overview
AudioCraft Architecture:
┌──────────────────────────────────────────────────────────────┐
│ Text Encoder (T5) │
│ │ │
│ Text Embeddings │
└────────────────────────┬─────────────────────────────────────┘
│
┌────────────────────────▼─────────────────────────────────────┐
│ Transformer Decoder (LM) │
│ Auto-regressively generates audio tokens │
│ Using efficient token interleaving patterns │
└────────────────────────┬─────────────────────────────────────┘
│
┌────────────────────────▼─────────────────────────────────────┐
│ EnCodec Audio Decoder │
│ Converts tokens back to audio waveform │
└──────────────────────────────────────────────────────────────┘
Model variants
| Model | Size | Description | Use Case |
|---|
musicgen-small | 300M | Text-to-music | Quick generation |
musicgen-medium | 1.5B | Text-to-music | Balanced |
musicgen-large | 3.3B | Text-to-music | Best quality |
musicgen-melody | 1.5B | Text + melody | Melody conditioning |
musicgen-melody-large | 3.3B | Text + melody | Best melody |
musicgen-stereo-* | Varies | Stereo output | Stereo generation |
musicgen-style | 1.5B | Style transfer | Reference-based |
audiogen-medium | 1.5B | Text-to-sound | Sound effects |
Generation parameters
| Parameter | Default | Description |
|---|
duration | 8.0 | Length in seconds (1-120) |
top_k | 250 | Top-k sampling |
top_p | 0.0 | Nucleus sampling (0 = disabled) |
temperature | 1.0 | Sampling temperature |
cfg_coef | 3.0 | Classifier-free guidance |
MusicGen usage
Text-to-music generation
python
1from audiocraft.models import MusicGen
2import torchaudio
3
4model = MusicGen.get_pretrained('facebook/musicgen-medium')
5
6# Configure generation
7model.set_generation_params(
8 duration=30, # Up to 30 seconds
9 top_k=250, # Sampling diversity
10 top_p=0.0, # 0 = use top_k only
11 temperature=1.0, # Creativity (higher = more varied)
12 cfg_coef=3.0 # Text adherence (higher = stricter)
13)
14
15# Generate multiple samples
16descriptions = [
17 "epic orchestral soundtrack with strings and brass",
18 "chill lo-fi hip hop beat with jazzy piano",
19 "energetic rock song with electric guitar"
20]
21
22# Generate (returns [batch, channels, samples])
23wav = model.generate(descriptions)
24
25# Save each
26for i, audio in enumerate(wav):
27 torchaudio.save(f"music_{i}.wav", audio.cpu(), sample_rate=32000)
Melody-conditioned generation
python
1from audiocraft.models import MusicGen
2import torchaudio
3
4# Load melody model
5model = MusicGen.get_pretrained('facebook/musicgen-melody')
6model.set_generation_params(duration=30)
7
8# Load melody audio
9melody, sr = torchaudio.load("melody.wav")
10
11# Generate with melody conditioning
12descriptions = ["acoustic guitar folk song"]
13wav = model.generate_with_chroma(descriptions, melody, sr)
14
15torchaudio.save("melody_conditioned.wav", wav[0].cpu(), sample_rate=32000)
Stereo generation
python
1from audiocraft.models import MusicGen
2
3# Load stereo model
4model = MusicGen.get_pretrained('facebook/musicgen-stereo-medium')
5model.set_generation_params(duration=15)
6
7descriptions = ["ambient electronic music with wide stereo panning"]
8wav = model.generate(descriptions)
9
10# wav shape: [batch, 2, samples] for stereo
11print(f"Stereo shape: {wav.shape}") # [1, 2, 480000]
12torchaudio.save("stereo.wav", wav[0].cpu(), sample_rate=32000)
Audio continuation
python
1from transformers import AutoProcessor, MusicgenForConditionalGeneration
2
3processor = AutoProcessor.from_pretrained("facebook/musicgen-medium")
4model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-medium")
5
6# Load audio to continue
7import torchaudio
8audio, sr = torchaudio.load("intro.wav")
9
10# Process with text and audio
11inputs = processor(
12 audio=audio.squeeze().numpy(),
13 sampling_rate=sr,
14 text=["continue with a epic chorus"],
15 padding=True,
16 return_tensors="pt"
17)
18
19# Generate continuation
20audio_values = model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=512)
MusicGen-Style usage
Style-conditioned generation
python
1from audiocraft.models import MusicGen
2
3# Load style model
4model = MusicGen.get_pretrained('facebook/musicgen-style')
5
6# Configure generation with style
7model.set_generation_params(
8 duration=30,
9 cfg_coef=3.0,
10 cfg_coef_beta=5.0 # Style influence
11)
12
13# Configure style conditioner
14model.set_style_conditioner_params(
15 eval_q=3, # RVQ quantizers (1-6)
16 excerpt_length=3.0 # Style excerpt length
17)
18
19# Load style reference
20style_audio, sr = torchaudio.load("reference_style.wav")
21
22# Generate with text + style
23descriptions = ["upbeat dance track"]
24wav = model.generate_with_style(descriptions, style_audio, sr)
Style-only generation (no text)
python
1# Generate matching style without text prompt
2model.set_generation_params(
3 duration=30,
4 cfg_coef=3.0,
5 cfg_coef_beta=None # Disable double CFG for style-only
6)
7
8wav = model.generate_with_style([None], style_audio, sr)
AudioGen usage
Sound effect generation
python
1from audiocraft.models import AudioGen
2import torchaudio
3
4model = AudioGen.get_pretrained('facebook/audiogen-medium')
5model.set_generation_params(duration=10)
6
7# Generate various sounds
8descriptions = [
9 "thunderstorm with heavy rain and lightning",
10 "busy city traffic with car horns",
11 "ocean waves crashing on rocks",
12 "crackling campfire in forest"
13]
14
15wav = model.generate(descriptions)
16
17for i, audio in enumerate(wav):
18 torchaudio.save(f"sound_{i}.wav", audio.cpu(), sample_rate=16000)
EnCodec usage
Audio compression
python
1from audiocraft.models import CompressionModel
2import torch
3import torchaudio
4
5# Load EnCodec
6model = CompressionModel.get_pretrained('facebook/encodec_32khz')
7
8# Load audio
9wav, sr = torchaudio.load("audio.wav")
10
11# Ensure correct sample rate
12if sr != 32000:
13 resampler = torchaudio.transforms.Resample(sr, 32000)
14 wav = resampler(wav)
15
16# Encode to tokens
17with torch.no_grad():
18 encoded = model.encode(wav.unsqueeze(0))
19 codes = encoded[0] # Audio codes
20
21# Decode back to audio
22with torch.no_grad():
23 decoded = model.decode(codes)
24
25torchaudio.save("reconstructed.wav", decoded[0].cpu(), sample_rate=32000)
Common workflows
Workflow 1: Music generation pipeline
python
1import torch
2import torchaudio
3from audiocraft.models import MusicGen
4
5class MusicGenerator:
6 def __init__(self, model_name="facebook/musicgen-medium"):
7 self.model = MusicGen.get_pretrained(model_name)
8 self.sample_rate = 32000
9
10 def generate(self, prompt, duration=30, temperature=1.0, cfg=3.0):
11 self.model.set_generation_params(
12 duration=duration,
13 top_k=250,
14 temperature=temperature,
15 cfg_coef=cfg
16 )
17
18 with torch.no_grad():
19 wav = self.model.generate([prompt])
20
21 return wav[0].cpu()
22
23 def generate_batch(self, prompts, duration=30):
24 self.model.set_generation_params(duration=duration)
25
26 with torch.no_grad():
27 wav = self.model.generate(prompts)
28
29 return wav.cpu()
30
31 def save(self, audio, path):
32 torchaudio.save(path, audio, sample_rate=self.sample_rate)
33
34# Usage
35generator = MusicGenerator()
36audio = generator.generate(
37 "epic cinematic orchestral music",
38 duration=30,
39 temperature=1.0
40)
41generator.save(audio, "epic_music.wav")
Workflow 2: Sound design batch processing
python
1import json
2from pathlib import Path
3from audiocraft.models import AudioGen
4import torchaudio
5
6def batch_generate_sounds(sound_specs, output_dir):
7 """
8 Generate multiple sounds from specifications.
9
10 Args:
11 sound_specs: list of {"name": str, "description": str, "duration": float}
12 output_dir: output directory path
13 """
14 model = AudioGen.get_pretrained('facebook/audiogen-medium')
15 output_dir = Path(output_dir)
16 output_dir.mkdir(exist_ok=True)
17
18 results = []
19
20 for spec in sound_specs:
21 model.set_generation_params(duration=spec.get("duration", 5))
22
23 wav = model.generate([spec["description"]])
24
25 output_path = output_dir / f"{spec['name']}.wav"
26 torchaudio.save(str(output_path), wav[0].cpu(), sample_rate=16000)
27
28 results.append({
29 "name": spec["name"],
30 "path": str(output_path),
31 "description": spec["description"]
32 })
33
34 return results
35
36# Usage
37sounds = [
38 {"name": "explosion", "description": "massive explosion with debris", "duration": 3},
39 {"name": "footsteps", "description": "footsteps on wooden floor", "duration": 5},
40 {"name": "door", "description": "wooden door creaking and closing", "duration": 2}
41]
42
43results = batch_generate_sounds(sounds, "sound_effects/")
Workflow 3: Gradio demo
python
1import gradio as gr
2import torch
3import torchaudio
4from audiocraft.models import MusicGen
5
6model = MusicGen.get_pretrained('facebook/musicgen-small')
7
8def generate_music(prompt, duration, temperature, cfg_coef):
9 model.set_generation_params(
10 duration=duration,
11 temperature=temperature,
12 cfg_coef=cfg_coef
13 )
14
15 with torch.no_grad():
16 wav = model.generate([prompt])
17
18 # Save to temp file
19 path = "temp_output.wav"
20 torchaudio.save(path, wav[0].cpu(), sample_rate=32000)
21 return path
22
23demo = gr.Interface(
24 fn=generate_music,
25 inputs=[
26 gr.Textbox(label="Music Description", placeholder="upbeat electronic dance music"),
27 gr.Slider(1, 30, value=8, label="Duration (seconds)"),
28 gr.Slider(0.5, 2.0, value=1.0, label="Temperature"),
29 gr.Slider(1.0, 10.0, value=3.0, label="CFG Coefficient")
30 ],
31 outputs=gr.Audio(label="Generated Music"),
32 title="MusicGen Demo"
33)
34
35demo.launch()
Memory optimization
python
1# Use smaller model
2model = MusicGen.get_pretrained('facebook/musicgen-small')
3
4# Clear cache between generations
5torch.cuda.empty_cache()
6
7# Generate shorter durations
8model.set_generation_params(duration=10) # Instead of 30
9
10# Use half precision
11model = model.half()
Batch processing efficiency
python
1# Process multiple prompts at once (more efficient)
2descriptions = ["prompt1", "prompt2", "prompt3", "prompt4"]
3wav = model.generate(descriptions) # Single batch
4
5# Instead of
6for desc in descriptions:
7 wav = model.generate([desc]) # Multiple batches (slower)
GPU memory requirements
| Model | FP32 VRAM | FP16 VRAM |
|---|
| musicgen-small | ~4GB | ~2GB |
| musicgen-medium | ~8GB | ~4GB |
| musicgen-large | ~16GB | ~8GB |
Common issues
| Issue | Solution |
|---|
| CUDA OOM | Use smaller model, reduce duration |
| Poor quality | Increase cfg_coef, better prompts |
| Generation too short | Check max duration setting |
| Audio artifacts | Try different temperature |
| Stereo not working | Use stereo model variant |
References
Resources