Low-overhead graphics + compute for every Apple GPU
Metal is Apple’s cross-platform GPU API that unifies high-performance graphics and general-purpose compute under a single, low-overhead umbrella. By exposing near-bare-metal control of command submission, memory residency, and shader execution, it lets you squeeze maximum throughput from A-series, M-series, and discrete AMD GPUs on iPhone, iPad, Mac, Apple TV, and Vision Pro.
Devices, queues, resources, and pipelines
1. MTLDevice
represents the physical GPU and creates every other object.
2. MTLCommandQueue
issues command buffers, each housing parallel MTLCommandEncoder
s for graphics, compute, or blit workloads.
3. MTLBuffer
and MTLTexture
store data; heaps & residency sets group related resources for unified-memory GPUs.
4. MTLRenderPipelineState
& MTLComputePipelineState
compile shader functions and fixed-function state into GPU-specific binaries.
C++-17–inspired syntax for vertex, fragment, kernel, and mesh functions
MSL adopts modern C++ features (templates, namespaces, operator overloading) while adding GPU-centric types such as float3
, texture2d<float, access::sample>
, and threadgroup
memory. You define entry points with explicit buffer, texture, and sampler indices or use argument buffers for bindless resource access. Mesh & amplification shaders (Apple GPU‐only) enable culling and primitive generation directly on the GPU.
// Vertex → Fragment example
using namespace metal;
struct VIn { float3 position [[attribute(0)]]; float2 uv [[attribute(1)]]; };
struct VOut { float4 clipPos [[position]]; float2 uv; };
vertex VOut vertMain(VIn in [[stage_in]],
constant float4x4 &proj [[buffer(1)]])
{
VOut out;
out.clipPos = proj * float4(in.position, 1);
out.uv = in.uv;
return out;
}
fragment float4 fragMain(VOut in [[stage_in]],
texture2d<float> tex [[texture(0)]],
sampler samp [[sampler(0)]])
{
return tex.sample(samp, in.uv);
}
From vertices to pixels
Stage 1 — Vertex Processing: vertex/mesh shaders transform geometry and optionally perform GPU frustum culling.
Stage 2 — Rasterization: fixed-function hardware converts primitives into fragments.
Stage 3 — Fragment Processing: fragment shaders compute per-pixel color, depth, and stencil.
Stage 4 — Tile Resolve & Store: on-chip tile memory merges results to system memory, minimizing bandwidth.
State is configured once per encoder via descriptors (render pass, depth-stencil, multisample, etc.) then reused across frames.
Harnessing thousands of GPU cores for non-graphics workloads
A MTLComputePipelineState
wraps a kernel function. Threads are dispatched in grids of threadgroups, each getting fast threadgroup
SRAM. Synchronization happens via threadgroup_barrier
. For large data sets, use dispatchThreads
with automatic threadgroup sizing or the newer dispatchThreadgroups
API for fine grain control.
Drop-in accelerated kernels & ML graph execution
MPS bundles hand-tuned kernels for image filters, BLAS/LAPACK routines, and neural-network layers. MPSGraph builds heterogeneous ML graphs that run side-by-side with your custom compute work. WWDC 2024 introduced fused scaledDotProductAttention and KV-cache ops for transformer models, FFT speed-ups, and a new visual MPSGraph viewer.
Mesh shaders, hardware ray tracing, residency sets, unified libraries
Unified Shaders & Simplified Device Init: build a single library that runs unmodified on iPhone, iPad, and Mac.
Residency Sets: atomically make related buffers resident to cut memory management overhead.
Hardware Ray Tracing Improvements: Apple Silicon now provides direct-state access for faster intersection result retrieval and row-major matrix layouts easing HLSL ports.