Metal on iOS — Complete Overview

Introduction to Metal

Low-overhead graphics + compute for every Apple GPU

Metal is Apple’s cross-platform GPU API that unifies high-performance graphics and general-purpose compute under a single, low-overhead umbrella. By exposing near-bare-metal control of command submission, memory residency, and shader execution, it lets you squeeze maximum throughput from A-series, M-series, and discrete AMD GPUs on iPhone, iPad, Mac, Apple TV, and Vision Pro.

Metal Architecture & Core Objects

Devices, queues, resources, and pipelines

1. MTLDevice represents the physical GPU and creates every other object.

2. MTLCommandQueue issues command buffers, each housing parallel MTLCommandEncoders for graphics, compute, or blit workloads.

3. MTLBuffer and MTLTexture store data; heaps & residency sets group related resources for unified-memory GPUs.

4. MTLRenderPipelineState & MTLComputePipelineState compile shader functions and fixed-function state into GPU-specific binaries.

Metal Shading Language (MSL)

C++-17–inspired syntax for vertex, fragment, kernel, and mesh functions

MSL adopts modern C++ features (templates, namespaces, operator overloading) while adding GPU-centric types such as float3, texture2d<float, access::sample>, and threadgroup memory. You define entry points with explicit buffer, texture, and sampler indices or use argument buffers for bindless resource access. Mesh & amplification shaders (Apple GPU‐only) enable culling and primitive generation directly on the GPU.


// Vertex → Fragment example
using namespace metal;

struct VIn  { float3 position [[attribute(0)]]; float2 uv [[attribute(1)]]; };
struct VOut { float4 clipPos [[position]]; float2 uv; };

vertex VOut vertMain(VIn in [[stage_in]],
                     constant float4x4 &proj [[buffer(1)]])
{
    VOut out;
    out.clipPos = proj * float4(in.position, 1);
    out.uv      = in.uv;
    return out;
}

fragment float4 fragMain(VOut in [[stage_in]],
                         texture2d<float> tex [[texture(0)]],
                         sampler samp [[sampler(0)]])
{
    return tex.sample(samp, in.uv);
}

Graphics Pipeline

From vertices to pixels

Stage 1 — Vertex Processing: vertex/mesh shaders transform geometry and optionally perform GPU frustum culling.

Stage 2 — Rasterization: fixed-function hardware converts primitives into fragments.

Stage 3 — Fragment Processing: fragment shaders compute per-pixel color, depth, and stencil.

Stage 4 — Tile Resolve & Store: on-chip tile memory merges results to system memory, minimizing bandwidth.

State is configured once per encoder via descriptors (render pass, depth-stencil, multisample, etc.) then reused across frames.

Compute & GPGPU

Harnessing thousands of GPU cores for non-graphics workloads

A MTLComputePipelineState wraps a kernel function. Threads are dispatched in grids of threadgroups, each getting fast threadgroup SRAM. Synchronization happens via threadgroup_barrier. For large data sets, use dispatchThreads with automatic threadgroup sizing or the newer dispatchThreadgroups API for fine grain control.

Metal Performance Shaders (MPS) + MPSGraph

Drop-in accelerated kernels & ML graph execution

MPS bundles hand-tuned kernels for image filters, BLAS/LAPACK routines, and neural-network layers. MPSGraph builds heterogeneous ML graphs that run side-by-side with your custom compute work. WWDC 2024 introduced fused scaledDotProductAttention and KV-cache ops for transformer models, FFT speed-ups, and a new visual MPSGraph viewer.

What’s New (iOS 18 / Metal 3 + 4)

Mesh shaders, hardware ray tracing, residency sets, unified libraries

Unified Shaders & Simplified Device Init: build a single library that runs unmodified on iPhone, iPad, and Mac.

Residency Sets: atomically make related buffers resident to cut memory management overhead.

Hardware Ray Tracing Improvements: Apple Silicon now provides direct-state access for faster intersection result retrieval and row-major matrix layouts easing HLSL ports.