Rendering Pipeline

Date: 2026.05.17 Updated: 2026.05.17

카테고리: Metal

Main Reference
- Metal by Tutorials - 4th edition

cpu vs gpu

CPU : Latency 우선 (Cache가 중요) GPU : Throughput 우선 (ALU가 중요)

Ideal setup은 low latency and high throughput. Low latency는 쌓여있는 task들에 대한 순차적 실행을 무응답이나 시스템 지연 없이 수행할 수 있게 한다. High throughput은 CPU가 응답을 기다리느라 낭비하는 시간이 없게 하도록 GPU를 효율적으로 돌리게 한다.

CPU는 GPU가 놀지 않도록 계속 일감을 던저주어야 하는데, 어느 순간 둘 중 하나가 놀고 있는 상황이 발생할 수 있다. Metal은 이를 막기 위해 asynchronously, multiple command buffer 방식을 채용한다.

Pipeline State

OpenGL (State Machine 기반)

glUseProgram(shaderProgram);
glEnable(GL_DEPTH_TEST);
glDepthFunc(GL_LESS);
glEnable(GL_CULL_FACE);
glCullFace(GL_BACK);
glBlendFunc(GL_SRC_ALPHA, GL_ONE_MINUS_SRC_ALPHA);
glEnable(GL_BLEND);

glDrawElements(...);  // 현재 설정된 상태로 그리기

각 바인딩과 상태 변경이 즉시 GPU로 전송
- 실시간으로 global state 변경
중간에 pipeline이 바뀐다면 모든 설정 다시 세팅
상태 검증은 런타임에서 직접 실행할 때 가능

Metal (State Object 기반)

// Metal - 렌더 파이프라인을 미리 생성
let pipelineDescriptor = MTLRenderPipelineDescriptor()
pipelineDescriptor.vertexFunction = vertexShader
pipelineDescriptor.fragmentFunction = fragmentShader
let renderPipeline = device.makeRenderPipelineState(descriptor: pipelineDescriptor)

func draw() {
    renderEncoder.setRenderPipelineState(renderPipeline)  // 미리 만든 객체 사용
    renderEncoder.drawPrimitives(...)
}

Metal은 명령어들을 버퍼에 담아 한 번에 처리 (command buffer 시스템)
Pipeline State 객체를 미리 생성해두고 중간에 스위칭해가며 사용
컴파일 타임에 사전 검증 가능

// Pipeline State (불변)
let pipelineState = try device.makeRenderPipelineState(descriptor: descriptor)

// Dynamic State (변경 가능)
renderEncoder.setViewport(MTLViewport(...)) renderEncoder.setScissorRect(MTLScissorRect(...)) renderEncoder.setCullMode(.back)
renderEncoder.setTriangleFillMode(.lines)

보통 Pipeline 생성은 비용이 비싸기 때문에 const(불변)로 미리 생성해서 사용
가벼운 세팅들은 dynamic하게 세팅하기도 함

Render Pipeline

Metal을 사용하는 프로그래머인 경우 vertex processing, fragment processing만 고려하면 된다. 이 두 stage만 직접 프로그래밍 가능하고, 나머지 stage의 경우 특별하게 디자인된 hardware unit에 의해 처리된다.

Vertex Fetch

Input Assembler라고도 불리는 Vertex Fetch 단계에서는 GPU가 vertex buffer의 실제 메모리 주소를 계산해 vertex shader에게 넘겨주는 단계이다. 하나의 vertex data가 여러 버퍼를 통해 들어올 수 있고, 각 성분의 offset 정보 등이 필요하기 때문에 CPU에서 지정한 vertex descriptor 정보가 사용된다. 또한 각 vertex 별로 vertex shader가 실행되기 때문에 캐싱 효율이 매우 중요하다.

// GPU 내부 동작 (개념적)
struct ShadedVertex {
    let position: float4      // Vertex Shader 출력
    let color: float4         // Vertex Shader 출력
    let vertexID: Int        // 원본 정점 ID
}

// Post-Transform Cache
var vertexCache: [Int: ShadedVertex] = [:]  // vertexID → 변환된 결과

// 첫 번째 삼각형 처리
for index in [0, 1, 2] {
    if let cached = vertexCache[index] {
        // 캐시 히트! Vertex Shader 실행 안 함
        useCachedResult(cached)
    } else {
        // 캐시 미스: Vertex Shader 실행
        let shaded = runVertexShader(vertexID: index)
        vertexCache[index] = shaded
    }
}

// 두 번째 삼각형 처리
for index in [2, 1, 3] {
    // index 2, 1은 이미 캐시에 있음!
    // Vertex Shader를 다시 실행하지 않음
}

또한 중복된 index에 의해 동일한 vertex shader가 여러 번 실행될 수 있다. 이런 경우 실행 결과를 저장해 둔 second cache를 이용해 중복 연산을 피하고, 결과를 재사용한다.

이러한 과정이 끝나면 Scheduler라는 hardware unit에 의해 vertex processing 단계로 넘어가게 된다.

Vertex Processing

Vertex Processing 단계에서는 하나의 vertex 단위로 실행되며, 여러 좌표 공간을 거쳐 최종 프레임 버퍼 포지션에 도달한다.

AMD GPU Architecture	PowerVR GPU (Mobile)

AMD

1 Graphics Command Processor: This coordinates the work processes.
4 Shader Engines (SE): An SE is an organizational unit on the GPU that can serve an entire pipeline. Each SE has a geometry processor, a rasterizer and Compute Units.
9 Compute Units (CU): A CU is nothing more than a group of shader cores.
64 shader cores: A shader core is the basic building block of the GPU where all of the shading work is done.

Mobile(Ios)

Instead of having SEs and CUs, the PowerVR GPU has Unified Shading Clusters (USC).
This particular GPU model has 6 USCs and 32 cores per USC for a total of only 192 cores.

There ar a few rules

Inside a CU, you can only process either vertices or fragments, and only at one time.
You can only process one shader function per SE
만약 vertex shader, fragment shader를 병렬로 실행시키고 싶다면 다른 SE들을 이용하면 됨
- 예를 들어 현재 프레임의 fragment shader, 다음 프레임의 vertex shader가 서로 다른 Shader Engine에서 병렬로 실행될 수 있음

struct VertexIn {
	float4 position [[attribute(0)]];
};

// 2 vertex
float4 vertex_main(const VertexIn vertexIn [[stage_in]]) {
	return vertexIn.position;
}

Vertex 데이터들은 버퍼 내부에 index된 상태로 저장되어 있다. Vertex Shader는 [[stage_in]] attribute을 통해 해당 코어가 담당하는 index를 가져올 수 있으며, VertexIn 구조체 단위로 cached unpack한다. 참고로 [[stage_in]] qualifier는 이전 스테이지에서 넘어온 입력을 뜻한다.

이때 shader에서 VertexIn 구조체를 정의하는 이유(vertexDescriptor에 정보가 있는데 굳이?)는 컴파일 타임에 타입 체크를 하기 위함이다. - 그리고 padding 때문에 vertexDescriptor와 일치할 필요도 없다.

하나의 CU는 shader core 갯수 단위로 배칭되어 실행되고, 이 배치에 속한 shader core들은 CU의 전체 캐시 데이터에 접근이 가능하다. 이때 포인트는, 각 vertex들이 ordered, grouped 되기 때문에 primitive assembly stage로 넘어갈 준비가 된다는 것이다.

A special hardware unit known as the Distributer sends the grouped blocks of vertices on to the Primitive Assembly stage.

Primitive Assembly

Primitive assembly	Primitives Types

이전 단계에서 데이터들이 block 단위로 group화되어 넘어오기 때문에 block 내부에서 geometrical shape(primitive)에 필요한 모든 데이터를 얻을 수 있다.

Metal API는 공식적으로 5개의 primitive를 지원한다.

이전 챕터에서 언급했었는데, pipeline state를 세팅할 때 winding orderer를 지정해야 한다. Winding order를 clockwise로 하느냐, counter-clockwise로 하느냐에 따라 primitive의 front/back이 결정되고, 이로 인해 culling 작업이 수행된다.

Culling은 primitive 단위로 실행되며(pixel단위 X), 전체가 아닌 일부만 occluded된 경우 culling이 아닌 clipping이 적용된다.

Rasterization

There are two modern rendering techniques. ray tracing : preferred when rendering content that is static and far away

for each pixel on the screen, it sends a ray into the scene rasterization : preferred when the content is closing to the camera and more dynamic
For each object in the scene, send rays back into the screen

Rasterization 상세 과정

==1. Triangle Setup== 이전 단계에서 넘어온 연결된 vertex들(triangle primitive)의 2D 그리드 좌표 계산 Line segment들의 기울기 계산 -> 세 변으로 삼각형 형성

==2. Scan Conversion== 화면을 수평선(scan line) 단위로 위에서 아래로 스캔하면서, 각 라인이 삼각형과 만나는 지점을 계산 이때 계산해둔 line segment의 기울기를 이용

==3. Fill Algorithm== 경계를 포함해 삼각형 내부의 픽셀들이 보이는지 판별

이때 3단계의 특수한 hardware unit task가 실행된다.

Hierarchical-Z : 타일부터 픽셀까지 단계별로 가려진 픽셀 빠르게 판별 (ex. 64, 32, 16, …)
Z and Stencil Test : 픽셀 단위로 가시성 정밀 판별
Interpolator : 해당 픽셀의 위치에 맞게 vertex attribute 보간

1. Rasterization
   ↓
   Fragment 생성 (x, y, depth)
   ↓
2. Hierarchical-Z Test
   ↓
   타일 단위로 빠른 제거
   ↓
3. Z/Stencil Test (per-pixel)
   ↓
   정확한 깊이 비교
   ↓
4. Interpolation
   ↓
   Fragment 속성 계산 (normal, texCoord, color 등)
   ↓

중요한 것은 Fragment Processing 이전에 Z-Testing을 수행해 불필요한 pixel shader 실행을 최소화!

위 단계가 끝나면 scheduler 유닛이 다시 fragment Processing으로 dispatchs work.

[!note] Mobile device의 경우, 32x32 타일 그리드 단위로 rasterization이 진행되는데, 하나의 USC가 32개의 shader core를 갖고 있기 때문에 매우 효율적으로 작동한다.

Fragment Processing

The Vertex Fetch unit grabs vertices from the memory and passes them to the Scheduler unit.
The Scheduler unit knows which Shader Engine are available, so it dispatches work on them
The Distributer unit knows if this work was Vertex or Fragment Processing
- If the work was Vertex Processing, it sends the result to the Primitive Assembly unit
- This path continues to the Rasterization unit, and then back to the Scheduler unit
- If the work was Fragment Processing, it sends the result to the Color Writing unit

Fragment Processing은 Vertex Processing과 더불어 프로그래밍 가능한 단계이다. Fragment Shdader는 Primitive Assembly, Rasterization을 거쳐 interpolation 된 pixel 단위의 vertex attribute을 파라미터로 받으며, pixel 단위의 single color를 리턴한다.

If you don’t want a vertex output to be interpolated, add the attribute [[flat]] to its definition.

struct VertexOut {
	float4 position [[position]];
	float3 normal; // 기본: 보간됨
	float2 texCoord; // 기본: 보간됨
	float3 color [[flat]]; // flat: 보간 안 됨!
	int materialID [[flat]]; // flat: 정수는 보간하면 안 됨
	uint primitiveID [[flat]]; // flat: ID 값도 보간 X
};

GPU takes the fragments and does a series of post-processing

alpha-testing & alpha-blending
scissor testing
stencil testing
late Z testing
antialiasing
… 자세한 내용은 이후 Post Processing 챕터에서

FrameBuffer

As soon as fragments have been processed into pixels, the Distributer unit sends them to the Color Writing unit. This unit is responsible for writing the final color in a special memory location known as the FrameBuffer.

+ Swap-chain (double or triple buffering)

맨 위로 이동하기

정인호