Kubernetes에 GPU 노드 추가(1)

Notice

Recent Posts

Recent Comments

Link

« 2025/11 »
일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

Tags more

Archives

Today

Total

관리 메뉴

JUST WRITE

Kubernetes에 GPU 노드 추가(1) - GPU 노드 세팅 본문

Infra/Kubernetes

Kubernetes에 GPU 노드 추가(1) - GPU 노드 세팅

천재보단범재 2024. 5. 14. 15:17

GPU 노드 세팅

컨테이너화된 애플리케이션이 주로 개발이 되면서 Kubernetes의 활용도가 높아지고 있습니다.

AI 서비스 역시 Kubernetes에 배포, 운영되는 경우가 많아지고 있습니다.

AI 서비스에서 인퍼런스(Inference) 성능을 위해 GPU 활용이 중요합니다.

Kubernetes에 GPU 노드를 추가함으로써 AI 서비스 인퍼런싱시 GPU를 활용할 수 있습니다.

사내에 마침 GPU를 가진 PC가 남아 있어 세팅을 해보았습니다.

2개 포스팅을 통해 Kubernetes에 GPU 노드를 추가하는 방법을 정리하려고 합니다.

이번 포스팅에서는 추가하기 전 GPU 노드 세팅하는 방법을 정리해보았습니다.

GPU 노드 k8s Worker 세팅

nvidia driver 및 Container Toolkit 설치

lshw(list hardware) 명령어로 GPU 정보를 확인합니다.

classname(-C) 옵션을 display로 해서 그래픽 드라이버 정보를 확인할 수 있습니다.

사내 서버 정보이기 때문에 GPU 모델 정보는 숨겼습니다.

# sudo lshw -C display
  *-display
       description: VGA compatible controller
       product: ******
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:01:00.0
       version: a1
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress vga_controller bus_master cap_list rom
       configuration: driver=nvidia latency=0
       resources: iomemory:600-5ff iomemory:600-5ff irq:149 memory:50000000-50ffffff memory:6000000000-600fffffff memory:6010000000-6011ffffff ioport:5000(size=128) memory:51000000-5107ffff
  *-display
       description: Display controller
       product: AlderLake-S GT1
       vendor: Intel Corporation
       physical id: 2
       bus info: pci@0000:00:02.0
       version: 0c
       width: 64 bits
       clock: 33MHz
       capabilities: pciexpress msi pm bus_master cap_list
       configuration: driver=i915 latency=0
       resources: iomemory:600-5ff iomemory:400-3ff irq:148 memory:6013000000-6013ffffff memory:4000000000-400fffffff ioport:6000(size=64) memory:4010000000-4016ffffff memory:4020000000-40ffffffff

Kubernetes에 추가할 노드에 GPU가 있는 것을 확인하였습니다.

노드에서 GPU를 활용하려면 nvidia driver를 설치해야 합니다.

nvidia driver는 NVIDIA 그래픽 카드(혹은 GPU)와 OS 간 통신을 관리하고 제어하는 소프트웨어입니다.

nvidia driver 설치는 아래 링크에서 확인할 수 있습니다.

AWS EC2에서 GPU를 사용하려면?!

AWS EC2 GPU를 사용하려면?! 딥러닝 작업은 단순사칙연산을 수행합니다. 단순사칙연산 작업은 CPU보다는 GPU에 작업 시 효율적으로 동작합니다. GPU 서버에서 GPU를 활용해서 딥러닝 작업을 하려면 추

developnote-blog.tistory.com

nvidia driver가 정상적으로 설치되었다면 아래와 같이 확인할 수 있습니다.

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA *********               Off |   00000000:01:00.0  On |                  N/A |
| 41%   34C    P8             20W /  280W |       0MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

노드의 OS가 우분투라면 아래 우분투 공식 홈페이지에 정리된 Docs를 참고하시길 바랍니다.

https://ubuntu.com/server/docs/nvidia-drivers-installation

ubuntu.com

컨테이너 런타임을 설치합니다.

컨테이너 런타임은 containerd로 설치를 진행하였습니다.

자세한 설치 방법은 아래 포스팅에서 확인할 수 있습니다.

AWS EC2 Kubernetes Cluster 설치(1) Kubeadm - containerd 설치

AWS EC2 Kubernetes Cluster 설치 이번 포스팅에서는 AWS EC2 인스턴스 4개에 Kubernetes Cluster를 구성해보았다. 각 인스턴스는 Ubuntu Sever 22.04에 t3.large로 구성하였다. 1개의 Master와 3개의 Worker로 구성하였다. K

developnote-blog.tistory.com

nvidia driver 설치가 끝났으면 NVIDIA Container Toolkit을 설치합니다.

NVIDIA Container Toolkit은 컨테이너 내에서 NVIDIA GPU를 활용할 수 있게 해 줍니다.

호스트에 설치된 CUDA Driver를 컨테이너에 마운트 하여 사용할 수 있게 해 줍니다.

컨테이너 환경에서 호스트의 GPU를 활용하기 위해 필요한 Toolkit입니다.

출처 : https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/index.html

Nvidia 공식 홈페이지에서 Container Toolkit 설치 방법을 확인할 수 있습니다.

관련 레포지토리를 추가한 뒤에 패키지를 설치해 주면 됩니다.

# apt repository 추가 설정
$ curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

# repository 추가 설정 파일 확인
$ cat /etc/apt/sources.list.d/nvidia-container-toolkit.list

deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://nvidia.github.io/libnvidia-container/stable/deb/$(ARCH) /
#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://nvidia.github.io/libnvidia-container/experimental/deb/$(ARCH) /

# repository 업데이트
$ apt-get update

# 설치
$ apt-get install -y nvidia-container-toolkit

설치하고 나서 컨테이너 실행 시 적용될 수 있도록 컨테이너 런타임 설정을 바꿔줘야 합니다.

패키지 설치 시 같이 설치된 nvidia-ctk cli로 간단하게 설정을 바꿀 수 있습니다.

컨테이너 런타임을 containerd로 세팅했기 때문에 해당 옵션을 추가해서 실행하였습니다.

$ nvidia-ctk runtime configure --runtime=containerd
INFO[0000] Loading config from /etc/containerd/config.toml
INFO[0000] Wrote updated config to /etc/containerd/config.toml
INFO[0000] It is recommended that containerd daemon be restarted.

containerd의 경우 설정파일이 /etc/containerd/config.toml입니다.

설정 파일 일부분이 변경되었음을 알 수 있습니다.

default_runtime이 runc에서 nvidia로 변경되고,

플러그인에 nvidia 부분이 추가된 것을 확인할 수 있습니다.

# 변경된 부분 (runc -> nvidia)
[plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"
      disable_snapshot_annotations = true
      discard_unpacked_layers = false
      ignore_blockio_not_enabled_errors = false
      ignore_rdt_not_enabled_errors = false
      no_pivot = false
      snapshotter = "overlayfs"
      
 # 추가된 부분(plugin쪽)  
   [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          base_runtime_spec = ""
          cni_conf_dir = ""
          cni_max_conf_num = 0
          container_annotations = []
          pod_annotations = []
          privileged_without_host_devices = false
          privileged_without_host_devices_all_devices_allowed = false
          runtime_engine = ""
          runtime_path = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"
          sandbox_mode = "podsandbox"
          snapshotter = ""

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"
            CriuImagePath = ""
            CriuPath = ""
            CriuWorkPath = ""
            IoGid = 0
            IoUid = 0
            NoNewKeyring = false
            NoPivotRoot = false
            Root = ""
            ShimCgroup = ""
            SystemdCgroup = true

변경한 설정을 적용하기 위해 containerd를 재시작합니다.

# containerd 재시작
$ sudo systemctl restart containerd

# containerd 상태 확인
# systemctl status containerd
● containerd.service - containerd container runtime
     Loaded: loaded (/usr/local/lib/systemd/system/containerd.service; enabled; vendor preset: enabled)
     Active: active (running) since Tue 2024-05-14 04:48:59 UTC; 47s ago
       Docs: https://containerd.io
    Process: 164377 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
   Main PID: 164378 (containerd)
      Tasks: 127
     Memory: 6.5G
        CPU: 1.296s
     CGroup: /system.slice/containerd.service
             ├─111058 /usr/local/bin/containerd-shim-runc-v2 -namespace k8s.io -id ac637d84f9b807a3aee6c7f3fa3db79d2c21f7cdd6f3e747f96bb92262893197 -address /run/containerd/containerd.sock
             ├─111098 /usr/local/bin/containerd-shim-runc-v2 -namespace k8s.io -id 2a040d90e10266726ccbdb2e14b77e502fc08adda8a4425a328e3ced9ff42d4c -address /run/containerd/containerd.sock
             ├─111143 /usr/local/bin/containerd-shim-runc-v2 -namespace k8s.io -id c391cec06c47b0583e4b07edb523a95a246546bcf1e0a0077c83dec8e9fe18e0 -address /run/containerd/containerd.sock
             ├─111614 /usr/local/bin/containerd-shim-runc-v2 -namespace k8s.io -id dde0b6647367884100ffd9a8d1ab8a4da8a44fad1081fe8b6d68a7012ecaef09 -address /run/containerd/containerd.sock
             ├─111657 /usr/local/bin/containerd-shim-runc-v2 -namespace k8s.io -id 66177e74bef45b2de82aabc333a8037f67ccf0afaa1c3640aaa219b0f978dc38 -address /run/containerd/containerd.sock
             ├─111691 /usr/local/bin/containerd-shim-runc-v2 -namespace k8s.io -id 4f224f3ffe00400510a25091859f09d6c447f591286e7291386903856879b39a -address /run/containerd/containerd.sock
             ├─112175 /usr/local/bin/containerd-shim-runc-v2 -namespace k8s.io -id 3aed55c492ea17dd314e9dfc7288f862d49a9ac48a894194328ff3313fef0804 -address /run/containerd/containerd.sock
             └─164378 /usr/local/bin/containerd

May 14 04:48:58 k8sw04.coxspace.biz containerd[164378]: time="2024-05-14T04:48:58.568399959Z" level=info msg="Start subscribing containerd event"
May 14 04:48:58 k8sw04.coxspace.biz containerd[164378]: time="2024-05-14T04:48:58.568476673Z" level=info msg="Start recovering state"
May 14 04:48:58 k8sw04.coxspace.biz containerd[164378]: time="2024-05-14T04:48:58.568594821Z" level=info msg=serving... address=/run/containerd/containerd.sock.ttrpc
May 14 04:48:58 k8sw04.coxspace.biz containerd[164378]: time="2024-05-14T04:48:58.568652835Z" level=info msg=serving... address=/run/containerd/containerd.sock
May 14 04:48:59 k8sw04.coxspace.biz containerd[164378]: time="2024-05-14T04:48:59.028030755Z" level=info msg="Start event monitor"
May 14 04:48:59 k8sw04.coxspace.biz containerd[164378]: time="2024-05-14T04:48:59.028061764Z" level=info msg="Start snapshots syncer"
May 14 04:48:59 k8sw04.coxspace.biz containerd[164378]: time="2024-05-14T04:48:59.028072026Z" level=info msg="Start cni network conf syncer for default"
May 14 04:48:59 k8sw04.coxspace.biz containerd[164378]: time="2024-05-14T04:48:59.028079211Z" level=info msg="Start streaming server"
May 14 04:48:59 k8sw04.coxspace.biz containerd[164378]: time="2024-05-14T04:48:59.028132456Z" level=info msg="containerd successfully booted in 0.575287s"
May 14 04:48:59 k8sw04.coxspace.biz systemd[1]: Started containerd container runtime.

이제 컨테이너에서도 GPU를 활용할 수 있을지 확인합니다.

NVIDIA의 cuda 기본 이미지를 활용하여 컨테이너 내에서 nviida-smi 명령어를 실행해 봅니다.

# NVIDIA CUDA 이미지 다운로드
$ ctr image pull docker.io/nvidia/cuda:12.4.1-base-ubuntu22.04
docker.io/nvidia/cuda:12.4.1-base-ubuntu22.04:                                    resolved       |++++++++++++++++++++++++++++++++++++++|
index-sha256:0f6bfcbf267e65123bcc2287e2153dedfc0f24772fb5ce84afe16ac4b2fada95:    done           |++++++++++++++++++++++++++++++++++++++|
manifest-sha256:8767a245ed2c481eb245d8f6c625accc3788e1fb8612403d6b4cd4645a4f09c7: done           |++++++++++++++++++++++++++++++++++++++|
layer-sha256:56dc8550293751a1604e97ac949cfae82ba20cb2a28e034737bafd7382559609:    done           |++++++++++++++++++++++++++++++++++++++|
layer-sha256:0d6448aff88945ea46a37cfe4330bdb0ada228268b80da6258a0fec63086f404:    done           |++++++++++++++++++++++++++++++++++++++|
layer-sha256:0a7674e3e8fe69dcd7f1424fa29aa033b32c42269aab46cbe9818f8dd7154754:    done           |++++++++++++++++++++++++++++++++++++++|
layer-sha256:b71b637b97c5efb435b9965058ad414f07afa99d320cf05e89f10441ec1becf4:    done           |++++++++++++++++++++++++++++++++++++++|
layer-sha256:3c645031de2917ade93ec54b118d5d3e45de72ef580b8f419a8cdc41e01d042c:    done           |++++++++++++++++++++++++++++++++++++++|
config-sha256:ca14dc8401b66a20e1ca678268250834c5c66ac3f458dd570088bb681444ffc0:   done           |++++++++++++++++++++++++++++++++++++++|
elapsed: 6.6 s                                                                    total:  83.1 M (12.6 MiB/s)
unpacking linux/amd64 sha256:0f6bfcbf267e65123bcc2287e2153dedfc0f24772fb5ce84afe16ac4b2fada95...
done: 2.140010405s

# NVIDIA CUDA 컨테이너내에서 GPU 확인
$ ctr run --rm --gpus 0 -t docker.io/nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-test nvidia-smi
Tue May 14 04:59:08 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA TITAN RTX               Off |   00000000:01:00.0  On |                  N/A |
| 41%   32C    P8             19W /  280W |       0MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

컨테이너 내에서도 정상적으로 GPU를 확인할 수 있습니다.

정리

Kubernetes에서 GPU를 활용하기 위해서 필요한 선작업이 필요하였습니다.

nvidia-driver 설치, NVIDIA Container Toolkit 설치가 필요합니다.

이제 GPU를 활용하기 위한 노트 세팅은 마무리하였습니다.

다음 포스팅에서는 세팅한 노드를 Kubernetes에 Worker 노드로 추가, 설정하는 방법을 정리하도록 하겠습니다.

[참고사이트]

Installing the NVIDIA Container Toolkit — NVIDIA Container Toolkit 1.15.0 documentation

Install an NVIDIA GPU Driver if you do not already have one installed. You can install a driver by using the package manager for your distribution, but other installation methods, such as downloading a .run file intaller, are available. Refer to the NVIDIA

docs.nvidia.com

GitHub - NVIDIA/k8s-device-plugin: NVIDIA device plugin for Kubernetes

NVIDIA device plugin for Kubernetes. Contribute to NVIDIA/k8s-device-plugin development by creating an account on GitHub.

github.com

Does the nvidia-device-plugin need to be running on all worker nodes? · Issue #45 · NVIDIA/k8s-device-plugin

I can deploy the device-plugin on my gpu nodes successfully. After running kubectl create command, all the worker nodes deploy one nvidia-device-plugin. I know this is because using DaemonSet to de...

github.com

Deploying Omniverse Farm on Kubernetes — Omniverse Farm latest documentation

E. Management Services Multiple services handle communication, life cycle, and interaction across the Omniverse Farm cluster. These instances are considered memory intensive and should be treated as such. These services include the agents, controller, dash

docs.omniverse.nvidia.com

https://ubuntu.com/server/docs/nvidia-drivers-installation

ubuntu.com

728x90

저작자표시 비영리 변경금지 (새창열림)

'Infra > Kubernetes' 카테고리의 다른 글

Kubernetes에 GPU 노드 추가(2) - GPU Worker 노드 추가 (0)	2024.06.05
업그레이드해도 될까요? - Control Plane Upgrade (0)	2024.04.25
여기만 사용해! - 특정 Namespace 전용 User 생성 (0)	2024.02.15
k8s 날 거부하지 마 - Certificate 만료 갱신 (0)	2023.10.18
명령어 한 번에 Kubernetes 설치하기(2) - AWS ENI를 이용한 설치 (0)	2023.10.04