VMware vSphere GPU虚拟化Bitfusion 安装参考指南

免费项目 2024-09-05 0

点击此处查看最新的网赚项目教程

插件加密_插件登录_网站密码不能输入需要什么插件,0,0,0,0.0,0,0,0,,-

查看详细信息页面将显示一条警告，指明 vSphere Bitfusion OVF 使用高级配置值，可能会带来安全风险

触发警示的配置值是 pciPassthru.use64bitMMIO = true 和 pciPassthru.64bitMMIOSizeGB = 256

第一个参数将为 GPU 设备启用 PCI 直通，GPU 设备需要具有 16 GB 或更多的内存映射

第二个参数将配置内存映射 I/O (MMIO) 大小 256 GB。您可以稍后在 vSphere Bitfusion 虚拟机的设置中调整此值

选择网络

插件登录_插件加密_网站密码不能输入需要什么插件,0,0,0,0.0,0,0,0,,-

2.3.3 自定义OVF模版

设置hostname名称

在 Bitfusion 服务器设置部分中，输入要在其上部署 vSphere Bitfusion OVF 模板的 vCenter Server 实例的用户名和密码。

在 Bitfusion 服务器设置部分中，输入 vCenter Server TLS 证书指纹【实际上不需要填写】

插件加密_插件登录_网站密码不能输入需要什么插件,0,0,0,0.0,0,0,0,,-

1、在凭据部分指定客户密码【虽然可选，但是建议进行选择，务必注意，用户名是customer】

建议配置凭据，因为如果需要手工安装NVIDIA驱动程序，需要使用该账户登录后使用提权方式进行安装

部署完成后，使用客户用户帐户通过控制台 shell 或 SSH 登录到 vCenter Server Appliance。

2、在 NVIDIA 软件包部分中，选中下载并安装 NVIDIA 软件包复选框以接受 NVIDIA 许可证。

通过接受 NVIDIA 许可证， vSphere Bitfusion 会在首次引导虚拟机期间下载并安装 NVIDIA 驱动程序、CUDA 库和 NVIDIA Fabric Manager

如果在无法访问 Internet 的环境（例如，使用气隙网络）中运行 vSphere Bitfusion，请不要选中该复选框

必须在部署 vSphere Bitfusion 设备后手动下载并安装 NVIDIA 软件

网站密码不能输入需要什么插件,0,0,0,0.0,0,0,0,,-_插件登录_插件加密

注意

必须指定用于管理和数据流量的网络适配器 1 的配置。网络适配器 1 必须连接到与 vCenter Server 实例通信的网络

网络适配器 2、3 和 4 是可选的，并且仅用于数据流量。每个网络适配器都必须连接到单独的网络

vSphere Bitfusion 会选择可将数据最高效地传输到 vSphere Bitfusion 服务器的网络。

插件登录_网站密码不能输入需要什么插件,0,0,0,0.0,0,0,0,,-_插件加密

3 配置Bitfusion服务器3.1 添加 GPU 设备

在 vSphere Client 中，右键单击清单中的 vSphere Bitfusion 虚拟机，然后选择编辑设置。

在虚拟硬件选项卡上，单击添加新设备按钮。

从下拉菜单中的其他设备下，选择 PCI 设备。

展开新 PCI 设备部分，然后选择访问类型。

网站密码不能输入需要什么插件,0,0,0,0.0,0,0,0,,-_插件登录_插件加密

选择添加本次GPU卡，Tesla V1000 32GB

网站密码不能输入需要什么插件,0,0,0,0.0,0,0,0,,-_插件加密_插件登录

关于VMDirectPath I/O

VMDirectPath I/O 允许客户机操作系统直接访问 GPU，从而绕过 ESXi Hypervisor

通过使用直通设备，可以更高效地使用资源，并提高 vSphere Bitfusion 环境的性能,启用 GPU 直通可在 vSphere 上提供接近于其本机系统的性能级别

3.2 为 ESXi 主机配置 CPU 和内存资源

如果 ESXi 主机专用于 vSphere Bitfusion 服务器，请将 CPU 和内存设置为其最大值

如果主机不是专用于 vSphere Bitfusion

请将最小 CPU 值指定为 GPU 数量乘以 4

将最小内存值指定为汇总 GPU 卡内存的 1.5 倍或 32 GB（取较大者）

在 vSphere Client 中，右键单击 vSphere Bitfusion 虚拟机，然后选择编辑设置。

展开 CPU 部分，然后编辑资源

展开内存部分，然后编辑资源

在内存下，选中预留所有客户机内存 (全部锁定) 复选框

插件加密_网站密码不能输入需要什么插件,0,0,0,0.0,0,0,0,,-_插件登录

3.3 打开bitfusion server虚拟机电源

打开虚拟机电源，耐心等待

关于cassandra报错属于正常

由于我们在定义OVF模版中并没有安装nividia驱动，会报could not load NVML library错误，属于正常

网站密码不能输入需要什么插件,0,0,0,0.0,0,0,0,,-_插件加密_插件登录

安装完成后，浏览器会提示已成功部署插件

插件加密_插件登录_网站密码不能输入需要什么插件,0,0,0,0.0,0,0,0,,-

实际上会再vSphere安装部署Bitfusion插件，这点很重要

插件登录_插件加密_网站密码不能输入需要什么插件,0,0,0,0.0,0,0,0,,-

从日志查看安装日志

4 继续配置vSphere Bitfusion 服务器4.1 添加新网络

可将 vSphere Bitfusion 服务器的虚拟机连接到最多四个网络。

在 vSphere Bitfusion 服务器的部署过程中，必须至少配置用于管理和数据流量的网络适配器 1。网络适配器 2、3 和 4 是可选的，并且仅用于数据流量。要在服务器部署完成后添加用于数据流量的网络接口，请执行以下过程。

每个网络适配器都必须连接到单独的网络。vSphere Bitfusion 会选择可将数据最高效地传输到 vSphere Bitfusion 服务器的网络。

展开新网络部分，然后从适配器类型下拉菜单中选择要分配给虚拟机的网络适配器。

vSphere Bitfusion 支持 VMXNET3 和 PVRDMA 适配器。

插件加密_网站密码不能输入需要什么插件,0,0,0,0.0,0,0,0,,-_插件登录

插件登录_插件加密_网站密码不能输入需要什么插件,0,0,0,0.0,0,0,0,,-

4 添加后续的 vSphere Bitfusion 服务器

如果需要更多GPU资源时，可以向 vSphere Bitfusion 集群中添加更多服务器

可以使用 vSphere Bitfusion 插件将后续 vSphere Bitfusion 服务器添加到集群

该插件使用主服务器的配置数据，可以更快地部署后续服务器。

添加的 vSphere Bitfusion 服务器必须与第一个 vSphere Bitfusion 服务器一起由同一个 vCenter Server 实例进行管理

5 安装NVIDIA驱动程序

在Bitfusion Server虚拟机上安装NVIDIA驱动程序

插件登录_网站密码不能输入需要什么插件,0,0,0,0.0,0,0,0,,-_插件加密

选择NVIDIA驱动

注意

经认证可与 vSphere Bitfusion 4.5.0 和 4.5.1 配合使用的 NVIDIA 驱动程序为 NVIDIA-Linux-x86_64-460.73.01.run

经认证可与 vSphere Bitfusion 4.5.2 配合使用的 NVIDIA 驱动程序为 NVIDIA-Linux-x86_64-470.129.06.run

可以采用离线或者在线安装方式

1、在线安装

但是需要使用在自定义OVF模版中设置的customer账户和密码，利用提权方式在线安装

sudo install-nvidia-packages --defaults --yes

2、离线安装

前提是在本地某台机器上临时部署一个http web server，将驱动程序拷贝至http可访问位置

sudo install-nvidia-packages --driver http://172.18.2.12/NVIDIA/NVIDIA-Linux-x86_64-470.129.06.run

安装完驱动程序后，vMware vCenter将显示Bitfusion集群及GPU相关信息

网站密码不能输入需要什么插件,0,0,0,0.0,0,0,0,,-_插件加密_插件登录

6 安装 vSphere Bitfusion 客户端

从 VMware 网站下载适用于您的Linux发行版的vSphere Bitfusion 客户端插件

1、ubuntu 22.04

wget https://packages.vmware.com/bitfusion/ubuntu/22.04/bitfusion-client-ubuntu2204_4.5.2-16_amd64.deb

2、RHEL 7/8

本实验基于RHEL8.6版本测试

wget https://packages.vmware.com/bitfusion/rhel/8/bitfusion-client-rhel8-4.5.2-16.x86_64.rpm

安装Bitfusion客户端插件后，关闭虚拟机

7 利用插件激活客户端

针对上述完成Bitfusion客户端插件后的虚拟机，在VM电源关闭状态下利用vCenter下已安装配置的Bitfusion插件进行激活

7.1 前提条件

请再次检查环境

网站密码不能输入需要什么插件,0,0,0,0.0,0,0,0,,-_插件加密_插件登录

image-20221002172902026

选择第一项，作为客户端

插件加密_插件登录_网站密码不能输入需要什么插件,0,0,0,0.0,0,0,0,,-

7.2 配置用户组

打开客户端虚拟机电源

在客户端计算机的终端中，通过运行 sudo usermod -aG bitfusion username 命令将用户添加到 vSphere Bitfusion Linux 用户组，其中 username 是新用户的名称

usermod -aG bitfusion root

7.3 验证

$bitfusion list_gpus
 - server 0 (leader)  [172.18.3.60:56001]: running 0 tasks 
   |- GPU [0]: free memory (32510 / 32510MiB) Tesla V100S-PCIE-32GB (7.0)

至此，Bitfusion服务端和客户端的安装基本完毕

8 部署使用实例

要将 AI 和 ML 应用程序与 vSphere Bitfusion 配合使用，请安装并配置多个软件包和编程框架

8.1 安装 NVIDIA CUDA

统一计算设备架构 (CUDA) 是由 NVIDIA 开发的一种并行计算平台和编程模型，用于在图形处理单元 (GPU) 上进行常规计算

CUDA 可利用 GPU 的处理能力大幅提高计算应用程序的速度。例如，TensorFlow 和 PyTorch 基准测试使用 CUDA

要下载适用于 CentOS 8 或 Red Hat Linux 8 的 NVIDIA CUDA 11 软件包

rpm包比较大，安装起来倒是很快

$wget https://developer.download.nvidia.com/compute/cuda/11.0.3/local_installers/cuda-repo-rhel8-11-0-local-11.0.3_450.51.06-1.x86_64.rpm

$rpm -i cuda-repo-rhel8-11-0-local-11.0.3_450.51.06-1.x86_64.rpm

安装cuda包比较多，达214个

$yum -y install cuda

Transaction Summary
====================================================================================================================================================================================================================
Install  214 Packages

8.2 安装 NVIDIA cuDNN

NVIDIA CUDA Deep Neural Network (cuDNN) 是一个 GPU 加速的原语库，用于深度神经网络

8.2.1 前提条件

$wget https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/libcudnn8-8.0.5.39-1.cuda11.0.x86_64.rpm

【本人是直接通过网站上直接下载】

https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/libcudnn8-8.0.5.39-1.cuda11.0.x86_64.rpm

安装

$rpm -ivh libcudnn8-8.0.5.39-1.cuda11.0.x86_64.rpm

验证

$ldconfig -p | grep cudnn
        libcudnn_ops_train.so.8 (libc6,x86-64) => /lib64/libcudnn_ops_train.so.8
        libcudnn_ops_infer.so.8 (libc6,x86-64) => /lib64/libcudnn_ops_infer.so.8
        libcudnn_cnn_train.so.8 (libc6,x86-64) => /lib64/libcudnn_cnn_train.so.8
        libcudnn_cnn_infer.so.8 (libc6,x86-64) => /lib64/libcudnn_cnn_infer.so.8
        libcudnn_adv_train.so.8 (libc6,x86-64) => /lib64/libcudnn_adv_train.so.8
        libcudnn_adv_infer.so.8 (libc6,x86-64) => /lib64/libcudnn_adv_infer.so.8
        libcudnn.so.8 (libc6,x86-64) => /lib64/libcudnn.so.8

验证python3 版本

$dnf install python3
$python3 -V
Python 3.6.

8.3 安装 pip3

sudo yum install -y python36-devel
sudo pip3 install -U pip setuptools

使用 pip3 install 命令安装 TensorFlow。

sudo pip3 install tensorflow-gpu==2.4

8.4 安装 TensorFlow 基准测试

TensorFlow 基准测试是用于测试 TensorFlow 框架性能的开源 ML 应用程序。

如果使用的是 CentOS 或 Red Hat Linux，则必须安装 Python 3

您可以针对 TensorFlow 基准测试创建分支并下载到本地环境中。

前提条件:确认您已安装 TensorFlow。

yum install git
mkdir -p bitfusion
cd ~/bitfusion
git clone https://github.com/tensorflow/benchmarks.git
cd benchmarks

导航到存储库的基准目录和列表分支

git branch -a

[root@localhost benchmarks]# git branch -a
* master
  remotes/origin/HEAD -> origin/master
  remotes/origin/chenGitHuber-patch-1
  remotes/origin/cnn_tf_v1.10_compatible
  remotes/origin/cnn_tf_v1.11_compatible
  remotes/origin/cnn_tf_v1.12_compatible
  remotes/origin/cnn_tf_v1.13_compatible
  remotes/origin/cnn_tf_v1.14_compatible
  remotes/origin/cnn_tf_v1.15_compatible
  remotes/origin/cnn_tf_v1.5_compatible
  remotes/origin/cnn_tf_v1.8_compatible
  remotes/origin/cnn_tf_v1.9_compatible
  remotes/origin/cnn_tf_v2.0_compatible
  remotes/origin/cnn_tf_v2.1_compatible
  remotes/origin/cpbr-patch
  remotes/origin/cpbr-patch-1
  remotes/origin/data-gen
  remotes/origin/feat/log_pip_packages
  remotes/origin/feat/more_pip_pinning
  remotes/origin/fix-class-instantiation
  remotes/origin/fix/perfzero_pip_ver
  remotes/origin/keras-benchmarks
  remotes/origin/master
  remotes/origin/mkl_experiment
  remotes/origin/no_keras_benchmark
  remotes/origin/pkanwar23-patch-1
  remotes/origin/pkanwar23-patch-1-1
  remotes/origin/pkanwar23-patch-2
  remotes/origin/reedwm-patch-1
  remotes/origin/revert-382-python2_fix
  remotes/origin/s4tf
  remotes/origin/tf_benchmark_stage

切换到cnn_tf_v2.1_compatible分支

git checkout cnn_tf_v2.1_compatible

确认当前路径

$pwd
/root/bitfusion/benchmarks

cd ~/bitfusion/

要使用 tf_cnn_benchmarks.py 基准测试脚本，请运行 bitfusion run 命令

通过运行示例中的命令，可以使用单个 GPU 的全部内存和 /data 目录中预安装的 ML 数据

bitfusion run -n 1 -- python3 
./benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py 
--data_format=NCHW 
--batch_size=64 
--model=resnet50 
--variable_update=replicated 
--local_parameter_device=gpu 
--nodistortions 
--num_gpus=1 
--num_batches=100 
--data_dir=/data 
--data_name=imagenet 
--use_fp16=False

要使用 tf_cnn_benchmarks.py 基准测试脚本，请使用 bitfusion run 参数运行 -p 0.67 命令。

通过运行示例中的命令，将使用单个 GPU 中 67% 的内存和 /data 目录中的预安装 ML 数据。使用 -p 0.67 参数，您可以在 GPU 剩余的 33% 内存分区中运行另一个作业。

Running warm up
2022-10-02 21:34:51.121315: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2022-10-02 21:34:53.122748: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2022-10-02 21:34:53.324492: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
Done warm up
Step    Img/sec total_loss
1       images/sec: 407.6 +/- 0.0 (jitter = 0.0)        7.608
10      images/sec: 405.0 +/- 0.9 (jitter = 2.9)        7.849
20      images/sec: 404.5 +/- 0.6 (jitter = 3.1)        8.013
30      images/sec: 403.7 +/- 0.5 (jitter = 3.6)        7.940
40      images/sec: 403.0 +/- 0.4 (jitter = 3.8)        8.137
50      images/sec: 403.0 +/- 0.4 (jitter = 3.7)        8.052
60      images/sec: 402.9 +/- 0.3 (jitter = 3.4)        7.784
70      images/sec: 402.5 +/- 0.6 (jitter = 2.7)        7.855
80      images/sec: 402.3 +/- 0.6 (jitter = 3.2)        8.012
90      images/sec: 402.4 +/- 0.5 (jitter = 2.7)        7.840
100     images/sec: 402.5 +/- 0.5 (jitter = 2.5)        8.090
----------------------------------------------------------------
total images/sec: 402.22
----------------------------------------------------------------