NCF ディープラーニング推論で第 2 世代インテル® Xeon® スケーラブル・プロセッサーが NVIDIA* GPU を凌駕

この記事は、インテル® AI Blog に公開されている「2nd Generation Intel® Xeon® Scalable CPUs Outperform NVIDIA GPUs on NCF Deep Learning Inference」の日本語参考訳です。

推奨システム (英語) は、今日インターネット企業が展開している最も複雑で広く普及している商用 AI アプリケーションです。これらのシステムを使用する際の最大の課題の 1 つは、ユーザーと類似ユーザーの好みに基づいてそのユーザーの興味を予測する、協調フィルタリングです。ニューラル協調フィルタリング (NCF) (英語) と呼ばれる新しいモデルは、ディープラーニングを活用してユーザーとアイテムの相互関係を学習し、推奨パフォーマンスを向上します。MLPerf* (https://mlperf.org/) は、主要ベンチマークとして NCF を採用 (https://github.com/mlperf/training/blob/master/recommendation/pytorch/ncf.py) しています。

近年のハードウェアの進歩、ソフトウェア・ツールの開発、およびフレームワークの最適化をとおして、インテルは CPU のディープラーニング・パフォーマンスを大幅に向上しました。第 2 世代インテル® Xeon® スケーラブル・プロセッサーのインテル® ディープラーニング・ブースト (インテル® DL ブースト) 機能により、インテルはデュアルソケットのインテル® Xeon® Platinum 9282 プロセッサー・ベースのシステムにおいて、1.22 ミリ秒 (msec) 未満で 1 秒あたり 6,400 万文の NCF モデル推論パフォーマンスを達成し、2020 年 1 月 16 日に NVIDIA が発表した (英語) NCF の GPU パフォーマンス^[1] を凌駕しました。

モデル	プラットフォーム	パフォーマンス	精度	データセット
NCF	インテル® Xeon® Platinum 9282 プロセッサー	スループット: 6,454 万要求/秒レイテンシー: 1.22msec	INT8	MovieLens 20M (英語)
	NVIDIA* V100 Tensor Core GPU	スループット: 6,194 万要求/秒レイテンシー: 20msec	混在	MovieLens 20M
	NVIDIA* T4 Tensor Core GPU	スループット: 5,534 万要求/秒レイテンシー: 1.8msec	INT8	合成

図 1: NCF モデルにおける第 2 世代インテル® Xeon® スケーラブル・プロセッサーと NVIDIA* GPU のパフォーマンスの比較性能やベンチマーク結果について、さらに詳しい情報をお知りになりたい場合は、http://www.intel.com/benchmarks/ (英語) を参照してください。インテル® ソフトウェア製品のパフォーマンスおよび最適化に関する注意事項については、http://software.intel.com/en-us/articles/optimization-notice を参照してください。

再現手順:

ステップ 1: YUM (英語) または APT リポジトリー (https://software.intel.com/en-us/articles/installing-intel-free-libs-and-python-apt-repo) からインテル® マス・カーネル・ライブラリー (インテル® MKL) をインストールします。

ステップ 2: インテル® MKL で MXNet をビルドしてランタイム環境を有効にします。

インテル® ディープ・ニューラル・ネットワーク・ライブラリー (インテル® DNNL) (旧称: ディープ・ニューラル・ネットワーク向けインテル® マス・カーネル・ライブラリー (インテル® MKL-DNN)) はデフォルトで有効になります。詳細は、「インテル® MKL-DNN と MXNet をインストールする」 (英語) を参照してください。

git clone https://github.com/apache/incubator-mxnet
cd ./incubator-mxnet && git checkout dfa3d07
git submodule update --init --recursive
make -j USE_BLAS=mkl USE_INTEL_PATH=/opt/intel
source /opt/intel/bin/compilervars.sh intel64
export PYTHONPATH=/workspace/incubator-mxnet/python/

ステップ 3: NCF を起動します (詳細は README (英語) を参照)。次のコマンドを使用して、簡単なベンチマークを実行できます。

# go to NCF dir
cd /workspace/incubator-mxnet/example/neural_collaborative_filtering/
# install some python libraries
pip install numpy pandas scipy tqdm
# prepare ml-20m dataset
python convert.py
# download pre-trained models
# optimize model
python model_optimizer.py
# calibration on ml-20m dataset
python ncf.py --prefix=./model/ml-20m/neumf-opt --calibration
# benchmark
bash benchmark.sh -p model/ml-20m/neumf-opt-quantized

まとめ

この記事で紹介したとおり、インテル® Xeon® スケーラブル・プロセッサーは NCF モデルの推論に非常に効果的です。今後、インテルは DLRM などより広範な推奨システムモデルの高速化に取り組み、2020 年にリリース予定の次世代インテル® Xeon® スケーラブル・プロセッサーでインテル® DL ブーストに新しい拡張を追加して、単精度 (float 32) と半精度 (bfloat16 (英語)) の混在精度でトレーニング効率を示す予定です。

^[1] システム構成

Intel Xeon Platinum 9282 Processor: Tested by Intel as of 01/17/2020. DL Inference: Platform: Intel S2900WK 2S Intel Xeon Platinum 9282 (56 cores per socket), HT ON, turbo ON, Total Memory 384 GB (24 slots/ 16 GB/ 2933 MHz), microcode:0x500002c, BIOS: PLYXCRB1.86B.0576.D18.1902140627, Ubuntu 18.04.2 LTS (GNU/Linux 5.4.0-050400-generic x86_64), Deep Learning Framework: Apache MXNet version: https://github.com/apache/incubator-mxnet Commit id: dfa3d07, GCC 7.4.0 for build, DNNL version: v1.1.2 (commit hash: cb2cc7a), model: https://github.com/apache/incubator-mxnet/tree/dfa3d07/example/neural_collaborative_filtering, BS=700, Real Data:MovieLens-20M, 112 software instance/2 socket, Datatype: INT8; throughput: 64.54 million samples/s.

NVIDIA performance and configuration details taken from https://developer.nvidia.com/deep-learning-performance-training-inference on 01/16/2020. Batch latency of NVIDIA V100 is inferred by equation ‘1048576*1000/61941700~=16.92ms=0.01692s~=0.02s=20ms’

For more complete information about performance and benchmark results, visit www.intel.com/benchmarks. Refer to http://software.intel.com/en-us/articles/optimization-notice for more information regarding performance and optimization choices in Intel software products.

法務上の注意書き

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks.

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available security updates. See backup for configuration details. No product or component can be absolutely secure. Your costs and results may vary. Intel technologies may require enabled hardware, software or service activation. Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.