Run dapple

record machine learning

发布时间 : 2021-06-01 10:11

阅读 :

Reference and Notations
README
What is HPGO
Get Started

Reference and Notations

Item	Instructions
DAPPLE	github
HPGO	github

README

DAPPLE的组成：

profiler: 输入DNN model，输出execution time, activation, parameter sizes per layer
planner: 输入profiling results, 根据全局batchsize，生成一个优化后的异构的并行策略
runtime system：执行框架。

Repo中包含了如下模型的实现：

VGG19
AmoebaNet
BERT
GNMT
XLNET

profiler的使用

README中没有提怎么使用，但是说给出了一些模型的profiling results

repo中似乎没有提供复现profiler步骤的方法。

planner的使用

profiler的结果与机器有关，planner直接读取profiling results，所以planner复现结果与机器无关。

GTTHUB中提供了planner的使用的详细说明

DAPPLE Planner Experiments Reproduction中提到了Currently our profiling is done offline, and the results are cached within the profiler folder。

单线程用Python API，多线程用Rust API。

runtime system的使用

参见每个模型中的run.sh

What is HPGO

DAPPLE中经常使用HPGO。那HPGO是什么呢？

HPGO: Hybrid Parallelism Global Orchestration

Get Started

选择一个合适的docker

观察DAPPLE代码，在planner中调用HPGO部分要求python版本为3.6~3.7，这个可以通过conda虚拟环境解决。在vgg19的代码中用的是python2(python2.7)，而且其中导入了tensorflow, horovod.tensorflow。没有说明tensorflow版本，但是根据使用接口判断tensorflow是1.x版本。在docker horovod中寻找合适的镜像。

我选择了horovod/horovod:0.18.1-tf1.14.0-torch1.2.0-mxnet1.5.0-py2.7

修改vgg19/run.sh

diff --git a/vgg19/run.sh b/vgg19/run.sh
index 7617b7e..dc0a722 100755
--- a/vgg19/run.sh
+++ b/vgg19/run.sh
@@ -4,8 +4,8 @@ device_num=2
 batch_size=32 # max = 190
 batch_num=12
 replica=2
-remote_ip=<SET YOUR REMOTE IP>
-local_ip=<SET YOUR LOCAL IP>
+remote_ip=none
+local_ip=localhost
 local_test=true

报错-0

Traceback (most recent call last):
  File "tf-keras-dapple.py", line 391, in <module>
    est.train( input_fn = fn_image_preprocess , steps = FLAGS.num_batches) #, hooks = hooks)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 367, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1188, in _train_model_default
    features, labels, ModeKeys.TRAIN, self.config)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1146, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "tf-keras-dapple.py", line 254, in model_fn
    total_loss, outputs = model.build(features, labels)
  File "/root/source/DAPPLE/vgg19/applications/vgg19_slice_model.py", line 329, in build
    x = self.block4_conv1[0](x)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 561, in __call__
    base_layer_utils.create_keras_history(inputs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/base_layer_utils.py", line 200, in create_keras_history
    _, created_layers = _create_keras_history_helper(tensors, set(), [])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/base_layer_utils.py", line 244, in _create_keras_history_helper
    constants[i] = backend.function([], op_input)([])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/backend.py", line 3253, in __call__
    session = get_session(inputs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/backend.py", line 462, in get_session
    _initialize_variables(session)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/backend.py", line 879, in _initialize_variables
    [variables_module.is_variable_initialized(v) for v in candidate_vars])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 950, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1173, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1350, in _do_run
    run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1370, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation block1_conv1/kernel/Initializer/random_uniform/RandomUniform: Could not satisfy explicit device specification '' because the node node block1_conv1/kernel/Initializer/random_uniform/RandomUniform (defined at /root/source/DAPPLE/vgg19/applications/vgg19_slice_model.py:280) placed on device Device assignments active during op 'block1_conv1/kernel/Initializer/random_uniform/RandomUniform' creation:
  with tf.device(None): </usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/resource_variable_ops.py:602>
  with tf.device(/job:worker/replica:0/task:0/device:GPU:0): </root/source/DAPPLE/vgg19/applications/vgg19_slice_model.py:275>
  with tf.device(None): </usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py:1174>  was colocated with a group of nodes that required incompatible device '/job:worker/replica:0/task:0/device:GPU:0'. All available devices [/job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:1, /job:localhost/replica:0/task:0/device:XLA_GPU:2, /job:localhost/replica:0/task:0/device:XLA_GPU:3, /job:localhost/replica:0/task:0/device:XLA_CPU:0, /job:localhost/replica:0/task:0/device:GPU:0, /job:localhost/replica:0/task:0/device:GPU:1, /job:localhost/replica:0/task:0/device:GPU:2, /job:localhost/replica:0/task:0/device:GPU:3].

修改

diff --git a/vgg19/run.sh b/vgg19/run.sh
index 7617b7e..dc0a722 100755
--- a/vgg19/run.sh
+++ b/vgg19/run.sh
@@ -4,8 +4,8 @@ device_num=2
 batch_size=32 # max = 190
 batch_num=12
 replica=2
-remote_ip=<SET YOUR REMOTE IP>
-local_ip=<SET YOUR LOCAL IP>
+remote_ip=none
+local_ip=localhost
 local_test=true

提交一个issue

Hello, I tried to run DAPPLE on my server, but encountered some odd problems.

Following is hardware information of my environment.

Item	type/version
GPU	1080Ti x 4
CPU

I pulled a docker image which is horovod/horovod:0.18.1-tf1.14.0-torch1.2.0-mxnet1.5.0-py2.7 and use nvidia-docker to start the image.

Following is software information of my environment.

Item	version
horovod	0.18
tensorflow	1.14.0
mpirun	4.0.0
cuda	10.0
nvidia driver	418.197.02

I found the codes are written for 8 cards, so I modify the codes, pasted below:

diff --git a/vgg19/run.sh b/vgg19/run.sh
index 7617b7e..57674f7 100755
--- a/vgg19/run.sh
+++ b/vgg19/run.sh
@@ -4,8 +4,8 @@ device_num=2
 batch_size=32 # max = 190
 batch_num=12
 replica=2
-remote_ip=<SET YOUR REMOTE IP>
-local_ip=<SET YOUR LOCAL IP>
+remote_ip=none
+local_ip=localhost
 local_test=true

 export CUDA_VISIBLE_DEVICES=2,3
@@ -16,7 +16,7 @@ if [ ${cross_node} = "0" ]; then
 python tf-keras-ds.py --fake_io=True --model=vgg19 --num_batches=10000 --batch_size=${batch_size} --strategy=none

 else
-ip_list="localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost"
+ip_list="localhost,localhost,localhost,localhost"
 worker_hosts=${local_ip}
 if [ ${device_num} != "1" ]; then
   if [ ${local_test} != "true" ]; then
@@ -28,9 +28,9 @@ fi
 echo ${worker_hosts}

 if [ ${local_test} == "true" ]; then
-CUDA_VISIBLE_DEVICES=4,5,6,7 mpirun -np $np --host ${ip_list} --allow-run-as-root -x NCCL_DEBUG=INFO nohup python tf-keras-dapple.py --fake_io=True --model=vgg19 --num_batches=10000 --batch_size=${batc
h_size} --strategy=none --cross_pipeline=True --pipeline_device_num=${device_num} --micro_batch_num=${batch_num} --job_name=worker --task_index=1 --worker_hosts=${worker_hosts} > ${inst_id}_2.log 2>&1 &
+CUDA_VISIBLE_DEVICES=2,3 mpirun -np $np --host ${ip_list} --allow-run-as-root -x NCCL_DEBUG=INFO nohup python tf-keras-dapple.py --fake_io=True --model=vgg19 --num_batches=10000 --batch_size=${ba
tch_size} --strategy=none --cross_pipeline=True --pipeline_device_num=${device_num} --micro_batch_num=${batch_num} --job_name=worker --task_index=1 --worker_hosts=${worker_hosts} > ${inst_id}_2.log 2>&1 &
 fi

-CUDA_VISIBLE_DEVICES=0,1,2,3 mpirun -np $np --host ${ip_list} --allow-run-as-root -x NCCL_DEBUG=INFO python tf-keras-dapple.py --fake_io=True --model=vgg19 --num_batches=10000 --batch_size=${batch_size
} --strategy=none --cross_pipeline=True --pipeline_device_num=${device_num} --micro_batch_num=${batch_num} --num_replica=${replica} --job_name=worker --task_index=0 --worker_hosts=${worker_hosts} #> ${inst_
id}.log 2>&1 &
+CUDA_VISIBLE_DEVICES=0,1 mpirun -np $np --host ${ip_list} --allow-run-as-root -x NCCL_DEBUG=INFO python tf-keras-dapple.py --fake_io=True --model=vgg19 --num_batches=10000 --batch_size=${batch_si
ze} --strategy=none --cross_pipeline=True --pipeline_device_num=${device_num} --micro_batch_num=${batch_num} --num_replica=${replica} --job_name=worker --task_index=0 --worker_hosts=${worker_hosts} #> ${ins
t_id}.log 2>&1 &
 fi


diff --git a/vgg19/applications/cluster_utils.py b/vgg19/applications/cluster_utils.py
index c7bc21b..edf483d 100644
--- a/vgg19/applications/cluster_utils.py
+++ b/vgg19/applications/cluster_utils.py
@@ -2,7 +2,7 @@ import tensorflow as tf
 import horovod.tensorflow as hvd
 from tensorflow.python.platform import flags
 FLAGS = flags.FLAGS
-MAX_GPUS_PER_NODE = 8
+MAX_GPUS_PER_NODE = 4

 def get_cluster_manager(config_proto):
   """Returns the cluster manager to be used."""



diff --git a/vgg19/tf-keras-dapple.py b/vgg19/tf-keras-dapple.py
index 41bb661..a0cee41 100755
--- a/vgg19/tf-keras-dapple.py
+++ b/vgg19/tf-keras-dapple.py
@@ -100,7 +100,7 @@ def prepare_tf_config():


 if __name__ == '__main__':
-    default_raw_data_dir = '/tmp/dataset/mini-imagenet/raw-data/train/n01440764/'
+    default_raw_data_dir = '/data/DNN_Dataset/imagenet/tiny/meshtf/mininet/mininet/mini-imagenet-sp2/train/n01532829/'
     default_ckpt_dir = 'mycheckpoint'
     flags.DEFINE_string('model', 'resnet50', 'imagenet model name.')
     flags.DEFINE_string('strategy', 'none', 'strategy of variable updating')

By executing bash run.sh, I got the following result. I tried to search it online, but fail to solve it. So I’m here to ask for your help. BTW, the runtime system should take the planner result as input, but I don’t see it in run.sh or somewhere else, can you point it out where the planner result is used?

>>> bash run.sh
localhost,localhost
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/horovod/tensorflow/__init__.py:117: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/horovod/tensorflow/__init__.py:143: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

logdir = 
mkdir: missing operand
Try 'mkdir --help' for more information.
('Training model: ', 'vgg19')
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/init_ops.py:1251: calling __init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
WARNING:tensorflow:From tf-keras-dapple.py:159: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.

WARNING:tensorflow:From tf-keras-dapple.py:159: The name tf.logging.DEBUG is deprecated. Please use tf.compat.v1.logging.DEBUG instead.

WARNING:tensorflow:From tf-keras-dapple.py:172: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

WARNING:tensorflow:From tf-keras-dapple.py:175: The name tf.GPUOptions is deprecated. Please use tf.compat.v1.GPUOptions instead.

['localhost:4000', 'localhost:5000']
WARNING:tensorflow:From /root/source/DAPPLE/vgg19/applications/cluster_utils.py:57: The name tf.train.replica_device_setter is deprecated. Please use tf.compat.v1.train.replica_device_setter instead.

WARNING:tensorflow:From /root/source/DAPPLE/vgg19/applications/cluster_utils.py:80: The name tf.train.Server is deprecated. Please use tf.distribute.Server instead.

2021-06-02 08:21:53.616519: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2021-06-02 08:21:53.623581: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2021-06-02 08:21:54.910991: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5591bba7fb30 executing computations on platform CUDA. Devices:
2021-06-02 08:21:54.911022: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1
2021-06-02 08:21:54.911028: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (1): GeForce GTX 1080 Ti, Compute Capability 6.1
2021-06-02 08:21:54.936597: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2095155000 Hz
2021-06-02 08:21:54.936705: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5591bba8d3b0 executing computations on platform Host. Devices:
2021-06-02 08:21:54.936718: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2021-06-02 08:21:54.939208: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:03:00.0
2021-06-02 08:21:54.940335: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 1 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:04:00.0
2021-06-02 08:21:54.940375: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2021-06-02 08:21:54.941705: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2021-06-02 08:21:54.942817: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2021-06-02 08:21:54.943097: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2021-06-02 08:21:54.944652: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2021-06-02 08:21:54.945872: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2021-06-02 08:21:54.950215: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2021-06-02 08:21:54.955523: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0, 1
2021-06-02 08:21:54.955562: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2021-06-02 08:21:55.338971: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-06-02 08:21:55.338998: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 1 
2021-06-02 08:21:55.339005: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N Y 
2021-06-02 08:21:55.339009: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 1:   Y N 
2021-06-02 08:21:55.345703: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:worker/replica:0/task:0/device:GPU:0 with 10452 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:03:00.0, compute capability: 6.1)
2021-06-02 08:21:55.347717: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:worker/replica:0/task:0/device:GPU:1 with 10470 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:04:00.0, compute capability: 6.1)
2021-06-02 08:21:55.350960: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:250] Initialize GrpcChannelCache for job worker -> {0 -> localhost:4000, 1 -> localhost:5000}
2021-06-02 08:21:55.351616: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:365] Started server with target: grpc://localhost:4000
INFO:tensorflow:TF_CONFIG environment variable: {u'cluster': {u'chief': [u'localhost:4000'], u'worker': [u'localhost:5000']}, u'task': {u'index': 0, u'type': u'chief'}}
WARNING:tensorflow:Using temporary folder as model directory: /tmp/tmp5c7pnu
INFO:tensorflow:Using config: {'_save_checkpoints_secs': None, '_session_config': gpu_options {
  allow_growth: true
}
allow_soft_placement: true
, '_keep_checkpoint_max': 5, '_task_type': u'chief', '_global_id_in_cluster': 0, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fe074a88a90>, '_model_dir': '/tmp/tmp5c7pnu', '_protocol': None, '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_service': None, '_num_ps_replicas': 0, '_tf_random_seed': None, '_save_summary_steps': 100, '_device_fn': None, '_experimental_distribute': None, '_num_worker_replicas': 2, '_task_id': 0, '_log_step_count_steps': 10, '_experimental_max_worker_delay_secs': None, '_evaluation_master': '', '_eval_distribute': None, '_train_distribute': None, '_master': u'grpc://localhost:4000'}
Training starts.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/python/training/training_util.py:236: initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
Using Fake IO
((224, 224, 3), (224, 224))
WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

WARNING:tensorflow:From tf-keras-dapple.py:73: prefetch_to_device (from tensorflow.contrib.data.python.ops.prefetching_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.prefetch_to_device(...)`.
INFO:tensorflow:Calling model_fn.
#### Devices ####
hvd.local_rank() == 0
['/job:worker/replica:0/task:0/device:GPU:0', '/job:worker/replica:0/task:1/device:GPU:0']
#### Replica Devices ####
['/job:worker/replica:0/task:0/device:GPU:0', '/job:worker/replica:0/task:0/device:GPU:1']
2021-06-02 08:22:00.197197: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:03:00.0
2021-06-02 08:22:00.198343: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 1 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:04:00.0
2021-06-02 08:22:00.198377: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2021-06-02 08:22:00.198403: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2021-06-02 08:22:00.198416: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2021-06-02 08:22:00.198429: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2021-06-02 08:22:00.198441: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2021-06-02 08:22:00.198453: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2021-06-02 08:22:00.198466: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2021-06-02 08:22:00.202942: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0, 1
2021-06-02 08:22:00.203023: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-06-02 08:22:00.203032: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 1 
2021-06-02 08:22:00.203037: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N Y 
2021-06-02 08:22:00.203042: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 1:   Y N 
2021-06-02 08:22:00.206630: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10452 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:03:00.0, compute capability: 6.1)
2021-06-02 08:22:00.207768: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10470 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:04:00.0, compute capability: 6.1)
Traceback (most recent call last):
  File "tf-keras-dapple.py", line 391, in <module>
    est.train( input_fn = fn_image_preprocess , steps = FLAGS.num_batches) #, hooks = hooks)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 367, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1188, in _train_model_default
    features, labels, ModeKeys.TRAIN, self.config)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1146, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "tf-keras-dapple.py", line 254, in model_fn
    total_loss, outputs = model.build(features, labels)
  File "/root/source/DAPPLE/vgg19/applications/vgg19_slice_model.py", line 329, in build
    x = self.block4_conv1[0](x)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 561, in __call__
    base_layer_utils.create_keras_history(inputs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/base_layer_utils.py", line 200, in create_keras_history
    _, created_layers = _create_keras_history_helper(tensors, set(), [])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/base_layer_utils.py", line 244, in _create_keras_history_helper
    constants[i] = backend.function([], op_input)([])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/backend.py", line 3253, in __call__
    session = get_session(inputs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/backend.py", line 462, in get_session
    _initialize_variables(session)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/backend.py", line 879, in _initialize_variables
    [variables_module.is_variable_initialized(v) for v in candidate_vars])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 950, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1173, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1350, in _do_run
    run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1370, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation block1_conv1/kernel/Initializer/random_uniform/RandomUniform: Could not satisfy explicit device specification '' because the node node block1_conv1/kernel/Initializer/random_uniform/RandomUniform (defined at /root/source/DAPPLE/vgg19/applications/vgg19_slice_model.py:280) placed on device Device assignments active during op 'block1_conv1/kernel/Initializer/random_uniform/RandomUniform' creation:
  with tf.device(None): </usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/resource_variable_ops.py:602>
  with tf.device(/job:worker/replica:0/task:0/device:GPU:0): </root/source/DAPPLE/vgg19/applications/vgg19_slice_model.py:275>
  with tf.device(None): </usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py:1174>  was colocated with a group of nodes that required incompatible device '/job:worker/replica:0/task:0/device:GPU:0'. All available devices [/job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:1, /job:localhost/replica:0/task:0/device:XLA_CPU:0, /job:localhost/replica:0/task:0/device:GPU:0, /job:localhost/replica:0/task:0/device:GPU:1]. 
Colocation Debug Info:
Colocation group had the following types and supported devices: 
Root Member(assigned_device_name_index_=-1 requested_device_name_='/job:worker/replica:0/task:0/device:GPU:0' assigned_device_name_='' resource_device_name_='/job:worker/replica:0/task:0/device:GPU:0' supported_device_types_=[GPU, CPU, XLA_CPU, XLA_GPU] possible_devices_=[]
AssignVariableOp: GPU CPU XLA_CPU XLA_GPU 
VarIsInitializedOp: GPU CPU XLA_CPU XLA_GPU 
Mul: GPU CPU XLA_CPU XLA_GPU 
Add: GPU CPU XLA_CPU XLA_GPU 
Sub: GPU CPU XLA_CPU XLA_GPU 
ReadVariableOp: GPU CPU XLA_CPU XLA_GPU 
RandomUniform: GPU CPU XLA_CPU XLA_GPU 
VarHandleOp: GPU CPU XLA_CPU XLA_GPU 
Const: GPU CPU XLA_CPU XLA_GPU 

Colocation members, user-requested devices, and framework assigned devices, if any:
  block1_conv1/kernel/Initializer/random_uniform/shape (Const) 
  block1_conv1/kernel/Initializer/random_uniform/min (Const) 
  block1_conv1/kernel/Initializer/random_uniform/max (Const) 
  block1_conv1/kernel/Initializer/random_uniform/RandomUniform (RandomUniform) 
  block1_conv1/kernel/Initializer/random_uniform/sub (Sub) 
  block1_conv1/kernel/Initializer/random_uniform/mul (Mul) 
  block1_conv1/kernel/Initializer/random_uniform (Add) 
  block1_conv1/kernel (VarHandleOp) /job:worker/replica:0/task:0/device:GPU:0
  block1_conv1/kernel/IsInitialized/VarIsInitializedOp (VarIsInitializedOp) /job:worker/replica:0/task:0/device:GPU:0
  block1_conv1/kernel/Assign (AssignVariableOp) /job:worker/replica:0/task:0/device:GPU:0
  block1_conv1/kernel/Read/ReadVariableOp (ReadVariableOp) /job:worker/replica:0/task:0/device:GPU:0
  block1_conv1/Conv2D/ReadVariableOp (ReadVariableOp) /job:worker/replica:0/task:0/device:GPU:0
  VarIsInitializedOp_18 (VarIsInitializedOp) /job:worker/replica:0/task:1/device:GPU:0

	 [[node block1_conv1/kernel/Initializer/random_uniform/RandomUniform (defined at /root/source/DAPPLE/vgg19/applications/vgg19_slice_model.py:280) ]]Additional information about colocations:No node-device colocations were active during op 'block1_conv1/kernel/Initializer/random_uniform/RandomUniform' creation.
Device assignments active during op 'block1_conv1/kernel/Initializer/random_uniform/RandomUniform' creation:
  with tf.device(None): </usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/resource_variable_ops.py:602>
  with tf.device(/job:worker/replica:0/task:0/device:GPU:0): </root/source/DAPPLE/vgg19/applications/vgg19_slice_model.py:275>
  with tf.device(None): </usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py:1174>

Original stack trace for u'block1_conv1/kernel/Initializer/random_uniform/RandomUniform':
  File "tf-keras-dapple.py", line 391, in <module>
    est.train( input_fn = fn_image_preprocess , steps = FLAGS.num_batches) #, hooks = hooks)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 367, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1188, in _train_model_default
    features, labels, ModeKeys.TRAIN, self.config)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1146, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "tf-keras-dapple.py", line 254, in model_fn
    total_loss, outputs = model.build(features, labels)
  File "/root/source/DAPPLE/vgg19/applications/vgg19_slice_model.py", line 280, in build
    x = self.block1_conv1[i](img_input)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 591, in __call__
    self._maybe_build(inputs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 1881, in _maybe_build
    self.build(input_shapes)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/layers/convolutional.py", line 165, in build
    dtype=self.dtype)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 384, in add_weight
    aggregation=aggregation)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/tracking/base.py", line 663, in _add_variable_with_custom_getter
    **kwargs_for_getter)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/base_layer_utils.py", line 155, in make_variable
    shape=variable_shape if variable_shape.rank else None)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 259, in __call__
    return cls._variable_v1_call(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 220, in _variable_v1_call
    shape=shape)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 198, in <lambda>
    previous_getter = lambda **kwargs: default_variable_creator(None, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 2495, in default_variable_creator
    shape=shape)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 263, in __call__
    return super(VariableMetaclass, cls).__call__(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/resource_variable_ops.py", line 460, in __init__
    shape=shape)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/resource_variable_ops.py", line 604, in _init_from_args
    initial_value() if init_from_fn else initial_value,
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/base_layer_utils.py", line 135, in <lambda>
    init_val = lambda: initializer(shape, dtype=dtype)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/init_ops.py", line 533, in __call__
    shape, -limit, limit, dtype, seed=self.seed)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/random_ops.py", line 247, in random_uniform
    rnd = gen_random_ops.random_uniform(shape, dtype, seed=seed1, seed2=seed2)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_random_ops.py", line 820, in random_uniform
    name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2005, in __init__
    self._traceback = tf_stack.extract_stack()

--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[63853,1],0]
  Exit code:    1
--------------------------------------------------------------------------

转载请注明来源，欢迎对文章中的引用来源进行考证，欢迎指出任何有错误或不够清晰的表达。