Run dapple

  1. Reference and Notations
  2. README
    1. profiler的使用
    2. planner的使用
    3. runtime system的使用
  3. What is HPGO
  4. Get Started
    1. 选择一个合适的docker
    2. 修改vgg19/run.sh
    3. 报错-0
      1. 修改
    4. 提交一个issue

Reference and Notations

Item Instructions
DAPPLE github
HPGO github

README

DAPPLE的组成:

  1. profiler: 输入DNN model,输出execution time, activation, parameter sizes per layer
  2. planner: 输入profiling results, 根据全局batchsize,生成一个优化后的异构的并行策略
  3. runtime system:执行框架。

Repo中包含了如下模型的实现:

  • VGG19
  • AmoebaNet
  • BERT
  • GNMT
  • XLNET

profiler的使用

README中没有提怎么使用,但是说给出了一些模型的profiling results

repo中似乎没有提供复现profiler步骤的方法。

planner的使用

profiler的结果与机器有关,planner直接读取profiling results,所以planner复现结果与机器无关。

GTTHUB中提供了planner的使用的详细说明

DAPPLE Planner Experiments Reproduction中提到了Currently our profiling is done offline, and the results are cached within the profiler folder

单线程用Python API,多线程用Rust API。

runtime system的使用

参见每个模型中的run.sh

What is HPGO

DAPPLE中经常使用HPGO。那HPGO是什么呢?

HPGO: Hybrid Parallelism Global Orchestration

Get Started

选择一个合适的docker

观察DAPPLE代码,在planner中调用HPGO部分要求python版本为3.6~3.7,这个可以通过conda虚拟环境解决。在vgg19的代码中用的是python2(python2.7),而且其中导入了tensorflow, horovod.tensorflow。没有说明tensorflow版本,但是根据使用接口判断tensorflow是1.x版本。在docker horovod中寻找合适的镜像。

我选择了horovod/horovod:0.18.1-tf1.14.0-torch1.2.0-mxnet1.5.0-py2.7

修改vgg19/run.sh

1
2
3
4
5
6
7
8
9
10
11
12
13
diff --git a/vgg19/run.sh b/vgg19/run.sh
index 7617b7e..dc0a722 100755
--- a/vgg19/run.sh
+++ b/vgg19/run.sh
@@ -4,8 +4,8 @@ device_num=2
batch_size=32 # max = 190
batch_num=12
replica=2
-remote_ip=<SET YOUR REMOTE IP>
-local_ip=<SET YOUR LOCAL IP>
+remote_ip=none
+local_ip=localhost
local_test=true

报错-0

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
Traceback (most recent call last):
File "tf-keras-dapple.py", line 391, in <module>
est.train( input_fn = fn_image_preprocess , steps = FLAGS.num_batches) #, hooks = hooks)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 367, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1188, in _train_model_default
features, labels, ModeKeys.TRAIN, self.config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1146, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "tf-keras-dapple.py", line 254, in model_fn
total_loss, outputs = model.build(features, labels)
File "/root/source/DAPPLE/vgg19/applications/vgg19_slice_model.py", line 329, in build
x = self.block4_conv1[0](x)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 561, in __call__
base_layer_utils.create_keras_history(inputs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/base_layer_utils.py", line 200, in create_keras_history
_, created_layers = _create_keras_history_helper(tensors, set(), [])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/base_layer_utils.py", line 244, in _create_keras_history_helper
constants[i] = backend.function([], op_input)([])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/backend.py", line 3253, in __call__
session = get_session(inputs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/backend.py", line 462, in get_session
_initialize_variables(session)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/backend.py", line 879, in _initialize_variables
[variables_module.is_variable_initialized(v) for v in candidate_vars])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 950, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1173, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1350, in _do_run
run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1370, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation block1_conv1/kernel/Initializer/random_uniform/RandomUniform: Could not satisfy explicit device specification '' because the node node block1_conv1/kernel/Initializer/random_uniform/RandomUniform (defined at /root/source/DAPPLE/vgg19/applications/vgg19_slice_model.py:280) placed on device Device assignments active during op 'block1_conv1/kernel/Initializer/random_uniform/RandomUniform' creation:
with tf.device(None): </usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/resource_variable_ops.py:602>
with tf.device(/job:worker/replica:0/task:0/device:GPU:0): </root/source/DAPPLE/vgg19/applications/vgg19_slice_model.py:275>
with tf.device(None): </usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py:1174> was colocated with a group of nodes that required incompatible device '/job:worker/replica:0/task:0/device:GPU:0'. All available devices [/job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:1, /job:localhost/replica:0/task:0/device:XLA_GPU:2, /job:localhost/replica:0/task:0/device:XLA_GPU:3, /job:localhost/replica:0/task:0/device:XLA_CPU:0, /job:localhost/replica:0/task:0/device:GPU:0, /job:localhost/replica:0/task:0/device:GPU:1, /job:localhost/replica:0/task:0/device:GPU:2, /job:localhost/replica:0/task:0/device:GPU:3].

修改

1
2
3
4
5
6
7
8
9
10
11
12
13
diff --git a/vgg19/run.sh b/vgg19/run.sh
index 7617b7e..dc0a722 100755
--- a/vgg19/run.sh
+++ b/vgg19/run.sh
@@ -4,8 +4,8 @@ device_num=2
batch_size=32 # max = 190
batch_num=12
replica=2
-remote_ip=<SET YOUR REMOTE IP>
-local_ip=<SET YOUR LOCAL IP>
+remote_ip=none
+local_ip=localhost
local_test=true

提交一个issue

Hello, I tried to run DAPPLE on my server, but encountered some odd problems.

Following is hardware information of my environment.

Item type/version
GPU 1080Ti x 4
CPU

I pulled a docker image which is horovod/horovod:0.18.1-tf1.14.0-torch1.2.0-mxnet1.5.0-py2.7 and use nvidia-docker to start the image.

Following is software information of my environment.

Item version
horovod 0.18
tensorflow 1.14.0
mpirun 4.0.0
cuda 10.0
nvidia driver 418.197.02

I found the codes are written for 8 cards, so I modify the codes, pasted below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
diff --git a/vgg19/run.sh b/vgg19/run.sh
index 7617b7e..57674f7 100755
--- a/vgg19/run.sh
+++ b/vgg19/run.sh
@@ -4,8 +4,8 @@ device_num=2
batch_size=32 # max = 190
batch_num=12
replica=2
-remote_ip=<SET YOUR REMOTE IP>
-local_ip=<SET YOUR LOCAL IP>
+remote_ip=none
+local_ip=localhost
local_test=true

export CUDA_VISIBLE_DEVICES=2,3
@@ -16,7 +16,7 @@ if [ ${cross_node} = "0" ]; then
python tf-keras-ds.py --fake_io=True --model=vgg19 --num_batches=10000 --batch_size=${batch_size} --strategy=none

else
-ip_list="localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost"
+ip_list="localhost,localhost,localhost,localhost"
worker_hosts=${local_ip}
if [ ${device_num} != "1" ]; then
if [ ${local_test} != "true" ]; then
@@ -28,9 +28,9 @@ fi
echo ${worker_hosts}

if [ ${local_test} == "true" ]; then
-CUDA_VISIBLE_DEVICES=4,5,6,7 mpirun -np $np --host ${ip_list} --allow-run-as-root -x NCCL_DEBUG=INFO nohup python tf-keras-dapple.py --fake_io=True --model=vgg19 --num_batches=10000 --batch_size=${batc
h_size} --strategy=none --cross_pipeline=True --pipeline_device_num=${device_num} --micro_batch_num=${batch_num} --job_name=worker --task_index=1 --worker_hosts=${worker_hosts} > ${inst_id}_2.log 2>&1 &
+CUDA_VISIBLE_DEVICES=2,3 mpirun -np $np --host ${ip_list} --allow-run-as-root -x NCCL_DEBUG=INFO nohup python tf-keras-dapple.py --fake_io=True --model=vgg19 --num_batches=10000 --batch_size=${ba
tch_size} --strategy=none --cross_pipeline=True --pipeline_device_num=${device_num} --micro_batch_num=${batch_num} --job_name=worker --task_index=1 --worker_hosts=${worker_hosts} > ${inst_id}_2.log 2>&1 &
fi

-CUDA_VISIBLE_DEVICES=0,1,2,3 mpirun -np $np --host ${ip_list} --allow-run-as-root -x NCCL_DEBUG=INFO python tf-keras-dapple.py --fake_io=True --model=vgg19 --num_batches=10000 --batch_size=${batch_size
} --strategy=none --cross_pipeline=True --pipeline_device_num=${device_num} --micro_batch_num=${batch_num} --num_replica=${replica} --job_name=worker --task_index=0 --worker_hosts=${worker_hosts} #> ${inst_
id}.log 2>&1 &
+CUDA_VISIBLE_DEVICES=0,1 mpirun -np $np --host ${ip_list} --allow-run-as-root -x NCCL_DEBUG=INFO python tf-keras-dapple.py --fake_io=True --model=vgg19 --num_batches=10000 --batch_size=${batch_si
ze} --strategy=none --cross_pipeline=True --pipeline_device_num=${device_num} --micro_batch_num=${batch_num} --num_replica=${replica} --job_name=worker --task_index=0 --worker_hosts=${worker_hosts} #> ${ins
t_id}.log 2>&1 &
fi


diff --git a/vgg19/applications/cluster_utils.py b/vgg19/applications/cluster_utils.py
index c7bc21b..edf483d 100644
--- a/vgg19/applications/cluster_utils.py
+++ b/vgg19/applications/cluster_utils.py
@@ -2,7 +2,7 @@ import tensorflow as tf
import horovod.tensorflow as hvd
from tensorflow.python.platform import flags
FLAGS = flags.FLAGS
-MAX_GPUS_PER_NODE = 8
+MAX_GPUS_PER_NODE = 4

def get_cluster_manager(config_proto):
"""Returns the cluster manager to be used."""



diff --git a/vgg19/tf-keras-dapple.py b/vgg19/tf-keras-dapple.py
index 41bb661..a0cee41 100755
--- a/vgg19/tf-keras-dapple.py
+++ b/vgg19/tf-keras-dapple.py
@@ -100,7 +100,7 @@ def prepare_tf_config():


if __name__ == '__main__':
- default_raw_data_dir = '/tmp/dataset/mini-imagenet/raw-data/train/n01440764/'
+ default_raw_data_dir = '/data/DNN_Dataset/imagenet/tiny/meshtf/mininet/mininet/mini-imagenet-sp2/train/n01532829/'
default_ckpt_dir = 'mycheckpoint'
flags.DEFINE_string('model', 'resnet50', 'imagenet model name.')
flags.DEFINE_string('strategy', 'none', 'strategy of variable updating')

By executing bash run.sh, I got the following result. I tried to search it online, but fail to solve it. So I’m here to ask for your help. BTW, the runtime system should take the planner result as input, but I don’t see it in run.sh or somewhere else, can you point it out where the planner result is used?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
>>> bash run.sh
localhost,localhost
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/horovod/tensorflow/__init__.py:117: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/horovod/tensorflow/__init__.py:143: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

logdir =
mkdir: missing operand
Try 'mkdir --help' for more information.
('Training model: ', 'vgg19')
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/init_ops.py:1251: calling __init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
WARNING:tensorflow:From tf-keras-dapple.py:159: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.

WARNING:tensorflow:From tf-keras-dapple.py:159: The name tf.logging.DEBUG is deprecated. Please use tf.compat.v1.logging.DEBUG instead.

WARNING:tensorflow:From tf-keras-dapple.py:172: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

WARNING:tensorflow:From tf-keras-dapple.py:175: The name tf.GPUOptions is deprecated. Please use tf.compat.v1.GPUOptions instead.

['localhost:4000', 'localhost:5000']
WARNING:tensorflow:From /root/source/DAPPLE/vgg19/applications/cluster_utils.py:57: The name tf.train.replica_device_setter is deprecated. Please use tf.compat.v1.train.replica_device_setter instead.

WARNING:tensorflow:From /root/source/DAPPLE/vgg19/applications/cluster_utils.py:80: The name tf.train.Server is deprecated. Please use tf.distribute.Server instead.

2021-06-02 08:21:53.616519: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2021-06-02 08:21:53.623581: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2021-06-02 08:21:54.910991: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5591bba7fb30 executing computations on platform CUDA. Devices:
2021-06-02 08:21:54.911022: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1
2021-06-02 08:21:54.911028: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (1): GeForce GTX 1080 Ti, Compute Capability 6.1
2021-06-02 08:21:54.936597: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2095155000 Hz
2021-06-02 08:21:54.936705: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5591bba8d3b0 executing computations on platform Host. Devices:
2021-06-02 08:21:54.936718: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): <undefined>, <undefined>
2021-06-02 08:21:54.939208: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:03:00.0
2021-06-02 08:21:54.940335: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 1 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:04:00.0
2021-06-02 08:21:54.940375: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2021-06-02 08:21:54.941705: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2021-06-02 08:21:54.942817: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2021-06-02 08:21:54.943097: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2021-06-02 08:21:54.944652: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2021-06-02 08:21:54.945872: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2021-06-02 08:21:54.950215: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2021-06-02 08:21:54.955523: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0, 1
2021-06-02 08:21:54.955562: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2021-06-02 08:21:55.338971: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-06-02 08:21:55.338998: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0 1
2021-06-02 08:21:55.339005: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N Y
2021-06-02 08:21:55.339009: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 1: Y N
2021-06-02 08:21:55.345703: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:worker/replica:0/task:0/device:GPU:0 with 10452 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:03:00.0, compute capability: 6.1)
2021-06-02 08:21:55.347717: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:worker/replica:0/task:0/device:GPU:1 with 10470 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:04:00.0, compute capability: 6.1)
2021-06-02 08:21:55.350960: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:250] Initialize GrpcChannelCache for job worker -> {0 -> localhost:4000, 1 -> localhost:5000}
2021-06-02 08:21:55.351616: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:365] Started server with target: grpc://localhost:4000
INFO:tensorflow:TF_CONFIG environment variable: {u'cluster': {u'chief': [u'localhost:4000'], u'worker': [u'localhost:5000']}, u'task': {u'index': 0, u'type': u'chief'}}
WARNING:tensorflow:Using temporary folder as model directory: /tmp/tmp5c7pnu
INFO:tensorflow:Using config: {'_save_checkpoints_secs': None, '_session_config': gpu_options {
allow_growth: true
}
allow_soft_placement: true
, '_keep_checkpoint_max': 5, '_task_type': u'chief', '_global_id_in_cluster': 0, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fe074a88a90>, '_model_dir': '/tmp/tmp5c7pnu', '_protocol': None, '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_service': None, '_num_ps_replicas': 0, '_tf_random_seed': None, '_save_summary_steps': 100, '_device_fn': None, '_experimental_distribute': None, '_num_worker_replicas': 2, '_task_id': 0, '_log_step_count_steps': 10, '_experimental_max_worker_delay_secs': None, '_evaluation_master': '', '_eval_distribute': None, '_train_distribute': None, '_master': u'grpc://localhost:4000'}
Training starts.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/python/training/training_util.py:236: initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
Using Fake IO
((224, 224, 3), (224, 224))
WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
* https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
* https://github.com/tensorflow/addons
* https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

WARNING:tensorflow:From tf-keras-dapple.py:73: prefetch_to_device (from tensorflow.contrib.data.python.ops.prefetching_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.prefetch_to_device(...)`.
INFO:tensorflow:Calling model_fn.
#### Devices ####
hvd.local_rank() == 0
['/job:worker/replica:0/task:0/device:GPU:0', '/job:worker/replica:0/task:1/device:GPU:0']
#### Replica Devices ####
['/job:worker/replica:0/task:0/device:GPU:0', '/job:worker/replica:0/task:0/device:GPU:1']
2021-06-02 08:22:00.197197: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:03:00.0
2021-06-02 08:22:00.198343: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 1 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:04:00.0
2021-06-02 08:22:00.198377: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2021-06-02 08:22:00.198403: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2021-06-02 08:22:00.198416: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2021-06-02 08:22:00.198429: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2021-06-02 08:22:00.198441: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2021-06-02 08:22:00.198453: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2021-06-02 08:22:00.198466: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2021-06-02 08:22:00.202942: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0, 1
2021-06-02 08:22:00.203023: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-06-02 08:22:00.203032: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0 1
2021-06-02 08:22:00.203037: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N Y
2021-06-02 08:22:00.203042: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 1: Y N
2021-06-02 08:22:00.206630: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10452 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:03:00.0, compute capability: 6.1)
2021-06-02 08:22:00.207768: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10470 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:04:00.0, compute capability: 6.1)
Traceback (most recent call last):
File "tf-keras-dapple.py", line 391, in <module>
est.train( input_fn = fn_image_preprocess , steps = FLAGS.num_batches) #, hooks = hooks)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 367, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1188, in _train_model_default
features, labels, ModeKeys.TRAIN, self.config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1146, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "tf-keras-dapple.py", line 254, in model_fn
total_loss, outputs = model.build(features, labels)
File "/root/source/DAPPLE/vgg19/applications/vgg19_slice_model.py", line 329, in build
x = self.block4_conv1[0](x)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 561, in __call__
base_layer_utils.create_keras_history(inputs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/base_layer_utils.py", line 200, in create_keras_history
_, created_layers = _create_keras_history_helper(tensors, set(), [])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/base_layer_utils.py", line 244, in _create_keras_history_helper
constants[i] = backend.function([], op_input)([])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/backend.py", line 3253, in __call__
session = get_session(inputs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/backend.py", line 462, in get_session
_initialize_variables(session)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/backend.py", line 879, in _initialize_variables
[variables_module.is_variable_initialized(v) for v in candidate_vars])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 950, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1173, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1350, in _do_run
run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1370, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation block1_conv1/kernel/Initializer/random_uniform/RandomUniform: Could not satisfy explicit device specification '' because the node node block1_conv1/kernel/Initializer/random_uniform/RandomUniform (defined at /root/source/DAPPLE/vgg19/applications/vgg19_slice_model.py:280) placed on device Device assignments active during op 'block1_conv1/kernel/Initializer/random_uniform/RandomUniform' creation:
with tf.device(None): </usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/resource_variable_ops.py:602>
with tf.device(/job:worker/replica:0/task:0/device:GPU:0): </root/source/DAPPLE/vgg19/applications/vgg19_slice_model.py:275>
with tf.device(None): </usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py:1174> was colocated with a group of nodes that required incompatible device '/job:worker/replica:0/task:0/device:GPU:0'. All available devices [/job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:1, /job:localhost/replica:0/task:0/device:XLA_CPU:0, /job:localhost/replica:0/task:0/device:GPU:0, /job:localhost/replica:0/task:0/device:GPU:1].
Colocation Debug Info:
Colocation group had the following types and supported devices:
Root Member(assigned_device_name_index_=-1 requested_device_name_='/job:worker/replica:0/task:0/device:GPU:0' assigned_device_name_='' resource_device_name_='/job:worker/replica:0/task:0/device:GPU:0' supported_device_types_=[GPU, CPU, XLA_CPU, XLA_GPU] possible_devices_=[]
AssignVariableOp: GPU CPU XLA_CPU XLA_GPU
VarIsInitializedOp: GPU CPU XLA_CPU XLA_GPU
Mul: GPU CPU XLA_CPU XLA_GPU
Add: GPU CPU XLA_CPU XLA_GPU
Sub: GPU CPU XLA_CPU XLA_GPU
ReadVariableOp: GPU CPU XLA_CPU XLA_GPU
RandomUniform: GPU CPU XLA_CPU XLA_GPU
VarHandleOp: GPU CPU XLA_CPU XLA_GPU
Const: GPU CPU XLA_CPU XLA_GPU

Colocation members, user-requested devices, and framework assigned devices, if any:
block1_conv1/kernel/Initializer/random_uniform/shape (Const)
block1_conv1/kernel/Initializer/random_uniform/min (Const)
block1_conv1/kernel/Initializer/random_uniform/max (Const)
block1_conv1/kernel/Initializer/random_uniform/RandomUniform (RandomUniform)
block1_conv1/kernel/Initializer/random_uniform/sub (Sub)
block1_conv1/kernel/Initializer/random_uniform/mul (Mul)
block1_conv1/kernel/Initializer/random_uniform (Add)
block1_conv1/kernel (VarHandleOp) /job:worker/replica:0/task:0/device:GPU:0
block1_conv1/kernel/IsInitialized/VarIsInitializedOp (VarIsInitializedOp) /job:worker/replica:0/task:0/device:GPU:0
block1_conv1/kernel/Assign (AssignVariableOp) /job:worker/replica:0/task:0/device:GPU:0
block1_conv1/kernel/Read/ReadVariableOp (ReadVariableOp) /job:worker/replica:0/task:0/device:GPU:0
block1_conv1/Conv2D/ReadVariableOp (ReadVariableOp) /job:worker/replica:0/task:0/device:GPU:0
VarIsInitializedOp_18 (VarIsInitializedOp) /job:worker/replica:0/task:1/device:GPU:0

[[node block1_conv1/kernel/Initializer/random_uniform/RandomUniform (defined at /root/source/DAPPLE/vgg19/applications/vgg19_slice_model.py:280) ]]Additional information about colocations:No node-device colocations were active during op 'block1_conv1/kernel/Initializer/random_uniform/RandomUniform' creation.
Device assignments active during op 'block1_conv1/kernel/Initializer/random_uniform/RandomUniform' creation:
with tf.device(None): </usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/resource_variable_ops.py:602>
with tf.device(/job:worker/replica:0/task:0/device:GPU:0): </root/source/DAPPLE/vgg19/applications/vgg19_slice_model.py:275>
with tf.device(None): </usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py:1174>

Original stack trace for u'block1_conv1/kernel/Initializer/random_uniform/RandomUniform':
File "tf-keras-dapple.py", line 391, in <module>
est.train( input_fn = fn_image_preprocess , steps = FLAGS.num_batches) #, hooks = hooks)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 367, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1188, in _train_model_default
features, labels, ModeKeys.TRAIN, self.config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1146, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "tf-keras-dapple.py", line 254, in model_fn
total_loss, outputs = model.build(features, labels)
File "/root/source/DAPPLE/vgg19/applications/vgg19_slice_model.py", line 280, in build
x = self.block1_conv1[i](img_input)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 591, in __call__
self._maybe_build(inputs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 1881, in _maybe_build
self.build(input_shapes)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/layers/convolutional.py", line 165, in build
dtype=self.dtype)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 384, in add_weight
aggregation=aggregation)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/tracking/base.py", line 663, in _add_variable_with_custom_getter
**kwargs_for_getter)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/base_layer_utils.py", line 155, in make_variable
shape=variable_shape if variable_shape.rank else None)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 259, in __call__
return cls._variable_v1_call(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 220, in _variable_v1_call
shape=shape)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 198, in <lambda>
previous_getter = lambda **kwargs: default_variable_creator(None, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 2495, in default_variable_creator
shape=shape)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 263, in __call__
return super(VariableMetaclass, cls).__call__(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/resource_variable_ops.py", line 460, in __init__
shape=shape)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/resource_variable_ops.py", line 604, in _init_from_args
initial_value() if init_from_fn else initial_value,
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/base_layer_utils.py", line 135, in <lambda>
init_val = lambda: initializer(shape, dtype=dtype)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/init_ops.py", line 533, in __call__
shape, -limit, limit, dtype, seed=self.seed)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/random_ops.py", line 247, in random_uniform
rnd = gen_random_ops.random_uniform(shape, dtype, seed=seed1, seed2=seed2)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_random_ops.py", line 820, in random_uniform
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2005, in __init__
self._traceback = tf_stack.extract_stack()

--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[63853,1],0]
Exit code: 1
--------------------------------------------------------------------------


转载请注明来源,欢迎对文章中的引用来源进行考证,欢迎指出任何有错误或不够清晰的表达。