且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

分布式tensorflow复制训练示例:grpc_tensorflow_server-没有这样的文件或目录

更新时间:2023-10-18 15:22:58

grpc_tensorflow_server 二进制文件是在发行版的分布式TensorFlow,默认不再构建或包含在二进制发行版中。替换为 tf.train .Server Python类,它更具可编程性且易于使用。



您可以使用 tf.train.Server 重现 grpc_tensorflow_server 的行为:

 #ps.py。在192.168.0.1上运行。 (IP地址已更改为有效。)
导入张量流为tf
服务器= tf.train.Server({ ps:[ 192.168.0.1:2222]},
{ worker:[ 192.168.0.2:2222, 192.168.0.3:2222\"]},
job_name = ps,task_index = 0)
server.join()

#worker_0.py。在192.168.0.2上运行。
进口张量流为tf
服务器= tf.train.Server({ ps:[ 192.168.0.1:2222]},
{ worker:[ 192.168.0.2 :2222, 192.168.0.3:2222\"]}、
job_name = worker,task_index = 0)
server.join()

#worker_1.py。在192.168.0.3上运行。 (IP地址已更改为有效。)
导入张量流为tf
服务器= tf.train.Server({ ps:[ 192.168.0.1:2222]},
{ worker:[ 192.168.0.2:2222, 192.168.0.3:2222\"]},
job_name = worker,task_index = 1)
server.join()

很显然,可以清除此示例并使用命令行标志等使其可重用,但是TensorFlow并未规定这些的特殊形式。需要注意的主要事情是(i)每个TensorFlow任务有一个 tf.train.Server 实例,(ii)所有 Server 实例必须使用相同的集群定义(将作业名称映射到地址列表的字典)构造,并且(iii)每个任务由一对唯一的 job_name task_index



一旦在各自的计算机上运行了三个脚本,就可以创建另一个连接到它们的脚本:

  import tensorflow as tf 

sess = tf.Session( grpc://192.168.0.2:2222)
#...


I am trying to make a distributed tensorflow implementation by following the instructions in this blog: Distributed TensorFlow by Leo K. Tam. My aim is to perform replicated training as mentioned in this post

I have completed the steps till installing tensorflow and successfully running the following command and getting results:

sudo bazel-bin/tensorflow/cc/tutorials_example_trainer --use_gpu

Now the next thing, which I want to implement is to launch the gRPC server on one of the nodes by the following command :

bazel-bin/tensorflow/core/distributed_runtime/rpc/grpc_tensorflow_server --cluster_spec='worker|192.168.555.254:2500;192.168.555.255:2501' --job_name=worker --task_id=0 &

Though, when I run it, I get the following error: rpc/grpc_tensorflow_server:No such file directory

-bash: bazel-bin/tensorflow/core/distributed_runtime/rpc/grpc_tensorflow_server: No such file or directory

The contents of my rpc folder are:

 libgrpc_channel.pic.a              libgrpc_remote_master.pic.lo       libgrpc_session.pic.lo             libgrpc_worker_service_impl.pic.a  _objs/                             
 libgrpc_master_service_impl.pic.a  libgrpc_remote_worker.pic.a        libgrpc_tensor_coding.pic.a        libgrpc_worker_service.pic.a       
 libgrpc_master_service.pic.lo      libgrpc_server_lib.pic.lo          libgrpc_worker_cache.pic.a         librpc_rendezvous_mgr.pic.a

I am clearly missing out on a step in between, which is not mentioned in the blog. My objective is to be able to run the command mentioned above (to launch the gRPC server) so that I can start a worker process on one of the nodes.

The grpc_tensorflow_server binary was a temporary measure used in the pre-released version of Distributed TensorFlow, and it is no longer built by default or included in the binary distributions. Its replacement is the tf.train.Server Python class, which is more programmable and easier to use.

You can write simple Python scripts using tf.train.Server to reproduce the behavior of grpc_tensorflow_server:

# ps.py. Run this on 192.168.0.1. (IP addresses changed to be valid.)
import tensorflow as tf
server = tf.train.Server({"ps": ["192.168.0.1:2222"]},
                         {"worker": ["192.168.0.2:2222", "192.168.0.3:2222"]},
                         job_name="ps", task_index=0)
server.join()

# worker_0.py. Run this on 192.168.0.2.
import tensorflow as tf
server = tf.train.Server({"ps": ["192.168.0.1:2222"]},
                         {"worker": ["192.168.0.2:2222", "192.168.0.3:2222"]},
                         job_name="worker", task_index=0)
server.join()

# worker_1.py. Run this on 192.168.0.3. (IP addresses changed to be valid.)
import tensorflow as tf
server = tf.train.Server({"ps": ["192.168.0.1:2222"]},
                         {"worker": ["192.168.0.2:2222", "192.168.0.3:2222"]},
                         job_name="worker", task_index=1)
server.join()

Clearly this example could be cleaned up and made reusable with command-line flags etc., but TensorFlow doesn't prescribe a particular form for these. The main things to note is that (i) there is one tf.train.Server instance per TensorFlow task, (ii) all Server instances must be constructed with the same "cluster definition" (the dictionary mapping job names to lists of addressess), and (iii) each task is identified by a unique pair of job_name and task_index.

Once you run the three scripts on the respective machines,, you can create another script to connect to them:

import tensorflow as tf

sess = tf.Session("grpc://192.168.0.2:2222")
# ...