Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Executor interface design and implementation #4537

Merged
merged 64 commits into from
Oct 11, 2017
Merged
Show file tree
Hide file tree
Changes from 56 commits
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
540cc2c
add executor class and interface
QiJune Sep 30, 2017
3481bdc
add global device context
QiJune Sep 30, 2017
e42cafb
add executor unittest
QiJune Sep 30, 2017
d4be973
fix gpu build error
QiJune Sep 30, 2017
b630d40
add scope
QiJune Sep 30, 2017
39b2ff3
Merge remote-tracking branch 'baidu/develop' into executor_impl
QiJune Sep 30, 2017
0950091
pass place to GetCUDADeviceContext
QiJune Sep 30, 2017
ce4d14b
add struct Device
QiJune Oct 1, 2017
f1c5d9e
Merge remote-tracking branch 'baidu/develop' into executor_impl
QiJune Oct 1, 2017
f29a6b0
fix gpu build error
QiJune Oct 1, 2017
b5dbe88
follow comments
QiJune Oct 3, 2017
023ed5e
merge baidu/develop
QiJune Oct 3, 2017
6e2f968
simple test
Oct 3, 2017
1a0d8fa
done merged
Oct 3, 2017
e946fc1
add elementwise_add
Oct 3, 2017
6c4d1f5
refine codes
QiJune Oct 3, 2017
f5e73f4
pass simple elementwise_add op
Oct 3, 2017
0009a30
pass cpu test; gpu seg fault
Oct 3, 2017
3950515
remove device context manager
QiJune Oct 3, 2017
cb198fa
merge baidu/develop
QiJune Oct 4, 2017
fe10e86
fix gpu build error
QiJune Oct 4, 2017
e02cc57
Merge remote-tracking branch 'baidu/develop' into executor_impl
QiJune Oct 4, 2017
3014f6a
correct macro
QiJune Oct 5, 2017
623848a
add feed operator
QiJune Oct 5, 2017
20725f2
add executor feed operator test
QiJune Oct 5, 2017
45c4dca
add fetch operator
QiJune Oct 5, 2017
48b080d
ensure global BuddyAllocator is initialized before global Scope
QiJune Oct 6, 2017
bbceb72
refine some codes
QiJune Oct 6, 2017
39f75a1
Merge remote-tracking branch 'baidu/develop' into executor_impl
QiJune Oct 6, 2017
1f5192a
fix executor gpu unittest
QiJune Oct 6, 2017
ac0e382
test text
Oct 6, 2017
e8a678e
fix executor gpu unittest runtime error
QiJune Oct 6, 2017
d73aa87
merge develop
Oct 6, 2017
91f5d2b
follow comments and create local_scope inside executor run method
QiJune Oct 6, 2017
f087533
Merge remote-tracking branch 'baidu/develop' into executor_impl
QiJune Oct 6, 2017
a7d700e
revert local scope to TODO
QiJune Oct 6, 2017
683ef60
Merge commit 'refs/pull/4537/head' of https://github.com/PaddlePaddle…
gangliao Oct 6, 2017
b68a95f
prune pass simple test
Oct 7, 2017
005f15b
FeedOp and FetchOp unit test
Oct 7, 2017
a67e8ea
Add AddOp
Oct 8, 2017
c83ea1c
remove hardcode add_XX_op
Oct 8, 2017
6e7666f
before backward
Oct 8, 2017
c93d74a
merge develop
Oct 8, 2017
089cc11
clean up && fix #4624
Oct 9, 2017
e515571
clean up for review
Oct 9, 2017
340d21d
Init at block[0]; Run at block[1]
Oct 10, 2017
e655d29
merge develop
gangliao Oct 10, 2017
932402c
debug for sum
Oct 10, 2017
1540074
follow comments and refine codes
QiJune Oct 10, 2017
7d21d8c
Merge remote-tracking branch 'baidu/develop' into executor_impl
QiJune Oct 10, 2017
0e1f21a
Fix bug
JiayiFeng Oct 10, 2017
a17442d
merge 4664 in advance
Oct 10, 2017
e3161bb
pass simple backward
Oct 10, 2017
2fc7fc7
pass multiple forward backward
Oct 10, 2017
975a512
infer feed operator output variable shape with dims attribute
QiJune Oct 10, 2017
a308ff2
make infershape of feedop and fetchop compatible with compile-time de…
QiJune Oct 10, 2017
3f9e247
set variable support dim
Oct 10, 2017
293a7d1
add feed infershape todo
Oct 10, 2017
062ff4d
clean up
Oct 10, 2017
2e7cd20
remove log in backward
Oct 10, 2017
436ea50
follow comments
QiJune Oct 10, 2017
a528a97
remove prune as member function to function
Oct 10, 2017
f410622
merge follow comment
Oct 10, 2017
434949c
clean up for merge
Oct 10, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions paddle/framework/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -42,5 +42,12 @@ add_custom_command(TARGET framework_py_proto POST_BUILD
cc_library(backward SRCS backward.cc DEPS net_op)
cc_test(backward_test SRCS backward_test.cc DEPS backward recurrent_op device_context)

cc_library(executor SRCS executor.cc DEPS op_registry device_context scope framework_proto backward ${GLOB_OP_LIB})
if(WITH_GPU)
nv_test(executor_test SRCS executor_test.cc DEPS executor)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we run cc_test(executor_test SRCS executor_test.cc DEPS executor) as well when WITH_GPU is ON?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When WITH_GPU is ON, both cpu and gpu code will be tested.

else()
cc_test(executor_test SRCS executor_test.cc DEPS executor)
endif()

cc_library(tensor_array SRCS tensor_array.cc DEPS lod_tensor)
cc_test(tensor_array_test SRCS tensor_array_test.cc DEPS tensor_array place)
1 change: 1 addition & 0 deletions paddle/framework/backward.cc
Original file line number Diff line number Diff line change
Expand Up @@ -378,6 +378,7 @@ std::vector<std::unique_ptr<OpDescBind>> MakeBlockBackward(
backward_descs[dup_op[i]]->Rename(out_name, new_name);
sum_op_inputs.emplace_back(new_name);
}
LOG(INFO) << "sum_op_inputs size " << sum_op_inputs.size();
std::unique_ptr<OpDescBind> sum_op(new OpDescBind(
"sum", {{"X", sum_op_inputs}}, {{"Out", {out_name}}}, {}));
pending_sum_ops.push_back({dup_op.back(), std::move(sum_op)});
Expand Down
176 changes: 176 additions & 0 deletions paddle/framework/executor.cc
Original file line number Diff line number Diff line change
@@ -0,0 +1,176 @@
/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */

#include "paddle/framework/executor.h"

#include <algorithm>
#include <iostream>
#include <memory>
#include <set>
#include <vector>

#include "paddle/framework/lod_tensor.h"
#include "paddle/framework/op_registry.h"
#include "paddle/framework/scope.h"

#include <boost/range/adaptor/reversed.hpp>

namespace paddle {
namespace framework {

const std::string kFeedOpType = "feed";
const std::string kFetchOpType = "fetch";

Executor::Executor(const std::vector<platform::Place>& places) {
PADDLE_ENFORCE_GT(places.size(), 0);
device_contexts_.resize(places.size());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enforce places.size() > 0 ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

for (size_t i = 0; i < places.size(); i++) {
if (platform::is_cpu_place(places[i])) {
device_contexts_[i] = new platform::CPUDeviceContext(
boost::get<platform::CPUPlace>(places[i]));
} else if (platform::is_gpu_place(places[i])) {
#ifdef PADDLE_WITH_CUDA
device_contexts_[i] = new platform::CUDADeviceContext(
boost::get<platform::GPUPlace>(places[i]));
#else
PADDLE_THROW("'GPUPlace' is not supported in CPU only device.");
Copy link
Contributor

@helinwang helinwang Oct 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"CPU only device." does not make much sense, maybe change to: 'GPUPlace' is not supported, please recompile with PADDLE_WITH_CUDA=ON.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

#endif
}
}
}

Executor::~Executor() {
for (auto& device_context : device_contexts_) {
delete device_context;
}
}

void Executor::Run(const ProgramDesc& pdesc, Scope* scope, int block_id) {
// TODO(tonyyang-svail):
// - only runs on the first device (i.e. no interdevice communication)
// - will change to use multiple blocks for RNN op and Cond Op
PADDLE_ENFORCE_GT(pdesc.blocks_size(), block_id);
auto& block = pdesc.blocks(block_id);
auto& device = device_contexts_[0];

// Instantiate all the vars in the global scope
for (auto& var : block.vars()) {
scope->NewVar(var.name());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the Variable has already been created?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then it returns a pointer without re-creating a new Variable.

}

Scope& local_scope = scope->NewScope();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that we should drop local_scope after invoking Run?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there is no easy way to do this. I will add a TODO on this.


std::vector<bool> should_run = Prune(pdesc, block_id);
PADDLE_ENFORCE_EQ(should_run.size(), block.ops_size());
for (size_t i = 0; i < should_run.size(); ++i) {
// if (should_run[i]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No comment out code, please.

if (true) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to add a constant value and assigned with true. Otherwise, this would be a magic value.

for (auto& var : block.ops(i).outputs()) {
for (auto& argu : var.arguments()) {
if (local_scope.FindVar(argu) == nullptr) {
local_scope.NewVar(argu);
}
}
}
LOG(INFO) << block.ops(i).type();
if (block.ops(i).type() == "sum") {
LOG(INFO) << "Here";
for (auto& var : block.ops(i).inputs()) {
for (auto& argu : var.arguments()) {
LOG(INFO) << var.parameter() << " " << argu;
}
}
}
auto op = paddle::framework::OpRegistry::CreateOp(block.ops(i));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will be extremely slow if we create operator every time because the protobuf message will be parsed and copied.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the most straightforward implementation. (i.e. avoid any possible premature optimization). Once we get it running, we can go ahead to do profiling.

LOG(INFO) << op->DebugString();
op->Run(local_scope, *device);
}
}

// TODO(tonyyang-svail):
// - Destroy local_scope
}

std::vector<bool> Executor::Prune(const ProgramDesc& pdesc, int block_id) {
// TODO(tonyyang-svail):
// - will change to use multiple blocks for RNN op and Cond Op

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only runs the first block for now, will change to use multiple blocks for RNN op and Cond Op

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

auto& block = pdesc.blocks(block_id);
auto& ops = block.ops();

bool expect_feed = true;
for (auto& op_desc : ops) {
PADDLE_ENFORCE(op_desc.type() != kFeedOpType || expect_feed,
"All FeedOps are at the beginning of the ProgramDesc");
expect_feed = (op_desc.type() == kFeedOpType);
}

bool expect_fetch = true;
for (auto op_iter = ops.rbegin(); op_iter != ops.rend(); ++op_iter) {
auto& op_desc = *op_iter;
PADDLE_ENFORCE(op_desc.type() != kFetchOpType || expect_fetch,
"All FetchOps must at the end of the ProgramDesc");
expect_fetch = (op_desc.type() == kFetchOpType);
}

std::set<std::string> dependent_vars;
std::vector<bool> should_run;
for (auto op_iter = ops.rbegin(); op_iter != ops.rend(); ++op_iter) {
auto& op_desc = *op_iter;

bool found_dependent_vars = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems a complicated and long method, write some comments for each code block, or split this class into multiple smaller functions.

for (auto& var : op_desc.outputs()) {
for (auto& argu : var.arguments()) {
if (dependent_vars.count(argu) != 0) {
found_dependent_vars = true;
}
}
}

if (op_desc.type() == kFetchOpType || found_dependent_vars) {
// erase its output to the dependency graph
for (auto& var : op_desc.outputs()) {
for (auto& argu : var.arguments()) {
dependent_vars.erase(argu);
}
}

// insert its input to the dependency graph
for (auto& var : op_desc.inputs()) {
for (auto& argu : var.arguments()) {
dependent_vars.insert(argu);
}
}

LOG(INFO) << "1 " << op_desc.type();
should_run.push_back(true);
} else {
LOG(INFO) << "0 " << op_desc.type();
should_run.push_back(false);
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should ENFORCE dependent_vars to be empty here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// TODO(tonyyang-svail):
// - check this after integration of Init
// PADDLE_ENFORCE(dependent_vars.empty());

// since we are traversing the ProgramDesc in reverse order
// we reverse the should_run vector
std::reverse(should_run.begin(), should_run.end());

return should_run;
}

} // namespace framework
} // namespace paddle
56 changes: 56 additions & 0 deletions paddle/framework/executor.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */

#pragma once

#include "paddle/framework/framework.pb.h"
#include "paddle/framework/op_info.h"
#include "paddle/framework/scope.h"
#include "paddle/framework/tensor.h"

namespace paddle {
namespace framework {

class Executor {
public:
explicit Executor(const std::vector<platform::Place>& places);
~Executor();

/* @Brief
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think C++ code is the document, and we don't really need to use Doxygen. Therefore, we can write much shorter comments. For this specific case Executor::Run, I don't even think that it needs a comment.

* Runtime evaluation of the given ProgramDesc under certain Scope
*
* @param
* ProgramDesc
* Scope
*/
void Run(const ProgramDesc&, Scope*, int);

protected:
/* @Brief
* Pruning the graph
*
* @param
* ProgramDesc
*
* @return
* vector<bool> Same size as ops. Indicates whether an op should be run.
*/
std::vector<bool> Prune(const ProgramDesc& pdesc, int block_id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prune is a high level optimize part, which should be done before executor run. The executor takes a "no redundant op groups".


private:
std::vector<platform::DeviceContext*> device_contexts_;
};

} // namespace framework
} // namespace paddle
Loading