/** * Copyright (c) Meta Platforms, Inc. and affiliates. * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */
folly/AtomicHashmap.h
folly/AtomicHashmap.h
introduces a synchronized UnorderedAssociativeContainer implementation designed for extreme performance in heavily multithreaded environments (about 2-5x faster than tbb::concurrent_hash_map) and good memory usage properties. Find and iteration are wait-free, insert has key-level lock granularity, there is minimal memory overhead, and permanent 32-bit ids can be used to reference each element.
Although it can provide extreme performance, AtomicHashmap has some unique limitations as well.
The space for erased elements cannot be reclaimed (they are tombstoned forever) so it’s generally not a good idea to use this if you’re erasing things a lot.
Only supports 32 or 64 bit keys - this is because they must be atomically compare-and-swap’ed.
Growth beyond initialization reduces performance - if you don’t know the approximate number of elements you’ll be inserting into the map, you probably shouldn’t use this class.
Must manage synchronization externally in order to modify values in the map after insertion. Lock pools are a common way to do this, or you may consider using folly::PackedSyncPtr<T>
as your ValueT
.
Must define special reserved key values for empty, erased, and locked elements.
For a complete list of limitations and departures from the UnorderedAssociativeContainer concept, see folly/AtomicHashMap.h
value_type
references remain valid as long as the map itself. Note this is not true for most other probing hash maps which will move elements when rehashing, which is necessary for them to grow. AtomicHashMap grows by chaining additional slabs, so elements never need to be moved.
Unique 32-bit ids can be used to reference elements in the map via iterator::getIndex()
. This can be helpful to save memory in the rest of the application by replacing 64-bit pointers or keys.
Iterators are never invalidated. This means you can iterate through the map while simultaneously inserting and erasing. This is particularly useful for non-blocking map serialization.
Usage is similar to most maps, although note the conspicuous lack of operator[] which encourages non thread-safe access patterns.
Below is a synchronized key counter implementation that allows the counter values to be incremented in parallel with serializing all the values to a string.
class Counters {
private:
<int64_t,int64_t> ahm;
AtomicHashMap
public:
explicit Counters(size_t numCounters) : ahm(numCounters) {}
void increment(int64_t obj_id) {
auto ret = ahm.insert(make_pair(obj_id, 1));
if (!ret.second) {
// obj_id already exists, increment
(&ret.first->second, 1);
NoBarrier_AtomicIncrement}
}
int64_t getValue(int64_t obj_id) {
auto ret = ahm.find(obj_id);
return ret != ahm.end() ? ret->second : 0;
}
// Serialize the counters without blocking increments
() {
string toString= "{\n";
string ret .reserve(ahm.size() * 32);
retfor (const auto& e : ahm) {
+= folly::to<string>(
ret " [", e.first, ":", NoBarrier_Load(&e.second), "]\n");
}
+= "}\n";
ret return ret;
}
};
AtomicHashMap is a composition of AtomicHashArray submaps, which implement the meat of the functionality. Only one AHA is created on initialization, and additional submaps are appended if the first one gets full. If the AHM grows, there will be multiple submaps that must be probed in series to find a given key. The more growth, the more submaps will be chained, and the slower it will get. If the initial size estimate is good, only one submap will ever be created and performance will be optimal.
AtomicHashArray is a fixed-size probing hash map (also referred to as an open addressed hash map) where hash collisions are resolved by checking subsequent elements. This means that they can be allocated in slabs as arrays of value_type elements, have excellent cache performance, and have no memory overhead from storing pointers.
The algorithm is simple - when inserting, the key is hash-mod’ed to an offset, and that element-key is atomically compare-and-swap’ed with the locked key value. If successful, the value is written and the element-key is unlocked by setting it to the input key value. If the compare fails, the next element is tried until success or the map is full.
Finds are even simpler. The key is hash-mod’ed to an offset, and the element-key is examined. If it is the same as the input key, the reference is returned, if it’s the empty key, failure is returned, otherwise the next key is tried. This can be done wait-free without any atomic instructions because the elements are always in a valid state.
Erase is done by finding the key, then compare-and-swap’ing the element-key with the reserved erased key value. If the swap succeeds, return success, otherwise return failure (the element was erased by a competing thread). If the key does not exist, return failure.
folly/Benchmark.h
folly/Benchmark.h
provides a simple framework for writing and executing benchmarks. Currently the framework targets only single-threaded testing (though you can internally use fork-join parallelism and measure total run time).
Using folly/Benchmark.h
is very simple. Here’s an example:
#include <folly/Benchmark.h>
#include <vector>
using namespace std;
using namespace folly;
(insertFrontVector) {
BENCHMARK// Let's insert 100 elements at the front of a vector
<int> v;
vectorfor (unsigned int i = 0; i < 100; ++i) {
.insert(v.begin(), i);
v}
}
(insertBackVector) {
BENCHMARK// Let's insert 100 elements at the back of a vector
<int> v;
vectorfor (unsigned int i = 0; i < 100; ++i) {
.insert(v.end(), i);
v}
}
int main() {
();
runBenchmarks}
Compiling and running this code produces to the standard output:
===============================================================================
test.cpp relative ns/iter iters/s
===============================================================================
insertFrontVector 3.84K 260.38K
insertBackVector 1.61K 622.75K
===============================================================================
Let’s worry about the empty column “relative” later. The table contains, for each benchmark, the time spent per call and the converse number of calls per second. Numbers are represented in metric notation (K for thousands, M for millions etc). As expected, in this example the second function is much faster (fewer ns/iter and more iters/s).
The macro BENCHMARK
introduces a function and also adds it to an internal array containing all benchmarks in the system. The defined function takes no arguments and returns void
.
The framework calls the function many times to collect statistics about it. Sometimes the function itself would want to do that iteration—for example how about inserting n
elements instead of 100 elements? To do the iteration internally, use BENCHMARK
with two parameters. The second parameter is the number of iterations and is passed by the framework down to the function. The type of the count is implicitly unsigned
. Consider a slightly reworked example:
#include <folly/Benchmark.h>
#include <folly/container/Foreach.h>
#include <vector>
using namespace std;
using namespace folly;
(insertFrontVector, n) {
BENCHMARK<int> v;
vectorfor (unsigned int i = 0; i < n; ++i) {
.insert(v.begin(), i);
v}
}
(insertBackVector, n) {
BENCHMARK<int> v;
vectorfor (unsigned int i = 0; i < n; ++i) {
.insert(v.end(), i);
v}
}
int main() {
();
runBenchmarks}
The produced numbers are substantially different:
===============================================================================
Benchmark relative ns/iter iters/s
===============================================================================
insertFrontVector 39.92 25.05M
insertBackVector 3.46 288.89M
===============================================================================
Now the numbers indicate the speed of one single insertion because the framework assumed the user-defined function used internal iteration (which it does). So inserting at the back of a vector is more than 10 times faster than inserting at the front! Speaking of comparisons…
Choosing one or more good baselines is a crucial activity in any measurement. Without a baseline there is little information to derive from the sheer numbers. If, for example, you do experimentation with algorithms, a good baseline is often an established approach (e.g. the built-in std::sort
for sorting). Essentially all experimental numbers should be compared against some baseline.
To support baseline-driven measurements, folly/Benchmark.h
defines BENCHMARK_RELATIVE
, which works much like BENCHMARK
, except it considers the most recent lexically-occurring BENCHMARK
a baseline, and fills the “relative” column. Say, for example, we want to use front insertion for a vector as a baseline and see how back insertion compares with it:
#include <folly/Benchmark.h>
#include <folly/container/Foreach.h>
#include <vector>
using namespace std;
using namespace folly;
(insertFrontVector, n) {
BENCHMARK<int> v;
vectorfor (unsigned int i = 0; i < n; ++i) {
.insert(v.begin(), i);
v}
}
(insertBackVector, n) {
BENCHMARK_RELATIVE<int> v;
vectorfor (unsigned int i = 0; i < n; ++i) {
.insert(v.end(), i);
v}
}
int main() {
();
runBenchmarks}
This program prints something like:
===============================================================================
Benchmark relative ns/iter iters/s
===============================================================================
insertFrontVector 42.65 23.45M
insertBackVector 1208.24% 3.53 283.30M
===============================================================================
showing the 1208.24% relative speed advantage of inserting at the back compared to front. The scale is chosen in such a way that 100% means identical speed, numbers smaller than 100% indicate the benchmark is slower than the baseline, and numbers greater than 100% indicate the benchmark is faster. For example, if you see 42% that means the speed of the benchmark is 0.42 of the baseline speed. If you see 123%, it means the benchmark is 23% or 1.23 times faster.
To close the current benchmark group and start another, simply use BENCHMARK
again.
If you want to draw a horizontal line of dashes (e.g. at the end of a group or for whatever reason), use BENCHMARK_DRAW_LINE()
. The line fulfills a purely aesthetic role; it doesn’t interact with measurements in any way.
(foo) {
BENCHMARK;
Foo foo.doSomething();
foo}
();
BENCHMARK_DRAW_LINE
(bar) {
BENCHMARK;
Bar bar.doSomething();
bar}
Sometimes benchmarking code must do some preparation work that is physically inside the benchmark function, but should not take part to its time budget. To temporarily suspend the benchmark, use the pseudo-statement BENCHMARK_SUSPEND
as follows:
(insertBackVector, n) {
BENCHMARK<int> v;
vector{
BENCHMARK_SUSPEND .reserve(n);
v}
for (unsigned int i = 0; i < n; ++i) {
.insert(v.end(), i);
v}
}
The preallocation effected with v.reserve(n)
will not count toward the total run time of the benchmark.
Only the main thread should call BENCHMARK_SUSPEND
(and of course it should not call it while other threads are doing actual work). This is because the timer is application-global.
If the scope introduced by BENCHMARK_SUSPEND
is not desired, you may want to “manually” use the BenchmarkSuspender
type. Constructing such an object suspends time measurement, and destroying it resumes the measurement. If you want to resume time measurement before the destructor, call dismiss
against the BenchmarkSuspender
object. The previous example could have been written like this:
(insertBackVector, n) {
BENCHMARK;
BenchmarkSuspender braces<int> v;
vector.reserve(n);
v.dismiss();
bracesfor (unsigned int i = 0; i < n; ++i) {
.insert(v.end(), i);
v}
}
doNotOptimizeAway
Finally, the small utility function doNotOptimizeAway
prevents compiler optimizations that may interfere with benchmarking . Call doNotOptimizeAway(var) against variables that you use for benchmarking but otherwise are useless. The compiler tends to do a good job at eliminating unused variables, and this function fools it into thinking a variable is in fact needed. Example:
(fpOps, n) {
BENCHMARKdouble d = 1;
(i, 1, n) {
FOR_EACH_RANGE += i;
d -= i;
d *= i;
d /= i;
d }
(d);
doNotOptimizeAway}
folly/Benchmark.h
has a simple, systematic approach to collecting timings.
First, it organizes measurements in several large epochs, and takes the minimum over all epochs. Taking the minimum gives the closest result to the real runtime. Benchmark timings are not a regular random variable that fluctuates around an average. Instead, the real time we’re looking for is one to which there’s a variety of additive noise (i.e. there is no noise that could actually shorten the benchmark time below its real value). In theory, taking an infinite amount of samples and keeping the minimum is the actual time that needs measuring. That’s why the accuracy of benchmarking increases with the number of epochs.
Clearly, in real functioning there will also be noise and a variety of effects caused by the running context. But the noise during the benchmark (straight setup, simple looping) is a poor model for the noise in the real application. So taking the minimum across several epochs is the most informative result.
Inside each epoch, the function measured is iterated an increasing number of times until the total runtime is large enough to make noise negligible. At that point the time is collected, and the time per iteration is computed. As mentioned, the minimum time per iteration over all epochs is the final result.
The timer function used is clock_gettime
with the CLOCK_REALTIME
clock id. Note that you must use a recent Linux kernel (2.6.38 or newer), otherwise the resolution of CLOCK_REALTIME
is inadequate.
folly/Conv.h
folly/Conv.h
is a one-stop-shop for converting values across types. Its main features are simplicity of the API (only the names to
and toAppend
must be memorized), speed (folly is significantly faster, sometimes by an order of magnitude, than comparable APIs), and correctness.
All examples below are assume to have included folly/Conv.h
and issued using namespace folly;
You will need:
// To format as text and append to a string, use toAppend.
;
fbstring str(2.5, &str);
toAppend(str, "2.5");
CHECK_EQ
// Multiple arguments are okay, too. Just put the pointer to string at the end.
(" is ", 2, " point ", 5, &str);
toAppend(str, "2.5 is 2 point 5");
CHECK_EQ
// You don't need to use fbstring (although it's much faster for conversions and in general).
std::string stdStr;
("Pi is about ", 22.0 / 7, &stdStr);
toAppend// In general, just use to<TargetType>(sourceValue). It returns its result by value.
= to<std::string>("Variadic ", "arguments also accepted.");
stdStr
// to<fbstring> is 2.5x faster than to<std::string> for typical workloads.
= to<fbstring>("Variadic ", "arguments also accepted."); str
Using to<Target>(value)
to convert one integral type to another will behave as follows:
short x;
unsigned short y;
...
auto a = to<int>(x); // zero overhead conversion
auto b = to<int>(y); // zero overhead conversion
to
inserts bounds checks and throws std::range_error
if the target type cannot accommodate the source value. Example:short x;
unsigned short y;
long z;
...
= 123;
x auto a = to<unsigned short>(x); // fine
= -1;
x = to<unsigned short>(x); // THROWS
a = 2000000000;
z auto b = to<int>(z); // fine
+= 1000000000;
z = to<int>(z); // THROWS
b auto b = to<unsigned int>(z); // fine
As mentioned, there are two primitives for converting anything to string: to
and toAppend
. They support the same set of source types, literally by definition (to
is implemented in terms of toAppend
for all types). The call toAppend(value, &str)
formats and appends value
to str
whereas to<StringType>(value)
formats value
as a StringType
and returns the result by value. Currently, the supported StringType
s are std::string
and fbstring
Both toAppend
and to
with a string type as a target support variadic arguments. Each argument is converted in turn. For toAppend
the last argument in a variadic list must be the address of a supported string type (no need to specify the string type as a template argument).
Nothing special here - integrals are converted to strings in decimal format, with a ‘-’ prefix for negative values. Example:
auto a = to<fbstring>(123);
assert(a == "123");
= to<fbstring>(-456);
a assert(a == "-456");
The conversion implementation is aggressively optimized. It converts two digits at a time assisted by fixed-size tables. Converting a long
to an fbstring
is 3.6x faster than using boost::lexical_cast
and 2.5x faster than using sprintf
even though the latter is used in conjunction with a stack-allocated constant-size buffer.
Note that converting integral types to fbstring
has a particular advantage compared to converting to std::string
No integral type (<= 64 bits) has more than 20 decimal digits including sign. Since fbstring
employs the small string optimization for up to 23 characters, converting an integral to fbstring
is guaranteed to not allocate memory, resulting in significant speed and memory locality gains. Benchmarks reveal a 2x gain on a typical workload.
char
to string conversionAlthough char
is technically an integral type, most of the time you want the string representation of 'a'
to be "a"
, not 96
That’s why folly/Conv.h
handles char
as a special case that does the expected thing. Note that signed char
and unsigned char
are still considered integral types.
folly/Conv.h
uses V8’s double conversion routines. They are accurate and fast; on typical workloads, to<fbstring>(doubleValue)
is 1.9x faster than sprintf
and 5.5x faster than boost::lexical_cast
(It is also 1.3x faster than to<std::string>(doubleValue)
const char*
to string conversionFor completeness, folly/Conv.h
supports const char*
including i.e. string literals. The “conversion” consists, of course, of the string itself. Example:
auto s = to<fbstring>("Hello, world");
assert(s == "Hello, world");
folly/Conv.h
includes three kinds of parsing routines:
to<Type>(const char* begin, const char* end)
rigidly converts the range [begin, end) to Type
These routines have drastic restrictions (e.g. allow no leading or trailing whitespace) and are intended as an efficient back-end for more tolerant routines.to<Type>(stringy)
converts stringy
to Type
Value stringy
may be of type const char*
, StringPiece
, std::string
, or fbstring
(Technically, the requirement is that stringy
implicitly converts to a StringPiece
to<Type>(&stringPiece)
parses with progress information: given stringPiece
of type StringPiece
it parses as much as possible from it as type Type
and alters stringPiece
to remove the munched characters. This is easiest clarified by an example:= " 1234 angels on a pin";
fbstring s (s);
StringPiece pcauto x = to<int>(&pc);
assert(x == 1234);
assert(pc == " angels on a pin";
Note how the routine ate the leading space but not the trailing one.
Parsing integral types is unremarkable - decimal format is expected, optional '+'
or '-'
sign for signed types, but no optional '+'
is allowed for unsigned types. The one remarkable element is speed - parsing typical long
values is 6x faster than sscanf
. folly/Conv.h
uses aggressive loop unrolling and table-assisted SIMD-style code arrangement that avoids integral division (slow) and data dependencies across operations (ILP-unfriendly). Example:
= " 12345 ";
fbstring str assert(to<int>(str) == 12345);
= " 12345six seven eight";
str (str);
StringPiece pcassert(to<int>(&pc) == 12345);
assert(str == "six seven eight");
folly/Conv.h
uses, again, V8’s double-conversion routines as back-end. The speed is 3x faster than sscanf
and 1.7x faster than in-home routines such as parse<double>
But the more important detail is accuracy - even if you do code a routine that works faster than to<double>
chances are it is incorrect and will fail in a variety of corner cases. Using to<double>
is strongly recommended.
Note that if the string “NaN” (with any capitalization) is passed to to<double>
then NaN
is returned, which can be tested for as follows:
= "nan"; // "NaN", "NAN", etc.
fbstring str double d = to<double>(str);
if (std::isnan(d)) {
// string was a valid representation of the double value NaN
}
Note that passing “-NaN” (with any capitalization) to to<double>
also returns NaN
.
Note that if the strings “inf” or “infinity” (with any capitalization) are passed to to<double>
then infinity
is returned, which can be tested for as follows:
= "inf"; // "Inf", "INF", "infinity", "Infinity", etc.
fbstring str double d = to<double>(str);
if (std::isinf(d)) {
// string was a valid representation of one of the double values +Infinity
// or -Infinity
}
Note that passing “-inf” or “-infinity” (with any capitalization) to to<double>
returns -infinity
rather than +infinity
. The sign of the infinity
can be tested for as follows:
= "-inf"; // or "inf", "-Infinity", "+Infinity", etc.
fbstring str double d = to<double>(str);
if (d == std::numeric_limits<double>::infinity()) {
// string was a valid representation of the double value +Infinity
} else if (d == -std::numeric_limits<double>::infinity()) {
// string was a valid representation of the double value -Infinity
}
Note that if an unparseable string is passed to to<double>
then an exception is thrown, rather than NaN
being returned. This can be tested for as follows:
= "not-a-double"; // Or "1.1.1", "", "$500.00", etc.
fbstring str double d;
try {
= to<double>(str);
d } catch (const std::range_error &) {
// string could not be parsed
}
Note that the empty string (""
) is an unparseable value, and will cause to<double>
to throw an exception.
tryTo<T>
is the non-throwing variant of to<T>
. It returns an Expected<T, ConversionCode>
. You can think of Expected
as like an Optional<T>
, but if the conversion failed, Expected
stores an error code instead of a T
.
tryTo<T>
has similar performance as to<T>
when the conversion is successful. On the error path, you can expect tryTo<T>
to be roughly three orders of magnitude faster than the throwing to<T>
and to completely avoid any lock contention arising from stack unwinding.
Here is how to use non-throwing conversions:
auto t1 = tryTo<int>(str);
if (t1.hasValue()) {
(t1.value());
use}
Expected
has a composability feature to make the above pattern simpler.
<int>(str).then([](int i) { use(i); }); tryTo
folly/dynamic.h
folly/dynamic.h
provides a runtime dynamically typed value for C++, similar to the way languages with runtime type systems work (e.g. Python). It can hold types from a predetermined set of types (ints, bools, arrays of other dynamics, etc), similar to something like std::variant
, but the syntax is intended to be a little more like using the native type directly.
Here are some code samples to get started (assumes a using folly::dynamic;
was used):
= 12; // creates a dynamic that holds an integer
dynamic twelve = "string"; // yep, this one is an fbstring
dynamic str
// A few other types.
= nullptr;
dynamic nul = false;
dynamic boolean
// Arrays can be initialized with dynamic::array.
= dynamic::array("array ", "of ", 4, " elements");
dynamic array assert(array.size() == 4);
= dynamic::array;
dynamic emptyArray assert(emptyArray.empty());
// Maps from dynamics to dynamics are called objects. The
// dynamic::object constant is how you make an empty map from dynamics
// to dynamics.
= dynamic::object;
dynamic map ["something"] = 12;
map["another_something"] = map["something"] * 2;
map
// Dynamic objects may be initialized this way
= dynamic::object("something", 12)("another_something", 24); dynamic map2
Any operation on a dynamic requires checking at runtime that the type is compatible with the operation. If it isn’t, you’ll get a folly::TypeError
. Other exceptions can also be thrown if you try to do something impossible (e.g. if you put a very large 64-bit integer in and try to read it out as a double).
More examples should hopefully clarify this:
= 42;
dynamic dint
= "foo";
dynamic str = str + "something"; // fine
dynamic anotherStr = str + dint; // TypeError is raised dynamic thisThrows
Explicit type conversions can be requested for some of the basic types:
= 12345678;
dynamic dint = dint.asDouble(); // doub will hold 12345678.0
dynamic doub = dint.asString(); // str == "12345678"
dynamic str
= std::numeric_limits<int64_t>::max();
dynamic hugeInt = hugeInt.asDouble(); // throws a folly/Conv.h error,
dynamic hugeDoub // since it can't fit in a double
For more complicated conversions, see DynamicConverter.
Equality operators (==
, !=
) are supported for all types.
For dynamics of the same type, the underlying equality operator shall apply.
For comparisons between different numeric types (double and int64), numeric equality will apply - thus 2.0 == 2
.
Values of any other different types will be deemed to not be equal.
Ordering operators (<
, <=
, >
, >=
) are supported for all types, except dynamic::object
which will throw if it is involved in an ordering operator.
For dynamics of the same type, the underlying ordering operator shall apply.
For comparisons between different numeric types (double and int64), numeric ordering will apply - thus 1.5 < 2
.
Ordering of values between other different types will maintain total ordering properties and be consistent within a given binary run, and thus safe for use in e.g. std::set
. The actual ordering is undefined and could change across versions, thus a dependency should not be taken outside of the total ordering property within a given binary.
Hashing is supported by all types, and the hashes of two values will match if they are equal per dynamic::operator==
.
As a result, numerical types have the same numerical hashing regardless of int64 vs double - so e.g. std::hash<dynamic>()(2)
will give the same value as std::hash<dynamic>()(2.0)
.
You can iterate over dynamic arrays as you would over any C++ sequence container.
= dynamic::array(2, 3, "foo");
dynamic array
for (auto& val : array) {
(val);
doSomethingWith}
You can iterate over dynamic maps by calling items()
, keys()
, values()
, which behave similarly to the homonymous methods of Python dictionaries.
= dynamic::object(2, 3)("hello", "world")("x", 4);
dynamic obj
for (auto& pair : obj.items()) {
// Key is pair.first, value is pair.second
(pair.first);
processKey(pair.second);
processValue}
for (auto& key : obj.keys()) {
(key);
processKey}
for (auto& value : obj.values()) {
(value);
processValue}
You can find an element by key in a dynamic map using the find()
method, which returns an iterator compatible with items()
:
= dynamic::object(2, 3)("hello", "world")("x", 4);
dynamic obj
auto pos = obj.find("hello");
// pos->first is "hello"
// pos->second is "world"
auto pos = obj.find("no_such_key");
// pos == obj.items().end()
The original motivation for implementing this type was to try to make dealing with json documents in C++ almost as easy as it is in languages with dynamic type systems (php or javascript, etc). The reader can judge whether we’re anywhere near that goal, but here’s what it looks like:
// Parsing JSON strings and using them.
std::string jsonDocument = R"({"key":12,"key2":[false, null, true, "yay"]})";
= folly::parseJson(jsonDocument);
dynamic parsed assert(parsed["key"] == 12);
assert(parsed["key2"][0] == false);
assert(parsed["key2"][1] == nullptr);
// Building the same document programmatically.
= dynamic::object
dynamic sonOfAJ ("key", 12)
("key2", dynamic::array(false, nullptr, true, "yay"));
// Printing. (See also folly::toPrettyJson)
auto str = folly::toJson(sonOfAJ);
assert(jsonDocument.compare(str) == 0);
Dynamic typing is more expensive than static typing, even when you do it in C++. ;)
However, some effort has been made to keep folly::dynamic
and the json (de)serialization at least reasonably performant for common cases. The heap is only used for arrays and objects, and move construction is fully supported. String formatting internally also uses the highly performant folly::to<>
(see folly/Conv.h
).
A trade off to keep in mind though, is that sizeof(folly::dynamic)
is 64 bytes. You probably don’t want to use it if you need to allocate large numbers of them (prefer static types, etc).
Q. Why doesn’t a dynamic string support begin(), end(), and operator[]?
The value_type of a dynamic iterator is dynamic
, and operator[]
(or the at()
function) has to return a reference to a dynamic. If we wanted this to work for strings, this would mean we’d have to support dynamics with a character type, and moreover that the internal representation of strings would be such that we can hand out references to dynamic as accessors on individual characters. There are a lot of potential efficiency drawbacks with this, and it seems like a feature that is not needed too often in practice.
Q. Isn’t this just a poor imitation of the C# language feature?
Pretty much.
folly/DynamicConverter.h
When dynamic objects contain data of a known type, it is sometimes useful to have its well-typed representation. A broad set of type-conversions are contained in DynamicConverter.h
, and facilitate the transformation of dynamic objects into their well-typed format.
Simply pass a dynamic into a templated convertTo:
= { { 1, 2, 3 }, { 4, 5 } }; // a vector of vector of int
dynamic d auto vvi = convertTo<fbvector<fbvector<int>>>(d);
convertTo naturally supports conversions to
NOTE:
convertTo
Additionally, convertTo
If Type meets the container criteria, then it will be constructed by calling its InputIterator constructor.
If you want to use convertTo to convert dynamics into your own custom class, then all you have to do is provide a template specialization of DynamicConverter with the static method convert. Make sure you put it in namespace folly.
Example:
struct Token {
int kind_;
lexeme_;
fbstring
explicit Token(int kind, const fbstring& lexeme)
: kind_(kind), lexeme_(lexeme) {}
};
namespace folly {
template <> struct DynamicConverter<Token> {
static Token convert(const dynamic& d) {
int k = convertTo<int>(d["KIND"]);
= convertTo<fbstring>(d["LEXEME"]);
fbstring lex return Token(k, lex);
}
};
}
Run your concurrent code in a performant way
Wangle provides two concrete thread pools (IOThreadPoolExecutor, CPUThreadPoolExecutor) as well as building them in as part of a complete async framework. Generally you might want to grab the global executor, and use it with a future, like this:
auto f = someFutureFunction().via(getCPUExecutor()).then(...)
Or maybe you need to construct a thrift/memcache client, and need an event base:
auto f = getClient(getIOExecutor()->getEventBase())->callSomeFunction(args...) .via(getCPUExecutor()) .then([](Result r){ .... do something with result});
The current C++11 std::launch only has two modes: async or deferred. In a production system, neither is what you want: async will launch a new thread for every launch without limit, while deferred will defer the work until it is needed lazily, but then do the work in the current thread synchronously when it is needed.
Wangle's thread pools always launch work as soon as possible, have limits to the maximum number of tasks / threads allowed, so we will never use more threads than absolutely needed. See implementation details below about each type of executor.
Unfortunately none of the existing thread pools had every feature needed - things based on pipes are too slow. Several older ones didn't support std::function.
If you want epoll support, you need an fd - event_fd is the latest notification hotness. Unfortunately, an active fd triggers all the epoll loops it is in, leading to thundering herd - so if you want a fair queue (one queue total vs. one queue per worker thread), you need to use some kind of semaphore. Unfortunately semaphores can't be put in epoll loops, so they are incompatible with IO. Fortunately, you usually want to separate the IO and CPU bound work anyway to give stronger tail latency guarantees on IO.
Base class that contains the thread startup/shutdown/stats logic, since this is pretty disjoint from how tasks are actually run
An observer interface is provided to listen for thread start/stop events. This is useful to create objects that should be one-per-thread, but also have them work correctly if threads are added/removed from the thread pool.
PoolStats are provided to get task count, running time, waiting time, etc.
folly/FBString.h
fbstring
is a drop-in replacement for std::string
. The main benefit of fbstring
is significantly increased performance on virtually all important primitives. This is achieved by using a three-tiered storage strategy and by cooperating with the memory allocator. In particular, fbstring
is designed to detect use of jemalloc and cooperate with it to achieve significant improvements in speed and memory usage.
fbstring
supports 32- and 64-bit and little- and big-endian architectures.
Small strings (<= 23 chars) are stored in-situ without memory allocation.
Medium strings (24 - 255 chars) are stored in malloc-allocated memory and copied eagerly.
Large strings (> 255 chars) are stored in malloc-allocated memory and copied lazily.
100% compatible with std::string
.
Thread-safe reference counted copy-on-write for strings “large” strings (> 255 chars).
Uses malloc
instead of allocators.
Jemalloc-friendly. fbstring
automatically detects if application uses jemalloc and if so, significantly improves allocation strategy by using non-standard jemalloc extensions.
find()
is implemented using simplified Boyer-Moore algorithm. Casual tests indicate a 30x speed improvement over string::find()
for successful searches and a 1.5x speed improvement for failed searches.
Offers conversions to and from std::string
.
folly/FBVector.h
Simply replacing std::vector
with folly::fbvector
(after having included the folly/FBVector.h
header file) will improve the performance of your C++ code using vectors with common coding patterns. The improvements are always non-negative, almost always measurable, frequently significant, sometimes dramatic, and occasionally spectacular.
::fbvector<int> numbers({0, 1, 2, 3});
folly.reserve(10);
numbersfor (int i = 4; i < 10; i++) {
.push_back(i * 2);
numbers}
assert(numbers[6] == 12);
std::vector
is the stalwart abstraction many use for dynamically-allocated arrays in C++. It is also the best known and most used of all containers. It may therefore seem a surprise that std::vector
leaves important - and sometimes vital - efficiency opportunities on the table. This document explains how our own drop-in abstraction fbvector
improves key performance aspects of std::vector
. Refer to folly/test/FBVectorTest.cpp for a few benchmarks.
It is well known that std::vector
grows exponentially (at a constant factor) in order to avoid quadratic growth performance. The trick is choosing a good factor. Any factor greater than 1 ensures O(1) amortized append complexity towards infinity. But a factor that’s too small (say, 1.1) causes frequent vector reallocation, and one that’s too large (say, 3 or 4) forces the vector to consume much more memory than needed.
The initial HP implementation by Stepanov used a growth factor of 2; i.e., whenever you’d push_back
into a vector without there being room, it would double the current capacity. This was not a good choice: it can be mathematically proven that a growth factor of 2 is rigorously the worst possible because it never allows the vector to reuse any of its previously-allocated memory. Despite other compilers reducing the growth factor to 1.5, gcc has staunchly maintained its factor of 2. This makes std::vector
cache- unfriendly and memory manager unfriendly.
To see why that’s the case, consider a large vector of capacity C. When there’s a request to grow the vector, the vector (assuming no in-place resizing, see the appropriate section in this document) will allocate a chunk of memory next to its current chunk, copy its existing data to the new chunk, and then deallocate the old chunk. So now we have a chunk of size C followed by a chunk of size k * C. Continuing this process we’ll then have a chunk of size k * k * C to the right and so on. That leads to a series of the form (using ^^ for power):
C, C*k, C*k^^2, C*k^^3, ...
If we choose k = 2 we know that every element in the series will be strictly larger than the sum of all previous ones because of the remarkable equality:
1 + 2^^1 + 2^^2 + 2^^3... + 2^^n = 2^^(n+1) - 1
This means that any new chunk requested will be larger than all previously used chunks combined, so the vector must crawl forward in memory; it can’t move back to its deallocated chunks. But any number smaller than 2 guarantees that you’ll at some point be able to reuse the previous chunks. For instance, choosing 1.5 as the factor allows memory reuse after 4 reallocations; 1.45 allows memory reuse after 3 reallocations; and 1.3 allows reuse after only 2 reallocations.
Of course, the above makes a number of simplifying assumptions about how the memory allocator works, but definitely you don’t want to choose the theoretically absolute worst growth factor. fbvector
uses a growth factor of 1.5. That does not impede good performance at small sizes because of the way fbvector
cooperates with jemalloc (below).
Virtually all modern allocators allocate memory in fixed-size quanta that are chosen to minimize management overhead while at the same time offering good coverage at low slack. For example, an allocator may choose blocks of doubling size (32, 64, 128,
As discussed above, std::vector
also needs to (re)allocate in quanta. The next quantum is usually defined in terms of the current size times the infamous growth constant. Because of this setup, std::vector
has some slack memory at the end much like an allocated block has some slack memory at the end.
It doesn’t take a rocket surgeon to figure out that an allocator- aware std::vector
would be a marriage made in heaven: the vector could directly request blocks of “perfect” size from the allocator so there would be virtually no slack in the allocator. Also, the entire growth strategy could be adjusted to work perfectly with allocator’s own block growth strategy. That’s exactly what fbvector
does - it automatically detects the use of jemalloc and adjusts its reallocation strategy accordingly.
But wait, there’s more. Many memory allocators do not support in- place reallocation, although most of them could. This comes from the now notorious design of realloc()
to opaquely perform either in-place reallocation or an allocate-memcpy-deallocate cycle. Such lack of control subsequently forced all clib-based allocator designs to avoid in-place reallocation, and that includes C++’s new
and std::allocator
. This is a major loss of efficiency because an in-place reallocation, being very cheap, may mean a much less aggressive growth strategy. In turn that means less slack memory and faster reallocations.
One particularly sensitive topic about handling C++ values is that they are all conservatively considered non- relocatable. In contrast, a relocatable value would preserve its invariant even if its bits were moved arbitrarily in memory. For example, an int32_t
is relocatable because moving its 4 bytes would preserve its actual value, so the address of that value does not “matter” to its integrity.
C++’s assumption of non-relocatable values hurts everybody for the benefit of a few questionable designs. The issue is that moving a C++ object “by the book” entails (a) creating a new copy from the existing value; (b) destroying the old value. This is quite vexing and violates common sense; consider this hypothetical conversation between Captain Picard and an incredulous alien:
Incredulous Alien: “So, this teleporter, how does it work?”
Picard: “It beams people and arbitrary matter from one place to another.”
Incredulous Alien: “Hmmm… is it safe?”
Picard: “Yes, but earlier models were a hassle. They’d clone the person to another location. Then the teleporting chief would have to shoot the original. Ask O’Brien, he was an intern during those times. A bloody mess, that’s what it was.”
Only a tiny minority of objects are genuinely non-relocatable:
class Ew {
char buffer[1024];
char * pointerInsideBuffer;
public:
() : pointerInsideBuffer(buffer) {}
Ew...
}
The first class of designs can always be redone at small or no cost in efficiency. The second class of objects should not be values in the first place - they should be allocated with new
and manipulated using (smart) pointers. It is highly unusual for a value to have observers that alias pointers to it.
Relocatable objects are of high interest to std::vector
because such knowledge makes insertion into the vector and vector reallocation considerably faster: instead of going to Picard’s copy-destroy cycle, relocatable objects can be moved around simply by using memcpy
or memmove
. This optimization can yield arbitrarily high wins in efficiency; for example, it transforms vector< vector<double> >
or vector< hash_map<int, string> >
from risky liabilities into highly workable compositions.
In order to allow fast relocation without risk, fbvector
uses a trait folly::IsRelocatable
defined in "folly/Traits.h"
. By default, folly::IsRelocatable::value
conservatively yields false. If you know that your type Widget
is in fact relocatable, go right after Widget
’s definition and write this:
// at global namespace level
namespace folly {
struct IsRelocatable<Widget> : boost::true_type {};
}
If you don’t do this, fbvector<Widget>
will fail to compile with a static_assert
.
fbvector
uses a careful implementation all around to make sure it doesn’t lose efficiency through the cracks. Some future directions may be in improving raw memory copying (memcpy
is not an intrinsic in gcc and does not work terribly well for large chunks) and in furthering the collaboration with jemalloc. Have fun!
folly/Format.h
folly/Format.h
provides a fast, powerful, type-safe, flexible facility for formatting text, using a specification language similar to Python’s str.format. By default, it can format strings, numbers (integral and floating point), and dynamically-typed folly::dynamic
objects, and can extract values from random-access containers and string-keyed maps. In many cases, format
is faster than sprintf
as well as being fully type-safe.
Here are some code samples to get started:
using folly::format;
using folly::sformat;
// Objects produced by format() can be streamed without creating
// an intermediary string; {} yields the next argument using default
// formatting.
std::cout << format("The answers are {} and {}", 23, 42);
// => "The answers are 23 and 42"
// If you just want the string, though, you're covered.
std::string result = sformat("The answers are {} and {}", 23, 42);
// => "The answers are 23 and 42"
// To insert a literal '{' or '}', just double it.
std::cout << format("{} {{}} {{{}}}", 23, 42);
// => "23 {} {42}"
// Arguments can be referenced out of order, even multiple times
std::cout << format("The answers are {1}, {0}, and {1} again", 23, 42);
// => "The answers are 42, 23, and 42 again"
// It's perfectly fine to not reference all arguments
std::cout << format("The only answer is {1}", 23, 42);
// => "The only answer is 42"
// Values can be extracted from indexable containers
// (random-access sequences and integral-keyed maps), and also from
// string-keyed maps
std::vector<int> v {23, 42};
std::map<std::string, std::string> m { {"what", "answer"} };
std::cout << format("The only {1[what]} is {0[1]}", v, m);
// => "The only answer is 42"
// format works with pairs and tuples
std::tuple<int, std::string, int> t {42, "hello", 23};
std::cout << format("{0} {2} {1}", t);
// => "42 23 hello"
// Format supports width, alignment, arbitrary fill, and various
// format specifiers, with meanings similar to printf
// "X<10": fill with 'X', left-align ('<'), width 10
std::cout << format("{:X<10} {}", "hello", "world");
// => "helloXXXXX world"
// Field width may be a runtime value rather than part of the format string
int x = 6;
std::cout << format("{:-^*}", x, "hi");
// => "--hi--"
// Explicit arguments work with dynamic field width, as long as indexes are
// given for both the value and the field width.
std::cout << format("{2:+^*0}",
9, "unused", 456); // => "+++456+++"
// Format supports printf-style format specifiers
std::cout << format("{0:05d} decimal = {0:04x} hex", 42);
// => "00042 decimal = 002a hex"
// Formatter objects may be written to a string using folly::to or
// folly::toAppend (see folly/Conv.h), or by calling their appendTo(),
// and str() methods
std::string s = format("The only answer is {}", 42).str();
std::cout << s;
// => "The only answer is 42"
// Decimal precision usage
std::cout << format("Only 2 decimals is {:.2f}", 23.34134534535);
// => "Only 2 decimals is 23.34"
Format string (format
): "{" [arg_index] ["[" key "]"] [":" format_spec] "}"
arg_index
: index of argument to format; default = next argument. Note that a format string may have either default argument indexes or non-default argument indexes, but not both (to avoid confusion).key
: if the argument is a container (C-style array or pointer, std::array
, vector, deque, map), you may use this to select the element to format; works with random-access sequences and integer- and string-keyed maps. Multiple level keys work as well, with components separated with “.”; for example, given map<string, map<string, string>> m
, {[foo.bar]}
selects m["foo"]["bar"]
.format_spec
: format specification, see belowFormat specification: [[fill] align] [sign] ["#"] ["0"] [width] [","] ["." precision] ["."] [type]
fill
(may only be specified if align
is also specified): pad with this character (‘
’ (space) or ‘0
’ (zero) might be useful; space is default)align
: one of ‘<
’, ‘>
’, ‘=
’, ‘^
’:
<
’: left-align (default for most objects)>
’: right-align (default for numbers)=
’: pad after sign, but before significant digits; used to print -0000120
; only valid for numbers^
’: centersign
: one of ‘+
’, ‘-
’, ’ ’ (space) (only valid for numbers)
+
’: output ‘+
’ if positive or zero, ‘-
’ if negative-
’: output ‘-
’ if negative, nothing otherwise (default)
’ (space): output ‘
’ (space) if positive or zero, ‘-
’ if negative#
’: output base prefix (0
for octal, 0b
or 0B
for binary, 0x
or 0X
for hexadecimal; only valid for integers)0
’: 0-pad after sign, same as specifying “0=
” as the fill
and align
parameters (only valid for numbers)width
: minimum field width. May be ‘*
’ to indicate that the field width is given by an argument. Defaults to the next argument (preceding the value to be formatted) but an explicit argument index may be given following the ‘*
’.,
’ (comma): output comma as thousands’ separator (only valid for integers, and only for decimal output)precision
(not allowed for integers):
f
’ or ‘F
’ presentation) or number of significant digits (‘g
’ or ‘G
’).
’ (when used after precision or in lieu of precison): Forces a trailing decimal point to make it clear this is a floating point value.type
: presentation format, see belowPresentation formats:
folly::StringPiece
, std::string
, folly::fbstring
, const char*
):
s
’ (default)b
’: output in binary (base 2) (“0b
” prefix if ‘#
’ specified)B
’: output in binary (base 2) (“0B
” prefix if ‘#
’ specified)c
’: output as a character (cast to char
)d
’: output in decimal (base 10) (default)o
’: output in octal (base 8)O
’: output in octal (base 8) (same as ‘o
’)x
’: output in hexadecimal (base 16) (lower-case digits above 9)X
’: output in hexadecimal (base 16) (upper-case digits above 9)n
’: locale-aware output (currently same as ‘d
’)bool
:
true
” or “false
” as stringschar
:
c
’ instead of ‘d
’float
, double
; long double
is not implemented):
e
’: scientific notation using ‘e
’ as exponent characterE
’: scientific notation using ‘E
’ as exponent characterf
’: fixed pointF
’: fixed point (same as ‘f
’)g
’: general; use either ‘f
’ or ‘e
’ depending on magnitude (default)G
’: general; use either ‘f
’ or ‘E
’ depending on magnituden
’: locale-aware version of ‘g
’ (currently same as ‘g
’)%
’: percentage: multiply by 100 then display as ‘f
’You can extend format
for your own class by providing a specialization for folly::FormatValue
. See folly/Format.h
and folly/FormatArg.h
for details, and the existing specialization for folly::dynamic
in folly/dynamic-inl.h
for an implementation example.
folly/Function.h
folly::Function
is a polymorphic function wrapper that is not copyable and does not require the wrapped function to be copy constructible. It is similar to std::function
, but different with respect to some interesting features.
There are some limitations in std::function
that folly::Function
tries to avoid. std::function
is copy-constructible and requires that the callable that it wraps is copy-constructible as well, which is a constraint that is often inconvenient. In most cases when using a std::function
you don’t make use of its copy-constructibility, so you might sometimes feel like you get back very little in return for a noticeable restriction.
This restriction becomes apparent when trying to use a lambda capturing a unique_ptr
(or any non-copyable type) as a callback for a folly::Future.
std::unique_ptr<Foo> foo_ptr = new Foo;
.then(
some_future[foo_ptr = std::move(foo_ptr)] mutable
(int x)
{ foo_ptr->setX(x); }
);
This piece of code did not compile before folly::Future
started using folly::Function
instead of std::function
to store the callback. Because the lambda captures something non-copyable (the unique_ptr
), it is not copyable itself. And std::function
can only store copyable callables.
The implementation of folly::Future did not make use of the copy-constructibility of std::function
at any point. There was no benefit from the fact that the std::function
is copy-constructible, but the fact that it can only wrap copy-constructible callables posed a restriction.
A workaround was available: folly::MoveWrapper
, which wraps an object that may be non-copyable and implements copy operations by moving the embedded object. Using a folly::MoveWrapper
, you can capture non-copyable objects in a lambda, and the lambda itself is still copyable and may be wrapped in a std::function
. It is a pragmatic solution for the above problem, but you have to be a little careful. The problem is that you can’t use a MoveWrapper
anywhere where copy operations are assumed to behave like actual copy operations. Also, a folly::MoveWrapper<std::unique_ptr<T>>
essentially behaves like auto_ptr<T>
. Ask yourself whether you’d want to use lots of auto_ptr
s in your codebase. And the original question still persists: we very often don’t benefit from copy-constructibility of std::function
, so why do we have to live with this restriction? I.e. why do we have to use MoveWrapper
?
folly::Function
is an actual solution to this problem, as it can wrap non-copyable callables, at the cost of not being copy-constructible, which more often than not is not a relevant restriction. folly::Future
now uses folly::Function
to store callbacks, so the good news is: the code example from the top of this note is becoming a perfectly valid way to use future callbacks. The code compiles and behaves as you would expect.
Here are more details about folly::Function
: much like std::function
, it wraps an arbitrary object that can be invoked like a given function type. E.g. a folly::Function<int(std::string, double)>
can wrap any callable object that returns an int
(or something that is convertible to an int
) when invoked with a std::string
and a double
argument. The function type is a template parameter of folly::Function
, but the specific type of the callable is not.
Other than copyability, there is one more significant difference between std::function
and folly::Function
, and it concerns const-correctness. std::function
does not enforce const-correctness: it allows you to store mutable callables (i.e. callables that may change their inner state when executed, such as a mutable lambda) and call them in a const context (i.e. when you only have access to a const reference to the std::function
object). For example:
class FooBar {
public:
void call_func() const {
func_();
}
private:
std::function<void()> func_;
};
The call_func
member function is declared const. However, when it calls func_()
, it may change the inner state of func_
, and thereby the inner state of the FooBar
object. Inside the FooBar
class, func_
is like a non-const method that is callable from const methods.
Some people consider std::function
in the standard broken with respect to this. (Paper N4348 explains this problem in more detail.) It also lists possible ways to fix the problem. folly::Function
, however, goes a different way: you have to declare whether you want to store a const function, in which case you can invoke any reference (const or non-const) of the folly::Function
, or a non-const (mutable) function, in which case you need a non-const reference to the folly::Function
to be able to invoke it. In the above example, let’s say that func_
stores a const function, which makes it okay that it gets invoked from call_func
(a const method). Instead of std::function
, you could use folly::Function<void() const>
for the func_
member.
Const-ness is part of a function type. To illustrate:
class Foo {
public:
int operator()() { return 1; }
int operator()(char const*) { return 2; }
int operator()(int) { return 3; }
int operator()(int) const { return 4; }
int operator()(int, int) const { return 5; }
};
You can overload methods multiple times using different argument signatures. Const-ness is part of that signature, so even for the same set of argument types you can overload a const and a non-const version. It’s not even particularly unusual to do that. Take for instance the begin()
method of STL container types: begin()
returns an iterator
, begin() const
returns a const_iterator
. folly::Function
allows you to select a specific overload easily:
::Function<int()> uf1 = Foo();
folly// uf1() returns 1
::Function<int(char const*)> uf2 = Foo();
folly// uf2() returns 2
::Function<int(int)> uf3 = Foo();
folly// uf3() returns 3
::Function<int(int) const> uf4 = Foo();
folly// uf4() returns 4
::Function<int(int, int) const> uf5 = Foo();
folly// uf5() returns 5
If cfoo
is a const-reference to a Foo
object, cfoo(int)
returns 4. If foo
is a non-const reference to a Foo
object, foo(int)
returns 3. Normal const-to-non-const conversion behavior applies: if you call foo(int, int)
it will return 5: a non-const reference will invoke the const method if no non-const method is defined. Which leads to the following behavior:
::Function<int(int, int)> uf5nc = Foo();
folly// uf5nc() returns 5
If you are wondering if the introduction of const function types means that you have to change lots of normal function types to const function types if you want to use folly::Function
: not really, or at least not as much as you might think. There are only two reasons to use a folly::Function
with a const function type: * a callable object defines both const and non-const operator()
and you explicitly want to select the const one * you need to invoke a folly::Function
from a const context (i.e. you only have a const reference to the folly::Function
)
In practice, you will not need the const variant very often. Adding const to a function type adds a restriction for the callable: it must not change its inner state when invoked. If you don’t care whether it does or not, don’t worry about const!
A folly::Function<R(Args...) const>
can be converted into a folly::Function<R(Args...)>
: either way the stored callable will not change its inner state when invoked. The former type expresses and guarantees that, the latter does not. When you get rid of the const, the selected function stays the same:
::Function<int(int)> uf4nc = std::move(uf4);
folly// uf4nc() returns 4, not 3!
If you want to go the other way, you are talking about a (potentially dangerous) const cast: a callable that may or may not change its inner state is declared as one that guarantees not to do that. Proceed at your own risk! This conversion does not happen implicitly, though:
::Function<int() const> uf1c = folly::constCastFunction(std::move(uf1));
folly// uf1c() returns 1
Admittedly, seeing const function types as template parameters is unfamiliar. As far as I am aware it happens nowhere in the standard library. But it is the most consistent way to deal with the issue of const-correctness here. Const qualifiers are part of a function type, as a matter of fact. If you require a const-qualified function to be wrapped in a folly::Function
, just declare it as that! More often than not you will find that you do not need the const qualifier. While writing the folly::Function
implementation, a good set of unit tests had existed before the const function types got introduced. Not a single one of those unit tests had to be changed: they all compiled and passed after the introduction of const function types. Obviously new ones were added to test the const-correctness. But in your day-to-day use of folly::Function
you won’t have to worry about const very often.
Like most implementations of std::function
, folly::Function
stores small callable objects in-line and larger callable objects will be stored on the heap. folly::Function
returns the size of the allocation it has made from its heapAllocatedMemory()
member function; naturally it will return 0
when the callable is stored in-line.
Folly Futures is an async C++ framework inspired by Twitter’s Futures implementation in Scala (see also Future.scala, Promise.scala, and friends), and loosely builds upon the existing but anemic Futures code found in the C++11 standard (std::future) and boost::future (especially >= 1.53.0). Although inspired by the C++11 std::future interface, it is not a drop-in replacement because some ideas don’t translate well enough to maintain API compatibility.
The primary difference from std::future is that you can attach callbacks to Futures (with thenValue
or thenTry
), under the control of an executor to manage where work runs, which enables sequential and parallel composition of Futures for cleaner asynchronous code.
#include <folly/futures/Future.h>
#include <folly/executors/ThreadedExecutor.h>
using namespace folly;
using namespace std;
void foo(int x) {
// do something with x
<< "foo(" << x << ")" << endl;
cout }
// ...
::ThreadedExecutor executor;
folly<< "making Promise" << endl;
cout <int> p;
Promise<int> f = p.getSemiFuture().via(&executor);
Futureauto f2 = move(f).thenValue(foo);
<< "Future chain made" << endl;
cout
// ... now perhaps in another event callback
<< "fulfilling Promise" << endl;
cout .setValue(42);
p(f2).get();
move<< "Promise fulfilled" << endl; cout
This would print:
making Promise
Future chain made
fulfilling Promise
foo(42)
Promise fulfilled
In addition to this document, there is a blog post on code.facebook.com (June 2015).
This brief guide covers the basics. For a more in-depth coverage skip to the appropriate section.
Let’s begin with an example using an imaginary simplified Memcache client interface:
using std::string;
class MemcacheClient {
public:
struct GetReply {
enum class Result {
,
FOUND,
NOT_FOUND,
SERVER_ERROR};
;
Result result// The value when result is FOUND,
// The error message when result is SERVER_ERROR or CLIENT_ERROR
// undefined otherwise
;
string value};
(string key);
GetReply get};
This API is synchronous, i.e. when you call get()
you have to wait for the result. This is very simple, but unfortunately it is also very easy to write very slow code using synchronous APIs.
Now, consider this traditional asynchronous signature for the same operation:
int async_get(string key, std::function<void(GetReply)> callback);
When you call async_get()
, your asynchronous operation begins and when it finishes your callback will be called with the result. Very performant code can be written with an API like this, but for nontrivial applications the code devolves into a special kind of spaghetti code affectionately referred to as “callback hell”.
The Future-based API looks like this:
<GetReply> future_get(string key); SemiFuture
A SemiFuture<GetReply>
or Future<GetReply>
is a placeholder for the GetReply
that we will eventually get. For most of the descriptive text below, Future can refer to either folly::SemiFuture
or folly::Future
as the former is a safe subset of the latter. A Future usually starts life out “unfulfilled”, or incomplete, i.e.:
.isReady() == false
fut.value() // will throw an exception because the Future is not ready fut
At some point in the future, the Future
will have been fulfilled, and we can access its value.
.isReady() == true
fut& reply = fut.value(); GetReply
Futures support exceptions. If the asynchronous producer fails with an exception, your Future may represent an exception instead of a value. In that case:
.isReady() == true
fut.value() // will rethrow the exception fut
Just what is exceptional depends on the API. In our example we have chosen not to raise exceptions for SERVER_ERROR
, but represent this explicitly in the GetReply
object. On the other hand, an astute Memcache veteran would notice that we left CLIENT_ERROR
out of GetReply::Result
, and perhaps a CLIENT_ERROR
would have been raised as an exception, because CLIENT_ERROR
means there’s a bug in the library and this would be truly exceptional. These decisions are judgement calls by the API designer. The important thing is that exceptional conditions (including and especially spurious exceptions that nobody expects) get captured and can be handled higher up the “stack”.
So far we have described a way to initiate an asynchronous operation via an API that returns a Future, and then sometime later after it is fulfilled, we get its value. This is slightly more useful than a synchronous API, but it’s not yet ideal. There are two more very important pieces to the puzzle.
First, we can aggregate Futures, to define a new Future that completes after some or all of the aggregated Futures complete. Consider two examples: fetching a batch of requests and waiting for all of them, and fetching a group of requests and waiting for only one of them.
;
MemcacheClient mc
<SemiFuture<GetReply>> futs;
vectorfor (auto& key : keys) {
.push_back(mc.future_get(key));
futs}
auto all = collectAll(futs.begin(), futs.end());
<SemiFuture<GetReply>> futs;
vectorfor (auto& key : keys) {
.push_back(mc.future_get(key));
futs}
auto any = collectAny(futs.begin(), futs.end());
<SemiFuture<GetReply>> futs;
vectorfor (auto& key : keys) {
.push_back(mc.future_get(key));
futs}
auto anyv = collectAnyWithoutException(futs.begin(), futs.end());
all
and any
are Futures (for the exact type and usage see the header files). They will be complete when all/one of futs are complete, respectively. (There is also collectN()
for when you need some, and collectAnyWithoutException
when you need one value if some value would be available.)
Second, we can associate a Future with an executor. An executor specifies where work will run, and we detail this more later. In summary, given an executor we can convert a SemiFuture
to a Future
with an executor, or a Future
on one executor to a Future
on another executor.
For example:
::ThreadedExecutor executor;
folly<GetReply> semiFut = mc.future_get("foo");
SemiFuture<GetReply> fut1 = std::move(semiFut).via(&executor); Future
Once an executor is attached, a Future
allows continuations to be attached and chained together monadically. An example will clarify:
<GetReply> semiFut = mc.future_get("foo");
SemiFuture<GetReply> fut1 = std::move(semiFut).via(&executor);
Future
<string> fut2 = std::move(fut1).thenValue(
Future[](GetReply reply) {
if (reply.result == MemcacheClient::GetReply::Result::FOUND)
return reply.value;
throw SomeException("No value");
});
<Unit> fut3 = std::move(fut2)
Future.thenValue([](string str) {
<< str << endl;
cout })
.thenTry([](folly::Try<string> strTry) {
<< strTry.value() << endl;
cout })
.thenError(folly::tag_t<std::exception>{}, [](std::exception const& e) {
<< e.what() << endl;
cerr });
That example is a little contrived but the idea is that you can transform a result from one type to another, potentially in a chain, and unhandled errors propagate. Of course, the intermediate variables are optional.
Using .thenValue
or .thenTry
to add callbacks is idiomatic. It brings all the code into one place, which avoids callback hell. .thenValue
appends a continuation that takes T&&
for some Future<T>
and an error bypasses the callback and is passed to the next, thenTry
takes a callback taking folly::Try<T>
which encapsulates both value and exception. thenError(tag_t<ExceptionType>{},...
will bypass a value and only run if there is an exception, the ExceptionType
template parameter to the tag type allows filtering by exception type; tag_t<ExceptionType>{}
is optional and if no tag is passed passed the function will be parameterized with a folly::exception_wrapper
. For C++17 there is a global inline variable and folly::tag<ExceptionType>
may be passed directly without explicit construction.
Up to this point we have skirted around the matter of waiting for Futures. You may never need to wait for a Future, because your code is event-driven and all follow-up action happens in a then-block. But if want to have a batch workflow, where you initiate a batch of asynchronous operations and then wait for them all to finish at a synchronization point, then you will want to wait for a Future. Futures have a blocking method called wait()
that does exactly that and optionally takes a timeout.
Futures are partially threadsafe. A Promise or Future can migrate between threads as long as there’s a full memory barrier of some sort. Future::thenValue
and Promise::setValue
(and all variants that boil down to those two calls) can be called from different threads. But, be warned that you might be surprised about which thread your callback executes on. Let’s consider an example, where we take a future straight from a promise, without going via the safer SemiFuture, and where we therefore have a Future
that does not carry an executor. This is in general something to avoid.
// Thread A
<Unit> p;
Promiseauto f = p.getFuture();
// Thread B
std::move(f).thenValue(x).thenValue(y).thenTry(z);
// Thread A
.setValue(); p
This is legal and technically threadsafe. However, it is important to realize that you do not know in which thread x
, y
, and/or z
will execute. Maybe they will execute in Thread A when p.setValue()
is called. Or, maybe they will execute in Thread B when f.thenValue
is called. Or, maybe x
will execute in Thread A, but y
and/or z
will execute in Thread B. There’s a race between setValue
and then
—whichever runs last will execute the callback. The only guarantee is that one of them will run the callback.
For safety, .via
should be preferred. We can chain .via
operations to give very strong control over where callbacks run:
std::move(aFuture)
.thenValue(x)
.via(e1).thenValue(y1).thenValue(y2)
.via(e2).thenValue(z);
x
will execute in the context of the executor associated with aFuture
. y1
and y2
will execute in the context of e1
, and z
will execute in the context of e2
. If after z
you want to get back to the original context, you need to get there with a call to via
passing the original executor.
If you are wrapping an asynchronous operation, or providing an asynchronous API to users, then you will want to make Promise
s. Every Future has a corresponding Promise (except Futures that spring into existence already completed, with makeFuture()
). Promises are simple: you make one, you extract the Future, and you fulfill it with a value or an exception. Example:
<int> p;
Promise<int> f = p.getSemiFuture();
SemiFuture
.isReady() == false
f
.setValue(42);
p
.isReady() == true
f.value() == 42 f
and an exception example:
<int> p;
Promise<int> f = p.getSemiFuture();
SemiFuture
.isReady() == false
f
.setException(std::runtime_error("Fail"));
p
.isReady() == true
f.value() // throws the exception f
It’s good practice to use setWith which takes a function and automatically captures exceptions, e.g.
<int> p;
Promise.setWith([]{
ptry {
// do stuff that may throw
return 42;
} catch (MySpecialException const& e) {
// handle it
return 7;
}
// Any exceptions that we didn't catch, will be caught for us
});
folly/GroupVarint.h
folly/GroupVarint.h
is an implementation of variable-length encoding for 32- and 64-bit integers using the Group Varint encoding scheme as described in Jeff Dean’s WSDM 2009 talk and in Information Retrieval: Implementing and Evaluating Search Engines.
Briefly, a group of four 32-bit integers is encoded as a sequence of variable length, between 5 and 17 bytes; the first byte encodes the length (in bytes) of each integer in the group. A group of five 64-bit integers is encoded as a sequence of variable length, between 7 and 42 bytes; the first two bytes encode the length (in bytes) of each integer in the group.
GroupVarint.h
defines a few classes:
GroupVarint<T>
, where T
is uint32_t
or uint64_t
:
Basic encoding / decoding interface, mainly aimed at encoding / decoding one group at a time.
GroupVarintEncoder<T, Output>
, where T
is uint32_t
or uint64_t
, and Output
is a functor that accepts StringPiece
objects as arguments:
Streaming encoder: add values one at a time, and they will be flushed to the output one group at a time. Handles the case where the last group is incomplete (the number of integers to encode isn’t a multiple of the group size)
GroupVarintDecoder<T>
, where T
is uint32_t
or uint64_t
:
Streaming decoder: extract values one at a time. Handles the case where the last group is incomplete.
The 32-bit implementation is significantly faster than the 64-bit implementation; on platforms supporting the SSSE3 instruction set, we use the PSHUFB instruction to speed up lookup, as described in SIMD-Based Decoding of Posting Lists (CIKM 2011).
For more details, see the header file folly/GroupVarint.h
and the associated test file folly/test/GroupVarintTest.cpp
.
folly/stats/Histogram.h
Histogram
Histogram.h
defines a simple histogram class, templated on the type of data you want to store. This class is useful for tracking a large stream of data points, where you want to remember the overall distribution of the data, but do not need to remember each data point individually.
Each histogram bucket stores the number of data points that fell in the bucket, as well as the overall sum of the data points in the bucket. Note that no overflow checking is performed, so if you have a bucket with a large number of very large values, it may overflow and cause inaccurate data for this bucket. As such, the histogram class is not well suited to storing data points with very large values. However, it works very well for smaller data points such as request latencies, request or response sizes, etc.
In addition to providing access to the raw bucket data, the Histogram
class also provides methods for estimating percentile values. This allows you to estimate the median value (the 50th percentile) and other values such as the 95th or 99th percentiles.
All of the buckets have the same width. The number of buckets and bucket width is fixed for the lifetime of the histogram. As such, you do need to know your expected data range ahead of time in order to have accurate statistics. The histogram does keep one bucket to store all data points that fall below the histogram minimum, and one bucket for the data points above the maximum. However, because these buckets don’t have a good lower/upper bound, percentile estimates in these buckets may be inaccurate.
HistogramBuckets
The Histogram
class is built on top of HistogramBuckets
. HistogramBuckets
provides an API very similar to Histogram
, but allows a user-defined bucket class. This allows users to implement more complex histogram types that store more than just the count and sum in each bucket.
When computing percentile estimates HistogramBuckets
allows user-defined functions for computing the average value and data count in each bucket. This allows you to define more complex buckets which may have multiple different ways of computing the average value and the count.
For example, one use case could be tracking timeseries data in each bucket. Each set of timeseries data can have independent data in the bucket, which can show how the data distribution is changing over time.
Say we have code that sends many requests to remote services, and want to generate a histogram showing how long the requests take. The following code will initialize histogram with 50 buckets, tracking values between 0 and 5000. (There are 50 buckets since the bucket width is specified as 100. If the bucket width is not an even multiple of the histogram range, the last bucket will simply be shorter than the others.)
::Histogram<int64_t> latencies(100, 0, 5000); folly
The addValue() method is used to add values to the histogram. Each time a request finishes we can add its latency to the histogram:
.addValue(now - startTime); latencies
You can access each of the histogram buckets to display the overall distribution. Note that bucket 0 tracks all data points that were below the specified histogram minimum, and the last bucket tracks the data points that were above the maximum.
auto numBuckets = latencies.getNumBuckets();
<< "Below min: " << latencies.getBucketByIndex(0).count << "\n";
cout for (unsigned int n = 1; n < numBuckets - 1; ++n) {
<< latencies.getBucketMin(n) << "-" << latencies.getBucketMax(n)
cout << ": " << latencies.getBucketByIndex(n).count << "\n";
}
<< "Above max: "
cout << latencies.getBucketByIndex(numBuckets - 1).count << "\n";
You can also use the getPercentileEstimate()
method to estimate the value at the Nth percentile in the distribution. For example, to estimate the median, as well as the 95th and 99th percentile values:
int64_t median = latencies.getPercentileEstimate(0.5);
int64_t p95 = latencies.getPercentileEstimate(0.95);
int64_t p99 = latencies.getPercentileEstimate(0.99);
Note that Histogram
and HistogramBuckets
objects are not thread-safe. If you wish to access a single Histogram
from multiple threads, you must perform your own locking to ensure that multiple threads do not access it at the same time.
folly/
Below is a list of (some) Folly components in alphabetical order, along with a brief description of each.
Arena.h
, ThreadCachedArena.h
Simple arena for memory allocation: multiple allocations get freed all at once. With threaded version.
AtomicHashMap.h
, AtomicHashArray.h
, AtomicHashArray.h
, AtomicLinkedList.h
, …High-performance atomic data-structures. Many of these are built with very specific tradeoffs and constraints in mind that make them faster than their more general counterparts. Each header should contain information about what these tradeoffs are.
Baton.h
A Baton allows a thread to block once and be awoken: it captures a single handoff. It is essentially a (very small, very fast) semaphore that supports only a single call to sem_call
and sem_wait
.
Benchmark.h
A small framework for benchmarking code. Client code registers benchmarks, optionally with an argument that dictates the scale of the benchmark (iterations, working set size etc). The framework runs benchmarks (subject to a command-line flag) and produces formatted output with timing information.
Bits.h
Various bit manipulation utilities optimized for speed; includes functions that wrap the ffsl(l) primitives in a uniform interface.
ConcurrentSkipList.h
An implementation of the structure described in A Provably Correct Scalable Concurrent Skip List by Herlihy et al.
Conv.h
A variety of data conversion routines (notably to and from string), optimized for speed and safety.
Demangle.h
Pretty-printing C++ types.
DiscriminatedPtr.h
Similar to std::variant
, but restricted to pointers only. Uses the highest-order unused 16 bits in a pointer as discriminator. So sizeof(DiscriminatedPtr<int, string, Widget>) == sizeof(void*)
.
dynamic.h
Dynamically-typed object, created with JSON objects in mind. DynamicConverter.h
is a utility for efficiently converting from a dynamic
to a more concrete structure when the scheme is known (e.g. json -> map<int,int>
).
EvictingCacheMap.h
A simple LRU hash map.
FBString.h
A drop-in implementation of std::string
with a variety of optimizations.
FBVector.h
A mostly drop-in implementation of std::vector
with a variety of optimizations.
File.h
A C++ abstraction around files.
Fingerprint.h
Rabin fingerprinting.
Function.h
A polymorphic wrapper for callables similar to std::function
but not copyable and therefore able to wrap non-copyable callables, such as lambdas that capture move-only types like std::unique_ptr
or folly::Promise
.
futures/
Futures is a framework for expressing asynchronous code in C++ using the Promise/Future pattern.
Format.h
Python-style formatting utilities.
gen/
This library makes it possible to write declarative comprehensions for processing sequences of values efficiently in C++ akin to C#’s LINQ.
GroupVarint.h
Group Varint encoding for 32-bit values.
IPAddress.h
A collection of utilities to deal with IPAddresses, including ipv4 and ipv6.
io/
A collection of useful of abstractions for high-performance io. This is heavily relied upon in Facebook’s internally networking code.
Hash.h
Various popular hash function implementations.
Histogram.h
A simple class for collecting histogram data.
IntrusiveList.h
Convenience type definitions for using boost::intrusive_list
.
json.h
JSON serializer and deserializer. Uses dynamic.h
.
Likely.h
Wrappers around __builtin_expect
.
Malloc.h
, Memory.h
Memory allocation helpers, particularly when using jemalloc.
MicroSpinLock.h
A really, really small spinlock for fine-grained locking of lots of teeny-tiny data.
MPMCQueue.h
MPMCQueue
The additional utility MPMCPipeline.h
is an extension that lets you chain several queues together with processing steps in between.
PackedSyncPtr.h
A highly specialized data structure consisting of a pointer, a 1-bit spin lock, and a 15-bit integral, all inside one 64-bit word.
Poly.h
A class template that makes it relatively easy to define a type-erasing polymorphic object wrapper.
Preprocessor.h
Necessarily evil stuff.
ProducerConsumerQueue.h
Lock free single-reader, single-writer queue.
Random.h
Defines only one function—randomNumberSeed()
.
Range.h
Boost-style range facility and the StringPiece
specialization.
RWSpinLock.h
Fast and compact reader-writer spin lock.
ScopeGuard.h
C++11 incarnation of the old ScopeGuard idiom.
Singleton.h
A singleton to rule the singletons. This is an attempt to insert a layer between C++ statics and the fiasco that ensues, so that things can be created, and destroyed, correctly upon program creation, program end and sometimes dlopen
and fork
.
Singletons are bad for you, but this may help.
SmallLocks.h
Very small spin locks (1 byte and 1 bit).
small_vector.h
Vector with the small buffer optimization and an optional embedded PicoSpinLock
.
sorted_vector_types.h
Collections similar to std::map
but implemented as sorted vectors.
stats/
A collection of efficient utilities for collecting statistics: * time series counters, gauges, histograms, and quantiles; * single-pass mean and variance.
StlAllocator.h
STL allocator wrapping a simple allocate/deallocate interface.
String.h
String utilities that connect folly::fbstring
with std::string
.
Subprocess.h
Subprocess library, modeled after Python’s subprocess module.
Synchronized.h
High-level synchronization library.
System.h
Demangling and errno utilities.
ThreadCachedInt.h
High-performance atomic increment using thread caching.
ThreadLocal.h
Improved thread local storage for non-trivial types.
TimeoutQueue.h
Queue with per-item timeout.
Traits.h
Type traits that complement those defined in the standard C++11 header <traits>
.
Unicode.h
Defines the codePointToUtf8
function.
Uri.h
A collection of utilities to deal with URIs.
folly/PackedSyncPtr.h
A highly specialized data structure consisting of a pointer, a 1-bit spin lock, and a 15-bit integral packed into sizeof(void*)
.
Typical application is for microsharding of many elements within containers. Because there is no memory overhead, an arbitrarily large number of locks can be used to minimize lock contention with no memory penalty. Additionally, excellent cache performance is obtained by storing the lock inline with the pointer (no additional cache miss or false sharing). Finally, because it uses a simple spinlock mechanism, the cost of acquiring an uncontended lock is minimal.
This is not a “smart” pointer: nothing automagical is going on here. Locking is up to the user. Resource deallocation is up to the user. Locks are never acquired or released outside explicit calls to lock() and unlock().
Change the value of the raw pointer with set(), but you must hold the lock when calling this function if multiple threads could be using it.
Here is an example of using a PackedSyncPtr to build a synchronized vector with no memory overhead - the spinlock and size are stored in the 16 unused bits of pointer, the rest of which points to the actual data. See folly/small_vector.h
for a complete implementation of this concept.
template<typename T>
class SyncVec {
<T> base;
PackedSyncPtr
public:
() { base.init(); }
SyncVec
void push_back(const T& t) {
.set(
basestatic_cast<T*>(realloc(base.get(), (base.extra() + 1) * sizeof(T))));
[base.extra()] = t;
base.setExtra(base.extra() + 1);
base}
size_t size() const {
return base.extra();
}
void lock() {
.lock();
base}
void unlock() {
.unlock();
base}
* begin() const {
Treturn base.get();
}
* end() const {
Treturn base.get() + base.extra();
}
};
This is using an x64-specific detail about the effective virtual address space. Long story short: the upper two bytes of all our pointers will be zero in reality—and if you have a couple billion such pointers in core, it makes pretty good sense to try to make use of that memory. The exact details can be perused here:
http://en.wikipedia.org/wiki/X86-64#Canonical_form_addresses
folly/Poly.h
Poly
is a class template that makes it relatively easy to define a type-erasing polymorphic object wrapper.
std::function
is one example of a type-erasing polymorphic object wrapper; folly::exception_wrapper
is another. Type-erasure is often used as an alternative to dynamic polymorphism via inheritance-based virtual dispatch. The distinguishing characteristic of type-erasing wrappers are:
shared_ptr
s and unique_ptr
s in APIs, complicating their point-of-use. APIs that take type-erasing wrappers, on the other hand, can often store small objects in-situ, with no dynamic allocation. The memory management, if any, is handled for you, and leads to cleaner APIs: consumers of your API don’t need to pass shared_ptr<AbstractBase>
; they can simply pass any object that satisfies the interface you require. (std::function
is a particularly compelling example of this benefit. Far worse would be an inheritance-based callable solution like shared_ptr<ICallable<void(int)>>
. )folly::Poly
Defining a polymorphic wrapper with Poly
is a matter of defining two things:
Below is a simple example program that defines a drawable
wrapper for any type that provides a draw
member function. (The details will be explained later.)
// This example is an adaptation of one found in Louis Dionne's dyno library.
#include <folly/Poly.h>
#include <iostream>
struct IDrawable {
// Define the interface of something that can be drawn:
template <class Base> struct Interface : Base {
void draw(std::ostream& out) const { folly::poly_call<0>(*this, out);}
};
// Define how concrete types can fulfill that interface (in C++17):
template <class T> using Members = folly::PolyMembers<&T::draw>;
};
// Define an object that can hold anything that can be drawn:
using drawable = folly::Poly<IDrawable>;
struct Square {
void draw(std::ostream& out) const { out << "Square\n"; }
};
struct Circle {
void draw(std::ostream& out) const { out << "Circle\n"; }
};
void f(drawable const& d) {
.draw(std::cout);
d}
int main() {
(Square{}); // prints Square
f(Circle{}); // prints Circle
f}
The above program prints:
Square
Circle
Here is another (heavily commented) example of a simple implementation of a std::function
-like polymorphic wrapper. Its interface has only a single member function: operator()
// An interface for a callable object of a particular signature, Fun
// (most interfaces don't need to be templates, FWIW).
template <class Fun>
struct IFunction;
template <class R, class... As>
struct IFunction<R(As...)> {
// An interface is defined as a nested class template called
// Interface that takes a single template parameter, Base, from
// which it inherits.
template <class Base>
struct Interface : Base {
// The Interface has public member functions. These become the
// public interface of the resulting Poly instantiation.
// (Implementation note: Poly<IFunction<Sig>> will publicly
// inherit from this struct, which is what gives it the right
// member functions.)
operator()(As... as) const {
R // The definition of each member function in your interface will
// always consist of a single line dispatching to folly::poly_call<N>.
// The "N" corresponds to the N-th member function in the
// list of member function bindings, Members, defined below.
// The first argument will always be *this, and the rest of the
// arguments should simply forward (if necessary) the member
// function's arguments.
return static_cast<R>(
::poly_call<0>(*this, std::forward<As>(as)...));
folly}
};
// The "Members" alias template is a comma-separated list of bound
// member functions for a given concrete type "T". The
// "FOLLY_POLY_MEMBERS" macro accepts a comma-separated list, and the
// (optional) "FOLLY_POLY_MEMBER" macro lets you disambiguate overloads
// by explicitly specifying the function signature the target member
// function should have. In this case, we require "T" to have a
// function call operator with the signature `R(As...) const`.
//
// If you are using a C++17-compatible compiler, you can do away with
// the macros and write this as:
//
// template <class T>
// using Members =
// folly::PolyMembers<folly::sig<R(As...) const>(&T::operator())>;
//
// And since `folly::sig` is only needed for disambiguation in case of
// overloads, if you are not concerned about objects with overloaded
// function call operators, it could be further simplified to:
//
// template <class T>
// using Members = folly::PolyMembers<&T::operator()>;
//
template <class T>
using Members = FOLLY_POLY_MEMBERS(
(R(As...) const, &T::operator()));
FOLLY_POLY_MEMBER};
// Now that we have defined the interface, we can pass it to Poly to
// create our type-erasing wrapper:
template <class Fun>
using Function = Poly<IFunction<Fun>>;
Given the above definition of Function
, users can now initialize instances of (say) Function<int(int, int)>
with function objects like std::plus<int>
and std::multiplies<int>
, as below:
<int(int, int)> fun = std::plus<int>{};
Functionassert(5 == fun(2, 3));
= std::multiplies<int>{};
fun assert(6 = fun(2, 3));
With C++17, defining an interface to be used with Poly
is fairly straightforward. As in the Function
example above, there is a struct with a nested Interface
class template and a nested Members
alias template. No macros are needed with C++17.
Imagine we were defining something like a Java-style iterator. If we are using a C++17 compiler, our interface would look something like this:
template <class Value>
struct IJavaIterator {
template <class Base>
struct Interface : Base {
bool Done() const { return folly::poly_call<0>(*this); }
() const { return folly::poly_call<1>(*this); }
Value Currentvoid Next() { folly::poly_call<2>(*this); }
};
// NOTE: This works in C++17 only:
template <class T>
using Members = folly::PolyMembers<&T::Done, &T::Current, &T::Next>;
};
template <class Value>
using JavaIterator = Poly<IJavaIterator<Value>>;
Given the above definition, JavaIterator<int>
can be used to hold instances of any type that has Done
, Current
, and Next
member functions with the correct (or compatible) signatures.
The presence of overloaded member functions complicates this picture. Often, property members are faked in C++ with const
and non-const
member function overloads, like in the interface specified below:
struct IIntProperty {
template <class Base>
struct Interface : Base {
int Value() const { return folly::poly_call<0>(*this); }
void Value(int i) { folly::poly_call<1>(*this, i); }
};
// NOTE: This works in C++17 only:
template <class T>
using Members = folly::PolyMembers<
::sig<int() const>(&T::Value),
folly::sig<void(int)>(&T::Value)>;
folly};
using IntProperty = Poly<IIntProperty>;
Now, any object that has Value
members of compatible signatures can be assigned to instances of IntProperty
object. Note how folly::sig
is used to disambiguate the overloads of &T::Value
.
In C++14, the nice syntax above doesn’t work, so we have to resort to macros. The two examples above would look like this:
template <class Value>
struct IJavaIterator {
template <class Base>
struct Interface : Base {
bool Done() const { return folly::poly_call<0>(*this); }
() const { return folly::poly_call<1>(*this); }
Value Currentvoid Next() { folly::poly_call<2>(*this); }
};
// NOTE: This works in C++14 and C++17:
template <class T>
using Members = FOLLY_POLY_MEMBERS(&T::Done, &T::Current, &T::Next);
};
template <class Value>
using JavaIterator = Poly<IJavaIterator<Value>>;
and
struct IIntProperty {
template <class Base>
struct Interface : Base {
int Value() const { return folly::poly_call<0>(*this); }
void Value(int i) { return folly::poly_call<1>(*this, i); }
};
// NOTE: This works in C++14 and C++17:
template <class T>
using Members = FOLLY_POLY_MEMBERS(
(int() const, &T::Value),
FOLLY_POLY_MEMBER(void(int), &T::Value));
FOLLY_POLY_MEMBER};
using IntProperty = Poly<IIntProperty>;
One typical advantage of inheritance-based solutions to runtime polymorphism is that one polymorphic interface could extend another through inheritance. The same can be accomplished with type-erasing polymorphic wrappers. In the Poly
library, you can use folly::PolyExtends
to say that one interface extends another.
struct IFoo {
template <class Base>
struct Interface : Base {
void Foo() const { return folly::poly_call<0>(*this); }
};
template <class T>
using Members = FOLLY_POLY_MEMBERS(&T::Foo);
};
// The IFooBar interface extends the IFoo interface
struct IFooBar : PolyExtends<IFoo> {
template <class Base>
struct Interface : Base {
void Bar() const { return folly::poly_call<0>(*this); }
};
template <class T>
using Members = FOLLY_POLY_MEMBERS(&T::Bar);
};
using FooBar = Poly<IFooBar>;
Given the above definition, instances of type FooBar
have both Foo()
and Bar()
member functions.
The sensible conversions exist between a wrapped derived type and a wrapped base type. For instance, assuming IDerived
extends IBase
with PolyExtends
:
<IDerived> derived = ...;
Poly<IBase> base = derived; // This conversion is OK. Poly
As you would expect, there is no conversion in the other direction, and at present there is no Poly
equivalent to dynamic_cast
.
Sometimes you don’t need to own a copy of an object; a reference will do. For that you can use Poly
to capture a reference to an object satisfying an interface rather than the whole object itself. The syntax is intuitive.
int i = 42;
// Capture a mutable reference to an object of any IRegular type:
<IRegular &> intRef = i;
Poly
assert(42 == folly::poly_cast<int>(intRef));
// Assert that we captured the address of "i":
assert(&i == &folly::poly_cast<int>(intRef));
A reference-like Poly
has a different interface than a value-like Poly
. Rather than calling member functions with the obj.fun()
syntax, you would use the obj->fun()
syntax. This is for the sake of const
-correctness. For example, consider the code below:
struct IFoo {
template <class Base>
struct Interface {
void Foo() { folly::poly_call<0>(*this); }
};
template <class T>
using Members = folly::PolyMembers<&T::Foo>;
};
struct SomeFoo {
void Foo() { std::printf("SomeFoo::Foo\n"); }
};
;
SomeFoo foo<IFoo &> const anyFoo = foo;
Poly->Foo(); // prints "SomeFoo::Foo" anyFoo
Notice in the above code that the Foo
member function is non-const
. Notice also that the anyFoo
object is const
. However, since it has captured a non-const
reference to the foo
object, it should still be possible to dispatch to the non-const
Foo
member function. When instantiated with a reference type, Poly
has an overloaded operator->
member that returns a pointer to the IFoo
interface with the correct const
-ness, which makes this work.
The same mechanism also prevents users from calling non-const
member functions on Poly
objects that have captured const
references, which would violate const
-correctness.
Sensible conversions exist between non-reference and reference Poly
s. For instance:
<IRegular> value = 42;
Poly<IRegular &> mutable_ref = value;
Poly<IRegular const &> const_ref = mutable_ref;
Poly
assert(&poly_cast<int>(value) == &poly_cast<int>(mutable_ref));
assert(&poly_cast<int>(value) == &poly_cast<int>(const_ref));
If you wanted to write the interface ILogicallyNegatable
, which captures all types that can be negated with unary operator!
, you could do it as we’ve shown above, by binding &T::operator!
in the nested Members
alias template, but that has the problem that it won’t work for types that have defined unary operator!
as a free function. To handle this case, the Poly
library lets you use a free function instead of a member function when creating a binding.
With C++17 you may use a lambda to create a binding, as shown in the example below:
struct ILogicallyNegatable {
template <class Base>
struct Interface : Base {
bool operator!() const { return folly::poly_call<0>(*this); }
};
template <class T>
using Members = folly::PolyMembers<
+[](T const& t) -> decltype(bool(!t)) { return bool(!t); }>;
};
This requires some explanation. The unary operator+
in front of the lambda is necessary! It causes the lambda to decay to a C-style function pointer, which is one of the types that folly::PolyMembers
accepts. The decltype
in the lambda return type is also necessary. Through the magic of SFINAE, it will cause Poly<ILogicallyNegatable>
to reject any types that don’t support unary operator!
.
If you are using a free function to create a binding, the first parameter is implicitly the this
parameter. It will receive the type-erased object.
If you are using a C++14 compiler, the definition of ILogicallyNegatable
above will fail because lambdas are not constexpr
. We can get the same effect by writing the lambda as a named free function, as show below:
struct ILogicallyNegatable {
template <class Base>
struct Interface : Base {
bool operator!() const { return folly::poly_call<0>(*this); }
};
template <class T>
static auto negate(T const& t)
-> decltype(bool(!t)) { return bool(!t); }
template <class T>
using Members = FOLLY_POLY_MEMBERS(&negate<T>);
};
As with the example that uses the lambda in the preceding section, the first parameter is implicitly the this
parameter. It will receive the type-erased object.
What if you want to create an IAddable
interface for things that can be added? Adding requires two objects, both of which are type-erased. This interface requires dispatching on both objects, doing the addition only if the types are the same. For this we make use of the PolySelf
template alias to define an interface that takes more than one object of the erased type.
struct IAddable {
template <class Base>
struct Interface : Base {
friend PolySelf<Base>
operator+(PolySelf<Base> const& a, PolySelf<Base> const& b) const {
return folly::poly_call<0>(a, b);
}
};
template <class T>
using Members = folly::PolyMembers<
+[](T const& a, T const& b) -> decltype(a + b) { return a + b; }>;
};
Given the above definition of IAddable
we would be able to do the following:
<IAddable> a = 2, b = 3;
Poly<IAddable> c = a + b;
Polyassert(poly_cast<int>(c) == 5);
If a
and b
stored objects of different types, a BadPolyCast
exception would be thrown.
If you want to store move-only types, then your interface should extend the poly::IMoveOnly
interface.
Poly
will store “small” objects in an internal buffer, avoiding the cost of of dynamic allocations. At present, this size is not configurable; it is pegged at the size of two double
s.
Poly
objects are always nothrow movable. If you store an object in one that has a potentially throwing move constructor, the object will be stored on the heap, even if it could fit in the internal storage of the Poly
object. (So be sure to give your objects nothrow move constructors!)
Poly
implements type-erasure in a manner very similar to how the compiler accomplishes virtual dispatch. Every Poly
object contains a pointer to a table of function pointers. Member function calls involve a double- indirection: once through the v-pointer, and other indirect function call through the function pointer.
folly/ProducerConsumerQueue.h
The folly::ProducerConsumerQueue
class is a one-producer one-consumer queue with very low synchronization overhead.
The queue must be created with a fixed maximum size (and allocates that many cells of sizeof(T)), and it provides just a few simple operations:
read
: Attempt to read the value at the front to the queue into a variable, returns false
iff queue was empty.write
: Emplace a value at the end of the queue, returns false
iff the queue was full.frontPtr
: Retrieve a pointer to the item at the front of the queue, or nullptr
if it is empty.popFront
: Remove the item from the front of the queue (queue must not be empty).isEmpty
: Check if the queue is empty.isFull
: Check if the queue is full.sizeGuess
: Returns the number of entries in the queue. Because of the way we coordinate threads, this guess could be slightly wrong when called by the producer/consumer thread, and it could be wildly inaccurate if called from any other threads. Hence, only call from producer/consumer threads!All of these operations are wait-free. The read operations (including frontPtr
and popFront
) and write operations must only be called by the reader and writer thread, respectively. isFull
, isEmpty
, and sizeGuess
may be called by either thread, but the return values from read
, write
, or frontPtr
are sufficient for most cases.
write
may fail if the queue is full, and read
may fail if the queue is empty, so in many situations it is important to choose the queue size such that the queue filling or staying empty for long is unlikely.
A toy example that doesn’t really do anything useful:
::ProducerConsumerQueue<folly::fbstring> queue{size};
folly
std::thread reader([&queue] {
for (;;) {
::fbstring str;
follywhile (!queue.read(str)) {
//spin until we get a value
continue;
}
(str);
sink}
});
// producer thread:
for (;;) {
::fbstring str = source();
follywhile (!queue.write(str)) {
//spin until the queue has room
continue;
}
}
Alternatively, the consumer may be written as follows to use the ‘front’ value in place, thus avoiding moves or copies:
std::thread reader([&queue] {
for (;;) {
::fbstring* pval;
follydo {
= queue.frontPtr();
pval } while (!pval); // spin until we get a value;
(*pval);
sink.popFront();
queue}
});
folly/synchronization/Rcu.h
C++ read-copy-update (RCU) functionality in folly.
Read-copy-update (RCU) is a low-overhead synchronization mechanism that provides guaranteed ordering between operations on shared data. In the simplest usage pattern, readers enter a critical section, view some state, and leave the critical section, while writers modify shared state and then defer some cleanup operations.
Proper use of the APIs provided by folly RCU will guarantee that a cleanup operation that is deferred during a reader critical section will not be executed until after that critical section is over.
The main synchronization primitive in folly RCU is a folly::rcu_domain
. A folly::rcu_domain
is a “universe” of deferred execution. Each domain has an executor on which deferred functions may execute. folly::rcu_domain
provides the requirements to be BasicLockable, and readers may enter a read region in a folly::rcu_domain
by treating the domain as a mutex type that can be wrapped by C++ locking primitives.
For example, to enter and exit a read region using non-movable RAII semantics, you could use an std::scoped_lock
:
{
std::scoped_lock<folly::rcu_domain> lock(folly::rcu_default_domain());
.read();
protectedData}
Alternatively, if you need to be able to move the lock, you could use std::unique_lock
:
class ProtectedData {
private:
std::unique_lock<folly::rcu_domain> lock_;
void* data_;
}
There is a global, default domain that can be accessed using folly::rcu_default_domain()
as in the example above. If required, you can also create your own domain:
{
* my_executor = getExecutor();
Executor::rcu_domain domain(my_executor /* or nullptr to use default */);
folly}
In general, using the default domain is strongly encouraged as you will likely get better cache locality by sharing a domain that it used by other callers in process. If, however, you can’t avoid blocking during reader critical sections, your own custom domain should be used to avoid delaying reclamation from other updaters. A custom domain can also be used if you want update callbacks to be invoked on a specific executor.
A typical reader / updater synchronization will look something like this:
void doSomethingWith(IPAddress host);
static std::atomic<ConfigData*> globalConfigData;
void reader() {
while (true) {
;
IPAddress curManagementServer{
// We're about to do some reads we want to protect; if we read a
// pointer, we need to make sure that if the writer comes along and
// updates it, the writer's cleanup operation won't happen until we're
// done accessing the pointed-to data. We get a Guard on that
// domain; as long as it exists, no function subsequently passed to
// invokeEventually will execute.
std::scoped_lock<folly::rcu_domain> guard(folly::rcu_default_domain());
* configData = globalConfigData.load(std::memory_order_consume);
ConfigData// We created a guard before we read globalConfigData; we know that the
// pointer will remain valid until the guard is destroyed.
= configData->managementServerIP;
curManagementServer // RCU domain via the scoped mutex is released here; retired objects
// may be freed.
}
(curManagementServer);
doSomethingWith}
}
void writer() {
while (true) {
std::this_thread::sleep_for(std::chrono::seconds(60));
* oldConfigData = globalConfigData.load(std::memory_order_relaxed);
ConfigData* newConfigData = loadConfigDataFromRemoteServer();
ConfigData.store(newConfigData, std::memory_order_release);
globalConfigData::rcu_retire(oldConfigData);
folly// Alternatively, in a blocking manner:
// folly::rcu_synchronize();
// delete oldConfigData;
}
}
In the example above, a single writer updates ConfigData*
in a loop, and then defers freeing it until all RCU readers have exited their read regions. The writer may use either of the following two APIs to safely defer freeing the old ConfigData*
:
rcu_retire()
: To schedule the asynchronous deletion of the oldConfigData
pointer when all readers have exited their read regions.rcu_synchronize()
: To block the calling thread until all readers have exited their read regions, at which point the pointer is safe to be deleted.If you expect there to be very long read regions, it may be required to use rcu_synchronize()
or a periodic rcu_barrier()
(described below) to avoid running out of memory due to delayed reclamation.
Another synchronization primitive provided by the folly RCU library is rcu_barrier()
. Unlike rcu_synchronize()
, which blocks until all outstanding readers have exited their read regions, rcu_barrier()
blocks until all outstanding deleters (specified in a call to rcu_retire()
) are completed.
As mentioned above, one example of where this may be useful is avoiding out-of-memory errors due to scheduling too many objects whose reclamation is delayed. Taking our example from above, we could avoid OOMs using a periodic invocation to rcu_barrier()
as follows:
static std::atomic<ConfigData*> globalConfigData;
void writer() {
uint32_t retires = 0;
while (true) {
std::this_thread::sleep_for(std::chrono::seconds(60));
* oldConfigData = globalConfigData.load(std::memory_order_relaxed);
ConfigData* newConfigData = loadConfigDataFromRemoteServer();
ConfigData.store(newConfigData, std::memory_order_release);
globalConfigDataif (retires++ % 1000 == 0) {
::rcu_barrier();
folly}
::rcu_retire(oldConfigData);
folly}
}
When invoking folly::rcu_retire()
, you may optionally also pass a custom deleter function that is invoked instead of std::default_delete
:
#include <folly/logging/xlog.h>
static std::atomic<ConfigData*> globalConfigData;
void writer() {
while (true) {
std::this_thread::sleep_for(std::chrono::seconds(60));
* oldConfigData = globalConfigData.load(std::memory_order_relaxed);
ConfigData* newConfigData = loadConfigDataFromRemoteServer();
ConfigData.store(newConfigData, std::memory_order_release);
globalConfigData::rcu_retire(oldConfigData, [](ConfigData* obj) {
folly(INFO) << "Deleting retired config " << oldConfigData->version);
XLOGdelete obj;
});
}
}
Exceptions may not be thrown at any point in a retire callback. This includes both the deleter, as well as the object’s destructor. Other than this, any operation is safe from within a retired object’s destructor, including retiring other objects, or even retiring the same object as long as the custom deleter did not free it.
When using the default domain or the default executor, it is not legal to hold a lock across an rcu_retire()
that is acquired by the deleter. This is normally not a problem when using the default deleter delete
, which does not acquire any user locks. However, even when using the default deleter, an object having a user-defined destructor that acquires locks held across the corresponding call to rcu_retire()
can still deadlock.
Note as well that there is no guarantee of the order in which retire callbacks are invoked. A retire callback is guaranteed to be invoked only after all readers that were present when the callback was scheduled have exited. Otherwise, any ordering of callback invocation may occur.
fork()
may not be invoked in a multithreaded program where any thread other than the calling thread is in an RCU read region. Doing so will result in undefined behavior, and will likely lead to deadlock. If the forking thread is inside of an RCU read region, it must invoke exec()
before exiting the read region.
std::scoped_lock<folly::rcu_domain>
creation/destruction is on the order of ~5ns. By comparison, folly::SharedMutex::lock_shared()
followed by folly::SharedMutex::unlock_shared()
is ~26ns.
folly/SmallLocks.h
This module is currently x64 only.
This header defines two very small mutex types. These are useful in highly memory-constrained environments where contention is unlikely. The purpose of these is to allow fine-grained locking in massive data structures where memory is at a premium. Often, each record may have a spare bit or byte lying around, so sometimes these can be tacked on with no additional memory cost.
There are two types exported from this header. MicroSpinLock
is a single byte lock, and PicoSpinLock
can be wrapped around an integer to use a single bit as a lock. Why do we have both? Because you can’t use x64 bts
on a single byte, so sizeof(MicroSpinLock)
is smaller than sizeof(PicoSpinLock)
can be, giving it some use cases.
Both the locks in this header model the C++11 Lockable concept. So you can use std::lock_guard
or std::unique_lock
to lock them in an RAII way if you want.
Additional information is in the header.
folly/Synchronized.h
folly/Synchronized.h
introduces a simple abstraction for mutex- based concurrency. It replaces convoluted, unwieldy, and just plain wrong code with simple constructs that are easy to get right and difficult to get wrong.
Many of our multithreaded C++ programs use shared data structures associated with locks. This follows the time-honored adage of mutex-based concurrency control “associate mutexes with data, not code”. Consider the following example:
class RequestHandler {
...
requestQueue_;
RequestQueue requestQueueMutex_;
SharedMutex
std::map<std::string, Endpoint> requestEndpoints_;
requestEndpointsMutex_;
SharedMutex
workState_;
HandlerState workStateMutex_;
SharedMutex ...
};
Whenever the code needs to read or write some of the protected data, it acquires the mutex for reading or for reading and writing. For example:
void RequestHandler::processRequest(const Request& request) {
<> watch;
stop_watch(request);
checkRequestValiditystd::unique_lock lock(requestQueueMutex_);
requestQueue_.push_back(request);
stats_->addStatValue("requestEnqueueLatency", watch.elapsed());
(INFO) << "enqueued request ID " << request.getID();
LOG}
However, the correctness of the technique is entirely predicated on convention. Developers manipulating these data members must take care to explicitly acquire the correct lock for the data they wish to access. There is no ostensible error for code that:
const
access to the guarded datafolly/Synchronized.h
The same code sample could be rewritten with Synchronized
as follows:
class RequestHandler {
...
<RequestQueue> requestQueue_;
Synchronized<std::map<std::string, Endpoint>> requestEndpoints_;
Synchronized<HandlerState> workState_;
Synchronized...
};
void RequestHandler::processRequest(const Request& request) {
<> watch;
stop_watch(request);
checkRequestValidityrequestQueue_.wlock()->push_back(request);
stats_->addStatValue("requestEnqueueLatency", watch.elapsed());
(INFO) << "enqueued request ID " << request.getID();
LOG}
The rewrite does at maximum efficiency what needs to be done: acquires the lock associated with the RequestQueue
object, writes to the queue, and releases the lock immediately thereafter.
On the face of it, that’s not much to write home about, and not an obvious improvement over the previous state of affairs. But the features at work invisible in the code above are as important as those that are visible:
requestQueue_
without acquiring the lock you wouldn’t be able to; it is virtually impossible to access the queue without acquiring the correct lock.If you need to perform several operations while holding the lock, Synchronized
provides several options for doing this.
The wlock()
method (or lock()
if you have a non-shared mutex type) returns a LockedPtr
object that can be stored in a variable. The lock will be held for as long as this object exists, similar to a std::unique_lock
. This object can be used as if it were a pointer to the underlying locked object:
{
auto lockedQueue = requestQueue_.wlock();
->push_back(request1);
lockedQueue->push_back(request2);
lockedQueue}
The rlock()
function is similar to wlock()
, but acquires a shared lock rather than an exclusive lock.
We recommend explicitly opening a new nested scope whenever you store a LockedPtr
object, to help visibly delineate the critical section, and to ensure that the LockedPtr
is destroyed as soon as it is no longer needed.
Alternatively, Synchronized
also provides mechanisms to run a function while holding the lock. This makes it possible to use lambdas to define brief critical sections:
void RequestHandler::processRequest(const Request& request) {
<> watch;
stop_watch(request);
checkRequestValidityrequestQueue_.withWLock([&](auto& queue) {
// withWLock() automatically holds the lock for the
// duration of this lambda function
.push_back(request);
queue});
stats_->addStatValue("requestEnqueueLatency", watch.elapsed());
(INFO) << "enqueued request ID " << request.getID();
LOG}
One advantage of the withWLock()
approach is that it forces a new scope to be used for the critical section, making the critical section more obvious in the code, and helping to encourage code that releases the lock as soon as possible.
Synchronized<T>
Synchronized
is a template with two parameters, the data type and a mutex type: Synchronized<T, Mutex>
.
If not specified, the mutex type defaults to folly::SharedMutex
. However, any mutex type with an interface compatible with the standard mutex types is supported. Additionally, any mutex type compatible with the extended protocol implemented in folly/synchronization/Lock.h
is supported.
Synchronized
provides slightly different APIs when instantiated with a shared mutex type or an upgrade mutex type then with a plain exclusive mutex. If instantiated with either of the two mutex types above through having a member called lock_shared(), the Synchronized
object has corresponding wlock
, rlock
or ulock
methods to acquire different lock types. When using a shared or upgrade mutex type, these APIs ensure that callers make an explicit choice to acquire a shared, exclusive or upgrade lock and that callers do not unintentionally lock the mutex in the incorrect mode. The rlock()
APIs only provide const
access to the underlying data type, ensuring that it cannot be modified when only holding a shared lock.
The default constructor default-initializes the data and its associated mutex.
The copy constructor locks the source for reading and copies its data into the target. (The target is not locked as an object under construction is only accessed by one thread.)
Finally, Synchronized<T>
defines an explicit constructor that takes an object of type T
and copies it. For example:
// Default constructed
<map<string, int>> syncMap1;
Synchronized
// Copy constructed
<map<string, int>> syncMap2(syncMap1);
Synchronized
// Initializing from an existing map
<string, int> init;
map["world"] = 42;
init<map<string, int>> syncMap3(init);
Synchronized(syncMap3->size(), 1); EXPECT_EQ
The copy assignment operator copies the underlying source data into a temporary with the source mutex locked, and then move the temporary into the destination data with the destination mutex locked. This technique avoids the need to lock both mutexes at the same time. Mutexes are not copied or moved.
The move assignment operator assumes the source object is a true rvalue and does lock the source mutex. It moves the source data into the destination data with the destination mutex locked.
swap
acquires locks on both mutexes in increasing order of object address, and then swaps the underlying data. This avoids potential deadlock, which may otherwise happen should one thread do a = b
while another thread does b = a
.
The data copy assignment operator copies the parameter into the destination data while the destination mutex is locked.
The data move assignment operator moves the parameter into the destination data while the destination mutex is locked.
To get a copy of the guarded data, there are two methods available: void copy(T*)
and T copy()
. The first copies data to a provided target and the second returns a copy by value. Both operations are done under a read lock. Example:
<vector<string>> syncVec1, syncVec2;
Synchronized<string> vec;
vector
// Assign
= syncVec2;
syncVec1 // Assign straight from vector
= vec;
syncVec1
// Swap
.swap(syncVec2);
syncVec1// Swap with vector
.swap(vec);
syncVec1
// Copy to given target
.copy(&vec);
syncVec1// Get a copy by value
auto copy = syncVec1.copy();
lock()
If the mutex type used with Synchronized
is a simple exclusive mutex type (as opposed to a shared mutex), Synchronized<T>
provides a lock()
method that returns a LockedPtr<T>
to access the data while holding the lock.
The LockedPtr
object returned by lock()
holds the lock for as long as it exists. Whenever possible, prefer declaring a separate inner scope for storing this variable, to make sure the LockedPtr
is destroyed as soon as the lock is no longer needed:
void fun(Synchronized<vector<string>, std::mutex>& vec) {
{
auto locked = vec.lock();
->push_back("hello");
locked->push_back("world");
locked}
(INFO) << "successfully added greeting";
LOG}
wlock()
and rlock()
If the mutex type used with Synchronized
is a shared mutex type, Synchronized<T>
provides a wlock()
method that acquires an exclusive lock, and an rlock()
method that acquires a shared lock.
The LockedPtr
returned by rlock()
only provides const access to the internal data, to ensure that it cannot be modified while only holding a shared lock.
int computeSum(const Synchronized<vector<int>>& vec) {
int sum = 0;
auto locked = vec.rlock();
for (int n : *locked) {
+= n;
sum }
return sum;
}
void doubleValues(Synchronized<vector<int>>& vec) {
auto locked = vec.wlock();
for (int& n : *locked) {
*= 2;
n }
}
This example brings us to a cautionary discussion. The LockedPtr
object returned by lock()
, wlock()
, or rlock()
only holds the lock as long as it exists. This object makes it difficult to access the data without holding the lock, but not impossible. In particular you should never store a raw pointer or reference to the internal data for longer than the lifetime of the LockedPtr
object.
For instance, if we had written the following code in the examples above, this would have continued accessing the vector after the lock had been released:
// No. NO. NO!
for (int& n : *vec.wlock()) {
*= 2;
n }
The vec.wlock()
return value is destroyed in this case as soon as the internal range iterators are created. The range iterators point into the vector’s data, but lock is released immediately, before executing the loop body.
Needless to say, this is a crime punishable by long debugging nights.
Range-based for loops are slightly subtle about the lifetime of objects used in the initializer statement. Most other problematic use cases are a bit easier to spot than this, since the lifetime of the LockedPtr
is more explicitly visible.
withLock()
As an alternative to the lock()
API, Synchronized
also provides a withLock()
method that executes a function or lambda expression while holding the lock. The function receives a reference to the data as its only argument.
This has a few benefits compared to lock()
:
lock()
if they choose to, but this is not required. withLock()
ensures that a new scope must always be defined.withLock()
also helps encourage users to release the lock as soon as possible. Because the critical section scope is easily visible in the code, it is harder to accidentally put extraneous code inside the critical section without realizing it.For example, withLock()
makes the range-based for loop mistake from above much harder to accidentally run into:
.withLock([](auto& locked) {
vecfor (int& n : locked) {
*= 2;
n }
});
This code does not have the same problem as the counter-example with wlock()
above, since the lock is held for the duration of the loop.
When using Synchronized
with a shared mutex type, it provides separate withWLock()
and withRLock()
methods instead of withLock()
.
ulock()
and withULockPtr()
Synchronized
also supports upgrading and downgrading mutex lock levels as long as the mutex type used to instantiate the Synchronized
type has the same interface as the mutex types in the C++ standard library. See below for an intro to upgrade mutexes.
An upgrade lock can be acquired as usual either with the ulock()
method or the withULockPtr()
method as so
{
// only const access allowed to the underlying object when an upgrade lock
// is acquired
auto ulock = vec.ulock();
auto newSize = ulock->size();
}
auto newSize = vec.withULockPtr([](auto ulock) {
// only const access allowed to the underlying object when an upgrade lock
// is acquired
return ulock->size();
});
An upgrade lock acquired via ulock()
or withULockPtr()
can be upgraded or downgraded by calling any of the following methods on the LockedPtr
proxy
moveFromUpgradeToWrite()
moveFromWriteToUpgrade()
moveFromWriteToRead()
moveFromUpgradeToRead()
Calling these leaves the LockedPtr
object on which the method was called in an invalid null
state and returns another LockedPtr proxy holding the specified lock. The upgrade or downgrade is done atomically - the Synchronized
object is never in an unlocked state during the lock state transition. For example
auto ulock = obj.ulock();
if (ulock->needsUpdate()) {
auto wlock = ulock.moveFromUpgradeToWrite();
// ulock is now null
->updateObj();
wlock}
This “move” can also occur in the context of a withULockPtr()
(withWLockPtr()
or withRLockPtr()
work as well!) function as so
auto newSize = obj.withULockPtr([](auto ulock) {
if (ulock->needsUpdate()) {
// release upgrade lock get write lock atomically
auto wlock = ulock.moveFromUpgradeToWrite();
// ulock is now null
->updateObj();
wlock
// release write lock and acquire read lock atomically
auto rlock = wlock.moveFromWriteToRead();
// wlock is now null
return rlock->newSize();
} else {
// release upgrade lock and acquire read lock atomically
auto rlock = ulock.moveFromUpgradeToRead();
// ulock is now null
return rlock->newSize();
}
});
An upgrade mutex is a shared mutex with an extra state called upgrade
and an atomic state transition from upgrade
to unique
. The upgrade
state is more powerful than the shared
state but less powerful than the unique
state.
An upgrade lock permits only const access to shared state for doing reads. It does not permit mutable access to shared state for doing writes. Only a unique lock permits mutable access for doing writes.
An upgrade lock may be held concurrently with any number of shared locks on the same mutex. An upgrade lock is exclusive with other upgrade locks and unique locks on the same mutex - only one upgrade lock or unique lock may be held at a time.
The upgrade mutex solves the problem of doing a read of shared state and then optionally doing a write to shared state efficiently under contention. Consider this scenario with a shared mutex:
struct MyObect {
bool isUpdateRequired() const;
void doUpdate();
};
struct MyContainingObject {
::Synchronized<MyObject> sync;
folly
void mightHappenConcurrently() {
// first check
if (!sync.rlock()->isUpdateRequired()) {
return;
}
.withWLock([&](auto& state) {
sync// second check
if (!state.isUpdateRequired()) {
return;
}
.doUpdate();
state});
}
};
Here, the second isUpdateRequired
check happens under a unique lock. This means that the second check cannot be done concurrently with other threads doing first isUpdateRequired
checks under the shared lock, even though the second check, like the first check, is read-only and requires only const access to the shared state.
This may even introduce unnecessary blocking under contention. Since the default mutex type, folly::SharedMutex
, has write priority, the unique lock protecting the second check may introduce unnecessary blocking to all the other threads that are attempting to acquire a shared lock to protect the first check. This problem is called reader starvation.
One solution is to use a shared mutex type with read priority, such as folly::SharedMutexReadPriority
. That can introduce less blocking under contention to the other threads attempting to acquire a shared lock to do the first check. However, that may backfire and cause threads which are attempting to acquire a unique lock (for the second check) to stall, waiting for a moment in time when there are no shared locks held on the mutex, a moment in time that may never even happen. This problem is called writer starvation.
Starvation is a tricky problem to solve in general. But we can partially side- step it in our case.
An alternative solution is to use an upgrade lock for the second check. Threads attempting to acquire an upgrade lock for the second check do not introduce unnecessary blocking to all other threads that are attempting to acquire a shared lock for the first check. Only after the second check passes, and the upgrade lock transitions atomically from an upgrade lock to a unique lock, does the unique lock introduce necessary blocking to the other threads attempting to acquire a shared lock. With this solution, unlike the solution without the upgrade lock, the second check may be done concurrently with all other first checks rather than blocking or being blocked by them.
The example would then look like:
struct MyObject {
bool isUpdateRequired() const;
void doUpdate();
};
struct MyContainingObject {
::Synchronized<MyObject> sync;
folly
void mightHappenConcurrently() {
// first check
if (!sync.rlock()->isUpdateRequired()) {
return;
}
.withULockPtr([&](auto ulock) {
sync// second check
if (!ulock->isUpdateRequired()) {
return;
}
auto wlock = ulock.moveFromUpgradeToWrite();
->doUpdate();
wlock});
}
};
Note: Some shared mutex implementations offer an atomic state transition from shared
to unique
and some upgrade mutex implementations offer an atomic state transition from shared
to upgrade
. These atomic state transitions are dangerous, however, and can deadlock when done concurrently on the same mutex. For example, if threads A and B both hold shared locks on a mutex and are both attempting to transition atomically from shared to upgrade locks, the threads are deadlocked. Likewise if they are both attempting to transition atomically from shared to unique locks, or one is attempting to transition atomically from shared to upgrade while the other is attempting to transition atomically from shared to unique. Therefore, Synchronized
’s LockedPtr
proxies do not expose either of these dangerous atomic state transitions even when the underlying mutex type supports them.
When Synchronized
is used with a mutex type that supports timed lock acquisition, lock()
, wlock()
, and rlock()
can all take an optional std::chrono::duration
argument. This argument specifies a timeout to use for acquiring the lock. If the lock is not acquired before the timeout expires, a null LockedPtr
object will be returned. Callers must explicitly check the return value before using it:
void fun(Synchronized<vector<string>>& vec) {
{
auto locked = vec.lock(10ms);
if (!locked) {
throw std::runtime_error("failed to acquire lock");
}
->push_back("hello");
locked->push_back("world");
locked}
(INFO) << "successfully added greeting";
LOG}
unlock()
and scopedUnlock()
Synchronized
is a good mechanism for enforcing scoped synchronization, but it has the inherent limitation that it requires the critical section to be, well, scoped. Sometimes the code structure requires a fleeting “escape” from the iron fist of synchronization, while still inside the critical section scope.
One common pattern is releasing the lock early on error code paths, prior to logging an error message. The LockedPtr
class provides an unlock()
method that makes this possible:
<map<int, string>> dic;
Synchronized...
{
auto locked = dic.rlock();
auto iter = locked->find(0);
if (iter == locked.end()) {
.unlock(); // don't hold the lock while logging
locked(ERROR) << "key 0 not found";
LOGreturn false;
}
(*iter);
processValue}
(INFO) << "succeeded"; LOG
For more complex nested control flow scenarios, scopedUnlock()
returns an object that will release the lock for as long as it exists, and will reacquire the lock when it goes out of scope.
<map<int, string>> dic;
Synchronized...
{
auto locked = dic.wlock();
auto iter = locked->find(0);
if (iter == locked->end()) {
{
auto unlocker = locked.scopedUnlock();
(INFO) << "Key 0 not found, inserting it."
LOG}
->emplace(0, "zero");
locked} else {
*iter = "zero";
}
}
Clearly scopedUnlock()
comes with specific caveats and liabilities. You must assume that during the scopedUnlock()
section, other threads might have changed the protected structure in arbitrary ways. In the example above, you cannot use the iterator iter
and you cannot assume that the key 0
is not in the map; another thread might have inserted it while you were bragging on LOG(INFO)
.
Whenever a LockedPtr
object has been unlocked, whether with unlock()
or scopedUnlock()
, it will behave as if it is null. isNull()
will return true. Dereferencing an unlocked LockedPtr
is not allowed and will result in undefined behavior.
Synchronized
and std::condition_variable
When used with a std::mutex
, Synchronized
supports using a std::condition_variable
with its internal mutex. This allows a condition_variable
to be used to wait for a particular change to occur in the internal data.
The LockedPtr
returned by Synchronized<T, std::mutex>::lock()
has a as_lock()
method that returns a reference to a std::unique_lock<std::mutex>
, which can be given to the std::condition_variable
:
<vector<string>, std::mutex> vec;
Synchronizedstd::condition_variable emptySignal;
// Assuming some other thread will put data on vec and signal
// emptySignal, we can then wait on it as follows:
auto locked = vec.lock();
.wait(locked.as_lock(),
emptySignal[&] { return !locked->empty(); });
acquireLocked()
Sometimes locking just one object won’t be able to cut the mustard. Consider a function that needs to lock two Synchronized
objects at the same time - for example, to copy some data from one to the other. At first sight, it looks like sequential wlock()
calls will work just fine:
void fun(Synchronized<vector<int>>& a, Synchronized<vector<int>>& b) {
auto lockedA = a.wlock();
auto lockedB = b.wlock();
... use lockedA and lockedB ...
}
This code compiles and may even run most of the time, but embeds a deadly peril: if one threads call fun(x, y)
and another thread calls fun(y, x)
, then the two threads are liable to deadlocking as each thread will be waiting for a lock the other is holding. This issue is a classic that applies regardless of the fact the objects involved have the same type.
This classic problem has a classic solution: all threads must acquire locks in the same order. The actual order is not important, just the fact that the order is the same in all threads. Many libraries simply acquire mutexes in increasing order of their address, which is what we’ll do, too. The acquireLocked()
function takes care of all details of proper locking of two objects and offering their innards. It returns a std::tuple
of LockedPtr
s:
void fun(Synchronized<vector<int>>& a, Synchronized<vector<int>>& b) {
auto ret = folly::acquireLocked(a, b);
auto& lockedA = std::get<0>(ret);
auto& lockedB = std::get<1>(ret);
... use lockedA and lockedB ...
}
Note that C++ 17 introduces structured binding syntax which will make the returned tuple more convenient to use:
void fun(Synchronized<vector<int>>& a, Synchronized<vector<int>>& b) {
auto [lockedA, lockedB] = folly::acquireLocked(a, b);
... use lockedA and lockedB ...
}
An acquireLockedPair()
function is also available, which returns a std::pair
instead of a std::tuple
. This is more convenient to use in many situations, until compiler support for structured bindings is more widely available.
The library is geared at protecting one object of a given type with a mutex. However, sometimes we’d like to protect two or more members with the same mutex. Consider for example a bidirectional map, i.e. a map that holds an int
to string
mapping and also the converse string
to int
mapping. The two maps would need to be manipulated simultaneously. There are at least two designs that come to mind.
struct
You can easily pack the needed data items in a little struct. For example:
class Server {
struct BiMap {
<int, string> direct;
map<string, int> inverse;
map};
<BiMap> bimap_;
Synchronized...
};
...
bimap_.withLock([](auto& locked) {
.direct[0] = "zero";
locked.inverse["zero"] = 0;
locked});
With this code in tow you get to use bimap_
just like any other Synchronized
object, without much effort.
std::tuple
If you won’t stop short of using a spaceship-era approach, std::tuple
is there for you. The example above could be rewritten for the same functionality like this:
class Server {
<tuple<map<int, string>, map<string, int>>> bimap_;
Synchronized...
};
...
bimap_.withLock([](auto& locked) {
<0>(locked)[0] = "zero";
get<1>(locked)["zero"] = 0;
get});
The code uses std::get
with compile-time integers to access the fields in the tuple. The relative advantages and disadvantages of using a local struct vs. std::tuple
are quite obvious - in the first case you need to invest in the definition, in the second case you need to put up with slightly more verbose and less clear access syntax.
Synchronized
and its supporting tools offer you a simple, robust paradigm for mutual exclusion-based concurrency. Instead of manually pairing data with the mutexes that protect it and relying on convention to use them appropriately, you can benefit of encapsulation and typechecking to offload a large part of that task and to provide good guarantees.
folly/ThreadCachedInt.h
High-performance atomic increment using thread caching.
folly/ThreadCachedInt.h
introduces a integer class designed for high performance increments from multiple threads simultaneously without loss of precision. It has two read modes, readFast
gives a potentially stale value with one load, and readFull
gives the exact value, but is much slower, as discussed below.
Increment performance is up to 10x greater than std::atomic_fetch_add
in high contention environments. See folly/test/ThreadCachedIntTest.h
for more comprehensive benchmarks.
readFast
is as fast as a single load.
readFull
, on the other hand, requires acquiring a mutex and iterating through a list to accumulate the values of all the thread local counters, so is significantly slower than readFast
.
Create an instance and increment it with increment
or the operator overloads. Read the value with readFast
for quick, potentially stale data, or readFull
for a more expensive but precise result. There are additional convenience functions as well, such as set
.
<int64_t> val;
ThreadCachedInt(0, val.readFast());
EXPECT_EQ++val; // increment in thread local counter only
(0, val.readFast()); // increment has not been flushed
EXPECT_EQ(1, val.readFull()); // accumulates all thread local counters
EXPECT_EQ.set(2);
val(2, val.readFast());
EXPECT_EQ(2, val.readFull()); EXPECT_EQ
folly::ThreadCachedInt
uses folly::ThreadLocal
to store thread specific objects that each have a local counter. When incrementing, the thread local instance is incremented. If the local counter passes the cache size, the value is flushed to the global counter with an atomic increment. It is this global counter that is read with readFast
via a simple load, but will not count any of the updates that haven’t been flushed.
In order to read the exact value, ThreadCachedInt
uses the extended readAllThreads()
API of folly::ThreadLocal
to iterate through all the references to all the associated thread local object instances. This currently requires acquiring a global mutex and iterating through the references, accumulating the counters along with the global counter. This also means that the first use of the object from a new thread will acquire the mutex in order to insert the thread local reference into the list. By default, there is one global mutex per integer type used in ThreadCachedInt
. If you plan on using a lot of ThreadCachedInt
s in your application, considering breaking up the global mutex by introducing additional Tag
template parameters.
set
simply sets the global counter value, and marks all the thread local instances as needing to be reset. When iterating with readFull
, thread local counters that have been marked as reset are skipped. When incrementing, thread local counters marked for reset are set to zero and unmarked for reset.
Upon destruction, thread local counters are flushed to the parent so that counts are not lost after increments in temporary threads. This requires grabbing the global mutex to make sure the parent itself wasn’t destroyed in another thread already.
There are of course many ways to skin a cat, and you may notice there is a partial alternate implementation in folly/test/ThreadCachedIntTest.cpp
that provides similar performance. ShardedAtomicInt
simply uses an array of std::atomic<int64_t>
’s and hashes threads across them to do low-contention atomic increments, and readFull
just sums up all the ints.
This sounds great, but in order to get the contention low enough to get similar performance as ThreadCachedInt with 24 threads, ShardedAtomicInt
needs about 2000 ints to hash across. This uses about 20x more memory, and the lock-free readFull
has to sum up all 2048 ints, which ends up being a about 50x slower than ThreadCachedInt
in low contention situations, which is hopefully the common case since it’s designed for high-write, low read access patterns. Performance of readFull
is about the same speed as ThreadCachedInt
in high contention environments.
Depending on the operating conditions, it may make more sense to use one implementation over the other. For example, a lower contention environment will probably be able to use a ShardedAtomicInt
with a much smaller array without hurting performance, while improving memory consumption and perf of readFull
.
folly/ThreadLocal.h
Improved thread local storage for non-trivial types.
boost::thread_specific_ptr
.pthread_getspecific
directly, but only consumes a single pthread_key_t
per Tag
template param.thread_specific_ptr
API with accessAllThreads
and extended custom deleter support.The API of ThreadLocalPtr
is very close to boost::thread_specific_ptr
with the notable addition of the accessAllThreads
method. There is also a ThreadLocal
class which is a thin wrapper around ThreadLocalPtr
that manages allocation automatically (creates a new object the first time it is dereferenced from each thread).
ThreadLocalPtr
simply gives you a place to put and access a pointer local to each thread such that it will be destroyed appropriately.
{
::ThreadLocalPtr<Widget> w;
folly.reset(new Widget(0), Widget::customDeleterA);
wstd::thread([&w]() {
.reset(new Widget(1), Widget::customDeleterB);
w.get()->mangleWidget();
w} // Widget(1) is destroyed with customDeleterB
} // Widget(0) is destroyed with customDeleterA
Note that customDeleterB
will get called with TLPDestructionMode::THIS_THREAD
and customerDeleterA
will get called with TLPDestructionMode::ALL_THREADS
. This is to distinguish between thread exit vs. the entire ThreadLocalPtr
getting destroyed, in which case there is cleanup work that may be avoided.
The accessAllThreads
interface is provided to walk all the thread local child objects of a parent. accessAllThreads
initializes an accessor which holds a global lock that blocks all creation and destruction of ThreadLocal
objects with the same Tag
and can be used as an iterable container. Typical use is for frequent write, infrequent read data access patterns such as counters. Note that you must specify a unique Tag type so you don’t block other ThreadLocal object usage, and you should try to minimize the lifetime of the accessor so the lock is held for as short as possible).
The following example is a simplification of folly/ThreadCachedInt.h
. It keeps track of a counter value and allows multiple threads to add to the count without synchronization. In order to get the total count, read()
iterates through all the thread local values via accessAllThreads()
and sums them up. class NewTag
is used to break the global mutex so that this class won’t block other ThreadLocal
usage when read()
is called.
Note that read()
holds the global mutex which blocks construction, destruction, and read()
for other SimpleThreadCachedInt
’s, but does not block add()
. Also, since it uses the unique NewTag
, SimpleThreadCachedInt
does not affect other ThreadLocal
usage.
class SimpleThreadCachedInt {
class NewTag; // Segments the global mutex
<int,NewTag> val_;
ThreadLocal
public:
void add(int val) {
*val_ += val; // operator*() gives a reference to the thread local instance
}
int read() {
int ret = 0;
// accessAllThreads acquires the global lock
for (const auto& i : val_.accessAllThreads()) {
+= i;
ret } // Global lock is released on scope exit
return ret;
}
};
We keep a __thread
array of pointers to objects (ThreadEntry::elements
) where each array has an index for each unique instance of the ThreadLocalPtr
object. Each ThreadLocalPtr
object has a unique id that is an index into these arrays so we can fetch the correct object from thread local storage very efficiently.
In order to prevent unbounded growth of the id space and thus huge ThreadEntry::elements
arrays, for example due to continuous creation and destruction of ThreadLocalPtr
objects, we keep track of all active instances by linking them together into a list. When an instance is destroyed we remove it from the chain and insert the id into freeIds_
for reuse. These operations require a global mutex, but only happen at construction and destruction time. accessAllThreads
also acquires this global mutex.
We use a single global pthread_key_t
per Tag
to manage object destruction and memory cleanup upon thread exit because there is a finite number of pthread_key_t
’s available per machine.
Implements traits complementary to those provided in
IsRelocatable
trait.IsOneOf
trait<type_traits>
is the Standard type-traits library defining a variety of traits such as is_integral
or is_floating_point
. This helps to gain more information about a given type.
folly/Traits.h
implements traits complementing those present in the Standard.
In C++, the default way to move an object is by calling the copy constructor and destroying the old copy instead of directly copying the memory contents by using memcpy(). The conservative approach of moving an object assumes that the copied object is not relocatable. The two following code sequences should be semantically equivalent for a relocatable type:
{
void conservativeMove(T * from, T * to) {
new(to) T(from);
(*from).~T();
}
}
{
void optimizedMove(T * from, T * to) {
(to, from, sizeof(T));
memcpy}
}
Very few C++ types are non-relocatable. The type defined below maintains a pointer inside an embedded buffer and hence would be non-relocatable. Moving the object by simply copying its memory contents would leave the internal pointer pointing to the old buffer.
class NonRelocatableType {
private:
char buffer[1024];
char * pointerToBuffer;
...
public:
() : pointerToBuffer(buffer) {}
NonRelocatableType...
};
We can optimize the task of moving a relocatable type T using memcpy. IsRelocatable
Declaring types
template <class T1, class T2>
class MyParameterizedType;
class MySimpleType;
Declaring a type as relocatable
Appending the lines below after definition of My*Type (MyParameterizedType
or MySimpleType
) will declare it as relocatable
/* Definition of My*Type goes here */
// global namespace (not inside any namespace)
namespace folly {
// defining specialization of IsRelocatable for MySimpleType
template <>
struct IsRelocatable<MySimpleType> : std::true_type {};
// defining specialization of IsRelocatable for MyParameterizedType
template <class T1, class T2>
struct IsRelocatable<MyParameterizedType<T1, T2>>
: ::std::true_type {};
}
To make it easy to state assumptions for a regular type or a family of parameterized type, various macros can be used as shown below.
Stating that a type is Relocatable using a macro
// global namespace
namespace folly {
// For a Regular Type
(MySimpleType);
FOLLY_ASSUME_RELOCATABLE// For a Parameterized Type
(MyParameterizedType<T1, T2>);
FOLLY_ASSUME_RELOCATABLE}
fbvector
only works with relocatable objects. If assumptions are not stated explicitly, fbvector<MySimpleType>
or fbvector<MyParameterizedType>
will fail to compile due to assertion below:
static_assert(IsRelocatable<My*Type>::value, "");
FOLLY_ASSUME_FBVECTOR_COMPATIBLE*(type) macros can be used to state that type is relocatable and has nothrow constructor.
Stating that a type is fbvector-compatible
using macros i.e. relocatable and has nothrow default constructor
// at global level, i.e no namespace
// macro for regular type
(MySimpleType)
FOLLY_ASSUME_FBVECTOR_COMPATIBLE// macro for types having 2 template parameters (MyParameterizedType)
(MyParameterizedType) FOLLY_ASSUME_FBVECTOR_COMPATIBLE_2
Similarly,
FOLLY_ASSUME_FBVECTOR_COMPATIBLE_1(MyTypeHavingOneParameter) macro is for family of parameterized types having 1 parameter
FOLLY_ASSUME_FBVECTOR_COMPATIBLE_3(MyTypeHavingThreeParameters) macro is for family of parameterized types having 3 parameters
FOLLY_ASSUME_FBVECTOR_COMPATIBLE_4(MyTypeHavingFourParameters) macro is for family of parameterized types having 4 parameters
Few common types, namely std::basic_string
, std::vector
, std::list
, std::map
, std::deque
, std::set
, std::unique_ptr
, std::shared_ptr
, std::function
, which are compatible with fbvector
are already instantiated and declared compatible with fbvector
. fbvector
can be directly used with any of these C++ types.
std::pair
can be safely assumed to be compatible with fbvector
if both of its components are.
std::is_same<T1, T2>::value
can be used to test if types of T1 and T2 are same. folly::IsOneOf<T, T1, Ts...>::value
can be used to test if type of T matches the type of one of the other template parameter, T1, T2, …Tn. Recursion is used to implement this type trait.
folly/small_vector.h
folly::small_vector<T,Int=1,...>
is a sequence container that implements small buffer optimization. It behaves similarly to std::vector, except until a certain number of elements are reserved it does not use the heap.
Like standard vector, it is guaranteed to use contiguous memory. (So, after it spills to the heap all the elements live in the heap buffer.)
Simple usage example:
<int,2> vec;
small_vector.push_back(0); // Stored in-place on stack
vec.push_back(1); // Still on the stack
vec.push_back(2); // Switches to heap buffer. vec
This class is useful in either of following cases:
Short-lived stack vectors with few elements (or maybe with a usually-known number of elements), if you want to avoid malloc.
If the vector(s) are usually under a known size and lookups are very common, you’ll save an extra cache miss in the common case when the data is kept in-place.
You have billions of these vectors and don’t want to waste space on std::vector
’s capacity tracking. This vector lets malloc
track our allocation capacity. (Note that this slows down the insertion/reallocation code paths significantly; if you need those to be fast you should use fbvector
.)
The last two cases were the main motivation for implementing it.
There are also a couple of flags you can pass into this class template to customize its behavior. You can provide them in any order after the in-place count. They are all in the namespace small_vector_policy
.
NoHeap
- Avoid the heap entirely. (Throws std::length_error
if you would’ve spilled out of the in-place allocation.)
<Any integral type>
- customizes the amount of space we spend on tracking the size of the vector.
A couple more examples:
// With space for 32 in situ unique pointers, and only using a
// 4-byte size_type.
<std::unique_ptr<int>, 32, uint32_t> v;
small_vector
// A inline vector of up to 256 ints which will not use the heap.
<int, 256, NoHeap> v;
small_vector
// Same as the above, but making the size_type smaller too.
<int, 256, NoHeap, uint16_t> v; small_vector