Bit packing like a mad man

Bit packing like a mad man
Amaury SECHET
@deadalnix

Memory is slow
• About 300 cycles to hit memory
• Bandwidth still increasing
• Latency only marginally increasing

Memory is slow - Caching
• Add faster memory on CPU.
• Various size and speed
– Signal needs time to travel
– L1: 3-4 cycles, 32kb
• Instruction
• Data
– L2: 8-14 cycles, 256kb
– L3: tens of cycles, few Mb, often shared
– Cache line: 64 bytes

The king is throwing a party
He has 1000 bottles
in his cellar

An evil man poisoned
a bottle with his
secret recipe with 11
herbs and spices !
• The poison will kill anyone
even in small doses.
• It takes several hours for
someone to die from
poisoning.
• The King has 1000 servants
and 20 prisoners.
• He would like to avoid killing
servants if possible, but
killing prisoners is fine.
• What should the king do ?

The answer
• The king can use 10 prisoners.
• Number each bottle in binary
• Each prisoner will drink from multiple bottles
– Prisoner n will drink bottle where the nth digit is 1
• The prisoner ding will give the result in binary.

The king’s party was a real success !

Bit packing
• Reduce memory waste
• Increase cache utilization
• Minimal CPU cost
• Not a replacement for better algorithms
– Instantiating less objects saves a lot of memory !

Alignment
• Ensure that load/store do not
– Cross cache line
– Cross pages boundaries
• Unaligned access: severe penalties
– Bad performances on some CPU, loss of atomicity
• Hardware is doing 2 accesses
– Hard error on others (SIGBUS or alike)
• Defined by ABI

Alignment – Rule of thumb
• Integral types smaller than size_t
– T.sizeof
• Integral types bigger than size_t
– size_t.sizeof
– Compiler will decompose memory accesses
• Structs
– Max(alignment of each field)
– Add padding to respect alignment

Struct padding
struct S {
bool f1;
uint f2;
bool f3;
}
f1 f2pad f3 pad
12 bytes, 6 wasted

Struct padding
struct S {
uint f2;
bool f1;
bool f3;
}
f3f2 f1 pad
8 bytes, 2 wasted

Padding tips
• Start with fields with high alignment
• Know where pads are
• Enforce assumptions using static assert
– alignof
– sizeof
• Classes, like structs, but
– Implicit fields
• Vtable
• Monitor
– At least pointer size alignment

Information density
• How much actual information ?
• Bool
– 1 bit of information
– 8 bits of storage
• Object
– 45 bits of information
– 64 bits of storage
• Dump memory and zip it
– Aim for that size

Bit packing
• Trade memory consumption for CPU
– Usually a good deal
• Use one integral as storage
– Store several elements in that integral
– Use bitwise operations to manipulate elements
• std.bitmanip can help

Struct packing
f1
4 bytes, 0 wasted
import std.bitmanip;
struct S {
mixin(bitfield!(
uint, "f1", 30,
bool, "f2", 1,
bool, "f3", 1,
));
}
f2 f3
• f1 is now 30 bits instead of 32 bits
• Now about 1B max
• Fields aren’t atomic anymore
• bitfield does all the magic

enum ReadMask = (1 << S) – 1;
enum WriteMask = ReadMask << N;
@property uint entry() {
return (data >> N) & ReadMask;
}
@property void entry(uint val) in {
assert(val & ReadMask == val);
} body {
data = (data & ~WriteMask) | ((val << N) & WriteMask);
}
Bit packing intergals
entry
32 NN + S 0
Data:

enum Mask = 1 << N;
@property bool entry() {
return (data & Mask) != 0;
}
@property entry(bool val) {
if (val) {
data = data | Mask;
} else {
data = data & ~Mask;
}
}
Bit packing bools
entry
32 NN + 1 0
Data:
Note: data ^ Mask will flip the bit
It is sometime faster than to set it.

Bitfield layout
• 2 special spots
– Rightmost : mask only
– Leftmost : shift only
• Large elements require large mask
– Put them on the left most
• Bools always use masks
– Can be checked in leftmost with signed < 0
– Don’t put them in special spots unless very hot

Bitfield layout
• We want :
– One flag
– One 2 bits enum E
– A 29 bits integral
• What is the best layout ?

Bitfield layout
enum E { E0, E1, E2, E3 }
struct S {
mixin(bitfield!(
E, "e", 2,
bool, "flag", 1,
uint, "integral", 29,
));
}
e = cast(E) (data & 0x03);
flag = (data & 0x04) != 0;
integral = data >> 3;
Codegen :

Unused bits
• Sometime, the whole bitfield is not needed
– Create a nameless field
• uint, "", 29
– Make it usable for out struct/subclasses
• uint, ”_derived", 29
• Ideally make it private/protected
• Or use in private struct elements
• Need to implement the remaining fields manually
• Feature request: bitfield with explicit storage

Unused bits - example
class Symbol : Node {
Name name;
Name mangle;
mixin(bitfields!(
Step, "step", 2,
Linkage, "linkage", 3,
Visibility, "visibility", 3,
InTemplate, "inTemplate", 1,
bool, "hasThis", 1,
bool, "hasContext", 1,
bool, "isPoisoned", 1,
bool, "isAbstract", 1,
bool, "isProperty", 1,
uint, "derived", 18,
));
}
class Field : Symbol {
// ...
this(..., uint index, ... ) {
// ...
this.derived = index;
// Always true for fields.
this.hasThis = true;
}
@property index() const {
// Only 262 143 fields possible !
return derived;
}
}

Tagging pointers - @trusted
• Least significant bits are known to be 0
– How many depends on alignment
– Log2(T.alignof)
– At least 3 bits on Objects (2 on 32 bits systems)
• Once again, std.bitmanip can help
– taggedPointer/taggedClassRef
– Checks alignment constraints at compiler time
– Misaligned pointers are not safe

Tagging pointers - @trusted
enum Color { Black, Red }
struct Link(T) {
mixin(taggedPointer!(
T*, "child",
Color, "color", 1,
));
}
struct Node(T) {
Link!T left;
Link!T right;
}
pointed
child
• Actual pointer points at the object
• Tagged pointer point within the object
• GC knows about interior pointers

Tagging pointers - @system
• Allocate in the lower 32bits of address space
– Truncate pointer to 32 bits
– Limited to 4Gb
– Jemalloc can do that for you
– Used by HHVM for codegen
• On X86 most significant 16bits are zeros
– Hijack them !
– Confuse the GC !
– Try to not SEGFAULT

Intermission – Germany loves D !
They even put stickers on their cars !

Let’s use a context
• Useful for cold but often reused data
• For instance, identifiers in a compiler
– Usually don’t care about the actual value
• Context store identifiers, provide a unique id
– 32 bits vs 128 bits
– Equality can be tested with an int compare
– Can be its own hash for hastable lookups
• Make the GC happy
– less pointers
– More noscan !

struct Name {
private:
uint id;
this(uint id) {
this.id = id;
}
public:
string toString(const Context c) const {
return c.names[id]
}
immutable(char)* toStringz(const Context c) const {
auto s = toString();
assert(s.ptr[s.length] == '0', "Expected a zero terminated string");
return s.ptr;
}
}

class Context {
private:
string[] names;
uint[string] lookups;
public:
auto getName(const(char)[] str) {
if (auto id = str in lookups) {
return Name(*id);
}
// As we are cloning, make sure it is 0 terminated as to pass to C.
import std.string;
auto s = str.toStringz()[0 .. str.length];
auto id = lookups[s] = cast(uint) names.length;
names ~= s;
return Name(id);
}
}

Context prefill
• Useful to pin some id at compile time
• Can be used without lookup in the context
• Generated identifiers
• object.d
• Linkage/Version/Scope/Attribute

Context prefill
enum Reserved = [
"__ctor", "__dtor", "__postblit", "__vtbl",
];
enum Prefill = [
// Linkages
"C", "D", "C++", "Windows", "System",
// Generated
"init", "length", "max", "min",
"ptr", "sizeof", "alignof",
// Scope
"exit", "success", "failure",
// Defined in object
"object", "size_t", "ptrdiff_t", "string",
"Object",
"TypeInfo", "ClassInfo",
"Throwable", "Exception", "Error",
// Attribute
"property", "safe", "trusted", "system", "nogc",
// ...
];
auto getNames() {
import d.lexer;
auto identifiers = [""];
foreach(k, _; getOperatorsMap()) {
identifiers ~= k;
}
foreach(k, _; getKeywordsMap()) {
identifiers ~= k;
}
return identifiers ~ Reserved ~ Prefill;
}
enum Names = getNames();

Context prefill
auto getLookups() {
uint[string] lookups;
foreach(uint i, id; Names) {
lookups[id] = i;
}
return lookups;
}
enum Lookups = getLookups();
template BuiltinName(
string name,
) {
private enum id = Lookups
.get(name, uint.max);
static assert(
id < uint.max,
name ~ " is not a builtin
name.",
);
enum BuiltinName = Name(id);
}

More context !
• Track locations in a compiler
– They are everywhere
• Register file in the context
– Allocate a range of value from N to N + sizeof(file)
– A position for each byte in the file !
• Add a flag for mixin (D) / macros (C++)
– Register expansions in the context.

More context !
• Use cases:
– Emit debug infos
– Error messages
• Perfs do not matter for errors
• Access pattern mostly predictable for debug
• Find file/line from location using
– One element cache
– Linear search (8 elements)
– Binary search

More context !
File 2 File 3 EmptyFile 1
Mixin 2
Mixin
3
Empty
Mixin
1
0 2B
-2B -1
Context store file boundaries and line position within files

More context !
• A position is 31 bits number + a flag
– Up to 2Gb of source code + 2 Gb of macros/mixin
• A pair of positions is a location
– Used for tokens/expressions/symbols/statements
• Lexer only need to bump the position value
for each token by the length of the token
• Strategy used by clang / SDC

Tagged reference
• Useful to encapsulate several reference types
• Can provide methods forwarding to elements
– Use reflection to do so
– Avoid vtable lookups/cascaded loads
– No common layout in the referenced object
• Number of elements limited by alignement
– Easy to get up to 8 on X64
• LLVM’s call/invoke

Tagged reference
template TagFields(uint i, U...) {
import std.conv;
static if (U.length == 0) {
enum TagFields = "nt" ~ T.stringof ~ " = “
~ to!string(i) ~ ",";
} else {
enum S = U[0].stringof;
static assert(
(S[0] & 0x80) == 0,
S ~ " must not start with an unicode.",
);
static assert(
U[0].sizeof <= size_t.sizeof,
"Elements must be of pointer size or smaller.",
);
import std.ascii;
enum Name = (S == "typeof(null)")
? "Undefined"
: toUpper(S[0]) ~ S[1 .. $];
enum TagFields = "nt" ~ Name ~ " = "
~ to!string(i) ~ "," ~ TagFields!(i + 1, U[1 .. $]);
}
}
mixin("enum Tag {" ~ TagFields!(0, U) ~ "n}");
import std.traits;
alias Tags = EnumMembers!Tag;
import std.typetuple;
alias TagTuple = TypeTuple!(uint, "tag", EnumSize!Tag);

Tagged reference
struct TaggedRef(U...) {
private:
mixin(taggedPointer!(
void*, "ptr", TagTuple));
public:
auto get(Tag E)() in {
assert(tag == E);
} body {
static union Helper {
void* __ptr;
U u;
}
return Helper(ptr).u[E];
}
template opDispatch(string s, T...) {
auto opDispatch(A...)(A args) {
final switch(tag) {
foreach(T; Tags) {
case T:
auto r = get!T();
return mixin("r." ~ s)(args);
}
}
}
}
}

Value Type Polymorphism
• All subtypes fit under a given size budget
• A tag is used to differentiate them
• The whole thing is wrapped in an nice API
• Being able to hide atrocities behind a nice
façade, that’s the power of D
• Example: Representing D types

template SizeOfBitField(T...) {
static if (T.length < 2) {
enum SizeOfBitField = 0;
} else {
enum SizeOfBitField =
T[2] + SizeOfBitField!(T[3 .. $]);
}
}
enum EnumSize(E) =
computeEnumSize!E();
size_t computeEnumSize(E)() {
size_t size = 0;
import std.traits;
foreach (m; EnumMembers!E) {
size_t ms = 0;
while ((m >> ms) != 0) {
ms++;
}
import std.algorithm;
size = max(size, ms);
}
return size;
}

struct TypeDescriptor(K, T...) {
enum DataSize = ulong.sizeof * 8 - 3 - EnumSize!K - SizeOfBitField!T;
mixin(bitfields!(
K, "kind", EnumSize!K,
TypeQualifier, "qualifier", 3,
ulong, "data", DataSize,
T,
));
static assert(TypeDescriptor.sizeof == ulong.sizeof);
this(K k, TypeQualifier q, ulong d = 0) {
kind = k;
qualifier = q;
data = d;
}
}

• A type is a TypeDescriptor + an indirection field
• Data depend on the kind
– If it doesn’t fit, use indirection field
• There are many type kind:
– Builtin
– Struct
– Class
– Alias
– Function
– …
• Common API switch on kind to do the right thing

data Qualifier Kind
Indirection
• 128 bits budget
• Indirection is used when
• The type need extra space (Function)
• The type need to refers to a symbol (Aggregate, Alias)
• Otherwise null
• Replaced the type class hierarchy advantageously
• Significant memory consumption reduction
• Significantly faster runtime (about 20%)

• You can nest, effectively creating hierarcies
• For instance, Identifiable is
– A type
– An expression
– A symbol
• More packing !

data Qualifier Kind
Indirection/Expression/Symbol
Tag
• Tag is used to discriminate between
• Type
• Expression
• Symbol
• Tag is zeroed out to find the type
• Saved 70 Mb (!) of template bloat in SDC

import d.semantic.identifier;
Identifiable i = ...;
i.apply!(delegate Expression(identified) {
alias T = typeof(identified);
static if (is(T : Expression)) {
return identified;
} else {
return getError(
identified,
location,
t.name.toString(pass.context) ~ " isn't callable",
);
}
})();

Identifiable
Type Expression Symbol
Builtin Class AliasStruct Pointer Function …

Value Type - ABI
• Struct up to 2 fields
– Up to pointer sized
– Slice !
– No float/integral mixing
• Common anti pattern 2 pointers + a bool
– std.bigint.BigInt is a slice + a bool
– Passed in memory instead of registers 
• More than one pointer tends to use 2
– Use either 1 or 2 pointer sized struct

Classless Polymorphism
• Create a base struct
• All substruct use it as first field
• Contains a tag describing the type
– The tag can be part of a bitfield
• Use mixin in all substruct
– Include static assert to check this is done right
– Alias this the base

• Each leaf of the hierarchy has a tag value
• Each non leaf has a range of tag value
• The root match all values
• The hierarchy must be know at compile time
• Use a bunch of mixin templates
– Add the boilerplate
– A ton of static asserts

struct Child {
mixin Parent!Root;
}
struct Root {
mixin Childs!(Child, SubStruct);
}
struct SubStruct {
mixin GrandChilds!(
Root,
SubChild,
);
}
struct SubChild {
mixin Parent!SubStruct;
}

Root
Root Child’s fields
Root SubStruct’s fields
Root SubStruct’s fields SubChild’s fields

• Child share the parent’s part of the layout
– It is safe to upcast
– Done via alias this
• Downcast to a leaf: check tag’s value
– Cheap
– Easy pattern matching
• Downcast to substruct: check tag range
– Cheap
• No typeid pointer chasing

Virtualish Dispatch
• No virtual table
• Get function pointer in a table
– One table per method
– One entry per leaf type
– Using the tag as an index
• Used by HHVM for PHP arrays
– Creative datastructure
– Is a vector/hashmap/set/tuple/whatever…

Regular Virtual Dispatch
f1 f2 f3 f4
Vtable
pointer
T1’s data
g1 g2 g3 g4
Vtable
pointer
T2’s data
• One vtable per type
• Vtable has one entry per method
• Load vtable then load function address

Virtualish Dispatch
f1 g1 h1 i1
Tag T1’s data
f2 g2 h2 i2
Tag T2’s data
• One vtable per method
• Vtable has one entry per type
• Load tag then use it as index in per function table

Virtualish Dispatch
• Usually better locality
– Calling the same method on objects of various
types more common than calling various method
on objects of the same type
• Often worked around by sorting by type
– Classless get most of the benefit without sorting
– Still helps branch prediction
• Tables can be generated using reflection in D

Classless visitors !
• Regular class hierarchy need to know all
method at compile time
– Can add types dynamically
• Classless hierarchy need to know all types at
compile time
– Can add method dynamically
• Visitor can create a visit method’s table
– And use the tag to dispatch
• Closed extensibility one way, opened it
another way

Bit packing like a mad man

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Bit packing like a mad man

Similar to Bit packing like a mad man (20)

More from Andrei Alexandrescu

More from Andrei Alexandrescu (7)

Recently uploaded

Recently uploaded (20)

Bit packing like a mad man