A Scalable, Correct Time-Stamped Stack
image/svg+xml
A Scalable, Correct Time-Stamped Stack
2014-08-11
Andreas Haas
Google Tech Talk
##
MASTER SLIDE
November 23, 2015
Fast Concurrent Data StructuresThrough Timestamping
Andreas Haas
DefensioUniversity of Salzburg
Introduction
Implementation
Correctness
Performance
Optimizations
stack
push
pop
push
push
pop
push
pop
pop
push
pop
execute as many operations as possibleper second
more threads =>more operations
correct behavior --indistinguishable froma lock-based stack
linearizability
number of threads
operations per second
ideal: good performance, good scalability
bad performance, good scalability
good performance, negative scalability
what wewant
Treiber stack:
b
c
a
d
top
element orderdefined by thelink structure
contention onthe top pointer
a
| 1
b
| 2
c
| 3
d
| 3
timestamped stack:
element orderdefined bytimestamps
concurrently inserted elements can share a timestamp
contentionon timestamp generation
Push is very fast
Pop is slow but can be optimized
this allows moreconcurrency and lowercost per operation
Make pop operations fast
Correctness
Show linearizability with respect to stack semantics
Implementation
Traditional approaches based on linearization points do not work
ourTS stack
best otherstack
numberof cores
standardbaseline
0
2000
4000
6000
8000
10000
12000
2
8
16
24
32
40
48
56
64
operations per ms (more is better)
number of threads
gnuplot_plot_1
Treiber Stack
gnuplot_plot_2
EB Stack
gnuplot_plot_3
TS Stack
the same approach works for queues and deques
Remove the element with the oldest timestamp
Timestamped queue
Combination of TS queue and TS stack
Timestamped deque
Fast, but the LCRQ (PPoPP 2013) is even faster
Fastest concurrent deque we are aware of
A complete correctness proof is still missing
Introduction
Implementation
Correctness
Performance
Optimizations
b
| 2
c
| 4
a
| 1
push: timestampsorder elements
d
| 3
TS stack
a
b
top
d
c
elements in onelist are ordered
no order guaranteesacross list boundaries
pop: try to remove the identified element
pop: iterate over thelist heads to find theyoungest element
thread 2:
thread 3:
thread 4:
thread 1:
push: exclusively inserts into thethread's list
a
| 1
d
| 3
b
| 2
c
| T
insert the element with an initialtop (T) timestamp
TS stack
a
b
top
c
d
pseudo code: push
b
| 2
c
push
non-atomic,lock-freeimplementation
d
Assign the timestamp to the element
Insert the element into the thread's list
Generate a timestamp
multiple pops can succeed at the same time
b
| 2
c
| 4
d
c
eventually the actual timestampis assigned
pseudo code: pop
lock-free: concurrent operations may change the TS stack during the iteration
If removal fails, restart the iteration
Search for the youngest element by iterating over the TS stack (array of lists),starting at a random index
Try to remove the identified element
It is enough to read the timestamps of the list head elements
b
| 2
c
| 4
d
| 3
a
| 1
TS stack
identified element:
a
| 1
c
| 4
c
| 4
remove
b
| 2
c
| 4
d
| 3
a
| 1
TS stack
identified element:
a
| 1
b
| 2
d
| 3
.............. .is the youngestelement still in the TS stacksince the search began
d
| 3
elements mayget removed
elements may get inserted
e
| 5
a
| 1
optimizations:reduce the number offailed removals
Introduction
Implementation
Correctness
Performance
Optimizations
a
| 1
d
| 3
b
| 2
c
| T
no younger elementcan exist, try to remove it immediately
TS stack
a
b
top
c
d
elements without timestamp can be removed faster
This corresponds to the elimination of the EB stack (SPAA 2004)
By using timestamps, elimination is a common execution path
a
| 1
b
| 2
c
| 3
d
| 3
elements can share timestamps
TS stack
a
b
top
c
c
d
sequentially inserted elements never sharetimestamps
either element canbe popped first
c
push
d
push
||
b
| 2
can take effectin any order
multiple pops can succeed at the same time
A hardware clock (e.g. x86's TSC)
A weakly synchronized counters (e.g. Lamport clocks)
Various timestamping algorithms possible
An atomic counter
Specialized timestamping algorithm based on CAS
interval timestampsapproximateexecution times
b
[3,5]
c
[6,8]
a
[1,2]
d
[4,7]
TS stack
a
top
c
c
d
b
generate two timestamps to define the timestamp interval
use intervals as timestamps to increase timestamp sharing
c
push
a
push
b
push
trade-off push performance with pop performance
a
b
top
c
a
c
top
b
only one top element
c
push
a
push
b
push
increase intervallength for moreconcurrent pushes
more top elementsfor concurrent pops
short intervalsdo not overlap
nearly every popsucceeds after oneiteration
0
2000
4000
6000
8000
10000
12000
0
3000
6000
9000
12000
15000
0
0.5
1
1.5
2
2.5
3
operations per ms (more is better)
number of iterations (less is better)
increased interval length in ns
gnuplot_plot_1
Performance TS Stack (hardware clock)
gnuplot_plot_2
Iterations TS Stack (hardware clock)
push gets slower than pop
Introduction
Implementation
Correctness
Performance
Optimizations
c
push
a
push
b
push
a
c
top
b
the linearization points depends on later pops
c
push
a
push
b
push
b
pop
c
pop
a
pop
a
pop
c
pop
b
pop
c
pop
b
pop
a
pop
c
push
a
push
b
push
where can we place correct linearization points?
Related to the queue theorem by Henzinger, Sezgin and Vafeiadis (CONCUR 2013)
Forbid bad local behavior instead of establishing a global linearization order
Generic theorem: applies to all stack algorithms we are aware of
stack theorem: show linearizability without linearization points
Proved in Isabelle/HOL (~5000 lines)
The stack theorem is more restrictive than the queue theorem (are stacks more restricted than queues?)
TS stack: ...b... is insertedinto the TS stack beforepop ...a... starts its search
a
b
The following two conditions are sufficient for linearizability with respect to stack semantics:
Linearizable without order requirements (linearizable with respect to set semantics)
Order correct: the following behavior does not happen:
b
push
a
push
"before"
b
pop
a
pop
"before"
"before"
Treiber stack:CAS of push ...a... beforeCAS of push ...b...
a
b
Treiber stack:CAS of push ...b... beforeCAS of pop ...a...
a
b
Treiber stack:CAS of pop ...a... beforeCAS of pop ...b...
a
b
TS stack: ...a... isremoved from the TS stack before ...b...
a
b
TS stack: timestamp order
Introduction
Implementation
Correctness
Performance
Optimizations
Evaluated on two server machines
Server with four 10-core 2GHz Intel Xeon processors (40 cores, 2 hyperthreads per core),24MB shared L3-cache, and 128GB of UMA memory (only in the paper)
Server with four 16-core 2.3GHz AMD Opteron processors (64 cores), 16MB shared L3-cache, and 512GB of cc-NUMA memory
All stacks are configured to show optimal performance with 64 threads
Producer-consumer benchmark
Scal benchmarking framework
n/2 producers, n/2 consumers, each executing 1 million operations
Configurable contention by a busy wait between stack operations
0
2000
4000
6000
8000
10000
12000
2
8
16
24
32
40
48
56
64
operations per ms (more is better)
number of threads
gnuplot_plot_1
Treiber Stack
gnuplot_plot_2
EB Stack (delay=21μs)
gnuplot_plot_3
TS Stack (hardware clock, interval length=6μs)
gnuplot_plot_4
TS Stack (software counter, interval length=10.5μs)
0
2000
4000
6000
8000
10000
12000
2
8
16
24
32
40
48
56
64
operations per ms (more is better)
number of threads
gnuplot_plot_1
Treiber Stack
gnuplot_plot_2
EB Stack (delay=21μs)
gnuplot_plot_3
TS Stack (hardware clock, interval length=4.5μs)
gnuplot_plot_4
TS Stack (software counter, interval length=9μs)
Fastest concurrent stack we know of
Generic stack theorem that does not require linearization points
Are there more generic correctness conditions for linearizable data structures?
Relaxed memory
Future work
Is linearizability even the right correctness condition for data structures?
The stack theorem specifies the order requirements of a concurrent stack
Which relaxed memory models provide cheap ways to satisfy these requirements?
Thank You