HPC in Python with TensorFlow

try:
    import google.colab
except ImportError:
    pass
else:
    !pip install zfit numba numpy tensorflow_probability
import tensorflow as tf

Depending on the CPU used and the TensorFlow binaries, what we will see (not in the Jupyter Notebook) are a bunch of messages, including the following: tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA

What could they mean?

AVX stands for Advanced Vector Extensions and are instruction that the CPU can perform. They are specializations on vector operations (remember? SIMD, CPU inststruction set, etc.)

Why do they appear?

The code that we are using was not compiled with this flag on. This means, TensorFlow assumes that the CPU does not support this instructions and instead uses non-optimized ones. The reason is that this allows the binary (=compiled code) to also be run on a CPU that does not support then. While we use only some speed. (yes, technically TensorFlow can be faster when compiled natively on your computer, but then it takes time and effort)

import numba
import numpy as np
import tensorflow_probability as tfp
import zfit
from zfit import z

Let’s start with a simple comparison of Numpy, an AOT compiled library, versus pure Python

size1 = 100000
list1 = [np.random.uniform() for _ in range(size1)]
list2 = [np.random.uniform() for _ in range(size1)]
list_zeros = [0] * size1

ar1 = np.array(list1)
ar2 = np.random.uniform(size=size1)  # way more efficient!
ar_zeros = np.zeros_like(ar1) # quite useful function the *_like -> like the object
# we could also create the below, in general better:
# ar_empty = np.empty(shape=size1, dtype=np.float64)
%%timeit
for i in range(size1):
    list_zeros[i] = list1[i] + list2[i]
14.1 ms ± 373 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
ar1 + ar2
108 µs ± 131 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

( playground : we can also try assignements here or simliar)

Fast and slow

Before we go deeper into the topic, we can draw two conclusions:

  • slow Python is not “slow”: it is still blazingly fast on an absolute scale, e.g if you need to loop over a few hundred points, it’s still nothing. But it can add up!

  • Numpy is a factor of 300 faster for this case (and: better reabable!)

=> there is no reason to ever add (numerical) arrays with a for loop in Python (except for numba jit)

As mentioned, TensorFlow is basically Numpy. Let’s check that out

rnd1_np = np.random.uniform(size=10, low=0, high=10)
rnd1_np  # adding a return value on the last line without assigning it prints the value
array([2.16963493, 6.97459476, 0.59067682, 2.30885495, 3.81780519,
       5.56890749, 1.07919196, 9.69959239, 6.135924  , 5.7163101 ])
rnd1 = tf.random.uniform(shape=(10,),  # notice the "shape" argument: it's more picky than Numpy
                         minval=0,
                         maxval=10,
                         dtype=tf.float64)
rnd2 = tf.random.uniform(shape=(10,),
                         minval=0,
                         maxval=10,
                         dtype=tf.float64)
rnd1
<tf.Tensor: shape=(10,), dtype=float64, numpy=
array([2.26114644, 0.68319395, 7.34361396, 6.84196525, 1.36759423,
       6.69716447, 3.72511919, 3.50020386, 7.42365297, 1.98782923])>

This is in fact a “numpy array wrapped” and can explicitly be converted to an array

rnd1.numpy()
array([2.26114644, 0.68319395, 7.34361396, 6.84196525, 1.36759423,
       6.69716447, 3.72511919, 3.50020386, 7.42365297, 1.98782923])

Other operations act as we would expect it

rnd1 + 10
<tf.Tensor: shape=(10,), dtype=float64, numpy=
array([12.26114644, 10.68319395, 17.34361396, 16.84196525, 11.36759423,
       16.69716447, 13.72511919, 13.50020386, 17.42365297, 11.98782923])>

… and it converts itself (often) to Numpy when needed.

np.sqrt(rnd1)
array([1.50371089, 0.82655547, 2.70991032, 2.61571505, 1.16944184,
       2.58788803, 1.93005678, 1.87088318, 2.72463814, 1.40990398])

We can slice it…

rnd1[1:3]
<tf.Tensor: shape=(2,), dtype=float64, numpy=array([0.68319395, 7.34361396])>

…expand it….

rnd1[None, :, None]
<tf.Tensor: shape=(1, 10, 1), dtype=float64, numpy=
array([[[2.26114644],
        [0.68319395],
        [7.34361396],
        [6.84196525],
        [1.36759423],
        [6.69716447],
        [3.72511919],
        [3.50020386],
        [7.42365297],
        [1.98782923]]])>

…and broadcast with the known (maybe slightly stricter) rules

matrix1 = rnd1[None, :] * rnd1[:, None]
matrix1
<tf.Tensor: shape=(10, 10), dtype=float64, numpy=
array([[ 5.1127832 ,  1.54480156, 16.60498654, 15.47068532,  3.09233081,
        15.14326956,  8.42303998,  7.91447349, 16.78596646,  4.49477299],
       [ 1.54480156,  0.46675397,  5.01711261,  4.67438924,  0.9343321 ,
         4.57546223,  2.54497888,  2.39131809,  5.07179478,  1.3580729 ],
       [16.60498654,  5.01711261, 53.92866605, 50.24475151, 10.04308407,
        49.18139051, 27.3558373 , 25.70414596, 54.51644164, 14.59785052],
       [15.47068532,  4.67438924, 50.24475151, 46.81248841,  9.35703218,
        45.82176653, 25.48713603, 23.94827318, 50.79237563, 13.60065853],
       [ 3.09233081,  0.9343321 , 10.04308407,  9.35703218,  1.87031397,
         9.15900347,  5.0944515 ,  4.7868586 , 10.15254496,  2.71854379],
       [15.14326956,  4.57546223, 49.18139051, 45.82176653,  9.15900347,
        44.85201191, 24.94773588, 23.44144094, 49.71742491, 13.31281931],
       [ 8.42303998,  2.54497888, 27.3558373 , 25.48713603,  5.0944515 ,
        24.94773588, 13.87651298, 13.03867658, 27.65399215,  7.40490082],
       [ 7.91447349,  2.39131809, 25.70414596, 23.94827318,  4.7868586 ,
        23.44144094, 13.03867658, 12.25142708, 25.98429881,  6.95780756],
       [16.78596646,  5.07179478, 54.51644164, 50.79237563, 10.15254496,
        49.71742491, 27.65399215, 25.98429881, 55.11062347, 14.7569544 ],
       [ 4.49477299,  1.3580729 , 14.59785052, 13.60065853,  2.71854379,
        13.31281931,  7.40490082,  6.95780756, 14.7569544 ,  3.95146506]])>

Equivalent operations

Many operations that exist in Numpy also exist in TensorFlow, sometimes with a different name.

The concept however is exactly the same: we have higher level objects such as Tensors (or arrays) and call operations on it with arguments. This is a “strong limitation” (theoretically) of what we can do, however, since we do math, there is only a limited set we need, and in practice this suffices for 98% of the cases.

Therefore we won’t dive too deep into the possibilities of TensorFlow/Numpy regarding operations but it is suggested to read the API docs, many are self-explanatory. It can be surprising that there is also some support for more exotic elements such as RaggedTensors and operations and SparseTensors and operations

Mostly, the differences and the terminology will be introduced.

tf.sqrt(rnd1)
<tf.Tensor: shape=(10,), dtype=float64, numpy=
array([1.50371089, 0.82655547, 2.70991032, 2.61571505, 1.16944184,
       2.58788803, 1.93005678, 1.87088318, 2.72463814, 1.40990398])>
tf.reduce_sum(matrix1, axis=0)  # with the axis argument to specify over which to reduce
<tf.Tensor: shape=(10,), dtype=float64, numpy=
array([ 94.5871099 ,  28.57901637, 307.19426671, 286.20955657,
        57.20849544, 280.15232525, 155.82726211, 146.41872027,
       310.5424172 ,  83.15384588])>

DTypes

TensorFlow is more picky on dtypes as Numpy and does not automatically cast dtypes. That’s why we can often get a dtype error. Solution: make sure you add a x = tf.cast(x, dtype=tf.float64) (or whatever dtype we want) to cast it into the right dtype.

One noticable difference: TensorFlow uses float32 as the default for all operations. Neural Networks function quite well with that (sometimes even with float16) but for (most) scientific use-cases, we want to use float64. So yes, currently, we have to define this in (too) many places.

Sidenote : Also zfit uses tf.float64 throughout internally. It’s an often seen problem that custom shapes will raise a dtype error.

What we can’t do: assignements

The idea of TensorFlow evolves around building a Graph\(^{1)}\), the Eager (Numpy-like) execution is more of a “we get it for free addition”, but does not lead the development ideas. This has one profound implication, namely that we cannot make an assignement to a Tensor, because it is a node in a graph. The logic just does not work (exception: tf.Variable$^{2)}). This does not mean that TensorFlow would not perform in-place operations behind the scenes - it very well does if it is save to do so. Since TensorFlow knows the whole graph with all dependencies, this can be figured out.

Even if EagerMode, assignements could work (as for Numpy arrays), they are forbidden for consistency (one of the great plus points of TensorFlow).

TensorFlow list

an advanced structure that can also be modified is the tf.TensorArray, which acts like a list or ArrayList (in other languages). It is however advanced and not usually used, so we won’t go into the details here.

1) if you’re familiar with TensorFlow 1, this statement would suprise you as pretty obvious; but in TensorFlow 2, this is luckily more hidden.
2) more general, ResourceHandler

rnd1_np[5] = 42
try:
    rnd1[5] = 42
except TypeError as error:
    print(error)
'tensorflow.python.framework.ops.EagerTensor' object does not support item assignment

Speed comparison

Let’s do the same calculation as with Numpy. The result should be comparable: both are AOT compiled libraries specialized on numerical, vectorized operations.

rnd1_big = tf.random.uniform(shape=(size1,),  # notice the "shape" argument: it's more picky than Numpy
                         minval=0,
                         maxval=10,
                         dtype=tf.float64)
rnd2_big = tf.random.uniform(shape=(size1,),
                         minval=0,
                         maxval=10,
                         dtype=tf.float64)
%%timeit
rnd1_big + rnd2_big
151 µs ± 9.24 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%%timeit  # using numpy, same as before
ar1 + ar2
78.1 µs ± 1.45 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Looks like the same as Numpy. Let’s compare with smaller arrays

rnd1_np = rnd1.numpy()
rnd2_np = rnd2.numpy()
rnd1_np
array([2.26114644, 0.68319395, 7.34361396, 6.84196525, 1.36759423,
       6.69716447, 3.72511919, 3.50020386, 7.42365297, 1.98782923])
%%timeit
rnd1_np + rnd2_np
592 ns ± 5.23 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

%%timeit
rnd1 + rnd2
15 µs ± 30.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

TensorFlow is slow?

We see now a significant difference in the runtime! This is because TensorFlow has a larger overhead than Numpy. As seen before, this is not/barely noticable for larger arrays, however for very small calculations, this is visible.

There is more overhead because TensorFlow tries to be “smarter” about many things than Numpy and does not simply directly execute the computation.

The cost is a slowdown on very small operations but a better scaling and improved performance with larger arrays and more complicated calculations.

# relative speeds may differ, depending on the hardware used.
# size_big = 10  # numpy faster
size_big = 20000  # sameish
# size_big = 100000  # TF faster
# size_big = 1000000  # TF faster
# size_big = 10000000  # TF faster
# size_big = 100000000  # TF faster
%%timeit
tf.random.uniform(shape=(size_big,), dtype=tf.float64) + tf.random.uniform(shape=(size_big,), dtype=tf.float64)
562 µs ± 3.33 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%%timeit
np.random.uniform(size=(size_big,)) + np.random.uniform(size=(size_big,))
394 µs ± 1.91 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

TensorFlow kernels

In general, TensorFlow is preciser in what input arguments are required compared to Numpy and does less automatic dtype casting and asks more explicit for shapes. For example, integers don’t work in the logarithm. However, this error message illustrates very well the kernel dispatch system of TensorFlow, so lets do it!

try:
    tf.math.log(5)
except (tf.errors.NotFoundError, tf.errors.InvalidArgumentError) as error: # TF 2.4 error, TF 2.5 error
    print(error)
Value for attr 'T' of int32 is not in the list of allowed values: bfloat16, half, float, double, complex64, complex128
	; NodeDef: {{node Log}}; Op<name=Log; signature=x:T -> y:T; attr=T:type,allowed=[DT_BFLOAT16, DT_HALF, DT_FLOAT, DT_DOUBLE, DT_COMPLEX64, DT_COMPLEX128]> [Op:Log]

What we see here: it searches the registered kernels and does not find any that supports this operation. We find different classifications:

  • GPU: normal GPU kernel

  • CPU: normal CPU kernel

  • XLA: Accelerated Linear Algebra is a high-level compiler that can fuse operations, which would result in single calls to a fused kernel.

tf.function

Let’s see the JIT in action. Therefore, we use the example from the slides and start modifying it.

def add_log(x, y):
    print('running Python')
    tf.print("running TensorFlow")
    x_sq = tf.math.log(x)
    y_sq = tf.math.log(y)
    return x_sq + y_sq

As seen before, we can use it like Python. To make sure that we know when the actual Python is executed, we inserted a print and a tf.print, the latter is a TensorFlow operation and therefore expected to be called everytime we compute something.

add_log(4., 5.)
running Python

running TensorFlow
<tf.Tensor: shape=(), dtype=float32, numpy=2.9957323>
add_log(42., 52.)
running Python

running TensorFlow
<tf.Tensor: shape=(), dtype=float32, numpy=7.6889133>

As we see, both the Python and TensorFlow operation execute. Now we can do the same with a decorator. Note that so far we entered pure Python numbers, not Tensors. Since we ran in eager mode, this did not matter so far.

@tf.function(autograph=False)
def add_log_tf(x, y):
    print('running Python')
    tf.print("running TensorFlow")
    x_sq = tf.math.log(x)
    y_sq = tf.math.log(y)
    return x_sq + y_sq
add_log_tf(1., 2.)
running Python

running TensorFlow
<tf.Tensor: shape=(), dtype=float32, numpy=0.6931472>
add_log_tf(11., 21.)  # again with different numbers
running Python

running TensorFlow
<tf.Tensor: shape=(), dtype=float32, numpy=5.442418>

As we see, Python is still run: this happens because 11. is not equal to 1., TensorFlow does not convert those to Tensors. Lets use it in the right way, with Tensors

add_log_tf(tf.constant(1.), tf.constant(2.))  # first compilation
running Python

running TensorFlow
<tf.Tensor: shape=(), dtype=float32, numpy=0.6931472>
add_log_tf(tf.constant(11.), tf.constant(22.))
running TensorFlow
<tf.Tensor: shape=(), dtype=float32, numpy=5.488938>

Now only the TensorFlow operations get executed! Everything else became static. We can illustrate this more extremely here

@tf.function(autograph=False)
def add_rnd(x):
    print('running Python')
    tf.print("running TensorFlow")
    rnd_np = np.random.uniform()
    rnd_tf = tf.random.uniform(shape=())
    return x * rnd_np, x * rnd_tf
add_rnd(tf.constant(1.))
running Python

running TensorFlow
(<tf.Tensor: shape=(), dtype=float32, numpy=0.7154827>,
 <tf.Tensor: shape=(), dtype=float32, numpy=0.95440686>)

The first time, the numpy code was executed as well, no difference so far. However, running it a second time, only the TensorFlow parts can change

add_rnd(tf.constant(1.))
running TensorFlow
(<tf.Tensor: shape=(), dtype=float32, numpy=0.7154827>,
 <tf.Tensor: shape=(), dtype=float32, numpy=0.8802124>)
add_rnd(tf.constant(2.))
running TensorFlow
(<tf.Tensor: shape=(), dtype=float32, numpy=1.4309654>,
 <tf.Tensor: shape=(), dtype=float32, numpy=1.2755852>)

We see now clearly: TensorFlow executes the function but only cares about the TensorFlow operations , everything else is regarded as static. This can be a large pitfall! If we would execute this function without the decorator, we would get a different result, since Numpy is also sampling a new random variable every time.

Large functions

That having said, we can build graphs that require thousands of lines of Python code to stick them together correctly. Function calls in function calls etc are all possible.

Shapes

Tensors have a shape, similar to Numpy arrays. But Tensors have two kind of shapes, a static and a dynamic shape. The static shape is what can be inferred before executing the computation while the dynamic shape is only inferred during the execution of the code. The latter typically arises with random variables and masking or cuts.

We can access the static shape with Tensor.shape

If the shape is known inside a graph, this will be the same. If the shape is unknown, the unknown axis will be None.

@tf.function(autograph=False)
def func_shape(x):
    print(f"static shape: {x.shape}")  # static shape
    tf.print('dynamic shape ',tf.shape(x))  # dynamic shape
    x = x[x>3.5]
    print(f"static shape cuts applied: {x.shape}")  # static shape
    tf.print('dynamic shape cuts applied',tf.shape(x))  # dynamic shape
func_shape(rnd1)
static shape: (10,)

static shape cuts applied: (None,)

dynamic shape  [10]
dynamic shape cuts applied [6]

We can access the axes by indexing

rnd3 = rnd1[None, :] * rnd1[:, None]
rnd3
<tf.Tensor: shape=(10, 10), dtype=float64, numpy=
array([[ 5.1127832 ,  1.54480156, 16.60498654, 15.47068532,  3.09233081,
        15.14326956,  8.42303998,  7.91447349, 16.78596646,  4.49477299],
       [ 1.54480156,  0.46675397,  5.01711261,  4.67438924,  0.9343321 ,
         4.57546223,  2.54497888,  2.39131809,  5.07179478,  1.3580729 ],
       [16.60498654,  5.01711261, 53.92866605, 50.24475151, 10.04308407,
        49.18139051, 27.3558373 , 25.70414596, 54.51644164, 14.59785052],
       [15.47068532,  4.67438924, 50.24475151, 46.81248841,  9.35703218,
        45.82176653, 25.48713603, 23.94827318, 50.79237563, 13.60065853],
       [ 3.09233081,  0.9343321 , 10.04308407,  9.35703218,  1.87031397,
         9.15900347,  5.0944515 ,  4.7868586 , 10.15254496,  2.71854379],
       [15.14326956,  4.57546223, 49.18139051, 45.82176653,  9.15900347,
        44.85201191, 24.94773588, 23.44144094, 49.71742491, 13.31281931],
       [ 8.42303998,  2.54497888, 27.3558373 , 25.48713603,  5.0944515 ,
        24.94773588, 13.87651298, 13.03867658, 27.65399215,  7.40490082],
       [ 7.91447349,  2.39131809, 25.70414596, 23.94827318,  4.7868586 ,
        23.44144094, 13.03867658, 12.25142708, 25.98429881,  6.95780756],
       [16.78596646,  5.07179478, 54.51644164, 50.79237563, 10.15254496,
        49.71742491, 27.65399215, 25.98429881, 55.11062347, 14.7569544 ],
       [ 4.49477299,  1.3580729 , 14.59785052, 13.60065853,  2.71854379,
        13.31281931,  7.40490082,  6.95780756, 14.7569544 ,  3.95146506]])>
tf.shape(rnd3)
<tf.Tensor: shape=(2,), dtype=int32, numpy=array([10, 10], dtype=int32)>
rnd3.shape[1]
10

Variables

TensorFlow offers the possibility to have statefull objects inside a compiled graph (which e.g. is not possible with Numba). The most commonly used one is the tf.Variable. Technically, they are automatically captured on the function compilation and belong to it.

var1 = tf.Variable(1.)
@tf.function(autograph=False)
def scale_by_var(x):
    print('running Python')
    tf.print("running TensorFlow")
    return x * var1
scale_by_var(tf.constant(1.))
running Python

running TensorFlow
<tf.Tensor: shape=(), dtype=float32, numpy=1.0>
scale_by_var(tf.constant(2.))
running TensorFlow
<tf.Tensor: shape=(), dtype=float32, numpy=2.0>
var1.assign(42.)
scale_by_var(tf.constant(1.))
running TensorFlow
<tf.Tensor: shape=(), dtype=float32, numpy=42.0>

As we see, the output changed. This is of course especially useful in the context of model fitting libraries, be it likelihoods or neural networks.

def add_rnd(x):
    print('running Python')
    tf.print("running TensorFlow")
    rnd_np = np.random.uniform()
    rnd_tf = tf.random.uniform(shape=())
    return x * rnd_np, x * rnd_tf
add_rnd(tf.constant(1.))
running Python

running TensorFlow
(<tf.Tensor: shape=(), dtype=float32, numpy=0.39230967>,
 <tf.Tensor: shape=(), dtype=float32, numpy=0.2781849>)
add_rnd(tf.constant(2.))
running Python

running TensorFlow
(<tf.Tensor: shape=(), dtype=float32, numpy=1.9611908>,
 <tf.Tensor: shape=(), dtype=float32, numpy=0.6551211>)

This means that we can use Numpy fully compatible in eager mode, but not when decorated.

def try_np_sqrt(x):
    return np.sqrt(x)
try_np_sqrt(tf.constant(5.))
2.236068
try_np_sqrt_tf = tf.function(try_np_sqrt, autograph=False)  # equivalent to decorator
try:
    try_np_sqrt_tf(tf.constant(5.))
except NotImplementedError as error:
    print(error)
Cannot convert a symbolic Tensor (x:0) to a numpy array. This error may indicate that you're trying to pass a Tensor to a NumPy call, which is not supported

As we see, Numpy complains in the graph mode, given that it cannot handle the Symbolic Tensor.

Having the tf.function decorator means that we can’t use any Python dynamicity. What fails when decorated but works nicely if not:

def greater_python(x, y):
    if x > y:
        return True
    else:
        return False
greater_python(tf.constant(1.), tf.constant(2.))
False

This works again, and will fail with the graph decorator.

greater_python_tf = tf.function(greater_python, autograph=False)
try:
    greater_python_tf(tf.constant(1.), tf.constant(2.))
except Exception as error:
    print(error)
using a `tf.Tensor` as a Python `bool` is not allowed: AutoGraph is disabled in this function. Try decorating it directly with @tf.function.

The error message hints at something: while this does not work now - Python does not yet now the value of the Tensors so it can’t decide whether it will evaluate to True or False - there is the possibility of “autograph”: it automatically converts (a subset) of Python to TensorFlow: while loops, for loops through Tensors and conditionals. However, this is usually less effective and more errorprone than using explicitly the tf.* functions. Lets try it!

greater_python_tf_autograph = tf.function(greater_python, autograph=True)
greater_python_tf_autograph(tf.constant(1.), tf.constant(2.))
<tf.Tensor: shape=(), dtype=bool, numpy=False>

This now works neatless! But we’re never sure.

To do it explicitly, we can do that as well.

code = tf.autograph.to_code(greater_python)
print(code)
def tf__greater_python(x, y):
    with ag__.FunctionScope('greater_python', 'fscope', ag__.ConversionOptions(recursive=True, user_requested=True, optional_features=(), internal_convert_user_code=True)) as fscope:
        do_return = False
        retval_ = ag__.UndefinedReturnValue()

        def get_state():
            return (do_return, retval_)

        def set_state(vars_):
            nonlocal do_return, retval_
            (do_return, retval_) = vars_

        def if_body():
            nonlocal do_return, retval_
            try:
                do_return = True
                retval_ = True
            except:
                do_return = False
                raise

        def else_body():
            nonlocal do_return, retval_
            try:
                do_return = True
                retval_ = False
            except:
                do_return = False
                raise
        ag__.if_stmt((ag__.ld(x) > ag__.ld(y)), if_body, else_body, get_state, set_state, ('do_return', 'retval_'), 2)
        return fscope.ret(retval_, do_return)

Performance

In the end, this is what matters. And a comparison would be nice. Let’s do that and see how Numpy and TensorFlow compare.

nevents = 10000000
data_tf = tf.random.uniform(shape=(nevents,), dtype=tf.float64)
data_np = np.random.uniform(size=(nevents,))
def calc_np(x):
    x_init = x
    i = 42.
    x = np.sqrt(np.abs(x_init * (i + 1.)))
    x = np.cos(x - 0.3)
    x = np.power(x, i + 1)
    x = np.sinh(x + 0.4)
    x = x ** 2
    x = x / np.mean(x)
    x = np.abs(x)
    logx = np.log(x)
    x = np.mean(logx)
    
    x1 = np.sqrt(np.abs(x_init * (i + 1.)))
    x1 = np.cos(x1 - 0.3)
    x1 = np.power(x1, i + 1)
    x1 = np.sinh(x1 + 0.4)
    x1 = x1 ** 2
    x1 = x1 / np.mean(x1)
    x1 = np.abs(x1)
    logx = np.log(x1)
    x1 = np.mean(logx)
    
    x2 = np.sqrt(np.abs(x_init * (i + 1.)))
    x2 = np.cos(x2 - 0.3)
    x2 = np.power(x2, i + 1)
    x2 = np.sinh(x2 + 0.4)
    x2 = x2 ** 2
    x2 = x2 / np.mean(x2)
    x2 = np.abs(x2)
    logx = np.log(x2)
    x2 = np.mean(logx)
    return x + x1 + x2

calc_np_numba = numba.jit(nopython=True, parallel=True)(calc_np)
def calc_tf(x):
    x_init = x
    i = 42.
    x = tf.sqrt(tf.abs(x_init * (tf.cast(i, dtype=tf.float64) + 1.)))
    x = tf.cos(x - 0.3)
    x = tf.pow(x, tf.cast(i + 1, tf.float64))
    x = tf.sinh(x + 0.4)
    x = x ** 2
    x = x / tf.reduce_mean(x)
    x = tf.abs(x)
    x = tf.reduce_mean(tf.math.log(x))
    
    x1 = tf.sqrt(tf.abs(x_init * (tf.cast(i, dtype=tf.float64) + 1.)))
    x1 = tf.cos(x1 - 0.3)
    x1 = tf.pow(x1, tf.cast(i + 1, tf.float64))
    x1 = tf.sinh(x1 + 0.4)
    x1 = x1 ** 2
    x1 = x1 / tf.reduce_mean(x1)
    x1 = tf.abs(x1)
    
    x2 = tf.sqrt(tf.abs(x_init * (tf.cast(i, dtype=tf.float64) + 1.)))
    x2 = tf.cos(x2 - 0.3)
    x2 = tf.pow(x2, tf.cast(i + 1, tf.float64))
    x2 = tf.sinh(x2 + 0.4)
    x2 = x2 ** 2
    x2 = x2 / tf.reduce_mean(x2)
    x2 = tf.abs(x2)
    
    return x + x1 + x2

calc_tf_func = tf.function(calc_tf, autograph=False)
%%timeit -n1 -r1  # compile time, just for curiosity
calc_tf_func(data_tf)
959 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

%%timeit -n1 -r1  # compile time, just for curiosity
calc_np_numba(data_np)
5.59 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

%timeit calc_np(data_np)  # not compiled
3.74 s ± 8.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit calc_tf(data_tf)  # not compiled
4.5 s ± 132 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit -n1 -r7
calc_np_numba(data_np)
1.79 s ± 6.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit -n1 -r7
calc_tf_func(data_tf)
876 ms ± 15.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

We can now play around with this numbers. Depending on the size (we can go up to 10 mio) and parallelizability of the problem, the numbers differ..

In general:

  • Numpy is faster for small numbers

  • TensorFlow is faster for larger arrays and well parallelizable computations. Due to the larger overhead in dispatching in eager mode, it is significantly slower for very small (1-10) sample sizes.

=> there is no free lunch

Note: this has not run on a GPU, which would automatically happen for TensorFlow.

def calc_tf2(x, n):
    sum_init = tf.zeros_like(x)
    for i in range(1, n + 1):
        x = tf.sqrt(tf.abs(x * (tf.cast(i, dtype=tf.float64) + 1.)))
        x = tf.cos(x - 0.3)
        x = tf.pow(x, tf.cast(i + 1, tf.float64))
        x = tf.sinh(x + 0.4)
        x = x ** 2
        x = x / tf.reduce_mean(x, axis=None)
        x = tf.abs(x)
        x = x - tf.reduce_mean(tf.math.log(x, name="Jonas_log"), name="Jonas_mean")  # name for ops, see later ;)
        sum_init += x
    return sum_init

calc_tf_func2 = tf.function(calc_tf2, autograph=False)

@numba.njit(parallel=True)  # njit is equal to jit(nopython=True), meaning "compile everything or raise error"
def calc_numba2(x, n):
    sum_init = np.zeros_like(x)
    for i in range(1, n + 1):
        x = np.sqrt(np.abs(x * (i + 1.)))
        x = np.cos(x - 0.3)
        x = np.power(x, i + 1)
        x = np.sinh(x + 0.4)
        x = x ** 2
        x = x / np.mean(x)
        x = np.abs(x)
        x = x - np.mean(np.log(x))
        sum_init += x
    return sum_init
%%timeit -n1 -r1  #compile
calc_numba2(rnd1_big.numpy(), 1)
1.54 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

calc_numba2(rnd1_big.numpy(), 1)
array([0.82991846, 2.45507623, 1.49106249, ..., 1.58784451, 0.45862357,
       2.36444783])
%%timeit -n1 -r1  #compile
calc_tf_func2(rnd1_big, 1)
33.1 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

calc_tf_func2(rnd1_big, 1)
<tf.Tensor: shape=(100000,), dtype=float64, numpy=
array([0.82991846, 2.45507623, 1.49106249, ..., 1.58784451, 0.45862357,
       2.36444783])>
%%timeit
calc_numba2(rnd1_big.numpy(), 1)
6.56 ms ± 33.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
calc_tf_func2(rnd1_big, 1)
3.89 ms ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

calc_tf_func2(rnd1_big, 10)
<tf.Tensor: shape=(100000,), dtype=float64, numpy=
array([10.13421598, 16.0298037 , 12.03320054, ..., 13.13887434,
       12.0921866 , 15.94627606])>
calc_numba2(rnd1_big.numpy(), 10)
array([10.13421598, 16.0298037 , 12.03320054, ..., 13.13887434,
       12.0921866 , 15.94627606])
%%timeit
calc_numba2(rnd1_big.numpy(), 10)
63.6 ms ± 615 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
calc_tf_func2(rnd1_big, 10)
74.4 ms ± 1.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Control flow

While TensorFlow is independent of the Python control flow, it has its own functions for that, mainly:

  • while_loop(): a while loop taking a body and condition function

  • cond(): if-like

  • case and switch_case: if/elif statements

  • tf.where

def true_fn():
    return 1.

def false_fn():
    return 0.

value = tf.cond(tf.greater(111., 42.), true_fn=true_fn, false_fn=false_fn)
value
1.0

While loops

We can create while loops in order to have some kind of repetitive task

def cond(x, y):
    return x > y

def body(x, y):
    return x / 2, y + 1

x, y = tf.while_loop(cond=cond,
                     body=body,
                     loop_vars=[100., 1.])
x, y
(<tf.Tensor: shape=(), dtype=float32, numpy=3.125>,
 <tf.Tensor: shape=(), dtype=float32, numpy=6.0>)

map a function

We can also map a function on each element. While this is not very efficient, it allows for high flexibility.

%%timeit -n1 -r1
tf.map_fn(tf.math.sin, rnd1_big)
25.8 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

%%timeit -n1 -r1
tf.vectorized_map(tf.math.sin, rnd1_big)  # can speedup things sometimes
54.1 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

@tf.function(autograph=False)
def do_map(func, tensor):
    return tf.map_fn(func, tensor)

do_map(tf.math.sin, rnd1_big)
<tf.Tensor: shape=(100000,), dtype=float64, numpy=
array([ 0.08780093, -0.61491212,  0.99557477, ...,  0.99827938,
        0.95044042, -0.79466407])>
@tf.function(autograph=False)
def do_map_vec(func, tensor):
    return tf.vectorized_map(func, tensor)

do_map_vec(tf.math.sin, rnd1_big)
<tf.Tensor: shape=(100000,), dtype=float64, numpy=
array([ 0.08780093, -0.61491212,  0.99557477, ...,  0.99827938,
        0.95044042, -0.79466407])>
%%timeit
do_map(tf.math.sin, rnd1_big)
1.27 s ± 43.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
do_map_vec(tf.math.sin, rnd1_big)
1.76 ms ± 67.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%%timeit
tf.math.sin(rnd1_big)
1.44 ms ± 4.52 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

As we can see, the generic mapping is surely not optimal. However, it works “always”. vectorized_map on the other hand has a huge speedup and performs nearly as well as using the native function! However, while this works nicely for this case, it’s applications are limited and depend heavily on the use-case; more complicated examples can easily result in a longer runtime and a huge memory consumption. Caution is therefore advised when using this function.

Gradients

TensorFlow allows us to calculate the automatic gradients.

var2 = tf.Variable(2.)
with tf.GradientTape() as tape:
    tape.watch(var2)  # actually watches all variables already by default
    y = var2 ** 3
y
<tf.Tensor: shape=(), dtype=float32, numpy=8.0>
grad = tape.gradient(y, var2)
grad
<tf.Tensor: shape=(), dtype=float32, numpy=12.0>

Tasks to try out

Following are a few ideas to check whether you understand things

  • can you get the second derivative? play around ;)

  • write a shape that you know as a function

  • create the MC integral of it over a certain range

This allows to do many things with gradients and e.g. solve differential equations.

Bonus: the graph

We talked about the graph back and forth, but where is it ?

The graph can be retained from a function that was already traced.

concrete_func = calc_tf_func2.get_concrete_function(rnd1, 2)
concrete_func
<ConcreteFunction calc_tf2(x, n=2) at 0x7F994EDE7640>
graph = concrete_func.graph
graph
<tensorflow.python.framework.func_graph.FuncGraph at 0x7f994df799a0>
ops = graph.get_operations()
ops
[<tf.Operation 'x' type=Placeholder>,
 <tf.Operation 'zeros_like' type=Const>,
 <tf.Operation 'Cast/x' type=Const>,
 <tf.Operation 'Cast' type=Cast>,
 <tf.Operation 'add/y' type=Const>,
 <tf.Operation 'add' type=AddV2>,
 <tf.Operation 'mul' type=Mul>,
 <tf.Operation 'Abs' type=Abs>,
 <tf.Operation 'Sqrt' type=Sqrt>,
 <tf.Operation 'sub/y' type=Const>,
 <tf.Operation 'sub' type=Sub>,
 <tf.Operation 'Cos' type=Cos>,
 <tf.Operation 'Cast_1/x' type=Const>,
 <tf.Operation 'Cast_1' type=Cast>,
 <tf.Operation 'Pow' type=Pow>,
 <tf.Operation 'add_1/y' type=Const>,
 <tf.Operation 'add_1' type=AddV2>,
 <tf.Operation 'Sinh' type=Sinh>,
 <tf.Operation 'pow_1/y' type=Const>,
 <tf.Operation 'pow_1' type=Pow>,
 <tf.Operation 'Const' type=Const>,
 <tf.Operation 'Mean' type=Mean>,
 <tf.Operation 'truediv' type=RealDiv>,
 <tf.Operation 'Abs_1' type=Abs>,
 <tf.Operation 'Jonas_log' type=Log>,
 <tf.Operation 'Const_1' type=Const>,
 <tf.Operation 'Jonas_mean' type=Mean>,
 <tf.Operation 'sub_1' type=Sub>,
 <tf.Operation 'add_2' type=AddV2>,
 <tf.Operation 'Cast_2/x' type=Const>,
 <tf.Operation 'Cast_2' type=Cast>,
 <tf.Operation 'add_3/y' type=Const>,
 <tf.Operation 'add_3' type=AddV2>,
 <tf.Operation 'mul_1' type=Mul>,
 <tf.Operation 'Abs_2' type=Abs>,
 <tf.Operation 'Sqrt_1' type=Sqrt>,
 <tf.Operation 'sub_2/y' type=Const>,
 <tf.Operation 'sub_2' type=Sub>,
 <tf.Operation 'Cos_1' type=Cos>,
 <tf.Operation 'Cast_3/x' type=Const>,
 <tf.Operation 'Cast_3' type=Cast>,
 <tf.Operation 'Pow_2' type=Pow>,
 <tf.Operation 'add_4/y' type=Const>,
 <tf.Operation 'add_4' type=AddV2>,
 <tf.Operation 'Sinh_1' type=Sinh>,
 <tf.Operation 'pow_3/y' type=Const>,
 <tf.Operation 'pow_3' type=Pow>,
 <tf.Operation 'Const_2' type=Const>,
 <tf.Operation 'Mean_1' type=Mean>,
 <tf.Operation 'truediv_1' type=RealDiv>,
 <tf.Operation 'Abs_3' type=Abs>,
 <tf.Operation 'Jonas_log_1' type=Log>,
 <tf.Operation 'Const_3' type=Const>,
 <tf.Operation 'Jonas_mean_1' type=Mean>,
 <tf.Operation 'sub_3' type=Sub>,
 <tf.Operation 'add_5' type=AddV2>,
 <tf.Operation 'Identity' type=Identity>]
log_op = ops[-6]
log_op
<tf.Operation 'Jonas_log_1' type=Log>
log_op.outputs
[<tf.Tensor 'Jonas_log_1:0' shape=(10,) dtype=float64>]
op_inputs_mean = ops[-4].inputs
op_inputs_mean
(<tf.Tensor 'Jonas_log_1:0' shape=(10,) dtype=float64>,
 <tf.Tensor 'Const_3:0' shape=(1,) dtype=int32>)
log_op.outputs[0] is op_inputs_mean[0]
True

The output of the log operation is the input to the mean operation! We can just walk along the graph here. TensorFlow Graphs are no magic, they are simple object that store their input, their output, their operation. That’s it!