Symbolic amplitude models

Symbolic amplitude models#

Amplitude analysis is a method that is used intensively in particle and hadron physics experiments. It allows us to describes the intensity distributions obtained from the experiments with the use of amplitude models. These allow us to extract parameters about intermediate states appearing in the scattering processes, which are governed by the electroweak force and the strong force.

The complicated nature of the strong force, described by Quantum Chromodynamics, makes it difficult to derive intensity models from first principles. Instead, we have to rely on approximations given specific assumptions for the scattering process that we study. Each amplitude model that we formulate, is almost always merely an approximation of the true scattering process. As a consequence, we always have to reassess our analysis results and try alternative models. In addition, amplitude models can be extremely complicated, with large, complex-valued parametrizations and dozens of input parameters. We therefore want to evaluate these models with as much information as possible. That means large input data samples and ‘fits’ using the full likelihood function, which provides us a multidimensional description of the data by using event-based, unbinned fit methods.

Given these challenges, we can identify three major requirements that amplitude analysis software should satisfy:

Performance

We want to evaluate likelihood functions as fast as possible over large data samples, so that we can optimize our model parameters by testing several hypotheses in due time.

Flexibility

We want to quickly formulate a wide range of amplitude models, given the latest theoretical and experimental insights.

Transparency

It should be easy to inspect the implemented amplitude models, ideally by using mathematical formulas, so that the analysis can easily be reproduced or compared to results from other experiments, tools, or theoretical models.

Performance#

Array-oriented programming#

Even though Python is a popular programming language for data science, it is too slow for performing computations over large data samples. Computations in Python programs are therefore almost always outsourced through third-party Python libraries that are written in C++ or other compiled languages. This leads to an array-oriented programming style. Variables represent multidimensional arrays and the computational backend performs the element-wise operations behind-the-scenes. This has the additional benefit that the higher level Python code becomes more readable.

In the following example, we have two data samples \(a\) and \(b\), each containing a million data points, and we want to compute \(c_i=a_i+b_i^2\) for each of these data point \(i\). For simplicity, we set both \(a\) and \(b\) to be [0, 1, 2, ..., 999_999].

a_lst = list(range(1_000_000))
b_lst = list(range(1_000_000))

Pure Python loop#

Naively, one could compute \(c\) for each data point by creating a list and filling it with \(c_i = a_i+b_i^2\).

%%timeit
c_lst = []
for a_i, b_i in zip(a_lst, a_lst):
    c_lst.append(a_i + b_i**2)

302 ms ± 2.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

for loops like these are a natural choice when coming from compiled languages like C++, but are considerably much slower when done with Python.

Equivalent computation with arrays#

NumPy is one of the most popular array-oriented libraries for Python. The data points for \(a\) and \(b\) are now represented by array objects…

import numpy as np

a = np.array(a_lst)
b = np.array(b_lst)

…and the array-oriented computation of \(c = a+b^2\) becomes much faster and more readable.

%%timeit
c = a + b**2

1.03 ms ± 20.1 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Accelerated computational libraries#

The 2010s saw the release of a number of Python packages for highly optimized numerical computational backends that gained popularity within the data science and Machine Learning communities. Examples of these are Numba (2012), TensorFlow (2015), Pytorch (2016), and JAX (2018). Just like NumPy, the core of these packages is written in highly performant languages like C++, but apply several smart techniques to make the computations even faster. The main techniques that these backends apply are:

Just-In-Time Compilation (JIT): Python code is compiled if and only if it is run. JIT not only offers performance in a dynamic workflow, but also allows the compiler to optimize the code at runtime based on the actual data input.
Hardware acceleration: JIT compilation is performed through an intermediate, device-agnostic layer of code (particularly XLA), which allows the user to run their code not only on regular CPUs, but also on different types of hardware accelerators, like GPUs and TPUs.
Parallelization: array-oriented computations can automatically parallelized over multiple CPU cores (multithreading) or multiple CPU, GPU or TPU devices (multiprocessing).
Automatic Differentiation: Many of these libraries can automatically compute derivatives, which is useful for gradient-based optimization algorithms. While this functionality was designed with linear Machine Learning models in mind, it can be used to compute exact gradients over mathematical models.

These techniques are usually directly available with minor changes to existing array-oriented code. In most cases, it is just a matter of decorating the array-oriented function with a JIT-compile decorator and, where needed, replacing the calls to vectorized functions (such as summing up a column in two-dimensional array) with their accelerated equivalents.

Original

import numpy as np

def my_func(a, b):
    return np.sum(a + b**2, axis=0)

Numba

import numba as nb
import numpy as np

@nb.jit(nopython=True)
def my_func(a, b):
    return np.sum(a + b**2, axis=0)

JAX

import jax
import jax.numpy as jnp

@jax.jit
def my_func(a, b):
    return jnp.sum(a + b**2, axis=0)

TensorFlow

import tensorflow as tf
import tensorflow.experimental.numpy as tnp

@tf.function(jit_compile=True)
def my_func(a, b):
    return tnp.sum(a + b**2, axis=0)

Pytorch

import torch

@torch.jit.script
def my_func(a, b):
    return torch.sum(a + b**2, dim=0)

As can be seen, the implementation of the array-oriented NumPy function remains largely unaffected by the switch to these accelerated computational libraries. The resulting JIT-compiled function objects are automatically compiled and parallelized for the selected device for fast numerical computations over large data samples.

Flexibility#

Python offers us flexibility to write concise and understandable code that can be run code interactively through a terminal or with Jupyter Notebooks. As we saw, array-oriented computational backends make this code suitable for high-performance, parallelized computations over large data samples. The fact that array-oriented code looks so similar for different accelerated computational libraries begs the question whether we can find a way to directly convert the mathematical expressions that we as physicists are familiar with into these fast numerical functions. It turns out that we can do this using a Computer Algebra System.

Computer Algebra System#

Programs like Mathematica, Maple and Matlab are popular examples for mathematics, physicists, and engineers, as they allow to simplify expression, solve equations or integrals, investigate their behavior with plots, et cetera. At core, these programs are Computer Algebra Systems (CAS) that represent mathematical expressions as graphs or trees and transform and modify them through algorithms that implement algebraic operations.

The most commonly used CAS in Python is SymPy and it has a major advantage over commercial CAS programs in that it is open source and can be used as a library. This allows us to integrate it into our own applications for amplitude analysis or build up simple mathematical expressions in Jupyter notebooks, so that we can inspect them in \(\LaTeX\) form. For example, a simple Breit-Wigner function is written as:

import sympy as sp

s, m0, Γ0, g = sp.symbols("s m0 Gamma0 g")
expression = g * m0 * Γ0 / (m0**2 - s - sp.I * Γ0 * m0)
expression

\[\displaystyle \frac{\Gamma_{0} g m_{0}}{- i \Gamma_{0} m_{0} + m_{0}^{2} - s}\]

Note

SymPy orders symbolic terms its own way if they are commutative, independent on the Python code given by the user.

Expression trees#

Internally, SymPy expressions are built up by applying mathematical operations to algebraic objects, such as symbols and numbers. In this example, we see how the Breit-Wigner function is built up from four symbols, a complex number, and a few integers. The resulting expression can be visualized as an expression tree of fundamental mathematical operations.

_images/59b0ceb3857180af067ff38c2970ed4132bb5b3b6f0f851532fca4c76f2d18a0.svg

Algebraic substitutions#

An example of an algebraic computation is algebraic substitution of some of the symbols. Here’s an example where we substitute the symbols \(N\), \(m_0\), and \(\Gamma_0\) with fixed values (like model parameters).

substituted_expr = expression.subs({m0: 0.980, Γ0: 0.06, g: 1})
substituted_expr

\[\displaystyle \frac{0.0588}{- s + 0.9604 - 0.0588 i}\]

With the substitutions, the expression tree shrinks. Subtrees that contained only real-valued numbers or one of the three substituted symbols are collapsed into a single number node.

_images/2df2f04baa4cb758a85a327d3b7b1144454d25768d67a02a4db22c4d23e9268d.svg

Code generation#

Expression trees are not only useful for applying algebraic operations to their nodes. They can also be used as a template for generating code. In fact, the \(\LaTeX\) formula is generated using SymPy’s \(\LaTeX\) printer:

\frac{\Gamma_{0} g m_{0}}{- i \Gamma_{0} m_{0} + m_{0}^{2} - s}

SymPy provides a large number of code printers for different languages and human-readable serialization standards. A few examples are shown below.

# Python
Gamma0*g*m0/(-1j*Gamma0*m0 + m0**2 - s)

// C++
Gamma0*g*m0/(-I*Gamma0*m0 + std::pow(m0, 2) - s)

! Fortran
Gamma0*g*m0/(-cmplx(0,1)*Gamma0*m0 + m0**2 - s)

# Julia
Gamma0 .* g .* m0 ./ (-im * Gamma0 .* m0 + m0 .^ 2 - s)

// Rust
Gamma0*g*m0/(-I*Gamma0*m0 + m0.powi(2) - s) 

<!-- MathML -->
<mrow>
  <mfrac>
    <mrow>
      <msub>
        <mi>Γ</mi>
        <mi>0</mi>
      </msub>
      <mo>&InvisibleTimes;</mo>
      <mi>g</mi>
      <mo>&InvisibleTimes;</mo>
      <msub>
        <mi>m</mi>
        <mi>0</mi>
      </msub>
    </mrow>
    <mrow>
      <mrow>
        <mo>-</mo>
        <mi>&ImaginaryI;</mi>
        <mo>&InvisibleTimes;</mo>
        <msub>
          <mi>Γ</mi>
          <mi>0</mi>
        </msub>
        <mo>&InvisibleTimes;</mo>
        <msub>
          <mi>m</mi>
          <mi>0</mi>
        </msub>
      </mrow>
      <mo>+</mo>
      <msup>
        <msub>
          <mi>m</mi>
          <mi>0</mi>
        </msub>
        <mn>2</mn>
      </msup>
      <mo>-</mo>
      <mi>s</mi>
    </mrow>
  </mfrac>
</mrow>

Since SymPy is a Python library, the code generation process can be completely customized. This allows us to generate code for languages that are not yet implemented or modify the behavior of existing code printers, which can be used to generate array-oriented Python code for several computational libraries. For the Breit-Wigner example, the generated NumPy function looks like the following.

full_func = sp.lambdify(args=(s, m0, Γ0, g), expr=expression, modules="numpy")

def _lambdifygenerated(s, m0, Gamma0, g):
    return Gamma0 * g * m0 / (-1j * Gamma0 * m0 + m0**2 - s)

substituted_func = sp.lambdify(args=s, expr=substituted_expr, modules="numpy")

def _lambdifygenerated(s):
    return 0.0588 / (-s + 0.9604 - 0.0588 * 1j)

The example as described here is small for illustrative purposes. It turns out that code generation works just as well for expressions with a much larger number of mathematical operations, even if in the order of hundreds of thousands. This is exactly what is needed for fitting amplitude models to data.

We now have a flexible and transparent way of formulating amplitude models that can be easily modified with algebraic operations. The models can immediately be inspected as mathematical expressions and can be used as template for generating array-oriented numerical functions for efficient computations over large data samples. In addition, any algebraic operations that simplify the expression tree directly map onto the generated array-oriented code, which can result in better numerical performance of the generated code.

Transparency#

We have seen how a Computer Algebra System that generates array-oriented code allows us to formulate performant and flexible amplitude models. Physicists can now focus on implementing theory in a central place, while the computations are outsourced to optimized libraries. In itself, these are ingredients that make it much easier to write analysis code. However, the set-up offers major indirect benefits to the wider amplitude analysis community as well.

Self-documenting workflow#

The combination of a CAS, performant code generation, and a dynamically typed language like Python is ideal when working with Jupyter notebooks. Amplitude models can be built up interactively with the option to directly inspect their implementation as mathematical formulas or plot their behavior with visualization libraries. Fast, numerical performance is near at hand through automatic code generation, which bridges the gab between formulating amplitude models to performing fits.

This ties in perfectly with recent trends in data science and modern publishing tools. In recent years, there have been several initiatives that render Jupyter notebooks as publication-ready documents (websites, PDF files, etc.). Community-wide examples are the Executable Book Project, Curvenote, and Quarto, while CERN has launched platforms like SWAN for running notebooks with direct access to CERN computing resources and Reana for making analyses more reproducible and scalable.

Given these trends, writing an amplitude analysis with symbolics is therefore not only intuitive to the physicist who write it, but results in a self-documenting workflow that naturally evolves towards publication-ready materials. An example is the polarimetry analysis by LHCb, where the entire implemented amplitude model is directly rendered from the codebase as mathematical formulas and analysis results from high-performance computations are directly available through code generation.

Model preservation#

Amplitude analyses are notoriously hard to reproduce. There are often several ways to formulate the same model, the model complexity makes the source code of the analysis model hard to understand, and starting conditions in a fit can lead to completely different fit results. Formulating the amplitude models symbolically addresses exactly these difficulties.

First of all, the self-documenting workflow with symbolic expressions removes the need for readers to dive through the underlying code, as the formulas are directly visible to the reader. Mathematics is the language we all speak. This easily allows others to reimplement models into their own framework of choice, now or in the future, in e.g. new programming languages.

Second, a symbolic amplitude model can be serialized to human-readable format with code generation. Just as with the generation of numerical functions, the model’s expression tree can be used to generate nodes in a serialization format like YAML or JSON. Other analysis frameworks can then import the model for cross-checks or for adapting the analysis to other experiments.

On a technical note, the Python ecosystem in combination with Jupyter Notebooks and Sphinx makes it possible for any reader to directly rerun analysis in the browser or in some local environment. Pinned dependencies ensure that the analysis produces the same results.

Knowledge exchange#

Exciting times are coming for amplitude analysis studies. On the one hand, collider experiments are producing increasingly large data samples that provide high statistics that are suitable for amplitude analysis and challenge the tools that are on the market. On the other hand, computational power has become so widely available, that it becomes possible to perform fits with more complicated models over large data input. While this provides many opportunities, it also poses challenges to the community.

There are many physicists who have little background in amplitude analysis, but need to become familiar with the theory and the techniques. For now, however, the literature is sparse, highly technical, and often specific to the experiment for which it was written. This makes the learning curve for newcomers extremely steep, so that it takes a lot of time before one can perform an actual amplitude analysis.

As amplitude models become more complex, it becomes crucial to get input from other experiments. This requires a proper matching of amplitude models and therefore it is essential that the models become more reproducible, extendable, and portable. All of this makes it paramount for the amplitude analysis community to provide easy access to the basic principles of the theoretical model.

Symbolics can play a valuable role here as well, as it becomes much easier to share and maintain knowledge gained about amplitude models and amplitude analysis theory. Symbolic amplitude models directly show the implemented mathematics and their numerical functions can directly be used for interactive visualizations. In addition, the self-documenting workflow makes it more inviting to contribute to community documentation as it narrows the gap between theory and code.