```
ex = :(sum([1,2,3]))
dump(ex)
```

```
Expr
head: Symbol call
args: Array{Any}((2,))
1: Symbol sum
2: Expr
head: Symbol vect
args: Array{Any}((3,))
1: Int64 1
2: Int64 2
3: Int64 3
```

On this blog, I typically talk about things that I have *some* understanding of. In this post, I want to try something a little different and instead cover a topic that I am utterly *clueless* about. There’s an interesting aspect about Julia, which I know embarrassingly little about at this point: Metaprogramming.

Having worked with Julia for about 1.5 years now, I have so far employed a successful strategy of occasionally taking a glimpse at that part of the Julia documentation and then deciding to go back to pretending it never existed. Meanwhile, Karandeep Singh has stepped on the Julia scence around 5 minutes ago and already developed a package that literally oozes macro: Tidier.jl. The package API is such a joy to work with that I have felt inspired to finally take a serious look at what metaprogramming is all about.

My goal for this post is to get to the point where I can confidently write my first macro for `CounterfactualExplanations.jl`

- more on this below! If you are as clueless and curious about the topic as I am, then follow along on a journey into the metaverse. Buckle in though, it’s going to be a bumpy ride! You have been warned.

You guessed it, we’ll start with the official Julia documentation on the topic:

Like Lisp, Julia represents its own code as a data structure of the language itself.

Hmmm … I know nothing about Lisp and this is already beyond me. Code as a data structure?

Let’s first try to understand what exactly metaprogramming is, outside of Julia and Lisp. We’ll take a quick detour before we’re back. Wikipedia has the following to say on the topic:

Metaprogramming is a programming technique in which computer programs have the ability to treat other programs as their data.

Right … What about ChatGPT? I wanted to try out ReplGPT anyway so here goes^{1}:

```
julia> using ReplGPT
ChatGPT> Hey hey, can you please explain metaprogramming to me (I have no computer science background, but am experienced with programming and data science)
Metaprogramming is a programming technique where a program is capable of creating or manipulating code at
runtime. It involves writing computer programs that create, modify, or analyze other computer programs or data
about those programs.
```

So, in layman’s terms, metaprogramming involves code that generates code - I guess we really have entered the metaverse!

Let’s head back to the Julia documentation (v1.8) and look at the first couple of examples.

Skipping the details here, it turns out that when I write `1 + 1`

in the REPL, Julia first parses this program as a string `"1 + 1`

into an expression `ex=:(1 + 1)::Expr`

, which is then evaluated `eval(ex)`

(Figure 1). I’ve used a `quote`

here to generate the expression because I’ve used **quoting** before for use with `Documenter.jl`

.

And if I understand this correctly, the expression `ex::Expr`

is literally **code as a data structure**:

```
ex = :(sum([1,2,3]))
dump(ex)
```

```
Expr
head: Symbol call
args: Array{Any}((2,))
1: Symbol sum
2: Expr
head: Symbol vect
args: Array{Any}((3,))
1: Int64 1
2: Int64 2
3: Int64 3
```

That data structure can be “manipulated from within the language”.

Let’s try that! Currently, evaluating this expression yields the `sum`

of the `Array`

:

`eval(ex)`

`6`

Upon manipulation (that sounds weird!), we have:

```
ex.args[1] = :maximum
eval(ex)
```

`3`

Ok ok, things are starting to make sense now!

Back to the Julia documentation and next on the agenda we have Interpolation. Skipping the details again, it seems like I can interpolate an expression much like strings. Using interpolation I can recreate the expression from above as follows:

```
fun = maximum
x = [1,2,3]
ex_from_ex_interpoliation = :($fun($x))
a = eval(ex_from_ex_interpoliation)
```

Using string interpolation is quite similar:

```
fun = maximum
x = [1,2,3]
ex_from_string_interpolation = Meta.parse("$fun($x)")
eval(ex_from_string_interpolation) == a
```

`true`

And much like with function arguments, we can also use splatting:

`eval(:(zeros($x...)))`

```
1×2×3 Array{Float64, 3}:
[:, :, 1] =
0.0 0.0
[:, :, 2] =
0.0 0.0
[:, :, 3] =
0.0 0.0
```

Next off, we have **nested quotes**. I can’t see myself using these anytime soon but anyway it seems that for each `$`

sign that we prepend to `x`

, an evaluation is trigged:

`eval(quote quote $$:(x) end end)`

```
quote
#= In[34]:2 =#
[1, 2, 3]
end
```

Moving on, we have `QuoteNodes`

, which I will steer clear of because I probably won’t be doing any super advanced metaprogramming anytime soon. The next two sections on evaluating expressions and functions on `Expr`

essions also look somewhat more involved than what I need right now, but I expect I’ll find myself back here when I write that first macro for `CounterfactualExplanations.jl`

.

Ahhh, I see we’ve finally arrived in Macroland!

A macro maps a tuple of arguments to a returned expression, and the resulting expression is compiled directly rather than requiring a runtime eval call.

Let’s see if we can make sense of this as we move on. The `Hello, world!`

example makes the concept quite clear:

```
macro sayhello(name)
return :( println("Hello, ", $name) ) # return the expression ...
end
@sayhello "reader" # ... to be immediately compiled.
```

`Hello, reader`

It seems that a macro is a way to build and return expressions inside a block (a bit like a function) but on call that expression is immediately evaluated. In other words, we can use macros to write code that generates code that is then evaluated.

To fully grasp the next part, I should have not skipped the part on functions on `Expr`

essions. We’ll leave that little nugget for Future Me. That same guy will also have to suffer the consequences of Present Me merely skimming the details in the next section on macro invocation. Present Me is impatient and overly confident in the level of knowledge that we have just acquired about metaprogramming.

Let’s get ahead of ourselves and meet the final boss of the metaverse: an Advanced Macro.

I’ll leave it to you to thoroughly read that section in the Julia docs. Here, we’ll jump straight into building the macro I want to have for `CounterfactualExplanations.jl`

. I now think it’ll be less involved than I thought — optimism in the face of uncertainty!

`CounterfactualExplanations.jl`

is a package for generating Counterfactual Explanations for predictive models . This is a relatively recent approach to Explainable AI that I am (probably a little too) excited about and won’t dwell on here. For what follows, it suffices to say that generating Counterfactual Explanations can be seen as a generative modelling task because it involves generating samples in the input space: . To this end, the package has previously shipped with a number of `Generators`

: composite types that contain information about how counterfactuals ought to be generated.

This has allowed users to specify the type of generator they want to use by instantiating it. For example, the DiCE generator by Mothilal, Sharma, and Tan (2020) could (and still can) be instantiated as follows:

`generator = DiCEGenerator()`

This has been a straightforward way for users to use off-the-shelf counterfactual generators. But relying on separate composite types for this task may have been an overkill. In fact, all this time there was some untapped potential here, as we will see next.

One of my key objectives for the package has always been composability. It turns out that many of the various counterfactual generators that have been proposed in the literature, essentially do the same thing: they optimize an objective function. In Altmeyer et al. (2023), we denote that objective formally as follows,

where denotes the main loss function and is a penalty term. I won’t cover this in any more detail here, but you can read about it in the package docs. The important thing is that Equation 1 very closely describes how counterfactual search is actually implemented in the package.

In other words, all generators currently implemented share a common starting point. They largely just vary in the exact way the objective function is specified. This gives rise to an interesting idea:

Why not compose generators that combine ideas from different off-the-shelf generators?

I want to give users an easy way to do that, without having to build custom `Generator`

types from scratch. This (I think) is a good use case for metaprogramming.

Let’s try and see if we can make that work. We’ll simply extend `CounterfactualExplanations`

right here in this repo hosting the blog (easily done in Julia) and provided everything works out well create a pull request. I already have a GitHub issue for this with a linked branch, so that’s the one I’ll use in my environment:

```
(metaprogramming) pkg> add https://github.com/JuliaTrustworthyAI/CounterfactualExplanations.jl#118-add-losses-and-penalties-modules-or-group-under-objectives-module
julia> using CounterfactualExplanations
```

By the time you’re reading this, all changes to that branch will have hopefully already been committed and merged.

Let’s start by instantiating a generic generator:

`generator = GenericGenerator()`

Our goal is to create macros that build expressions that, when evaluated, mutate the `generator`

instance.

`@objective`

Our first and most important macro shall define the counterfactual search objective. In particular, the `@objective`

macro should accept an expression that looks much like the right-hand-side of Equation 1, which is essentially just a weighted sum.

Let’s start with that part. Naively, we could begin by writing it out very literally:

```
ex = :(yloss + λ*cost)
eval(ex)
```

Of course, evaluating this expression throws an error because none of the variables are actually defined. Let’s work on that …

For the loss and penalty functions, we will use methods available from the `CounterfactualExplanations.Objectives`

module, while for we will use a literal:

`ex = :(logitbinarycrossentropy + 0.1 * distance_l2)`

Let’s try to make sense of the data structure we have created:

`ex.args`

```
3-element Vector{Any}:
:+
:logitbinarycrossentropy
:(0.1distance_l2)
```

My first naive approach is shown below. It errors because I forgot to interpolate the variables inside the `quote`

.

```
macro objective(generator, ex)
loss = ex.args[2]
ex_penalty = ex.args[3]
λ = ex_penalty.args[2]
cost = ex_penalty.args[3]
ex_generator = quote
generator.loss = loss
generator.cost = cost
generator.λ = λ
end
return ex_generator
end
```

Having fixed that below, I still get an error because `loss`

and `cost`

functions are not part of the global scope. I am pretty sure that this error would have occurred anyway and has nothing to do with the fact that I’m writing a macro.

```
macro objective(generator, ex)
loss = ex.args[2]
ex_penalty = ex.args[3]
λ = ex_penalty.args[2]
cost = ex_penalty.args[3]
ex_generator = quote
$generator.loss = $loss
$generator.cost = $cost
$generator.λ = $λ
end
return ex_generator
end
```

Instead of importing the functions, I just get them explicitly from the `Objectives`

module,

```
macro objective(generator, ex)
loss = getfield(CounterfactualExplanations.Objectives, ex.args[2])
ex_penalty = ex.args[3]
λ = ex_penalty.args[2]
cost = getfield(CounterfactualExplanations.Objectives, ex_penalty.args[3])
ex_generator = quote
$generator.loss = $loss
$generator.penalty = $cost
$generator.λ = $λ
generator
end
return ex_generator
end
```

and, finally, this works:

`@objective(generator, logitbinarycrossentropy + 0.1distance_l2)`

`Generator(Flux.Losses.logitbinarycrossentropy, CounterfactualExplanations.Objectives.distance_l2, 0.1, false, Flux.Optimise.Descent(0.1))`

But what about adding multiple penalties? The DiCE generator, for example, also takes into account how diverse the counterfactual explanations are (Mothilal, Sharma, and Tan 2020). The corresponding penalty is called `ddp_diversity`

. Let’s start with the expression again:

```
ex = :(logitbinarycrossentropy + 0.1distance_l2 + 1.0ddp_diversity)
ex.args
```

```
4-element Vector{Any}:
:+
:logitbinarycrossentropy
:(0.1distance_l2)
:(1.0ddp_diversity)
```

This time there’s a second nested `Expr`

ession among the arguments: `:(1.0ddp_diversity)`

.

```
macro objective(generator, ex)
loss = getfield(CounterfactualExplanations.Objectives, ex.args[2])
Λ = Vector{AbstractFloat}()
costs = Vector{Function}()
for i in 3:length(ex.args)
ex_penalty = ex.args[i]
λ = ex_penalty.args[2]
push!(Λ, λ)
cost = getfield(CounterfactualExplanations.Objectives, ex_penalty.args[3])
push!(costs, cost)
end
ex_generator = quote
$generator.loss = $loss
$generator.penalty = $costs
$generator.λ = $Λ
generator
end
return ex_generator
end
```

That works well,

`@objective(generator, logitbinarycrossentropy + 0.05distance_l2 + 1.0ddp_diversity)`

`Generator(Flux.Losses.logitbinarycrossentropy, Function[CounterfactualExplanations.Objectives.distance_l2, CounterfactualExplanations.Objectives.ddp_diversity], AbstractFloat[0.05, 1.0], false, Flux.Optimise.Descent(0.1))`

but we should still make sure that this `generator`

is also compatible with our package. Below we go through some of the typical workflows associated with Counterfactual Explanations. Firstly, we load some synthetic data and fit a black-box model to it.

```
n_dim = 2
n_classes = 4
n_samples = 400
model_name = :MLP
counterfactual_data = CounterfactualExplanations.load_blobs(n_samples; k=n_dim, centers=n_classes)
M = fit_model(counterfactual_data, model_name)
plot(M, counterfactual_data)
```

Next, we begin by specifying our target and factual label. We then draw a random sample from the non-target (factual) class.

```
# Factual and target:
target = 2
factual = 4
chosen = rand(findall(predict_label(M, counterfactual_data) .== factual))
x = select_factual(counterfactual_data,chosen)
```

Finally, we use our `generator`

to generate counterfactuals:

```
ce = generate_counterfactual(
x, target, counterfactual_data, M, generator;
num_counterfactuals = 5,
converge_when = :generator_conditions
)
```

It worked! 🎉 The resulting counterfactual search is illustrated in Figure 2. I may have overspecified the size of the `ddp_diversity`

penalty a little bit here, but it sure makes for a cool chart!

Time for me to add this all to `CounterfactualExplanations.jl`

… ⏳

… aaand I’m back. There was one thing I had ignored that ended up causing a minor complication: macro hygiene.

Again, I’ll leave it to you to read up on the details, but the bottom line is that when writing macros, we need to keep variable scopes in mind. `CounterfactualExplanations.jl`

is composed of various (sub)modules, and when I initially added the macro to the `CounterfactualExplanations.Generators`

module, it errored.

The problem was (I believe) that the `generator`

variable existed in the global scope (`Main`

) but it was not accessible for the `@objective`

macro that at runtime lives in `Main.Generators`

. Fortunately, it is easy to make the variable accessible by wrapping it inside an `esc()`

call:

This escaping mechanism can be used to “violate” hygiene when necessary, in order to introduce or manipulate user variables.

This may not be the ideal way to do this, and as always, if you have any suggestions I’d be happy to hear about them.

If you want to find out more about how macros can now be used to easily compose counterfactual generators, check out the new section in the package documentation.

In this blog post, I’ve done something I usually try to avoid: talk about things I don’t know. Metaprogramming is an exciting topic and if you’re still here, you just got to experience it through the lens of an absolute novice. During our leap of faith into Julia’s metaverse we’ve learned the following things:

- Code in Julia is internally represented as a mutable data structure.
- Macros are a way to take such data structures and transform them before they get evaluated at runtime.
- An important thing to keep in mind when writing macros is variable scopes.

Throughout this post, I have skipped various important details that (I think) were not immediately relevant to the goal I had in mind: adding my first macro to `CounterfactualExplanations.jl`

. In the future, I may write about this topic again and cover some of these missing details (hopefully with a bit more insight at that point!).

Altmeyer, Patrick, Giovan Angela, Aleksander Buszydlik, Karol Dobiczek, Arie van Deursen, and Cynthia Liem. 2023. “Endogenous Macrodynamics in Algorithmic Recourse.” In *First IEEE Conference on Secure and Trustworthy Machine Learning*.

Mothilal, Ramaravind K, Amit Sharma, and Chenhao Tan. 2020. “Explaining Machine Learning Classifiers Through Diverse Counterfactual Explanations.” In *Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency*, 607–17.

A very cool package! Funny story, though, I somehow managed to commit my OpenAI API key to GitHub on the first go (developing)↩︎

BibTeX citation:

```
@online{altmeyer2023,
author = {Altmeyer, Patrick},
title = {A {Leap} of {Faith} into {Julia’s} {Metaverse}},
date = {23-03-13},
url = {https://www.paltmeyer.com/blog//blog/posts/meta-programming},
langid = {en}
}
```

For attribution, please cite this work as:

Altmeyer, Patrick. 23AD. “A Leap of Faith into Julia’s
Metaverse.” March 13, 23AD. https://www.paltmeyer.com/blog//blog/posts/meta-programming.

I’ve said it before and I’ll say it again: Quarto is amazing! Since the beginning of my PhD I haven’t used any other tool for prototyping, writing and publishing any of my work.^{1} That work has included: this website, presentations, academic articles, notebooks and more. By highlighting useful features of Quarto in articles like this one, I hope to encourage more people to try it out.

While I’m convinced that Quarto can be useful in almost any context including industry, I realize that certain obstacles may have so far prevented some of you from using it. One such obstacle concerns custom formats: the standard Quarto formats for HTML, PDF, Revealjs, etc. are slick but minimalistic. For many formats, there are various themes to choose from, but they too lack personal touch (or corporate identity in the industry setting).

At first sight, traditional publishing tools like MS Office seem to have an edge here: customization is made easy through GUIs and standardization through templates is possible to a certain degree. I understand the appeal but still would encourage you to look beyond MS Word, Powerpoint and Beamer presentations. To this end, I’ve put together this short tutorial that explains how I have built and contributed a TU Delft theme for Revealjs. If nothing else, this theme can be used by my colleagues at Delft University of Technology to create beautiful, Delft-styled presentations with ease.

Advanced and reproducible customization in Quarto is done through Quarto Extensions:

“Quarto Extensions are a powerful way to modify or extend the behavior of Quarto, and can be created and distributed by anyone.”

— Quarto team

Users can already utilize several open-sourced extensions that add filters, journal article formats and other custom formats. As we will see, it is very straightforward to contribute extensions, so the list of available extensions is growing quickly.

Normally, I would start by explaining how to use Quarto Extensions, but in this particular case the user and developer experience is so close that I’ll jump straight into development.

To get started with building the TU Delft Custom Format I followed the official Quarto docs. I first used the appropriate Quarto command, which initiates an interactive process in the command line:

```
$ quarto create extension format:revealjs
? Extension Name › lexdoc
```

Once done, the basic folder structure for my extension was set up and ready to be pushed to a remote Github repository for distribution: https://github.com/pat-alt/quarto-tudelft. Even though I had not yet added any custom formatting rules, anyone would now be able to use this empty extension for their work.

To actually add some custom formatting rules to the extension I started working on the files contained in `_extensions/tudelft/`

. Using my institution’s PowerPoint template as a reference, I previewed the `template.qmd`

file and simply made appropriate adjustments to the `_extensions/tudelft/custom.scss`

and `_extensions/tudelft/_extension.yml`

files until I was satisfied. To help me in that process, I took inspiration from various existing Revealjs extensions all listed in the awesome-quarto repository.

I am no expert in CSS (far from it!), so this was very much trial-and-error based, but I got there eventually. One feature I am particularly happy about is the custom transition slides: by default all slides at level 1, so slides that initiate a new section,

`# Transition Slide`

will be formatted in a standardized way. The relevant CSS rule can be found here

The Reavealjs template also includes a few images, which I have lifted from my institution’s PowerPoint template. To make sure that these images are also available locally when users install the template, any resources need to be stored inside the theme directory `_extensions/tudelft/`

. I have had some issues pointing to the right location of these images in the _extensions/tudelft/custom.scss and _extensions/tudelft/_extension.yml file. At the time of writing this, the image URLs are pointing to their remote location on Github (see here). This works, but probably isn’t ideal, so any suggestions are welcome.

In February, 2023, I will present a research paper on Algorithmic Recourse at the first IEEE Conference on Secure and Trustworthy Machine Learning: SaTML 2023. This was a good incentive for me to build a TU Delft Theme once for this occasion and then be able to reuse it again in the future.

With the template built and distributed, how do you actually use it?

This part is truly a walk in the park. As outlined in the README users can either work directly with the template,

`quarto use template pat-alt/quarto-tudelft`

or add the template to an existing Quarto project:

`quarto add pat-alt/quarto-tudelft`

The first option will get you started with a working document straight away. For my paper presentation, I worked with the second option. At the time of writing, I am building and hosting all of my presentations in my website repository (the repo that also builds this very article you’re reading): https://github.com/pat-alt/pat-alt.github.io.

With the extension added to the project, I can now use it anywhere within that project by simply specifying,

`format: tudelft-revealjs`

in the YAML header of my Quarto document where `tudelft-revealjs`

is just the name of the custom format.

It gets better … The extension can be extended further by providing yet another custom style sheet, as I have done for my paper presentation:

```
format:
tudelft-revealjs:
theme: custom.scss
```

Check out the final presentation here or see the embedded version below:

Not entirely true: I’ve also used

`Pluto.jl`

🎈 and had to resort to`.Rmd`

in one particular case.↩︎

BibTeX citation:

```
@online{altmeyer2023,
author = {Altmeyer, Patrick},
title = {Quarto on {Steroids:} {Advanced} {Customization} Through
{Quarto} {Extensions}},
date = {23-01-16},
url = {https://www.paltmeyer.com/blog//blog/posts/quarto-extensions},
langid = {en}
}
```

For attribution, please cite this work as:

Altmeyer, Patrick. 23AD. “Quarto on Steroids: Advanced
Customization Through Quarto Extensions.” January 16, 23AD. https://www.paltmeyer.com/blog//blog/posts/quarto-extensions.

This is the third (and for now final) part of a series of posts that introduce Conformal Prediction in Julia using `ConformalPrediction.jl`

. The first post introduced Conformal Prediction for supervised classification tasks: we learned that conformal classifiers produce set-valued predictions that are guaranteed to include the true label of a new sample with a certain probability. In the second post we applied these ideas to a more hands-on example: we saw how easy it is to use `ConformalPrediction.jl`

to conformalize a Deep Learning image classifier.

In this post, we will look at regression models instead, that is supervised learning tasks involving a continuous outcome variable. Regression tasks are as ubiquitous as classification tasks. For example, we might be interested in using a machine learning model to predict house prices or the inflation rate of the Euro or the parameter size of the next large language model. In fact, many readers may be more familiar with regression models than classification, in which case it may also be easier for you to understand Conformal Prediction (CP) in this context.

Before we start, let’s briefly recap what CP is all about. Don’t worry, we’re not about to deep-dive into methodology. But just to give you a high-level description upfront:

Conformal prediction (a.k.a. conformal inference) is a user-friendly paradigm for creating statistically rigorous uncertainty sets/intervals for the predictions of such models. Critically, the sets are valid in a distribution-free sense: they possess explicit, non-asymptotic guarantees even without distributional assumptions or model assumptions.

— Angelopoulos and Bates (2021) (arXiv)

Intuitively, CP works under the premise of turning heuristic notions of uncertainty into rigorous uncertainty estimates through repeated sampling or the use of dedicated calibration data.

In what follows we will explore what CP can do by going through a standard machine learning workflow using `MLJ.jl`

and `ConformalPrediction.jl`

. There will be less focus on how exactly CP works, but references will point you to additional resources.

Most machine learning workflows start with data. For illustrative purposes we will work with synthetic data. The helper function below can be used to generate some regression data.

```
function get_data(;N=1000, xmax=3.0, noise=0.5, fun::Function=fun(X) = X * sin(X))
# Inputs:
d = Distributions.Uniform(-xmax, xmax)
X = rand(d, N)
X = MLJBase.table(reshape(X, :, 1))
# Outputs:
ε = randn(N) .* noise
y = @.(fun(X.x1)) + ε
y = vec(y)
return X, y
end
```

Figure 1 illustrates our observations (dots) along with the ground-truth mapping from inputs to outputs (line). We have defined that mapping as follows:

`f(X) = X * cos(X)`

`MLJ`

`ConformalPrediction.jl`

is interfaced to `MLJ.jl`

(Blaom et al. 2020): a comprehensive Machine Learning Framework for Julia. `MLJ.jl`

provides a large and growing suite of popular machine learning models that can be used for supervised and unsupervised tasks. Conformal Prediction is a model-agnostic approach to uncertainty quantification, so it can be applied to any common supervised machine learning model.

The interface to `MLJ.jl`

therefore seems natural: any (supervised) `MLJ.jl`

model can now be conformalized using `ConformalPrediction.jl`

. By leveraging existing `MLJ.jl`

functionality for common tasks like training, prediction and model evaluation, this package is light-weight and scalable. Now let’s see how all of that works …

To start with, let’s split our data into a training and test set:

`train, test = partition(eachindex(y), 0.4, 0.4, shuffle=true)`

Now let’s define a model for our regression task:

```
Model = @load KNNRegressor pkg = NearestNeighborModels
model = Model()
```

Have it your way!

Think this dataset is too simple? Wondering why on earth I’m not using XGBoost for this task? In the interactive version of this post you have full control over the data and the model. Try it out!

Using standard `MLJ.jl`

workflows let us now first train the unconformalized model. We first wrap our model in data:

`mach_raw = machine(model, X, y)`

Then we fit the machine to the training data:

`MLJBase.fit!(mach_raw, rows=train, verbosity=0)`

Figure 2 below shows the resulting point predictions for the test data set:

How is our model doing? It’s never quite right, of course, since predictions are estimates and therefore uncertain. Let’s see how we can use Conformal Prediction to express that uncertainty.

We can turn our `model`

into a conformalized model in just one line of code:

`conf_model = conformal_model(model)`

By default `conformal_model`

creates an Inductive Conformal Regressor (more on this below) when called on a `<:Deterministic`

model. This behaviour can be changed by using the optional `method`

key argument.

To train our conformal model we can once again rely on standard `MLJ.jl`

workflows. We first wrap our model in data:

`mach = machine(conf_model, X, y)`

Then we fit the machine to the data:

`MLJBase.fit!(mach, rows=train, verbosity=0)`

Now let us look at the predictions for our test data again. The chart below shows the results for our conformalized model. Predictions from conformal regressors are range-valued: for each new sample the model returns an interval that covers the test sample with a user-specified probability , where is the expected error rate. This is known as the **marginal coverage guarantee** and it is proven to hold under the assumption that training and test data are exchangeable.

Intuitively, a higher coverage rate leads to larger prediction intervals: since a larger interval covers a larger subspace of , it is more likely to cover the true value.

I don’t expect you to believe me that the marginal coverage property really holds. In fact, I couldn’t believe it myself when I first learned about it. If you like mathematical proofs, you can find one in this tutorial, for example. If you like convincing yourself through empirical observations, read on below …

To verify the marginal coverage property empirically we can look at the empirical coverage rate of our conformal predictor (see Section 3 of the tutorial for details). To this end our package provides a custom performance measure `emp_coverage`

that is compatible with `MLJ.jl`

model evaluation workflows. In particular, we will call `evaluate!`

on our conformal model using `emp_coverage`

as our performance metric. The resulting empirical coverage rate should then be close to the desired level of coverage.

```
model_evaluation =
evaluate!(_mach, operation=MLJBase.predict, measure=emp_coverage, verbosity=0)
println("Empirical coverage: $(round(model_evaluation.measurement[1], digits=3))")
println("Coverage per fold: $(round.(model_evaluation.per_fold[1], digits=3))")
```

`Empirical coverage: 0.902`

`Coverage per fold: [0.94, 0.904, 0.874, 0.874, 0.898, 0.922]`

✅ ✅ ✅ Great! We got an empirical coverage rate that is slightly higher than desired 😁 … but why isn’t it exactly the same?

In most cases it will be slightly higher than desired, since is a lower bound. But note that it can also be slightly lower than desired. That is because the coverage property is “marginal” in the sense that the probability is averaged over the randomness in the data. For most purposes a large enough calibration set size () mitigates that randomness enough. Depending on your choices above, the calibration set may be quite small (set to 500), which can lead to **coverage slack** (see Section 3 in the tutorial).

Inductive Conformal Prediction (also referred to as Split Conformal Prediction) broadly speaking works as follows:

- Partition the training into a proper training set and a separate calibration set
- Train the machine learning model on the proper training set.
- Using some heuristic notion of uncertainty (e.g., absolute error in the regression case), compute nonconformity scores using the calibration data and the fitted model.
- For the given coverage ratio compute the corresponding quantile of the empirical distribution of nonconformity scores.
- For the given quantile and test sample , form the corresponding conformal prediction set like so:

This has been a super quick tour of `ConformalPrediction.jl`

. We have seen how the package naturally integrates with `MLJ.jl`

, allowing users to generate rigorous predictive uncertainty estimates for any supervised machine learning model.

Quite cool, right? Using a single API call we are able to generate rigorous prediction intervals for all kinds of different regression models. Have we just solved predictive uncertainty quantification once and for all? Do we even need to bother with anything else? Conformal Prediction is a very useful tool, but like so many other things, it is not the final answer to all our problems. In fact, let’s see if we can take CP to its limits.

The helper function to generate data from above takes an optional argument `xmax`

. By increasing that value, we effectively expand the domain of our input. Let’s do that and see how our conformal model does on this new out-of-domain data.

Whooooops 🤕 … looks like we’re in trouble: in Figure 4 the prediction intervals do not cover out-of-domain test samples well. What happened here?

By expanding the domain of out inputs, we have violated the exchangeability assumption. When that assumption is violated, the marginal coverage property does not hold. But do not despair! There are ways to deal with this.

If you are curious to find out more, be sure to read on in the docs. There are also a number of useful resources to learn more about Conformal Prediction, a few of which I have listed below:

*A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification*by Angelopoulos and Bates (2022).*Awesome Conformal Prediction*repository by Manokhin (2022)**MAPIE**: a comprehensive Python library for conformal prediction.- My previous two blog posts.

Enjoy!

Angelopoulos, Anastasios N., and Stephen Bates. 2021. “A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification.” https://arxiv.org/abs/2107.07511.

Blaom, Anthony D., Franz Kiraly, Thibaut Lienart, Yiannis Simillides, Diego Arenas, and Sebastian J. Vollmer. 2020. “MLJ: A Julia Package for Composable Machine Learning.” *Journal of Open Source Software* 5 (55): 2704. https://doi.org/10.21105/joss.02704.

BibTeX citation:

```
@online{altmeyer2022,
author = {Altmeyer, Patrick},
title = {Prediction {Intervals} for Any {Regression} {Model}},
date = {22-12-12},
url = {https://www.paltmeyer.com/blog//blog/posts/conformal-regression},
langid = {en}
}
```

For attribution, please cite this work as:

Altmeyer, Patrick. 22AD. “Prediction Intervals for Any Regression
Model.” December 12, 22AD. https://www.paltmeyer.com/blog//blog/posts/conformal-regression.

Deep Learning is popular and — for some tasks like image classification — remarkably powerful. But it is also well-known that Deep Neural Networks (DNN) can be unstable (Goodfellow, Shlens, and Szegedy 2014) and poorly calibrated. Conformal Prediction can be used to mitigate these pitfalls.

In the first part of this series of posts on Conformal Prediction, we looked at the basic underlying methodology and how CP can be implemented in Julia using `ConformalPrediction.jl`

. This second part of the series is a more goal-oriented how-to guide: it demonstrates how you can conformalize a deep learning image classifier built in `Flux.jl`

in just a few lines of code.

Since this is meant to be more of a hands-on article, we will avoid diving too deeply into methodological concepts. If you need more colour on this, be sure to check out the first article on this topic and also A. N. Angelopoulos and Bates (2021). For a more formal treatment of Conformal Prediction see also A. Angelopoulos et al. (2022).

The task at hand is to predict the labels of handwritten images of digits using the famous MNIST dataset (LeCun 1998). Importing this popular machine learning dataset in Julia is made remarkably easy through `MLDatasets.jl`

:

```
using MLDatasets
N = 1000
Xraw, yraw = MNIST(split=:train)[:]
Xraw = Xraw[:,:,1:N]
yraw = yraw[1:N]
```

Figure 1 below shows a few random samples from the training data:

```
using MLJ
using Images
X = map(x -> convert2image(MNIST, x), eachslice(Xraw, dims=3))
y = coerce(yraw, Multiclass)
n_samples = 10
mosaic(rand(X, n_samples)..., ncol=n_samples)
```

To model the mapping from image inputs to labels will rely on a simple Multi-Layer Perceptron (MLP). A great Julia library for Deep Learning is `Flux.jl`

. But wait … doesn’t `ConformalPrediction.jl`

work with models trained in `MLJ.jl`

? That’s right, but fortunately there exists a `Flux.jl`

interface to `MLJ.jl`

, namely `MLJFlux.jl`

. The interface is still in its early stages, but already very powerful and easily accessible for anyone (like myself) who is used to building Neural Networks in `Flux.jl`

.

In `Flux.jl`

, you could build an MLP for this task as follows,

```
using Flux
mlp = Chain(
Flux.flatten,
Dense(prod((28,28)), 32, relu),
Dense(32, 10)
)
```

where `(28,28)`

is just the input dimension (28x28 pixel images). Since we have ten digits, our output dimension is ten.^{1}

We can do the exact same thing in `MLJFlux.jl`

as follows,

```
using MLJFlux
builder = MLJFlux.@builder Chain(
Flux.flatten,
Dense(prod(n_in), 32, relu),
Dense(32, n_out)
)
```

where here we rely on the `@builder`

macro to make the transition from `Flux.jl`

to `MLJ.jl`

as seamless as possible. Finally, `MLJFlux.jl`

already comes with a number of helper functions to define plain-vanilla networks. In this case, we will use the `ImageClassifier`

with our custom builder and cross-entropy loss:

```
ImageClassifier = @load ImageClassifier
clf = ImageClassifier(
builder=builder,
epochs=10,
loss=Flux.crossentropy
)
```

The generated instance `clf`

is a model (in the `MLJ.jl`

sense) so from this point on we can rely on standard `MLJ.jl`

workflows. For example, we can wrap our model in data to create a machine and then evaluate it on a holdout set as follows:

```
mach = machine(clf, X, y)
evaluate!(
mach,
resampling=Holdout(rng=123, fraction_train=0.8),
operation=predict_mode,
measure=[accuracy]
)
```

The accuracy of our very simple model is not amazing, but good enough for the purpose of this tutorial. For each image, our MLP returns a softmax output for each possible digit: 0,1,2,3,…,9. Since each individual softmax output is valued between zero and one, , this is commonly interpreted as a probability: . Edge cases – that is values close to either zero or one – indicate high predictive certainty. But this is only a heuristic notion of predictive uncertainty (A. N. Angelopoulos and Bates 2021). Next, we will turn this heuristic notion of uncertainty into a rigorous one using Conformal Prediction.

Since `clf`

is a model, it is also compatible with our package: `ConformalPrediction.jl`

. To conformalize our MLP, we therefore only need to call `conformal_model(clf)`

. Since the generated instance `conf_model`

is also just a model, we can still rely on standard `MLJ.jl`

workflows. Below we first wrap it in data and then fit it. Aaaand … we’re done! Let’s look at the results in the next section.

```
using ConformalPrediction
conf_model = conformal_model(clf; method=:simple_inductive, coverage=.95)
mach = machine(conf_model, X, y)
fit!(mach)
```

Figure 2 below presents the results. Figure 2 (a) displays highly certain predictions, now defined in the rigorous sense of Conformal Prediction: in each case, the conformal set (just beneath the image) includes only one label.

Figure 2 (b) and Figure 2 (c) display increasingly uncertain predictions of set size two and three, respectively. They demonstrate that CP is well equipped to deal with samples characterized by high aleatoric uncertainty: digits four (4), seven (7) and nine (9) share certain similarities. So do digits five (5) and six (6) as well as three (3) and eight (8). These may be hard to distinguish from each other even after seeing many examples (and even for a human). It is therefore unsurprising to see that these digits often end up together in conformal sets.

To evaluate the performance of conformal models, specific performance measures can be used to assess if the model is correctly specified and well-calibrated (A. N. Angelopoulos and Bates 2021). We will look at this in some more detail in another post in the future. For now, just be aware that these measures are already available in `ConformalPrediction.jl`

and we will briefly showcase them here.

As for many other things, `ConformalPrediction.jl`

taps into the existing functionality of `MLJ.jl`

for model evaluation. In particular, we will see below how we can use the generic `evaluate!`

method on our machine. To assess the correctness of our conformal predictor, we can compute the empirical coverage rate using the custom performance measure `emp_coverage`

. With respect to model calibration we will look at the model’s conditional coverage. For adaptive, well-calibrated conformal models, conditional coverage is high. One general go-to measure for assessing conditional coverage is size-stratified coverage. The custom measure for this purpose is just called `size_stratified_coverage`

, aliased by `ssc`

.

The code below implements the model evaluation using cross-validation. The Simple Inductive Classifier that we used above is not adaptive and hence the attained conditional coverage is low compared to the overall empirical coverage, which is close to , so in line with the desired coverage rate specified above.

```
_eval = evaluate!(
mach,
resampling=CV(),
operation=predict,
measure=[emp_coverage, ssc]
)
display(_eval)
println("Empirical coverage: $(round(_eval.measurement[1], digits=3))")
println("SSC: $(round(_eval.measurement[2], digits=3))")
```

```
PerformanceEvaluation object with these fields:
measure, operation, measurement, per_fold,
per_observation, fitted_params_per_fold,
report_per_fold, train_test_rows
Extract:
┌───────────────────────────────────────────────────────────┬───────────┬───────
│ measure │ operation │ meas ⋯
├───────────────────────────────────────────────────────────┼───────────┼───────
│ emp_coverage (generic function with 1 method) │ predict │ 0.95 ⋯
│ size_stratified_coverage (generic function with 1 method) │ predict │ 0.77 ⋯
└───────────────────────────────────────────────────────────┴───────────┴───────
3 columns omitted
```

```
Empirical coverage: 0.951
SSC: 0.771
```

We can attain higher adaptivity (SSC) when using adaptive prediction sets:

```
conf_model = conformal_model(clf; method=:adaptive_inductive, coverage=.95)
mach = machine(conf_model, X, y)
fit!(mach)
_eval = evaluate!(
mach,
resampling=CV(),
operation=predict,
measure=[emp_coverage, ssc]
)
results[:adaptive_inductive] = mach
display(_eval)
println("Empirical coverage: $(round(_eval.measurement[1], digits=3))")
println("SSC: $(round(_eval.measurement[2], digits=3))")
```

```
PerformanceEvaluation object with these fields:
measure, operation, measurement, per_fold,
per_observation, fitted_params_per_fold,
report_per_fold, train_test_rows
Extract:
┌───────────────────────────────────────────────────────────┬───────────┬───────
│ measure │ operation │ meas ⋯
├───────────────────────────────────────────────────────────┼───────────┼───────
│ emp_coverage (generic function with 1 method) │ predict │ 0.99 ⋯
│ size_stratified_coverage (generic function with 1 method) │ predict │ 0.94 ⋯
└───────────────────────────────────────────────────────────┴───────────┴───────
3 columns omitted
```

```
Empirical coverage: 0.991
SSC: 0.948
```

We can also have a look at the resulting set size for both approaches using a custom `Plots.jl`

recipe (fig-setsize). In line with the above, the spread is wider for the adaptive approach, which reflects that “the procedure is effectively distinguishing between easy and hard inputs” (A. N. Angelopoulos and Bates 2021).

```
plt_list = []
for (_mod, mach) in results
push!(plt_list, bar(mach.model, mach.fitresult, X; title=String(_mod)))
end
plot(plt_list..., size=(800,300))
plot(plt_list..., size=(800,300),bg_colour=:transparent)
```

In this short guide we have seen how easy it is to conformalize a deep learning image classifier in Julia using `ConformalPrediction.jl`

. Almost any deep neural network trained in `Flux.jl`

is compatible with `MLJ.jl`

and can therefore be conformalized in just a few lines of code. This makes it remarkably easy to move uncertainty heuristics to rigorous predictive uncertainty estimates. We have also seen a sneak peek at performance evaluation of conformal predictors. Stay tuned for more!

Angelopoulos, Anastasios N., and Stephen Bates. 2021. “A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification.” https://arxiv.org/abs/2107.07511.

Angelopoulos, Anastasios, Stephen Bates, Jitendra Malik, and Michael I. Jordan. 2022. “Uncertainty Sets for Image Classifiers Using Conformal Prediction.” arXiv. http://arxiv.org/abs/2009.14193.

Goodfellow, Ian J, Jonathon Shlens, and Christian Szegedy. 2014. “Explaining and Harnessing Adversarial Examples.” https://arxiv.org/abs/1412.6572.

LeCun, Yann. 1998. “The MNIST Database of Handwritten Digits.”

For a full tutorial on how to build an MNIST image classifier relying solely on

`Flux.jl`

, check out this tutorial.↩︎

BibTeX citation:

```
@online{altmeyer2022,
author = {Altmeyer, Patrick},
title = {How to {Conformalize} a {Deep} {Image} {Classifier}},
date = {22-12-05},
url = {https://www.paltmeyer.com/blog//blog/posts/conformal-image-classifier},
langid = {en}
}
```

For attribution, please cite this work as:

Altmeyer, Patrick. 22AD. “How to Conformalize a Deep Image
Classifier.” December 5, 22AD. https://www.paltmeyer.com/blog//blog/posts/conformal-image-classifier.

Earlier this year in July, I gave a short Experience Talk at JuliaCon. In a related blog post I explained how the introduction of Quarto made my transition from R to Julia painless: I would be able to start learning Julia without having to give up on all the benefits associated with R Markdown.

In November, 2022, I am presenting on this topic again at the 2nd JuliaLang Eindhoven meetup. In addition to the slides, I thought I’d share a small companion blog post that highlights some useful tips and tricks for anyone interested in using Quarto with Julia.

We will start in this section with a few general recommendations.

I continue to recommend using VSCode for any work with Quarto and Julia. The Quarto docs explain how to get started by installing the necessary Quarto and IJulia extensions. Since most Julia users will regularly want to update their Julia version, I would additionally recommend to add `IJulia.jl`

to your `~/.julia/config/startup.jl`

file:^{1}

```
# Setup OhMyREPL, Revise and Term
import Pkg
let
pkgs = ["Revise", "OhMyREPL", "Term", "IJulia"]
for pkg in pkgs
if Base.find_package(pkg) === nothing
Pkg.add(pkg)
end
end
end
```

Additionally, you only need to remember that …

… if you install a new Julia binary […], you must update the IJulia installation […] by running

`Pkg.build("IJulia")`

— Source: IJulia docs

I guess this step can also be automated in `~/.julia/config/startup.jl`

, but haven’t tried that yet.

`.ipynb`

vs `.qmd`

I also continue to recommend working with Quarto notebooks as opposed to Jupyter notebooks (files ending in `.qmd`

and `.ipynb`

, respectively). This is partially just based on preference (from R Markdown I’m used to working with `.Rmd`

files), but there is also a good reason to consider using `.qmd`

, even if you’re used to working with Jupyter: the code chunks in your Quarto notebook automatically link to the Julia REPL in VSCode. In other words, you can run code chunks in your notebook and then access any variable that you may have created in the REPL. I find this quite useful, cause it allows me to quickly test code. Perhaps there’s a good way to do this with Jupyter notebooks as well, but when I last used them I would always have to insert new code cells to test stuff.

Either way switching between Jupyter and Quarto notebooks is straight-forward: `quarto convert notebook.qmd`

will convert any Quarto notebook into a Jupyter notebook and vice versa. One potential benefit of Jupyter notebooks is their connection to Google Colab: it is possible to store Jupyter notebooks on Github and make them available on Colab, allowing users to quickly interact with your code without the need to clone anything. If this is important to you, you can still work with `.qmd`

documents and simply specify `keep-ipynb: true`

in the YAML header.

The world and the data that describes it is not static 📈. Why should scientific outputs be?

One of the things I have always really loved about R Markdown was the ability to use inline code: the Knitr engine allows you to call and render any object `x`

that you have created in preceding R chunks like this: `r x`

. This is very powerful, because it enables us to bridge the gap between computations and output. In other words, it allows us to easily produce reproducible and dynamic content.

Until recently I had not been aware that this is also possible for Julia. Consider the following example. The code below depends on remote data that is continuously updated:

```
using MarketData
snp = yahoo("^GSPC")
using Dates
last_trade_day = timestamp(snp[end])[1]
p_close = values(snp[end,:Close])[1]
last_trade_day_formatted = Dates.format(last_trade_day, "U d, yyyy")
```

It loads the most recent publicly available data on equity prices from Yahoo finance. In an ideal world, we’d like any updates to these inputs to be reflected in our output. That way you can just re-render the Quarto notebook to get an updated report. To render Julia code inline, we use `Markdown.jl`

like so:

```
using Markdown
Markdown.parse("""
When the S&P 500 last traded, on $(last_trade_day_formatted), it closed at $(p_close).
""")
```

When the S&P 500 last traded, on February 23, 2023, it closed at 4012.320068.

In practice, one would of course set `#| echo: false`

in this case. Whatever content you publish, this approach will keep it up-to-date. This practice of simply re-rendering the source notebook also ensures that any other output remains up-to-date (e.g. Figure 1)

Related to the previous point, I typically define the following execution options in my `_quarto.yml`

or `_metadata.yml`

. The `freeze: auto`

option ensures that documents are only rerendered if the source changes. In cases where code should always be re-executed you whould want to set `freeze: false`

, instead. I set `output: false`

because typically I have a lot of code chunks that don’t generate any output that is of immediate interest to readers.

```
execute:
freeze: auto
eval: true
echo: true
output: false
```

To ensure that your content can be repoduced easily, it may additionally be helpful to explicitly specify the Julia version you used (`jupyter: julia-1.8`

) and set up a global or local Julia environments. Inserting the following at the beginning of your Quarto notebook

`using Pkg; Pkg.activate("<path>")`

ensures that the desired environemnt that lives in `<path>`

is actually activated and used.

I have also continued to use Quarto in combination with `Documenter.jl`

to document my Julia packages. This essentially boils down to writing up documentation using interactive `.qmd`

notebooks and then rendering those to `.md`

files as inputs for `Documenter.jl`

. There are a few good reasons for this approach, especially if you’re used to working with Quarto anyway:

- Re-rendering any docs with
`eval: true`

provides an additional layer of quality assurance: if any of the code chunks throws an error, you know that your documentation is outdated (perhaps due to an API change). It also offers a straight-forward way to test package functions that produce non-testable (e.g. stochastic) output. In such cases, the use of`jldoctest`

is not always straight-forward (see here). - You get some stuff for free, e.g. citation management. Unfortunately, as far as I’m aware there is still no support for cross-referencing.
- You can use Quarto execution options like
`execute-dir: project`

and`resources: www/`

to globally specify the working directory and a directory for external resources like images.

There are also a few peculiarities to be aware of. To avoid any issues with `Documenter.jl`

, I’ve found it useful to ensure that the rendered `.md`

files do not contain any raw HTML and to preserve text wrapping:

```
format:
commonmark:
variant: -raw_html
wrap: preserve
```

When working with `.qmd`

files you also need to use a slightly different syntax for admonitions. The following syntax inside the `.qmd`

```
| !!! note \"An optional title\"
| Here is something that you should pay attention to.
```

will generate the desired output inside the rendered `.md`

:^{2}

```
!!! note "An optional title"
Here is something that you should pay attention to.
```

Any of my package repos — `CounterfactualExplanations.jl`

, `LaplaceRedux.jl`

, `ConformalPrediction.jl`

— should provide additional colour on this topic.

Quarto supports templates/classes, which has helped me with paper submissions in the past (e.g. my pending JuliaCon Proceedings submissions). I’ve found that `rticles`

still has an edge here, but the list of out-of-the-box templates for journal articles is growing. Should I find some time in the future, I will try to add a template for JuliaCon Proceedings. The beauty of this is that it should enable publishers to not only use traditional forms of publication (PDF), but also include more dynamic formats with ease (think distill, but more than that.)

This short post has provided a bit of an update on using Quarto with Julia. From my own experience so far, things have been getting easier and better (thanks to the amazing work of Quarto dev team). I’m exicted to see things improve even further and still think that Quarto is a revolutionary new tool for scientific publishing. Let’s hope publishers eventually recognise this value 👀.

Unrelated to Quarto, but this thread on discourse is full of other useful ideas for your

`startup.jl`

.↩︎See related discussion.↩︎

BibTeX citation:

```
@online{altmeyer2022,
author = {Altmeyer, Patrick},
title = {A Year of Using {Quarto} with {Julia}},
date = {22-11-21},
url = {https://www.paltmeyer.com/blog//blog/posts/tips-and-tricks-for-using-quarto-with-julia},
langid = {en}
}
```

For attribution, please cite this work as:

Altmeyer, Patrick. 22AD. “A Year of Using Quarto with
Julia.” November 21, 22AD. https://www.paltmeyer.com/blog//blog/posts/tips-and-tricks-for-using-quarto-with-julia.

and changing coverage rates.

As coverage grows, so does the size of the

prediction sets.

A first crucial step towards building trustworthy AI systems is to be transparent about predictive uncertainty. Model parameters are random variables and their values are estimated from noisy data. That inherent stochasticity feeds through to model predictions and should to be addressed, at the very least in order to avoid overconfidence in models.

Beyond that obvious concern, it turns out that quantifying model uncertainty actually opens up a myriad of possibilities to improve up- and down-stream modeling tasks like active learning and robustness. In Bayesian Active Learning, for example, uncertainty estimates are used to guide the search for new input samples, which can make ground-truthing tasks more efficient (Houlsby et al. 2011). With respect to model performance in downstream tasks, uncertainty quantification can be used to improve model calibration and robustness (Lakshminarayanan, Pritzel, and Blundell 2016).

In previous posts we have looked at how uncertainty can be quantified in the Bayesian context (see here and here). Since in Bayesian modeling we are generally concerned with estimating posterior distributions, we get uncertainty estimates almost as a byproduct. This is great for all intends and purposes, but it hinges on assumptions about prior distributions. Personally, I have no quarrel with the idea of making prior distributional assumptions. On the contrary, I think the Bayesian framework formalizes the idea of integrating prior information in models and therefore provides a powerful toolkit for conducting science. Still, in some cases this requirement may be seen as too restrictive or we may simply lack prior information.

Enter: Conformal Prediction (CP) — a scalable frequentist approach to uncertainty quantification and coverage control. In this post we will go through the basic concepts underlying CP. A number of hands-on usage examples in Julia should hopefully help to convey some intuition and ideally attract people interested in contributing to a new and exciting open-source development.

🏃♀️ TL;DR

- Conformal Prediction is an interesting frequentist approach to uncertainty quantification that can even be combined with Bayes (Section 1).
- It is scalable and model-agnostic and therefore well applicable to machine learning (Section 1).
`ConformalPrediction.jl`

implements CP in pure Julia and can be used with any supervised model available from`MLJ.jl`

(Section 2).- Implementing CP directly on top of an existing, powerful machine learning toolkit demonstrates the potential usefulness of this framework to the ML community (Section 2).
- Standard conformal classifiers produce set-valued predictions: for ambiguous samples these sets are typically large (for high coverage) or empty (for low coverage) (Section 2.1).

Conformal Prediction promises to be an easy-to-understand, distribution-free and model-agnostic way to generate statistically rigorous uncertainty estimates. That’s quite a mouthful, so let’s break it down: firstly, as I will hopefully manage to illustrate in this post, the underlying concepts truly are fairly straight-forward to understand; secondly, CP indeed relies on only minimal distributional assumptions; thirdly, common procedures to generate conformal predictions really do apply almost universally to all supervised models, therefore making the framework very intriguing to the ML community; and, finally, CP does in fact come with a frequentist coverage guarantee that ensures that conformal prediction sets contain the true value with a user-chosen probability. For a formal proof of this *marginal coverage* property and a detailed introduction to the topic, I recommend Angelopoulos and Bates (2021).

Note

In what follows we will loosely treat the tutorial by Angelopoulos and Bates (2021) and the general framework it sets as a reference. You are not expected to have read the paper, but I also won’t reiterate any details here.

CP can be used to generate prediction intervals for regression models and prediction sets for classification models (more on this later). There is also some recent work on conformal predictive distributions and probabilistic predictions. Interestingly, it can even be used to complement Bayesian methods. Angelopoulos and Bates (2021), for example, point out that prior information should be incorporated into prediction sets and demonstrate how Bayesian predictive distributions can be conformalized in order to comply with the frequentist notion of coverage. Relatedly, Hoff (2021) proposes a Bayes-optimal prediction procedure. And finally, Stanton, Maddox, and Wilson (2022) very recently proposed a way to introduce conformal prediction in Bayesian Optimization. I find this type of work that combines different schools of thought very promising, but I’m drifting off a little … So, without further ado, let us look at some code.

In this section of this first short post on CP we will look at how conformal prediction can be implemented in Julia. In particular, we will look at an approach that is compatible with any of the many supervised machine learning models available in MLJ: a beautiful, comprehensive machine learning framework funded by the Alan Turing Institute and the New Zealand Strategic Science Investment Fund Blaom et al. (2020). We will go through some basic usage examples employing a new Julia package that I have been working on: `ConformalPrediction.jl`

.

`ConformalPrediction.jl`

is a package for uncertainty quantification through conformal prediction for machine learning models trained in MLJ. At the time of writing it is still in its early stages of development, but already implements a range of different approaches to CP. Contributions are very much welcome:

We consider a simple binary classification problem. Let denote our feature-label pairs and let denote the mapping from features to labels. For illustration purposes we will use the moons dataset 🌙. Using `MLJ.jl`

we first generate the data and split into into a training and test set:

```
using MLJ
using Random
Random.seed!(123)
# Data:
X, y = make_moons(500; noise=0.15)
train, test = partition(eachindex(y), 0.8, shuffle=true)
```

Here we will use a specific case of CP called *split conformal prediction* which can then be summarized as follows:^{1}

- Partition the training into a proper training set and a separate calibration set: .
- Train the machine learning model on the proper training set: .
- Compute nonconformity scores, , using the calibration data and the fitted model .
- For a user-specified desired coverage ratio compute the corresponding quantile, , of the empirical distribution of nonconformity scores, .
- For the given quantile and test sample , form the corresponding conformal prediction set:

This is the default procedure used for classification and regression in `ConformalPrediction.jl`

.

You may want to take a look at the source code for the classification case here. As a first important step, we begin by defining a concrete type `SimpleInductiveClassifier`

that wraps a supervised model from `MLJ.jl`

and reserves additional fields for a few hyperparameters. As a second step, we define the training procedure, which includes the data-splitting and calibration step. Finally, as a third step we implement the procedure in Equation 1 to compute the conformal prediction set.

Development Status

The permalinks above take you to the version of the package that was up-to-date at the time of writing. Since the package is in its early stages of development, the code base and API can be expected to change.

Now let’s take this to our 🌙 data. To illustrate the package functionality we will demonstrate the envisioned workflow. We first define our atomic machine learning model following standard `MLJ.jl`

conventions. Using `ConformalPrediction.jl`

we then wrap our atomic model in a conformal model using the standard API call `conformal_model(model::Supervised; kwargs...)`

. To train and predict from our conformal model we can then rely on the conventional `MLJ.jl`

procedure again. In particular, we wrap our conformal model in data (turning it into a machine) and then fit it on the training set. Finally, we use our machine to predict the label for a new test sample `Xtest`

:

```
# Model:
KNNClassifier = @load KNNClassifier pkg=NearestNeighborModels
model = KNNClassifier(;K=50)
# Training:
using ConformalPrediction
conf_model = conformal_model(model; coverage=.9)
mach = machine(conf_model, X, y)
fit!(mach, rows=train)
# Conformal Prediction:
Xtest = selectrows(X, first(test))
ytest = y[first(test)]
predict(mach, Xtest)[1]
```

`import NearestNeighborModels`

` ✔`

UnivariateFinite{Multiclass{2}} ┌ ┐ 0 ┤■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 0.94 └ ┘

The final predictions are set-valued. While the softmax output remains unchanged for the `SimpleInductiveClassifier`

, the size of the prediction set depends on the chosen coverage rate, .

When specifying a coverage rate very close to one, the prediction set will typically include many (in some cases all) of the possible labels. Below, for example, both classes are included in the prediction set when setting the coverage rate equal to =1.0. This is intuitive, since high coverage quite literally requires that the true label is covered by the prediction set with high probability.

```
conf_model = conformal_model(model; coverage=coverage)
mach = machine(conf_model, X, y)
fit!(mach, rows=train)
# Conformal Prediction:
Xtest = (x1=[1],x2=[0])
predict(mach, Xtest)[1]
```

UnivariateFinite{Multiclass{2}} ┌ ┐ 0 ┤■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 0.5 1 ┤■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 0.5 └ ┘

Conversely, for low coverage rates, prediction sets can also be empty. For a choice of =0.1, for example, the prediction set for our test sample is empty. This is a bit difficult to think about intuitively and I have not yet come across a satisfactory, intuitive interpretation.^{2} When the prediction set is empty, the `predict`

call currently returns `missing`

:

```
conf_model = conformal_model(model; coverage=coverage)
mach = machine(conf_model, X, y)
fit!(mach, rows=train)
# Conformal Prediction:
predict(mach, Xtest)[1]
```

`missing`

Figure 1 should provide some more intuition as to what exactly is happening here. It illustrates the effect of the chosen coverage rate on the predicted softmax output and the set size in the two-dimensional feature space. Contours are overlayed with the moon data points (including test data). The two samples highlighted in red, and , have been manually added for illustration purposes. Let’s look at these one by one.

Firstly, note that (red cross) falls into a region of the domain that is characterized by high predictive uncertainty. It sits right at the bottom-right corner of our class-zero moon 🌜 (orange), a region that is almost entirely enveloped by our class-one moon 🌛 (green). For low coverage rates the prediction set for is empty: on the left-hand side this is indicated by the missing contour for the softmax probability; on the right-hand side we can observe that the corresponding set size is indeed zero. For high coverage rates the prediction set includes both and , indicative of the fact that the conformal classifier is uncertain about the true label.

With respect to , we observe that while also sitting on the fringe of our class-zero moon, this sample populates a region that is not fully enveloped by data points from the opposite class. In this region, the underlying atomic classifier can be expected to be more certain about its predictions, but still not highly confident. How is this reflected by our corresponding conformal prediction sets?

```
Xtest_2 = (x1=[-0.5],x2=[0.25])
cov_ = .9
conf_model = conformal_model(model; coverage=cov_)
mach = machine(conf_model, X, y)
fit!(mach, rows=train)
p̂_2 = pdf(predict(mach, Xtest_2)[1], 0)
```

Well, for low coverage rates (roughly ) the conformal prediction set does not include : the set size is zero (right panel). Only for higher coverage rates do we have : the coverage rate is high enough to include , but the corresponding softmax probability is still fairly low. For example, for we have

These two examples illustrate an interesting point: for regions characterised by high predictive uncertainty, conformal prediction sets are typically empty (for low coverage) or large (for high coverage). While set-valued predictions may be something to get used to, this notion is overall intuitive.

```
# Setup
coverages = range(0.75,1.0,length=5)
n = 100
x1_range = range(extrema(X.x1)...,length=n)
x2_range = range(extrema(X.x2)...,length=n)
anim = @animate for coverage in coverages
conf_model = conformal_model(model; coverage=coverage)
mach = machine(conf_model, X, y)
fit!(mach, rows=train)
p1 = contourf_cp(mach, x1_range, x2_range; type=:proba, title="Softmax", axis=nothing)
scatter!(p1, X.x1, X.x2, group=y, ms=2, msw=0, alpha=0.75)
scatter!(p1, Xtest.x1, Xtest.x2, ms=6, c=:red, label="X₁", shape=:cross, msw=6)
scatter!(p1, Xtest_2.x1, Xtest_2.x2, ms=6, c=:red, label="X₂", shape=:diamond, msw=6)
p2 = contourf_cp(mach, x1_range, x2_range; type=:set_size, title="Set size", axis=nothing)
scatter!(p2, X.x1, X.x2, group=y, ms=2, msw=0, alpha=0.75)
scatter!(p2, Xtest.x1, Xtest.x2, ms=6, c=:red, label="X₁", shape=:cross, msw=6)
scatter!(p2, Xtest_2.x1, Xtest_2.x2, ms=6, c=:red, label="X₂", shape=:diamond, msw=6)
plot(p1, p2, plot_title="(1-α)=$(round(coverage,digits=2))", size=(800,300))
end
gif(anim, fps=0.5)
```

This has really been a whistle-stop tour of Conformal Prediction: an active area of research that probably deserves much more attention. Hopefully, though, this post has helped to provide some color and, if anything, made you more curious about the topic. Let’s recap the TL;DR from above:

- Conformal Prediction is an interesting frequentist approach to uncertainty quantification that can even be combined with Bayes (Section 1).
- It is scalable and model-agnostic and therefore well applicable to machine learning (Section 1).
`ConformalPrediction.jl`

implements CP in pure Julia and can be used with any supervised model available from`MLJ.jl`

(Section 2).- Implementing CP directly on top of an existing, powerful machine learning toolkit demonstrates the potential usefulness of this framework to the ML community (Section 2).
- Standard conformal classifiers produce set-valued predictions: for ambiguous samples these sets are typically large (for high coverage) or empty (for low coverage) (Section 2.1).

Below I will leave you with some further resources.

Chances are that you have already come across the Awesome Conformal Prediction repo: Manokhin (n.d.) provides a comprehensive, up-to-date overview of resources related to the conformal prediction. Among the listed articles you will also find Angelopoulos and Bates (2021), which inspired much of this post. The repo also points to open-source implementations in other popular programming languages including Python and R.

Angelopoulos, Anastasios N., and Stephen Bates. 2021. “A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification.” https://arxiv.org/abs/2107.07511.

Blaom, Anthony D., Franz Kiraly, Thibaut Lienart, Yiannis Simillides, Diego Arenas, and Sebastian J. Vollmer. 2020. “MLJ: A Julia Package for Composable Machine Learning.” *Journal of Open Source Software* 5 (55): 2704. https://doi.org/10.21105/joss.02704.

Hoff, Peter. 2021. “Bayes-Optimal Prediction with Frequentist Coverage Control.” https://arxiv.org/abs/2105.14045.

Houlsby, Neil, Ferenc Huszár, Zoubin Ghahramani, and Máté Lengyel. 2011. “Bayesian Active Learning for Classification and Preference Learning.” https://arxiv.org/abs/1112.5745.

Lakshminarayanan, Balaji, Alexander Pritzel, and Charles Blundell. 2016. “Simple and Scalable Predictive Uncertainty Estimation Using Deep Ensembles.” https://arxiv.org/abs/1612.01474.

Manokhin, Valery. n.d. “Awesome Conformal Prediction.”

Stanton, Samuel, Wesley Maddox, and Andrew Gordon Wilson. 2022. “Bayesian Optimization with Conformal Coverage Guarantees.” https://arxiv.org/abs/2210.12496.

In other places split conformal prediction is sometimes referred to as

*inductive*conformal prediction.↩︎Any thoughts/comments welcome!↩︎

BibTeX citation:

```
@online{altmeyer2022,
author = {Altmeyer, Patrick},
title = {Conformal {Prediction} in {Julia} 🟣🔴🟢},
date = {22-10-25},
url = {https://www.paltmeyer.com/blog//blog/posts/conformal-prediction},
langid = {en}
}
```

For attribution, please cite this work as:

Altmeyer, Patrick. 22AD. “Conformal Prediction in Julia
🟣🔴🟢.” October 25, 22AD. https://www.paltmeyer.com/blog//blog/posts/conformal-prediction.

Counterfactual explanations, which I introduced in one of my previous posts^{1}, offer a simple and intuitive way to explain black-box models without opening them. Still, as of today there exists only one open-source library that provides a unifying approach to generate and benchmark counterfactual explanations for models built and trained in Python (Pawelczyk et al. 2021). This is great, but of limited use to users of other programming languages 🥲.

Enter `CounterfactualExplanations.jl`

: a Julia package that can be used to explain machine learning algorithms developed and trained in Julia, Python and R. Counterfactual explanations fall into the broader category of explainable artificial intelligence (XAI).

Explainable AI typically involves models that are not inherently interpretable but require additional tools to be explainable to humans. Examples of the latter include ensembles, support vector machines and deep neural networks. This is not to be confused with interpretable AI, which involves models that are inherently interpretable and transparent such as general additive models (GAM), decision trees and rule-based models.

Some would argue that we best avoid explaining black-box models altogether (Rudin 2019) and instead focus solely on interpretable AI. While I agree that initial efforts should always be geared towards interpretable models, stopping there would entail missed opportunities and anyway is probably not very realistic in times of DALLE and Co.

Even though […] interpretability is of great importance and should be pursued, explanations can, in principle, be offered without opening the “black box.”

— Wachter, Mittelstadt, and Russell (2017)

This post introduces the main functionality of the new Julia package. Following a motivating example using a model trained in Julia, we will see how easy the package can be adapted to work with models trained in Python and R. Since the motivation for this post is also to hopefully attract contributors, the final section outlines some of the exciting developments we have planned.

To introduce counterfactual explanations I used a simple binary classification problem in my previous post. It involved a linear classifier and a linearly separable, synthetic data set with just two features. This time we are going to step it up a notch: we will generate counterfactual explanations MNIST data. The MNIST dataset contains 60,000 training samples of handwritten digits in the form of 28x28 pixel grey-scale images (LeCun 1998). Each image is associated with a label indicating the digit (0-9) that the image represents.

The `CounterfactualExplanations.jl`

package ships with two black-box models that were trained to predict labels for this data: firstly, a simple multi-layer perceptron (MLP) and, secondly, a corresponding deep ensemble. Originally proposed by Lakshminarayanan, Pritzel, and Blundell (2016), deep ensembles are really just ensembles of deep neural networks. They are still among the most popular approaches to Bayesian deep learning.^{2}

The code below loads relevant packages along with the MNIST data and pre-trained models.

```
# Load package, models and data:
using CounterfactualExplanations, Flux
using CounterfactualExplanations.Data: mnist_data, mnist_model, mnist_ensemble
data, X, ys = mnist_data()
model = mnist_model()
ensemble = mnist_ensemble()
counterfactual_data = CounterfactualData(X,ys;domain=(0,1))
```

While the package can currently handle a few simple classification models natively, it is designed to be easily extensible through users and contributors. Extending the package to deal with custom models typically involves only two simple steps:

**Subtyping**: the custom model needs to be declared as a subtype of the package-internal type`AbstractFittedModel`

.**Multiple dispatch**: the package-internal functions`logits`

and`probs`

need to be extended through custom methods for the new model type.

The following code implements these two steps first for the MLP and then for the deep ensemble.

```
using CounterfactualExplanations.Models
import CounterfactualExplanations.Models: logits, probs
# MLP:
# Step 1)
struct NeuralNetwork <: Models.AbstractFittedModel
model::Any
end
# Step 2)
logits(M::NeuralNetwork, X::AbstractArray) = M.model(X)
probs(M::NeuralNetwork, X::AbstractArray)= softmax(logits(M, X))
M = NeuralNetwork(model)
# Deep ensemble:
using Flux: stack
# Step 1)
struct FittedEnsemble <: Models.AbstractFittedModel
ensemble::AbstractArray
end
# Step 2)
using Statistics
logits(M::FittedEnsemble, X::AbstractArray) = mean(stack([m(X) for m in M.ensemble],3),dims=3)
probs(M::FittedEnsemble, X::AbstractArray) = mean(stack([softmax(m(X)) for m in M.ensemble],3),dims=3)
M_ensemble = FittedEnsemble(ensemble)
```

Next, we need to specify the counterfactual generators we want to use. The package currently ships with two default generators that both need gradient access: firstly, the generic generator introduced by Wachter, Mittelstadt, and Russell (2017) and, secondly, a greedy generator introduced by Schut et al. (2021).

The greedy generator is designed to be used with models that incorporate uncertainty in their predictions such as the deep ensemble introduced above. It works for probabilistic (Bayesian) models, because they only produce high-confidence predictions in regions of the feature domain that are populated by training samples. As long as the model is expressive enough and well-specified, counterfactuals in these regions will always be realistic and unambiguous since by construction they should look very similar to training samples. Other popular approaches to counterfactual explanations like REVISE (Joshi et al. 2019) and CLUE (Antorán et al. 2020) also play with this simple idea.

The following code instantiates the two generators for the problem at hand.

```
generic = GenericGenerator(;loss=:logitcrossentropy)
greedy = GreedyGenerator(;loss=:logitcrossentropy)
```

Once the model and counterfactual generator are specified, running counterfactual search is very easy using the package. For a given factual (`x`

), target class (`target`

) and data set (`counterfactual_data`

), simply running

`generate_counterfactual(x, target, counterfactual_data, M, generic)`

will generate the results, in this case using the generic generator (`generic`

) for the MLP (`M`

). Since we have specified two different black-box models and two different counterfactual generators, we have four combinations of a model and a generator in total. For each of these combinations I have used the `generate_counterfactual`

function to produce the results in Figure 1.

In every case the desired label switch is in fact achieved, but arguably from a human perspective only the counterfactuals for the deep ensemble look like a four. The generic generator produces mild perturbations in regions that seem irrelevant from a human perspective, but nonetheless yields a counterfactual that can pass as a four. The greedy approach clearly targets pixels at the top of the handwritten nine and yields the best result overall. For the non-Bayesian MLP, both the generic and the greedy approach generate counterfactuals that look much like adversarial examples: they perturb pixels in seemingly random regions on the image.

The Julia language offers unique support for programming language interoperability. For example, calling R or Python is made remarkably easy through `RCall.jl`

and `PyCall.jl`

, respectively. This functionality can be leveraged to use `CounterfactualExplanations.jl`

to generate explanations for models that were developed in other programming languages. At this time there is no native support for foreign programming languages, but the following example involving a `torch`

neural network trained in `R`

demonstrates how versatile the package is.^{3}

`torch`

modelWe will consider a simple MLP trained for a binary classification task. As before we first need to adapt this custom model for use with our package. The code below the two necessary steps - sub-typing and method extension. Logits are returned by the `torch`

model and copied from the R environment into the Julia scope. Probabilities are then computed inside the Julia scope by passing the logits through the sigmoid function.

```
using Flux
using CounterfactualExplanations, CounterfactualExplanations.Models
import CounterfactualExplanations.Models: logits, probs # import functions in order to extend
# Step 1)
struct TorchNetwork <: Models.AbstractFittedModel
nn::Any
end
# Step 2)
function logits(M::TorchNetwork, X::AbstractArray)
nn = M.nn
y = rcopy(R"as_array($nn(torch_tensor(t($X))))")
y = isa(y, AbstractArray) ? y : [y]
return y'
end
function probs(M::TorchNetwork, X::AbstractArray)
return σ.(logits(M, X))
end
M = TorchNetwork(R"model")
```

Compared to models trained in Julia, we need to do a little more work at this point. Since our counterfactual generators need gradient access, we essentially need to allow our package to communicate with the R `torch`

library. While this may sound daunting, it turns out to be quite manageable: all we have to do is respecify the function that computes the gradient with respect to the counterfactual loss function so that it can deal with the `TorchNetwork`

type we defined above. That is all the adjustment needed to use `CounterfactualExplanations.jl`

for our custom R model. Figure 2 shows a counterfactual path for a randomly chosen sample with respect to the MLP trained in R.

Experimental functionality

You may have stumbled across the term *respecify* above: does it really seem like a good idea to just replace an existing function from our package? Surely not! There are certainly better ways to go about this, which we will consider when adding native support for Python and R models in future package releases. Which brings us to our final section …

```
import CounterfactualExplanations.Generators: ∂ℓ
using LinearAlgebra
# Countefactual loss:
function ∂ℓ(
generator::AbstractGradientBasedGenerator,
counterfactual_state::CounterfactualState)
M = counterfactual_state.M
nn = M.nn
x′ = counterfactual_state.x′
t = counterfactual_state.target_encoded
R"""
x <- torch_tensor($x′, requires_grad=TRUE)
output <- $nn(x)
loss_fun <- nnf_binary_cross_entropy_with_logits
obj_loss <- loss_fun(output,$t)
obj_loss$backward()
"""
grad = rcopy(R"as_array(x$grad)")
return grad
end
```

The ambition for `CounterfactualExplanations.jl`

is to provide a go-to place for counterfactual explanations to the Julia community and beyond. This is a grand ambition, especially for a package that has so far been built by a single developer who has little prior experience with Julia. We would therefore very much like to invite community contributions. If you have an interest in trustworthy AI, the open-source community and Julia, please do get involved! This package is still in its early stages of development, so any kind of contribution is welcome: advice on the core package architecture, pull requests, issues, discussions and even just comments below would be much appreciated.

To give you a flavor of what type of future developments we envision, here is a non-exhaustive list:

- Native support for additional counterfactual generators and predictive models including those built and trained in Python or R.
- Additional datasets for testing, evaluation and benchmarking.
- Improved preprocessing including native support for categorical features.
- Support for regression models.

Finally, if you like this project but don’t have much time, then simply sharing this article or starring the repo on GitHub would also go a long way.

If you’re interested in learning more about this development, feel free to check out the following resources:

- Package docs: [stable], [dev].
- Contributor’s guide.
- GitHub repo.

Antorán, Javier, Umang Bhatt, Tameem Adel, Adrian Weller, and José Miguel Hernández-Lobato. 2020. “Getting a Clue: A Method for Explaining Uncertainty Estimates.” https://arxiv.org/abs/2006.06848.

Joshi, Shalmali, Oluwasanmi Koyejo, Warut Vijitbenjaronk, Been Kim, and Joydeep Ghosh. 2019. “Towards Realistic Individual Recourse and Actionable Explanations in Black-Box Decision Making Systems.” https://arxiv.org/abs/1907.09615.

Lakshminarayanan, Balaji, Alexander Pritzel, and Charles Blundell. 2016. “Simple and Scalable Predictive Uncertainty Estimation Using Deep Ensembles.” https://arxiv.org/abs/1612.01474.

LeCun, Yann. 1998. “The MNIST Database of Handwritten Digits.”

Pawelczyk, Martin, Sascha Bielawski, Johannes van den Heuvel, Tobias Richter, and Gjergji Kasneci. 2021. “Carla: A Python Library to Benchmark Algorithmic Recourse and Counterfactual Explanation Algorithms.” https://arxiv.org/abs/2108.00783.

Rudin, Cynthia. 2019. “Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead.” *Nature Machine Intelligence* 1 (5): 206–15.

Schut, Lisa, Oscar Key, Rory Mc Grath, Luca Costabello, Bogdan Sacaleanu, Yarin Gal, et al. 2021. “Generating Interpretable Counterfactual Explanations By Implicit Minimisation of Epistemic and Aleatoric Uncertainties.” In *International Conference on Artificial Intelligence and Statistics*, 1756–64. PMLR.

Wachter, Sandra, Brent Mittelstadt, and Chris Russell. 2017. “Counterfactual Explanations Without Opening the Black Box: Automated Decisions and the GDPR.” *Harv. JL & Tech.* 31: 841.

BibTeX citation:

```
@online{altmeyer2022,
author = {Altmeyer, Patrick},
title = {A New Tool for Explainable {AI}},
date = {22-04-20},
url = {https://www.paltmeyer.com/blog//blog/posts/a-new-tool-for-explainable-ai},
langid = {en}
}
```

For attribution, please cite this work as:

Altmeyer, Patrick. 22AD. “A New Tool for Explainable AI.”
April 20, 22AD. https://www.paltmeyer.com/blog//blog/posts/a-new-tool-for-explainable-ai.

Does your work involve research, coding, writing and publishing? If so, then chances are that you often find yourself bouncing back and forth between different open-source text editors, IDEs, programming languages and platforms depending on your current needs. Using a diverse set of tools is reasonable, because there typically is no single perfect approach that solves all our problems. For example, interactive notebooks like Jupyter are useful for working with code and communicating it to others, but they are probably not anyone’s first choice for producing a scientific article. Similarly, Beamer presentations can be useful for presenting science in a standardized fashion, but they are the very opposite of interactive and look incredibly boring.

As much as the great variety of free tools deserves being celebrated, all this bouncing back and forth can be really tiring. What if there was a single tool, an engine that can turn your work into all kinds of different outputs? I mean literally any output you can think of: Markdown, HTML, PDF, LateX, ePub, entire websites, presentations (yes, also Beamer if you have to), MS Word, OpenOffice, … the list goes on. All of that starting from the same place: a plain Markdown document blended with essentially any programming language of your choice and a YAML header defining your output. This tool now exists and it goes by the name Quarto.

In this short blog post I hope to convince you that Quarto is the only publishing engine you will ever need. What I am definitely not going to tell you is which IDE, text editor or programming language you should be using to actually produce your work. Quarto does not care about that. Quarto is here to make your life a bit easier (and by ‘a bit’ I mean a whole lot). Quarto is nothing less but a revolution for scientific publishing.

To put this all in some context (well, my context), I will now tell you a bit about what has led me to making such bold claims about yet another open-source tool.

Hold up?! Wasn’t this supposed to be about Julia and Quarto?

Yes! But it’s worth noting that a lot of the benefits that Quarto brings have been available to R users for many years, thanks to the amazing work of many great open-source contributors like @xieyihui. Julia was the main reason for me to branch out of this comfortable R bubble as I describe below. That said, if you are a Julia user who really couldn’t care less about my previous experiences with R Markdown, this is a good time to skip straight ahead to Section 2. By the way, if you haven’t clicked on that link, here’s a small showcase demonstrating how it was generated. It shows easy it is to have everything well organised and connected with Quarto.

Cross-referencing

There is a standard recipe for generating cross-references in Quarto. The example below involves a section cross-reference.

```
If you are a Julia user that really couldn't care less about my previous experiences with R Markdown, this is a good time to skip straight ahead to @sec-match.
## Quarto and Julia - a perfect match {#sec-match}
```

There is actually a comprehensive 8-step guide explaining how to achieve something similar in MS Word, but personally I wouldn’t go there. Anyway, take your pick:

- 🔴 Go there.
- 🟢 Move on to Section 1 #safespace.

For many years I have used R Markdown for essentially anything work-related. As an undergraduate economics student facing the unfortunate reality that people still teach Stata, I was drawn to R. This was partially because R has a great open-source community and also partially because Stata. Once I realised that I would be able to use R Markdown to write up all of my future homework assignments and even my thesis, I never looked back. MS Word was now officially dead to me. Overleaf was nothing more than a last resort if everyone else in my team insisted on using it for a group project. Being able to write my undergraduate dissertation in R Markdown was a first truly triumphant moment. Soon after that I would also try myself at Shiny, produce outputs in HTML and build entire websites through `blogdown`

. And all of that from within R Studio involving R and Markdown and really not much else. During my first professional job at the Bank of England I was reluctant to use anything other than R Markdown to produce all of my output. Luckily for me, the Bank was very much heading in that same direction at the time and my reluctance was not perceived as stubbornness, but actually welcome (at least I hoped so).

Soon though, part of me felt a little boxed in. For any work that required me to look outside of the R bubble, I knew I might also have to give up a very, very comfortable work environment and my productivity would surely take a hit. During my master’s in Data Science, for example, the mantra was very much “Python + Jupyter or die”. Through `reticulate`

and R Studio’s growing support for Python I managed to get by without having to leave my bubble too often. But `reticulate`

always felt a little clunky (sorry!) and some professors were reluctant to accept anything other than Jupyter notebooks. Even if others had not perceived it that way in the past, I certainly started to feel that I might just be a little too attached the beautiful bubble that R Studio had created around me.

Then there was Julia: elegant, fast, pure, scientific and - oh my REPL! - those beautiful colors and unicode symbols. The stuff of dreams, really! Geeky dreams, but dreams nonetheless. I had once before given Julia a shot when working with high-frequency trade data for a course in market microstructure. This was the first time R really revealed its limitations to me and my bubble nearly burst, but thanks to `data.table`

and `Rcpp`

I managed to escape with only minor bruises. Still, Julia kept popping up, teasing me whenever I would work on some Frakenstein-style C++ code snippets that would hopefully resolve my R bottlenecks. I actually enjoyed mixing **some** C++ into my R code like I did here, but the process was just a little painful and slow. But wouldn’t learning **all of** Julia take even more time and patience? And what about my dependence on R Markdown?

As I started my PhD in September 2021, I eventually gave in. New beginnings - time to suck it up! If it meant that I’d have to use Jupyter notebooks with Julia, so be it! And so I was off to a somewhat bumpy start that would have me bouncing back and forth between trying to make Julia work in R Studio (meh), setting up Jupyter Lab (meeeh), just using the Julia REPL because “the REPL is all you need” (nope) and even struggling with Vim and Emacs. Then there was also `Pluto.jl`

, of course, which admittedly looks amazing! But it also looks very much tailored to Julia and (I believe) the number of different output formats you can produce is still very limited. Eventually, I settled for VSCode in combination with Jupyter notebooks. As much as I dreaded the latter, Jupyter is popular, arguably versatile and supports both R and Julia. This setup worked well enough for me, but it still definitely fell short of the breeze that R Studio had always provided. One thing that really bugged me, for example, was the fact that the IJulia kernel was not accessible from the Julia REPL. Each notebook would have its own environment, which could only be accessed through the notebook. In R Studio the interaction between R Markdown and the console is seamless, as both have access to the same environment variables.

Around the same time that I started using Julia, I read about Quarto for the first time. It looked … great! Like a timely little miracle really! But also … unfinished? Definitely experimental at the time. I loved the idea though and in a footnote somewhere on their website it said that the project was supported by R Studio which I took as a very good sign. So I decided to at least give it a quick try and built a small (tiny) website summarising some of the literature I had read for my PhD:

Just had my first go #quarto and I absolutely love the concept! Open-source and language agnostic - truly amazing work from @rstudio https://t.co/veCg7ywQ8v

— Patrick Altmeyer (@paltmey) October 29, 2021

This was a first very pleasant encounter with Quarto, arguable even smoother than building websites in `blogdown`

. As for working with Julia though, I had made up my mind that VSCode was the way to go and at the time there was no Quarto extension (there is now). There was also little in terms of communication about the project by R Studio, probably because things were really still in the early development stages. I was hopeful that eventually Quarto would enable me to emulate the R Studio experience in VS Code, but for now things were not quite there yet.

Since I was now working with VSCode + Jupyter and since Quarto supports Jupyter as well as all of my old R Markdown work, my next little Quarto project involved turning my old `blogdown`

-powered blog into a Quarto-powered blog. This was not strictly necessary, as I could always export my new Jupyter notebooks to HTML and let `blogdown`

do the rest. But it did streamline things a little bit and the default Quarto blog theme - you are staring at it - is actually 🔥. I also did not have to feel guilty towards @xieyihui about leaving `blogdown`

, because unsurprisingly he is on the Quarto team. As I was working on this little project I started noticing that the Quarto website was updated regularly and responses to issues I opened like this one were answered very swiftly. Clearly, things were moving and they were moving fast. More recently, the news about Quarto has been spreading and it’s left some folks as confused and amazed as I was, when I first heard about it:

#RStats can someone explain to me what's the difference between {Quarto} and {RMarkdown}? I saw a tweet about Quarto and now I'm all confused … What gap is it supposed to fill?

— Erwin Lares (@lasrubieras) March 30, 2022

This is why finally I’ve decided I should write a brief post about how and why I use Quarto. Since I have been working mostly with Julia for the past couple of months, I’ve chosen to focus on the interaction between Quarto and Julia. Coincidentally, yesterday was also the first time I saw a guide dedicated to Julia on the Quarto website, so evidently I am not the only one interested in that marriage. This also means that there really is not too much left for me to talk about now, since Quarto’s documentation is state-of-the-art. But a few bits and pieces I mention below might hopefully still be useful or at least some food for thought.

While what follows may be relevant to other programming languages, my main goal for this last section is to flag Quarto to the Julia community. In any case, #rstats folks have been using R and Python in R Markdown documents for a while now and won’t need much of an introduction to Quarto. As for Python aficionados, I can only recommend to give Quarto a shot (you will still be able to use Jupyter notebooks).

The very article you are reading right now was composed in a Quarto document. These documents feel and look very much like standard Julia Markdown documents, but you can do a lot more with them. You can find the source code for this and other documents presented in this blog here.

To get you started, here is my current setup combining VSCode, Quarto and Julia:

- VSCode extensions: in addition to the Julia extension you will need the Quarto extension. In addition, the YAML extension and some extension to preview Markdown docs would be helpful. I am not sure if Markdown Julia and Jupyter are strictly necessary, but it won’t hurt.
- I do most of my work in Quarto documents
`.qmd`

. - If you choose to also do that, make sure that the
`.qmd`

document has access to a`Pkg.jl`

environment that has`IJulia`

added.

Julia code cells can be added anywhere along with your plain text Markdown. They look like this:

```
```{julia}
using Pkg
Pkg.add("CounterfactualExplanations")
```
```

Contrary to Jupyter notebooks, executing this code cells will start a Julia REPL in VSCode. I find this very helpful, because it lets me fiddle with anything I have created inside the Quarto notebook without having to click into cells all the time. Quarto comes with great support for specifying code executing options. For example, for the code below I have specified `#| echo: true`

in order for the code to be rendered. The code itself is the code I actually used to build the animation above (heavily borrowed from this `Javis.jl`

tutorial).

```
#| echo: true
using Javis, Animations, Colors
size = 600
radius_factor = 0.33
function ground(args...)
background("transparent")
sethue("white")
end
function rotate_anim(idx::Number, total::Number)
distance_circle = 0.875
steps = collect(range(distance_circle,1-distance_circle,length=total))
Animation(
[0, 1], # must go from 0 to 1
[0, steps[idx]*2π],
[sineio()],
)
end
translate_anim = Animation(
[0, 1], # must go from 0 to 1
[O, Point(size*radius_factor, 0)],
[sineio()],
)
translate_back_anim = Animation(
[0, 1], # must go from 0 to 1
[O, Point(-(size*radius_factor), 0)],
[sineio()],
)
julia_colours = Dict(
:blue => "#4063D8",
:green => "#389826",
:purple => "#9558b2",
:red => "#CB3C33"
)
colour_order = [:red, :purple, :green, :blue]
n_colours = length(julia_colours)
function color_anim(start_colour::String, quarto_col::String="#4b95d0")
Animation(
[0, 1], # must go from 0 to 1
[Lab(color(start_colour)), Lab(color(quarto_col))],
[sineio()],
)
end
video = Video(size, size)
frame_starts = 1:10:40
n_total = 250
n_frames = 150
Background(1:n_total, ground)
# Blob:
function element(; radius = 1)
circle(O, radius, :fill) # The 4 is to make the circle not so small
end
# Cross:
function cross(color="black";orientation=:horizontal)
sethue(color)
setline(10)
if orientation==:horizontal
out = line(Point(-size,0),Point(size,0), :stroke)
else
out = line(Point(0,-size),Point(0,size), :stroke)
end
return out
end
for (i, frame_start) in enumerate(1:10:40)
# Julia circles:
blob = Object(frame_start:n_total, (args...;radius=1) -> element(;radius=radius))
act!(blob, Action(1:Int(round(n_frames*0.25)), change(:radius, 1 => 75))) # scale up
act!(blob, Action(n_frames:(n_frames+50), change(:radius, 75 => 250))) # scale up further
act!(blob, Action(1:30, translate_anim, translate()))
act!(blob, Action(31:120, rotate_anim(i, n_colours), rotate_around(Point(-(size*radius_factor), 0))))
act!(blob, Action(121:150, translate_back_anim, translate()))
act!(blob, Action(1:150, color_anim(julia_colours[colour_order[i]]), sethue()))
# Quarto cross:
cross_h = Object((n_frames+50):n_total, (args...) -> cross(;orientation=:horizontal))
cross_v = Object((n_frames+50):n_total, (args...) -> cross(;orientation=:vertical))
end
render(
video;
pathname = joinpath(www_path, "intro.gif"),
)
```

`Documenter.jl`

and QuartoAn interesting application of Quarto in the Julia ecosystem is package documentation. This is of course best done using `Documenter.jl`

and fortunately the two play nicely with each other, since both share a common ground (Markdown). Their interaction is perhaps best demonstrated through this Julia library I recently developed: `CounterfactualExplanatinos.jl`

. On there you will find lot of Julia scripts `*.jl`

under `src/`

and `test/`

, as well as many Markdown `.md`

and Quarto documents `.qmd`

under `docs`

. I *wrote* the package documentation in the Quarto documents, *rendered* documents individually through `quarto render [doc].qmd`

and then fed the resulting Markdown documents to `Documenter.jl`

as always.

Below is my standard YAML header for those Quarto documents:

```
format:
commonmark:
variant: -raw_html
wrap: none
self-contained: true
crossref:
fig-prefix: Figure
tbl-prefix: Table
bibliography: https://raw.githubusercontent.com/pat-alt/bib/main/bib.bib
output: asis
execute:
echo: true
eval: false
jupyter: julia-1.7
```

You can see that it points to Bibtex file I host on another Github repository. This makes it very easy to generate citations and references for the rendered Markdown documents, that also show up in the docs (e.g. here). Unfortunately, cross-referencing only partially works, because it relies on auto-generated HTML and `Documenter.jl`

expects this to be passed in blocks. Choosing `variant: -raw_html`

is only a workaround as I have discussed here. Ideally, `Documenter.jl`

would just accept HTML documents rendered from Quarto, but currently only Markdown documents are accepted by `make_docs`

. Still, if anything this workaround is a nice gimmick that extends the default `Documenter.jl`

functionality, without any hassle involved. Hopefully, this can be improved in the future.

Another very good use-case for Quarto involves actual scientific publications in journals such as JuliaCon Proceedings. The existing submission process is tailored towards reproducibility and actually involves reviews directly on GitHub, which is fantastic. But currently only submissions in TeX format are accepted, which is not so great. Using Quarto would not only streamline this process further, but also open the JuliaCon Proceedings Journal up to publishing content in different output formats. Quarto docs could be used to still render the traditional PDF. But those same documents could also be used to create interactive versions in HTML. Arguably, the entire journal could probably be built through Quarto.

In this post I wanted to demonstrate that Quarto might just be the next revolution in scientific publishing. In particular, I hope I have managed to demonstrate its appeal to the Julia community, which I am proud to be part of now that I have managed to branch out of my old R bubble. Please let me hear your thoughts and comments below!

BibTeX citation:

```
@online{altmeyer2022,
author = {Altmeyer, Patrick},
title = {Julia and {Quarto:} A Match Made in Heaven? 🌤},
date = {22-04-07},
url = {https://www.paltmeyer.com/blog//blog/posts/julia-and-quarto-a-match-made-in-heaven},
langid = {en}
}
```

For attribution, please cite this work as:

Altmeyer, Patrick. 22AD. “Julia and Quarto: A Match Made in
Heaven? 🌤.” April 7, 22AD. https://www.paltmeyer.com/blog//blog/posts/julia-and-quarto-a-match-made-in-heaven.

Deep learning has dominated AI research in recent years^{1} - but how much promise does it really hold? That is very much an ongoing and increasingly polarising debate that you can follow live on Twitter. On one side you have optimists like Ilya Sutskever, chief scientist of OpenAI, who believes that large deep neural networks may already be slightly conscious - that’s “may” and “slightly” and only if you just go deep enough? On the other side you have prominent skeptics like Judea Pearl who has long since argued that deep learning still boils down to curve fitting - purely associational and not even remotely intelligent (Pearl and Mackenzie 2018).

Whatever side of this entertaining twitter dispute you find yourself on, the reality is that deep-learning systems have already been deployed at large scale both in academia and industry. More pressing debates therefore revolve around the trustworthiness of these existing systems. How robust are they and in what way exactly do they arrive at decisions that affect each and every one of us? Robustifying deep neural networks generally involves some form of adversarial training, which is costly, can hurt generalization (Raghunathan et al. 2019) and does ultimately not guarantee stability (Bastounis, Hansen, and Vlačić 2021). With respect to interpretability, surrogate explainers like LIME and SHAP are among the most popular tools, but they too have been shown to lack robustness (Slack et al. 2020).

Exactly why are deep neural networks unstable and in-transparent? Let denote our feature-label pairs and let denote some deep neural network specified by its parameters . Then the first thing to note is that the number of free parameters is typically huge (if you ask Mr Sutskever it really probably cannot be huge enough!). That alone makes it very hard to monitor and interpret the inner workings of deep-learning algorithms. Perhaps more importantly though, the number of parameters *relative* to the size of is generally huge:

[…] deep neural networks are typically very underspecified by the available data, and […] parameters [therefore] correspond to a diverse variety of compelling explanations for the data. (Wilson 2020)

In other words, training a single deep neural network may (and usually does) lead to one random parameter specification that fits the underlying data very well. But in all likelihood there are many other specifications that also fit the data very well. This is both a strength and vulnerability of deep learning: it is a strength because it typically allows us to find one such “compelling explanation” for the data with ease through stochastic optimization; it is a vulnerability because one has to wonder:

How compelling is an explanation really if it competes with many other equally compelling, but potentially very different explanations?

A scenario like this very much calls for treating predictions from deep learning models probabilistically [Wilson (2020)]^{2}^{3}.

Formally, we are interested in estimating the posterior predictive distribution as the following Bayesian model average (BMA):

The integral implies that we essentially need many predictions from many different specifications of . Unfortunately, this means more work for us or rather our computers. Fortunately though, researchers have proposed many ingenious ways to approximate the equation above in recent years: Gal and Ghahramani (2016) propose using dropout at test time while Lakshminarayanan, Pritzel, and Blundell (2016) show that averaging over an ensemble of just five models seems to do the trick. Still, despite their simplicity and usefulness these approaches involve additional computational costs compared to training just a single network. As we shall see now though, another promising approach has recently entered the limelight: **Laplace approximation** (LA).

If you have read my previous post on Bayesian Logistic Regression, then the term Laplace should already sound familiar to you. As a matter of fact, we will see that all concepts covered in that previous post can be naturally extended to deep learning. While some of these concepts will be revisited below, I strongly recommend you check out the previous post before reading on here. Without further ado let us now see how LA can be used for truly effortless deep learning.

While LA was first proposed in the 18th century, it has so far not attracted serious attention from the deep learning community largely because it involves a possibly large Hessian computation. Daxberger et al. (2021) are on a mission to change the perception that LA has no use in DL: in their NeurIPS 2021 paper they demonstrate empirically that LA can be used to produce Bayesian model averages that are at least at par with existing approaches in terms of uncertainty quantification and out-of-distribution detection and significantly cheaper to compute. They show that recent advancements in autodifferentation can be leveraged to produce fast and accurate approximations of the Hessian and even provide a fully-fledged Python library that can be used with any pretrained Torch model. For this post, I have built a much less comprehensive, pure-play equivalent of their package in Julia - LaplaceRedux.jl can be used with deep learning models built in Flux.jl, which is Julia’s main DL library. As in the previous post on Bayesian logistic regression I will rely on Julia code snippits instead of equations to convey the underlying maths. If you’re curious about the maths, the NeurIPS 2021 paper provides all the detail you need.

Let’s recap: in the case of logistic regression we had a assumed a zero-mean Gaussian prior for the weights that are used to compute logits , which in turn are fed to a sigmoid function to produce probabilities . We saw that under this assumption solving the logistic regression problem corresponds to minimizing the following differentiable loss function:

As our first step towards Bayesian deep learning, we observe the following: the loss function above corresponds to the objective faced by a single-layer artificial neural network with sigmoid activation and weight decay^{4}. In other words, regularized logistic regression is equivalent to a very simple neural network architecture and hence it is not surprising that underlying concepts can in theory be applied in much the same way.

So let’s quickly recap the next core concept: LA relies on the fact that the second-order Taylor expansion of our loss function evaluated at the **maximum a posteriori** (MAP) estimate amounts to a multi-variate Gaussian distribution. In particular, that Gaussian is centered around the MAP estimate with covariance equal to the inverse Hessian evaluated at the mode (Murphy 2022).

That is basically all there is to the story: if we have a good estimate of we have an analytical expression for an (approximate) posterior over parameters. So let’s go ahead and start by run Bayesian Logistic regression using Flux.jl. We begin by loading some required packages including LaplaceRedux.jl. It ships with a helper function `toy_data_linear`

that creates a toy data set composed of linearly separable samples evenly balanced across the two classes.

```
# Import libraries.
using Flux, Plots, Random, PlotThemes, Statistics, LaplaceRedux
theme(:wong)
# Number of points to generate.
xs, y = toy_data_linear(100)
X = hcat(xs...); # bring into tabular format
data = zip(xs,y);
```

Then we proceed to prepare the single-layer neural network with weight decay. The term determines the strength of the penalty: we regularize parameters more heavily for higher values. Equivalently, we can say that from the Bayesian perspective it governs the strength of the prior : a higher value of indicates a higher conviction about our prior belief that , which is of course equivalent to regularizing more heavily. The exact choice of for this toy example is somewhat arbitrary (it made for good visualizations below). Note that I have used to denote our neural parameters to distinguish the case from Bayesian logistic regression, but we are in fact still solving the same problem.

```
nn = Chain(Dense(2,1))
λ = 0.5
sqnorm(x) = sum(abs2, x)
weight_regularization(λ=λ) = 1/2 * λ^2 * sum(sqnorm, Flux.params(nn))
loss(x, y) = Flux.Losses.logitbinarycrossentropy(nn(x), y) + weight_regularization();
```

Before we apply Laplace approximation we train our model:

```
using Flux.Optimise: update!, ADAM
opt = ADAM()
epochs = 50
for epoch = 1:epochs
for d in data
gs = gradient(params(nn)) do
l = loss(d...)
end
update!(opt, params(nn), gs)
end
end
```

Up until this point we have just followed the standard recipe for training a regularized artificial neural network in Flux.jl for a simple binary classification task. To compute the Laplace approximation using LaplaceRedux.jl we need just two more lines of code:

```
la = laplace(nn, λ=λ)
fit!(la, data);
```

Under the hood the Hessian is approximated through the **empirical Fisher**, which can be computed using only the gradients of our loss function where are training data (see NeurIPS 2021 paper for details). Finally, LaplaceRedux.jl ships with a function `predict(𝑳::LaplaceRedux, X::AbstractArray; link_approx=:probit)`

that computes the posterior predictive using a probit approximation, much like we saw in the previous post. That function is used under the hood of the `plot_contour`

function below to create the right panel of Figure 1. It visualizes the posterior predictive distribution in the 2D feature space. For comparison I have added the corresponding plugin estimate as well. Note how for the Laplace approximation the predicted probabilities fan out indicating that confidence decreases in regions scarce of data.

```
p_plugin = plot_contour(X',y,la;title="Plugin",type=:plugin);
p_laplace = plot_contour(X',y,la;title="Laplace")
# Plot the posterior distribution with a contour plot.
plt = plot(p_plugin, p_laplace, layout=(1,2), size=(1000,400))
savefig(plt, "www/posterior_predictive_logit.png");
```

Now let’s step it up a notch: we will repeat the exercise from above, but this time for data that is not linearly separable using a simple MLP instead of the single-layer neural network we used above. The code below is almost the same as above, so I will not go through the various steps again.

```
# Number of points to generate:
xs, y = toy_data_non_linear(200)
X = hcat(xs...); # bring into tabular format
data = zip(xs,y)
# Build MLP:
n_hidden = 32
D = size(X)[1]
nn = Chain(
Dense(D, n_hidden, σ),
Dense(n_hidden, 1)
)
λ = 0.01
sqnorm(x) = sum(abs2, x)
weight_regularization(λ=λ) = 1/2 * λ^2 * sum(sqnorm, Flux.params(nn))
loss(x, y) = Flux.Losses.logitbinarycrossentropy(nn(x), y) + weight_regularization()
# Training:
epochs = 200
for epoch = 1:epochs
for d in data
gs = gradient(params(nn)) do
l = loss(d...)
end
update!(opt, params(nn), gs)
end
end
```

Fitting the Laplace approximation is also analogous, but note that this we have added an argument: `subset_of_weights=:last_layer`

. This specifies that we only want to use the parameters of the last layer of our MLP. While we could have used all of them (`subset_of_weights=:all`

), Daxberger et al. (2021) find that the last-layer Laplace approximation produces satisfying results, while be computationally cheaper. Figure 2 demonstrates that once again the Laplace approximation yields a posterior predictive distribution that is more conservative than the over-confident plugin estimate.

```
la = laplace(nn, λ=λ, subset_of_weights=:last_layer)
fit!(la, data);
p_plugin = plot_contour(X',y,la;title="Plugin",type=:plugin)
p_laplace = plot_contour(X',y,la;title="Laplace")
# Plot the posterior distribution with a contour plot.
plt = plot(p_plugin, p_laplace, layout=(1,2), size=(1000,400))
savefig(plt, "www/posterior_predictive_mlp.png");
```

To see why this is a desirable outcome consider the zoomed out version of Figure 2 below: the plugin estimator classifies with full confidence in regions completely scarce of any data. Arguably Laplace approximation produces a much more reasonable picture, even though it too could likely be improved by fine-tuning our choice of and the neural network architecture.

```
zoom=-50
p_plugin = plot_contour(X',y,la;title="Plugin",type=:plugin,zoom=zoom);
p_laplace = plot_contour(X',y,la;title="Laplace",zoom=zoom);
# Plot the posterior distribution with a contour plot.
plt = plot(p_plugin, p_laplace, layout=(1,2), size=(1000,400));
savefig(plt, "www/posterior_predictive_mlp_zoom.png");
```

Recent state-of-the-art research on neural information processing suggests that Bayesian deep learning can be effortless: Laplace approximation for deep neural networks appears to work very well and it does so at minimal computational cost (Daxberger et al. 2021). This is great news, because the case for turning Bayesian is strong: society increasingly relies on complex automated decision-making systems that need to be trustworthy. More and more of these systems involve deep learning which in and of itself is not trustworthy. We have seen that typically there exist various viable parameterizations of deep neural networks each with their own distinct and compelling explanation for the data at hand. When faced with many viable options, don’t put all of your eggs in one basket. In other words, go Bayesian!

To get started with Bayesian deep learning I have found many useful and free resources online, some of which are listed below:

`Turing.jl`

tutorial on Bayesian deep learning in Julia.- Various RStudio AI blog posts including this one and this one.
- TensorFlow blog post on regression with probabilistic layers.
- Kevin Murphy’s draft text book, now also available as print.

Bastounis, Alexander, Anders C Hansen, and Verner Vlačić. 2021. “The Mathematics of Adversarial Attacks in AI–Why Deep Learning Is Unstable Despite the Existence of Stable Neural Networks.” https://arxiv.org/abs/2109.06098.

Daxberger, Erik, Agustinus Kristiadi, Alexander Immer, Runa Eschenhagen, Matthias Bauer, and Philipp Hennig. 2021. “Laplace Redux-Effortless Bayesian Deep Learning.” *Advances in Neural Information Processing Systems* 34.

Gal, Yarin, and Zoubin Ghahramani. 2016. “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning.” In *International Conference on Machine Learning*, 1050–59. PMLR.

Lakshminarayanan, Balaji, Alexander Pritzel, and Charles Blundell. 2016. “Simple and Scalable Predictive Uncertainty Estimation Using Deep Ensembles.” https://arxiv.org/abs/1612.01474.

Murphy, Kevin P. 2022. *Probabilistic Machine Learning: An Introduction*. MIT Press.

Pearl, Judea, and Dana Mackenzie. 2018. *The Book of Why: The New Science of Cause and Effect*. Basic books.

Raghunathan, Aditi, Sang Michael Xie, Fanny Yang, John C Duchi, and Percy Liang. 2019. “Adversarial Training Can Hurt Generalization.” https://arxiv.org/abs/1906.06032.

Slack, Dylan, Sophie Hilgard, Emily Jia, Sameer Singh, and Himabindu Lakkaraju. 2020. “Fooling Lime and Shap: Adversarial Attacks on Post Hoc Explanation Methods.” In *Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society*, 180–86.

Wilson, Andrew Gordon. 2020. “The Case for Bayesian Deep Learning.” https://arxiv.org/abs/2001.10995.

See for example this article in the MIT Technology Review↩︎

In fact, not treating probabilistic deep learning models as such is sheer madness because remember that the underlying parameters are random variables. Frequentists and Bayesians alike will tell you that relying on a single point estimate of random variables is just nuts!↩︎

Proponents of Causal AI like Judea Pearl would argue that the Bayesian treatment still does not go far enough: in their view model explanations can only be truly compelling if they are causally found.↩︎

See this answer on Stack Exchange for a detailed discussion.↩︎

BibTeX citation:

```
@online{altmeyer2022,
author = {Altmeyer, Patrick},
title = {Go Deep, but Also ... Go {Bayesian!}},
date = {22-02-18},
url = {https://www.paltmeyer.com/blog//blog/posts/effortsless-bayesian-dl},
langid = {en}
}
```

For attribution, please cite this work as:

Altmeyer, Patrick. 22AD. “Go Deep, but Also ... Go
Bayesian!” February 18, 22AD. https://www.paltmeyer.com/blog//blog/posts/effortsless-bayesian-dl.

If you’ve ever searched for evaluation metrics to assess model accuracy, chances are that you found many different options to choose from (too many?). Accuracy is in some sense the holy grail of prediction so it’s not at all surprising that the machine learning community spends a lot time thinking about it. In a world where more and more high-stake decisions are being automated, model accuracy is in fact a very valid concern.

But does this recipe for model evaluation seem like a sound and complete approach to automated decision-making? Haven’t we forgot anything? Some would argue that we need to pay more attention to **model uncertainty**. No matter how many times you have cross-validated your model, the loss metric that it is being optimized against as well as its parameters and predictions remain inherently random variables. Focusing merely on prediction accuracy and ignoring uncertainty altogether can install a false level of confidence in automated decision-making systems. Any **trustworthy** approach to learning from data should therefore at the very least be transparent about its own uncertainty.

How can we estimate uncertainty around model parameters and predictions? **Frequentist** methods for uncertainty quantification generally involve either closed-form solutions based on asymptotic theory or bootstrapping (see for example here for the case of logistic regression). In Bayesian statistics and machine learning we are instead concerned with modelling the **posterior distribution** over model parameters. This approach to uncertainty quantification is known as **Bayesian Inference** because we treat model parameters in a Bayesian way: we make assumptions about their distribution based on **prior** knowledge or beliefs and update these beliefs in light of new evidence. The frequentist approach avoids the need for being explicit about prior beliefs, which in the past has sometimes been considered as *un*scientific. However, frequentist methods come with their own assumptions and pitfalls (see for example Murphy (2012)) for a discussion). Without diving further into this argument, let us now see how **Bayesian Logistic Regression** can be implemented from the bottom up.

In this post we will work with a synthetic toy data set composed of binary labels and corresponding feature vectors . Working with synthetic data has the benefit that we have control over the **ground truth** that generates our data. In particular, we will assume that the binary labels are generated by a logistic regression model

where is the **sigmoid** or **logit** function (Murphy 2022).^{1} Features are generated from a mixed Gaussian model.

To add a little bit of life to our example we will assume that the binary labels classify samples into cats and dogs, based on their height and tail length. Figure 1 shows the synthetic data in the two-dimensional feature domain. Following an introduction to Bayesian Logistic Regression in the next section we will use the synthetic data to estimate our model.

Estimation usually boils down to finding the vector of parameters that maximizes the likelihood of observing under the assumed model. That estimate can then be used to compute predictions for some new unlabelled data set .

The starting point for Bayesian Logistic Regression is **Bayes’ Theorem**:

Formally, this says that the posterior distribution of parameters is proportional to the product of the likelihood of observing given and the prior density of . Applied to our context this can intuitively be understood as follows: our posterior beliefs around are formed by both our prior beliefs and the evidence we observe. Yet another way to look at this is that maximising Equation 2 with respect to corresponds to maximum likelihood estimation regularized by prior beliefs (we will come back to this).

Under the assumption that individual label-feature pairs are **independently** and **identically** distributed, their joint likelihood is simply the product over their individual densities. The prior beliefs around are at our discretion. In practice they may be derived from previous experiments. Here we will use a zero-mean spherical Gaussian prior for reasons explained further below. To sum this up we have

with and . Plugging this into Bayes’ rule we finally have

Unlike with linear regression there are no closed-form analytical solutions to estimating or maximising this posterior, but fortunately accurate approximations do exist (Murphy 2022). One of the simplest approaches called **Laplace Approximation** is straight-forward to implement and computationally very efficient. It relies on the observation that under the assumption of a Gaussian prior, the posterior of logistic regression is also approximately Gaussian: in particular, this Gaussian distribution is centered around the **maximum a posteriori** (MAP) estimate with a covariance matrix equal to the inverse Hessian evaluated at the mode . With that in mind, finding seems like a natural next step.

In practice we do not maximize the posterior directly. Instead we minimize the negative log likelihood, which is an equivalent optimization problem and easier to implement. In Equation 4 below I have denoted the negative log likelihood as indicating that this is the **loss function** we aim to minimize. The following two lines in Equation 4 show the gradient and Hessian - so the first- and second-order derivatives of with respect to - where and . To understand how exactly the gradient and Hessian are derived see for example chapter 10 in Murphy (2022).^{2}.

**SIDENOTE** 💡

Note how earlier I mentioned that maximising the posterior likelihood can be seen as regularized maximum likelihood estimation. We can now make that connection explicit: in Equation 4 let us assume that . Then since with the second term in the first line is simply . This is equivalent to running logistic regression with an -penalty (Bishop 2006).

Since minimizing the loss function in Equation 4 is a convex optimization problem we have many efficient algorithms to choose from in order to solve this problem. With the Hessian at hand it seems natural to use a second-order method, because incorporating information about the curvature of the loss function generally leads to faster convergence. Here we will implement **Newton’s method** in line with the presentation in chapter 8 of Murphy (2022).

Suppose now that we have trained the Bayesian Logistic Regression model as our binary classifier using our training data . A new unlabelled sample arrives. As with any binary classifier we can predict the missing label by simply plugging the new sample into our classifier , where is the MAP estimate as before. If at training phase we have found to achieve good accuracy, we may expect to be a reasonably good approximation of the true and unobserved pair . But since we are still dealing with an expected value of a random variable, we would generally like to have an idea of how noisy this prediction is.

Formally, we are interested in the **posterior predictive** distribution:

**SIDENOTE** 💡

The approach that ignores uncertainty altogether corresponds to what is referred to as **plugin** approximation of the posterior predictive. Formally, it imposes .

With the posterior distribution over model parameters at hand we have the necessary ingredients to estimate the posterior predictive distribution .

An obvious, but computationally expensive way to estimate it is through Monte Carlo: draw from for and compute fitted values each. Then the posterior predictive distribution corresponds to the average over all fitted values, . By the law of large numbers the Monte Carlo estimate is an accurate estimate of the true posterior predictive for large enough . Of course, “large enough” is somewhat loosely defined here and depending on the problem can mean “very large”. Consequently, the computational costs involved essentially know no upper bound.

Fortunately, it turns out that we can trade off a little bit of accuracy in return for a convenient analytical solution. In particular, we have that where is the standard Gaussian cdf and ensures that the two functions have the same slope at the origin (Figure 2). Without dwelling further on the details we can use this finding to approximate the integral in Equation 5 as a sigmoid function. This is called **probit approximation** and implemented below.

We now have all the necessary ingredients to code Bayesian Logistic Regression up from scratch. While in practice we would usually want to rely on existing packages that have been properly tested, I often find it very educative and rewarding to program algorithms from the bottom up. You will see that Julia’s syntax so closely resembles the mathematical formulas we have seen above, that going from maths to code is incredibly easy. Seeing those formulas and algorithms then actually doing their magic is quite fun! The code chunk below, for example, shows the implementation of the loss function and its derivatives from Equation 4 above. Take a moment to go through the code line-by-line and try to understand how it relates back to the equations in Equation 4. Isn’t it amazing how closely the code resembles the actual equations?

Aside from the optimization routine this is essentially all there is to coding up Bayesian Logistic Regression from scratch in Julia Language. If you are curious to see the full source code in detail you can check out this interactive notebook. Now let us finally turn back to our synthetic data and see how Bayesian Logistic Regression can help us understand the uncertainty around our model predictions.

**DISCLAIMER** ❗️

I should mention that this is the first time I program in Julia, so for any Julia pros out there: please bear with me! Happy to hear your suggestions/comments.

Figure 3 below shows the resulting posterior distribution for and at varying degrees of prior uncertainty . The constant is held constant at the mode (). The red dot indicates the MLE. Note how for the choice of the posterior is equal to the prior. This is intuitive since we have imposed that we have no uncertainty around our prior beliefs and hence no amount of new evidence can move us in any direction. Conversely, for the posterior distribution is centered around the unconstrained MLE: prior knowledge is very uncertain and hence the posterior is dominated by the likelihood of the data.

What about the posterior predictive? The story is similar: since for the posterior is completely dominated by the zero-mean prior we have everywhere (top left panel in Figure 4. As we gradually increase uncertainty around our prior the predictive posterior depends more and more on the data : uncertainty around predicted labels is high only in regions that are not populated by samples . Not surprisingly, this effect is strongest for the MLE () where we see some evidence of overfitting.

In this post we have seen how Bayesian Logistic Regression can be implemented from scratch in Julia language. The estimated posterior distribution over model parameters can be used to quantify uncertainty around coefficients and model predictions. I have argued that it is important to be transparent about model uncertainty to avoid being overly confident in estimates.

There are many more benefits associated with Bayesian (probabilistic) machine learning. Understanding where in the input domain our model exerts high uncertainty can for example be instrumental in labelling data: see for example Gal, Islam, and Ghahramani (2017) and follow-up works for an interesting application to **active learning** for image data. Similarly, there is a recent work that uses estimates of the posterior predictive in the context of **algorithmic recourse** (Schut et al. 2021). For a brief introduction to algorithmic recourse see one of my previous posts.

As a great reference for further reading about probabilistic machine learning I can highly recommend Murphy (2022). An electronic version of the book is currently freely available as a draft. Finally, remember that if you want to try yourself at the code, you can check out this interactive notebook.

Bishop, Christopher M. 2006. *Pattern Recognition and Machine Learning*. springer.

Gal, Yarin, Riashat Islam, and Zoubin Ghahramani. 2017. “Deep Bayesian Active Learning with Image Data.” In *International Conference on Machine Learning*, 1183–92. PMLR.

Murphy, Kevin P. 2012. *Machine Learning: A Probabilistic Perspective*. MIT press.

———. 2022. *Probabilistic Machine Learning: An Introduction*. MIT Press.

Schut, Lisa, Oscar Key, Rory Mc Grath, Luca Costabello, Bogdan Sacaleanu, Yarin Gal, et al. 2021. “Generating Interpretable Counterfactual Explanations By Implicit Minimisation of Epistemic and Aleatoric Uncertainties.” In *International Conference on Artificial Intelligence and Statistics*, 1756–64. PMLR.

We let define the true coefficients.↩︎

Note that the author works with the negative log likelihood scaled by the sample size↩︎

BibTeX citation:

```
@online{altmeyer2021,
author = {Altmeyer, Patrick},
title = {Bayesian {Logistic} {Regression}},
date = {21-11-15},
url = {https://www.paltmeyer.com/blog//blog/posts/bayesian-logit},
langid = {en}
}
```

For attribution, please cite this work as:

Altmeyer, Patrick. 21AD. “Bayesian Logistic Regression.”
November 15, 21AD. https://www.paltmeyer.com/blog//blog/posts/bayesian-logit.

“You cannot appeal to [algorithms]. They do not listen. Nor do they bend.”

— Cathy O’Neil

In her popular book Weapons of Math Destruction Cathy O’Neil presents the example of public school teacher Sarah Wysocki, who lost her job after a teacher evaluation algorithm had rendered her redundant (O’Neil 2016). Sarah was highly popular among her peers, supervisors and students.

This post looks at a novel algorithmic solution to the problem that individuals like Sarah, who are faced with an undesirable outcome, should be provided with means to revise that outcome. The literature commonly refers to this as *individual recourse*. One of the first approaches towards individual recourse was proposed by Ustun, Spangher, and Liu (2019). In a recent follow-up paper, Joshi et al. (2019) propose a methodology coined `REVISE`

, which extends the earlier approach in at least three key ways:

`REVISE`

provides a framework that avoids suggesting an unrealistic set of changes by imposing a threshold likelihood on the revised attributes.- It is applicable to a broader class of models including Black Box classifiers and structural causal models.
- It can be used to detect poorly defined proxies and biases.

For a detailed discussion of these points you may check out this slide deck or consult the paper directly (freely available on DeepAI). Here, we will abstract from some of these complications and instead look at an application of a slightly simplified version of `REVISE`

. This should help us to first build a good intuition. Readers interested in the technicalities and code may find all of this in the annex below.

We will explain `REVISE`

through a short tale of cats and dogs. The protagonist of this tale is Kitty 🐱, a young cat that identifies as a dog. Unfortunately, Kitty is not very tall and her tail, though short for a cat, is longer than that of the average dog (Figure 1).

Much to her dismay, Kitty has been recognized as a cat by a linear classifier that we trained through stochastic gradient descent using the data on animals’ height and tail length. Once again interested readers may find technical details and code in the annex below. Figure 2 shows the resulting linear separation in the attribute space with the decision boundary in solid black and Kitty’s location indicated by a red circle. Can we provide individual recourse to Kitty?

Let’s see if and how we can apply `REVISE`

to Kitty’s problem. The following summary should give you some flavour of how the algorithm works:

- Initialize , that is the attributes that will be revised recursively. Kitty’s original attributes seem like a reasonable place to start.
- Through gradient descent recursively revise until 🐶. At this point the descent terminates since for these revised attributes the classifier labels Kitty as a dog.
- Return , that is the individual recourse for Kitty.

Figure 3 illustrates what happens when this approach is applied to Kitty’s problem. The different panels show the results for different values of a regularization parameter that governs the trade-off between achieving the desired label switch and keeping the distance between the original () and revised () attributes small. In all but one case, `REVISE`

converges: a decrease in tail length along with an increase in height eventually allows Kitty to cross the decision boundary. In other words, we have successfully turned Kitty into a dog - at least in the eyes of the linear classifier!

We also observe that as we increase for a fixed learning rate, `REVISE`

takes longer to converge. This should come as no surprise, since higher values of lead to greater regularization with respect to the penalty we place on the distance that Kitty has to travel. When we penalize too much (), Kitty never reaches the decision boundary, because she is reluctant to change her characteristics beyond a certain point. While not visible to the naked eye, in this particular example corresponds to the best choice among the candidate values.

`REVISE`

algorithm in action: how Kitty crosses the decision boundary by changing her attributes. Regularization with respect to the distance penalty increases from top left to bottom right.While hopefully Kitty’s journey has provided you with some useful intuition, the story is of course very silly. Even if your cat ever seems to signal that she wants to be a dog, helping her cross that decision boundary will be tricky. Some attributes are simply immutable or very difficult to change, which Joshi et al. (2019) do not fail to account for in their framework. Their proposed methodology offers a simple and ingenious approach towards providing individual recourse. Instead of concerning ourselves with Black Box interpretability, why not simply provide remedy in case things go wrong?

To some extent that idea has its merit. As this post has hopefully shown, `REVISE`

is straight-forward to understand and readily applicable. It could be a very useful tool to provide individual recourse in many real-world applications. As the implementation of our simplified version of `REVISE`

demonstrates, researchers should also find it relatively easy to develop the methodology further and tailor it to specific use cases. The simpler version here, for example, may be useful in settings where the dimensionality is relatively small and one can reasonably model the distribution of attributes without the need for generative models.

Still, you may be wondering: if the original classifier is based on poorly defined rules and proxies, then what good does `REVISE`

really do? Going back to the example of high-school teacher Sarah Wysocki, one of the key attributes determining teachers’ evaluations was their students’ performance. Realizing this, some teachers took the shortest route to success by artificially inflating their students’ test scores. That same course of action may well have been suggested by `REVISE`

. As Joshi et al. (2019) demonstrate, this very property of `REVISE`

may actually proof useful in detecting weaknesses of decision making systems before setting them loose (key contribution 3).

Nonetheless, the example above also demonstrates that approaches like `REVISE`

, useful as they may be, tend to provide solutions for very particular problems. In reality data-driven decision making systems are often subject to many different problems and hence research on trustworthy AI will need to tackle the issue from various angles. A few places to start include the question of dealing with data that is inherently biased, improving ad-hoc and post-hoc model interpretability and continuing efforts around causality-inspired AI.

Joshi, Shalmali, Oluwasanmi Koyejo, Warut Vijitbenjaronk, Been Kim, and Joydeep Ghosh. 2019. “Towards Realistic Individual Recourse and Actionable Explanations in Black-Box Decision Making Systems.” https://arxiv.org/abs/1907.09615.

O’Neil, Cathy. 2016. *Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy*. Crown.

Ustun, Berk, Alexander Spangher, and Yang Liu. 2019. “Actionable Recourse in Linear Classification.” In *Proceedings of the Conference on Fairness, Accountability, and Transparency*, 10–19.

In my blog posts I aim to implement interesting ideas from scratch even if that sometimes means that things need to undergo some sort of simplification. The benefit of this approach is that the experience is educationally rewarding - both for myself and hopefully also for readers. The first two sections of this annex show how `REVISE`

and linear classification can be implemented in R. The final section just shows how the synthetic data was generated. To also inspect the code that generates the visualizations and everything else, you can find the source code of this file on GitHub.

Linear classification is implemented through stochastic gradient descent (SGD) with Hinge loss

where is a coefficient vector, is the attribute vector of individual and is the individual’s outcome. Since we apply SGD in order to minimize the loss function by varying , we need an expression for its gradient with respect to , which is given by:

The code below uses this analytical solution to perform SGD over iterations or as long as updates yield feasible parameter values. As the final vector of coefficients the function returns . Denoting the optimal coefficient vector as , it can be shown that under certain conditions as .

```
#' Stochastic gradient descent
#'
#' @param X Feature matrix.
#' @param y Vector containing training labels.
#' @param eta Learning rate.
#' @param n_iter Maximum number of iterations.
#' @param w_init Initial parameter values.
#' @param save_steps Boolean checking if coefficients should be saved at each step.
#'
#' @return
#' @export
#'
#' @author Patrick Altmeyer
linear_classifier <- function(X,y,eta=0.001,n_iter=1000,w_init=NULL,save_steps=FALSE) {
# Initialization: ----
n <- nrow(X) # number of observations
d <- ncol(X) # number of dimensions
if (is.null(w_init)) {
w <- matrix(rep(0,d)) # initialize coefficients as zero...
} else {
w <- matrix(w_init) # ...unless initial values have been provided.
}
w_avg <- 1/n_iter * w # initialize average coefficients
iter <- 1 # iteration count
if (save_steps) {
steps <- data.table(iter=0, w=c(w), d=1:d) # if desired, save coefficient at each step
} else {
steps <- NA
}
feasible_w <- TRUE # to check if coefficients are finite, non-nan, ...
# Surrogate loss:
l <- function(X,y,w) {
x <- (-1) * crossprod(X,w) * y
pmax(0,1 + x) # Hinge loss
}
grad <- function(X,y,w) {
X %*% ifelse(crossprod(X,w) * y<=1,-y,0) # Gradient of Hinge loss
}
# Stochastic gradient descent: ----
while (feasible_w & iter<n_iter) {
t <- sample(1:n,1) # random draw
X_t <- matrix(X[t,])
y_t <- matrix(y[t])
v_t <- grad(X_t,y_t,w) # compute estimate of gradient
# Update:
w <- w - eta * v_t # update coefficient vector
feasible_w <- all(sapply(w, function(i) !is.na(i) & is.finite(i))) # check if feasible
if (feasible_w) {
w_avg <- w_avg + 1/n_iter * w # update average
}
if (save_steps) {
steps <- rbind(steps, data.table(iter=iter, w=c(w), d=1:d))
}
iter <- iter + 1 # increase counter
}
# Output: ----
output <- list(
X = X,
y = matrix(y),
coefficients = w_avg,
eta = eta,
n_iter = n_iter,
steps = steps
)
class(output) <- "classifier" # assign S3 class
return(output)
}
# Methods: ----
print.classifier <- function(classifier) {
print("Coefficients:")
print(classifier$coefficients)
}
print <- function(classifier) {
UseMethod("print")
}
predict.classifier <- function(classifier, newdata=NULL, discrete=TRUE) {
if (!is.null(newdata)) {
fitted <- newdata %*% classifier$coefficients # out-of-sampple prediction
} else {
fitted <- classifier$X %*% classifier$coefficients # in-sample fit
}
if (discrete) {
fitted <- sign(fitted) # map to {-1,1}
}
return(fitted)
}
predict <- function(classifier, newdata=NULL, discrete=TRUE) {
UseMethod("predict")
}
```

`REVISE`

(simplified)As flagged above, we are looking at a slightly simplified version of the algorithm presented in Joshi et al. (2019). In particular, the approach here does not incorporate the threshold on the likelihood nor does it account for immutable attributes.

Let be a binary outcome variable, a feature matrix containing individuals’ attributes and a corresponding data-dependent classifier. Suppose (the negative outcome) for some individual characterized by attributes . Then we want to find closest to such that the classifier assigns the positive outcome . In order to do so, we use gradient descent with Hinge loss to minimize the following function

where denotes the Euclidean distance. Note that this time we take the coefficient vector defining as given and instead vary the attributes. In particular, we will perform gradient descent steps as follows

where is the learning rate. The descent step is almost equivalent to the one described in Joshi et al. (2019), but here we greatly simplify things by optimizing directly in the attribute space instead of a latent space. The gradient of the loss function looks very similar to Equation 1. With respect to the Euclidean distance partial derivatives are of the following form:

The code that implements this optimization follows below.

```
#' REVISE algoritm - a simplified version
#'
#' @param classifier The fitted classifier.
#' @param x_star Attributes of individual seeking individual recourse.
#' @param eta Learning rate.
#' @param lambda Regularization parameter.
#' @param n_iter Maximum number of operations.
#' @param save_steps Boolean indicating if intermediate steps should be saved.
#'
#' @return
#' @export
#'
#' @author Patrick Altmeyer
revise.classifier <- function(classifier,x_star,eta=1,lambda=0.01,n_iter=1000,save_steps=FALSE) {
# Initialization: ----
d <- length(x_star) # number of dimensions
if (!is.null(names(x_star))) {
d_names <- names(x_star) # names of attributes, if provided
} else {
d_names <- sprintf("X%i", 1:d)
}
w <- classifier$coefficients # coefficient vector
x <- x_star # initialization of revised attributes
distance <- 0 # initial distance from starting point
converged <- predict(classifier, newdata = x)[1,1]==1 # positive outcome?
iter <- 1 # counter
if (save_steps) {
steps <- data.table(iter=1, x=x, d=d_names) # save intermediate steps, if desired
} else {
steps <- NA
}
# Gradients:
grad <- function(x,y,w) {
w %*% ifelse(crossprod(x,w) * y<=1,-y,0) # gradient of Hinge loss with respect to X
}
grad_dist <- function(x,x_star) {
d <- length(x_star)
distance <- dist(matrix(cbind(x_star,x),nrow=d,byrow = T))
matrix((x-x_star) / distance) # gradient of Euclidean distance with respect to X
}
# Gradient descent: ----
while(!converged & iter<n_iter) {
if (distance!=0) {
x <- c(x - eta * (grad(x=matrix(x),y=1,w) + lambda * grad_dist(x,x_star))) # gradient descent step
} else {
x <- c(x - eta * grad(x=matrix(x),y=1,w)) # gradient with respect to distance not defined at zero
}
converged <- predict(classifier, newdata = x)[1,1]==1 # positive outcome?
iter <- iter + 1 # update counter
if (save_steps) {
steps <- rbind(steps, data.table(iter=iter, x=x, d=d_names))
}
distance <- dist(matrix(cbind(x_star,x),nrow=d,byrow = T)) # update distance
}
# Output: ----
if (converged) {
revise <- x - x_star
} else {
revise <- NA
}
output <- list(
x_star = x_star,
revise = revise,
classifier = classifier,
steps = steps,
lambda = lambda,
distance = distance,
mean_distance = mean(abs(revise))
)
return(output)
}
revise <- function(classifier,x_star,eta=1,lambda=0.01,n_iter=1000,save_steps=FALSE) {
UseMethod("revise")
}
```

The synthetic data describing cats and dogs was generated as follows:

```
sim_data <- function(n=100,averages,noise=0.1) {
d <- ncol(averages)
y <- 2*(rbinom(n,1,0.5)-0.5) # generate binary outcome: 1=dog, -1=cat
X <- as.matrix(averages[(y+1)/2+1,]) # generate attributes conditional on y
dogs <- y==1 # boolean index for dogs
cats <- y==-1 # boolean index for cats
X[cats,] <- X[cats,] +
matrix(rnorm(sum(cats)*d),nrow=sum(cats)) %*% diag(noise*averages[2,]) # add noise for y=1 (cats)
X[dogs,] <- X[dogs,] +
matrix(rnorm(sum(dogs)*d),nrow=sum(dogs)) %*% diag(noise*averages[2,]) # add noise for y=1 (dogs)
return(list(X=X,y=y))
}
```

BibTeX citation:

```
@online{altmeyer2021,
author = {Altmeyer, Patrick},
title = {Individual Recourse for {Black} {Box} {Models}},
date = {21-04-27},
url = {https://www.paltmeyer.com/blog//blog/posts/individual-recourse-for-black-box-models},
langid = {en}
}
```

For attribution, please cite this work as:

Altmeyer, Patrick. 21AD. “Individual Recourse for Black Box
Models.” April 27, 21AD. https://www.paltmeyer.com/blog//blog/posts/individual-recourse-for-black-box-models.

Propelled by advancements in modern computer technology, deep learning has re-emerged as perhaps the most promising artificial intelligence (AI) technology of the last two decades. By treating problems as a nested, hierarchy of hidden layers deep artificial neural networks achieve the power and flexibility necessary for AI systems to navigate complex real-world environments. Unfortunately, their very nature has earned them a reputation as *Black Box* algorithms and their lack of interpretability remains a major impediment to their more wide-spread application.

In science, research questions usually demand not just answers but also explanations and variable selection is often as important as prediction (Ish-Horowicz et al. 2019). Economists, for example, recognise the undeniable potential of deep learning, but are rightly hesitant to employ novel tools that are not fully transparent and ultimately cannot be trusted. Similarly, real-world applications of AI have come under increasing scrutiny with regulators imposing that individuals influenced by algorithms should have the right to obtain explanations (Fan, Xiong, and Wang 2020). In high-risk decision-making fields such as AI systems that drive autonomous vehicles the need for explanations is self-evident (Ish-Horowicz et al. 2019).

In light of these challenges it is not surprising that research on explainable AI has recently gained considerable momentum (Arrieta et al. 2020). While in this short essay we will focus on deep learning in particular, it should be noted that this growing body of literature is concerned with a broader realm of machine learning models. The rest of this note is structured as follows: the first section provides a brief overview of recent advancements towards interpreting deep neural networks largely drawing on Fan, Xiong, and Wang (2020); the second section considers a novel entropy-based approach towards interpretability proposed by Crawford et al. (2019); finally, in the last section we will see how this approach can be applied to deep neural networks as proposed in Ish-Horowicz et al. (2019).

Before delving further into *how* the intrinsics of deep neural networks can be disentangled we should first clarify *what* interpretability in the context of algorithms actually means. Fan, Xiong, and Wang (2020) describes model interpretability simply as the extent to which humans can “understand and reason” the model. This may concern an understanding of both the *ad-hoc* workings of the algorithm as well as the *post-hoc* interpretability of its output. In the context of linear regression, for example, *ad-hoc* workings of the model are often described through the intuitive idea of linearly projecting the outcome variable onto the column space of . *Post-hoc* interpretations usually center around variable importance – the main focus of the following sections. Various recent advancements tackle interpretability of DNNs from different angles depending on whether the focus is on *ad-hoc* or *post-hoc* interpretability. Fan, Xiong, and Wang (2020) further asses that model interpretability hinges on three main aspects of *simulatability*, *decomposability* and *algorithmic transparency*, but for the purpose of this short note the *ad-hoc* vs. *post-hoc* taxonomy provides a simpler more natural framework. ^{1}

Understanding the *ad-hoc* intrinsic mechanisms of a DNN is inherently difficult. While generally transparency may be preserved in the presence of nonlinearity (e.g. decision trees), multiple hidden layers of networks (each of them) involving nonlinear operations are usually out of the realm of human comprehension (Fan, Xiong, and Wang 2020). Training also generally involves optimization of non-convex functions that involve an increasing number of saddle points as the dimensionality increases (Fan, Xiong, and Wang 2020). Methods to circumvent this problematic usually boil down to decreasing the overall complexity, either by regularizing the model or through proxy methods. Regularization – while traditionally done to avoid overfitting – has been found to be useful to create more interpretable representations. Monotonicity constraints, for example, impose that as the value of a specified covariate increases model predictions either monotonically decrease or increase. Proxy methods construct simpler representations of a learned DNN, such as a rule-based decision tree. This essentially involves repeatedly querying the trained network while varying the inputs and then deriving decision rules based on the model output.

Post-hoc interpretability usually revolves around the understanding of feature importance. A greedy approach to this issue involves simply removing features one by one and checking how model predictions change. A more sophisticated approach along these lines is *Shapley* value, which draws on cooperative game theory. The Shapley value assigns varying payouts to players depending on their contribution to overall payout. In the context of neural networks input covariate represents a player while overall payout is represented by the difference between average and individual outcome predictions.^{2} Exact computations of Shapley values are prohibitive as the dimensionality increases, though approximate methods have recently been developed (Fan, Xiong, and Wang 2020).

The remainder of this note focuses on a novel approach to feature extraction that measures entropy shifts in a learned probabilistic neural network in response to model inputs . We will first introduce this methodology in the context of Gaussian Process regression in the following section before finally turning to its application to Bayesian neural networks.

Ish-Horowicz et al. (2019) motivate their methodology for interpreting neural networks through Gaussian Process regression. Consider the following Bayesian regression model with Gaussian priors:

This naturally gives rise to a particular example of a Gaussian Process (GP). In particular, since is just a linear combination fo Gaussian random variables it follows a Gaussian Process itself

where is the Kernel (or Gram) matrix and is the kernel function (Bishop 2006). In other words, the prior distribution over induces a probability distribution over random functions . Similarly, the GP can be understood as a prior distribution over a an infinite-dimensional reproducible kernel Hilbert space (RKHS) (Crawford et al. 2019), which in a finite-dimensional setting becomes multivariate Gaussian.

In a standard linear regression model coefficients characterize the projection of the outcome variable onto the column space of the regressors . In particular, with ordinary least square we define:

The primary focus here is to learn the mapping from input to output. The key differentiating feature between this approach and the non-parametric model in Equation 1 is the fact that in case of the latter we are interested in learning not only the mapping from inputs to outputs, but also the representation () of the inputs (see for example (Goodfellow, Bengio, and Courville 2016)). To be even more specific, treating the feature representation itself as random as in Equation 1 allows us to learn non-linear relationships between the covariates , since they are implicitly captured by the RKHS (Crawford et al. 2019). Neural networks share this architecture and hence it is worth dwelling on it a bit further: the fact that the learned model inherently incorporates variable interactions leads to the observation that an individual feature is rarely important on its own with respect to the mapping from to (Ish-Horowicz et al. 2019). Hence, in order to gain an understanding of individual variable importance, one should aim to understand what role feature plays *within* the learned model, thereby taking into account its interactions with other covariates. Formally, Crawford et al. (2019) and define the *effect size analogue* as the equivalent of the familiar regression coefficient in the non-parametric setting

where denotes the Moore-Penrose pseudo-inverse (see for example Goodfellow, Bengio, and Courville (2016)). Intuitively the effect size analogue can be thought of as the resulting coefficients from regressing the fitted values from the learned probabilistic model on the covariates . It can be interpreted in the same way as linear regression coefficients, in the sense that describes the marginal change in given a unit increase in holding all else constant. Note here the subtle, but crucial difference between Equation 3 – a projection from the outcome variable onto the column space of – and Equation 4 – a projection from the learned model to . In other words, looking at can be thought of peeking directly into the *Block Box*. Unfortunately, as Crawford et al. (2019) point out, working with Equation 4 is usually not straight-forward. From a practitioner’s point of view, it may also not be obvious how to interpret a coefficient that describes marginal effects of input variables on a learned model. A more useful indicator in this context would provide a measure of how much individual variables contribute to the overall variation in the learned model. For this purpose Crawford et al. (2019) propose to work with a distributional centrality measure based on , which we shall turn to next.

The proposed methodology in Crawford et al. (2019) and Ish-Horowicz et al. (2019) depends on the availability of a posterior distribution over in that it measures its entropic shifts in response to the introduction of covariates. The intuition is straight-forward: within the context of the learned probabilistic model is covariate informative or not? More formally this boils down to determining if the posterior distribution of is dependent on the effect of . This can be quantified through the Kullback-Leibler divergence (KLD) between and the conditional posterior :

Covariates that contribute significant information to the model will have , while for insignificant covariates . The measure of induced entropy change gives rise to a ranking of the covariates in terms of their relative importance in the model. The RATE criterion of variable is then simply defined as

which in light of its bounds can naturally be interpreted as `s percentage contribution to the learned model. It is worth noting that of course depends on the value of the conditioning variable. A natural choice is which usually corresponds to the null hypothesis.

In order to use the RATE criterion in the context of deep learning we need to work in the Bayesian setting. Contrary to standard artificial neural networks which work under the assumption that weights have some true latent value, Bayesian neural networks place a prior distribution over network parameters and hence treat weights as random variables (Goan and Fookes 2020). Not only does it perhaps seem more natural to treat unobserved weights as random, but the Bayesian setting also naturally gives rise to reason about uncertainty in predictions, which can ultimately help us develop more trustworthy models (Goan and Fookes 2020). A drawback of BNNs is that exact computation of posteriors is computationally challenging and often intractable (a non-trivial issue that we will turn back to in a moment).

When the prior placed over parameters is Gaussian, the output of the BNN approaches a Gaussian Process as the width of the network grows, in line with the discussion in the previous section. This is exactly the assumption that Ish-Horowicz et al. (2019) work with. They propose an architecture for a multi-layer perceptron (MLP) composed of (1) an input layer collecting covariates , (2) a single deterministic, hidden layer and (3) an outer layer producing predictions from a probabilistic model . Let be a matrix of covariates. Then formally, we have

where is a link function and represents the probabilistic model learned in the outer layer with weights assumed to be Gaussian random variables.^{3} Finally, denotes the inner (or more generally penultimate) layer, an matrix of neural activations through . Ish-Horowicz et al. (2019) work with a simple single-layer MLP, but it should be evident that this be extended to arbitrary depth and complexity, while still maintaining the high-level structure imposed by Equation 7. This flexibility allows RATE to be applied to a wide range of Bayesian network architectures, since all that is really required is the posterior distribution over weights , which arises from the probabilistic outer layer. The fact that only the outer layer needs to be probabilistic has the additional benefit of mitigating the computational burden that comes with Bayesian inference, which was mentioned earlier.

Having established this basic, flexible set-up the Ish-Horowicz et al. (2019) go on to derive closed-form expressions for RATE in this setting. The details are omitted here since the logic is largely analogous to what we learned above, but can be found in Ish-Horowicz et al. (2019).

The RATE criterion originally proposed by Crawford et al. (2019) and shown to be applicable to Bayesian neural networks in Ish-Horowicz et al. (2019) offers an intuitive way to measure variable importance in the context of deep learning. By defining variable importance as the contribution inputs make to a probabilistic model, it implicitly incorporates the interactions between covariates and nonlinearities that the model has learned. In other words, it allows researchers to peek directly into the *Black Box*. This opens up interesting avenues for future research, as the approach can be readily applied in academic disciplines and real-world applications that rely heavily on explainability of outcomes.

Arrieta, Alejandro Barredo, Natalia Diaz-Rodriguez, Javier Del Ser, Adrien Bennetot, Siham Tabik, Alberto Barbado, Salvador Garcia, et al. 2020. “Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges Toward Responsible AI.” *Information Fusion* 58: 82–115.

Bishop, Christopher M. 2006. *Pattern Recognition and Machine Learning*. springer.

Crawford, Lorin, Seth R Flaxman, Daniel E Runcie, and Mike West. 2019. “Variable Prioritization in Nonlinear Black Box Methods: A Genetic Association Case Study.” *The Annals of Applied Statistics* 13 (2): 958.

Fan, Fenglei, Jinjun Xiong, and Ge Wang. 2020. “On Interpretability of Artificial Neural Networks.” https://arxiv.org/abs/2001.02522.

Goan, Ethan, and Clinton Fookes. 2020. “Bayesian Neural Networks: An Introduction and Survey.” In *Case Studies in Applied Bayesian Data Science*, 45–87. Springer.

Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. *Deep Learning*. MIT Press.

Ish-Horowicz, Jonathan, Dana Udwin, Seth Flaxman, Sarah Filippi, and Lorin Crawford. 2019. “Interpreting Deep Neural Networks Through Variable Importance.” https://arxiv.org/abs/1901.09839.

Simulatability describes the overall, high-level understandability of the mechanisms underlying the model – put simply, the less complex the model, the higher its simulatability. Decomposability concerns the extent to which the model can be taken apart into smaller pieces – neural networks by there very nature are compositions of multiple layers. Finally, algorithmic transparency refers to the extent to which the training of the algorithm is well-understood and to some extent observable – since DNNs generally deal with optimization of non-convex functions and often lack unique solution they are inherently intransparent.↩︎

For more detail see for example here.↩︎

For simplicity I have omitted the deterministic bias term.↩︎

BibTeX citation:

```
@online{altmeyer2021,
author = {Altmeyer, Patrick},
title = {A Peek Inside the “{Black} {Box}” - Interpreting Neural
Networks},
date = {21-02-07},
url = {https://www.paltmeyer.com/blog//blog/posts/a-peek-inside-the-black-box-interpreting-neural-networks},
langid = {en}
}
```

For attribution, please cite this work as:

Altmeyer, Patrick. 21AD. “A Peek Inside the ‘Black
Box’ - Interpreting Neural Networks.” February 7, 21AD. https://www.paltmeyer.com/blog//blog/posts/a-peek-inside-the-black-box-interpreting-neural-networks.

Note

**Update on Feb 20, 2022**

The post below was written when I still used `blogdown`

in combination with Hugo to build this blog. I have recently migrated the blog (along pretty much everything else I do) to quarto.

Quarto® is an open-source scientific and technical publishing system built on Pandoc.

Based on my first few experiences I would go further and say that quarto is *the only* open-source scientific and technical publishing system you’ll ever need. The project is supported by RStudio and (unsurprisingly) Yihui Xie is one of the contributors. Go check it out!

It turns out building a static website in R is remarkably easy, as long as you know your way around R Markdown. Knowledge of HTML and CSS helps, but is not strictly necessary and can be acquired along the way. My package of choice for this website is `blogdown`

by Yihui Xie who has had a major impact on the R community through his many package contributions (`knitr`

, `bookdown`

, `pagedown`

, …) and certainly made my life a lot easier on many occasions.

To get started just follow the instructions on `blogdown`

’s GitHub repository or keep reading here for a high-level overview. Setting up a basic website in R requires exactly two steps:

Set up a local directory for the website. Let’s suppose you create it here

`~/Documents/myAwesomeWebsite`

.In R, navigate to the directory and simply run

`blogdown::newsite()`

.

This will set up a basic template which you can develop. Changing the theme and playing with the basic structure of the website is relatively straight-forward. Personally I have so far managed to work things out based on a working knowledge of HTML and CSS that I’ve developed in the past through my work with R Shiny.

There are various ways to deploy your website, i.e. make it accessible to the public. This website is deployed through GitHub pages. Detailed instructions on how to do this can be found here. Since I already had an existing local clone of my `pat-alt.github.io`

repo, I just dropped it in the source directory of the website:

```
source/
│
├── config.yaml
├── content/
├── themes/
└── ...
patalt.github.io/
│
├── .git/
├── .nojekyll
├── index.html
├── about/
└── ...
```

After adding `publishDir: pat-alt.github.io`

to my `config.yaml`

and then running `blogdown::hugo_build()`

the website was built inside the clone. All that was left to do was to commit changes from the local clone to the `pat-alt.github.io`

remote repo. A few moments later the website was already up and running.

There are certainly easier ways to build a website. But if like me you do pretty much all your work in R Markdown and want to share some of it, then you will love `blogdown`

. The beauty of it is that once the basic infrastructure is set up, adding content is as simple as running the following wrapper function

`blogdown::new_post("Your new post", ext = ".Rmd")`

where the first argument is just the title of your post and the `ext`

argument can be used to specify that you want to create an R Markdown document that can include code chucks. The wrapper function will automatically set up a directory for your post under `/post/`

. R Studio will redirect you to the relevant `.Rmd`

file that you can then fill with content. By default that folder will look roughly like this:

```
├── index.Rmd
├── index.html
└── index_files
└── header-attrs
└── header-attrs.js
```

As you can probably tell from the code chunks above this post was created just in the way I described. So I thought I might as well go ahead with a simple coding example to add some flavour. Suppose you have built some function that you think is worth sharing with the world or simply learned something new and interesting. As a case in point, I recently had a look at the `Rcpp`

package and wrote a small program in C++ to be used in R. Since R Markdown supports `Rcpp`

code chunks (along with Python, bash, SQL, …) it is straight-forward to show-case that code on this website.

The program can be used to simulate data from a categorical distribution. This distribution describes the possible results of a random variable that can take on one of possible categories with different probabilities. In base R we could use `rmultinom(n=1000,1,p=c(0.5,0.1,0.4))`

to simulate draws from one such distribution with three different categories. Alternatively, we could write the program in C++ as follows:

```
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix simCategorical(int n, NumericVector p) {
int k = p.size();
NumericMatrix mat(k, n);
// Normalise prob if necessary:
if (sum(p)!=1) {
p = p/sum(p);
}
NumericVector emp_cdf = cumsum(p);
NumericVector u = Rcpp::runif(n, 0, 1);
// Matrix for 1-hot-encoding:
for (int j = 0; j < n; j++) {
// Perform binary search:
int l = 0;
int r = k;
double target = u[j];
while (l < r) {
int m = floor((l+r)/2);
if (emp_cdf[m] > target) {
r = m;
} else {
l = m+1;
}
}
mat(r,j) = 1;
}
return mat;
}
```

In terms of performance it turns out that the simple C++ program actually does somewhat better than the base R alternative:

```
library(microbenchmark)
library(ggplot2)
n <- 1000
p <- c(0.5,0.1,0.4)
mb <- microbenchmark(
"rmultinom" = {rmultinom(n, 1, p)},
"Rcpp" = {simCategorical(n, p)}
)
autoplot(mb)
```

If you have some existing work that you would like to share you can just use it to overwrite the `index.Rmd`

file. `blogdown`

supports any kind of R Markdown documents so you can use all of your favourite markdown packages (`bookdown`

, `pagedown`

, …). Just make sure to specify HTML output in the YAML header.

For more information about `blogdown`

see here. To inspect the code that builds this website check out my GitHub repository.

BibTeX citation:

```
@online{altmeyer2021,
author = {Altmeyer, Patrick},
title = {How {I’m} Building This Website in {R}},
date = {21-02-02},
url = {https://www.paltmeyer.com/blog//blog/posts/how-i-m-building-this-website-in-r},
langid = {en}
}
```

For attribution, please cite this work as:

Altmeyer, Patrick. 21AD. “How I’m Building This Website in
R.” February 2, 21AD. https://www.paltmeyer.com/blog//blog/posts/how-i-m-building-this-website-in-r.

Having worked with R Markdown and some of Yihui Xie’s amazing packages for years, I have only now come across his blogdown package. For a while I have been thinking about a good way to share some of my work and actually started collecting snippets in a Gitbook through bookdown quite some time ago. While the book is a work-in-progress that I aim to finish eventually, I will use this website to regularly share content related to my work, research and other things.

Note

**Update on Feb 20, 2022**

I have recently migrated this blog and pretty much everything else I do to quarto.

Quarto® is an open-source scientific and technical publishing system built on Pandoc.

Based on my first few experiences I would go further and say that quarto is *the only* open-source scientific and technical publishing system you’ll ever need. The project is supported by RStudio and (unsurprisingly) Yihui Xie is one of the contributors. Go check it out!

BibTeX citation:

```
@online{altmeyer2021,
author = {Altmeyer, Patrick and Altmeyer, Patrick},
title = {Welcome},
date = {21-02-01},
url = {https://www.paltmeyer.com/blog//blog/posts/welcome},
langid = {en}
}
```

For attribution, please cite this work as:

Altmeyer, Patrick, and Patrick Altmeyer. 21AD. “Welcome.”
February 1, 21AD. https://www.paltmeyer.com/blog//blog/posts/welcome.