Heat #1: The Heat Equation (and a Classical Solver)
To mark Valentine’s day, we let ourselves get inspired by feelings of love and human warmth. Love is to be spread around, so they say.
Yeah, we’re sure you guessed it right — in this allnew blog series, we delve into the problem of … 🔥 heat diffusion on a plate! 🔥
Let us set the bar here — in the Heat series, we will take you step by step up to a point where you can gauge the potential of Neural Networks – in both strength and versatility – in solving Partial Differential Equations.
The Heat Series
A 3Part Journey on ML for PDEs
For us, it’s about No Love Lost and so, in this 3part series, we’ll use the Heat Equation to plough through. We

Heat #1: introduce the physics behind heat diffusion and use a classical numerical routine to approximate its solution across a 2D plate.

Heat #2: with that estimate of the solution as a benchmark, see how MachineLearning algorithms fare for the same situation, by deploying a PINN (PhysicsInspired Neural Network).

Heat #3: explore further into how versatile these Neural Networks are in handling more complex geometries than the 2D domain and varying initial/boundary conditions;
This field is fastgrowing but, most importantly is at its blissful infancy stage. Our dream, rooted in a belief, is that Machine Learning may be refined to excel in extremely complex and customizable cases.
This sophistication could potentially fare better in performance than every other classical solver, allowing us to go to lands classical frameworks have never granted us so far!
Alright. 🔥 Let’s get hot in here! 🔥
Heat #1: the Heat Equation
(and a Classical Solver)
Hello, I Love You
The Heat Equation: a warmup
Everybody has an elementary feeling for what heat is. Heat diffusion is the process through which energy is transported in space due to gradients in temperature.
These considerations are reflected mathematically in the socalled Heat Equation, which sets how the temperature \(u(t, \mathbf{r})\) at a point in space \(\mathbf{r}\) and time instant \(t\) evolves. It must satisfy:
\[\begin{equation} \frac{\partial u (t, \mathbf{r})}{\partial t}  D \Delta u(t, \mathbf{r}) = 0. \end{equation}\]The thermal diffusivity \(D\) controls how fast heat can spread around a neighborhood. For more realistic (and oftentimes less structured) media, their specifics result in a local diffusity \(D(t, \mathbf{r})\). In the Heat series, we consider homogeneous media, for which \(D\) is constant throughout.
Now, what about \(\Delta\) — the Laplacian operator? Let’s see how it acts on a function \(u\):
\[\Delta u = \nabla \cdot (\nabla u) = \sum_{i} \frac{\partial^2 u}{\partial {r_i}^2}\]So we have \(n\) \(2^\rm{nd}\)order partial spatial derivatives to account for but also a single \(1^\rm{st}\)order partial derivative in time. That’s a large batch of differential operators right there. We can lump them into a single differential operator which governs the evolution of the solution we’re after:
\[\mathcal{L} \left[ \frac{\partial}{\partial t}, \frac{\partial^2}{\partial^2_{r_1}}, \dots \frac{\partial^2}{\partial^2_{r_n}} \right] u(t, \mathbf{r}) = 0\]Alright — no doubt we’re in the presence of a Partial Differential Equation.
These beasts are not easy to tame. However, throughout the Heat series, we exploit the structural simplicity of the Heat Equation — alongside the fact it is both very intuitive to understand and easy to set up meaningfully.
Between Love and Hate
Heating across a 2D square plate
We’ll guide you on how to solve this equation in a simple yet realistic geometric setup: across a 2dimensional plate of variable size [2].
As you will soon see, this domain is simple enough not only to allow intuitive visualization but also to provide the possibility of adding extra complexity without much effort.
So let’s have a go. Our spatial vector in 2 dimensions is \(\mathbf{r} = (x, y)\) and so the heat equation takes the form:
From a computational point of view, it will do us a favor to think in terms of the result upon applying each differential operator to the solution:
\[\begin{equation} u_t  D \left( u_{xx} + u_{yy} \right) = 0 \label{HE_2D} \end{equation}\]where the partial derivatives are shorthanded to, say, \(\partial_{x} \equiv \partial / \partial x\) and \(\partial_{xx} = \partial_{x} \partial_{x}\).
But hold on! We can’t start solving this beast just yet…
Love is the Drug
Setting the boundaries (& initial condition)
We need more information to formulate completely this PDE. Initial and boundary conditions ensure that a unique solution exists by the function at particular points \((t,x,y)\) of the input space.
These are normally reasoned through intuition and hindsight. Throughout the Heat series, we employ Dirichlet boundary conditions, which ensure:

through the initial condition, that a certain temperature is fixed across all points on the plate at the initial time instant (when \(t=0\)). We’ll take \(u(0, x, y) = 1 ^\mathrm{o}C\).

through the the boundary conditions, that a temperature is fixed at all times only for points along the 4 edges of the plate (top, bottom, left and right).
 We choose the energy source to act as to get the top edge to some hot temperature. We’ll take \(u(t, x, y_\mathrm{max}) = 1 ^\mathrm{o}C \ .\)
 The remaining edges are held at some cold temperature. We’ll take
$$\underbrace{u(t, x, y_\mathrm{min})}_{\text{Bottom}} = \underbrace{u(t, x_{\mathrm{min}}, y)}_{\text{Left}} = \underbrace{u(t, x_\mathrm{max}, y)}_{\text{Right}} = 1 ^\mathrm{o}C \ .$$
Fig. 1: The boundary and initial conditions used throughout the Heat series. Energy is pumped from the top edge onto an initially completely cold 2D plate. Credits: David Carvalho / Inductiva
The Power of Love
… and Classical Numerical Methods
As usual, finding governing equations from first principles is actually the easy part. Rather, solving them presents us major challenges. Why?
Most PDEs do not admit analytical, penandpaper solutions. Only a handful of cherrypicked differential operators can give rise to closedform solutions. Lesson learnt — numerical approximations must be used.
But which ones? In which conditions? For what type of PDE?
These are general and difficult questions to answer. Wildly different methods have been curated by the computationallyinclined in the Mathematical and Physical communities — not as recipes set on stone but rather frameworks prone to constant scrutiny.
Meshbased methods are traditionally the dominant approaches. The main idea is to discretize the domain of interest into a set of mesh points and, with them, approximate the solution. A whole zoo of methods exists, in particular:
 Finite Differences Methods (FDMs) replace the PDE expression with a discretized analogue along a grid with the aid of function differences.
 Finite Elements Methods (FEMs) subdivide the problem domain in multiple elements and apply the equations to each one of them.
 Finite Volume Methods (FVMs) builds a solution approximation based on the exact computation (taking advantage of the divergence theorem) of the average value of the solution function in each of the smaller subvolumes in which the domain is partitioned.
Cupid shouldn’t show his preference here. All these methods have their advantages and shortcomings given the idiosyncrasies of each PDE and setup of the problem. Sometimes they may even coincide — for some simple scenarios (such as regular grids) FEMs and FDMs might end up being the same [1].
Typically, meshbased classical routines tend to be very efficient in lowdimensional problems on regular geometries. We will then use this convenience to our advantage by applying a FDM in a FTCS (Forward in Time, Centered in Space) scheme.
Never Mesh with Love
Discretizing our domain
Deciding how to discretize the domain is also far from being set on stone: for a given instantiation of a PDE problem, it is typically one of the most complex steps in its resolution. Squares are cool and all but what if we want to simulate the Heat Equation in a mug or a pan?
We’ll make our lives easier (for now!) by using a regular grid to represent the domain. With the aid of the cube:
in a regular grid with \(N_t\), \(N_x\) and \(N_y\) points along the \(t\), \(x\) and \(y\)axis, respectively, we can set their step intervals, defined by their regular spacing along their respective dimension:
Consequently, input points \((t,x,y)\) become discretized as \((t_k, x_i, y_j)\) and associated with a node \([k,i,j]\). Here,
It is in this pixelated world we will express how heat will diffuse away…
Eu Tenho Dois Amores (I Have Two Lovers)
From continuous to discrete derivatives
So, we now need to express a differential operator in a finite, discretized domain. How exactly do we discretize such abstract objects, like the (univariate) derivative \(f_x(x)\):
\[f_x(x) = \underset{\Delta x \to 0}{\mathrm{lim}} \; \frac{f(x+ \Delta x)  f(x)}{\Delta x}.\]or a partial derivative with respect to, say, a coordinate \(x_j\):
Being on a grid means that we require information on a finite number of points in the domain. The FDM philosophy is to express (infinitesimal) differentials by (finite) differences between any nodes.
The continuous thus becomes discrete — the differential operator \(\mathcal{L}(t,x,y)\) becomes a functionvalued discrete grid \(\equiv \mathcal{L}[k,i,j] = \mathcal{L}(t_k,x_i,y_j)\). Resulting from this, so will our solution be output as the grid \(u[k,i,j]\).
In essence, we must find an algorithm which can propagate the constrains set by the PDE from the initial conditions as faithfully as possible along the grid. In a more specific sense, you might wonder:
How to approximate \(\mathcal{L}\) reasonably to obtain \(u[k,i,j]\) for all nodes \([k,i,j]\)?
This is exactly what the FDM solver will accomplish. For that, it has to approximate the differential operators — \(u_{t}[k,i,j]\), \(u_{xx}[k,i,j]\) and \(u_{yy}[k,i,j]\) — at all nodes \([k,i,j]\).
Goodbye operators, hello grids!
You’ve Got (Not) to Hide Your Love Away
Cupid chooses a Finite Difference Method
Approximating differentials on a discrete set is also not a recipe set on stone. Let us look at two Finite Difference approximations to a function of a single variable \(x\):
Fig. 2: Approximation schemes of the derivative of \(f(x)\) at a grid point \(x_i\) using  (left) forward and (right) centered schemes. Note that either method differs from the actual value \(f_x(x_i)\), given by the slope of the solid straight line. Credits: David Carvalho / Inductiva
These will estimate the true instantaneous rate via neighboring differences. Expectedly, an error will be incurred in the process.
The price to pay in “dropping” the limit results in a 1storder approximation, meaning this estimate scales linearly with the spacing \(\Delta x\). For our approximation to have a chance to succeed, we better sample the \(x\) axis with a high \(N_x\)!
You Can’t Hurry Love
Choosing how to spread heat
Given a particular PDE, different FDMs would iterate differently over the grid
node by node to obtain estimates of the differential operators.
So, fixing a scheme is (again) delicate task.
You guessed it right  for this problem, we will use the 2 approximation schemes we showed before to evaluate the differential operators. They are combined in the FTCS scheme:

Forward in Time  the time partial derivative \(u_t\) uses the forward 1st order differences
\[u^{k, i, j}_t = \frac{u^{k+1, i, j}  u^{k, i, j}}{\Delta t}\] 
Centered in Space  the spatial partial derivatives \(u_{xx}\) and \(u_{yy}\) are are computed through a 2ndorder central difference by applying the centered difference twice:
 We first approximate the (secondorder) derivative (say, \(u_{xx}\)) via the centered difference:
 And then express each firstorder term \(u_x\) in the same fashion e.g.:
 With a bit of rearranging, both spatial derivatives become:
$$ \begin{split} u_{xx}^{k,i,j} = \frac{u^{k, i+1, j}  2 u^{k, i, j} + u^{k, i1, j}}{(\Delta x)^2} && \ \ \ \ \ \ u_{yy}^{k,i,j}= \frac{u^{k,i,j+1}  2 u^{k,i,j} + u^{k,i,j1}}{(\Delta y)^2} \end{split} $$
Look — some terms are evaluated outside the original grid. However, you will notice that these fractional indexes only appear for the computation of intermediary constructions/variables [4].
Victim of Love
Heat diffusion on a grid
Phew! That was intense but we now have all the tools to start simulating heat flow! For that, we must:
 discretize the domain by sampling each dimension individually and considering all possible combinations of the coordinates, creating points \((t_k,x_i,y_j)\), indexed by nodes \([k,i,j]\).
 discretize the differential operators using a FTCS scheme.
 solve for the solution grid \(u[k,i,j]\) by using the iteration rules that propagate the solution across all nodes \([k,i,j]\).
In this fashion, we convert the 2D Heat Equation to this difference equation:
where we can lump all parameters in the ratios:
\[\begin{split} \alpha \equiv D\frac{\Delta t}{(\Delta x)^2} \ \ \ \ \ \ && \ \ \ \ \ \ \ \beta \equiv D \frac{\Delta t}{(\Delta y)^2}. \end{split}\]We can understand how knowledge about the function propagates along the grid with a stencil by depicting which input grid points are needed to iterate the algorithm so a solution estimate may be computed at all grid points.
Fig. 3: A stencil used by the
FTCS scheme. At a given node \([k,i,j]\), the solution is propagated forward in time as \(u[k+1,i,j]\)
by considering the 2point centered derivatives in \(x\) (the neighboring nodes
\([k,i+1,j]\) and \([k,i1,j]\)) and in \(y\) (the nodes \([k,i,j+1]\) and \([k,i,j1]\)).
Credits: Augusto Peres, Inductiva.
Time to Heat [Start]
Run our code
It’s time to play with all these concepts and see how our FDM approximation fares.
All our code can be accessed on our code snippets GitHub repository.
There, you will find the heat_fdm.py
file for simulations and plotting
utilities in /utils
.
Have a go yourself!
Fast(er) Love
The role of the \(D\)ffusivity
We will pick a spatial discretization of, say, \(500 \times 500\) points. We can now run for different thermal diffusivities and see their effect. Below you can see the temperature profiles as we increase \(D\) — first with \(D=0.01\), then \(D=0.1\) and finally \(D=1\).
Fig. 4: Role of various \(D\) in the diffusion. Credits: Manuel Madeira / Inductiva
Love is in the air — but heat for sure is on the plate!
We can reason with these results.

At the beginning (for \(t=0\)), heat starts developing from the only points that are not at the common temperature (\(1 \; ^\mathrm{o}C\)), where temperature gradients exist. These are points nearby the hot (top) edge.

From that point on, we can observe the progressive heat diffusion from the hot edge towards lower regions, until we reach a stationary state, where no further changes occur (and the system is in thermal equilibrium).

Heat does not spread evenly across the plate — its spread is faster in regions that are further from the cold regions. Given this symmetric setup, points along the central line between the edges allow for faster diffusion than regions nearby the edges. This will progressively wind down until we reach the cold edges and a paraboloidlike pattern is observed.

But more importantly — the higher the \(D\), the faster this spread occurs. This directly impacts the choice of the time sampling.
So — it seems that each internal parameter \(D\) requires its own discretization setup somehow. But in what way?
How Stable Is Your Love?
Setting stability criteria
We ran our first PDE classical solver. Yay! However, why are we happy with the results? Clearly, we can think of potential issues, such as:
 for excessively sparse discretizations, our approximations of the differential operators will be miserably undersampled and consequently the trial solution will be shockingly wrong.
 error buildup will take place as the iterative procedure tends to propagate and amplify errors between any iterations. In those cases, we say that the method is unstable.
However, these are the extreme instances. Were we lucky? If we are using “reasonable” discretizations, a question should make you scratch your head:
How to be sure a certain discretization provides a trustful approximation to the PDE solution?
Fortunately for us, you can notice that the discretization setup can be monitored exclusively through the ratios \(\alpha\) and \(\beta\)! This raises the question:
Can we somehow finetune them to ensure stability i.e. find admissible combinations of \((\Delta t, \Delta x, \Delta y)\) for a fixed \(D\)?
Well, theoretical studies can be performed to find a restricted space of acceptable discretization setups. For our case [3], VonNeumann analysis establishes a surprisingly simple stability criterion:
\[\alpha + \beta \leq \frac{1}{2}\]These two constants provide a straightforward procedure to be sure that we i) are not oversampling or undersampling a dimension with respect to any other and ii) have spacings that can capture meaningful changes of the solution.
This is not a loose condition.
Just to show you that we’re playing with fire, we cannot think of a more
illustrative example than by ramping up that bound by a mere \(2 \%\) above
the theoretical maximum i.e. for \(\alpha + \beta \approx 0.51\):
Fig. 5: In a discretization step for which \(\alpha + \beta > 0.5\), our algorithm is bound to fail. Credits: Manuel Madeira / Inductiva
Yup: for this setup, an “explosion” caused by the FDM instability can be seen propagating along the trial solution — quickly unfolding to some psychedelic burst!
At the very beginning, some similarities to the downwards diffusion pattern ( that we just saw previously) quickly fade away as the error buildup spreads away and faster in regions with already large errors.
It’s a race for disaster: the exponential buildup of the errors, even
if sustained for only a handful of propagation steps, ensures that some nodes
will experience huge growth (and \(u\) becomes unbounded) —
\(u \mapsto 1 \mapsto 10^{15} \mapsto \dots 10^{80} \dots\),
until the maximum computer precision is reached — at which point its value
becomes a NaN
(Not a Number). You can see that eventually all points
are not even represented on the scale (they’re white)!
By the same token, let’s now see what happens if we’re close to the edge from within the admissible area. Dropping our bound by \(2 \%\), let’s heat the start button for \(\alpha + \beta \approx 0.49\).
Fig. 6: With our parameters recalibrated, stability has been restored! Credits: Manuel Madeira / Inductiva
Incredible! Not even a minimal sign of instability coming through.
We’re happy  we obtained seemingly satisfactory outputs in a reasonable way. The simulations take about \(10s\) and CPUs can handle the computation just fine.
But remember though: this is an extremely simple system… We could be considering this problem in 3D or at way more complex domains and boundaries. In an extreme limit, we could be using these FDMs in highlynonlinear equations with hundreds of variables.
To Vectorize or Not To Vectorize
That is not the question
Understandingly, instructing how to compute many functions over a grid is not a trivial task — issues like computation power and memory storage can be bottlenecks already for discretization setups far from ideal.
We must understand how our implementation choices affect the algorithm performance and the output accuracy. In essence:
How much faster can we solve the PDE without compromising the output?
Using the running time as a metric, we’ll compare 3 mainstream implementations:

Nested
For
Cycles: we fill in the entries for the plate at the new time frame \(u^{k+1, i, j}\), by nesting 2For
cycles that iterate through the \(x\) and \(y\) axes from the previous time frame in \(u^{k, i, j}\). 
Vectorized approach: we perform arithmetic operations on \(u^{k+1,i,j}\) by considering it as a linear combination of slightly shifted versions \(u^{k,i,j}\), multiplied by constants (known a priori), resulting in scalar multiplications and additions of matrices. NumPy implementations will be used.

Compiled approach: we can compile a vectorized approach, which allows sequences of operations to be optimized together and run at once. The code compilation was performed through JAX, using its justintime (JIT) compilation decorator.
The approach used is passed as a flag: vectorization_strategy=my_strategy
.
The default is set to numpy
and the vectorized approach is used.
When set to jax
, a compiled version of the vectorized approach is used and,
when set to serial
, the nested For cycles approach is used.
After running each implementation for the same parameters, we obtain wildly different running times:
vectorization_strategy=serial  Time spent in FDM: 93.7518858909607s
vectorization_strategy=numpy  Time spent in FDM: 0.6520950794219971s
vectorization_strategy=jax  Time spent in FDM: 0.297029972076416s
Wow! By using the vectorized approach, we obtain more than a \(100\)fold boost in running speed!
This is not surprising. The vectorized approach takes advantage SIMD (Single Instruction, Multiple Data) computations, where one instruction carries out the same operation on a number of operands in parallel.
This discrepancy is then easy to understand and a huge acceleration is observed,
in constrast when ran with the nested For
cycles,
where each instruction is executed one at a time.
We are not done yet. By compiling the vectorized approach, we obtain yet another \(3\)fold speedup, further improving our metric! This improvement would be even more significant if we ran the compiled code in GPUs (currently, JAX still only supports CPU in Apple Silicon machines).
Computation resources are finite (and costly). Despite the huge boost granted by the compiled vectorized approach, this methodology improved our task by 2 orders of magnitude. Is that enough when dealing with far more complex PDEs in higher dimensions?
Limit to Your Love
How far can we push classical methods?
Despite our pleasant experience with a lowdimensional problem on a regular geometry, we should be careful thinking ahead. Classical methods for solving PDEs have long been recognized for suffering from some significant drawbacks and we will comment on a few:
Meshing
… Gives Love a Bad Name
Meshing becomes a daunting task in more complex geometries. Different discretization setups can always be chosen but once one is fixed, we have limited ourselves to computations on the nodes generated by that particular setup. We can no longer apply arbitrary smaller steps to approximate the differential operators.
Worse still — depending on the behavior of the differential operator, we might need to focus on certain regions more than others. It’s unclear how these choices impact the quality of the approximations.
Discretizing the problem also creates a bias in favor of the grid points with respect to all others.
But what if there the PDE behaves differently outside the mesh?
The best we can do is to infer a solution through interpolation but this is admittedly unsatisfactory.
Scaling
Will Tear Us Apart
Classical algorithms scale terribly with the PDE dimensions and grid refinements.
For all the Complexity lovers out there — you know that sweeping the regular
grid \([k,i,j]\) stepbystep requires algorithms of order
\(\mathcal{O}(N_t N_x N_y)\).
By simply refining our grid by \(5\) times, we scale the number of nodes by
a whopping factor \(5^3 = 125\) (!!!).
Similar bad news are expected as we consider simulations on PDEs for higher dimensions. For a problem of 3D heat diffusion, the complexity grows as \(\mathcal{O} (N_t N_x N_y N_z)\) and naturally, \(n\) dimensions scale as \(\mathcal{O}(N^d)\) — an exponential scaling of the algorithm with dimension.
Crazy Little Thing Called…
Feasibility
Even though we can rely on stability conditions, this does not mean we can deploy them — it may be computationally very, very, …, heavy or simply impossible.
As we saw, simulating a medium with higher \(D\) requires a finer time discretization. For some instances, satisfactory \(\Delta t\) might be too small to be practical and, in the extreme case, downright untenable!
This comes at a bitter price and is a major drawback of the FDM. Baby, was that a cheap Mon Cherrie knockoff?
All for Love
… and Machine Learning?
PDE simulations must be as fast and light as possible. But can classical methods, in general, get fast enough?
We can look farther ahead. Consider something extreme (but likely). Imagine our task is to find parameters that lead to optimal performance for some problem.
Let us assume we have access to the theoretically fastest PDE classical solver, foolproof to instabilities and extremely efficient in terms of memory and processing.
If hundreds, millions, …, billions of intermediate PDE calculations had to be performed with that algorithm, could we afford the computational budget? Would the time elapsed be reasonable?
Certainly not. It is not difficult to estimate runtimes comparable to our lifetime (run across millions of processors!) to achieve some reallife optimization tasks.
This would forbid us from iterating across the parameter space fast enough — Game over.
Efficiency in classical methods won’t provide the scaling we need to overcome
this barrier.
A new paradigm is needed.
Fortunately, a large class of algorithms can handle this curse of dimensionality
exceptionally well. Can you guess what we are talking about?
Yup, Deep Learning!
In the following posts of this series, we will address how we can leverage on those approaches to possibly overcome these issues raised by classical methods.
Until then, don’t melt away!
Advice For The Young At Heart
References & Remarks
[1]
More on this FEM and FDM possible similarity here.
[2].
The FDMbased approach we present is loosely following this excellent example that inspired us.
[3].
More details of the derivation of this expression can be found here (pages 11/12).
[4] There are more complex cases where it is convenient to define such
staggered grids with fractional offsets to the original (e.g for Maxwell’s Equations).
Love For…
Credits
We thank our Inductiva pals — Manuel Madeira (main author), David Carvalho (editor and coauthor), Fábio Cruz (technical editor and reviewer) and Augusto Peres (reviewer and animation creator) for this post.
❤️❤️❤️️ Pounds of Love, from us at Inductiva. ❤️❤️❤️
Recent posts from our blog
Rúben Dhanaraju
David Carvalho
Fábio Cruz
Luís Sarmento
In this 4th post of the series, we benchmark the performance of a Constraint Programming solver on its own for finding Hadamard matrices.
Fábio Cruz
Maria Castro
Luís Sarmento
David Carvalho
Do GPUs solve sparse eigenproblems faster than CPUs? In this post, we answer this question by comparing the SciPy and CuPy eigensolvers.