Heat #2: The Heat Equation (and a PhysicsInformed Neural Network)
In the debut of this 3post series, where we intend to showcase the power of Neural Networks to solve differential equations, we introduced you to the equation that serves as our prototypical example (the Heat Equation) and to the general setup we will use throughout (a 2D plate with edges kept at fixed temperatures).
At this initial stage, we showcased a classical algorithm comprised of a set of numerical routines to solve the Heat Equation. It consisted in discretizing the differential operators and thus obtaining an estimate which approximates the solution at specific positions and times (remember the term Finite Differences Method?)
At the very end of that post, we motivated for the importance of reaching fast computation runtimes and how they can become a bottleneck for large optimization problems.
As our scientific and technological needs evolve, it is not unrealistic to expect problems depending on solving billions of partial differential equations (PDEs) with hundreds of variables and defined on highdimensional spaces with hard boundaries constraints.
Can Deep Learning lend us a hand in this regard? Let’s find out!
Heat #2: The Heat Equation (and a PhysicsInformed Neural Network)
Employing a disruptive ethos, in this post we will hopefully convince you that there is a whole lot of potential stemming from Deep Learning (DL) methods in tackling this issue.
In recent years, huge developments in Machine Learning have been triggered by everenhancing computational infrastructure and modeling sophistication.
In fields such as Computer Vision or Natural Language Processing, Deep Learning (the subfield of Machine Learning that handles Neural Networks (NNs)) has evolved as to allow for huge advances in what can be done.
In fact, three extremely powerful points assure any researcher that NNs can be leveraged in their field:
 the capability NNs have to succeed handling extremely highdimensional spaces (for which we have motivated bottlenecklevel issues due to the curse of dimensionality will almost invariably arise);
 theoretical guarantees of success, namely due to the Universal Approximation Theorem [1];
 better still, all the DL software artillery available has dramatically lowered the entry barrier for deploying NNs frameworks. It has never been this easy to set up sophisticated models with a handful of lines of code!
It is no surprise that NNs started invading other territories and Fundamental Sciences are no exception.
Deep Learning for… Partial Differential Equations?
The underlying laws behind most natural sciences rely on equations dictating how certain quantities vary in time and space. Naturally, we must wonder:
Can we possibly solve such an abstract problem like a Partial Differential Equation (PDE) with the aid of NNs?
To answer this question, let’s first formalize a bit. In PDEs, the unknowns are functions which depend on various variables and are defined over some domain. Solving the PDE (or systems thereof) entails finding such functions.
For a single realvalued PDE with \(n\) spatial variables \(r_i\) and a single temporal variable \(t\), we are after a function \(u: \Omega \in \mathbb{R}^{n+1} \rightarrow \mathbb{R}^d\) which satisfies:
\[\mathcal{F} \left[u, \frac{\partial u}{\partial r_i}, \frac{\partial u}{\partial t}, \ldots \right] = 0,\]where \(d\) sets thedimensionality of the output.
\(\mathcal{F}\) is a differential operator — an operator which may depend
on those variables but also on the trial solution itself and, most importantly,
on its derivatives. In general, these derivative can involve any order term.
Our hope is thus to find a NN whose output trial solution can mimic the actual PDE solution.
Deep Advantages
Guess what? As parametric function approximators, NNs are particularly well tailored for this problem.
If you recall what we discussed in Heat #1, many limiting issues are found when numerically solving a PDE via a classical method by discretizing the domain — usually performed with Finite Difference or Finite Element Methods. Ideally, we would like to circumvent these problems.
As we will soon see, PINNs (PhysicsInformed Neural Networks) address most of the previously mentioned limitations of the classical methods:
 they are meshfree, making the interpolation for nonmesh points no longer a concern. Once trained, PINNs can be evaluated at any point of the domain of interest. The way the training is performed can also be adjusted so that the functionvalued solution we seek can be found in very efficient ways;
 there is no such thing as error buildup or instability coming from the iterative nature and discretization imposed by the classical methods.
For now, it’s clear — we should express a solution of the PDE as a NN.
Buckle yourself up and let the fun begin!
PhysicsInformed Neural Networks (PINNs)
We will showcase you one of the hottest approaches to tackle PDEs from a DL perspective — PhysicsInformed Neural Networks (PINNs) [2,3].
In what way does this architecture differ from more conventional NN models? Well, firstly we:
 try to approximate the function solution to the PDE through a NN that fits some data points that are provided. Typically, these are points set by boundary and initial conditions but they can also be in the interior of the domain;
 constrain the space of functions from which the NN can learn by penalizing solutions whose partial derivatives do not satisfy the desired PDE.
The first point is commonplace in DL approaches. This simply entails a ‘supervised’ minimization procedure between the output and the ground truth set by the conditions. This flavor of architecture is normally termed vanilla.
The whole novelty and potential of PINNs stem from the second term, which is the one that fundamentally deviates from the vanilla DL by regularizing NN guesses.
It’s this aspect that coins these NNs as Physicsinformed: from all the NNs that we could learn, we are imposing an inductive bias towards those that verify the Physics of the system at hand (i.e. what the conditions expressed in the PDE enforce).
The idea of Physics here is a bit misleading. PINNs do not have to pertain to physical systems. By Physics, it is assumed that some strong principle or law must be held across the system. Curiously enough, these tend to be expressed precisely with PDEs!
Training the PINN
But how can we obtain a NN that we are confident has mimicked the actual solution of the PDE taking into consideration the two aforementioned notions?
Needless to say, there are many answers, depending on the problem at hand. For instance, we could
 try to construct (activation) functions that necessarily respect the differential equation;
 if, say, the PDE were linear, use linear combinations of solutions so to respect the PDE;
 work in the spectral (frequency) domain — where the derivatives could be tight by some relation — and consequently an inverse fast Fourier transform could be used to map the solution to the direct domain.
A Clever Loss Function
The answer that requires minimal change though is to tweak the way the loss function is computed. This objective function \(\mathcal{L}\) is parametrized by tunable parameters \(\mathbf{\theta} = (\theta_1, \theta_2, ...)\) and dictated by the model. In order to learn the solution, this loss must be minimized.
We want to make the loss as small as possible. Once we’re in that regime, we can be fairly sure our trial output is somehow close to the true solution and we can say our model has learnt how to solve the PDE.
Tracking the behavior of \(\mathcal{L}\) allows the NN to adjust its internal parameters towards the values that lead to a smaller loss and thus to the best approximation of the PDE output by the NN.
Considering this motivation — admittedly more mathematical — we can now formulate the impact of each term on the overall loss through an individual term:

Output values loss \(\mathcal{L}_\mathrm{data}\) — any loss that penalizes a NN that does not verify the value function at the several prescribed points. For instance, we can consider a Mean Squared Error function between the values output by the NN and the target ones (which we know a priori):
$$ \mathcal{L}_\mathrm{data} = \frac{1}{N} \sum_{i=1}^N \\hat{u}(t^i, r^i_1, \ldots, r^i_n)  u^i\^2, $$where \(\hat{u}(\ldots)\) is the NN estimate and \(u^i\) is the ground truth value of \(u\) at the point \((t^i, r^i_1, \ldots, r^i_n)\).
Typically, the \(N\) points used to compute \(\mathcal{L}_\mathrm{data}\) are drawn from the initial and the boundary conditions, but this is not strict — those could be any points!

PDE “Regularization” \(\mathcal{L}_\mathrm{PDE}\) — a term which penalizes models that do not satisfy the PDE. To achieve this, we can use a little trick. Since we wrote our PDE in the form \(\mathcal{F}[u, \ldots]=0\), we want the lefthand side to be as close to \(0\) as possible. We can then simply make sure our mimicking function \(\hat{u}\) yields the lowest loss:
\[\mathcal{L}_\mathrm{PDE} = \frac{1}{N} \sum_{i=1}^N \\mathcal{F}[\hat{u}(t^i, \mathbf{r}^i)]\^2.\]Easier said than done. This is a colossal task for each parameter instantiation of our NN (hundreds of billions of parameters may have to be recalibrated!)
However, we can benefit from the autodifferentiation artillery that has been developed for NNs: the available packages allow for a straightforward and efficient way of obtaining the partial derivatives with respect to the many model inputs.
In contrast to the first point, this loss term can be applied to any point, given that we do not require prescribed target values.
We are now in conditions to obtain a total loss term by summing both loss terms:
\[\mathcal{L}_\mathrm{PINN} = \mathcal{L}_\mathrm{data} + \mathcal{L}_\mathrm{PDE}.\]Even though both terms were weighted evenly, this is not a requirement. Actually, it is quite common to find this loss defined as a convex combination of those two terms, where the weighting coefficient is an extra hyperparameter requiring appropriate finetuning.
But putting these and other technicalities aside, this is pretty much what you need to know about PINNs to follow the rest of this post!
Fig. 1: The PINN we used to solve the 2D Heat Equation consists of two parts: firstly, by updating the NN’s weights \(\mathbf{W}\) and biases \(\mathbf{b}\) we can minimize the residual difference between the model and the prescribed values at some selected points; then the NN output is also optimized so that \(\\mathcal{F}[u, \ldots] \\) gets as close to 0 as possible. Credits: David Carvalho / Inductiva
IDRLnet: a PINNs library
To set a DL algorithm, we use IDRLnet, a PINNs library built on Pytorch [4]. To the best of our knowledge, it is one of the most versatile opensource PINNs library.
The IDRLnet library has an architecture based on three different types of nodes:
DataNodes
 these are essentially domains that are obtained from Geometric Objects (also provided by the library), from where it is possible to sample the points used to train the PINN;NetNodes
 these are abstracted from neural networks, i.e. the architecture, training hyperparameters, and everything else related to the NN used are defined in this node;PDENodes
 these contain all the information related to the PDE we wish to solve. Contrarily to NetNodes, PDENodes do not have trainable parameters.
Fine — we are ready to try it out on our new favorite PDE: the 2D Heat Equation!
Hot NN Coming Through!
To keep things simple, let us start with the exact same situation as the one presented for FDM from Heat #1:
Fig. 2: The boundary and initial conditions used throughout the Heat series. Energy is pumped from the top edge onto an initially completely cold 2D plate. Credits: David Carvalho / Inductiva
A very simple domain was chosen — a regular 2D square plate. We must then consider points of the form \((t, \mathbf{r}) = (t, x, y)\). The temperature \(u(t, \mathbf{r})\) must satisfy the 2D Heat Equation:
Recall that the hot edge was kept at the maximal temperature (\(1\;^\mathrm{o}C\)), while the remaining boundaries at the minimal temperature (\(1\;^\mathrm{o}C\)). Initially, at \(t=0\), all points were kept at the minimal temperature too.
This prescription allows us to compute the loss term \(\mathcal{L}_{\mathrm{data}}\). As for the regularization, the loss term \(\mathcal{L}_{\mathrm{PDE}}\) is:
Time to Heat [Start]
You can find our code here and run your very first PINN! Run this experiment through the following instruction in the command line:
python heat_idrlnet.py max_iter=10000 output_num_x=500 output_num_y=500 colorbar_limits=1.5,1.5
The flags used trigger the following instructions:
max_iter
defines the total number of training epochs;output_num_x
andoutput_num_y
define the discretization along the xaxis and yaxis, respectively, of the grid in which we infer results;colorbar_limits
defines the range of the colorbar used.
For illustrative purposes, we set the diffusivity constant to \(D=0.1\) throughout the entire post.
Disclaimer: our code is able to accommodate some extra complexity that is not needed in this post. For now though, we will not dive in detail but let the magic happen in Heat #3 😎.
Classical vs NN — the fight begins
To see how well our NNbased framework handles the task, we can compare the NN output to the one generated from the classical algorithm we used in Heat #1:
Let’s plot the output obtained with the FDM (a classical algorithm) [top]
and a PINN we trained [middle], as well as the
error \(\text{Error} = u_{\rm FDM}  u_{\rm PINN}\) [bottom]. This error plot
can be easily computed by running the provided heat_error.py
python script.
Fig. 3: Comparison of the results obtained via a classical algorithm (a FDM)[top], a DL algorithm (the PINN computed with IDRLnet)[middle] and the absolute value of their difference [bottom]. They seem very similar! Indeed, an inspection of the error shows global convergence. Credits: Manuel Madeira / Inductiva
Wow — this looks rather good!
The NN output approximates quite closely the results obtained with FDM. There are some deviations mainly in the initial instants and then in the upper corners. Actually, these are the regions where sharper transitions of the solution function are found, and it is thus natural that our PINN has more difficulty to fit correctly there.
This certainly seems hopeful — but an inquisitive mind like yours must be wondering about why this PINN worked.
How long should we train?
In order to train the PINN, a rather large number of epochs (\(N_{\rm epochs}=10000\)) was used for training.
This can make us think: just like the classical algorithm had to be tuned so no nonsensical estimates were output, there must be some suitable tuning to assure us the algorithm can indeed approximate the solution.
Fig. 4: Difference between the PINN output after training with different number of epochs and its respective classical output (serving as a benchmark). We can see a substantial error for few epochs (\(N_{\rm epochs} = 100\)) [top]. If we ramp up 10fold (\(N_{\rm epochs} = 1000\)), the error is essentially gone [middle] and even further suppressed for \(N_{\rm epochs} = 10000\) [bottom]. For this regime, both the classical and DL algorithms provide essentially the same estimate. Credits: Manuel Madeira / Inductiva
The results are just as we expected: the higher \(N_{\rm epochs}\), the better the NN learns. In this particular case, it seems that by around \(N_{\rm epochs} \approx 1000\) it already leads to appropriate learning. In general, tuning this parameter can be costly (moreso if performed in a brute force fashion).
Specific knowledge of the PDE or domain in question may make our lives easier!…
Customized training
The versatility of the PINNs in focusing on different regions differently begs a question of the utmost relevance:
Are we favoring or penalizing regions that require a different level of refinement?
Fortunately, via TensorBoard, it is possible to track various loss terms throughout the training procedure by using the command
tensorboard logdir [insert path to "network_dir"]
With it, we can then see the effect of extending the training procedure. For this, we set different numbers of epochs and see how the output is impacted by this choice.
Fig. 5: Learning curves (in logarithmic scale) for different subdomains. We can see that, even though all terms have different behavior, they eventually converge to exceedingly small values. Credits: Manuel Madeira / Inductiva
You can see how the loss for various subdomains (the captions should be clear to follow) changes as training takes place.
The training of PINNs (or generally with NNs), is not a completely trivial task. In fact, from the learning curves above, it is clear that if we stopped our training too early on, our results would be necessarily worse.
The ghost of overfitting
We typically find overfitting whenever we train our model excessively with a training dataset — ultimately providing a strict fitting across the given data points, or when training using a closed dataset (by not using randomlygenerated data).
This leads to a huge drawback: the model loses its generalization capability.
In other words, when we present the model to new unseen data still from the same distribution that generated the training dataset, the model will have weak predicting power for those points and thus a much higher loss rate.
A validation set of data points is typically kept in parallel to the NN training procedure to ensure, through periodic checks, that we do not enter such an overfitting regime.
In this case, we do not have to worry about the possibility of overfitting for two main reasons:
 firstly, we are sampling a different set of data points in each epoch, i.e., our training set is changing from epoch to epoch
 secondly, we are considering a regularization term in our loss (recall \(\mathcal{L}_\mathrm{PDE}\)).
Furthermore, we see that points coming from the boundary conditions are the most troublesome ones — due to their higher loss. To contrast this, FDM implementations have those points directly hardcoded onto the final solution.
Learning Rate
Another typical issue impacting the performance of PINNs (and NNs in general) is the choice of the learning rate \(\alpha\).
This is a hyperparameter — a variable which pertains to the structure of the model itself. The update of the NNs parameters is performed by applying optimization algorithms, thus known as optimizers.
The way in which the learning rate is exploited in these optimizers varies. However, we can intuitively think about it as the magnitude of the step given in the parameters space in each iteration.
Striking a balance between speed and accuracy is not only a vital part of setting the model but also a delicate task on its own.
A larger learning rate allows us to have larger updates in the parameters and thus faster progress but at the same time it may lead to large instabilities in the convergence process. In extreme instances, we may never be in conditions to access the optimal regions.
To see this, let us see the effect of using larger and smaller learning rates \(\alpha\) than the one chosen in the previous experiment.
Fig. 6: Learning curves (in logarithmic scale) for two different learning rates. Note that a very large rate \(\alpha = 0.1\) leads to curves that never get smaller and so the NN is bound to fail [top]. Using a smaller rate (\(\alpha=0.0001\)) [bottom], although takes more time to run, indeed leads to increasingly smaller losses. Credits: Manuel Madeira / Inductiva
On the one hand, for \(\alpha = 0.1\), the model runs rather quickly but the learning curve stagnates quickly since further progress will not get us closer to the optimal minima of the loss function – we simply keep meandering in parameter space!
Contrary to this regime, a small rate \(\alpha=0.0001\) may take more time but allows the model to eventually have eversmaller losses.
Balancing performance, accuracy and computation resources surely is a sophisticated feat!
There is art in DL
The lesson we learnt here is that indeed NNs can be adequately trained to yield estimates of PDE solutions without explicit computation of the PDE.
For that though, we must be crafty — choices like the learning rate and the number of epochs directly impact the performance of the PINN.
The finetuning of hyperparameters is not something set on stone and requires subtlety and exploration to be successful. The technical aspects and their formulation are still open topics and intensively studied in the DL+ML communities.
Just as with FDMs we had to finetune the discretization parameters in order to ensure the stability of the method, the same must be considered for PINNs — they are not a panacea! Even though it offers an edge in streamlining the computation of the estimate, extra care is still needed!
NN architectures come in many flavors and recipes and may be improved in many intuitive ways. As a general question, we can wonder:
Should we take some inspiration from the classical methods and explore a direction where the boundary (and initial) conditions are hardconstrained in the learning problem (e.g. verified by construction)?
This is one of the questions that we are tackling at Inductiva — and we will show you more as progress is made 😛!
Next episode
In the next (and final!) post of the Heat series, we will precisely unveil some of the aspects that can make our problem harder to solve but also more realistic.
In particular, we will show you how to extend this vanilla version of PINNs to be able to:
 deal with complex geometries (rather than a simple square plate);
 solve a PDE to more than one instance of the boundary conditions without having to retrain the whole model again.
Neural Networks sure sound exciting, right? Stay tuned for Heat #3 🌶️!
References
[1]
A bit more about the mathematical foundation for the suitabillity of NNs to
model challenging maps from input to output.
[2]
Defining papers outlining the architecture and ideas of PINNs in the Literature.
[3]
Yet another influential paper on PINNs.
[4]
Check out the documentation of the IDRLnet library here.
Credits & Thanks
We thank our Inductiva pals — Manuel Madeira (main author), David Carvalho (editor and coauthor), Fábio Cruz (technical editor and reviewer) and Mariana Prazeres (who helped us set up and use the ILDRnet library).
Dedication
This post is dedicated to Sílvio Macedo (01/02/1975  17/03/2022), a technology enthusiast, a founder of many companies, and a friend.
Thank you Sílvio for inspiring everyone around you to go further and higher. We try to do that every day at Inductiva. This post is also a bit of you.
Recent posts from our blog
Rúben Dhanaraju
David Carvalho
Fábio Cruz
Luís Sarmento
In this 4th post of the series, we benchmark the performance of a Constraint Programming solver on its own for finding Hadamard matrices.
Fábio Cruz
Maria Castro
Luís Sarmento
David Carvalho
Do GPUs solve sparse eigenproblems faster than CPUs? In this post, we answer this question by comparing the SciPy and CuPy eigensolvers.