Nikolay Babakov is a PhD students at University of Santiago de Compostela in Spain, and I am helping to supervise him as part of the EU NL4XAI project. Nikolay is working on making Bayesian Networks (BN) more useful and attractive, and has recently published three papers on this topic, which I describe below.
Bayesian networks are causal graphs which contain probabilistic information about the relationship between nodes. To take a very simple example, a BN might show that Smoking and Pollution are causes of LungCancer, and that LungCancer in turn causes ChestPain and WeightLoss. Probabilities are included, eg how likely is Smoking to cause LungCancer. Reasoning algorithms can use the BN to calculate the probability of an unknown node from observables; for example if a person has bad Smoking, ChestPain, and WeightLoss, but has not been exposed to heavy Pollution, how likely is he to have LungCancer?
BNs as a reasoning mechanism are attractive in principle because they deal naturally with uncertainty while being configurable and explainable (blog), which is very important when an AI system is used to support human professionals in making decisions. However in practice they can be difficult to build and are not easy to explain, which limits their usefulness. Nikolay’s goal is to make BNs easier to build and explain, and hence more useful.
Reusability of Bayesian Networks case studies: a survey
Like many other PhD students, Nikolay started his PhD doing some surveys of previous work. One of his surveys was recently published in Applied Intelligence (DOI); this focuses on reusability of previously published BNs.
Scientific experiments must of course be reproducible, and an experiment involving a BN can only be reproduced if (amongst other things) sufficient information is provided about the BN to allow it to be reused. Nikolay did a PRISMA structured survey of previous work which showed that only 18% of published papers provided sufficient information. Also, when he directly contacted authors, only 12% provided sufficient information for reusability.
This result was very disappointing to Nikolay (and to me), but perhaps it should not have been a surprise. When we tried to reproduce NLP evaluations in ReproHum, we discovered that only 13% of contacted authored were willing and able to provide sufficient information to allow their experiment to be replicated (blog). So this is a generic problem across AI and NLP, its not just a problem with BNs.
Explaining Bayesian Networks in Natural Language using Factor Arguments. Evaluation in the Medical Domain
Nikolay published some of his work on explaining BNs in the ExpliMed workshop (paper) at ECAI. He essentially took some ideas which had been developed but not published by Jaime Sevilla when Jaime was a PhD student at Aberdeen, and did a substantial amount of work to further develop and evaluate these ideas.
The goal of Jaime and Nikolay’s explanation algorithm is to explain how an output variable is influenced by input variables. In the above example, this means explaining how the network’s prediction of LungCancer is influenced by Smoking, Pollution, ChestPain, and WeightLoss. This is pretty trivial in the above example, but its not trivial in more complex networks which contain dozens or even hundreds of nodes, most of which are intermediate nodes (neither input nor output). Essentially the algorithm uses a factor analysis to the network to assess the influence of a specific input X on the target output, looking at how evidence propagates across the network from input to output.
The algorithm was evaluated by asking human judges to rate explanations from the factor-based algorithm and two baselines; judges saw the textual explanations from the algorithms as well as a visualisation of the BNs. There was a strong preference for the explanations from the factor-based algorithms, and also a suggestion that the explanation might be more effective if it was integrated into the visualisation.
Scalability of Bayesian Network Structure Elicitation with Large Language Models: a Novel Methodology and Comparative Analysis
The last of the three papers was presented at COLING (ACL Anthology), and explores whether LLMs can be used to help create BNs. In rough terms, there are three steps to creating a BN
- Choose the nodes in the network
- Create causal links between the nodes.
- Add information about probabilities
Probabilities are usually inferred from data. In theory the structure of the causal graph can also be inferred from data, but in practice its much more common for domain experts to manually design the graphs.
Nikolay’s paper explores whether LLMs can be used for the second of the above steps, that is to suggest the links in the causal graph, from the names of the nodes. For example, if we tell GPT that the network includes the nodes Smoking, Pollution, LungCancer, ChestPain and WeightLoss, can GPT use its domain/world knowledge to infer that there are causal links from Smoking and Pollution to LungCancer, and from LungCancer to ChestPain and WeightLoss? Again the goal is purely to suggest links, the nodes and probabilities come from elsewhere.
Of course this approach only makes sense if the nodes have meaningful names, which is not always the case (eg, the networks in https://www.bnlearn.com/bnrepository/discrete-medium.html includes nodes such as “dg25”, “temp2” and “CKND_12_30”).
Nikolay’s approach is to create multiple LLM “experts” (eg Oncologist, GP, Nurse, etc for above example) and ask each expert to suggest links between nodes; he then uses majority voting to combine the results (ie, a link is included in the final result if more than half of the experts suggest it).
Evaluation is done by taking a real BN (with meaningful names), extracting the nodes, using the above technique to generate links, and comparing the generated links to the links in the original BN. A data contamination check is done first to exclude BNs which may have been memorised by the LLMs. Comparison is based on F-score (precision/recall of generated links vs original links) and Structural Hamming Distance (SHD) (normalised by edge count) between original and generated graphs.
For GPT4 (the best performing LLM in this context, F-score averages around 0.6, and SHD/edgeCount averages around 0.9 . So there are some similarities between the LLM-generated links and the human-authored links, but also many differences. Performance also is lower for larger BNs.
Final thoughts
I dont know what the future of BNs is, but I repeatedly hear doctors and other domain experts insist on explainable and configurable AI models that work well with uncertainty, and BNs are one way of achieving this. So it is important to improve BNs (as Nikolay is doing) to make them easier to build and explain. More generally, I think an academic monoculture (where everyone focuses on the latest trendy tech) is not ideal, we will achieve more scientifically if academic researchers are “scouts and explorers” who investigate a wide variety of different ideas and approaches.
References
N Babakov, E Reiter, A Bugarín-Diz (2025). Scalability of Bayesian Network Structure Elicitation with Large Language Models: a Novel Methodology and Comparative Analysis. Proc of COLING-2025. (ACL Anthology)
N Babakov, A Sivaprasad, E Reiter, A Bugarín-Diz (2025). Reusability of Bayesian Networks case studies: a survey. Applied Intelligence. (DOI)
J Sevilla, N Babakov, E Reiter, A Bugarín-Diz (2024). Explaining Bayesian Networks in Natural Language using Factor Arguments. Evaluation in the medical domain. Proc of EXPLIMED workshop. (PDF)