In this post we'll be exploring the dataset posted by last month. This is a bunch of thermoelectric data mined by LLMs from journal publications. It contains things like material names, crystal structures, properties such as electrical and thermal conductivity, figure of merit and plenty more. Not every material has the full feature set, and as we'll see its all a bit messy.
I've got a lot to say about this dataset, and have been working with it for a while, but it just keeps dragging on and I want to get a post up so I'm just going to post some short stuff here.
First a quick reminder of the fundamental equation of thermoelectrics - the figure of merit . This characterises the performance of thermoelectric materials. Good thermoelectric materials have ZT ~1, great ones have ZT > 1.5. So a reasonable expectation is for our data to lie in the range 0 < ZT < 2.
Where:
- = dimensionless thermoelectric figure of merit
- = Seebeck coefficient
- = electrical conductivity
- = absolute temperature
- = thermal conductivity
Let's look at the distribution of ZT for our dataset:
ZT distribution from Itani et. al. dataset.
Turns out our dataset is corrupted by some wildly non-physical ZT values ranging in the hundreds. Lets crop to a more realistic ZT < 5.
ZT values from Itani et al. cropped to <5 ZT.
Nice - this looks better. With a median of 0.72 and a sharp drop off >1, this has the kind of shape that makes sense. It's worth mentioning here that the dataset is a mix of experimental and theoretical results, so the excellent materials with ZT > 2 may be either further data corruption or physically unproven modelling results.
Let's look at the source literature for some of the materials in the range 2 < ZT < 4.5 for validation. There are 211 materials in this data range. We have to get down to material #35 for the first experimental result which is SnSe with ZT = 3.1 at 783 K (510C/950F). The dataset contains the DOIs of all the materials so we can track down the original paper. Although, once tracked down, this shows up another issue with using LLMs to mine text from papers - this ZT didn't come from the actual paper at the DOI given in the dataset, but from a different paper cited by the authors.
We can find that paper as well though at https://doi.org/10.1038/s41563-021-01064-6. Zhou et al. 2021 has been cited 564 times in the past 4 years, showing this was a breakthrough work. It seems they pioneered a purification technique which removed thermally conductive tin oxides from the SnSe, bringing down the thermal conductivity to ultra-low levels, boosting the ZT.
Let's continue the exploration by looking at some the high performing theoretical results - these could be guiding lights showing us promising avenues for future research. FeCrSb is one that appeals to me, made of 3 commonly available, non-toxic elements. The paper can be found at https://doi.org/10.1016/j.cap.2023.02.013. These authors modelled XCrSb (X=Fe, Ru, Os) using a DFT based approach. Here, the LLM dataminer did get it right, and the authors predict a ZT of 3.27 at 300K. A room temperature thermoelectric of the efficiency implied by a ZT that high sounds too good to be true, but the paper has been cited 20 times since publication in 2023, so it has drawn some attention. We will dig into the citing papers to see if anyone has confirmed or refuted these results.
That's it for now, I hope to continue with more regular short form posts instead of trying to get it all in one.