I have started exploration of another large dataset of thermoelectric materials data. Same as the last one, this was mined from a large collection of literature using LLMs. Immediate impressions are that this is a higher quality set, better documented and easier to work with.
There was still some corruption in the set, seen in the naïve histogram of ZT. Ridiculously large values force the majority of the data into a single bin.
This was addressed by simply filtering the entire set to entries with sensible ZT values, giving a histogram shape similar to that seen in the first dataset explored. High frequency up to ZT ~ 1, then a drop off to ZT < 3.
Next is a plot breaking down ZT by power factor vs. thermal resistivity. The Pareto front is highlighted and ZT is indicated by marker size.
A strange feature of this plot is that the pareto front seems disconnected from the bulk of the data. Let's take a closer look at these pareto front materials anyway, before re-analysing the front of the bulk dataset. This initial front is made up of 6 materials (looks like 7 but one is just very close), 3 of them actually have quite low ZT (<0.5). The remaining 3 are Ge0.89Cd0.03Sb0.08Te, SnSe thin film and Na0.005Ag0.015Sn0.98Se with ZTs of 1.47, 1.2 and 0.81 respectively. Some of these rows look sus - for example the Na0.005Ag0.015Sn0.98Se has a ZT of ~70 if you calculate it from the reported values of temperature, PF and TR, but a reported ZT of 0.81. In other words, there's a discrepancy between the mined thermoelectric parameters and the mined ZT value.
This is where the higher quality of this data set becomes apparent - the authors have already done this work and include a "relative error" column, which is the difference between reported ZT and calculated ZT from other reported values (eg. PF, TR etc). This highlights an issue that cropped up in the other literature-mined database we examined. LLMs are not a fool proof method of data mining. The difference is, this set has a nice way to handle these errors. When we filter the dataset to include only materials with relative errors ≤0.5 (calculated and reported ZT agree within 50%), the Pareto front expands from 6 to 18 materials, there's no longer a weird disconnect from front to dominated materials, and we're getting ZTs ~ 2.
We'll explore the actual materials on this front in more detail in Pt. 2.