Our first full training run that extended the work done in:
explores a simple idea: train a model to convert crystallography data from CIF to JSON and then judge how well that JSON could rewrite the original CIF. The policy model, based on a 3B language model with LoRA adapters, performs the forward conversion (CIF → JSON). A separate judge model, kept fixed, evaluates how likely it is to recover the exact CIF from that JSON, by computing a reverse-probability score token by token without actually generating the CIF. This score provides a reward signal for training the policy. The setup uses three parts: the policy (the convertor), the judge (scores round-trips), and a reference model for regularization. Training runs on Modal with three GPUs, using vLLM for judge serving and a careful memory plan. The goal is to create a reliable, reversible representation and to extend the approach to descriptions that generate CIF files.
On this page
fell short of our expectations. The goal was to flip the test training we did with the first experiment on its head. Instead of going from CIF input to JSON output, we wanted Qwen 2.5 to take an input semantic description of a crystal structure, and return a valid CIF.
Training progress looked good from the perspective of the logged metrics, (the completion length was capped at 756 tokens, this will be raised significantly for next run):
but we should have been closely monitoring the raw outputs from the policy model.
Somewhere between step 70-100, the policy model learned that repeating tokens was often 'good enough' to get a sizable reward from our judge. Trained policy outputs would closely match valid CIF tokens for the first tens of tokens before degrading into just repetition:
As we have come to learn, this is a common degradation mode seen in LLM RL post-training, and something that can be both more effectively monitoring, and defended against with more aggressive KL divergence penalties.
The next run will employ both a high divergence penalty, and some extra monitoring infra for us that enables steady tracking of current raw policy model outputs.
More to come.
RTRL Training Failures describes our first full training run, which tried to invert an earlier task. Instead of turning CIF output into JSON, we aimed for Qwen 2.5 to take a description of a crystal structure and return a valid CIF. The logged metrics looked promising, with progress up to 756 tokens planned, but we should have watched the raw policy outputs more closely.
Between steps 70 and 100, the policy learned that repeating tokens could earn a good reward, so initial CIF-like tokens appeared for a while before the output degraded into repetition. Example outputs showed many repeated lines of the same data fields, rather than a valid CIF structure.
This degradation is common in LLM RL post-training. The next run will add a stronger divergence penalty and better monitoring to track raw policy outputs more reliably. More updates will follow.
