This document provides concrete technical specifications for integrating MatGL and CHGNet into Ouro's infrastructure. It covers compute allocation, storage, API contracts, data pipelines, and deployment considerations.
Per-request requirements (single structure):
CPU: 1 vCPU, ~100MB RAM for structures up to 100 atoms
Typical latency:
CPU: 200-800ms per structure
GPU (CUDA): 20-100ms per structure
Batch of 100 structures: 5-15s on GPU
Recommended production setup:
2x GPU nodes (NVIDIA A40 or equivalent) for inference load balancing
CPU fallback for small batch jobs
Single GPU handles ~1000 inferences/minute (100-atom structures)
Per-request requirements:
Single-point evaluation: 1 vCPU, 200MB RAM (10-50ms on CPU, 1-5ms on GPU)
Full relaxation (50-100 steps): Requires GPU
GPU memory: 8-16GB for structures up to 100 atoms
Time: 2-10 minutes depending on structure size and convergence
CPU would take hours—GPU mandatory for realistic workflows
Recommended production setup:
Dedicated GPU cluster for structure relaxation (at least 4x GPU nodes)
Queue-based job dispatch (async task queue)
Single GPU handles ~10-20 relaxations/day (100-atom structures)
For a typical materials discovery workload (1000 candidate structures):
MatGL screening: ~100 structures/second on 2 GPUs → 10 seconds total
CHGNet relaxation (10% survive screening): 100 structures × 5 min avg → 8 GPU-hours across 4-GPU cluster = 2 hours wall time
MatGL: ~200MB per model (universal model + variants for different properties)
CHGNet: ~150MB per model
Total: ~350MB for both Phase 1 models + variants
Storage location: Object storage (S3 or equivalent)
Store with versioning enabled
Local caching on GPU nodes (NVMe recommended for fast loading)
Per-structure metadata:
Structure file (POSCAR/CIF): 1-5 KB
Input JSON (coordinates + metadata): 2-10 KB
MatGL predictions (properties + confidence): 2-5 KB
CHGNet predictions (forces + energy): 5-20 KB (proportional to atom count)
Total per structure: ~30-50 KB
Projected storage:
10,000 materials processed: ~500 MB (negligible)
1 million materials: ~50 GB (manageable in standard database)
Archival strategy:
Keep recent (last 30 days) in hot storage
Archive older results to cold storage
Maintain audit trail for reproducibility
Endpoint: POST /api/v1/materials/matgl/predict
Request schema with structures, optional properties filter, and model versioning.
Rate limiting:
100 requests/minute per API key
Batch size limit: 1000 structures/request
Timeout: 60 seconds per request
Response includes per-structure predictions with status codes and computation metadata.
Endpoint: POST /api/v1/materials/chgnet/evaluate
Returns energy, forces, and stress tensor for submitted structures. Designed for rapid evaluation of already-relaxed geometries.
Endpoint: POST /api/v1/materials/chgnet/relax (returns job ID)
Asynchronous relaxation workflow with configurable convergence criteria, ensemble selection, and optional webhook callbacks. Users poll job endpoint or wait for webhook notification.
User Input ↓ [Validation Layer] - Check structure format - Validate atomic species - Detect size (routing to CPU/GPU) ↓ [Queuing] - Route to MatGL or CHGNet queue - Async job dispatch - Rate limit check ↓ [Compute Layer] - GPU/CPU inference - Result aggregation ↓ [Storage Layer] - Save results to database - Cache frequently accessed predictions ↓ [Response/Webhook] - Synchronous: return results directly - Asynchronous: send webhook or poll endpoint
MatGL benchmarks (against test set from Materials Project):
Formation energy MAE: <0.1 eV/atom (target)
Band gap MAE: <0.3 eV (typical)
Elastic modulus error: <10% (for available data)
CHGNet benchmarks (against AIMD trajectories):
Force MAE: <0.1 eV/Å (oxide systems)
Energy MAE: <10 meV/atom
Stress tensor RMSE: <1 GPa
Sanity checks on predictions:
Formation energies within reasonable range (-10 to +5 eV/atom)
Forces don't violate energy conservation
Band gaps non-negative (for insulators)
Cross-validation with small DFT benchmark set:
Run 50 representative structures through both MatGL and DFT
Track prediction errors over time
Alert if accuracy degrades
User feedback loop:
Allow users to report when predictions disagree with experiments/DFT
Track feedback metrics for model retraining decisions (Phase 2)
Infrastructure:
Provision GPU nodes (2 for MatGL, 4 for CHGNet)
Set up object storage for model weights
Configure async task queue (Celery/RQ/Airflow)
Set up monitoring and alerting
API:
Implement endpoint handlers
Add rate limiting and authentication
Deploy API server with load balancing
Version endpoint and models
Data:
Create database schema for results
Set up caching layer (Redis)
Configure backups
Plan data retention policy
Testing:
Unit tests for validation logic
Integration tests for full pipeline
Load testing (simulate peak usage)
Benchmark accuracy against test sets
Documentation:
API reference (OpenAPI/Swagger)
User guides for each endpoint
Model limitations and accuracy expectations
Troubleshooting guide
Monitoring:
Latency tracking per endpoint
GPU utilization graphs
Queue depth and wait times
Prediction accuracy tracking
Cost per prediction (for billing)
Hardware (one-time, approximate):
2x NVIDIA A40 GPUs for MatGL: $20k
4x NVIDIA A40 GPUs for CHGNet: $40k
Storage/compute infrastructure: $30k
Total: ~$90k + operational costs
Per-prediction costs (estimated):
MatGL: ~0.001-0.005 USD per prediction (GPU time)
CHGNet single-point: ~0.01-0.05 USD
CHGNet relaxation (100-atom structure): ~2-5 USD
Monthly operational costs (scaling):
10k MatGL + 100 CHGNet relaxations: $500-1000
100k MatGL + 1000 CHGNet relaxations: $5k-10k
1M MatGL + 10k CHGNet relaxations: $50k-100k
API Authentication:
OAuth 2.0 or API key-based
Rate limiting per user/organization
Audit logging for all predictions
Data Privacy:
User structures stored securely
Option to delete results after retrieval
No model retraining on proprietary user data without consent
Compliance:
Ensure DFT training data licensing is properly documented
Published metrics for model uncertainty/accuracy
Disclosure of model limitations (especially CHGNet oxide bias)
On this page