Assets

Phase 1 Integration: Technical Specifications for MatGL & CHGNet

This document provides concrete technical specifications for integrating MatGL and CHGNet into Ouro's infrastructure. It covers compute allocation, storage, API contracts, data pipelines, and deployment considerations.

1. Compute Resource Requirements

MatGL Inference

Per-request requirements (single structure):

CPU: 1 vCPU, ~100MB RAM for structures up to 100 atoms
Typical latency:
- CPU: 200-800ms per structure
- GPU (CUDA): 20-100ms per structure
- Batch of 100 structures: 5-15s on GPU

Recommended production setup:

2x GPU nodes (NVIDIA A40 or equivalent) for inference load balancing
CPU fallback for small batch jobs
Single GPU handles ~1000 inferences/minute (100-atom structures)

CHGNet Inference + Relaxation

Per-request requirements:

Single-point evaluation: 1 vCPU, 200MB RAM (10-50ms on CPU, 1-5ms on GPU)

Full relaxation (50-100 steps): Requires GPU

GPU memory: 8-16GB for structures up to 100 atoms
Time: 2-10 minutes depending on structure size and convergence
CPU would take hours—GPU mandatory for realistic workflows

Recommended production setup:

Dedicated GPU cluster for structure relaxation (at least 4x GPU nodes)
Queue-based job dispatch (async task queue)
Single GPU handles ~10-20 relaxations/day (100-atom structures)

Scaling Model

For a typical materials discovery workload (1000 candidate structures):

MatGL screening: ~100 structures/second on 2 GPUs → 10 seconds total
CHGNet relaxation (10% survive screening): 100 structures × 5 min avg → 8 GPU-hours across 4-GPU cluster = 2 hours wall time

2. Storage Requirements

Model Weights

MatGL: ~200MB per model (universal model + variants for different properties)
CHGNet: ~150MB per model
Total: ~350MB for both Phase 1 models + variants

Storage location: Object storage (S3 or equivalent)

Store with versioning enabled
Local caching on GPU nodes (NVMe recommended for fast loading)

Input/Output Data

Per-structure metadata:

Structure file (POSCAR/CIF): 1-5 KB
Input JSON (coordinates + metadata): 2-10 KB
MatGL predictions (properties + confidence): 2-5 KB
CHGNet predictions (forces + energy): 5-20 KB (proportional to atom count)
Total per structure: ~30-50 KB

Projected storage:

10,000 materials processed: ~500 MB (negligible)
1 million materials: ~50 GB (manageable in standard database)

Archival strategy:

Keep recent (last 30 days) in hot storage
Archive older results to cold storage
Maintain audit trail for reproducibility

3. API Contract Design

MatGL Endpoint

Endpoint: POST /api/v1/materials/matgl/predict

Request schema with structures, optional properties filter, and model versioning.

Rate limiting:

100 requests/minute per API key
Batch size limit: 1000 structures/request
Timeout: 60 seconds per request

Response includes per-structure predictions with status codes and computation metadata.

CHGNet Endpoint (Single-Point)

Endpoint: POST /api/v1/materials/chgnet/evaluate

Returns energy, forces, and stress tensor for submitted structures. Designed for rapid evaluation of already-relaxed geometries.

CHGNet Endpoint (Structure Relaxation - Async)

Endpoint: POST /api/v1/materials/chgnet/relax (returns job ID)

Asynchronous relaxation workflow with configurable convergence criteria, ensemble selection, and optional webhook callbacks. Users poll job endpoint or wait for webhook notification.

4. Data Pipeline Architecture

plaintext

User Input
    ↓
[Validation Layer]
- Check structure format
- Validate atomic species
- Detect size (routing to CPU/GPU)
    ↓
[Queuing]
- Route to MatGL or CHGNet queue
- Async job dispatch
- Rate limit check
    ↓
[Compute Layer]
- GPU/CPU inference
- Result aggregation
    ↓
[Storage Layer]
- Save results to database
- Cache frequently accessed predictions
    ↓
[Response/Webhook]
- Synchronous: return results directly
- Asynchronous: send webhook or poll endpoint

5. Validation Framework

Model Accuracy Validation

MatGL benchmarks (against test set from Materials Project):

Formation energy MAE: <0.1 eV/atom (target)
Band gap MAE: <0.3 eV (typical)
Elastic modulus error: <10% (for available data)

CHGNet benchmarks (against AIMD trajectories):

Force MAE: <0.1 eV/Å (oxide systems)
Energy MAE: <10 meV/atom
Stress tensor RMSE: <1 GPa

Integration Validation

Sanity checks on predictions:
- Formation energies within reasonable range (-10 to +5 eV/atom)
- Forces don't violate energy conservation
- Band gaps non-negative (for insulators)
Cross-validation with small DFT benchmark set:
- Run 50 representative structures through both MatGL and DFT
- Track prediction errors over time
- Alert if accuracy degrades
User feedback loop:
- Allow users to report when predictions disagree with experiments/DFT
- Track feedback metrics for model retraining decisions (Phase 2)

6. Deployment Checklist

Infrastructure:

Provision GPU nodes (2 for MatGL, 4 for CHGNet)
Set up object storage for model weights
Configure async task queue (Celery/RQ/Airflow)
Set up monitoring and alerting

API:

Implement endpoint handlers
Add rate limiting and authentication
Deploy API server with load balancing
Version endpoint and models

Data:

Create database schema for results
Set up caching layer (Redis)
Configure backups
Plan data retention policy

Testing:

Unit tests for validation logic
Integration tests for full pipeline
Load testing (simulate peak usage)
Benchmark accuracy against test sets

Documentation:

API reference (OpenAPI/Swagger)
User guides for each endpoint
Model limitations and accuracy expectations
Troubleshooting guide

Monitoring:

Latency tracking per endpoint
GPU utilization graphs
Queue depth and wait times
Prediction accuracy tracking
Cost per prediction (for billing)

7. Cost Projections

Hardware (one-time, approximate):

2x NVIDIA A40 GPUs for MatGL: $20k
4x NVIDIA A40 GPUs for CHGNet: $40k
Storage/compute infrastructure: $30k
Total: ~$90k + operational costs

Per-prediction costs (estimated):

MatGL: ~0.001-0.005 USD per prediction (GPU time)
CHGNet single-point: ~0.01-0.05 USD
CHGNet relaxation (100-atom structure): ~2-5 USD

Monthly operational costs (scaling):

10k MatGL + 100 CHGNet relaxations: $500-1000
100k MatGL + 1000 CHGNet relaxations: $5k-10k
1M MatGL + 10k CHGNet relaxations: $50k-100k

8. Security & Access Control

API Authentication:

OAuth 2.0 or API key-based
Rate limiting per user/organization
Audit logging for all predictions

Data Privacy:

User structures stored securely
Option to delete results after retrieval
No model retraining on proprietary user data without consent

Compliance:

Ensure DFT training data licensing is properly documented
Published metrics for model uncertainty/accuracy
Disclosure of model limitations (especially CHGNet oxide bias)

On this page

Phase 1 Integration: Technical Specifications for MatGL & CHGNet

Convert a post to speech using OpenAI TTS

post→file

1y

Analyze a post for validity, mistakes, and logic issues

post→comment

10mo

No more results

1 link

comment
3h

posts