Futures, Privacy & Adapters:
The Three Pillars
A confidential compute exchange built on three foundational pillars—forward reservations for guaranteed capacity, privacy primitives for hardware-attested security, and adapters for workload abstraction—with enabling infrastructure that makes it all work.
The futures adapter is an autonomous microservice that mints, settles, and trades forward reservations (RC_Reserve) for GPU compute capacity. It runs alongside the broader platform but is implemented as a strictly additive adapter: it exposes its own FastAPI service, persists to its own SQLite (or cloud RDBMS) schema, and integrates with the existing UI purely via HTTP proxied through platform-server.py.
The adapter treats every provider as a short forward counterparty. Buyers lock compute supply by paying a split fee—commit fee paid immediately to providers, usage fee escrowed until jobs actually burn the reservation—mirroring oil futures logic where spot supply is guaranteed in the future at a fixed price. The service implements zero-collateral architecture: no platform treasury, no provider collateral; solvency emerges from algorithmic capacity ceilings, deterministic curve pricing, and strict ledger debits/credits.
Market Discovery
Compute per-provider, per-tenor curves (capacity, utilization, lock price, fee split) from telemetry and utilization data.
Purchase Lifecycle
Accept user quotes, verify capacity/price, debit ACU balances, mint RCs, and escrow usage fees.
Execution Coverage
On job completion events, allocate usage against outstanding RCs, credit providers, fall back to spot when futures inventory exhausted.
Secondary Trades
Allow RC splits/transfers, create order-book style listings, and expire unused capacity with γ-based refunds.
Observability & Verification
Persist detailed tables (stats, reserves, allocations, listings, trades) and ship a sophisticated testing harness that simulates telemetry and user flows at scale.
futures_trades · job_allocations · job_allocation_details
Context Diagram
┌─────────────────────┐ HTTPS via platform-server ┌────────────────────────┐
│ Browser UI │ <──────────────────────────────────── │ Flask Proxy │
│ (vracu-platform) │ /api/futures/* │ (platform-server) │
└─────────────────────┘ └───────────┬────────────┘
│ │
│ │
▼ ▼
┌──────────────────────────────────────────────────────────────────────────────────────────┐
│ FastAPI (services/futures_adapter/app.py) │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ api.py │ │ service.py │ │ capacity.py │ │ pricing.py │ │ ledger.py │ │
│ │ Router │─▶│ Service │─▶│ Forecast │─▶│ Curves │─▶│ Debits │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │ │
│ │ SQLAlchemy ORM │ │
│ └────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────┐
│ Futures DB (SQLite/Postgres) │
│ ┌──────────────┐ ┌──────────────────┐ │
│ │ rc_reserves │ │ account_balances │ │
│ ├──────────────┤ ├──────────────────┤ │
│ │ rc_stats │ │ futures_listings │ │
│ ├──────────────┤ ├──────────────────┤ │
│ │ job_allocs │ │ futures_trades │ │
│ └──────────────┘ └──────────────────┘ │
└──────────────────────────────────────────┘
| Module | Purpose |
|---|---|
config.py | Environment variables: provider IDs, pricing parameters, reserved fractions, default spot price, ledger backend selection |
db.py | SQLAlchemy session + ORM declarations for all tables (RCReserveORM, RCStatsORM, AccountBalanceORM, FuturesListingORM, etc.) |
models.py | Dataclasses representing domain entities returned to service/API layers |
schemas.py | Pydantic models for FastAPI request/response validation |
capacity.py | Deterministic capacity forecasting: translates per-tenor days to SCU capacity via reserved fraction and reliability floor |
pricing.py | Pricing curves: term premium, utilization premium, commit/usage split; produces lock price per tenor |
ledger.py | Balance mutations abstraction; default uses internal account_balances, optional HTTP backend for payments integration |
service.py | Heart of the system: market stats, quote/purchase flow, job coverage, expiry, secondary market operations |
secondary.py | Thin wrappers to orchestrate listing/trade creation |
job_hook.py | CLI utility to call /jobs/apply for completed jobs (control plane bridge) |
Entity Relationship Diagram
┌───────────────────────┐ ┌───────────────────────┐
│ account_balances │ │ rc_stats │
│───────────────────────│ │───────────────────────│
│ account_id (PK) │ │ provider_id, tenor │
│ balance │ │ lock_price, fees │
│ updated_at │ │ capacity, utilization │
└───────────────────────┘ └───────────────────────┘
┌───────────────────────────────────────────────────────────────────────┐
│ rc_reserves (PK: rc_id) │
│───────────────────────────────────────────────────────────────────────│
│ owner_id │ provider_id │ gpu_profile │ region │ tenor │ expiry_ts │
│ max_scu │ used_scu │ fee_comm │ fee_usage │ lock_price │
│ escrow_acu │ status (ACTIVE/FULLY_USED/EXPIRED) │ parent_rc_id │
└─────────────────────────────────┬─────────────────────────────────────┘
│
┌──────────────────────┼──────────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────────┐ ┌─────────────────────┐ ┌─────────────────────┐
│ futures_listings │ │ job_alloc_details │ │ futures_trades │
│─────────────────────│ │─────────────────────│ │─────────────────────│
│ listing_id (PK) │ │ job_id (FK) │ │ trade_id (PK) │
│ rc_id (FK) │ │ rc_id (FK) │ │ listing_id (FK) │
│ owner_id │ │ alloc_scu │ │ rc_id_buyer (FK) │
│ scu_available │ │ fee_usage │ │ buyer_id, seller_id │
│ ask_price_acu/scu │ │ payout_acu │ │ scu, price │
│ status (OPEN/FILLED)│ └─────────────────────┘ └─────────────────────┘
└─────────────────────┘
▲ ▲
│ │
┌──────────┴──────────────────────┴──────────┐
│ job_allocations (PK: job_id) │
│─────────────────────────────────────────────│
│ user_id │ provider_id │ total_scu │ spot_* │
└─────────────────────────────────────────────┘
Quote/Purchase Sequence
User Proxy Adapter(API) FuturesService Ledger DB
│ │ │ │ │ │
│──POST /quote──────────────▶ │ │ │
│ │ │──market()─────────▶│ │ │
│ │ │ │──compute_stats()──▶│ │
│ │ │ │ for each provider│ │
│ │ │ │◀──────────────────────────read stats─┤
│ │ │◀──QuoteResponse────│ │ │
│◀─────allocations, total_cost, partial──────────│ │ │
│ │ │ │ │
│──POST /purchase───────────▶ │ │ │
│ │ │──purchase()───────▶│ │ │
│ │ │ │──validate drift───▶│ │
│ │ │ │──debit(owner)─────▶│ │
│ │ │ │◀──────────────────ok│ │
│ │ │ │──credit(provider)─▶│ (commit fee) │
│ │ │ │──INSERT RC_Reserve─────────────────▶│
│ │ │ │ (escrow_acu = usage fee) │
│◀─────PurchaseResponse {reservations[], cost}───│ │ │
│ │ │ │ │
function quote(gpu_profile, region, tenor, quantity, provider_ids): offers = [] for pid in provider_ids: stat = compute_stats(pid, tenor) capacity_avail = remaining_capacity(pid, tenor, stat.notional) if capacity_avail > 0: offers.append((pid, stat, capacity_avail)) sort offers by stat.lock_price # cheapest first allocations = [] filled = 0 for offer in offers: take = min(offer.capacity_avail, quantity - filled) if take <= 0: break allocations.append({ provider_id: offer.pid, scu: take, lock_price: offer.stat.lock_price, fee_comm: offer.stat.fee_comm, fee_usage: offer.stat.fee_usage }) filled += take total_cost = sum(alloc.scu * alloc.lock_price for alloc in allocations) return allocations, total_cost, filled < quantity
Job Coverage Algorithm
┌─────────────────────────────┐
│ Control Plane POST │
│ /jobs/apply │
│ {job_id, user, provider, │
│ scu_used, spot_rate} │
└──────────────┬──────────────┘
│
▼
┌─────────────────────────────┐
│ Fetch RCs for user/provider│
│ ORDER BY expiry_ts ASC │
└──────────────┬──────────────┘
│
▼
┌─────────────────────────────┐
│ For each RC: │
│ alloc = min(remaining, │
│ rc.available) │
│ rc.used_scu += alloc │
│ rc.escrow -= fee_usage │
│ credit provider │
└──────────────┬──────────────┘
│
▼
┌─────────────────────────────┐
│ If remaining > 0: │
│ Bill spot to user │
│ Credit provider │
└──────────────┬──────────────┘
│
▼
┌─────────────────────────────┐
│ Insert job_allocation │
│ Insert detail records │
│ Return coverage summary │
└─────────────────────────────┘
RC Reserve Lifecycle
┌──────────┐
│ MINT │
│ (purchase│
│ action) │
└────┬─────┘
│
▼
┌──────────────┐
│ ACTIVE │◀──────────┐
│ │ │
│ max_scu=250 │ secondary
│ used_scu=0 │ trades add
│ escrow=full │ child RCs
└──────┬───────┘ │
│ │
job allocations │
consume SCU │
│ │
▼ │
┌──────────────┐ │
│ FULLY_USED │───────────┘
│ │ (split)
│ used >= max │
│ escrow ~ 0 │
└──────┬───────┘
│
expiry reached
│
▼
┌──────────────┐
│ EXPIRED │
│ │
│ γ refund │
│ to user │
│ breakage to │
│ provider │
└──────────────┘
Upon hitting expiry, RCs with expired timestamps are processed. The adapter computes remaining SCU (max - used) and remaining escrow (unused usage fees). Utilization is calculated as used/max, then fed into the γ function—a linear interpolation that returns 0 at 0% utilization and 0.7 at 90%+ utilization.
User refund equals remaining_escrow × γ(utilization). Provider breakage equals remaining_escrow − refund. This incentivizes users to maximize utilization while compensating providers for reserved capacity. The RC status transitions to EXPIRED and escrow zeroes out.
Gamma Function & Expiry Logic
γ (gamma)
│
0.7├────────────────────────────●━━━━━━━━━━ (90%+ utilization → max refund)
│ ╱
│ ╱
│ ╱
│ ╱
│ ╱
│ ╱
│ ╱
│ ╱
│ ╱
│ ╱
│ ╱
│ ╱
│ ╱
0.0├●─────────────────────────────────────── (0% utilization → no refund)
└─────────────────────────────────────────▶
0% 90% 100% Utilization
┌─────────────────────────────────────────────────────────────────────┐
│ Expiry Calculation: │
│ │
│ remaining_escrow = rc.escrow_acu │
│ utilization = rc.used_scu / rc.max_scu │
│ gamma = 0.7 × min(1.0, utilization / 0.9) │
│ │
│ refund_user = remaining_escrow × gamma │
│ provider_breakage = remaining_escrow - refund_user │
│ │
│ ledger.credit(owner, refund_user) │
│ ledger.credit(provider, provider_breakage) │
│ rc.status = EXPIRED │
│ rc.escrow_acu = 0 │
└─────────────────────────────────────────────────────────────────────┘
Listing Creation & Trade Execution
┌─────────────────────────────────────────────────────────────────────────────┐
│ LISTING CREATION │
└─────────────────────────────────────────────────────────────────────────────┘
User A (Seller) Adapter Database
│ │ │
│──POST /listings───────────────▶ │
│ {rc_id, scu_to_sell, ask} │ │
│ │──validate owner == rc.owner───▶│
│ │──check remaining >= scu────────▶│
│ │──INSERT listing────────────────▶│
│◀──────────ListingResponse─────│ │
│ │ │
┌─────────────────────────────────────────────────────────────────────────────┐
│ TRADE EXECUTION │
└─────────────────────────────────────────────────────────────────────────────┘
User B (Buyer) Adapter Ledger Database
│ │ │ │
│──POST /trades──▶│ │ │
│ {listing_id, │──fetch OPEN listing─────────────────────────▶ │
│ scu_to_buy} │ │ │
│ │──verify buyer ≠ seller │ │
│ │──verify scu available │ │
│ │ │ │
│ │──debit(buyer, total)────▶│ │
│ │──credit(seller, total)──▶│ │
│ │ │ │
│ │──reduce seller_rc.max_scu───────────────────▶ │
│ │──create buyer_rc (child)────────────────────▶ │
│ │──update listing.scu_available───────────────▶ │
│ │──INSERT trade record────────────────────────▶ │
│ │ │ │
│◀──TradeResponse─│ │ │
│ {buyer_rc_id} │ │ │
capacity.py
max_notional = forecast × reserved_fraction × reliability_floor
remaining = max(0, max_notional - current_notional)
pricing.py
util_premium = util_slope × max(0, utilization - util_target)
lock_price = spot × (1 + term_premium) × (1 + util_premium)
fee_commit = lock_price × commit_fraction[tenor]
fee_usage = lock_price - fee_commit
| Endpoint | Method | Description |
|---|---|---|
/health | GET | Readiness check returns {"status": "ok"} |
/market | GET | List of RCStats: provider, tenor, lock price, fees, capacity, utilization |
/providers/{id}/curve | GET | Tenor-wise curve for provider (UI "Curve" modal) |
/quote | POST | Request quote for GPU profile, region, tenor, quantity |
/purchase | POST | Finalize reservations along allocations; returns minted RCs |
/portfolio/{user_id} | GET | Current RCs for user (UI "My Portfolio" tab) |
/positions/{rc_id} | GET | RC-level detail with job allocations |
/listings | GET/POST | Secondary market listing feed / create listing |
/trades | POST | Execute listing purchase |
/jobs/apply | POST | Job completion hook; returns allocation summary |
/jobs/{job_id}/allocations | GET | Detailed coverage info for a job |
/expire | POST | Manual expiry sweep (cron/batch) |
rc-123 with max_scu=250.SCM (Standard Compute Minutes) is not "one minute on any GPU"—it's one minute on a reference machine with calibrated GFLOPS and bandwidth figures. Different GPUs deliver different SCM/min scores via benchmarking. The scheduler and futures adapter convert between SCM and actual runtime per hardware using those scores.
When a provider onboards, the provider_agent collects hardware attestation plus micro-benchmarks (GEMM FP16/FP32 GFLOPS, memory bandwidth, interconnect throughput). ACURateCalibrator normalizes each metric against reference values and applies weights to produce acurate_scm_per_min: how many standardized compute minutes that hardware delivers per wall-clock minute.
Calibration Pipeline
┌──────────────────────────────────────────────────────────────────────────────┐
│ PROVIDER ONBOARDING │
└──────────────────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────────────────┐
│ provider_agent/microbench/ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ GEMM FP16 │ │ GEMM FP32 │ │ Mem BW │ │ Interconnect│ │
│ │ GFLOPS │ │ GFLOPS │ │ GB/s │ │ Latency │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
└─────────┼────────────────┼────────────────┼────────────────┼─────────────────┘
│ │ │ │
└────────────────┴────────────────┴────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────────────────┐
│ ACURateCalibrator (provider_agent/calibration/calibrator.py) │
│ │
│ weights = { FP16: 55%, FP32: 15%, MemBW: 15%, Interconnect: 10%, │
│ Stability: 5% penalty } │
│ │
│ acurate_scm_per_min = weighted_score × reference_normalization │
│ │
│ H200 → ~1.4 SCM/min (faster than reference) │
│ H100 → ~1.0 SCM/min (reference baseline) │
│ A100 → ~0.7 SCM/min (slower than reference) │
└──────────────────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────────────────┐
│ control_plane/services.py::record_attestation │
│ ┌────────────────────────────────────────────────────────────────────────┐ │
│ │ provider_attestations table │ │
│ │ ───────────────────────────────────────────────────────────────────── │ │
│ │ provider_id │ hardware_spec │ acurate_scm_per_min │ attestation_ts │ │
│ └────────────────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────────────────┐
│ FUTURES ADAPTER │
│ │
│ Capacity model operates on SCU totals—already normalized to SCM. │
│ When a job requests 100 SCM, the control plane divides by each provider's │
│ acurate_scm_per_min to determine actual runtime. Futures contracts settle │
│ in ACU tokens at the SCM-normalized rate. │
└──────────────────────────────────────────────────────────────────────────────┘
futures_adapter/testing/
Comprehensive testing infrastructure supports realistic multi-provider simulations, stress testing, and UI replay for demonstrations.
providers.yaml (13 providers)
GPU specs & attestation
Telemetry per tenor
run_load.py (asyncio)
curve_simulator.py
provider_catalog.py
purchases.jsonl
jobs.jsonl, trades.jsonl
run_summary.json
Privacy Design Principles
The privacy subsystem lives under vracu-launcher/launcher/privacy/ and is composed of cooperating modules: attesters.py validates hardware claims; service.py orchestrates authorisation and on-demand key issuance; key_broker.py implements various key distribution strategies; revocation.py stores revocation state and background watchers; proof.py loads optional verifiers; interrank.py produces per-rank crypto bundles; and loader.py constructs the system based on configuration. This modular architecture ensures each component handles a specific security domain while sharing the same foundation—typed, deterministic Python code built for confidential compute workloads.
AttestationResult is a frozen dataclass containing valid (bool) and claims (dict). This simple shape allows attesters to return rich structured claims when verification succeeds or a human-readable error when it fails. PrivacyAuthorization stores claims (per-attester claim maps), attestation (flattened dictionary), optional dek bytes, and optional session metadata.
Privacy Module Architecture
launcher/privacy/
├── __init__.py
├── attesters.py ← Attester protocol + implementations (NVIDIA, TDX, SNP)
│ ├── Attester (Protocol)
│ ├── NvidiaCcOnAttester
│ ├── TdxAttester
│ └── SnpAttester
├── service.py ← PrivacyGate orchestration
│ ├── PrivacyGate
│ ├── authorize_job()
│ └── issue_session_dek()
├── key_broker.py ← Key distribution strategies
│ ├── KeyBroker (abstract)
│ ├── SessionKeyBroker (abstract)
│ ├── KeyBrokerAwsKms
│ ├── KeyBrokerVaultTransit
│ ├── KeyBrokerStatic
│ ├── SplitKeyBroker
│ └── HttpSplitKeyShareClient
├── revocation.py ← Revocation registry + watcher
│ ├── RevocationRegistry
│ ├── RevocationWatcher
│ └── SessionInvalidationPipeline
├── proof.py ← Proof verifier plugins
│ ├── load_proof_verifier()
│ └── run_proof_verifier()
├── interrank.py ← Inter-rank cryptography
│ ├── InterRankCryptoConfig
│ └── build()
├── loader.py ← Configuration-driven construction
│ └── build_privacy_components()
└── errors.py ← Custom exceptions
├── PrivacyViolation
├── PrivacyInitializationError
└── ProofVerificationError
This layered architecture ensures each module handles a specific security domain while sharing the same foundation—typed, deterministic Python code built for confidential compute workloads. The modules cooperate to validate hardware claims, issue cryptographic keys, track sessions, and enforce revocation policies.
AttestationResult is a frozen dataclass containing valid (bool) and claims (dict). PrivacyAuthorization stores claims (per-attester claim maps), attestation (flattened dictionary), optional dek bytes, and optional session metadata. RevocationDelta encapsulates newly revoked attestation hashes and session IDs. SessionState structures track session IDs, salts, attestation hashes, step counters, tokens, and broker state.
Attester is a typing.Protocol with verify(evidence, challenge) returning AttestationResult. Implementations must bind evidence to the challenge (job ID). The protocol enables static type checking and encourages consistent error handling across attesters. Because the protocol returns AttestationResult, attesters never raise exceptions for expected validation failures; they return valid=False with an informative error message.
NvidiaCcOnAttester
Enforces SPDM certificate chain validation, challenge binding, CC-On mode requirements, and payload signature verification.
Validates: nonce == challenge
Requires: cc_mode == "CC_ON"
Signature: RSA PSS / ECDSA
Returns: vendor, product_id, measurement
TdxAttester
Handles Intel Trust Domain Extensions quotes with report, signature, and cert_chain validation.
Validates: nonce, mr_enclave, mr_signer
Signature: SHA-384 verification
Returns: vendor, challenge, attributes
SnpAttester
Validates SEV-SNP attestation reports with policy enforcement and VCEK chain verification.
Validates: nonce, policy
Certificate: PEM, chain verification
Returns: vendor, policy, platform_version
PrivacyGate.authorize_job iterates over configured attesters. Evidences are accessed by name; missing evidence triggers PrivacyViolation. Each attester's verify method is called with challenge=job_id. If valid is false, PrivacyViolation includes the attester name and error message. Claims are stored in a map keyed by attester name and flattened into flat_claims.
PrivacyGate Authorization Flow
PrivacyGate.authorize_job(job_id, evidence_map)
│
├──▶ FOR attester_name IN configured_attesters:
│ │
│ ├──▶ evidence = evidence_map.get(attester_name)
│ │ │
│ │ └── IF NOT evidence:
│ │ RAISE PrivacyViolation("missing evidence")
│ │
│ ├──▶ result = attester.verify(evidence, challenge=job_id)
│ │ │
│ │ └── IF NOT result.valid:
│ │ RAISE PrivacyViolation(attester_name, result.error)
│ │
│ └──▶ claims[attester_name] = result.claims
│
├──▶ flat_claims = flatten(claims)
│
├──▶ attestation_hash = compute_attestation_hash(flat_claims)
│
├──▶ IF revocation_registry.is_attestation_revoked(attestation_hash):
│ RAISE PrivacyViolation("attestation revoked")
│
├──▶ IF broker IS SessionKeyBroker:
│ │
│ ├──▶ session = broker.create_session(job_id, attestation_hash)
│ │
│ └──▶ revocation_registry.track_session(session)
│
└──▶ RETURN PrivacyAuthorization(claims, flat_claims, dek, session)
key_broker.py defines a class hierarchy: KeyBroker (abstract), SessionKeyBroker (abstract subclass), KeyBrokerAwsKms, KeyBrokerVaultTransit, KeyBrokerStatic, SplitKeyBroker, and HttpSplitKeyShareClient. Each broker provides asynchronous methods (release or create_session/issue). Retry logic uses exponential_backoff accepting attempts, initial, maximum, and exceptions.
KeyBrokerAwsKms
Generates data keys using AWS KMS with encryption context binding to job ID and attestation hash.
Context: job_id + attestation_hash
Retry: exponential_backoff
Key Size: 256-bit
KeyBrokerVaultTransit
Posts to Vault's transit endpoint for key generation with context binding.
Headers: X-Vault-Token
Response: base64-encoded key
Key Size: 256-bit
KeyBrokerStatic
Deterministic key derivation using HKDF for development and testing.
Info: job_id:attestation_hash
Salt: configurable
Use: development only
SplitKeyBroker
Composes a primary broker with remote share client for threshold key issuance. Enforces monotonic step increments, proof verification, and session state management.
Remote: HttpSplitKeyShareClient
Combination: XOR(primary_share, remote_share)
Final Key: HKDF(combined, salt=session.salt, info=f"{session_id}:{step}")
Session State: session_id, attestation_hash, salt, last_step, threshold
ASYNC FUNCTION issue(job_id, session, attestation, step, proof): # Verify attestation hash matches session verify_attestation_hash(session["attestation_hash"], attestation) # Ensure monotonic step progression ensure_step_monotonic(session["last_step"], step) # Run proof verifier if required IF require_proof: proof_context = run_proof_verifier(job_id, session_id, step, attestation) # Get share from primary broker (KMS/Vault) primary_share = AWAIT primary_broker.release(job_id, attestation, step) # Get share from remote aggregator remote_response = AWAIT splitkey_client.issue_share(session_token, step, proof_context) remote_share = base64_decode(remote_response["share_b64"]) # Combine shares via XOR final_key_material = xor_bytes(primary_share, remote_share) # Derive final DEK using HKDF dek = hkdf( final_key_material, salt=session["salt"], info=f"{session_id}:{step}" ) # Update session state update_session_state(session, remote_response, step) RETURN dek, session
RevocationRegistry maintains sets of revoked attestation hashes and session IDs plus a dictionary of active sessions. A threading.Lock serialises access. update normalises inputs, determines new entries, updates sets, attaches revoked_at timestamps to tracked sessions, and updates version and updated_at.
Revocation Pipeline
┌─────────────────────────────────────────────────────────────────────────────┐
│ REVOCATION FEED SOURCE │
│ (HTTP endpoint / local file / control plane) │
└────────────────────────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ REVOCATION WATCHER │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ _run() loop: │ │
│ │ 1. sleep(poll_interval) │ │
│ │ 2. _fetch_payload() ─▶ HTTP GET or file read │ │
│ │ 3. _apply_payload() ─▶ validate + update registry │ │
│ │ 4. on_update() callback ─▶ trigger invalidation pipeline │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
└────────────────────────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ REVOCATION REGISTRY │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ _revoked_attestations: Set[str] │ │
│ │ _revoked_sessions: Set[str] │ │
│ │ _tracked_sessions: Dict[str, SessionData] │ │
│ │ _version: int │ │
│ │ _updated_at: datetime │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
└────────────────────────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ SESSION INVALIDATION PIPELINE │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ handle(delta: RevocationDelta): │ │
│ │ FOR session_id IN delta.session_ids: │ │
│ │ job_id = registry.get_job_id(session_id) │ │
│ │ cancel_job(job_id, reason="revoked_session") │ │
│ │ IF stop_job: stop_job(job_id, execution_metadata) │ │
│ │ FOR hash IN delta.attestation_hashes: │ │
│ │ sessions = registry.get_sessions_by_attestation(hash) │ │
│ │ FOR session IN sessions: cancel_job(...) │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
interrank.py delivers cryptographic material for multi-rank workloads. InterRankCryptoConfig accepts algorithm (aes-gcm), key size, nonce size, pad multiple, and Gaussian noise settings. build(world_size) generates handshake ID, per-rank keys, nonces, tags (SHA-256 over handshake ID + rank + key), and zipped contexts.
Sidecar ↔ Launcher ↔ Attester Handshake
Sidecar Launcher API PrivacyGate Attesters Key Broker
│ │ │ │ │
│──POST /v1/jobs/{id}/attestation─────────────▶ │ │
│ {evidences: {...}} │ │ │ │
│ │──authorize_job()─────▶│ │ │
│ │ │──verify(nvidia)────▶│ │
│ │ │◀───AttestResult────│ │
│ │ │──verify(tdx)───────▶│ │
│ │ │◀───AttestResult────│ │
│ │ │ │ │
│ │ │──check_revocation()│ │
│ │ │ (registry lookup)│ │
│ │ │ │ │
│ │ │──create_session()──────────────────────────▶
│ │ │◀──────────session + dek────────────────────│
│ │◀─PrivacyAuthorization│ │ │
│◀──200 {session_id, dek_hint}────────────────│ │ │
│ │ │ │ │
│ │ │ │ │
│══════════════════════│══ JOB EXECUTION ═════│════════════════════│═══════════════════════│
│ │ │ │ │
│──POST /v1/jobs/{id}/rotation────────────────▶ │ │
│ {step: N, proof} │ │ │ │
│ │──issue_session_dek()─▶ │ │
│ │ │──verify_step() │ │
│ │ │──run_proof() │ │
│ │ │──issue()───────────────────────────────────▶
│ │ │◀──────────new_dek──────────────────────────│
│◀──200 {dek_hint, next_step}─────────────────│ │ │
│ │ │ │ │
Broker Selection Logic
settings.privacy.key_broker
│
▼
┌─────────────┐
│ "aws_kms"? │──YES──▶ KeyBrokerAwsKms
└──────┬──────┘ │
│NO │ generate_data_key()
▼ │ encryption_context
┌─────────────┐ │
│"vault_transit"──YES──▶ KeyBrokerVaultTransit
└──────┬──────┘ │
│NO │ POST /transit/datakey
▼ │
┌─────────────┐ │
│ "static"? │──YES──▶ KeyBrokerStatic
└──────┬──────┘ │
│NO │ HKDF derivation
▼ │
┌─────────────┐ │
│"split_key"? │──YES──▶ SplitKeyBroker
└──────┬──────┘ │
│NO ├─▶ primary: KMS/Vault
▼ └─▶ remote: HTTP share
┌─────────────┐
│ ERROR │
│ InvalidConf │
└─────────────┘
Split-Key Threshold Crypto
┌─────────────────────────────┐
│ SPLIT-KEY ISSUANCE │
└──────────────┬──────────────┘
│
▼
┌─────────────────────────────┐
│ Primary Broker (KMS/Vault) │
│ ┌───────────────────────┐ │
│ │ share_A = release() │ │
│ │ (256-bit) │ │
│ └───────────────────────┘ │
└──────────────┬──────────────┘
│
▼
┌─────────────────────────────┐
│ Remote Share Aggregator │
│ ┌───────────────────────┐ │
│ │ share_B = issue() │ │
│ │ (256-bit) │ │
│ └───────────────────────┘ │
└──────────────┬──────────────┘
│
▼
┌─────────────────────────────┐
│ XOR COMBINATION │
│ combined = share_A ⊕ share_B│
└──────────────┬──────────────┘
│
▼
┌─────────────────────────────┐
│ HKDF DERIVE │
│ DEK = HKDF(combined, │
│ salt=session.salt, │
│ info=session:step) │
└──────────────┬──────────────┘
│
▼
┌─────────────────────────────┐
│ 256-bit AES-GCM DEK │
│ (per-step rotation) │
└─────────────────────────────┘
Detection → Invalidation → Cleanup
┌─────────────────────────────────────────────────────────────────────────────────────────────┐
│ REVOCATION CASCADE │
└─────────────────────────────────────────────────────────────────────────────────────────────┘
┌───────────────┐ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ DETECTION │────▶│ REGISTRY │────▶│ INVALIDATION │────▶│ CLEANUP │
└───────┬───────┘ └───────┬───────┘ └───────┬───────┘ └───────┬───────┘
│ │ │ │
▼ ▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ RevocationWatcher│ │ registry.update│ │ pipeline.handle│ │ Driver.stop() │
│ polls source │ │ (attestations, │ │ (delta) │ │ Kubernetes │
│ every N seconds │ │ sessions) │ │ │ │ pod delete │
└───────────────┘ └───────────────┘ └───────────────┘ └───────────────┘
│ │ │ │
│ │ │ │
┌───────▼───────┐ ┌───────▼───────┐ ┌───────▼───────┐ ┌───────▼───────┐
│ HTTP GET │ │ Set operations│ │ Lookup job_id │ │ Mark job │
│ /revocations │ │ - add hashes │ │ from session │ │ CANCELLED │
│ │ │ - add sessions│ │ │ │ │
│ or File read │ │ - version++ │ │ Cancel job │ │ Emit metrics │
│ revoked.json │ │ - timestamp │ │ with reason │ │ & logs │
└───────────────┘ └───────────────┘ └───────────────┘ └───────────────┘
│ │ │ │
│ │ │ │
▼ ▼ ▼ ▼
┌─────────────────────────────────────────────────────────────────────────────────────────────┐
│ Timeline: Detection (0s) → Registry Update (10ms) → Job Cancel (50ms) → Pod Delete (1s) │
└─────────────────────────────────────────────────────────────────────────────────────────────┘
| Component | Key File | Responsibility |
|---|---|---|
| Attesters | attesters.py |
Validate vendor evidence (NVIDIA, TDX, SNP) |
| Privacy Gate | service.py |
Orchestrate attestation checks, sessions, DEKs |
| Key Broker | key_broker.py |
Issue secrets via KMS, Vault, static, split-key |
| Revocation | revocation.py |
Track and apply revocation payloads |
| Proofs | proof.py |
Execute optional proof verifiers |
| Inter-rank | interrank.py |
Derive per-rank crypto materials |
| Invalidation | revocation.py |
Cancel workloads when sessions revoked |
Adapter Design Principles
Adapters convert user intent into the normalised ResourceProfile and ExecutionPlan dataclasses defined in launcher/adapters/base.py. Every job submitted through the API traverses an adapter before it touches persistence. This guarantees that driver-facing specifications (image, command, env, volumes), placement inputs (GPU count, VRAM, interconnect, features), telemetry metadata, and IO descriptors are encoded in a predictable format.
launcher/adapters/loader.py exposes REGISTRY mapping adapter names to classes. load_adapter(name) raises KeyError if the name is absent, preventing silent fallbacks. register_adapter(name, cls) allows runtime extension. Adapters are instantiated on demand so constructor parameters can be supplied by features or tests.
Adapter Flow Architecture
┌─────────────────────────────────────────────────────────────────────────────┐
│ USER PAYLOAD │
│ { │
│ "adapter": "training", │
│ "spec": { │
│ "command": ["torchrun", "train.py"], │
│ "num_gpus": 4, │
│ "min_vram_gb": 48, │
│ ... │
│ } │
│ } │
└────────────────────────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ ADAPTER LOADER │
│ load_adapter("training") ──▶ TrainingAdapter │
└────────────────────────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ ADAPTER.PREPARE(spec) │
│ │
│ ┌──────────────────────────┐ ┌──────────────────────────┐ │
│ │ ResourceProfile │ │ ExecutionPlan │ │
│ │ ──────────────────── │ │ ──────────────────── │ │
│ │ num_gpus: 4 │ │ image: "ghcr.io/..." │ │
│ │ min_vram_gb: 48 │ │ command: ("torchrun", │ │
│ │ interconnect: ("nvlink",)│ │ "train.py") │ │
│ │ scm_minutes: 60 │ │ env: {"ADAPTER": "..."}│ │
│ │ features: ("cuda>=12.1",│ │ strategy: "ddp" │ │
│ │ "nccl") │ │ io: {...} │ │
│ └──────────────────────────┘ └──────────────────────────┘ │
└────────────────────────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ PERSISTENCE │
│ Job(spec={...}, profile={...}, plan={...}, status=PENDING) │
└────────────────────────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ PLACEMENT / DRIVER / TELEMETRY │
│ profile ──▶ placement decisions │
│ plan ──▶ driver.launch() │
│ adapter.map_metrics() ──▶ telemetry aggregator │
└─────────────────────────────────────────────────────────────────────────────┘
ResourceProfile is a frozen dataclass capturing: num_gpus (integer GPU count), min_vram_gb (minimum VRAM per GPU), interconnect (tuple of required interconnects), scm_minutes (scheduled compute minutes for billing), and features (hardware/software feature flags like "cuda>=12.1", "nccl", "nvenc").
ExecutionPlan includes: image (container image), command (tuple of arguments), env (environment variables), volumes, strategy ("ddp", "service", "single", "tiling", "composite"), rendezvous, io descriptor, metadata, and Kubernetes-specific fields (labels, annotations, service_account, restart_policy, replicas, service_ports, probes, autoscaling).
TrainingAdapter
Prepares distributed training jobs with multi-GPU coordination strategies. Sets defaults for DDP/FSDP rendezvous, volume mounts for datasets, and priority metadata.
Features: ("cuda>=12.1", "nccl")
Strategy: "ddp"
IO Mode: checkpoint
Metrics: step, loss, throughput
InferenceAdapter
Targets long-running model serving with health probes, autoscaling, and load balancer exposure.
Strategy: "service"
Probes: readiness, liveness
Autoscaling: HPA support
Metrics: latency_p95_ms, QPS, error_rate
QuantizationAdapter
Handles model compression with PTQ and QAT modes, different resource profiles per mode.
QAT: 2 GPUs, 120 SCM, strategy="ddp"
Output: onnx, tensorrt
Metrics: step, loss, accuracy
RenderingAdapter
Manages visual workloads with frames, tiles, resolution, and NVENC requirements.
Features: ("nvenc") when required
Strategy: "tiling"
Metrics: frames_rendered, tiles_completed
CompositeAdapter
Chains multiple adapter stages sequentially. Validates stages, merges resource profiles, generates orchestration script.
Merges: max GPUs, max VRAM, sum SCM minutes, union features
Environment: COMPOSITE_STAGE_COUNT, STAGE_n_NAME
Metrics: stage, step, loss, throughput
# Core registry with built-in adapters _ADAPTER_REGISTRY: Dict[str, AdapterFactory] = { "training": TrainingAdapter, "inference": InferenceAdapter, "render": RenderingAdapter, "quant": QuantizationAdapter, "composite": CompositeAdapter, } # Third-party registration (zero core changes) def register_adapter(name: str, factory: AdapterFactory) -> None: _ADAPTER_REGISTRY[name.lower()] = factory # Load adapter by name def load_adapter(name: str, **options) -> Adapter: if name.lower() not in _ADAPTER_REGISTRY: raise KeyError(f"Unknown adapter: {name}") return _ADAPTER_REGISTRY[name.lower()](**options) # Example: Custom fine-tuning adapter class FineTuneAdapter(Adapter): def prepare(self, job_spec): profile = ResourceProfile( num_gpus=job_spec.get("num_gpus", 1), min_vram_gb=40, features=("cuda>=12.1", "peft", "bitsandbytes") ) plan = ExecutionPlan(...) return profile, plan # Register without platform redeployment register_adapter("finetune", FineTuneAdapter)
Adapter-to-Driver Mapping
┌─────────────────────────────────────────────────────────────────────────────┐
│ ADAPTER PREPARE │
│ │
│ TrainingAdapter.prepare(spec) │
│ │ │
│ ├──▶ ResourceProfile │
│ │ num_gpus: 4 │
│ │ min_vram_gb: 48 │
│ │ interconnect: ("nvlink",) │
│ │ features: ("cuda>=12.1", "nccl") │
│ │ │
│ └──▶ ExecutionPlan │
│ image: "ghcr.io/example/train:v2" │
│ command: ("torchrun", "--nproc_per_node=4", "train.py") │
│ strategy: "ddp" │
│ env: {"ADAPTER": "training", "WORLD_SIZE": "4"} │
└────────────────────────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ PLACEMENT PLANNER │
│ │
│ ResourceProfile ──▶ MultiProviderPlacementPlanner │
│ │ │
│ ├── GPU count ──▶ rank distribution │
│ ├── interconnect ──▶ provider selection │
│ └── features ──▶ capability matching │
└────────────────────────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ KUBERNETES DRIVER │
│ │
│ ExecutionPlan ──▶ driver.launch() │
│ │ │
│ ├── image ──▶ container spec │
│ ├── command ──▶ container args │
│ ├── env ──▶ environment variables │
│ ├── strategy ──▶ Job vs Deployment │
│ └── probes ──▶ readiness/liveness │
└────────────────────────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ TELEMETRY AGGREGATOR │
│ │
│ Adapter.map_metrics(raw_data) ──▶ normalized metrics │
│ │ │
│ └── {"step": 100, "loss": 0.42, "throughput": 1234.5} │
└─────────────────────────────────────────────────────────────────────────────┘
| Adapter | Key Fields | Strategy | Metrics |
|---|---|---|---|
| Training | num_gpus, strategy, datasets | ddp | step, loss, throughput |
| Inference | replicas, autoscaling, probes | service | latency_p95_ms, QPS, error_rate |
| Quantization | mode, output_format, precision | single/ddp | step, loss, accuracy |
| Rendering | frames, tiles, resolution | tiling | frames_rendered, tiles_completed |
| Composite | stages[] | composite | stage, step, loss |
Enabling Infrastructure
The systems that power Futures, Privacy & Adapters
Configuration Design Principles
Pydantic settings hierarchies, environment variable mapping, feature toggles, and configuration validation patterns that enable type-safe deployments across all infrastructure layers.
Configuration Design Principles
launcher/config/settings.py defines LauncherSettings, a Pydantic model that aggregates subordinate models: APISettings, FeatureSettings, PrivacySettings, DriverSettings, ObservabilitySettings, SecuritySettings, ControlPlaneSettings, and StorageSettings. Each submodel pulls values from environment variables prefixed with LAUNCHER_. Defaults are sensible but conservative: the API binds to 0.0.0.0:8080, rate limits default to 100/minute, observability exports JSON logs, and privacy requires at least one attester.
The API application entrypoint invokes load_settings(), storing the resulting object in FastAPI's state. Middlewares, routers, and background tasks receive the same object via dependencies. Workers reuse the same config by calling load_settings_cached when booting. Because Pydantic caches environment values, repeated loads remain deterministic.
LauncherSettings aggregates all subordinate configuration models into a single typed surface. Each submodel pulls values from environment variables prefixed with LAUNCHER_. The settings module includes helper constructors (load_settings()) and caches to ensure a single configuration object is reused across the process.
Configuration Hierarchy
LauncherSettings
├── APISettings
│ ├── host: str = "0.0.0.0"
│ ├── port: int = 8080
│ └── workers: int = 4
├── FeatureSettings
│ ├── enable_multi_provider_jobs: bool
│ ├── enable_revocation_watcher: bool
│ ├── enable_revocation_stop: bool
│ ├── enable_policy_engine: bool
│ ├── enable_artifact_encryption: bool
│ ├── enable_session_replay_protection: bool
│ └── enable_composite_jobs: bool
├── PrivacySettings
│ ├── attesters: List[str]
│ ├── key_broker: str
│ ├── split_key_threshold: int
│ ├── proof_verifier: str | None
│ └── revocation: RevocationSettings
├── DriverSettings
│ ├── backend: "kubernetes" | "simulation"
│ ├── namespace: str
│ └── service_account: str
├── ObservabilitySettings
│ ├── log_format: str = "json"
│ ├── traces_enabled: bool
│ └── metrics_enabled: bool
├── SecuritySettings
│ ├── rate_limit: str = "100/minute"
│ ├── admin_token: str
│ └── trusted_origins: List[str]
├── ControlPlaneSettings
│ ├── enabled: bool
│ ├── base_url: str
│ ├── api_key: str
│ └── timeout_seconds: float
└── StorageSettings
├── artifact_path: str
├── encryption_key: str
└── s3_bucket: str | None
The API application entrypoint invokes load_settings(), storing the resulting object in FastAPI's state. Middlewares, routers, and background tasks receive the same object via dependencies defined in launcher/api/dependencies.py. Workers reuse the same config by calling load_settings_cached when booting from launcher/worker/main.py.
FeatureSettings toggles govern major behaviours: enable_multi_provider_jobs, enable_revocation_watcher, enable_revocation_stop, enable_policy_engine, enable_artifact_encryption, enable_session_replay_protection, and enable_composite_jobs. Each flag is consulted at multiple call sites. Tests rely on these toggles to simulate different deployment profiles.
PrivacySettings
Describes attesters, key broker, optional split-key parameters, proof verifier module references, and revocation configuration.
key_broker: "static" | "aws_kms" | "vault_transit" | "split_key"
split_key_threshold: int = 2
split_key_endpoint: str | None
split_key_participants: List[str] | None
proof_verifier: str | None
revocation.enabled: bool
revocation.source_url: str | None
revocation.poll_interval_seconds: float = 60.0
DriverSettings
Specifies default driver backend, container registry overrides, and namespace names.
namespace: str
service_account: str
artifact_encryption: bool
ControlPlaneSettings
Exposes connection parameters for control plane integration.
base_url: str
api_key: str
timeout_seconds: float
reservation_timeout: float
ObservabilitySettings
Controls logging format, tracing, and metrics exposition.
traces_enabled: bool
otlp_endpoint: str | None
metrics_enabled: bool
SidecarSettings
Maps environment variables for provider-side runtime including attestation, rotation, and TLS parameters.
SIDECAR_CHALLENGE · SIDECAR_ROTATION_SECRET · SIDECAR_ROTATION_DUE_AT
SIDECAR_TLS_REQUIRED · SIDECAR_ARTIFACT_PATH · SIDECAR_STEP_SIGNAL_PATH
SIDECAR_ROTATION_GRACE_SECONDS · SIDECAR_ROTATION_RETRY_DELAY
PrivacySettings describe attesters (attesters list), key broker (key_broker string), optional split-key parameters (split_key_threshold, split_key_endpoint, split_key_participants), proof verifier module references (proof_verifier), and revocation configuration. The revocation section includes enabled, source_url, source_path, poll_interval_seconds, timeout_seconds, and TLS settings.
DriverSettings specify default driver backend ("kubernetes" or "simulation"), container registry overrides, namespace names, service account names, artifact encryption toggles, and file system locations for staging. These settings impact packager behaviour and driver manifests.
Configuration Propagation Flow
Environment Variables / Secrets Manager
│
├─────────────────────────────────────────────────────────────────────┐
│ │
▼ ▼
┌─────────────────────────┐ ┌─────────────────────────┐
│ launcher/config/ │ │ sidecar/config/ │
│ settings.py │ │ settings.py │
│ │ │ │
│ load_settings() ───────────────────────────────────────────▶│ SidecarSettings │
│ │ │ │ │ │
│ ├─▶ APISettings │ │ ├─▶ job_id │
│ ├─▶ PrivacySettings │ │ ├─▶ launcher_url │
│ ├─▶ DriverSettings │ │ ├─▶ rotation_secret │
│ └─▶ ... │ │ └─▶ tls_required │
└─────────┬───────────────┘ └─────────────────────────┘
│
├─▶ create_app() ──▶ FastAPI state, middlewares, routers
│
├─▶ LauncherService ──▶ driver registry, artifact storage
│
└─▶ JobProcessor ──▶ multi-provider strategies, control-plane hooks
┌─────────────────────────┐ ┌─────────────────────────┐
│ control_plane/config.py │ │ payments/stripe_service │
│ ServiceConfig │ │ /config.py │
│ │ │ │
│ ├─▶ database │ │ ├─▶ stripe_api_key │
│ ├─▶ oracle │ │ ├─▶ webhook_secret │
│ ├─▶ scheduler │ │ ├─▶ queue_url │
│ ├─▶ metering │ │ └─▶ ledger_db_url │
│ ├─▶ settlement │ │ │
│ ├─▶ chain │ └─────────────────────────┘
│ ├─▶ resilience │
│ └─▶ governance │
└─────────────────────────┘
control_plane/config.py's ServiceConfig aggregates numerous sub-configs: database, oracle, scheduler, metering, settlement, chain, resilience, regional, signing, governance, enterprise, optimizer, and attestation. Each sub-config defines typed fields with defaults and validation.
payments/stripe_service/config.py defines a Settings model with stripe_api_key, stripe_endpoint_secret, webhook_rate_limit, queue_url, queue_batch_size, queue_wait_seconds, region_name, ledger_database_url, currency, enable_queue, and log_level.
API → Worker → Driver Configuration Chain
┌─────────────────────────────────────────────────────────────────────────────────────────────┐
│ CONFIGURATION PROPAGATION │
└─────────────────────────────────────────────────────────────────────────────────────────────┘
Environment Application Runtime Components
Variables Startup (Injected Settings)
│ │ │
│ │ │
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────────────────────────┐
│ LAUNCHER_* │─────────▶│ load_settings│─────────────▶│ FastAPI State │
│ PRIVACY_* │ │ () │ │ ├─ app.state.settings │
│ DRIVER_* │ │ │ │ └─ Dependency injection │
│ CONTROL_* │ │ Pydantic │ │ │
└──────────────┘ │ validation │ │ LauncherService │
└──────┬───────┘ │ ├─ adapter_options │
│ │ ├─ policy_engine │
│ │ └─ quota_enforcer │
│ │ │
│ │ PrivacyGate │
▼ │ ├─ attesters[] │
┌──────────────┐ ┌──────────────┐ │ ├─ key_broker │
│ AWS Secrets │─────────▶│ Secret │ │ └─ revocation_registry │
│ Manager │ │ Resolution │ │ │
│ /ssm/params │ │ │ │ JobProcessor (Worker) │
└──────────────┘ └──────┬───────┘ │ ├─ driver_backend │
│ │ ├─ telemetry_aggregator │
│ │ └─ placement_strategy │
▼ │ │
┌──────────────┐ ┌──────────────┐ │ KubernetesDriver │
│ Config File │─────────▶│ File Loader │ │ ├─ namespace │
│ launcher.yaml│ │ (optional) │ │ ├─ service_account │
└──────────────┘ └──────────────┘ │ └─ image_pull_secrets │
└──────────────────────────────────┘
Runtime Feature Evaluation
┌─────────────────────────────────────────────────────────────────────────────────────────────┐
│ FEATURE TOGGLE EVALUATION │
└─────────────────────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────┐
│ Job Submission │
│ (POST /v1/jobs) │
└──────────┬──────────┘
│
┌───────────────────────────┼───────────────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────────────────┐ ┌─────────────────────────────┐ ┌─────────────────────────────┐
│ enable_multi_provider_jobs? │ │ enable_revocation_watcher? │ │ enable_artifact_encryption? │
└──────────────┬──────────────┘ └──────────────┬──────────────┘ └──────────────┬──────────────┘
│ │ │
┌───────┴───────┐ ┌───────┴───────┐ ┌───────┴───────┐
│YES │NO │YES │NO │YES │NO
▼ ▼ ▼ ▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Multi-rank │ │ Single │ │ Start │ │ Skip │ │ Encrypt │ │ Plain │
│ placement │ │ provider │ │ watcher │ │ watcher │ │ artifacts │ │ artifacts │
│ strategy │ │ only │ │ background │ │ │ │ w/ DEK │ │ │
└─────────────┘ └─────────────┘ └──────┬──────┘ └─────────────┘ └─────────────┘ └─────────────┘
│
▼
┌─────────────────────┐
│ enable_revocation_ │
│ stop? │
└──────────┬──────────┘
│
┌───────┴───────┐
│YES │NO
▼ ▼
┌─────────────┐ ┌─────────────┐
│ Stop jobs │ │ Cancel only │
│ on revoke │ │ (no pod │
│ (pod delete)│ │ delete) │
└─────────────┘ └─────────────┘
from pydantic import BaseSettings class PrivacyRevocationSettings(BaseSettings): enabled: bool = False source_url: str | None = None source_path: str | None = None poll_interval_seconds: float = 60.0 timeout_seconds: float = 10.0 class PrivacySettings(BaseSettings): attesters: list[str] key_broker: str = "static" split_key_enabled: bool = False split_key_threshold: int = 2 split_key_endpoint: str | None = None proof_verifier: str | None = None revocation: PrivacyRevocationSettings = PrivacyRevocationSettings() class LauncherSettings(BaseSettings): api: APISettings = APISettings() features: FeatureSettings = FeatureSettings() privacy: PrivacySettings driver: DriverSettings = DriverSettings() observability: ObservabilitySettings = ObservabilitySettings() security: SecuritySettings = SecuritySettings() control_plane: ControlPlaneSettings = ControlPlaneSettings() storage: StorageSettings = StorageSettings() # Environment variables populate fields settings = LauncherSettings()
| Configuration Block | Key File | Purpose |
|---|---|---|
| LauncherSettings | launcher/config/settings.py |
Aggregates all launcher configuration |
| SidecarSettings | sidecar/config/settings.py |
Provider-side runtime configuration |
| ServiceConfig | control_plane/config.py |
Control plane services configuration |
| Settings | payments/stripe_service/config.py |
Payments and Stripe integration |
Overture: System Design Principles
Directory semantics, dependency flows, persistence layers, and the high-level control cycle that orchestrates jobs from SDK submission to on-chain settlement.
Overture: System Design Principles
The repository at GPU-LAYER-DECENTRALISED resembles a densely populated city whose districts map directly to production responsibilities. vracu-launcher/ houses application facing APIs, worker orchestration, adapters, privacy layers, packagers, and placement planners. sidecar/ contains the runtime that executes on provider hosts, including attestation producers, rotation loops, TLS handlers, and configuration loaders. control_plane/ embeds the economic and scheduling core. payments/ owns Stripe ingestion, ledger persistence, SQS queues, and Arbitrum wallet integration.
contracts/ stores Solidity code (notably ConversionRouter.sol) plus supporting ABIs and Foundry configurations. phase4_sdk/ gives client libraries and Typer CLI wrappers. The alien-* families represent provider onboarding, resilience, and observability control planes. Operational scripts live under deployment/, ops/, and infra/. Validation evidence resides in numerous Markdown and PDF files, ranging from system overviews to production evidence reports. Each directory includes __init__.py or configuration files, demonstrating that this structure is not incidental—it is the result of iterative deployments.
Repository Structure Overview
GPU-LAYER/
├── vracu-launcher/
│ ├── launcher/api/ ← FastAPI routes, dependencies, middleware
│ ├── launcher/config/ ← Pydantic settings, secrets wiring
│ ├── launcher/adapters/ ← Training, inference, quantization, rendering, composite
│ ├── launcher/worker/ ← JobProcessor, queue consumers, multi-provider logic
│ ├── launcher/privacy/ ← Attesters, key brokers, revocation, proof plugins
│ ├── launcher/placement/ ← Strategy compilation, rank commitments, planners
│ ├── launcher/packager/ ← Artifact materialisation, encryption utilities
│ └── launcher/observability/ ← Metrics, SLOs, telemetry aggregator
├── sidecar/
│ ├── runtime/ ← Attestation handshake, rotation loops, TLS writers
│ └── attestation/ ← Evidence producers for NVIDIA CC-On, TDX, SNP
├── control_plane/ ← API server, services, oracle, scheduler, governance
├── payments/ ← Ledger DAO, Stripe routers, SQS queue workers, ACU wallet
├── contracts/ ← ConversionRouter.sol, ABIs, Foundry config
├── phase4_sdk/ ← ControlPlaneClient, Typer CLI, configuration helpers
├── alien-* ← Directory API, provider node, observability, resilience
├── provider_agent/ ← MIG manager, diagnostics collectors
├── deployment/, ops/, infra/ ← Helm charts, Terraform, Kyverno policies, scripts
└── docs/, PDFs, Markdown reports ← Architecture briefs, evidence, governance policies
The directory structure illustrates disciplined engineering where source, infrastructure, operations, and validation co-reside, ready for continuous inspection. vracu-launcher/ houses application facing APIs, worker orchestration, adapters, privacy layers, packagers, and placement planners. sidecar/ contains the runtime that executes on provider hosts, including attestation producers, rotation loops, TLS handlers, and configuration loaders. control_plane/ embeds the economic and scheduling core.
Dependencies flow from outer surfaces to inner utilities. FastAPI entry points in launcher/api/app.py import Pydantic settings from launcher/config/settings.py, which rely on helper modules in launcher/utils. Workers leverage adapters, strategies, and placement logic. Privacy components depend on cryptographic helpers and attester definitions. Control-plane clients mirror SDK structures ensuring the launcher and external clients use the same typed requests.
Dependency Flow Architecture
┌─────────────────────────────────────────────────────────────────────────────┐
│ EXTERNAL CLIENTS │
│ phase4_sdk/client.py ←→ Typer CLI │
└────────────────────────────────┬────────────────────────────────────────────┘
│ HTTP/JSON
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ LAUNCHER API LAYER │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ launcher/api/ │───▶│ launcher/core/ │───▶│ launcher/utils/ │ │
│ │ app.py │ │ service.py │ │ coerce.py │ │
│ │ routes/ │ │ │ │ retry.py │ │
│ └────────┬────────┘ └────────┬────────┘ └─────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ launcher/ │ │ launcher/ │ │
│ │ adapters/ │ │ privacy/ │ │
│ │ loader.py │ │ service.py │ │
│ └─────────────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ WORKER / DRIVER LAYER │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ launcher/worker │───▶│ launcher/driver │───▶│ sidecar/runtime │ │
│ │ processor.py │ │ kubernetes.py │ │ main.py │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ CONTROL PLANE / PAYMENTS │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ control_plane/ │◀──▶│ payments/ │◀──▶│ contracts/ │ │
│ │ api_server.py │ │ ledger/ │ │ Router.sol │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
Persistence is handled primarily via SQLAlchemy models in launcher/db/models.py, payments/ledger/models.py, and alien-directory-api/directory_api/db. SQLite databases are committed as artifacts to prove real test runs. Migrations exist for directory API (Alembic) and ledger (SQLAlchemy's metadata declarations). Control plane optionally uses Postgres, while ledger DAO can operate on Postgres or SQLite.
Job Persistence
SQLAlchemy models store job specs, status transitions, privacy handshakes, and telemetry snapshots.
Tables: jobs, privacy_sessions, artifacts
Backend: SQLite / Postgres
Ledger Persistence
Payment credits, provider payouts, and Stripe event records with full audit trail.
Tables: payment_credits, payouts
Backend: SQLite / Postgres
Provider Registry
Provider metadata, join tokens, heartbeats, and capability manifests.
Tables: providers, tokens, heartbeats
Migrations: Alembic
Clients use phase4_sdk/client.py or CLI equivalents to submit workloads. The SDK constructs HTTP requests with JSON payloads, attaches X-API-Key, and handles retries. On the launcher side, launcher/api/routes/jobs.py exposes POST /v1/jobs, which validates request bodies against Pydantic models, chooses the appropriate adapter via launcher/adapters/loader.py, persists job specs, and enqueues work onto asynchronous queues.
JobProcessor handles exceptions deliberately. Failed privacy raises PrivacyViolation, recorded via structured logs and metrics. Placement mismatches log PlacementError, while control plane reservation failures mark jobs as ABORTED. Consistent exception architecture means the same types appear across API responses, worker logs, and observability dashboards.
Exception Handling Architecture
JobProcessor.process()
│
├──▶ _process_privacy()
│ │
│ ├── PrivacyViolation ──▶ job.status = FAILED
│ │ record_attestation_failure()
│ │
│ └── Success ──▶ continue
│
├──▶ _apply_placement()
│ │
│ ├── PlacementError ──▶ log warning
│ │ attempt fallback
│ │
│ └── Success ──▶ continue
│
├──▶ _reserve_capacity()
│ │
│ ├── ControlPlaneError ──▶ job.status = ABORTED
│ │ reason = "reservation_failed"
│ │
│ └── Success ──▶ continue
│
└──▶ driver.launch()
│
├── DriverError ──▶ job.status = FAILED
│ driver.cleanup()
│
└── Success ──▶ job.status = RUNNING
Once a job passes privacy, the launcher may publish demand to the control plane. The control plane API server authenticates via X-API-Key, parses JSON into typed dataclasses (ReservationRequest, DemandConfig), and interacts with ControlPlaneContext. The context references PriceIndexOracleService, VRACUScheduler, MeteringService, ResilienceGuards, and optionally ArbitrumContracts.
Providers register with the directory API through join tokens issued by operators or the control plane. Onboarding includes publishing provider metadata, hardware capabilities, and heartbeats. When a job is assigned, the sidecar runtime fetches its configuration, contacts the launcher to submit attestation, and retrieves rotation secrets and TLS artifacts. The sidecar's rotation loop ensures DEKs are refreshed before expiry.
Provider Onboarding
┌─────────────┐
│ Join Token │
│ Issued │
└──────┬──────┘
▼
┌─────────────┐
│ Provider │
│ Register │
└──────┬──────┘
▼
┌─────────────┐
│ Heartbeat │
│ Loop │
└──────┬──────┘
▼
┌─────────────┐
│ Active │
│ Provider │
└─────────────┘
Sidecar Lifecycle
┌─────────────┐
│ Config │
│ Loaded │
└──────┬──────┘
▼
┌─────────────┐
│ Attestation │
│ Handshake │
└──────┬──────┘
▼
┌─────────────┐
│ Rotation │
│ Loop │
└──────┬──────┘
▼
┌─────────────┐
│ Workload │
│ Execute │
└─────────────┘
Observability is multi-pronged. launcher/observability/slo.py exports Prometheus metrics that feed dashboards. payments/stripe_service/metrics.py tracks queue lengths, webhook failures, and ledger states. control_plane/metrics.py monitors HTTP requests, queue depths, and scheduler backlogs. alien-observability offers provider-side exporters.
FUNCTION lifecycle(job_id, provider_endpoint): WITH db_session() AS session: job = session.get(Job, job_id) spec = dict(job.spec or {}) profile = ResourceProfile(**spec["profile"]) plan = ExecutionPlan(**spec["plan"]) IF strategy: compiled = strategy.compile(profile, plan) spec["strategy"] = strategy_payload(compiled) IF settings.features.enable_multi_provider_jobs AND placement: placement_map = placement.place(provider_endpoint.capabilities, compiled) spec["placement"] = serialise(placement_map) record_status(job, JobStatus.PROFILING) record_status(job, JobStatus.ALLOCATING) IF control_plane_client AND spec.policy: publish_demand(job_id, spec.policy, offers) reservation = reserve_capacity(job_id, provider_endpoint, profile, spec.policy, offers) IF reservation IS False: abort(job, JobStatus.ABORTED, reason="reservation_failed") RETURN ELSE IF reservation IS dict: spec.setdefault("control_plane", {}).setdefault("reservation", {}).update(reservation) abort_reason = process_privacy(session, job, spec) IF abort_reason: RETURN apply_rank_attestation(job, provider_endpoint, spec, profile, plan, compiled) materialize_artifacts(job_id, plan) launch_result = driver.launch(job_id, LaunchSpec(profile, plan, distribution), provider_endpoint) record_status(job, launch_result.status) job.spec = spec flag_modified(job, "spec") session.add(job) IF reservation: complete_reservation(reservation) RETURN {"job_id": job_id, "status": launch_result.status}
Telemetry flows through multiple channels: Prometheus metrics via launcher/observability/slo.py, structured logs via the standard logging module, tracing via OpenTelemetry, and ledger reports via payments/stripe_service/metrics.py. Evidence artifacts prove that integration tests, staging runs, and production cutovers were executed and recorded.
Consider a developer using the SDK: they configure adapters, invoke phase4_sdk commands, and verify deployments. Following this path leads through launcher/api/app.py, runs tests, executes the sidecar runtime, interacts with the control plane, observes ledger updates, and verifies on-chain events. This scenario underscores how repository artefacts, documentation, and code interlock to provide a reproducible journey.
End-to-End Job Flow
┌─────────────────────────────────────────────────────────────────────────────┐
│ SDK / CLI │
│ phase4 sdk submit-job │
└────────────────────────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ LAUNCHER API │
│ POST /v1/jobs ──▶ validate ──▶ adapter.prepare() ──▶ persist ──▶ enqueue │
└────────────────────────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ JOB PROCESSOR │
│ dequeue ──▶ privacy ──▶ placement ──▶ control_plane ──▶ driver.launch() │
└────────────────────────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ SIDECAR RUNTIME │
│ attest ──▶ handshake ──▶ rotate_keys ──▶ execute_workload ──▶ complete │
└────────────────────────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ CONTROL PLANE / PAYMENTS │
│ metering ──▶ settlement ──▶ ledger_update ──▶ on_chain_mint │
└─────────────────────────────────────────────────────────────────────────────┘
Beyond the main services, the repository houses CLI tools for local demos, orchestrated end-to-end runs, staging validation, and production verification. Scripts under tools/ and scripts/ manage migrations, benchmarking, and provider onboarding. This tooling ensures that engineers can reproduce complex scenarios with a single command.
Documentation spans Markdown, HTML, and PDF. Architecture diagrams, UI flows, security controls, and domain-specific analysis files demonstrate analytical depth. Many Markdown files follow naming patterns signifying milestones. This corpus bridges the gap between code and operational proof.
| Component | Key Path | Responsibility |
|---|---|---|
| Launcher API | launcher/api/ |
FastAPI routes, job submission, attestation endpoints |
| Privacy Gate | launcher/privacy/ |
Attesters, key brokers, revocation, proofs |
| Adapters | launcher/adapters/ |
Training, inference, quantization, rendering, composite |
| Control Plane | control_plane/ |
Oracle, scheduler, metering, governance |
| Payments | payments/ |
Ledger DAO, Stripe webhooks, ACU wallet |
| Contracts | contracts/ |
ConversionRouter.sol, ABIs, Foundry config |
| Sidecar | sidecar/ |
Runtime, attestation, rotation, TLS |
| SDK | phase4_sdk/ |
ControlPlaneClient, Typer CLI |
Launcher Orchestration
FastAPI wiring, job submission flows, worker processing, placement strategies, and the complete lifecycle from request to execution.
Launcher Orchestration
create_app constructs a FastAPI instance titled "VR-ACU Launcher". It loads LauncherSettings, invokes configure_observability to register logging and tracing middleware, instantiates SlowAPI's Limiter with settings.security.rate_limit, attaches rate-limit exception handlers, and registers startup/shutdown callbacks that start and stop a RevocationWatcher when privacy revocation is enabled.
During application creation the launcher builds privacy components via build_privacy_components(settings.privacy). Successful initialisation yields a PrivacyGate optionally coupled with RevocationRegistry and SessionInvalidationPipeline. Failures raise PrivacyInitializationError when confidential compute is required, otherwise the app logs a warning.
LauncherService initialises adapters using load_adapter, applies adapter-specific options from configuration, constructs BasicProfiler, DefaultPolicy, QuotaEnforcer, and PolicyEngine, and ensures database schema exists. It also stores adapter privacy templates to enrich job specs with expected attestation requirements.
FastAPI Application Wiring
create_app()
│
├──▶ load_settings() ──▶ LauncherSettings
│
├──▶ configure_observability()
│ ├── logging formatters
│ ├── OTLP exporters
│ └── Prometheus metrics
│
├──▶ SlowAPI Limiter(settings.security.rate_limit)
│
├──▶ build_privacy_components(settings.privacy)
│ ├── PrivacyGate
│ ├── RevocationRegistry
│ └── SessionInvalidationPipeline
│
├──▶ LauncherService
│ ├── load_adapter() for each adapter
│ ├── BasicProfiler
│ ├── DefaultPolicy
│ ├── QuotaEnforcer
│ └── PolicyEngine
│
├──▶ Register routers
│ ├── /v1/jobs
│ ├── /v1/jobs/{job_id}/attestation
│ ├── /v1/jobs/{job_id}/dek
│ ├── /v1/jobs/{job_id}/rotation
│ └── /v1/artifacts/verify
│
└──▶ Startup/Shutdown callbacks
├── RevocationWatcher.start()
└── RevocationWatcher.stop()
POST /v1/jobs parses JobSubmitRequest, logs payloads, and calls service.submit_job. The service evaluates policy (PolicyEngine.evaluate), resolves adapter, prepares ResourceProfile and ExecutionPlan, profiles the workload, computes policy constraints, shapes offers, validates quotas (QuotaEnforcer.validate_submission), and persists a Job model with status PENDING.
After submission the API queues background work using enqueue_background_job(job_id) backed by Redis/RQ. Worker processes launched via launcher/worker/main.py consume these jobs. They instantiate JobProcessor, configure drivers, telemetry aggregators, placement planners, control plane clients, and privacy gate references.
Job Lifecycle State Machine
┌─────────────────────────────────────────────┐
│ │
▼ │
┌─────────┐ ┌──────────┐ ┌───────────┐ ┌──────────┐ ┌─────────┐│
│ PENDING │───▶│PROFILING │───▶│ALLOCATING │───▶│LAUNCHING │───▶│ RUNNING ││
└─────────┘ └──────────┘ └───────────┘ └──────────┘ └────┬────┘│
│ │ │ │ │ │
│ │ │ │ │ │
▼ ▼ ▼ ▼ ▼ │
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐│
│ ABORTED │ │ FAILED │ │ ABORTED │ │ FAILED │ │COMPLETED││
└─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘│
│ │
└──────┘
(revocation)
State Transitions:
PENDING ──▶ PROFILING : worker picks up job
PROFILING ──▶ ALLOCATING : profile computed successfully
ALLOCATING ──▶ LAUNCHING : control plane reservation acquired
LAUNCHING ──▶ RUNNING : driver.launch() succeeded
RUNNING ──▶ COMPLETED : workload finished successfully
RUNNING ──▶ ABORTED : revocation triggered
* ──▶ FAILED : exception during processing
* ──▶ ABORTED : reservation failed / policy denied
JobProcessor.process(job_id) retrieves the job, extracts profile and plan, compiles strategies, optionally applies placement, coordinates privacy, materialises artifacts, and launches drivers. It publishes events to EventSink, updates job statuses throughout (PROFILING, ALLOCATING, LAUNCHING, RUNNING), interacts with control-plane reservations, and records metrics.
Job Routes
Core job management endpoints for submission, inspection, and lifecycle control.
GET /v1/jobs
GET /v1/jobs/{job_id}
GET /v1/jobs/{job_id}/status
GET /v1/jobs/{job_id}/logs
DELETE /v1/jobs/{job_id}
Attestation
Endpoints for privacy handshake, key rotation, and session management.
POST /v1/jobs/{id}/dek
POST /v1/jobs/{id}/rotation
GET /v1/jobs/{id}/privacy
Verification
Merkle proof verification and artifact streaming endpoints.
GET /v1/jobs/{id}/artifacts
GET /v1/jobs/{id}/artifacts/{aid}
JobProcessor integrates MultiProviderPlacementPlanner when multi-provider jobs are enabled. Strategy compilation uses adapter-specific logic to determine rank layouts. launcher/placement/rank_attestation.py commits environment variables and annotations that providers later echo, enabling the launcher to verify strategies were executed as planned.
ASYNC FUNCTION process(job_id): WITH db_session() AS session: job = session.get(Job, job_id) spec = dict(job.spec or {}) # Extract profile and plan from spec profile = ResourceProfile(**spec["profile"]) plan = ExecutionPlan(**spec["plan"]) # Compile strategy if multi-provider enabled IF settings.features.enable_multi_provider_jobs: compiled = strategy.compile(profile, plan) placement_map = placement_planner.place(provider.capabilities, compiled) spec["placement"] = serialise(placement_map) # Update status and process privacy record_status(job, JobStatus.PROFILING) record_status(job, JobStatus.ALLOCATING) # Control plane reservation IF control_plane_client: reservation = AWAIT reserve_capacity(job_id, profile) IF NOT reservation: abort(job, reason="reservation_failed") RETURN # Privacy handshake abort_reason = AWAIT process_privacy(session, job, spec) IF abort_reason: RETURN # Materialize artifacts and launch AWAIT materialize_artifacts(job_id, plan) launch_result = AWAIT driver.launch(job_id, LaunchSpec(profile, plan)) record_status(job, launch_result.status) RETURN {"job_id": job_id, "status": launch_result.status}
Sidecar Runtime
Provider-side execution runtime handling attestation handshakes, DEK rotation, TLS certificate management, and secure workload execution.
Sidecar Runtime
The sidecar runtime executes on provider hosts alongside GPU workloads. It orchestrates attestation, key rotation, TLS handling, artifact downloads, and workload execution. It interacts with the launcher via HTTP endpoints (/attestation, /dek, /rotation) and uses configuration from SidecarSettings. Logging is handled through structured context in provider logs.
SidecarSettings supplies job ID, provider ID, launcher URLs, bearer token, attestation challenge, rotation deadlines, proof URI, TLS requirements, artifact path, step signal path, and optional command overrides. The runtime stores settings, initialises AttestationProvider, builds a LauncherClient with HTTPX, and prepares asynchronous tasks (rotation loop, step watcher).
Sidecar Control Flow
SidecarRuntime.run()
│
├──▶ _perform_handshake()
│ │
│ ├──▶ AttestationProvider.produce(challenge)
│ │ ├── NVIDIA CC-On evidence
│ │ ├── Intel TDX quote
│ │ └── AMD SEV-SNP report
│ │
│ ├──▶ LauncherClient.attest(evidence)
│ │ └── POST /v1/jobs/{job_id}/attestation
│ │
│ ├──▶ _write_certificate_bundle(tls_certs)
│ │ ├── ca.pem
│ │ ├── cert.pem
│ │ └── key.pem
│ │
│ ├──▶ _write_dek_file(dek_bytes)
│ │
│ └──▶ _determine_command()
│
├──▶ Start rotation loop (if rotation_due_at provided)
│ └── asyncio.create_task(_rotation_loop())
│
├──▶ Start step watcher loop (if step_signal_path configured)
│ └── asyncio.create_task(_step_watcher_loop())
│
├──▶ _execute(command)
│ ├── asyncio.create_subprocess_exec()
│ ├── stream stdout/stderr
│ └── monitor return code
│
└──▶ stop()
├── cancel rotation task
├── cancel step watcher task
├── _zeroize_dek_file()
└── cleanup resources
AttestationProvider.produce(challenge) assembles evidence for configured attesters (NVIDIA CC-On, TDX, SEV-SNP). It may gather SPDM certificate chains, GPU reports, quote structures, and certificates depending on hardware. Errors raise RuntimeError, preventing handshake from proceeding with stale evidence.
The runtime acquires an async lock to ensure single handshake execution. It collects evidence, posts attestation, verifies responses include rotation_secret, attestation_hash, optional TLS bundle, and optional workload overrides. It writes TLS certificates, persists rotation secret, stores attestation hash, calculates rotation due timestamps, and optionally downloads artifacts.
ASYNC FUNCTION _rotation_loop(): WHILE NOT stop_event.is_set(): due_at = session.rotation_due_epoch IF NOT due_at: AWAIT sleep(default_interval) CONTINUE # Calculate sleep duration with grace period now = current_time() delay = max(0, due_at - now - settings.rotation_grace_seconds) AWAIT sleep(delay) TRY: AWAIT _rotate_key(reason="timer") EXCEPT HTTPError AS exc: logger.warning("rotation failed", error=str(exc)) AWAIT sleep(settings.rotation_retry_delay) ELSE: logger.info("rotation completed", step=session.last_step) ASYNC FUNCTION _rotate_key(reason): # Increment step counter step = session.last_step + 1 # Build request payload payload = { "session_token": session.token, "step": step, "proof_uri": settings.proof_uri, "attestation_hash": session.attestation_hash } # Request new DEK from launcher response = AWAIT launcher_client.request_dek(payload) # Decode and write new key dek_bytes = base64_decode(response["dek_b64"]) _write_dek_file(dek_bytes) # Update session state session.last_step = step session.rotation_secret = response.get("rotation_secret") session.rotation_due_at = response.get("due_at") # Acknowledge rotation AWAIT launcher_client.acknowledge_rotation(step)
Handshake Flow
Sidecar Launcher
│ │
│──▶ POST /attest ──▶│
│ {evidence} │
│ │
│◀── response ◀──────│
│ {dek, certs, │
│ rotation_due} │
│ │
│──▶ write certs ────│
│──▶ write dek ──────│
│ │
Rotation Flow
Sidecar Launcher
│ │
│──▶ POST /dek ─────▶│
│ {step, token, │
│ proof_uri} │
│ │
│◀── response ◀──────│
│ {new_dek, │
│ new_due_at} │
│ │
│──▶ POST /rotation ▶│
│ {ack, step} │
│ │
_write_certificate_bundle writes CA, certificate, and key files to the artifact directory with secure permissions. _build_environment adds TLS_CA_PATH, TLS_CERT_PATH, TLS_KEY_PATH. When fingerprint validation is enabled, _load_certificate_fingerprint computes SHA-256 to send with rotation acknowledgements.
The runtime stores DEKs in a designated file, used by the workload to decrypt artifacts. After workloads finish or runtime shuts down, _zeroize_dek_file overwrites the file with zeros, flushes, closes, and unlinks it. This ensures no residual secrets remain on disk.
Complete Sidecar Lifecycle
Provider Host Sidecar Runtime Launcher API Privacy Gate
│ │ │ │
│──start container────▶│ │ │
│ │ │ │
│ │══ HANDSHAKE PHASE ═════│═══════════════════════│
│ │ │ │
│ │──produce_evidence() │ │
│ │ ├─ NVIDIA CC-On │ │
│ │ ├─ Intel TDX quote │ │
│ │ └─ AMD SNP report │ │
│ │ │ │
│ │──POST /attestation────▶│ │
│ │ {evidences, job_id} │──authorize_job()─────▶│
│ │ │◀──PrivacyAuth────────│
│ │◀──200 {dek, certs, ───│ │
│ │ rotation_due} │ │
│ │ │ │
│ │──write_certs() │ │
│ │──write_dek() │ │
│ │ │ │
│ │══ EXECUTION PHASE ═════│═══════════════════════│
│ │ │ │
│ │──spawn_workload() │ │
│ │ ├─ env: TLS_*, DEK │ │
│◀─────GPU compute─────│ └─ subprocess │ │
│ │ │ │
│ │══ ROTATION LOOP ═══════│═══════════════════════│
│ │ │ │
│ │──[sleep until due_at] │ │
│ │──POST /rotation───────▶│ │
│ │ {step, token, proof} │──issue_session_dek()─▶│
│ │ │◀──new_dek─────────────│
│ │◀──200 {dek, next_due}──│ │
│ │──write_new_dek() │ │
│ │──zeroize_old_dek() │ │
│ │ [repeat...] │ │
│ │ │ │
│ │══ SHUTDOWN PHASE ══════│═══════════════════════│
│ │ │ │
│──workload complete──▶│ │ │
│ │──cancel_tasks() │ │
│ │──zeroize_all_keys() │ │
│ │──cleanup() │ │
│◀──exit 0─────────────│ │ │
Certificate Bundle Structure
/artifacts/tls/
├── ca.pem ← Platform CA
│ └── Issuer: Platform Root
│ └── Subject: Platform CA
│ └── Validity: 10 years
│
├── cert.pem ← Job Certificate
│ └── Issuer: Platform CA
│ └── Subject: job-{job_id}
│ └── Validity: job duration
│ └── Extensions:
│ └── subjectAltName:
│ └── DNS:*.job.internal
│
└── key.pem ← Private Key
└── Algorithm: ECDSA P-256
└── Permissions: 0600
└── Usage: TLS client auth
Certificate Flow
┌─────────────────────────────┐
│ LAUNCHER API │
│ ┌───────────────────────┐ │
│ │ generate_job_cert() │ │
│ │ ├─ load platform CA │ │
│ │ ├─ create CSR │ │
│ │ ├─ sign with CA key │ │
│ │ └─ bundle response │ │
│ └───────────────────────┘ │
└──────────────┬──────────────┘
│
▼
┌─────────────────────────────┐
│ SIDECAR RUNTIME │
│ ┌───────────────────────┐ │
│ │ _write_cert_bundle() │ │
│ │ ├─ mkdir -p /tls │ │
│ │ ├─ write ca.pem │ │
│ │ ├─ write cert.pem │ │
│ │ └─ write key.pem │ │
│ │ chmod 0600 │ │
│ └───────────────────────┘ │
└──────────────┬──────────────┘
│
▼
┌─────────────────────────────┐
│ WORKLOAD │
│ ┌───────────────────────┐ │
│ │ TLS_CA_PATH=/tls/ca │ │
│ │ TLS_CERT_PATH=/tls/.. │ │
│ │ TLS_KEY_PATH=/tls/.. │ │
│ │ │ │
│ │ mTLS connections to │ │
│ │ other ranks / storage │ │
│ └───────────────────────┘ │
└─────────────────────────────┘
Data Encryption Key Lifecycle
┌─────────────────────────────────────────────────────────┐
│ DEK STATE MACHINE │
└─────────────────────────────────────────────────────────┘
┌──────────────┐ ┌──────────────┐
│ INITIAL │ │ ZEROIZED │
│ (no DEK) │ │ (cleaned) │
└──────┬───────┘ └──────▲───────┘
│ │
│ handshake_complete shutdown OR error
│ │
▼ │
┌──────────────┐ rotation_due ┌──────────────┐ │
│ ACTIVE │──────────────────────────────▶│ ROTATING │ │
│ step = 0 │ │ step = N+1 │ │
│ dek = k₀ │◀──────────────────────────────│ new_dek │ │
└──────┬───────┘ rotation_ack └──────┬───────┘ │
│ │ │
│ │ │
│ ┌─────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ ACTIVE │ │
│ │ step = N+1 │ │
│ │ dek = k_{N+1}│───────────────────────────────────────┘
│ └──────────────┘
│ │
│ │ rotation_due (repeat)
│ ▼
│ ┌ ─ ─ ─ ─ ─ ─ ┐
└────────────▶ ROTATING ─────▶ ...
└ ─ ─ ─ ─ ─ ─ ┘
┌─────────────────────────────────────────────────────────────────────────────────────────────┐
│ State Transitions: │
│ INITIAL → ACTIVE: handshake returns first DEK (step=0) │
│ ACTIVE → ROTATING: rotation_due_at reached, request new DEK │
│ ROTATING → ACTIVE: new DEK received and written, step incremented │
│ ACTIVE → ZEROIZED: shutdown signal OR workload complete OR error │
│ │
│ Invariants: │
│ - Only one DEK active at a time │
│ - Steps monotonically increase │
│ - Old DEK overwritten before new DEK written (atomic swap) │
│ - Zeroization always occurs on exit │
└─────────────────────────────────────────────────────────────────────────────────────────────┘
Hardware → Evidence → Verification
┌─────────────────────────────────────────────────────────────────────────────────────────────┐
│ ATTESTATION EVIDENCE CHAIN │
└─────────────────────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────────────────────┐
│ HARDWARE LAYER │
│ ┌───────────────────┐ ┌───────────────────┐ ┌───────────────────┐ │
│ │ NVIDIA GPU │ │ Intel CPU │ │ AMD CPU │ │
│ │ ┌─────────────┐ │ │ ┌─────────────┐ │ │ ┌─────────────┐ │ │
│ │ │ CC-On Mode │ │ │ │ TDX Module │ │ │ │ SEV-SNP PSP │ │ │
│ │ │ SPDM Engine │ │ │ │ TD-VMCALL │ │ │ │ Guest Req │ │ │
│ │ └─────────────┘ │ │ └─────────────┘ │ │ └─────────────┘ │ │
│ └─────────┬─────────┘ └─────────┬─────────┘ └─────────┬─────────┘ │
└────────────┼─────────────────────────┼─────────────────────────┼────────────────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────────────────────────────────────────────────────┐
│ EVIDENCE PRODUCTION │
│ ┌───────────────────┐ ┌───────────────────┐ ┌───────────────────┐ │
│ │ NVIDIA Evidence │ │ TDX Quote │ │ SNP Report │ │
│ │ ├─ SPDM Cert Chain│ │ ├─ TD Report │ │ ├─ Guest Report │ │
│ │ ├─ GPU Report │ │ ├─ Signature │ │ ├─ Signature │ │
│ │ ├─ Measurements │ │ ├─ Cert Chain │ │ ├─ VCEK Cert │ │
│ │ └─ Nonce Binding │ │ └─ Challenge Hash │ │ └─ Challenge Hash │ │
│ └───────────────────┘ └───────────────────┘ └───────────────────┘ │
└────────────┬─────────────────────────┬─────────────────────────┬────────────────────────────┘
│ │ │
└─────────────────────────┼─────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────────────────────┐
│ VERIFICATION LAYER (Launcher Attesters) │
│ │
│ ┌───────────────────────────────────────────────────────────────────────────────────────┐ │
│ │ NvidiaCcOnAttester.verify() │ │
│ │ ├─ Validate SPDM certificate chain against NVIDIA root │ │
│ │ ├─ Verify GPU report signature │ │
│ │ ├─ Check measurements against known-good baseline │ │
│ │ ├─ Validate nonce matches challenge (job_id binding) │ │
│ │ └─ Extract claims: gpu_model, driver_version, cc_mode │ │
│ └───────────────────────────────────────────────────────────────────────────────────────┘ │
│ ┌───────────────────────────────────────────────────────────────────────────────────────┐ │
│ │ TdxAttester.verify() │ │
│ │ ├─ Verify quote signature against Intel attestation service │ │
│ │ ├─ Validate TD report body │ │
│ │ ├─ Check MRTD/MRCONFIGID measurements │ │
│ │ └─ Extract claims: mrenclave, mrsigner, tcb_level │ │
│ └───────────────────────────────────────────────────────────────────────────────────────┘ │
│ ┌───────────────────────────────────────────────────────────────────────────────────────┐ │
│ │ SnpAttester.verify() │ │
│ │ ├─ Validate VCEK certificate chain against AMD root │ │
│ │ ├─ Verify attestation report signature │ │
│ │ ├─ Check launch measurement │ │
│ │ └─ Extract claims: guest_svn, policy, platform_info │ │
│ └───────────────────────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────────────────────┘
Control Plane
Economic core with price index oracles, dominant resource fairness scheduling, metering slices, and governance timelocks.
Control Plane Services
ControlPlaneContext.from_config constructs services based on ServiceConfig. It initialises database connections, price index oracle, scheduler, metering service, settlement router, optional Arbitrum contracts, resilience guards, regional coordinator, health monitor, governance, key transparency, enterprise services, and fleet optimiser. Each component references its respective module.
ControlPlaneHTTPServer embeds context, API key, and metrics collector. RequestHandler enforces X-API-Key for all routes except /health. GET serves health data, metrics, governance proposals, key transparency roots, provider lists, policy snapshots, ledger pools, resilience status, regional status, fleet join tokens, and dashboards. POST handles demand configuration, supply submissions, bucket finalisation, reservations, metering acknowledgements, and governance proposals.
Control Plane Service Architecture
ControlPlaneContext.from_config(ServiceConfig)
│
├──▶ Database connections (Postgres / SQLite)
│
├──▶ PriceIndexOracleService
│ ├── record_supply_offers()
│ ├── configure_demand()
│ └── calculate_clearing_price()
│
├──▶ VRACUScheduler
│ ├── Dominant Resource Fairness
│ ├── Attained-service scoring
│ └── MIG slice support
│
├──▶ MeteringService
│ ├── record_slice()
│ ├── SHA-256 idempotency
│ └── duplicate detection
│
├──▶ SettlementRouter
│ ├── SCM-weighted TWAP
│ ├── Ceiling rounding to micro-ACU
│ └── Hold fraction for disputes
│
├──▶ ArbitrumContracts (optional)
│ ├── ConversionRouter interface
│ └── Web3 transaction signing
│
├──▶ ResilienceGuards
│ ├── Outstanding exposure limits
│ ├── Price floor enforcement
│ └── Credit utilisation monitoring
│
├──▶ RegionalCoordinator
│ ├── Heartbeat timeout handling
│ ├── Provisional receipts
│ └── Replay mode support
│
├──▶ GovernanceService
│ ├── Proposal lifecycle
│ ├── Timelock enforcement
│ └── Multi-role approvals
│
└──▶ KeyTransparencyLog
├── Merkle tree per week
├── Inclusion proofs
└── Provider key verification
PriceIndexOracleService records supply offers keyed by demand buckets. Demand configuration defines windows, reserve requirements, and price bounds. The oracle calculates clearing prices, returning BucketResult objects with allocations. Surge multipliers apply when utilization exceeds 95%, scaling linearly from 1.0x at 95% to 1.5x at 100%.
VRACUScheduler implements Dominant Resource Fairness with attained-service scoring. Each provider maintains a running total of attained service minutes; new allocations favor providers with lower historical utilization. NVIDIA MIG support extends the scheduler's capacity model, treating each MIG slice as an independent scheduling unit.
MeteringService
Provides strict idempotency via SHA-256 content hashing. Each slice contains job ID, bucket ID, sequence number, SCM delta, and price index.
Duplicates: succeed silently
Conflicts: rejection with reason
SettlementRouter
Computes SCM-weighted TWAP across slices. Burn amounts use ceiling rounding to micro-ACU units.
Rounding: ceiling to micro-ACU
Hold fraction: 0.0-1.0 for disputes
DualSignatureService
Produces cryptographic attestations with Ed25519 primary and post-quantum secondary signatures.
Secondary: SHAKE-256 → Dilithium3
Covers: JCS-canonicalized JSON
Reservation Flow
┌─────────────────────────────────────────────────────────────────────────────┐
│ RESERVATION REQUEST │
│ {job_id, tenant_id, gpu_profile, scm_minutes, policy_metadata} │
└────────────────────────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ RESILIENCE GUARDS │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ IF outstanding_exposure > max_exposure: │ │
│ │ RETURN ReservationDeclined(reason="resilience_guard") │ │
│ │ IF price < price_floor: │ │
│ │ RETURN ReservationDeclined(reason="price_floor") │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
└────────────────────────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ SCHEDULER │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ allocation = scheduler.allocate(request) │ │
│ │ IF allocation IS None: │ │
│ │ RETURN ReservationDeclined(reason="insufficient_capacity") │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
└────────────────────────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ PERSISTENCE │
│ database.save_reservation(allocation) │
│ metrics.record_reservation(allocation) │
└────────────────────────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ RESERVATION RESPONSE │
│ ReservationAccepted(reservation_id, provider_id, expiry) │
└─────────────────────────────────────────────────────────────────────────────┘
Price Discovery & Clearing
┌─────────────────────────────────────────────────────────────────────────────────────────────┐
│ PRICE INDEX ORACLE PIPELINE │
└─────────────────────────────────────────────────────────────────────────────────────────────┘
┌───────────────────────────────────────────────────────────────────────────────────────────┐
│ SUPPLY SIDE │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Provider A │ │ Provider B │ │ Provider C │ │ Provider D │ │ Provider E │ │
│ │ H100 x 8 │ │ A100 x 4 │ │ H100 x 2 │ │ A100 x 8 │ │ H200 x 4 │ │
│ │ $2.50/SCM │ │ $1.80/SCM │ │ $2.60/SCM │ │ $1.75/SCM │ │ $3.00/SCM │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
└─────────┼────────────────┼────────────────┼────────────────┼────────────────┼──────────────┘
│ │ │ │ │
└────────────────┴────────────────┼────────────────┴────────────────┘
│
▼
┌───────────────────────────────────────────────────────────────────────────────────────────┐
│ ORACLE SERVICE │
│ ┌─────────────────────────────────────────────────────────────────────────────────────┐ │
│ │ record_supply_offers(): │ │
│ │ - Bucket by GPU profile + region │ │
│ │ - Sort by price ascending │ │
│ │ - Build supply curve (cumulative capacity) │ │
│ └─────────────────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────────────────────┐ │
│ │ calculate_clearing_price(bucket, demand_scm): │ │
│ │ │ │
│ │ Price │ │
│ │ ▲ │ │
│ │ │ ╱ Supply Curve │ │
│ │ │ ╱ │ │
│ │ P*├─────────────────────● ← Clearing Price │ │
│ │ │ ╱ │ │ │
│ │ │ ╱ │ │ │
│ │ │ ╱ │ │ │
│ │ │ ╱ │ │ │
│ │ └─────────────────────┼──────────────────────▶ Quantity (SCM) │ │
│ │ Q* (demand) │ │
│ │ │ │
│ │ IF utilization > 95%: apply surge_multiplier (1.0x → 1.5x) │ │
│ └─────────────────────────────────────────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────────────────────────────────────────┐
│ BUCKET RESULT │
│ { bucket_id, clearing_price, allocations: [{provider, scm, price}], surge_applied } │
└───────────────────────────────────────────────────────────────────────────────────────────┘
Dominant Resource Fairness
┌─────────────────────────────┐
│ SCHEDULER STATE │
│ ───────────────────────── │
│ providers: [ │
│ {id: A, attained: 120}, │
│ {id: B, attained: 80}, │
│ {id: C, attained: 200}, │
│ ] │
└──────────────┬──────────────┘
│
▼
┌─────────────────────────────┐
│ NEW ALLOCATION REQUEST │
│ {job_id, scm: 100} │
└──────────────┬──────────────┘
│
▼
┌─────────────────────────────┐
│ SORT BY ATTAINED SERVICE │
│ (ascending) │
│ ───────────────────────── │
│ 1. Provider B (80 min) │
│ 2. Provider A (120 min) │
│ 3. Provider C (200 min) │
└──────────────┬──────────────┘
│
▼
┌─────────────────────────────┐
│ SELECT FIRST WITH CAPACITY │
│ Provider B: capacity ✓ │
│ ───────────────────────── │
│ B.attained += 100 │
│ B.attained = 180 │
└──────────────┬──────────────┘
│
▼
┌─────────────────────────────┐
│ RETURN ALLOCATION │
│ {provider: B, scm: 100} │
└─────────────────────────────┘
MIG Slice Support
┌─────────────────────────────┐
│ PHYSICAL GPU: H100 80GB │
│ ═══════════════════════════│
│ ┌─────────────────────────┐│
│ │ MIG Instance 0 ││
│ │ Profile: 1g.10gb ││
│ │ Status: allocated ││
│ │ Job: job-123 ││
│ └─────────────────────────┘│
│ ┌─────────────────────────┐│
│ │ MIG Instance 1 ││
│ │ Profile: 1g.10gb ││
│ │ Status: available ││
│ └─────────────────────────┘│
│ ┌─────────────────────────┐│
│ │ MIG Instance 2 ││
│ │ Profile: 2g.20gb ││
│ │ Status: available ││
│ └─────────────────────────┘│
│ ┌─────────────────────────┐│
│ │ MIG Instance 3 ││
│ │ Profile: 4g.40gb ││
│ │ Status: allocated ││
│ │ Job: job-456 ││
│ └─────────────────────────┘│
└─────────────────────────────┘
Scheduler treats each MIG
instance as independent
scheduling unit with own:
- Capacity (VRAM, SMs)
- Attained service
- Allocation state
Proposal → Queue → Execute
Proposer Governance Service Timelock Target Contract
│ │ │ │
│──propose(action)─────▶│ │ │
│ │──validate_proposer() │ │
│ │──check_quorum() │ │
│ │──create_proposal() │ │
│◀──proposal_id─────────│ │ │
│ │ │ │
│ │ │ │
════│═══════════════════════│═══ VOTING PERIOD ════│════════════════════════│═══
│ │ │ │
│──vote(yes/no)────────▶│ │ │
│ │──record_vote() │ │
│ │──update_tally() │ │
│ │ │ │
│ │ │ │
════│═══════════════════════│═══ VOTING ENDS ══════│════════════════════════│═══
│ │ │ │
│──queue()─────────────▶│ │ │
│ │──check_passed() │ │
│ │──schedule_timelock()─▶│ │
│ │ │──start_delay() │
│ │ │ (48h minimum) │
│ │ │ │
│ │ │ │
════│═══════════════════════│═══ TIMELOCK DELAY ═══│════════════════════════│═══
│ │ │ │
│──execute()───────────▶│ │ │
│ │──verify_timelock()───▶│ │
│ │ │──check_eta_passed() │
│ │ │──execute_action()─────▶│
│ │ │ │──apply()
│ │ │◀──────────success──────│
│ │◀──execution_receipt───│ │
│◀──tx_hash─────────────│ │ │
│ │ │ │
Payments
Payment processing with Stripe webhooks, SQS queue workers, ledger persistence, and Arbitrum-based ACU token integration.
Payments Infrastructure
LedgerDAO encapsulates SQLAlchemy interactions for payment credits. create_entry inserts PaymentCredit rows with PENDING status. mark_minted, mark_escrowed, mark_failed, mark_refunded update statuses and append transaction hashes. reserve_credit_for_job atomically assigns minted credits to jobs, raising ValueError if no credits match criteria.
FastAPI routes verify Stripe signatures via stripe.Webhook.construct_event, normalise events, record idempotency via ledger.record_stripe_event, and enqueue messages on SQS when configured. Payloads capture trainer IDs, GPU profiles, minute counts, wallet addresses, job IDs, regions, price versions, and invoice references.
Payment Flow Architecture
┌─────────────────────────────────────────────────────────────────────────────┐
│ STRIPE WEBHOOK │
│ checkout.session.completed / payment_intent.succeeded │
└────────────────────────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ WEBHOOK ROUTER │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ 1. stripe.Webhook.construct_event(payload, signature, secret) │ │
│ │ 2. _normalize_event() ──▶ extract metadata │ │
│ │ 3. ledger.record_stripe_event() ──▶ idempotency check │ │
│ │ 4. WebhookQueue.enqueue() ──▶ SQS FIFO │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
└────────────────────────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ QUEUE WORKER │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ WHILE True: │ │
│ │ messages = queue.receive_messages() │ │
│ │ FOR message IN messages: │ │
│ │ payload = StripePaymentPayload.parse(message.body) │ │
│ │ process_payment(payload) │ │
│ │ queue.delete(message) │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
└────────────────────────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ LEDGER DAO │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ create_entry(trainer_id, amount, wallet) ──▶ PENDING │ │
│ │ mark_minted(payment_id, tx_hash) ──▶ MINTED │ │
│ │ reserve_credit_for_job(job_id) ──▶ assign credit │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
└────────────────────────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ ACU WALLET │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ tx = wallet.build_mint(recipient, acu_amount, avl_burn) │ │
│ │ tx_hash = wallet.send(tx) ──▶ Arbitrum RPC │ │
│ │ ledger.mark_minted(payment_id, tx_hash) │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
WebhookQueue wraps boto3 SQS interactions, supporting FIFO dedupe identifiers. QueueWorker polls messages, deserialises JSON into StripePaymentPayload, invokes payment processing logic, and deletes successfully processed messages. The worker logs failures and re-queues messages by leaving them in SQS when exceptions occur.
# Aggregate metering slices total_minutes = sum(slice.minutes for slice in slices) numerator = sum(slice.minutes * slice.price for slice in slices) # Compute TWAP (Time-Weighted Average Price) twap_micro_usd = numerator // total_minutes # Compute burn with ceiling rounding burn_micro_acu = ceil(numerator / mint_price_micro_usd) # Apply hold fraction for dispute buffer provider_micro_acu = int(burn_micro_acu * (1.0 - hold_fraction)) refund_micro_acu = burn_micro_acu - provider_micro_acu # Generate canonical receipt (JCS sorted keys) receipt = { "burn_micro_acu": burn_micro_acu, "job_id": job_id, "provider": provider_address, "provider_micro_acu": provider_micro_acu, "refund_micro_acu": refund_micro_acu, "twap_micro_usd_per_scm": twap_micro_usd }
Credit → Debit → Balance Reconciliation
┌─────────────────────────────────────────────────────────────────────────────────────────────┐
│ LEDGER TRANSACTION FLOW │
└─────────────────────────────────────────────────────────────────────────────────────────────┘
┌───────────────────────────────────────────────────────────────────────────────────────────┐
│ CREDIT FLOW (Stripe Payment → ACU Tokens) │
│ │
│ Stripe Webhook LedgerDAO ACU Wallet │
│ │ │ │ │ │
│ │──payment.success─▶│ │ │ │
│ │ │──create_entry()────▶│ │ │
│ │ │ │ status: PENDING │ │
│ │ │ │──────────────────────▶│ │
│ │ │ │ │──mint_acu() │
│ │ │ │◀──tx_hash─────────────│ │
│ │ │ │ status: MINTED │ │
│ │ │ │ │ │
└───────────────────────────────────────────────────────────────────────────────────────────┘
┌───────────────────────────────────────────────────────────────────────────────────────────┐
│ DEBIT FLOW (Job Execution → Provider Payout) │
│ │
│ Job Complete MeteringService LedgerDAO Provider Wallet │
│ │ │ │ │ │
│ │──record_slice()───────▶│ │ │ │
│ │ │──reserve_credit()───▶│ │ │
│ │ │ │ status: ESCROWED │ │
│ │ │ │ │ │
│ │ │ │ │ │
│ Settlement SettlementRouter │ │ │
│ │ │ │ │ │
│ │──finalize()───────────▶│ │ │ │
│ │ │──compute_payout() │ │ │
│ │ │──mark_settled()─────▶│ │ │
│ │ │ │ status: SETTLED │ │
│ │ │ │──transfer_acu()─────▶│ │
│ │ │ │ │──received │
│ │ │ │ │ │
└───────────────────────────────────────────────────────────────────────────────────────────┘
┌───────────────────────────────────────────────────────────────────────────────────────────┐
│ LEDGER STATES │
│ │
│ PENDING ───▶ MINTED ───▶ ESCROWED ───▶ SETTLED │
│ │ │ │ │ │
│ │ │ │ └──▶ Provider receives ACU │
│ │ │ └──▶ Reserved for job, locked │
│ │ └──▶ On-chain ACU minted, available │
│ └──▶ Awaiting blockchain confirmation │
│ │
│ Alternative paths: │
│ PENDING ───▶ FAILED (blockchain error) │
│ ESCROWED ───▶ REFUNDED (job cancelled, dispute won) │
└───────────────────────────────────────────────────────────────────────────────────────────┘
FIFO Queue Architecture
┌─────────────────────────────┐
│ STRIPE WEBHOOKS │
│ ┌───────────────────────┐ │
│ │ checkout.completed │ │
│ │ payment.succeeded │ │
│ │ refund.created │ │
│ └───────────┬───────────┘ │
└──────────────┼──────────────┘
│
▼
┌─────────────────────────────┐
│ SQS FIFO QUEUE │
│ payments-prod.fifo │
│ ═══════════════════════════│
│ ┌─────────────────────────┐│
│ │ Message 1 (oldest) ││
│ │ dedupe: evt_abc123 ││
│ │ group: tenant_001 ││
│ └─────────────────────────┘│
│ ┌─────────────────────────┐│
│ │ Message 2 ││
│ │ dedupe: evt_def456 ││
│ │ group: tenant_002 ││
│ └─────────────────────────┘│
│ ┌─────────────────────────┐│
│ │ Message 3 (newest) ││
│ │ dedupe: evt_ghi789 ││
│ │ group: tenant_001 ││
│ └─────────────────────────┘│
└──────────────┬──────────────┘
│
▼
┌─────────────────────────────┐
│ QUEUE WORKER │
│ (ECS / Lambda) │
└─────────────────────────────┘
Worker Processing Loop
┌─────────────────────────────┐
│ QUEUE WORKER LOOP │
└──────────────┬──────────────┘
│
▼
┌─────────────────────────────┐
│ receive_messages( │
│ max_messages=10, │
│ wait_time=20s │
│ ) │
└──────────────┬──────────────┘
│
▼
┌─────────────────────────────┐
│ FOR message IN messages: │
│ ┌─────────────────────────┐│
│ │ payload = parse(body) ││
│ │ validate_signature() ││
│ │ idempotency_check() ││
│ └─────────────────────────┘│
└──────────────┬──────────────┘
│
▼
┌─────────────────────────────┐
│ TRY: │
│ process_payment(payload) │
│ ledger.create_entry() │
│ wallet.mint_acu() │
│ queue.delete(message) │
│ EXCEPT: │
│ log_error() │
│ # message stays in queue │
│ # retry after visibility │
│ # timeout (30s default) │
└──────────────┬──────────────┘
│
▼
┌─────────────────────────────┐
│ CONTINUE (next iteration) │
└─────────────────────────────┘
Signature Verification & Event Handling
┌─────────────────────────────────────────────────────────────────────────────────────────────┐
│ STRIPE WEBHOOK PROCESSING │
└─────────────────────────────────────────────────────────────────────────────────────────────┘
Stripe FastAPI Router Handlers Queue
│ │ │ │
│──POST /webhooks/stripe────▶│ │ │
│ Headers: │ │ │
│ Stripe-Signature │ │ │
│ Body: {event} │ │ │
│ │ │ │
│ │──stripe.Webhook.construct_event() │
│ │ (payload, signature, secret) │
│ │ │ │
│ │──IF InvalidSignature: │ │
│◀──────────400 Bad Req──────│ RETURN 400 │ │
│ │ │ │
│ │──_normalize_event() │ │
│ │ extract: trainer_id, │ │
│ │ amount, │ │
│ │ wallet_address, │ │
│ │ metadata │ │
│ │ │ │
│ │──ledger.record_stripe_event()│ │
│ │ (idempotency check) │ │
│ │ │ │
│ │──IF event.type == "checkout.session.completed": │
│ │ handler = handle_checkout│ │
│ │──ELIF event.type == "payment_intent.succeeded": │
│ │ handler = handle_payment │ │
│ │──ELIF event.type == "refund.created": │
│ │ handler = handle_refund │ │
│ │ │ │
│ │──handler(event)─────────────▶│ │
│ │ │──build_payload() │
│ │ │──validate_schema() │
│ │ │ │
│ │◀─────────payload─────────────│ │
│ │ │ │
│ │──queue.enqueue(payload)─────────────────────────────▶│
│ │ │ │
│◀──────────200 OK───────────│ │ │
│ │ │ │
┌─────────────────────────────────────────────────────────────────────────────────────────────┐
│ Event Types Handled: │
│ checkout.session.completed → Initial payment, mint ACU │
│ payment_intent.succeeded → Recurring/additional payment │
│ refund.created → Refund request, update ledger │
│ invoice.paid → Subscription renewal │
│ customer.subscription.* → Subscription lifecycle │
└─────────────────────────────────────────────────────────────────────────────────────────────┘
Contracts
Solidity smart contracts including ConversionRouter for AVL-to-ACU burns, spend limits, and dual-signature governance.
Smart Contracts
contracts/ConversionRouter.sol uses OpenZeppelin AccessControl, IERC20, and Pausable. Roles include DEFAULT_ADMIN_ROLE, OPERATOR_ROLE, and PAUSER_ROLE. Constructor stores immutable AVL token, ACU token, price oracle, and burn agent addresses, initialises spend limit, and sets role admins. The contract enforces non-zero addresses to prevent deployment mistakes.
burnAVLForACU (operator-only, when not paused) validates recipient, ACU amount, spend limit, queries AVL required via oracle, transfers AVL from operator to burn agent, calls IAVL.burn, transfers ACU from reserve to recipient, increments acuSpent, and emits ConversionExecuted. Administrative functions adjust reserve address, spend limit, and pause/unpause conversions.
On-Chain Settlement Flow
┌─────────────────────────────────────────────────────────────────────────────┐
│ OPERATOR WALLET │
│ 1. approve(ConversionRouter, avlAmount) │
│ 2. burnAVLForACU(acuAmount, recipient) │
└────────────────────────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ CONVERSION ROUTER │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ require(hasRole(OPERATOR_ROLE, msg.sender)) │ │
│ │ require(recipient != address(0)) │ │
│ │ require(acuAmount > 0) │ │
│ │ require(acuSpent + acuAmount <= acuSpendLimit) │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
└────────────────────────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ PRICE ORACLE │
│ avlNeeded = oracle.quoteAVLforACU(acuAmount) │
│ require(avlNeeded > 0) │
└────────────────────────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ TOKEN OPERATIONS │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ 1. avl.transferFrom(operator, burnAgent, avlNeeded) │ │
│ │ 2. avl.burn(burnAgent, avlNeeded) │ │
│ │ 3. acu.transferFrom(reserve, recipient, acuAmount) │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
└────────────────────────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ STATE UPDATE │
│ acuSpent += acuAmount │
│ emit ConversionExecuted(operator, recipient, acuAmount, avlNeeded) │
└────────────────────────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ RECIPIENT WALLET │
│ ACU tokens received │
│ Event logs ──▶ Off-chain ledger reconciliation │
└─────────────────────────────────────────────────────────────────────────────┘
function burnAVLForACU( uint256 acuAmount, address recipient ) external whenNotPaused returns (uint256) { require(hasRole(OPERATOR_ROLE, msg.sender), "not operator"); require(recipient != address(0), "zero recipient"); require(acuAmount > 0, "zero amount"); require(acuSpent + acuAmount <= acuSpendLimit, "spend limit"); uint256 avlNeeded = oracle.quoteAVLforACU(acuAmount); require(avlNeeded > 0, "zero quote"); require(avl.transferFrom(msg.sender, burnAgent, avlNeeded)); avl.burn(burnAgent, avlNeeded); require(acu.transferFrom(reserve, recipient, acuAmount)); acuSpent += acuAmount; emit ConversionExecuted(msg.sender, recipient, acuAmount, avlNeeded); return avlNeeded; }
Contract State
Immutable references and mutable accounting state for spend tracking.
acu: IERC20 (immutable)
oracle: IPriceOracle (immutable)
burnAgent: address (immutable)
reserve: address (mutable)
acuSpendLimit: uint256
acuSpent: uint256
Role Hierarchy
OpenZeppelin AccessControl with three distinct roles for operations.
OPERATOR_ROLE: burn/mint ops
PAUSER_ROLE: pause/unpause
Role admin: DEFAULT_ADMIN_ROLE
Event Emissions
Events for off-chain indexing and ledger reconciliation.
recipient, acuAmount, avlBurned)
SpendLimitUpdated(old, new)
ReserveUpdated(old, new)
SDK
Phase4 SDK with ControlPlaneClient, Typer CLI wrappers, and configuration helpers for external integration.
SDK and Client Libraries
ControlPlaneClient wraps urllib.request to interact with control plane endpoints. It stores base URL, API key, timeout, and optional SSL context. _request builds requests, attaches headers (X-API-Key, Content-Type), serialises JSON, parses responses, and raises ControlPlaneError on HTTP failures. _request_with_retry handles HTTP 429 with exponential backoff and jitter.
Methods include register_provider, configure_demand, submit_supply, reserve_capacity, allocate_reservation, finalize_bucket, wait_for_task, and get_task. Typer-based CLI exposes commands mirroring client methods, parsing command-line options, environment variables, and printing JSON responses.
SDK Interaction Surfaces
┌─────────────────────────────────────────────────────────────────────────────┐
│ DEVELOPER CLI / SCRIPTS │
│ │
│ $ phase4 sdk configure-demand --tenant team-a --gpu-profile a100x8 │
│ $ phase4 sdk submit-supply --provider prov-123 --capacity 100 │
│ $ phase4 sdk reserve-capacity --job-id job-42 --bucket-id bucket-1 │
│ $ phase4 sdk wait-for-task --task-id task-99 --timeout 300 │
└────────────────────────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ CONTROL PLANE CLIENT │
│ │
│ ControlPlaneClient(base_url, api_key, timeout, ssl_context) │
│ │ │
│ ├── _request(method, path, payload) │
│ │ ├── build urllib.request.Request │
│ │ ├── attach X-API-Key header │
│ │ ├── serialize JSON payload │
│ │ └── parse JSON response │
│ │ │
│ └── _request_with_retry(...) │
│ ├── handle HTTP 429 │
│ ├── exponential backoff │
│ └── jitter randomization │
└────────────────────────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ CONTROL PLANE ENDPOINTS │
│ │
│ /oracle/* ← demand, supply, finalize │
│ /scheduler/* ← reservations, allocations │
│ /metering/* ← slices, acknowledgements │
│ /settlement/* ← receipts, burns │
│ /governance/* ← proposals, execution │
│ /enterprise/* ← treasury, transfers │
└─────────────────────────────────────────────────────────────────────────────┘
from phase4_sdk import ControlPlaneClient # Configure client client = ControlPlaneClient( base_url="https://control-plane.vracu.net", api_key=os.environ["CONTROL_PLANE_API_KEY"], timeout=30.0 ) # Submit demand configuration demand = client.configure_demand( tenant="team-vision", gpu_profile="a100x8", scm_minutes=720, reserve_ratio=0.15 ) # Reserve capacity for job reservation = client.reserve_capacity( job_id="job-42", bucket_id=demand["bucket_id"], required_scm_minutes=60 ) # Poll for task completion result = client.wait_for_task( task_id=reservation["task_id"], timeout=300 ) print(f"Reservation: {reservation['reservation_id']}") print(f"Provider: {result['provider_id']}")
Providers
Provider ecosystem including directory API, join tokens, heartbeat monitoring, MIG management, and capability attestation.
Provider Ecosystem
alien-directory-api/directory_api/app.py builds a FastAPI service exposing /v1 endpoints. Routes include tokens (issuing join tokens), providers (registering providers, fetching profiles, updating metadata), and heartbeats (recording periodic health signals). Tokens include TTLs and scopes, enabling operators to issue time-bound onboarding credentials.
alien-provider-node implements a Typer CLI that consumes join tokens, registers hardware inventory, and reports telemetry. It communicates with the directory API using HTTPX, handles retries, and caches tokens securely. The node collects GPU inventory (count, memory, MIG profiles) and publishes metrics at regular intervals.
Provider Onboarding Flow
┌─────────────────────────────────────────────────────────────────────────────┐
│ JOIN TOKEN ISSUER │
│ Control Plane / Operator ──▶ POST /fleet/join-tokens │
│ {policy, ttl, scope, treasury_account} │
└────────────────────────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ DIRECTORY API │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ POST /v1/join-tokens ──▶ generate token with TTL │ │
│ │ POST /v1/join ──▶ validate token, create provider record │ │
│ │ POST /v1/join/verify ──▶ verify signature, store public key │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
└────────────────────────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ PROVIDER NODE │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ vracu-provider init --join-token │ │
│ │ ├── redeem token │ │
│ │ ├── verify signature │ │
│ │ └── store provider ID │ │
│ │ │ │
│ │ vracu-provider configure-mig --profile 1g.5gb --count 7 │ │
│ │ └── MIGManager.configure() ──▶ nvidia-smi │ │
│ │ │ │
│ │ vracu-provider run │ │
│ │ ├── heartbeat loop ──▶ POST /providers/{id}/heartbeat │ │
│ │ ├── telemetry loop ──▶ GPU metrics, temperature │ │
│ │ └── attestation loop ──▶ refresh credentials │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
└────────────────────────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ OBSERVABILITY AGENTS │
│ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │
│ │ Prometheus │ │ Loki │ │ OTLP │ │
│ │ Exporter │ │ Tailer │ │ Exporter │ │
│ └───────────────┘ └───────────────┘ └───────────────┘ │
└────────────────────────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ RESILIENCE CONTROLLER │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Monitor heartbeats ──▶ detect offline nodes │ │
│ │ Check revocation feeds ──▶ isolate compromised providers │ │
│ │ Orchestrate failover ──▶ reassign workloads │ │
│ │ Notify operators ──▶ Slack / PagerDuty webhooks │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
Provider Registry
Central registry for provider metadata, capabilities, and health status.
POST /v1/join
POST /v1/join/verify
GET /v1/providers/{id}
POST /v1/providers/{id}/heartbeat
GPU Partitioning
Idempotent MIG configuration using nvidia-smi commands.
_destroy_instances()
_create_instances(profile, count)
query() ──▶ MIGStatus
Telemetry Stack
Multi-protocol exporters for metrics, logs, and traces.
Loki: structured JSON logs
OTLP: distributed traces
nvidia-smi --query
Operations
Helm charts, Terraform modules, Kyverno policies, and operational scripts for production deployments.
Operations and Deployment
deployment/deploy_payments.sh orchestrates Kubernetes deployments for the payments service. It wraps kubectl and helm commands, applies manifests, waits for rollouts, and verifies service readiness. deployment/helm/ contains charts for launcher, sidecar relay, control plane, payments, and observability stacks. Values files configure image tags, replica counts, secrets references, and service accounts.
Kyverno policies enforce Cosign signatures on all pods labelled vracu-job=true. The policy references a cosign-public-key secret, instructing Kyverno to reject unsigned images. Network policies deny all ingress and egress for pods with vracu-job label, forcing operators to explicitly allow required traffic.
Deployment Pipeline
┌─────────────────────────────────────────────────────────────────────────────┐
│ GIT COMMIT │
│ feature branch ──▶ pull request ──▶ code review ──▶ merge │
└────────────────────────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ CI PIPELINE │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ 1. Lint (ruff, mypy) │ │
│ │ 2. Unit tests (pytest) │ │
│ │ 3. Integration tests │ │
│ │ 4. Security scan (Trivy, Anchore) │ │
│ │ 5. Build container images │ │
│ │ 6. Sign images (Cosign) │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
└────────────────────────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ INFRASTRUCTURE │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Terraform plan ──▶ review ──▶ apply │ │
│ │ ├── VPCs, subnets, security groups │ │
│ │ ├── EKS clusters, node groups │ │
│ │ ├── RDS instances, SQS queues │ │
│ │ └── Secrets Manager, KMS keys │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
└────────────────────────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ HELM DEPLOYMENT │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ helm upgrade --install launcher deployment/helm/launcher │ │
│ │ helm upgrade --install control-plane deployment/helm/control-plane │ │
│ │ helm upgrade --install payments deployment/helm/payments │ │
│ │ helm upgrade --install observability deployment/helm/observability │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
└────────────────────────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ ADMISSION CONTROL │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Kyverno policies: │ │
│ │ ├── require-signed-images (Cosign verification) │ │
│ │ ├── deny-privileged-containers │ │
│ │ └── require-resource-limits │ │
│ │ │ │
│ │ Network policies: │ │
│ │ ├── deny-all-ingress (vracu-job pods) │ │
│ │ └── deny-all-egress (vracu-job pods) │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
└────────────────────────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ POST-DEPLOY VALIDATION │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ scripts/validate_production.py │ │
│ │ ├── health checks │ │
│ │ ├── smoke tests │ │
│ │ ├── metrics verification │ │
│ │ └── dashboard screenshots │ │
│ │ │ │
│ │ Generate: DEPLOYMENT_COMPLETE_*.md │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
Validation
Comprehensive testing infrastructure, Prometheus metrics, structured logging, and production evidence artifacts.
Validation and Testing
vracu-launcher/tests/ contains unit tests covering adapters, privacy, and worker logic. tests/privacy/test_split_key.py validates session rotation monotonicity and attestation hash binding. tests/adapters/test_training.py asserts default resource profiles and command normalisation. tests/worker/test_processor.py uses fixtures to simulate job processing.
validation_e2e/ and local_validation/ directories host scripts and transcripts demonstrating full-stack runs. Files capture outcomes including validation output, test summaries, and fixes. Shell scripts orchestrate integration runs, capturing logs and metrics snapshots. SQLite databases act as artefacts for audits.
Validation Feedback Loop
┌─────────────────────────────────────────────────────────────────────────────┐
│ AUTOMATED TESTS │
│ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │
│ │ Unit Tests │ │ Integration │ │ E2E Tests │ │
│ │ pytest │ │ Tests │ │ validation_ │ │
│ │ tests/ │ │ integration_ │ │ e2e/ │ │
│ │ │ │ test.py │ │ │ │
│ └───────┬───────┘ └───────┬───────┘ └───────┬───────┘ │
│ │ │ │ │
│ └──────────────────┼──────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ CI PIPELINES │ │
│ │ Run on every commit, PR, and scheduled │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
└────────────────────────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ OBSERVABILITY │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ Prometheus Metrics │ │
│ │ job_submit_latency_seconds │ │
│ │ privacy_attestation_failures_total │ │
│ │ privacy_session_cancellations_total │ │
│ │ payments_webhooks_total │ │
│ │ control_plane_reservation_duration_seconds │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ Structured Logs (JSON) │ │
│ │ {"event": "job_submit", "job_id": "...", "adapter": "..."} │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ Distributed Traces (OTLP) │ │
│ │ launcher.request → privacy.authorize → control_plane.reserve │ │
│ └───────────────────────────────────────────────────────────────────┘ │
└────────────────────────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ VALIDATION CHECKLISTS │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ [ ] Health checks passing │ │
│ │ [ ] Metrics endpoints responding │ │
│ │ [ ] Kyverno policies enforced │ │
│ │ [ ] Ledger reconciliation complete │ │
│ │ [ ] Dashboard screenshots captured │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
└────────────────────────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ EVIDENCE ARCHIVES │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ validation_output.txt │ │
│ │ PRODUCTION_PROOF.md │ │
│ │ PRODUCTION_EVIDENCE_FINAL.md │ │
│ │ AWS_DEPLOYMENT_COMPLETE_FINAL_REPORT.md │ │
│ │ BLOCKCHAIN_VALIDATION_COMPLETE.md │ │
│ │ payments_production.db (SQLite snapshot) │ │
│ │ control_plane_local.db (SQLite snapshot) │ │
│ └───────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
def validate_production(): # Health checks assert check_http("/health") == 200 assert check_http("/metrics") == 200 # Submit test job and verify completion job_id = submit_test_job() wait_for_completion(job_id, timeout=300) # Verify metrics within thresholds metrics = scrape_prometheus() assert metrics["job_submit_latency_seconds"] < 30.0 assert metrics["privacy_attestation_failures_total"] == 0 # Reconcile ledger with on-chain state ledger_total = query_ledger_minted() onchain_total = query_onchain_minted() assert ledger_total == onchain_total # Generate validation report write_report({ "metrics": metrics, "ledger_total": ledger_total, "onchain_total": onchain_total, "timestamp": datetime.utcnow().isoformat() })
Prometheus Metrics
Comprehensive metrics across all subsystems feed Grafana dashboards for operational visibility.
privacy_attestation_failures_total (counter)
privacy_session_cancellations_total (counter)
payments_webhooks_total (counter)
payments_queue_backlog (gauge)
control_plane_reservation_duration_seconds (histogram)
Audit Artifacts
Numerous PDFs and Markdown reports document validation efforts for auditors.
PRODUCTION_PROOF.md
AWS_DEPLOYMENT_COMPLETE.md
BLOCKCHAIN_VALIDATION.md
Epilogue
System interlocks, security guarantees, governance mechanisms, and the path forward for confidential compute.
Epilogue
Throughout this document each guarantee traces to concrete modules: privacy enforcement in launcher/privacy, orchestration in launcher/worker, sidecar execution in sidecar/runtime, control plane economics in control_plane/*, payments in payments/*, on-chain settlement in contracts/, and provider tooling in alien-* directories. Signed-image policies, network isolation, and Terraform infrastructure anchor operational claims.
Confidential compute requires interlocks: attesters validate hardware, key brokers issue secrets, revocation pipelines cancel workloads, sidecars enforce key usage, drivers launch jobs, control plane allocates capacity, payments reconcile usage, and smart contracts finalise settlement. Each interlock uses typed data structures to prevent drift. Metrics and logs stitch the loops together.
System Weave
┌─────────────────────────────────────────────────────────────────────────────┐
│ SDK / CLI │
│ phase4_sdk ←→ Typer CLI ←→ External Partners │
└────────────────────────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ LAUNCHER API │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Routes │───▶│ Service │───▶│ Privacy │───▶│ Worker │ │
│ │ /v1/jobs │ │ LauncherSvc│ │ PrivacyGate│ │ JobProcessor│ │
│ └─────────────┘ └─────────────┘ └──────┬──────┘ └──────┬──────┘ │
└────────────────────────────────────────────────┼──────────────────┼─────────┘
│ │
┌────────────────────────────┘ │
│ │
▼ ▼
┌─────────────────────────────────┐ ┌─────────────────────────────┐
│ PRIVACY DESIGN PRINCIPLES │ │ SIDECAR RUNTIME │
│ ┌─────────────────────────┐ │ │ ┌─────────────────────┐ │
│ │ Attesters (NVIDIA/TDX/ │ │◀───────────▶│ │ Handshake / Rotate │ │
│ │ SNP) │ │ Mutual │ │ TLS / Execute │ │
│ ├─────────────────────────┤ │ Attestation│ └─────────────────────┘ │
│ │ Key Brokers (KMS/Vault/ │ │ │ │
│ │ Split-Key) │ │ │ │
│ ├─────────────────────────┤ │ │ │
│ │ Revocation Registry │ │ │ │
│ └─────────────────────────┘ │ │ │
└────────────────────────────────┘ └──────────────────────────────┘
│
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ CONTROL PLANE │
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐ │
│ │ Oracle │ │ Scheduler │ │ Metering │ │Settlement │ │Governance │ │
│ │ Pricing │ │ DRF │ │ Slices │ │ Router │ │ Timelock │ │
│ └─────┬─────┘ └─────┬─────┘ └─────┬─────┘ └─────┬─────┘ └───────────┘ │
└────────┼──────────────┼──────────────┼──────────────┼────────────────────────┘
│ │ │ │
└──────────────┼──────────────┼──────────────┘
│ │
▼ ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ PAYMENTS / ON-CHAIN │
│ ┌───────────────────────┐ ┌───────────────────────┐ │
│ │ Ledger DAO │ │ ConversionRouter │ │
│ │ ┌─────────────────┐ │◀──────────▶│ ┌─────────────────┐ │ │
│ │ │ PaymentCredits │ │ Mint │ │ burnAVLForACU() │ │ │
│ │ │ ProviderPayouts │ │ Burn │ │ setSpendLimit() │ │ │
│ │ └─────────────────┘ │ │ └─────────────────┘ │ │
│ └───────────────────────┘ └───────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────┐ │
│ │ Arbitrum L2 │ │
│ │ AVL ←→ ACU │ │
│ └───────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ PROVIDER ECOSYSTEM │
│ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │
│ │ Directory API │ │ Provider Node │ │ MIG Manager │ │ Observability │ │
│ │ Join Tokens │ │ Heartbeats │ │ nvidia-smi │ │ Prometheus │ │
│ └───────────────┘ └───────────────┘ └───────────────┘ └───────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
Potential extensions include new attesters (e.g., Intel TDX updates), additional adapters for specialised workloads, deeper integration with external observability platforms, and on-chain governance. The modular architecture—loaders, registry-based factories, configuration-driven services—facilitates such evolution without rewriting core components.
Security manifests in signed images, Kyverno policies, network isolation, Secrets Manager integration, dual signatures for on-chain settlements, and attested hardware enforcement. Governance modules enforce timelocks and multi-role approvals. Compliance artifacts connect governance actions to sign-offs, proving that cross-organisational approvals are embedded in both software and process.
The network's privacy guarantees emerge from three complementary layers. Hardware attestation prevents operators from inspecting workload data. Encrypted input/output pipelines protect data in transit and at rest. Cryptographic secure aggregation prevents peers from observing individual contributions during federated training. Together, these mechanisms enable confidential computation on untrusted infrastructure.
Readers can independently verify claims: run tests/, execute run_live_demo.sh, inspect payments_production.db, replay control-plane logs, verify Cosign signatures, query on-chain contracts via ABI, and compare metrics dashboards. The epilogue invites verification rather than asking for trust.
As new features land (PQ-safe keys, new adapters, region expansion), the same pattern will continue: code first, tests second, documentation third, evidence fourth. Future whitepaper revisions will follow this cadence, keeping technical truth aligned with narrative. The repository is both blueprint and proof, inviting stakeholders to inspect, validate, and extend the platform with confidence rooted in verifiable engineering.