The Decentralized
GPU Revolution
How fifty thousand nodes across one hundred fifty countries are reshaping computational infrastructure through Byzantine consensus, military-grade encryption, and community governance.
In the depths of data centers scattered across continents, a quiet revolution unfolds. Traditional cloud computing, with its centralized authorities and surveillance apparatuses, faces an existential challenge from an unlikely alliance of cryptographers, engineers, and idealists. Their weapon of choice: a distributed network of graphics processing units, bound together not by corporate decree but by mathematical consensus.
The numbers tell a compelling story. Fifty thousand nodes pulse with computational life, their collective power exceeding that of many nation-states' entire technological infrastructure. Yet this is not merely about raw processing power. It represents a fundamental reimagining of how we approach distributed computing A paradigm where computational tasks are divided among multiple machines, coordinating through network protocols. , privacy, and digital sovereignty in an age of unprecedented surveillance.
At the heart of this transformation lies a trinity of innovations: a privacy layer that would make cypherpunks weep with joy, a marketplace that democratizes access to machine learning pipelines, and a software development kit so elegantly simple that it borders on poetry. Each component represents years of research distilled into practical tools that challenge the status quo.
The genesis of this platform can be traced to a simple observation: the gatekeepers of computational power have become too powerful. Amazon Web Services, Google Cloud Platform, Microsoft Azure—these titans control not just the infrastructure but the very terms under which innovation occurs. Their data centers are panopticons, their terms of service are constitutions we never voted for, and their pricing models extract maximum value while providing minimum transparency.
Against this backdrop, the distributed GPU network emerges not as a mere alternative but as a philosophical statement. It asserts that computational resources, like knowledge itself, should be freely accessible to all who seek them. It proclaims that privacy is not a luxury but a fundamental right. It demonstrates that community governance can triumph over corporate hierarchies.
The Privacy Imperative
In the pantheon of distributed computing, privacy stands not as an afterthought but as the architectural cornerstone upon which all trust is built. The Bazaar platform implements an unprecedented privacy infrastructure—eight core services orchestrating four fundamental technologies to create what security researchers have called "the most comprehensive privacy-preserving compute platform in production today."
The numbers speak with authority: sub-millisecond differential privacy operations, 100-millisecond zero-knowledge proof generation, Byzantine fault tolerance up to 33% malicious nodes, and triple-encrypted Tor circuits established in under 300 milliseconds. These are not theoretical benchmarks but operational realities, battle-tested across thousands of compute hours and millions of privacy-preserving operations.
Privacy Controller
The central nervous system of privacy operations, orchestrating all privacy-preserving computations across the network. Built with FastAPI, operating on port 8009.
Port: 8009
Lines of Code: 354
Differential Privacy: ε=1.0, δ=1e-5
Zero-Knowledge: Groth16 SNARKs
Consensus: HoneyBadgerBFT
Privacy Stack
Enforces privacy budget allocation and tracks differential privacy consumption.
Max Epsilon: 5.0
Refresh: 3600s
Ledger: PostgreSQL
Privacy Suite
Components 27-32: Anonymization, optimization, adaptation, and policy engine.
Components: 6
Adapters: CNN, Transformer
Hot Config: Enabled
Bulletin Board
Immutable message board with HoneyBadgerBFT consensus and Merkle tree verification.
Algorithm: HoneyBadgerBFT
Fault Tolerance: 33%
Batch Size: 10 messages
Security Vault
Hardware-backed key management with FIPS 140-2 compliance and HSM integration.
Language: Go
Storage: BadgerDB
HSM: PKCS#11
AnoFel ZKP System
Zero-knowledge proof generation for anonymous federated learning with gradient privacy.
Scheme: Groth16
Proof Size: ~200 bytes
Generation: ~100ms
LF3PFL Coordinator
Layer-wise federated privacy with Byzantine-robust gradient aggregation.
Methods: Mean, Median, Krum
Byzantine Threshold: 33%
Variance Reduction: 75%
Tor Network Integration
Triple-encrypted onion routing with hidden services for all Bazaar components.
Circuit Lifetime: 600s · Max Circuits: 10
Hidden Services: composer.onion, policy-engine.onion, registry.onion, slo-broker.onion, bulletin-board.onion, privacy-controller.onion
Privacy Request Lifecycle
Every request entering the Bazaar platform undergoes a sophisticated privacy transformation, passing through multiple validation and protection layers before reaching its destination. This lifecycle, measured in milliseconds yet comprehensive in its security guarantees, represents the practical implementation of theoretical privacy primitives at scale.
╔═════════════════════════════════════════════════════════════════════╗
║ PRIVACY REQUEST LIFECYCLE ║
╠═════════════════════════════════════════════════════════════════════╣
║ ║
║ ┌────────────────┐ ║
║ │ Client Request │ ║
║ └────────┬───────┘ ║
║ │ ║
║ ▼ ║
║ ┌────────────────┐ Has Privacy Headers? ║
║ │ Kong Gateway │────────────┐ ║
║ │ :8000 │ │ ║
║ └────────┬───────┘ ▼ ║
║ │ [403 Denied] ║
║ │ ║
║ ▼ ║
║ ┌────────────────┐ ║
║ │ Privacy Budget │ Valid ε/δ Budget? ║
║ │ Plugin │────────────┐ ║
║ └────────┬───────┘ │ ║
║ │ ▼ ║
║ │ [403 Exhausted] ║
║ │ ║
║ ▼ ║
║ ┌────────────────┐ ║
║ │ AnoFel Plugin │ Valid ZK Proof? ║
║ │ Provenance │────────────┐ ║
║ └────────┬───────┘ │ ║
║ │ ▼ ║
║ │ [403 Invalid] ║
║ │ ║
║ ▼ ║
║ ┌─────────────────────────────────────┐ ║
║ │ Privacy Controller :8009 │ ║
║ ├─────────────────────────────────────┤ ║
║ │ • Generate ZK Proof (Groth16) │ ║
║ │ • Add DP Noise (ε=1.0, δ=1e-5) │ ║
║ │ • Secure Aggregation (MPC) │ ║
║ │ • Tor Circuit Routing (3 hops) │ ║
║ └────────┬─────────────────────────────┘ ║
║ │ ║
║ ▼ ║
║ ┌─────────────────────────────────────┐ ║
║ │ Bulletin Board Consensus │ ║
║ │ HoneyBadgerBFT │ ║
║ └────────┬─────────────────────────────┘ ║
║ │ ║
║ ▼ ║
║ ┌────────────────┐ ║
║ │Privacy Response│ ║
║ └────────────────┘ ║
║ ║
╚═════════════════════════════════════════════════════════════════════╝
Gradient Privacy Processing
Client Kong Gateway Privacy Controller AnoFel ZKP LF3PFL Tor Bulletin Board
│ │ │ │ │ │ │
├─POST /gradient──►│ │ │ │ │ │
│ ├─X-Privacy-Epsilon►│ │ │ │ │
│ │ X-Privacy-Grant │ │ │ │ │
│ │ │ │ │ │ │
│ ├─Validate Budget──►│ │ │ │ │
│ │ Check ε≥0.1 │ │ │ │ │
│ │ ├─Generate Proof──►│ │ │ │
│ │ │ ├─Clip L2────►│ │ │
│ │ │ │ norm ≤ 1.0 │ │ │
│ │ │ ├─Gaussian───►│ │ │
│ │ │ │ Noise σ=0.1│ │ │
│ │ │ ├─Pedersen───►│ │ │
│ │ │ │ Commitment │ │ │
│ │ │◄────ZKProof─────┤ │ │ │
│ │ │ BN254/Groth16 │ │ │ │
│ │ │ │ │ │ │
│ │ ├─Secret Sharing──────────────►│ │ │
│ │ │ ├─Byzantine──►│ │
│ │ │ │ Detection │ │
│ │ │ │ 33% thresh │ │
│ │ │◄─────────Aggregated Result──┤ │ │
│ │ │ │ │ │
│ │ ├─Build Circuit────────────────────────────►│ │
│ │ │ ├─3 Hop────────►│
│ │ │ │ Onion Routing │
│ │ ├─Post to Bulletin──────────────────────────────────────────►│
│ │ │ ├─Consensus
│ │ │ │ Round
│ │ │◄──────────────────────────Privacy Receipt─────────────────┤
│◄────Response────┼──────────────────┤ │
The gradient privacy processing flow represents the confluence of multiple privacy-preserving technologies working in concert. Each gradient update undergoes clipping to bound its L2 norm, receives carefully calibrated Gaussian noise to ensure differential privacy, and is committed using Pedersen commitments before zero-knowledge proof generation.
Zero-Knowledge Proof Generation
The implementation of Groth16 SNARKs on the BN254 curve represents a masterclass in applied cryptography. With proof sizes of merely 200 bytes and generation times averaging 100 milliseconds, the system achieves the holy grail of zero-knowledge systems: practical efficiency without compromising security.
# Differential Privacy Application gradient_norm = np.linalg.norm(gradient) if gradient_norm > self.config.clip_norm: gradient *= (self.config.clip_norm / gradient_norm) # Add Gaussian Noise for (ε,δ)-DP noise_stddev = self.config.noise_multiplier * self.config.clip_norm noise = np.random.normal(0, noise_stddev, gradient.shape) private_gradient = gradient + noise # Generate Pedersen Commitment r = self._generate_random_field_element() commitment = multiply(G1, int.from_bytes( hashlib.sha256(gradient.tobytes()).digest(), 'big' ) % curve_order) commitment = add(commitment, multiply(H1, r)) # Construct Groth16 Proof async def _generate_groth16_proof(self, gradient, commitment): # Generate proof elements on BN254 curve r = self._generate_random_field_element() s = self._generate_random_field_element() proof_a = multiply(G1, r) # G1 element proof_b = multiply(G2, s) # G2 element proof_c = multiply(G1, (r * s) % curve_order) # G1 element return ZKProof( proof_a=proof_a, proof_b=proof_b, proof_c=proof_c, commitment=commitment, public_inputs=[ str(self.config.epsilon), str(self.config.delta), str(np.sum(gradient)) ] )
ZERO-KNOWLEDGE PROOF CONSTRUCTION FLOW
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Gradient Input │─────►│ Differential │─────►│ Commitment │
│ │ │ Privacy │ │ Generation │
└─────────────────┘ └─────────────────┘ └────────┬────────┘
│ │
┌───────▼────────┐ ▼
│ Clip L2 Norm │ ┌─────────────────┐
│ norm ≤ 1 │ │ Hash Gradient │
└───────┬────────┘ └────────┬────────┘
│ │
┌───────▼────────┐ ▼
│ Add Gaussian │ ┌─────────────────┐
│ Noise σ=0.1 │ │Generate Random │
└───────┬────────┘ │ Field Element │
│ └────────┬────────┘
│ │
└──────────┬────────────────┘
│
┌──────────▼──────────┐
│ Circuit Creation │
│ Private Witness │
│ Public Inputs │
└──────────┬──────────┘
│
┌──────────▼──────────┐
│ Groth16 Proof │
│ Generation on │
│ BN254 Curve │
└──────────┬──────────┘
│
┌──────────────────┼──────────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Proof A (G1) │ │ Proof B (G2) │ │ Proof C (G1) │
│ ~100 bytes │ │ ~100 bytes │ │ ~100 bytes │
└──────────────┘ └──────────────┘ └──────────────┘
Byzantine Consensus & Bulletin Board
HoneyBadgerBFT, the Byzantine fault-tolerant consensus protocol at the heart of the bulletin board system, achieves what was once thought impossible: asynchronous Byzantine consensus with optimal communication complexity. With tolerance for up to 33% malicious nodes, the system maintains consistency and availability even under adversarial conditions.
The protocol operates in three distinct phases: reliable broadcast ensures all honest nodes receive the same messages, binary agreement reaches consensus on message inclusion, and the commit phase constructs Merkle trees for cryptographic verification. This elegant dance of distributed agreement occurs in approximately one second, a remarkable achievement for Byzantine consensus at scale.
Phase 1: Reliable Broadcast
- ECHO messages from N nodes
- READY threshold: 2f+1 nodes
- DELIVER decision on agreement
- Prevents equivocation
Phase 2: Binary Agreement
- Propose binary value (0/1)
- Collect votes from nodes
- Decide with 33% fault tolerance
- Guaranteed termination
Phase 3: Commit
- Merkle tree construction
- Cryptographic proof generation
- Bulletin board storage
- Client notification dispatch
HONEYBADGER CONSENSUS FLOW
Message Collection HoneyBadgerBFT Phases Commit & Store
───────────────── ─────────────────────── ──────────────
┌──────────┐ ┌────────────────┐ ┌──────────────┐
│Message 1 │──┐ │ Reliable │ │Build Merkle │
├──────────┤ │ │ Broadcast │ │ Tree │
│Message 2 │──┼───Batch──────────────►├────────────────┤──────────────────►├──────────────┤
├──────────┤ │ Formation │• ECHO Messages │ │ Generate │
│Message N │──┘ (10 msgs) │• READY (2f+1) │ │ Proofs │
└──────────┘ │• DELIVER │ └──────┬───────┘
└────────┬───────┘ │
│ ▼
┌────────▼───────┐ ┌──────────────┐
│ Binary │ │ Store in │
│ Agreement │ │ Bulletin │
├────────────────┤ ├──────────────┤
│• Propose Value │ │ Notify │
│• Vote Collection│ │ Clients │
│• Decide (67%) │ └──────────────┘
└────────────────┘
Consensus Time: ~1 second · Fault Tolerance: 33% · Batch Size: 10 messages
Secure Multi-Party Computation
Through the mathematical elegance of Shamir's secret sharing, the platform enables multiple parties to jointly compute functions over their private inputs without revealing those inputs to each other. This is not theoretical cryptography but practical privacy, enabling federated learning across untrusted nodes while maintaining complete confidentiality of individual contributions.
async def _secret_sharing_aggregation( self, updates: List[GradientUpdate], method: AggregationMethod ) -> torch.Tensor: """Aggregate using secret sharing for privacy""" # Extract gradients from updates gradients = [u.gradient for u in updates] # Generate random masks that sum to zero # This ensures the aggregation is correct while hiding individual values masks = [] for i in range(len(gradients) - 1): mask = torch.randn_like(gradients[0]) * 0.01 masks.append(mask) # Last mask ensures sum is zero (additive secret sharing) if masks: masks.append(-sum(masks)) # Apply masks to hide individual gradients masked_gradients = [g + m for g, m in zip(gradients, masks)] # Byzantine-robust aggregation methods if method == AggregationMethod.MEAN: aggregated = torch.mean(torch.stack(masked_gradients), dim=0) elif method == AggregationMethod.TRIMMED_MEAN: # Remove top and bottom 10% before averaging sorted_grads = torch.sort(torch.stack(masked_gradients), dim=0)[0] trim_size = len(masked_gradients) // 10 aggregated = torch.mean(sorted_grads[trim_size:-trim_size], dim=0) elif method == AggregationMethod.KRUM: # Select gradient with minimum distance to others aggregated = self._krum_aggregation(masked_gradients) return aggregated # Byzantine Detection: Statistical outlier detection variance_threshold = 2.0 * expected_variance byzantine_nodes = [i for i, g in enumerate(gradients) if torch.var(g) > variance_threshold]
Anonymous Routing via Tor
╔════════════════════════════════════════════════════════════════╗
║ TOR CIRCUIT ESTABLISHMENT ║
╠════════════════════════════════════════════════════════════════╣
║ ║
║ Privacy Guard Middle₁ Middle₂ ║
║ Controller Node Node Node ║
║ │ │ │ │ ║
║ ├──Create──────► │ │ ║
║ │ Circuit │ │ │ ║
║ │◄─────────────┤ │ │ ║
║ │ Key K₁ │ │ │ ║
║ │ │ │ │ ║
║ ├──Extend──────►───Extend──────► │ ║
║ │ (Enc: K₁) │ │ │ ║
║ │◄─────────────┼───────────────┤ │ ║
║ │ Key K₂ │ │ │ ║
║ │ │ │ │ ║
║ ├──Extend──────►───Forward─────►──Forward─────► ║
║ │ (Enc: K₁,K₂)│ │ │ ║
║ │◄─────────────┼───────────────┼──────────────┤ ║
║ │ Key K₃ │ │ │ ║
║ │ │ │ │ ║
║ │ TRIPLE ENCRYPTION │ ║
║ ├══Data════════►═══Decrypt═════►══Decrypt═════►═Decrypt═►║
║ │ K₁+K₂+K₃ │ K₁ │ K₂ │ K₃ ║
║ │ │ │ │ ║
║ ║
║ Circuit Build Time: ~300ms · Hops: 3 · Lifetime: 600s ║
║ ║
║ Hidden Services: ║
║ • composer.onion • bulletin-board.onion ║
║ • policy-engine.onion • privacy-controller.onion ║
║ • registry.onion • slo-broker.onion ║
╚════════════════════════════════════════════════════════════════╝
Performance & Compliance
Regulatory Compliance
Hardware Privacy Infrastructure
Hardware Attestation
Triple attestation stack proving code integrity without trusted third parties. NVIDIA CC-On, Intel TDX, and AMD SEV-SNP create hardware-backed trust anchors.
Intel TDX: Encrypted VM memory
AMD SEV-SNP: Memory integrity
Evidence Staleness: 24-hour threshold
Root of Trust: Silicon-backed
Measurement: GPU registers + firmware
Encrypted I/O Pipeline
End-to-end encryption with per-job Data Encryption Keys. All artifacts encrypted with AES-GCM before leaving secure enclaves.
Encryption: AES-GCM authenticated
TLS: Mutual authentication
Decryption: Inside secure enclave only
Provider View: Ciphertext only
Plaintext Location: Protected memory
Secure Aggregation
Committee-based federated learning with Shamir secret sharing over finite fields. Gradient privacy without reconstruction.
Threshold: K-of-M Shamir splits
DP Noise: Gaussian mechanism
Fixed-Point: Scale by 10^6
Key Derivation: HKDF-SHA256
Bulletin Board: Redis/S3/IPFS
Traffic Analysis Resistance
Fixed-size message padding and decentralized bulletin boards prevent size-based analysis. ANOFEL routing obscures metadata.
Bulletin: Decentralized append-only
ANOFEL: Distributed routing
Traffic Pattern: Uniform timing
Metadata: Zero leakage
Per-Round Keys: Ephemeral AES
The Model Bazaar
Dual-Token Compute Economy
The marketplace runs on a two-token architecture: ACU (Actual Compute Units) as the fixed-supply settlement currency, and AVL (Availability Token) as the inflationary utility token rewarding provider liveness. Users pay in ACU, providers earn AVL emissions, then convert to ACU via oracle-priced burns.
Smart contracts on Arbitrum handle trustless settlement. Each job deposits ACU into MirrorMintPool escrow, metering slices track consumption in micro-ACU precision, and settlement routes 80% to providers while burning 20% as protocol fees. Provider availability determines AVL emissions through Merkle airdrops—the longer you stay online, the more you earn.
MirrorMintPool.depositForJob(job_id, microAcu);
// Metering tracks consumption
MeteringService.ingest_slice(job_id, scm_consumed);
// Settlement routes payment
SettlementRouter.settle_job(job_id, provider);
// → 80% provider payment (released_micro_acu)
// → 20% protocol burn (burned_micro_acu)
// → 10% held for disputes (held_micro_acu)
// Provider earns AVL via availability
AvailabilityMerkleMinter.claim(epoch, amount);
// Provider converts AVL → ACU
ConversionRouter.burnAVLForACU(acuAmount);
Developer Experience
Modular Adapter Architecture
The SDK exposes a registry-based adapter system allowing third-party extensions
without platform redeployment. Training, inference, quantization, rendering, and
federated adapters ship by default. Custom workload types register via
register_adapter(), transforming job specs into resource profiles
and execution plans.
Each adapter implements prepare(job_spec) and map_metrics(raw).
The control plane uses ResourceProfile (num_gpus, min_vram_gb, interconnect, features)
for provider matching, while ExecutionPlan (image, command, env, volumes) drives
container orchestration. Adapters normalize telemetry—training emits step/loss/throughput,
inference emits latency_p95_ms/QPS/error_rate.
Most remarkably, the entire distributed infrastructure—metering, settlement, hardware attestation, encrypted I/O—becomes invisible. Developers submit Python functions; the platform handles Docker builds, GPU allocation, privacy-preserving execution, and trustless payment routing.
from gpu_platform import Client, register_adapter
# Custom adapter for fine-tuning workloads
class FineTuneAdapter(Adapter):
def prepare(self, job_spec):
return ResourceProfile(
num_gpus=job_spec.get("num_gpus", 1),
min_vram_gb=40,
features=("cuda>=12.1", "peft", "bitsandbytes")
), ExecutionPlan(...)
# Register adapter (zero platform changes)
register_adapter("finetune", FineTuneAdapter)
# Submit fine-tuning job
client = Client(api_key="your_key")
job = client.submit(
adapter="finetune",
base_model="llama-70b",
dataset="custom_data.jsonl",
lora_rank=64
)
# Platform handles: hardware attestation, encrypted I/O,
# metering, settlement, ACU payment routing
print(f"Job: {job.id}, Cost: {job.cost_micro_acu / 1e6} ACU")
Privacy Service Integration Architecture
The privacy architecture's true power emerges from the seamless integration of its components through the Kong API Gateway. Three critical plugins—privacy-budget, anofel-provenance, and msi-degraded—form the first line of defense, validating every request against privacy policies before it enters the system.
┌─────────────────────────────────────────────────────────────────┐
│ API Gateway Integration (Kong :8000) │
├─────────────────────────────────────────────────────────────────┤
│ • privacy-budget plugin (priority: 1000) │
│ - Headers: X-Privacy-Epsilon, X-Privacy-Grant │
│ - Min Budget: 0.1, Cache TTL: 120s │
│ │
│ • anofel-provenance plugin (priority: 950) │
│ - Headers: X-AnoFel-Proof, X-Tor-Signature │
│ - Proof Verification: Base64 → JSON → Validate │
│ │
│ • msi-degraded plugin │
│ - Graceful Degradation, Read-Only Methods │
└─────────────────┬───────────────────────────────────────────────┘
│
┌─────────────┼─────────────────────┐
▼ ▼ ▼
Privacy Privacy Stack Privacy Suite
Controller :8140 :8141
:8009 Budget Mgmt Components 27-32
│ │ │
└─────────────┼─────────────────────┘
▼
Business Services
(Composer, Policy, Registry, SLO)
Privacy Budget Ledger & Enforcement
The privacy budget ledger implements a sophisticated accounting system for differential privacy resources. Each operation consumes a portion of the privacy budget (ε, δ), tracked with microsecond precision and enforced through distributed consensus.
Privacy Budget Enforcement Flow
───────────────────────────────
Request ──► Extract Headers ──► Validate Format ──► Redis Lookup
│ │
▼ ▼
Check Format Database Query
ε ∈ [0.1, 5.0] │
δ ∈ [1e-9, 1e-3] ▼
Calculate Remaining
│
Budget Sufficient?
╱ │ ╲
Allow Throttle Deny
│ │
Consume Budget Log Violation
│
Update Cache & DB
│
Audit Trail
Configuration Parameters:
• Max Epsilon: 5.0 • Refresh Interval: 3600s
• Max Delta: 1e-5 • Max Tokens: 5
• Min Budget: 0.1 • Cache TTL: 120s
Security Vault Database Schema
The Security Vault maintains eight critical tables for managing cryptographic materials and audit trails. Built on BadgerDB for performance with PostgreSQL for compliance tracking, it supports RSA, ECDSA, ED25519, AES, and ChaCha20Poly1305 operations with hardware security module integration.
-- Key rotation with audit trail CREATE OR REPLACE FUNCTION rotate_key( p_old_key_id UUID, p_new_key_id UUID, p_rotated_by VARCHAR ) RETURNS UUID AS $$ DECLARE v_rotation_id UUID; BEGIN -- Create rotation record INSERT INTO key_rotation_history ( old_key_id, new_key_id, rotated_by ) VALUES ( p_old_key_id, p_new_key_id, p_rotated_by ) RETURNING id INTO v_rotation_id; -- Deactivate old key UPDATE keys SET active = FALSE WHERE id = p_old_key_id; -- Update rotation timestamp UPDATE keys SET rotated_at = NOW() WHERE id = p_new_key_id; RETURN v_rotation_id; END; $$ LANGUAGE plpgsql; -- Tables: keys, secrets, certificates, key_rotation_history, -- hsm_keys, encryption_operations, audit_log, compliance_records
System Architecture
The distributed GPU network, while revolutionary in its ambitions, rests upon a foundation of carefully orchestrated components. This technical appendix documents the actual implementation—a system battle-tested across thousands of reservations, millions of compute minutes, and hundreds of provider nodes.
At its core, the architecture separates concerns across three primary layers: the control plane for orchestration and economic coordination, the provider network for compute execution, and the settlement layer anchored in Arbitrum smart contracts for trustless financial settlement.
The PriceIndexOracleService implements uniform-price bucket auctions with surge multipliers and per-entity capacity caps. Demand configuration accepts micro-SCM requirements plus reserve buffers; supply submission accumulates offers sorted by price. Bucket finalization executes a modified uniform-price clearing: offers are sorted ascending, demand is filled sequentially, and the marginal price becomes the clearing price for all accepted units.
Surge multipliers apply when utilization exceeds 95%, scaling linearly from 1.0×
at 95% to 1.5× at 100%. Entity caps prevent single providers from capturing more
than 30% of any bucket's supply—a critical anti-manipulation guard. The resulting
BucketResult contains clearing price, utilization basis points, surge
multiplier, and filled micro-SCM, all persisted locally before being mirrored to
the on-chain PriceIndexOracle contract.
The VRACUScheduler implements Dominant Resource Fairness with attained-service scoring. Each provider maintains a running total of attained service minutes; new allocations favor providers with lower historical utilization. The fairness score combines expected job duration with attained service ratio, preventing long-running providers from monopolizing capacity while ensuring hardware constraints (VRAM, interconnect class) are satisfied.
NVIDIA MIG (Multi-Instance GPU) support extends the scheduler's capacity model. Provider registration records MIG profile, partition count, and per-partition memory. The scheduler normalizes SCM rates on a per-partition basis, treating each MIG slice as an independent scheduling unit. This enables fine-grained multi-tenancy without sacrificing fairness or resource isolation.
Metering & Settlement
The MeteringService provides strict idempotency guarantees via SHA-256 content hashing. Each meter slice contains job ID, bucket ID, sequence number, SCM delta, and the price index at time of execution. Duplicate submissions (identical hash) succeed silently; conflicting payloads (same sequence, different hash) return rejection with reason code.
Settlement aggregation computes SCM-weighted time-averaged pricing (TWAP) across
all slices for a job. The formula: Σ(minutes × price) / Σ(minutes).
Burn amounts apply ceiling rounding to micro-ACU units, ensuring providers never
receive fractional tokens. Hold fractions (0.0–1.0) split burn amounts between
immediate provider payout and refund escrow, enabling dispute resolution without
blocking settlement.
# Aggregate metering slices
total_minutes = sum(slice.minutes for slice in slices)
numerator = sum(slice.minutes * slice.price for slice in slices)
# Compute burn with ceiling rounding
burn_micro_acu = ceil(numerator / mint_price_micro_usd)
# Apply hold fraction for dispute buffer
provider_micro_acu = int(burn_micro_acu * (1.0 - hold_fraction))
refund_micro_acu = burn_micro_acu - provider_micro_acu
# Generate canonical receipt (JCS sorted keys)
receipt = {
"burn_micro_acu": burn_micro_acu,
"job_id": job_id,
"mint_price_micro_usd_per_acu": mint_price,
"provider": provider_address,
"provider_micro_acu": provider_micro_acu,
"refund_micro_acu": refund_micro_acu,
"twap_micro_usd_per_scm": numerator // total_minutes
}
The DualSignatureService produces cryptographic attestations for every settlement receipt. Primary signatures use Ed25519 with embedded public keys (32-byte seed expanded via SHA-512). Secondary signatures employ a post-quantum envelope: 64-byte secrets processed through SHAKE-256 XOF, yielding Dilithium3-compatible signing material.
Both signatures cover the JCS-canonicalized receipt JSON (sorted keys, minimal
whitespace). The control plane persists signatures alongside receipts in SQLite,
enabling offline verification without blockchain round-trips. The
GET /settlement/receipt/{job_id} endpoint returns the canonical
receipt, both signature envelopes, and current Mirror-Mint escrow state.
Workload Adapters
Modularity, not monoliths. Unlike traditional job schedulers that hardcode workload types into platform logic, the distributed compute network employs adapters— protocol-based transformers registered at runtime via a plugin architecture. Each adapter converts high-level job specifications into resource profiles and execution plans, enabling the same user code to run across Docker, Ray clusters, or Kubernetes without modification. Third-party developers extend the platform by registering custom adapters without touching core infrastructure code.
Registry-Based Architecture
The AdapterFactory pattern decouples adapter implementations from scheduler logic.
A global registry maps adapter names to factory functions, allowing hot-swapping of adapters
without redeploying the launcher service. The registry initializes with five core adapters
(training, inference, quantization, rendering, federated), but any module can call
register_adapter(name, factory) to inject custom transformation logic.
# Core registry with built-in adapters
_ADAPTER_REGISTRY: Dict[str, AdapterFactory] = {
"training": TrainingAdapter,
"inference": InferenceAdapter,
"render": RenderingAdapter,
"quant": QuantizationAdapter,
}
# Third-party registration (zero core changes)
def register_adapter(name: str, factory: AdapterFactory) -> None:
_ADAPTER_REGISTRY[name.lower()] = factory
# Example: Custom fine-tuning adapter
class FineTuneAdapter(Adapter):
def prepare(self, job_spec):
profile = ResourceProfile(
num_gpus=job_spec.get("num_gpus", 1),
min_vram_gb=40, # LoRA/QLoRA requirements
features=("cuda>=12.1", "peft", "bitsandbytes")
)
# Custom logic for parameter-efficient tuning...
return profile, plan
# Register without platform redeployment
register_adapter("finetune", FineTuneAdapter)
The adapter protocol defines two primary operations: prepare(job_spec)
transforms declarative requirements into a (ResourceProfile, ExecutionPlan)
tuple, while map_metrics(raw) normalizes heterogeneous telemetry into
standardized metering signals. This abstraction allows the control plane to allocate
providers based on resource constraints without understanding framework-specific details.
Training Adapter
The TrainingAdapter prepares distributed training jobs with multi-GPU coordination strategies. It accepts job specs containing image references, command arrays, VRAM requirements, and interconnect preferences. The adapter injects environment variables for DDP (DistributedDataParallel) or FSDP (Fully Sharded Data Parallel) rendezvous, configures volume mounts for dataset access, and sets priority metadata for queue ordering.
job_spec = {
"image": "ghcr.io/org/training:v2.1",
"command": ["python", "-m", "torch.distributed.run", "train.py"],
"num_gpus": 8,
"min_vram_gb": 80,
"interconnect": ["nvlink"],
"scm_minutes": 720,
"features": ["cuda>=12.1", "nccl"],
"strategy": "ddp"
}
adapter = TrainingAdapter()
profile, plan = adapter.prepare(job_spec)
# Metrics normalization
raw_metrics = {"step": 1024, "loss": 0.42, "throughput": 2048}
normalized = adapter.map_metrics(raw_metrics)
# → {"step": 1024, "loss": 0.42, "throughput": 2048}
Inference Adapter
The InferenceAdapter targets long-running model serving deployments. Unlike batch training jobs, inference workloads require service-oriented execution: health probes (readiness/liveness), autoscaling configurations, load balancer exposure, and rolling update strategies. The adapter generates Kubernetes-compatible execution plans with service ports, replica counts, and horizontal pod autoscaling parameters.
Health probes default to HTTP GET requests against /health endpoints,
with configurable initial delays and check intervals. Service types (ClusterIP,
LoadBalancer, NodePort) control network exposure. Autoscaling policies define CPU/memory
thresholds triggering replica scale-up, enabling elastic capacity matching demand spikes.
Federated Learning Adapter
The FederatedAdapter prepares multi-party training with privacy-preserving aggregation. Job specs include: world_size (participant count), committee parameters (K-of-M threshold for Shamir secret sharing), differential privacy budgets (epsilon/delta), and bulletin board backend configuration (Redis, S3, or IPFS).
The adapter injects environment variables controlling aggregation behavior:
FED_PACKET_BYTES (fixed-size message padding, default 128KB),
FED_ROUNDS (training iterations), FED_DP_EPSILON (privacy budget),
FED_BB_BACKEND (bulletin board type), and FED_ROUND_SECRET
(shared secret hex for key derivation). These parameters enable secure gradient
aggregation without trusted coordinators.
fed_spec = {
"image": "ghcr.io/vracu/federated:latest",
"command": ["python", "federated_train.py"],
"world_size": 8,
"committee_k": 3, # Threshold
"committee_m": 5, # Total shares
"dp_epsilon": 3.0,
"dp_delta": 1e-5,
"bb_backend": "redis",
"bb_uri": "redis://localhost:6379/0",
"rounds": 50,
"packet_bytes": 131072 # 128KB padding
}
adapter = FederatedAdapter()
profile, plan = adapter.prepare(fed_spec)
# Execution plan includes privacy env vars
assert plan.env["FED_DP_EPSILON"] == "3.0"
Quantization and rendering adapters follow similar patterns, specializing resource requirements and metric extraction for their respective workloads. The quantization adapter handles model compression tasks (GPTQ, AWQ, bitsandbytes), while the rendering adapter manages visual workloads with GPU rasterization demands.
Privacy-Preserving Computation
The network's privacy guarantees emerge from three complementary layers, each addressing a distinct threat model. Hardware attestation prevents operators from inspecting workload data. Encrypted input/output pipelines protect data in transit and at rest. Cryptographic secure aggregation prevents peers from observing individual contributions during federated training. Together, these mechanisms enable confidential computation on untrusted infrastructure.
Hardware Attestation
Provider nodes collect attestation evidence proving workloads execute inside hardware-protected enclaves. Three attestation technologies integrate via composite providers: NVIDIA Confidential Computing (CC-On), Intel TDX (Trust Domain Extensions), and AMD SEV-SNP (Secure Encrypted Virtualization with Secure Nested Paging).
Intel TDX provides VM-level isolation with encrypted memory and integrity protection. Attestation quotes prove execution inside a Trust Domain, with measurements covering firmware, kernel, and initial ramdisk. Control plane verification checks quote signatures against Intel's root keys, ensuring authenticity.
AMD SEV-SNP extends SEV with stronger memory integrity guarantees. SNP reports include platform measurements and VM guest policy, preventing malicious hypervisor tampering. Combined with encrypted memory, SNP isolates guest execution from host observation.
The CompositeAttestationProvider aggregates evidence from multiple sources, enabling hybrid deployments (e.g., NVIDIA GPU inside Intel TDX VM). Async evidence collection occurs during provider attach, with challenge-response protocols ensuring freshness. Stale evidence (> 24 hours) triggers re-attestation before accepting privacy-tier workloads.
# Filesystem-based attestation providers
nvidia_provider = NvidiaCcOnFilesystemProvider(
spdm_chain_paths=[Path("/sys/kernel/debug/nvidia-cc-on/spdm")],
gpu_report_paths=[Path("/sys/kernel/debug/nvidia-cc-on/gpu_report")]
)
tdx_provider = TdxFilesystemProvider(
quote_paths=[Path("/sys/kernel/config/tsm/report/tdx_quote")]
)
# Composite aggregation
composite = CompositeAttestationProvider([nvidia_provider, tdx_provider])
# Challenge-response for freshness
challenge = secrets.token_hex(32)
evidence = await composite.produce(challenge)
# Evidence structure: {provider_name: {type: data}}
assert "nvidia_cc_on" in evidence
assert "tdx" in evidence
Encrypted Input/Output Pipeline
Job artifacts (datasets, model checkpoints, configuration files) never touch provider disks in plaintext. The launcher service generates per-job Data Encryption Keys (DEKs), encrypting all artifacts with AES-GCM authenticated encryption. DEKs transfer to provider sidecars via TLS mutual authentication, with client certificate fingerprint validation preventing unauthorized access.
The sidecar's decrypt shim fetches DEKs at job start, decrypts artifacts inside the secure enclave, and launches the workload. Operators observe only ciphertext blobs—plaintext exists exclusively within hardware-protected memory. Output encryption reverses the flow: results encrypt before leaving the enclave, with DEKs accessible only to job initiators.
# Sidecar fetches DEK using TLS client cert
dek = fetch_dek(
launcher_url="https://launcher.vracu.net",
job_id="job-42",
fingerprint_sha256=cert_fingerprint,
cert_bundle={
"cert": "/tls/client.pem",
"key": "/tls/client.key",
"ca": "/tls/ca.pem"
}
)
# Decrypt artifacts inside enclave
for artifact in encrypted_artifacts:
nonce = base64.b64decode(artifact["nonce"])
ciphertext = base64.b64decode(artifact["ciphertext"])
aad = base64.b64decode(artifact.get("aad", ""))
# AES-GCM decrypt with authentication
plaintext = AESGCM(dek).decrypt(nonce, ciphertext, aad)
# Write to secure enclave filesystem
path = secure_workdir / artifact["path"]
path.write_bytes(plaintext)
# Plaintext never leaves enclave
Federated Learning with Secure Aggregation
The network implements committee-based secure aggregation combining Shamir secret sharing, differential privacy, and bulletin board coordination. Unlike naive averaging (where aggregators observe raw gradients), this protocol ensures no party—including the coordinator—sees individual contributions.
Each training round proceeds as follows: participants add Gaussian noise to gradients (satisfying (ε,δ)-differential privacy), encode noisy gradients as field elements over GF(4,294,967,291), split into M shares via Shamir's scheme (K-of-M threshold), encrypt shares with per-pair AES keys, post ciphertexts to bulletin board, collect K shares addressed to them, decrypt and reconstruct via Lagrange interpolation, average reconstructed gradients, and broadcast the aggregate.
Committee selection employs deterministic random sampling seeded
by round ID, ensuring all participants agree on the committee without coordination.
Key derivation uses HKDF-SHA256 with context strings encoding
round, sender, and recipient identities: fed-round:{round}:from:{sender}:to:{recipient}.
This generates unique AES keys for each communication pair per round.
The bulletin board abstraction supports three backends: Redis (RPUSH/LRANGE for ordered streams), S3 (timestamped objects with lexicographic ordering), and IPFS (content-addressed immutable logs). Fixed-size message padding (default 128KB) prevents size-based traffic analysis.
# Add DP noise (Gaussian mechanism)
sigma = math.sqrt(2.0 * math.log(1.25 / delta)) * sensitivity / epsilon
noisy_grad = gradient + np.random.normal(0.0, sigma, gradient.shape)
# Encode as field elements (scale by 10^6)
scaled = np.rint(noisy_grad * 1e6).astype(np.int64)
field_vals = np.mod(scaled, FIELD_MODULUS).astype(np.uint64)
# Shamir split (K=3, M=5)
shares = shamir_split(field_vals, k=3, m=5, seed=round_id)
# Encrypt shares for committee members
for share, member in zip(shares, committee):
aes_key = _derive_key(shared_secret, round_id, rank, member)
cipher = AESGCM(aes_key)
nonce = os.urandom(12)
ciphertext = cipher.encrypt(nonce, share.tobytes(), None)
# Post to bulletin board with padding
bulletin.post(
topic=f"fed/{round_id}/shares",
payload=pad_to_fixed(json.dumps(payload), 131072)
)
# Collect K shares, reconstruct via Lagrange
aggregates = _collect_shares(bulletin, rank, round_id, secret)
combined = shamir_combine(aggregates[:k])
decoded = _decode_gradient(combined) / world_size
This multi-layered approach achieves end-to-end confidentiality: hardware attestation proves code integrity, encrypted I/O protects data in motion, and secure aggregation prevents gradient leakage. The combination enables privacy-preserving machine learning on commodity hardware without trusted third parties—a capability previously requiring specialized secure enclaves or multiparty computation protocols.
Dual-Token Economic Model
Two tokens, distinct roles, unified economy. The network employs a dual-token architecture where ACU (Actual Compute Units) serves as the fixed-supply settlement currency, while AVL (Availability Token) functions as the inflationary utility token rewarding provider liveness. This separation creates economic pressure: ACU scarcity drives value appreciation, AVL emissions incentivize capacity contribution, and the ConversionRouter bridges them via oracle-priced burns—transforming availability into settlement rights.
ACU: The Settlement Token
ACU implements a fixed-supply ERC20 (18 decimals) with no mint function post-deployment. The total supply (S_MAX) initializes at construction and remains immutable—every ACU that will ever exist mints to the treasury address during contract deployment. This design choice transforms ACU into a deflationary settlement currency: as compute demand grows, fixed supply creates scarcity pressure.
Users deposit ACU into the MirrorMintPool escrow contract for job execution. Each job receives an isolated escrow account tracking: deposited_micro_acu, burned_micro_acu, released_micro_acu (provider payments), refunded_micro_acu, and held_micro_acu (dispute buffer). Settlement burns protocol fees to the treasury while routing provider payouts and refunding unused balances—all operations occur in micro-ACU (1e-6 ACU) precision to minimize rounding losses.
AVL: The Availability Token
AVL implements an ERC20 with role-gated minting and burning. The contract enforces a MAX_SUPPLY cap but allows addresses holding the MINTER_ROLE to create new tokens below this ceiling. Daily emissions distribute AVL to providers via Merkle airdrops proportional to their availability scores— the longer a provider maintains liveness (passing heartbeat checks), the more AVL they earn.
Providers stake AVL in the AvailabilityStaking contract to signal commitment. Staked amounts act as economic bonds: misbehavior (failed jobs, missed heartbeats) triggers slashing via the SLASHER_ROLE, burning a percentage of the stake. Unstaking requires a cooldown period preventing providers from exiting immediately before slashing events. The staking mechanism creates skin-in-the-game: providers risk capital to participate, and penalties enforce service quality.
# 1. User deposits ACU for job execution
MirrorMintPool.depositForJob(job_id="job-42", microAcu=10_000_000)
# Escrow: {deposited: 10M micro-ACU, burned: 0, released: 0}
# 2. Job executes, metering slices recorded
MeteringService.ingest_slice(
job_id="job-42",
minutes_delta_scm_micro=100_000, # 100 SCM consumed
priceindex_micro_usd_per_scm=120_000 # $0.12/SCM
)
# 3. Settlement aggregates slices, computes TWAP
result = SettlementRouter.settle_job(
job_id="job-42",
provider="gpu-a100-01",
hold_fraction=0.1 # 10% held for disputes
)
# burn_micro_acu: 2,000,000 (20% protocol fee)
# provider_micro_acu: 7,200,000 (90% of 8M)
# refund_micro_acu: 800,000 (unused escrow)
# 4. Provider earns daily AVL emissions
AvailabilityMerkleMinter.claim(
epochId="2025-11-04",
to="gpu-a100-01",
amount=1000 * 1e18, # 1000 AVL
proof=merkle_proof
)
# 5. Provider burns AVL to mint ACU (via oracle price)
ConversionRouter.burnAVLForACU(
acuAmount=500 * 1e18, # Mint 500 ACU
recipient="gpu-a100-01"
)
# Oracle: 1 ACU = 2.5 AVL at current TWAP
# Burns: 1250 AVL, Mints: 500 ACU
ConversionRouter: The Bridge
The ConversionRouter contract implements one-way conversion: burn AVL → mint ACU. The oracle-determined exchange rate reflects market-discovered pricing: as compute demand increases relative to provider supply, the ACU price (denominated in AVL) rises. The router enforces ACU_MAX_SUPPLY—cumulative mints cannot exceed this ceiling—preventing infinite inflation even as AVL emissions continue.
The conversion mechanism creates economic alignment: providers earn AVL through availability (passive income), accumulate stakes, then convert to ACU when settlement demand materializes (active income). Users purchasing ACU on secondary markets indirectly reward past provider contributions. The dual-loop structure— AVL emissions incentivize long-term capacity, ACU scarcity rewards immediate execution—balances supply-side growth with demand-side sustainability.
On-Chain Primitives
The settlement layer anchors economic finality in Arbitrum Nitro—a Layer 2 optimized for EVM execution with sub-second confirmation times and negligible gas costs. Seven Solidity contracts form the on-chain substrate: ACUToken, MirrorMintPool, PriceIndexOracle, BurnGovernor, ProtocolFeePool, AvailabilityToken, and ConversionRouter.
ACUToken implements a fixed-supply ERC-20 representing Standard Compute Minutes. Total supply is immutable post-deployment; no mint/burn functions exist, preserving the supply invariant. The treasury holds initial allocation; governance controls rescue functions for accidentally transferred tokens.
MirrorMintPool manages job escrow with burn/release/hold state
machines. The depositForJob function accepts micro-ACU deposits,
recording deposited amounts per job ID. Settlement authorities (authorized by
governance) invoke settleJob with burn amounts, provider addresses,
and receipt hashes. Burn amounts route to treasury; provider payments execute
immediately; hold fractions freeze pending governance review.
// Solidity settlement entrypoint
function settleJob(
bytes32 jobId,
uint256 burnMicroAcu,
address provider,
uint256 providerMicroAcu,
bytes32 receiptHash
) external onlyAuthority nonReentrant {
Escrow storage esc = _escrows[jobId];
require(!esc.finalized, "already finalized");
// Burn protocol fees to treasury
esc.burnedMicroAcu += burnMicroAcu;
_pushTokens(treasury, burnMicroAcu);
// Pay provider immediately
esc.releasedMicroAcu += providerMicroAcu;
_pushTokens(provider, providerMicroAcu);
esc.receiptHash = receiptHash;
emit Burned(jobId, burnMicroAcu);
emit ProviderPaid(jobId, provider, providerMicroAcu);
}
PriceIndexOracle records bucket configurations and clearing results.
Demand oracles (governance-authorized) call configureBucketDemand with
micro-SCM requirements and commit deadlines. Supply submitters post offers before
finalization. The control plane executes clearing off-chain, then publishes results
via finalizeBucket, emitting clearing price and surge multiplier events.
BurnGovernor mediates between settlement receipts and Mirror-Mint
escrow. The settleJob function accepts job IDs and receipt payloads,
extracting burn/provider amounts and forwarding to MirrorMintPool. Governance can
pause settlements system-wide via emergencyPause, halting all burns
without touching escrow state.
ProtocolFeePool accumulates burned ACU and distributes protocol
revenues. Governance proposals withdraw to specified addresses; spending requires
timelock execution (48-hour delay). The pool maintains immutable audit trails via
Withdrawal and FeeAccrued events.
ConversionRouter implements trustless ACU↔AVL swaps using the price oracle as reference. Conversion applies basis-point slippage caps; governance adjusts spreads based on liquidity depth. The router maintains no internal state—all pricing derives from on-chain oracle snapshots.
Provider Infrastructure
The provider network transforms heterogeneous GPU hardware into fungible compute units through a three-layer abstraction: backend runners (Docker, Ray, Kubernetes), capability publishing, and heartbeat coordination.
Provider nodes begin life via alien attach—a CLI tool consuming join
tokens from the directory API. The attach flow redeems tokens for provider IDs,
runs microbenchmarks to calibrate SCM rates, collects attestation evidence, and
publishes capabilities to the control plane.
# Redeem join token for provider ID
provider_id, metadata = redeem_join_token(settings)
# Initialize backend (Ray/K8s/Docker)
if settings.backend.kind == "ray":
address = ray_ensure_head(client_port=settings.backend.ray_port)
backend_payload = {"address": address}
# Run microbenchmarks for SCM rate calibration
scm_rate = derive_rate_micro()
# Collect attestation evidence (SGX/TPM)
if settings.privacy.enable_attestation:
collect_evidence(settings)
# Publish capabilities to control plane
publish_capabilities(
settings, provider_id,
backend_payload=backend_payload,
scm_rate_micro=scm_rate
)
# Begin heartbeat loop
run_heartbeat(settings, provider_id, stop_event)
Backend runners isolate workloads via container runtimes. The Docker backend executes jobs as privileged containers with GPU passthrough, SSH-tunneling logs and metrics to the control plane. The Ray backend bootstraps Ray clusters, submitting jobs via the Ray Jobs API with custom resource specifications (GPU count, VRAM, placement strategies). The Kubernetes backend provisions k3s clusters with NVIDIA device plugins, deploying jobs as pods with GPU requests/limits.
The heartbeat agent maintains provider liveness via periodic
POST /heartbeat calls. Each heartbeat includes: provider ID, current
load (running jobs, available VRAM), calibration drift (actual vs. advertised SCM
rate), and attestation refresh timestamps. The directory API marks providers
unavailable after three missed heartbeats (90-second timeout).
Privacy-preserving execution leverages attestation and encrypted inputs. Providers with SGX or TPM capabilities generate remote attestation quotes during attach; the control plane verifies quotes against manufacturer root keys before approving privacy-tier workloads. Encrypted job inputs decrypt inside secure enclaves, ensuring operators never observe plaintext data or intermediate activations.
The relay mechanism enables NAT traversal for home providers. Providers behind firewalls establish persistent WebSocket connections to relay servers; the control plane routes job submissions via relay endpoints. Bi-directional tunneling supports both job dispatch and real-time log streaming without requiring public IPs or port forwarding.
Operational Resilience
Production infrastructure demands resilience mechanisms beyond optimistic execution paths. The system implements circuit breakers, capacity guards, regional isolation, and chaos injection to maintain SLAs under adversarial conditions.
The primary price breaker monitors clearing prices against
configured mint prices. When a bucket clears below mint_price - epsilon,
the breaker opens, halting new demand configuration until price recovery. This
prevents cascading under-pricing that could destabilize provider economics.
The capacity breaker tracks filled SCM across recent buckets
(configurable window, default 10 buckets). If filled capacity falls below
(demand + reserve) × (1 - buffer_pct / 100), the breaker trips,
signaling insufficient supply to meet committed demand. The scheduler rejects
new reservations until capacity recovers.
# Primary price breaker check
threshold = mint_price_micro_usd - epsilon_micro_usd
if clearing_price < threshold:
breaker.trip(
name="primary_price",
reason=f"Clearing {clearing_price} below floor {threshold}"
)
return ConflictError("Breaker open: price floor violated")
# Capacity breaker check
recent_buckets = db.query_buckets(limit=breaker.window_size)
avg_filled = mean([b.filled_micro_scm for b in recent_buckets])
required = (demand + reserve) * (1.0 - buffer_pct / 100.0)
if avg_filled < required:
breaker.trip(
name="capacity",
reason=f"Avg filled {avg_filled} below required {required}"
)
Regional isolation supports active-passive multi-region deployments.
Control plane instances operate in three modes: NORMAL (full read-write),
ISOLATED (local writes queued, remote reads blocked), and
MERGE_REPLAY (reconciling queued writes post-outage).
During regional failures, operators invoke POST /region/isolate,
transitioning the affected region to isolated mode. Local writes persist to
SQLite; remote API calls receive 503 Service Unavailable. Recovery initiates
via POST /region/merge, which replays queued writes against primary
state, resolving conflicts via last-write-wins timestamp comparison.
The resilience controller monitors job health and triggers
automatic reallocation. Jobs exceeding SLA thresholds (95th percentile latency,
failure rate > 5%) receive priority reallocation to higher-tier providers.
The controller maintains a reallocate queue; operators approve/reject proposals
via POST /resilience/approve/{job_id}.
Observability surfaces breaker state, queue depths, and regional health via Prometheus metrics and Grafana dashboards. Critical alerts fire on: breaker open > 5 minutes, queue depth > 1000 tasks, missed heartbeats > 10% of fleet, settlement failures > 1% of volume.
Payment Infrastructure
While on-chain settlement handles provider payouts and protocol fees, enterprise users require fiat on-ramps. The payment stack bridges traditional finance via Stripe webhooks, double-entry ledger accounting, and ACU wallet provisioning.
The Stripe service listens for checkout.session.completed
webhooks, validating HMAC signatures before processing. Successful checkouts trigger:
credit issuance to user wallets, ledger debit/credit pairs (fiat → ACU), and
escrow deposits for immediate job execution.
# Validate webhook signature
event = stripe.Webhook.construct_event(
payload, sig_header, endpoint_secret
)
if event['type'] == 'checkout.session.completed':
session = event['data']['object']
# Extract metadata
user_id = session.metadata['user_id']
usd_cents = session.amount_total
acu_micro = int(usd_cents * ACU_PER_CENT_MICRO)
# Issue credits to ACU wallet
wallet.deposit(user_id, acu_micro)
# Record double-entry ledger transaction
ledger.record_transaction(
debit_account="fiat:stripe",
credit_account=f"acu_wallet:{user_id}",
amount_micro_acu=acu_micro,
metadata={"stripe_session": session.id}
)
The ledger service implements immutable double-entry accounting.
Every transaction creates two ledger entries: one debit, one credit, summing to zero.
Account types include: fiat:stripe (external fiat inflows),
acu_wallet:* (user balances), escrow:* (job deposits),
protocol:treasury (burned fees), and provider:* (payout accounts).
Monthly reconciliation queries aggregate ledger entries, verifying: Σ(debits) = Σ(credits), user wallet balances match sum of deposits minus escrow, and protocol treasury equals cumulative burns. Discrepancies trigger alerts and halt settlements pending manual review.
Escrow orchestration coordinates off-chain credits with on-chain
deposits. When users initiate jobs, the payment processor: debits ACU wallets,
credits escrow accounts, invokes MirrorMintPool.depositForJob, and
persists transaction hashes for audit trails.
Provider payouts reverse the flow: settlement receipts trigger wallet credits,
escrow debits, and optional fiat conversions. Providers configure payout rails
(on-chain ACU, Stripe transfers, wire) via PATCH /providers/{id}/payout.
The payout service batches settlements daily, minimizing gas costs via Merkle batching.
Developer SDK
The Phase 4 SDK encapsulates reservation loops, metering, settlement, and receipt verification behind Pythonic interfaces. Machine learning engineers integrate distributed compute with minimal infrastructure knowledge.
The ControlPlaneClient provides authenticated HTTP transport. Retry logic handles transient failures (exponential backoff, jittered delays); rate limit detection (HTTP 429) triggers automatic back-pressure. SSL context support enables custom certificate validation for private deployments.
from phase4_sdk import ControlPlaneClient, ReservationLoop
# Initialize client with API credentials
client = ControlPlaneClient(
base_url="https://control.vracu.network",
api_key=os.getenv("VRACU_API_KEY")
)
# Configure reservation parameters
loop = ReservationLoop(
client=client,
required_scm_minutes=1000,
min_vram_gb=40,
preferred_interconnect=["nvlink", "pcie"]
)
# Execute reservation → allocation → metering → settlement
result = loop.execute(
job_id="train-gpt-neo-2.7b",
workload_fn=lambda provider: train_model(provider)
)
# Receipt includes cryptographic proof
print(f"Settlement receipt: {result.receipt}")
print(f"Ed25519 signature: {result.signature_primary}")
print(f"PQ envelope: {result.signature_secondary}")
The ReservationLoop orchestrates multi-phase workflows. Phase 1 submits supply offers to the oracle. Phase 2 waits for bucket finalization. Phase 3 invokes the scheduler, receiving provider allocation. Phase 4 executes workloads, streaming meter slices to the control plane. Phase 5 polls settlement status, retrieving signed receipts upon job completion.
Integration examples demonstrate Ray, Modal, and Kubernetes adapters. The
Ray integration submits jobs via ray.job_submission.submit,
tailing logs for metering signals. The Modal integration wraps
Modal functions with VR-ACU reservation context, transparently routing compute
through the provider network. The Kubernetes integration generates
pod specs with GPU resource requests, applying VR-ACU annotations for cost attribution.
The CLI tool exposes SDK functionality via terminal commands.
vracu reserve initiates reservations, vracu meter ingests
manual slices, vracu settle forces settlement, and vracu receipt
verifies signatures. Shell completion scripts support bash, zsh, and fish.
Observability
Production observability leverages Prometheus metrics, structured logging, and distributed tracing to surface system health, performance bottlenecks, and failure modes.
Prometheus metrics export from GET /metrics
endpoints across all services. Control plane metrics include: queue depth
(async tasks pending), HTTP latency histograms (p50/p95/p99), breaker state
(binary open/closed), oracle finalization durations, and settlement batch sizes.
Provider metrics expose: GPU utilization percentages, VRAM allocated/free, job counts (running/queued/failed), heartbeat intervals, and attestation refresh timestamps. Grafana dashboards aggregate fleet-wide statistics, alerting on: utilization < 60% (underutilized), failures > 5% (reliability degradation), heartbeat gaps > 90s (connectivity issues).
# Prometheus metric definitions
vracu_queue_depth = Gauge(
'vracu_queue_depth',
'Pending async tasks in control plane queue'
)
vracu_http_duration = Histogram(
'vracu_http_duration_seconds',
'HTTP request duration',
['method', 'path', 'status']
)
vracu_breaker_state = Gauge(
'vracu_breaker_state',
'Circuit breaker state (0=closed, 1=open)',
['name']
)
vracu_settlement_batch_size = Histogram(
'vracu_settlement_batch_size',
'Number of jobs per settlement batch'
)
Structured logging emits JSON lines to stdout, ingested by log aggregators (Loki, Elasticsearch). Log entries include: trace IDs (for correlation), log levels (DEBUG/INFO/WARN/ERROR), component names, and contextual metadata (job IDs, provider IDs, transaction hashes).
Distributed tracing via OpenTelemetry instruments HTTP handlers, database queries, and blockchain transactions. Trace spans propagate across service boundaries via W3C Trace Context headers. Jaeger UI visualizes request flows, surfacing latency waterfalls and failure attribution.
Alerting rules fire on SLO violations, breaker openings, and anomaly detection. PagerDuty integration routes critical alerts to on-call engineers; Slack webhooks notify teams of warnings. Alert fatigue mitigation groups correlated alerts (multiple breakers from same root cause) into single incidents.
As we stand at the threshold of a new era in distributed computing, the implications extend far beyond technical specifications. This infrastructure represents a reimagining of power dynamics in the digital age. No longer must innovators genuflect before the altar of cloud providers. No longer must privacy be sacrificed for performance.
The distributed GPU network is more than infrastructure—it's a manifesto written in code, a declaration of independence from digital feudalism. Each node that joins the network is a vote for decentralization. Each transaction is a small revolution. Each computed result is proof that another world is not only possible but already being built.
Yet challenges remain. The network must scale without compromising its principles. It must remain accessible while resisting capture by special interests. It must evolve while maintaining backward compatibility. These are not merely technical challenges but philosophical ones that will shape the network's future.
Looking forward, the trajectory is clear. As artificial intelligence becomes increasingly central to human endeavor, the infrastructure supporting it must reflect our highest values: transparency, equity, privacy, and freedom. The distributed GPU network is not the end of this journey but perhaps its most promising beginning.
In the end, this is a story about choice. The choice to build rather than complain. The choice to collaborate rather than compete. The choice to open source the future rather than patent it. These choices, multiplied across thousands of contributors and millions of computations, constitute nothing less than a peaceful revolution in how we organize computational power.
The revolution will not be centralized.