Distributed Training
Subnet ID: 38
Coldkey:5EFJ7mEVyjxT3iXqxuHyMMuGgJKFt85tb5s4vEvnCTpSoP3w
Distributed Training subnet enables decentralized training of large language models using the hivemind library for peer-to-peer gradient sharing and all-reduce operations. Miners contribute GPU resources to collaborative training runs, with model checkpoints stored on Cloudflare R2. The subnet supports multi-GPU configurations per miner with automatic updates and WandB integration for monitoring training progress.
Yuma Pulse™
Hivemind Integration
Peer-to-peer gradient synchronization using hivemind library enabling distributed training across decentralized miners
Multi-GPU Support
Configure multiple GPUs per miner node with nproc_per_node parameter for maximum training throughput
R2 Checkpoint Storage
Model states saved to Cloudflare R2 buckets with format {model_name}-{uid} for reliable checkpointing
Training Monitoring
WandB and HuggingFace integration for tracking training metrics, logging, and model publishing