From Computing Power to Intelligence: A Decentralized AI Investment Map Driven by Reinforcement Learning

Author: Jacob Zhao @IOSGArtificial intelligence is moving from statistical learning, primarily focused on "pattern fitting," to a capability system centered on "structured reasoning," with the importance of post-training rapidly increasing. The emergence of DeepSeek-R1 marks a paradigm shift in reinforcement learning in the era of large models, leading to an industry consensus: pre-training provides the foundation for building general-purpose models; reinforcement learning is no longer merely a value alignment tool, but has been proven to systematically improve the quality of reasoning chains and complex decision-making capabilities, gradually evolving into a technological path for continuously improving intelligence levels.A Panoramic View of Reinforcement Learning Technology: Architecture, Framework, and ApplicationsSystem Architecture and Core Elements of Reinforcement LearningReinforcement Learning (RL) drives the model to autonomously improve its decision-making ability through "environment interaction—reward feedback—policy update". Its core structure can be regarded as a feedback loop consisting of state, action, reward, and policy. A complete RL system typically comprises three components: Policy (policy network), Rollout (experience sampling), and Learner (policy updater). The policy interacts with the environment to generate trajectories, and the Learner updates the policy based on reward signals, thus forming a continuous iterative and optimizing learning process: **Policy Network:** Generates actions from the environment state and is the core of the system's decision-making. During training, centralized backpropagation is needed to maintain consistency; during inference, it can be distributed to different nodes for parallel execution. Rollout (Experience Sampling): Nodes interact with the environment according to the policy, generating trajectories such as state, action, and reward. This process is highly parallel, requires very little communication, and is insensitive to hardware differences, making it the most suitable component for scaling in a decentralized environment. Learner: Aggregates all Rollout trajectories and performs policy gradient updates. It is the module with the highest requirements for computing power and bandwidth, therefore it is usually deployed in a centralized or lightly centralized manner to ensure convergence stability. Reinforcement Learning Stage Framework (RLHF → RLAIF → PRM → GRPO) Reinforcement learning can generally be divided into five stages, and the overall process is as follows: # Data Generation Stage (Policy Exploration) Given input cues, the policy model πθ generates multiple candidate inference chains or complete trajectories, providing a sample basis for subsequent preference evaluation and reward modeling, and determining the breadth of policy exploration. # Preference Feedback Phase (RLHF / RLAIF) RLHF (Reinforcement Learning from Human Feedback) This phase utilizes multiple candidate answers, human preference annotation, training a reward model (RM), and PPO optimization strategies to make the model output more aligned with human values. It is a key step in the GPT-3.5 → GPT-4 transition. RLAIF (Reinforcement Learning from AI Feedback) This phase replaces manual annotation with AI judges or constitutional rules, automating preference acquisition, significantly reducing costs, and achieving scalability. It has become a standard feature of Anthropic, OpenAI, and DeepSeek. The mainstream alignment paradigm, etc. #Reward Modeling StagePreference-based input reward models learn to map outputs to rewards. RM teaches the model "what is the correct answer", PRM teaches the model "how to reason correctly". RM (Reward Model) Used to evaluate the quality of the final answer, scoring only the output:Process Reward Model (PRM) It no longer only evaluates the final answer, but scores each step of reasoning, each token, and each logical segment. It is also a key technology of OpenAI o1 and DeepSeek-R1, essentially "teaching the model how to think". #Reward Verification PhaseIntroducing "verifiable constraints" during reward signal generation and usage ensures that rewards originate from reproducible rules, facts, or consensus, thereby reducing reward hacking and bias risks, and improving auditability and scalability in open environments.#Policy Optimization PhaseUpdates policy parameters θ under the guidance of signals from the reward model to obtain a policy πθ′ with stronger reasoning ability, higher security, and more stable behavior patterns. Mainstream optimization methods include: PPO (Proximal Policy Optimization): A traditional optimizer in RLHF, known for its stability, but often faces limitations such as slow convergence and insufficient stability in complex inference tasks. GRPO (Group Relative Policy Optimization): A core innovation of DeepSeek-R1, it estimates expected value by modeling the advantage distribution within candidate answer groups, rather than simply ranking them. This method preserves reward magnitude information, is more suitable for inference chain optimization, and has a more stable training process. It is considered an important reinforcement learning optimization framework for deep inference scenarios after PPO. DPO (Direct Preference Optimization): A post-training method without reinforcement learning: it does not generate trajectories or build reward models, but directly optimizes preference pairs. It is low-cost and stable, and therefore widely used for alignment in open-source models such as Llama and Gemma, but it does not improve inference ability. #New Policy Deployment PhaseThe optimized model exhibits: stronger inference chain generation ability (System-2 Reasoning), behavior more consistent with human or AI preferences, lower illusion rate, and higher security. The model continuously learns preferences, optimizes the process, and improves decision quality through continuous iteration, forming a closed loop. Five Major Categories of Industrial Applications of Reinforcement LearningReinforcement Learning has evolved from early game-theoretic intelligence into a core framework for autonomous decision-making across industries. Its application scenarios can be categorized into five major categories based on technological maturity and industrial implementation, each driving key breakthroughs in its respective area. Game & Strategy Systems: This is the earliest validated direction of RL. In environments with "perfect information + explicit rewards," such as AlphaGo, AlphaZero, AlphaStar, and OpenAI Five, RL demonstrated decision-making intelligence comparable to or even surpassing human experts, laying the foundation for modern RL algorithms. Embodied AI: Through continuous control, dynamic modeling, and environmental interaction, RL enables robots to learn manipulation, motion control, and cross-modal tasks (such as RT-2 and RT-X). It is rapidly moving towards industrialization and is a key technological route for the real-world application of robotics. Digital Reasoning (LLM System-2): RL + PRM drives large models from "language imitation" to "structured reasoning." Representative achievements include DeepSeek-R1, OpenAI o1/o3, Anthropic Claude, and AlphaGeometry. Its essence lies in optimizing rewards at the reasoning chain level, rather than simply evaluating the final answer. Automated Scientific Discovery and Mathematical Optimization: RL seeks optimal structures or strategies in unlabeled, complex rewards and vast search spaces. Fundamental breakthroughs such as AlphaTensor, AlphaDev, and Fusion RL have been achieved, demonstrating exploratory capabilities that surpass human intuition. Economic Decision-making & Trading: Reinforcement Learning (RL) is used for strategy optimization, high-dimensional risk control, and adaptive trading system generation. Compared to traditional quantitative models, it can learn continuously in uncertain environments and is an important component of intelligent finance. The Natural Match Between Reinforcement Learning and Web3: The high degree of compatibility between Reinforcement Learning (RL) and Web3 stems from the fact that both are essentially "incentive-driven systems." RL relies on reward signals to optimize strategies, while blockchain relies on economic incentives to coordinate participant behavior, making them naturally consistent at the mechanism level. The core requirements of RL—large-scale heterogeneous rollout, reward distribution, and authenticity verification—are precisely where the structural advantages of Web3 lie. # Decoupling Inference and Training The training process of reinforcement learning can be clearly divided into two stages: Rollout (exploratory sampling): The model generates a large amount of data based on the current policy, a computationally intensive but communication-sparse task. It does not require frequent communication between nodes and is suitable for parallel generation on globally distributed consumer-grade GPUs. Update (Parameter Update): Updates model weights based on collected data, requiring a high-bandwidth centralized node. "Inference-training decoupling" naturally aligns with decentralized heterogeneous computing power structures: Rollout can be outsourced to open networks, with settlement based on contribution via a token mechanism, while model updates remain centralized to ensure stability. # Verifiability: ZK and Proof-of-Learning provide means to verify whether nodes are actually performing inference, solving the honesty problem in open networks. In deterministic tasks such as coding and mathematical reasoning, verifiers only need to check the answers to confirm the workload, significantly improving the credibility of decentralized RL systems. The incentive layer, based on a token-based feedback production mechanism, allows Web3's token mechanism to directly reward contributors to RLHF/RLAIF's preference feedback, creating a transparent, settleable, and permissionless incentive structure for preference data generation. Staking and slashing further constrain feedback quality, forming a more efficient and aligned feedback market than traditional crowdsourcing. # Potential of Multi-Agent Reinforcement Learning (MARL) Blockchain is essentially a public, transparent, and continuously evolving multi-agent environment. Accounts, contracts, and agents constantly adjust their strategies under incentive-driven conditions, giving it a natural potential to build large-scale MARL testbeds. Although still in its early stages, its characteristics of public state, verifiable execution, and programmable incentives provide a fundamental advantage for the future development of MARL. Based on the theoretical framework above, we will briefly analyze the most representative projects in the current ecosystem: Prime Intellect: An Asynchronous Reinforcement Learning Paradigm - prime-rl Prime Intellect is committed to building a global open computing power market, lowering the training threshold, promoting collaborative decentralized training, and developing a complete open-source superintelligence technology stack. Its system includes: Prime Compute (unified cloud/distributed computing power environment), the INTELLECT model family (10B–100B+), the Open Reinforcement Learning Environment Hub, and the large-scale synthetic data engine (SYNTHETIC-1/2). The Prime Intellect core infrastructure components, specifically the prime-rl framework, are designed for asynchronous distributed environments and are highly relevant to reinforcement learning. Other components include the OpenDiLoCo communication protocol to overcome bandwidth bottlenecks and the TopLoc verification mechanism to ensure computational integrity. #A Glance at Prime Intellect Core Infrastructure Components#Technical Foundation: prime-rl Asynchronous Reinforcement Learning Frameworkprime-rl is the core training engine of Prime Intellect, designed specifically for large-scale asynchronous decentralized environments. It achieves high-throughput inference and stable updates through complete decoupling of Actor–Learner. The Rollout Worker and Trainer are no longer synchronously blocked; nodes can join or leave at any time, simply by continuously pulling the latest strategy and uploading generated data. The Rollout Worker is responsible for model inference and data generation. Prime Intellect innovatively integrates the vLLM inference engine into the Actor. vLLM's PagedAttention technology and Continuous Batching capabilities enable the Actor to generate inference trajectories with extremely high throughput. Learner (Trainer): Responsible for policy optimization. The Learner asynchronously pulls data from the shared Experience Buffer for gradient updates, without waiting for all Actors to complete the current batch. Coordinator (Orchestrator): Responsible for scheduling model weights and data flow. #Key Innovations of prime-rlTrue Asynchrony: prime-rl abandons the synchronous paradigm of traditional PPO, does not wait for slow nodes, and does not require batch alignment, enabling any number and performance of GPUs to access at any time, laying the foundation for the feasibility of decentralized RL. Deep Integration of FSDP2 and MoE: Through FSDP2 parameter slicing and MoE sparse activation, prime-rl enables efficient training of billions of models in a distributed environment. Actors only run active experts, significantly reducing memory and inference costs. GRPO+ (Group Relative Policy Optimization): GRPO eliminates the need for a Critic network, significantly reducing computation and memory overhead. It is naturally adapted to asynchronous environments. prime-rl's GRPO+ further ensures reliable convergence under high latency conditions through a stabilization mechanism. #INTELLECT Model Family: A Marker of the Maturity of Decentralized RL TechnologyINTELLECT-1 (10B, October 2024) First to prove that OpenDiLoCo can be trained efficiently in heterogeneous networks across three continents (communication ratio <2%, computing power utilization 98%), breaking the physical understanding of cross-regional training;INTELLECT-2 (32B, April 2025) As the first Permissionless RL model, it validates the performance of prime-rl and GRPO+. Stable convergence capability in multi-step delay and asynchronous environments enables decentralized RL with global open computing power participation; INTELLECT-3 (106B MoE, November 2025) adopts a sparse architecture that activates only 12B parameters, and is trained on 512×H200 to achieve flagship-level inference performance (AIME 90.8%, GPQA 74.4%, MMLU-Pro 81.9%, etc.), with overall performance approaching or even surpassing centralized closed-source models with a scale far larger than itself. Prime Intellect also built several supporting infrastructure components: OpenDiLoCo reduces cross-regional training communication by hundreds of times through time-sparse communication and quantization weight differences, enabling INTELLECT-1 to maintain 98% utilization across three continents; TopLoc + Verifiers forms a decentralized trusted execution layer to activate fingerprint and sandbox verification to ensure the authenticity of inference and reward data; and the SYNTHETIC data engine produces large-scale, high-quality inference chains and enables the 671B model to run efficiently on consumer-grade GPU clusters through pipelined parallelism. These components provide a crucial engineering foundation for the data generation, verification, and inference throughput of decentralized RL. The INTELELET series demonstrates that this technology stack can produce mature, world-class models, marking the transition of decentralized training systems from the conceptual stage to the practical application stage. Gensyn: The Core Reinforcement Learning Stack RL Swarm and SAPO Gensyn aims to aggregate idle global computing power into an open, trustless, and infinitely scalable AI training infrastructure. Its core includes a cross-device standardized execution layer, a peer-to-peer coordination network, and a trustless task verification system, automatically allocating tasks and rewards through smart contracts. Based on the characteristics of reinforcement learning, Gensyn introduces core mechanisms such as RL Swarm, SAPO, and SkipPipe to decouple the three stages of generation, evaluation, and updating, achieving collective evolution through a "swarm" of globally heterogeneous GPUs. Its ultimate delivery is not merely computing power, but verifiable intelligence. #Reinforcement Learning Applications of the Gensyn Stack#RL Swarm: A Decentralized Collaborative Reinforcement Learning EngineRL Swarm demonstrates a novel collaborative model. It's no longer a simple task distribution, but a decentralized "generate-evaluate-update" loop that simulates human social learning, analogous to a collaborative learning process, with an infinite loop: Solvers: Responsible for local model inference and Rollout generation, seamlessly integrating heterogeneous nodes. Gensyn integrates a high-throughput inference engine (such as CodeZero) locally, outputting complete trajectories rather than just answers. Proposers: Dynamically generate tasks (mathematical problems, coding problems, etc.), supporting task diversity and adaptive difficulty similar to Curriculum Learning. Evaluators: Evaluate the local Rollout using a frozen "judge model" or rules, generating local reward signals. The evaluation process is auditable, reducing opportunities for malicious behavior. These three components together form a P2P RL organizational structure, enabling large-scale collaborative learning without centralized scheduling. #SAPO: A Policy Optimization Algorithm for Decentralized ReconstructionSAPO (Swarm Sampling Policy Optimization) is based on the principle of "sharing Rollout and filtering gradient-less signal samples, rather than sharing gradients." It achieves stable convergence in environments with no central coordination and significant differences in node latency through large-scale decentralized Rollout sampling, treating the received Rollout as locally generated. Compared to PPO, which relies on Critic networks and has high computational costs, or GRPO, which is based on intra-group advantage estimation, SAPO enables consumer-grade GPUs to effectively participate in large-scale reinforcement learning optimization with extremely low bandwidth.#SAPO: A Policy Optimization Algorithm for Decentralized ReconstructionSAPO (Swarm Sampling Policy Optimization) is based on the principle of "sharing Rollout and filtering gradient-less signal samples, rather than sharing gradients."#Nous Research Component Overview#Model Layer: Hermes and the Evolution of Inference CapabilitiesThe Hermes series is Nous Research's main user-facing model interface. Its evolution clearly demonstrates the industry's migration path from traditional SFT/DPO alignment to Reasoning Reinforcement Learning (Reasoning RL):#Echo — Reinforcement Learning Training ArchitectureEcho is Gradient's reinforcement learning framework. Its core design philosophy is to decouple the training, inference, and data (reward) paths in reinforcement learning, enabling Rollout generation, policy optimization, and reward evaluation to scale and schedule independently in heterogeneous environments. It runs collaboratively in a heterogeneous network composed of inference and training nodes, maintaining training stability in a wide-area heterogeneous environment with a lightweight synchronization mechanism. This effectively alleviates the SPMD failure and GPU utilization bottlenecks caused by mixed inference and training in traditional DeepSpeed RLHF/VERL.