About Our Foundation
Upshift Foundation is dedicated to advancing digital skills and contributing to open-source projects for the benefit of society. We support research, development, and collaboration across various domains. Our current focus is on Polytechnic colleges and open source code contribution to AI research.
Our Projects & Initiatives
Educational Outreach
Open Source Contributions
Performance profiling tools for AI workloads.
Hardware description language implementation of LLMs.
FY2025 Conference Impact Reports
Detailed technical notes from foundation-sponsored conference attendances.
PyTorch Conference 2025
Overview
PyTorch Conference 2025 convened over 3,400 developers with a clear message: the industry is pivoting from "Model Architecture" to "System Architecture." The event highlighted that training reasoning models (like DeepSeek-R1) and deploying goal-directed agents requires a fundamental rethink of infrastructure. This year saw the introduction of dedicated "Agentic" tracks and the co-located Open Agent Summit, signaling that static prompt-response loops are being replaced by dynamic, multi-step agent workflows.
Major announcements reinforced this, including torchforge (for agentic development) and torchtitan (for scaling large MoE models), alongside a broader push to treat the entire datacenter—not just the GPU—as the unit of compute.
Detailed Session Notes
1. Verl: A Flexible & Efficient RL Framework (ByteDance)
This session introduced Verl, a hybrid framework designed to tackle the massive distributed computing workload required for Large Scale RLHF.
- The Complexity Challenge: Modern RLHF involves coordinating four distinct models simultaneously: Policy, Reward, Reference, and Value models. This creates a massive distributed system challenge that standard trainers cannot handle efficiently.
- Hybrid-Controller Architecture: Verl solves this with a dual approach:
- MPMD (Multiple Program Multiple Data): Used for flexibility, allowing developers to prototype diverse algorithms (PPO, GRPO) with core logic in just a few lines of Python.
- SPMD (Single Program Multiple Data): Used for efficiency, enabling scaling to thousands of GPUs.
- 3D-HybridEngine & Async Rollout: To maximize throughput, Verl uses a "3D-HybridEngine" to eliminate memory redundancy and an Async Rollout mechanism. This decouples generation from training, preventing GPUs from idling while waiting for text generation—a critical optimization for reasoning models.
- Agentic RL: A key update is support for "Agentic" workflows. The Policy LLM now generates not just text, but context and executable code, requiring a feedback loop that validates code execution results.
- Q&A Highlight: A critical discussion point was the load balancing mismatch in Mixture-of-Experts (MoE) models. When policy and reward models route tokens to different experts, it creates uneven GPU utilization. This remains an open challenge in the field.
2. Scaling Ingest Pipelines with HPC Principles
A technical deep-dive into the bottlenecks of ingesting raw, high-resolution time series data for training.
- The Memory Wall: As processor core speeds increase, main memory bandwidth fails to keep up.
- Cache Optimization: The session emphasized that successful scaling relies on minimizing "Cache Misses." Understanding Cache Lines—moving data in fixed-size blocks from main memory to L1/L2/L3 cache—is now a prerequisite for writing high-performance data loaders.
3. Keynote Panel: Scaling Laws & The Datacenter
Featuring speakers from CoreWeave, SemiAnalysis, AMD, and Crusoe.
- The "Datacenter as Computer" Shift: The industry demand has moved from "XPU" (individual chip performance) to "Datacenter" (cluster-level performance).
- New Scaling Laws: Future performance gains will not come from better silicon alone, but from the interconnect topology, thermal management, and power efficiency of the entire facility. The "unit of compute" is no longer the server, but the pod.
Closing Reflections
PyTorch 2025 demonstrated that we are entering the "System Era" of AI. The conference atmosphere suggested that the low-hanging fruit of model architecture (Transformers) has been harvested; the next frontier is System-Level Co-Design. We are moving toward a world where "Agentic AI" drives the workload, and the datacenter itself acts as the computer. For the Upshift Foundation, this validates our focus on "systems skills"—understanding not just how to define a model, but how to orchestrate the massive, distributed machinery required to train and serve it.
NSDI 2025: Networked Systems Design & Implementation
Overview
NSDI 2025 brought together researchers and practitioners to discuss advances in distributed systems, networking, and large-scale machine learning infrastructure. The sessions I attended focused on five major tracks: Infrastructure for Machine Learning, Machine Learning for Networks, RDMA and Transport Systems, Operational Experiences, and Storage Systems.
Detailed Session Notes
1. Infrastructure for Machine Learning
- AutoCCL: Automates distributed training communication. It builds tuning subspaces and applies coordinate descent to adapt to specific workloads. Results showed 1.23× bandwidth and 0.75× latency improvements.
- SuperServe: A dynamic serving framework. Features "SubnetAct" for instantaneous model switching and "SlackFitL" to reactively schedule workloads under bursty demand.
2. Machine Learning for Networks
- Mutant: A reinforcement learning-based congestion control framework. It learns from a pool of protocols without large datasets, using Bayesian linear regression for reward prediction.
3. RDMA and Network Transport
- ScalaCN: Addressed scalability in RDMA. Highlighted that scaling can increase latency by 34× and reduce bandwidth by 87% due to opaque RNIC hardware behavior.
- Juneberry: Redesigns RPC with RDMA for ultra-low latency, allowing server NICs to execute responses directly.
Closing Reflections
Across all sessions, the conference reinforced how system design and machine learning are merging disciplines. The future of scalable AI infrastructure lies in systems that learn, adapt, and self-optimize across layers—networking, compute, and storage.
GTC 2025: NVIDIA GPU Technology Conference
Overview
GTC 2025 highlighted NVIDIA’s expanding vision of end-to-end accelerated computing. The conference emphasized the convergence of AI, robotics, simulation, and data-center infrastructure to enable a continuous learning cycle.
Detailed Session Notes
1. Edge AI and Robotics
- Isaac Platform: Enables RL-based robotics development in Omniverse. Validating "Inside-out AI" (agents trained in simulation) in realistic physics engines minimizes "sim-to-real" transfer errors.
2. Vision and Search Acceleration
- Visual Search: Explored semantic search. Tools like nvImageCodec and cv-cuda eliminate CPU bottlenecks by moving the entire preprocessing pipeline to the GPU.
Closing Reflections
GTC 2025 demonstrated how AI development is becoming a closed feedback loop. From robotics validation in Isaac Sim to GPU-accelerated vision processing, the sessions revealed a future in which intelligence is distributed across both the cloud and the edge.
OFC 2025: Optical Fiber Communication Conference
Overview
OFC 2025 showcased cutting-edge innovations redefining high-speed interconnects for AI data centers, with growing emphasis on 400G, 800G, and emerging 1.6T Ethernet technologies.
Detailed Session Notes
1. Emerging Trends & Demos
- Transceivers: Move toward 800G and 1.6T using PAM4 signaling.
- Liquid-Cooled Switches: Supermicro demonstrated switch chassis for 51.2 Tbps ASICs with direct liquid loops.
2. Optical Interconnects
- Co-Packaged Optics (CPO): Integrating optical engines directly with switch ASICs to reduce electrical trace lengths.
- Silicon Photonics: Using photonic integrated circuits (PICs) for optical circuit switching with near-zero latency.
Closing Reflections
OFC 2025 offered a hands-on perspective into the physical infrastructure enabling global AI. Observing the detailed structure of optical cables and switch systems deepened my understanding of how data moves—photon by photon—through the world’s networks.
How to Contribute
Upshift Foundation is a private foundation. We welcome ideas in the areas of upskilling students from marginalized communities. We are also working on AI tools. Project ideas are welcome.
Contact Us
If you have any questions or suggestions, feel free to reach out to us at
info@upshiftfoundation.org