About Our Foundation

Upshift Foundation is dedicated to advancing digital skills and contributing to open-source projects for the benefit of society. We support research, development, and collaboration across various domains. Our current focus is on Polytechnic colleges and open source code contribution to AI research.

Our Projects & Initiatives

Educational Outreach

Open Source Contributions

AI Profiler

Performance profiling tools for AI workloads.

LLM in Verilog

Hardware description language implementation of LLMs.

FY2025 Conference Impact Reports

Detailed technical notes from foundation-sponsored conference attendances.

PyTorch Conference 2025
Author: Ajay Anubolu Date: 10/23/2025 Location: San Francisco, CA
Note: This report focuses on Distributed RL infrastructure, High-Performance Computing (HPC) optimization, and the emergence of "Agentic AI" as a primary system workload.

Overview

PyTorch Conference 2025 convened over 3,400 developers with a clear message: the industry is pivoting from "Model Architecture" to "System Architecture." The event highlighted that training reasoning models (like DeepSeek-R1) and deploying goal-directed agents requires a fundamental rethink of infrastructure. This year saw the introduction of dedicated "Agentic" tracks and the co-located Open Agent Summit, signaling that static prompt-response loops are being replaced by dynamic, multi-step agent workflows.

Major announcements reinforced this, including torchforge (for agentic development) and torchtitan (for scaling large MoE models), alongside a broader push to treat the entire datacenter—not just the GPU—as the unit of compute.

Detailed Session Notes

1. Verl: A Flexible & Efficient RL Framework (ByteDance)

This session introduced Verl, a hybrid framework designed to tackle the massive distributed computing workload required for Large Scale RLHF.

  • The Complexity Challenge: Modern RLHF involves coordinating four distinct models simultaneously: Policy, Reward, Reference, and Value models. This creates a massive distributed system challenge that standard trainers cannot handle efficiently.
  • Hybrid-Controller Architecture: Verl solves this with a dual approach:
    • MPMD (Multiple Program Multiple Data): Used for flexibility, allowing developers to prototype diverse algorithms (PPO, GRPO) with core logic in just a few lines of Python.
    • SPMD (Single Program Multiple Data): Used for efficiency, enabling scaling to thousands of GPUs.
  • 3D-HybridEngine & Async Rollout: To maximize throughput, Verl uses a "3D-HybridEngine" to eliminate memory redundancy and an Async Rollout mechanism. This decouples generation from training, preventing GPUs from idling while waiting for text generation—a critical optimization for reasoning models.
  • Agentic RL: A key update is support for "Agentic" workflows. The Policy LLM now generates not just text, but context and executable code, requiring a feedback loop that validates code execution results.
  • Q&A Highlight: A critical discussion point was the load balancing mismatch in Mixture-of-Experts (MoE) models. When policy and reward models route tokens to different experts, it creates uneven GPU utilization. This remains an open challenge in the field.

2. Scaling Ingest Pipelines with HPC Principles

A technical deep-dive into the bottlenecks of ingesting raw, high-resolution time series data for training.

  • The Memory Wall: As processor core speeds increase, main memory bandwidth fails to keep up.
  • Cache Optimization: The session emphasized that successful scaling relies on minimizing "Cache Misses." Understanding Cache Lines—moving data in fixed-size blocks from main memory to L1/L2/L3 cache—is now a prerequisite for writing high-performance data loaders.

3. Keynote Panel: Scaling Laws & The Datacenter

Featuring speakers from CoreWeave, SemiAnalysis, AMD, and Crusoe.

  • The "Datacenter as Computer" Shift: The industry demand has moved from "XPU" (individual chip performance) to "Datacenter" (cluster-level performance).
  • New Scaling Laws: Future performance gains will not come from better silicon alone, but from the interconnect topology, thermal management, and power efficiency of the entire facility. The "unit of compute" is no longer the server, but the pod.

Closing Reflections

PyTorch 2025 demonstrated that we are entering the "System Era" of AI. The conference atmosphere suggested that the low-hanging fruit of model architecture (Transformers) has been harvested; the next frontier is System-Level Co-Design. We are moving toward a world where "Agentic AI" drives the workload, and the datacenter itself acts as the computer. For the Upshift Foundation, this validates our focus on "systems skills"—understanding not just how to define a model, but how to orchestrate the massive, distributed machinery required to train and serve it.

NSDI 2025: Networked Systems Design & Implementation
Author: Ajay Anubolu Date: 04/30/2025 Location: Chapel Hill, NC
Note: The sessions detailed below represent a selected sample of the talks attended.

Overview

NSDI 2025 brought together researchers and practitioners to discuss advances in distributed systems, networking, and large-scale machine learning infrastructure. The sessions I attended focused on five major tracks: Infrastructure for Machine Learning, Machine Learning for Networks, RDMA and Transport Systems, Operational Experiences, and Storage Systems.

Detailed Session Notes

1. Infrastructure for Machine Learning

  • AutoCCL: Automates distributed training communication. It builds tuning subspaces and applies coordinate descent to adapt to specific workloads. Results showed 1.23× bandwidth and 0.75× latency improvements.
  • SuperServe: A dynamic serving framework. Features "SubnetAct" for instantaneous model switching and "SlackFitL" to reactively schedule workloads under bursty demand.

2. Machine Learning for Networks

  • Mutant: A reinforcement learning-based congestion control framework. It learns from a pool of protocols without large datasets, using Bayesian linear regression for reward prediction.

3. RDMA and Network Transport

  • ScalaCN: Addressed scalability in RDMA. Highlighted that scaling can increase latency by 34× and reduce bandwidth by 87% due to opaque RNIC hardware behavior.
  • Juneberry: Redesigns RPC with RDMA for ultra-low latency, allowing server NICs to execute responses directly.

Closing Reflections

Across all sessions, the conference reinforced how system design and machine learning are merging disciplines. The future of scalable AI infrastructure lies in systems that learn, adapt, and self-optimize across layers—networking, compute, and storage.

GTC 2025: NVIDIA GPU Technology Conference
Author: Ajay Anubolu Date: 03/19/2025 Focus: AI, Robotics, Edge Compute
Note: These notes summarize key technical sessions regarding the end-to-end AI lifecycle.

Overview

GTC 2025 highlighted NVIDIA’s expanding vision of end-to-end accelerated computing. The conference emphasized the convergence of AI, robotics, simulation, and data-center infrastructure to enable a continuous learning cycle.

Detailed Session Notes

1. Edge AI and Robotics

  • Isaac Platform: Enables RL-based robotics development in Omniverse. Validating "Inside-out AI" (agents trained in simulation) in realistic physics engines minimizes "sim-to-real" transfer errors.

2. Vision and Search Acceleration

  • Visual Search: Explored semantic search. Tools like nvImageCodec and cv-cuda eliminate CPU bottlenecks by moving the entire preprocessing pipeline to the GPU.

Closing Reflections

GTC 2025 demonstrated how AI development is becoming a closed feedback loop. From robotics validation in Isaac Sim to GPU-accelerated vision processing, the sessions revealed a future in which intelligence is distributed across both the cloud and the edge.

OFC 2025: Optical Fiber Communication Conference
Author: Ajay Anubolu Date: 04/03/2025 Focus: Optical Networking
Note: These notes cover physical layer innovations enabling high-speed AI interconnects.

Overview

OFC 2025 showcased cutting-edge innovations redefining high-speed interconnects for AI data centers, with growing emphasis on 400G, 800G, and emerging 1.6T Ethernet technologies.

Detailed Session Notes

1. Emerging Trends & Demos

  • Transceivers: Move toward 800G and 1.6T using PAM4 signaling.
  • Liquid-Cooled Switches: Supermicro demonstrated switch chassis for 51.2 Tbps ASICs with direct liquid loops.

2. Optical Interconnects

  • Co-Packaged Optics (CPO): Integrating optical engines directly with switch ASICs to reduce electrical trace lengths.
  • Silicon Photonics: Using photonic integrated circuits (PICs) for optical circuit switching with near-zero latency.

Closing Reflections

OFC 2025 offered a hands-on perspective into the physical infrastructure enabling global AI. Observing the detailed structure of optical cables and switch systems deepened my understanding of how data moves—photon by photon—through the world’s networks.

How to Contribute

Upshift Foundation is a private foundation. We welcome ideas in the areas of upskilling students from marginalized communities. We are also working on AI tools. Project ideas are welcome.

Contact Us

If you have any questions or suggestions, feel free to reach out to us at
info@upshiftfoundation.org

index.html Displaying index.html.