MyArxiv
Hardware Architecture 5
☆ IMS: Intelligent Hardware Monitoring System for Secure SoCs DATE
In the modern Systems-on-Chip (SoC), the Advanced eXtensible Interface (AXI) protocol exhibits security vulnerabilities, enabling partial or complete denial-of-service (DoS) through protocol-violation attacks. The recent countermeasures lack a dedicated real-time protocol semantic analysis and evade protocol compliance checks. This paper tackles this AXI vulnerability issue and presents an intelligent hardware monitoring system (IMS) for real-time detection of AXI protocol violations. IMS is a hardware module leveraging neural networks to achieve high detection accuracy. For model training, we perform DoS attacks through header-field manipulation and systematic malicious operations, while recording AXI transactions to build a training dataset. We then deploy a quantization-optimized neural network, achieving 98.7% detection accuracy with <=3% latency overhead, and throughput of >2.5 million inferences/s. We subsequently integrate this IMS into a RISC-V SoC as a memory-mapped IP core to monitor its AXI bus. For demonstration and initial assessment for later ASIC integration, we implemented this IMS on an AMD Zynq UltraScale+ MPSoC ZCU104 board, showing an overall small hardware footprint (9.04% look-up-tables (LUTs), 0.23% DSP slices, and 0.70% flip-flops) and negligible impact on the overall design's achievable frequency. This demonstrates the feasibility of lightweight, security monitoring for resource-constrained edge environments.
comment: The final version is accepted for publication at the Design, Automation & Test in Europe Conference (DATE) 2026
☆ InterPUF: Distributed Authentication via Physically Unclonable Functions and Multi-party Computation for Reconfigurable Interposers
Modern system-in-package (SiP) platforms increasingly adopt reconfigurable interposers to enable plug-and-play chiplet integration across heterogeneous multi-vendor ecosystems. However, this flexibility introduces severe trust challenges, as traditional authentication schemes fail to scale or adapt in decentralized, post-fabrication programmable environments. This paper presents InterPUF, a compact and scalable authentication framework that transforms the interposer into a distributed root of trust. InterPUF embeds a route-based differential delay physically unclonable function (PUF) across the reconfigurable interconnect and secures authentication using multi-party computation (MPC), ensuring raw PUF signatures are never exposed. Our hardware evaluation shows only 0.23% area and 0.072% power overhead across diverse chiplets while preserving authentication latency within tens of nanoseconds. Simulation results using pyPUF confirm strong uniqueness, reliability, and modeling resistance under process, voltage, and temperature variations. By combining interposer-resident PUF primitives with cryptographic hashing and collaborative verification, InterPUF enforces a minimal-trust authentication model without relying on a centralized anchor.
comment: Accepted to IEEE International Symposium on Hardware Oriented Security and Trust (HOST) 2026
☆ OpenACM: An Open-Source SRAM-Based Approximate CiM Compiler DATE 2026
The rise of data-intensive AI workloads has exacerbated the ``memory wall'' bottleneck. Digital Compute-in-Memory (DCiM) using SRAM offers a scalable solution, but its vast design space makes manual design impractical, creating a need for automated compilers. A key opportunity lies in approximate computing, which leverages the error tolerance of AI applications for significant energy savings. However, existing DCiM compilers focus on exact arithmetic, failing to exploit this optimization. This paper introduces OpenACM, the first open-source, accuracy-aware compiler for SRAM-based approximate DCiM architectures. OpenACM bridges the gap between application error tolerance and hardware automation. Its key contribution is an integrated library of accuracy-configurable multipliers (exact, tunable approximate, and logarithmic), enabling designers to make fine-grained accuracy-energy trade-offs. The compiler automates the generation of the DCiM architecture, integrating a transistor-level customizable SRAM macro with variation-aware characterization into a complete, open-source physical design flow based on OpenROAD and the FreePDK45 library. This ensures full reproducibility and accessibility, removing dependencies on proprietary tools. Experimental results on representative convolutional neural networks (CNNs) demonstrate that OpenACM achieves energy savings of up to 64\% with negligible loss in application accuracy. The framework is available on \href{https://github.com/ShenShan123/OpenACM}{OpenACM:URL}
comment: Accepted by DATE 2026
☆ RidgeWalker: Perfectly Pipelined Graph Random Walks on FPGAs HPCA 2026
Graph Random Walks (GRWs) offer efficient approximations of key graph properties and have been widely adopted in many applications. However, GRW workloads are notoriously difficult to accelerate due to their strong data dependencies, irregular memory access patterns, and imbalanced execution behavior. While recent work explores FPGA-based accelerators for GRWs, existing solutions fall far short of hardware potential due to inefficient pipelining and static scheduling. This paper presents RidgeWalker, a high-performance GRW accelerator designed for datacenter FPGAs. The key insight behind RidgeWalker is that the Markov property of GRWs allows decomposition into stateless, fine-grained tasks that can be executed out-of-order without compromising correctness. Building on this, RidgeWalker introduces an asynchronous pipeline architecture with a feedback-driven scheduler grounded in queuing theory, enabling perfect pipelining and adaptive load balancing. We prototype RidgeWalker on datacenter FPGAs and evaluated it across a range of GRW algorithms and real-world graph datasets. Experimental results demonstrate that RidgeWalker achieves an average speedup of 7.0x over state-of-the-art FPGA solutions and 8.1x over GPU solutions, with peak speedups of up to 71.0x and 22.9x, respectively. The source code is publicly available at https://github.com/Xtra-Computing/RidgeWalker.
comment: Accepted by HPCA 2026
☆ SwiftKV: An Edge-Oriented Attention Algorithm and Multi-Head Accelerator for Fast, Efficient LLM Decoding
Edge acceleration for large language models is crucial for their widespread application; however, achieving fast attention inference and efficient decoding on resource-constrained edge accelerators remains challenging. This paper presents SwiftKV Attention, a per-token pipelined, low-latency single-pass attention inference algorithm, where every (kt, vt) in the KV cache is processed exactly once in a uniform per-token pipeline without score materialization, blockwise softmax, or a second pass, thereby enabling fast execution on edge accelerators with a single hardware set and no resource-intensive parallelism. Furthermore, to address the limited support for multi-head LLM decoding in existing accelerators, we design the SwiftKV-MHA accelerator, which enables high precision attention and low precision GEMV on the same processor array, achieving fast and efficient multi-head parallel decoding. Experimental results show that, on the edge accelerator, the SwiftKV Attention algorithm achieves a 7.16* speedup over native attention and significantly outperforms other attention algorithms. SwiftKV-MHA further reduces attention latency by 13.48*; under the same settings, it improves generation speed by 17.4% and increases token efficiency by 1.98* compared with state-of-the-art works.
Programming Languages 5
☆ Qihe: A General-Purpose Static Analysis Framework for Verilog
In the past decades, static analysis has thrived in software, facilitating applications in bug detection, security, and program understanding. These advanced analyses are largely underpinned by general-purpose static analysis frameworks, which offer essential infrastructure to streamline their development. Conversely, hardware lacks such a framework, which overshadows the promising opportunities for sophisticated static analysis in hardware, hindering achievements akin to those witnessed in software. We thus introduce Qihe, the first general-purpose static analysis framework for Verilog -- a highly challenging endeavor given the absence of precedents in hardware. Qihe features an analysis-oriented front end, a Verilog-specific IR, and a suite of diverse fundamental analyses that capture essential hardware-specific characteristics -- such as bit-vector arithmetic, register synchronization, and digital component concurrency -- and enable the examination of intricate hardware data and control flows. These fundamental analyses are designed to support a wide array of hardware analysis clients. To validate Qihe's utility, we further developed a set of clients spanning bug detection, security, and program understanding. Our preliminary experimental results are highly promising; for example, Qihe uncovered 9 previously unknown bugs in popular real-world hardware projects (averaging 1.5K+ GitHub stars), all of which were confirmed by developers; moreover, Qihe successfully identified 18 bugs beyond the capabilities of existing static analyses for Verilog bug detection (i.e., linters), and detected 16 vulnerabilities in real-world hardware programs. By open-sourcing Qihe, which comprises over 100K lines of code, we aim to inspire further innovation and applications of sophisticated static analysis for hardware, aspiring to foster a similarly vibrant ecosystem that software analysis enjoys.
☆ Cutting Corners on Uncertainty: Zonotope Abstractions for Stream-based Runtime Monitoring
Stream-based monitoring assesses the health of safety-critical systems by transforming input streams of sensor measurements into output streams that determine a verdict. These inputs are often treated as accurate representations of the physical state, although real sensors introduce calibration and measurement errors. Such errors propagate through the monitor's computations and can distort the final verdict. Affine arithmetic with symbolic slack variables can track these errors precisely, but independent measurement noise introduces a fresh slack variable upon each measurement event, causing the monitor's state representation to grow without bound over time. Therefore, any bounded-memory monitoring algorithm must unify slack variables at runtime in a way that generates a sound approximation. This paper introduces zonotopes as an abstract domain for online monitoring of RLola specifications. We demonstrate that zonotopes precisely capture the affine state of the monitor and that their over-approximation produces a sound bounded-memory monitor. We present a comparison of different zonotope over-approximation strategies in the context of runtime monitoring, evaluating their performance and false-positive rates.
♻ ☆ Towards Industrial-scale Product Configuration
We address the challenge of product configuration in the context of increasing customer demand for diverse and complex products. We propose a solution through a curated selection of product model benchmarks formulated in the COOM language, divided into three fragments of increasing complexity. Each fragment is accompanied by a corresponding bike model example, and additional scalable product models are included in the COOM suite, along with relevant resources. We outline an ASP-based workflow for solving COOM-based configuration problems, highlighting its adaptability to different paradigms and alternative ASP solutions. The COOM Suite aims to provide a comprehensive, accessible, and representative set of examples that can serve as a common ground for stakeholders in the field of product configuration.
comment: Under consideration in Theory and Practice of Logic Programming (TPLP)
♻ ☆ Optimism in Equality Saturation
Equality saturation is a technique for program optimization based on non-destructive rewriting and a form of program analysis called e-class analysis. The current form of e-class analysis is pessimistic and therefore ineffective at analyzing cyclic programs, such as those in SSA form. We propose an abstract interpretation algorithm that can precisely analyze cycles during equality saturation. This results in a unified algorithm for optimistic analysis and non-destructive rewriting. We instantiate this approach on a prototype abstract interpreter for SSA programs using a new semantics of SSA. Our prototype can analyze simple example programs more precisely than clang and gcc.
♻ ☆ Representing Molecules with Algebraic Data Types: Beyond SMILES and SELFIES
Benchmarks of molecular machine learning models often treat the molecular representation as a neutral input format, yet the representation defines the syntax of validity, edit operations, and invariances that models implicitly learn. We propose MolADT, a typed intermediate representation (IR) for molecules expressed as a family of algebraic data types that separates (i) constitution via Dietz-style bonding systems, (ii) 3D geometry and stereochemistry, and (iii) optional electronic annotations. By shifting from string edits to operations over structured values, MolADT makes representational assumptions explicit, supports deterministic validation and localized transformations, and provides hooks for symmetry-aware and Bayesian workflows. We provide a reference implementation in Haskell (open-source, archived with DOI) and worked examples demonstrating delocalised/multicentre bonding, validation invariants, reaction extensions, and group actions relevant to geometric learning.
comment: 3 Figures
Data Structures and Algorithms 5
☆ Stabilizer Code-Generic Universal Fault-Tolerant Quantum Computation
Fault-tolerant quantum computation allows quantum computations to be carried out while resisting unwanted noise. Several error correcting codes have been developed to achieve this task, but none alone are capable of universal quantum computation. This universality is highly desired and often achieved using additional techniques such as code concatenation, code switching, or magic state distillation, which can be costly and only work for specific codes. This work implements logical Clifford and T gates through novel ancilla-mediated protocols to construct a universal fault-tolerant quantum gate set. Unlike traditional techniques, our implementation is deterministic, does not consume ancilla registers, does not modify the underlying data codes or registers, and is generic over all stabilizer codes. Thus, any single code becomes capable of universal quantum computation by leveraging helper codes in ancilla registers and mid-circuit measurements. Furthermore, since these logical gates are stabilizer code-generic, these implementations enable communication between heterogeneous stabilizer codes. These features collectively open the door to countless possibilities for existing and undiscovered codes as well as their scalable, heterogeneous coexistence.
comment: 34 pages, 6 figures, 7 tables
♻ ☆ Optimal learning of quantum channels in diamond distance
Quantum process tomography, the task of estimating an unknown quantum channel, is a central problem in quantum information theory. A long-standing open question is to determine the optimal number of uses of an unknown channel required to learn it in diamond distance, the standard metric for distinguishing quantum processes. While the analogous problem of quantum state tomography has been settled over the past decades in both the pure- and mixed-state settings, for general quantum channels it remained largely open beyond the unitary case. Here we design an algorithm showing that any channel with input and output dimensions $d_{\mathrm{in}},d_{\mathrm{out}}$ and Kraus rank at most $k$ can be learned to constant accuracy in diamond distance using $Θ(d_{\mathrm{in}}d_{\mathrm{out}}k)$ channel uses, and we prove that this scaling is optimal via a matching lower bound. More generally, achieving accuracy $\varepsilon$ is possible with $O(d_{\mathrm{in}}d_{\mathrm{out}}k/\varepsilon^{2})$ channel uses. Since quantum channels subsume states, unitaries, and isometries as special cases, our protocol provides a unified framework for the corresponding tomography tasks; in particular, it yields the first optimal protocols for isometries and for binary measurement tomography, and it recovers optimal trace-distance tomography for fixed-rank states. Our approach reduces channel tomography to pure-state tomography: we use the channel to prepare copies of its Choi state, purify them in parallel, and run sample-optimal pure-state tomography on the resulting purifications; we then show that the induced diamond-distance error scales essentially linearly with the trace-distance error in estimating the (purified) Choi state. We also resolve an open question by showing that adaptivity does not improve the dimension-optimal query complexity of quantum channel tomography.
comment: 10 + 43 pages, 1 figure. Added a section on query-complexity lower bounds
♻ ☆ A Polylogarithmic Competitive Algorithm for Stochastic Online Sorting and TSP
In \emph{Online Sorting}, an array of $n$ initially empty cells is given. At each time step $t$, an element $x_t \in [0,1]$ arrives and must be placed irrevocably into an empty cell without any knowledge of future arrivals. We aim to minimize the sum of absolute differences between pairs of elements placed in consecutive array cells, seeking an online placement strategy that results in a final array close to a sorted one. An interesting multidimensional generalization, a.k.a. the \emph{Online Travelling Salesperson Problem}, arises when the request sequence consists of points in the $d$-dimensional unit cube and the objective is to minimize the sum of euclidean distances between points in consecutive cells. Motivated by the recent work of (Abrahamsen, Bercea, Beretta, Klausen and Kozma; ESA 2024), we consider the \emph{stochastic version} of Online Sorting (\textit{resp.} Online TSP), where each element (\textit{resp.} point) $x_t$ is an i.i.d. sample from the uniform distribution on $[0, 1]$ (\textit{resp.} $[0,1]^d$). By carefully decomposing the request sequence into a hierarchy of balls-into-bins instances, where the balls to bins ratio is large enough so that bin occupancy is sharply concentrated around its mean and small enough so that we can efficiently deal with the elements placed in the same bin, we obtain an online algorithm that approximates the optimal cost within a factor of $O(\log^2 n)$ with high probability. Our result comprises an exponential improvement on the previously best known competitive ratio of $\tilde{O}(n^{1/4})$ for Stochastic Online Sorting due to (Abrahamsen et al.; ESA 2024) and $O(\sqrt{n})$ for (adversarial) Online TSP due to (Bertram, ESA 2025).
comment: To appear at STACS 2026
♻ ☆ Online Rack Placement in Large-Scale Data Centers: Online Sampling Optimization and Deployment
This paper optimizes the configuration of large-scale data centers toward cost-effective, reliable and sustainable cloud supply chains. The problem involves placing incoming racks of servers within a data center to maximize demand coverage given space, power and cooling restrictions. We formulate an online integer optimization model to support rack placement decisions. We propose a tractable online sampling optimization (OSO) approach to multi-stage stochastic optimization, which approximates unknown parameters with a sample path and re-optimizes decisions dynamically. We prove that OSO achieves a strong competitive ratio in canonical online resource allocation problems and sublinear regret in the online batched bin packing problem. Theoretical and computational results show it can outperform mean-based certainty-equivalent resolving heuristics. Our algorithm has been packaged into a software solution deployed across Microsoft's data centers, contributing an interactive decision-making process at the human-machine interface. Using deployment data, econometric tests suggest that adoption of the solution has a negative and statistically significant impact on power stranding, estimated at 1-3 percentage point. At the scale of cloud computing, these improvements in data center performance result in significant cost savings and environmental benefits.
comment: 73 pages
♻ ☆ Improved Streaming Algorithm for Fair $k$-Center Clustering
Many real-world applications pose challenges in incorporating fairness constraints into the $k$-center clustering problem, where the dataset consists of $m$ demographic groups, each with a specified upper bound on the number of centers to ensure fairness. Focusing on big data scenarios, this paper addresses the problem in a streaming setting, where data points arrive one by one sequentially in a continuous stream. Leveraging a structure called the $λ$-independent center set, we propose a one-pass streaming algorithm that first computes a reserved set of points during the streaming process. Then, for the post-streaming process, we propose an approach for selecting centers from the reserved point set by analyzing all three possible cases, transforming the most complicated one into a specially constrained vertex cover problem in an auxiliary graph. Our algorithm achieves a tight approximation ratio of 5 while consuming $O(k\log n)$ memory. It can also be readily adapted to solve the offline fair $k$-center problem, achieving a 3-approximation ratio that matches the current state of the art. Furthermore, we extend our approach to a semi-structured data stream, where data points from each group arrive in batches. In this setting, we present a 3-approximation algorithm for $m = 2$ and a 4-approximation algorithm for general $m$. Lastly, we conduct extensive experiments to evaluate the performance of our approaches, demonstrating that they outperform existing baselines in both clustering cost and runtime efficiency.
Graphics 2
☆ Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning
Vision-as-inverse-graphics, the concept of reconstructing an image as an editable graphics program is a long-standing goal of computer vision. Yet even strong VLMs aren't able to achieve this in one-shot as they lack fine-grained spatial and physical grounding capability. Our key insight is that closing this gap requires interleaved multimodal reasoning through iterative execution and verification. Stemming from this, we present VIGA (Vision-as-Inverse-Graphic Agent) that starts from an empty world and reconstructs or edits scenes through a closed-loop write-run-render-compare-revise procedure. To support long-horizon reasoning, VIGA combines (i) a skill library that alternates generator and verifier roles and (ii) an evolving context memory that contains plans, code diffs, and render history. VIGA is task-agnostic as it doesn't require auxiliary modules, covering a wide range of tasks such as 3D reconstruction, multi-step scene editing, 4D physical interaction, and 2D document editing, etc. Empirically, we found VIGA substantially improves one-shot baselines on BlenderGym (35.32%) and SlideBench (117.17%). Moreover, VIGA is also model-agnostic as it doesn't require finetuning, enabling a unified protocol to evaluate heterogeneous foundation VLMs. To better support this protocol, we introduce BlenderBench, a challenging benchmark that stress-tests interleaved multimodal reasoning with graphics engine, where VIGA improves by 124.70%.
♻ ☆ Controllable Video Generation: A Survey
With the rapid development of AI-generated content (AIGC), video generation has emerged as one of its most dynamic and impactful subfields. In particular, the advancement of video generation foundation models has led to growing demand for controllable video generation methods that can more accurately reflect user intent. Most existing foundation models are designed for text-to-video generation, where text prompts alone are often insufficient to express complex, multi-modal, and fine-grained user requirements. This limitation makes it challenging for users to generate videos with precise control using current models. To address this issue, recent research has explored the integration of additional non-textual conditions, such as camera motion, depth maps, and human pose, to extend pretrained video generation models and enable more controllable video synthesis. These approaches aim to enhance the flexibility and practical applicability of AIGC-driven video generation systems. In this survey, we provide a systematic review of controllable video generation, covering both theoretical foundations and recent advances in the field. We begin by introducing the key concepts and commonly used open-source video generation models. We then focus on control mechanisms in video diffusion models, analyzing how different types of conditions can be incorporated into the denoising process to guide generation. Finally, we categorize existing methods based on the types of control signals they leverage, including single-condition generation, multi-condition generation, and universal controllable generation. For a complete list of the literature on controllable video generation reviewed, please visit our curated repository at https://github.com/mayuelala/Awesome-Controllable-Video-Generation.
comment: project page: https://github.com/mayuelala/Awesome-Controllable-Video-Generation
Operating Systems 3
☆ Mitigating GIL Bottlenecks in Edge AI Systems
Deploying Python based AI agents on resource-constrained edge devices presents a runtime optimization challenge: high thread counts are needed to mask I/O latency, yet Python's Global Interpreter Lock (GIL) serializes execution. We demonstrate that naive thread-pool scaling causes a "saturation cliff": >= 20% throughput degradation at overprovisioned thread counts (N >= 512) on edge-representative configurations. We present a lightweight profiling tool and adaptive runtime system using a Blocking Ratio metric (beta) that distinguishes genuine I/O wait from GIL contention. Our library-based solution achieves 96.5% of optimal performance without manual tuning, outperforming multiprocessing (limited by ~8x memory overhead on devices with 512 MB-2 GB RAM) and asyncio (blocked by CPU-bound phases). Evaluation across seven edge AI workload profiles, including real ML inference with ONNX Runtime MobileNetV2, demonstrates 93.9% average efficiency. Comparative experiments with Python 3.13t (free threading) show that while GIL elimination enables ~4x throughput on multi-core edge devices, the saturation cliff persists on single-core devices, validating our beta metric for both GIL and no-GIL environments. This provides practical optimization for edge AI systems.
♻ ☆ Sharpen the Spec, Cut the Code: A Case for Generative File System with SYSSPEC
File systems are critical OS components that require constant evolution to support new hardware and emerging application needs. However, the traditional paradigm of developing features, fixing bugs, and maintaining the system incurs significant overhead, especially as systems grow in complexity. This paper proposes a new paradigm, generative file systems, which leverages Large Language Models (LLMs) to generate and evolve a file system from prompts, effectively addressing the need for robust evolution. Despite the widespread success of LLMs in code generation, attempts to create a functional file system have thus far been unsuccessful, mainly due to the ambiguity of natural language prompts. This paper introduces SYSSPEC, a framework for developing generative file systems. Its key insight is to replace ambiguous natural language with principles adapted from formal methods. Instead of imprecise prompts, SYSSPEC employs a multi-part specification that accurately describes a file system's functionality, modularity, and concurrency. The specification acts as an unambiguous blueprint, guiding LLMs to generate expected code flexibly. To manage evolution, we develop a DAG-structured patch that operates on the specification itself, enabling new features to be added without violating existing invariants. Moreover, the SYSSPEC toolchain features a set of LLM-based agents with mechanisms to mitigate hallucination during construction and evolution. We demonstrate our approach by generating SPECFS, a concurrent file system. SPECFS demonstrates equivalent level of correctness to that of a manually-coded baseline across hundreds of regression tests. We further confirm its evolvability by seamlessly integrating 10 real-world features from Ext4. Our work shows that a specification-guided approach makes generating and evolving complex systems not only feasible but also highly effective.
♻ ☆ AdaptCache: KV Cache Native Storage Hierarchy for Low-Delay and High-Quality Language Model Serving SOSP 2025
Large language model (LLM) applications often reuse previously processed context, such as chat history and documents, which introduces significant redundant computation. Existing LLM serving systems address such redundant computation by storing the KV caches of processed context and loading the corresponding KV cache when a new request reuses the context. Further, as these LLM applications scale, the total size of KV caches becomes excessively large and requires both DRAM and SSD for full storage. However, prior work that stores KV caches in DRAM and SSD suffers from high loading delays, as most KV cache hits come from SSD, which is slow to load. To increase the KV cache hit rate on DRAM, we identify lossy KV cache compression as a promising approach. We design a lossy compression system that decides the compression algorithm, compression rate and device placement for each KV cache entry to maximise DRAM hits and minimise loading delay without significantly degrading generation quality. Compared to various static compression baselines across three tasks, our system AdaptCache achieves 1.43--2.4 x delay savings at the same quality and 6--55% quality improvements at the same delay.
comment: Accepted at SOSP 2025 - The International Workshop on Big Memory (BigMem)
Hardware Architecture 4
☆ Architectural Classification of XR Workloads: Cross-Layer Archetypes and Implications
Edge and mobile platforms for augmented and virtual reality, collectively referred to as extended reality (XR) must deliver deterministic ultra-low-latency performance under stringent power and area constraints. However, the diversity of XR workloads is rapidly increasing, characterized by heterogeneous operator types and complex dataflow structures. This trend poses significant challenges to conventional accelerator architectures centered around convolutional neural networks (CNNs), resulting in diminishing returns for traditional compute-centric optimization strategies. Despite the importance of this problem, a systematic architectural understanding of the full XR pipeline remains lacking. In this paper, we present an architectural classification of XR workloads using a cross-layer methodology that integrates model-based high-level design space exploration (DSE) with empirical profiling on commercial GPU and CPU hardware. By analyzing a representative set of workloads spanning 12 distinct XR kernels, we distill their complex architectural characteristics into a small set of cross-layer workload archetypes (e.g., capacity-limited and overhead-sensitive). Building on these archetypes, we further extract key architectural insights and provide actionable design guidelines for next-generation XR SoCs. Our study highlights that XR architecture design must shift from generic resource scaling toward phase-aware scheduling and elastic resource allocation in order to achieve greater energy efficiency and high performance in future XR systems.
☆ FaTRQ: Tiered Residual Quantization for LLM Vector Search in Far-Memory-Aware ANNS Systems
Approximate Nearest-Neighbor Search (ANNS) is a key technique in retrieval-augmented generation (RAG), enabling rapid identification of the most relevant high-dimensional embeddings from massive vector databases. Modern ANNS engines accelerate this process using prebuilt indexes and store compressed vector-quantized representations in fast memory. However, they still rely on a costly second-pass refinement stage that reads full-precision vectors from slower storage like SSDs. For modern text and multimodal embeddings, these reads now dominate the latency of the entire query. We propose FaTRQ, a far-memory-aware refinement system using tiered memory that eliminates the need to fetch full vectors from storage. It introduces a progressive distance estimator that refines coarse scores using compact residuals streamed from far memory. Refinement stops early once a candidate is provably outside the top-k. To support this, we propose tiered residual quantization, which encodes residuals as ternary values stored efficiently in far memory. A custom accelerator is deployed in a CXL Type-2 device to perform low-latency refinement locally. Together, FaTRQ improves the storage efficiency by 2.4$\times$ and improves the throughput by up to 9$ \times$ than SOTA GPU ANNS system.
☆ Mugi: Value Level Parallelism For Efficient LLMs
Value level parallelism (VLP) has been proposed to improve the efficiency of large-batch, low-precision general matrix multiply (GEMM) between symmetric activations and weights. In transformer based large language models (LLMs), there exist more sophisticated operations beyond activation-weight GEMM. In this paper, we explore how VLP benefits LLMs. First, we generalize VLP for nonlinear approximations, outperforming existing nonlinear approximations in end-to-end LLM accuracy, performance, and efficiency. Our VLP approximation follows a value-centric approach, where important values are assigned with greater accuracy. Second, we optimize VLP for small-batch GEMMs with asymmetric inputs efficiently, which leverages timely LLM optimizations, including weight-only quantization, key-value (KV) cache quantization, and group query attention. Finally, we design a new VLP architecture, Mugi, to encapsulate the innovations above and support full LLM workloads, while providing better performance, efficiency and sustainability. Our experimental results show that Mugi can offer significant improvements on throughput and energy efficiency, up to $45\times$ and $668\times$ for nonlinear softmax operations, and $2.07\times$ and $3.11\times$ for LLMs, and also decrease operational carbon for LLM operation by $1.45\times$ and embodied carbon by $1.48\times$.
comment: 2026 International Conference on Architectural Support for Programming Languages and Operating Systems
♻ ☆ CarbonClarity: Understanding and Addressing Uncertainty in Embodied Carbon for Sustainable Computing
Embodied carbon footprint modeling has become an area of growing interest due to its significant contribution to carbon emissions in computing. However, the deterministic nature of the existing models fail to account for the spatial and temporal variability in the semiconductor supply chain. The absence of uncertainty modeling limits system designers' ability to make informed, carbon-aware decisions. We introduce CarbonClarity, a probabilistic framework designed to model embodied carbon footprints through distributions that reflect uncertainties in energy-per-area, gas-per-area, yield, and carbon intensity across different technology nodes. Our framework enables a deeper understanding of how design choices, such as chiplet architectures and new vs. old technology node selection, impact emissions and their associated uncertainties. For example, we show that the gap between the mean and 95th percentile of embodied carbon per cm$^2$ can reach up to 1.6X for the 7nm technology node. Additionally, we demonstrate through case studies that: (i) CarbonClarity is a valuable resource for device provisioning, help maintaining performance under a tight carbon budget; and (ii) chiplet technology and mature nodes not only reduce embodied carbon but also significantly lower its associated uncertainty, achieving an 18% reduction in the 95th percentile compared to monolithic designs for the mobile application.
Programming Languages 2
☆ Outrunning Big KATs: Efficient Decision Procedures for Variants of GKAT
This paper presents several efficient decision procedures for trace equivalence of GKAT automata, which make use of on-the-fly symbolic techniques via SAT solvers. To demonstrate applicability of our algorithms, we designed symbolic derivatives for CF-GKAT, a practical system based on GKAT designed to validate control-flow transformations. We implemented the algorithms in Rust and evaluated them on both randomly generated benchmarks and real-world control-flow transformations. Indeed, we observed order-of-magnitude performance improvements against existing implementations for both KAT and CF-GKAT. Notably, our experiments also revealed a bug in Ghidra, an industry-standard decompiler, highlighting the practical viability of these systems.
comment: Conditionally Accepted at ESOP 2026
♻ ☆ Multi-Threaded Software Model Checking via Parallel Trace Abstraction Refinement
Automatic software verification is a valuable means for software quality assurance. However, automatic verification and in particular software model checking can be time-consuming, which hinders their practical applicability e.g., the use in continuous integration. One solution to address the issue is to reduce the response time of the verification procedure by leveraging today's multi-core CPUs. In this paper, we propose a solution to parallelize trace abstraction, an abstraction-based approach to software model checking. The underlying idea of our approach is to parallelize the abstraction refinement. More concretely, our approach analyzes different traces (syntactic program paths) that could violate the safety property in parallel. We realize our parallelized version of trace abstraction in the verification tool Ulti mate Automizer and perform a thorough evaluation. Our evaluation shows that our parallelization is more effective than sequential trace abstraction and can provide results significantly faster on many time-consuming tasks. Also, our approach is more effective than DSS, a recent parallel approach to abstraction-based software model checking.
Data Structures and Algorithms 14
☆ UFO Trees: Practical and Provably-Efficient Parallel Batch-Dynamic Trees PPoPP 2026
The dynamic trees problem is to maintain a tree under edge updates while supporting queries like connectivity queries or path queries. Despite the first data structure for this fundamental problem -- the link-cut tree -- being invented 40 years ago, our experiments reveal that they are still the fastest sequential data structure for the problem. However, link-cut trees cannot support parallel batch-dynamic updates and have limitations on the kinds of queries they support. In this paper, we design a new parallel batch-dynamic trees data structure called UFO trees that simultaneously supports a wide range of query functionality, supports work-efficient parallel batch-dynamic updates, and is competitive with link-cut trees when run sequentially. We prove that a key reason for the strong practical performance of both link-cut trees and UFO trees is that they can perform updates and queries in sub-logarithmic time for low-diameter trees. We perform an experimental study of our optimized C++ implementations of UFO trees with ten other dynamic tree implementations, several of which are new, in a broad benchmark of both synthetic and real-world trees of varying diameter and size. Our results show that, in both sequential and parallel settings, UFO trees are the fastest dynamic tree data structure that supports a wide range of queries. Our new implementation of UFO trees has low space usage and easily scales to billion-size inputs, making it a promising building block for implementing more complex dynamic graph algorithms in practice.
comment: To appear in PPoPP 2026
☆ Scalable Algorithms for Approximate DNF Model Counting
Model counting of Disjunctive Normal Form (DNF) formulas is a critical problem in applications such as probabilistic inference and network reliability. For example, it is often used for query evaluation in probabilistic databases. Due to the computational intractability of exact DNF counting, there has been a line of research into a variety of approximation algorithms. These include Monte Carlo approaches such as the classical algorithms of Karp, Luby, and Madras (1989), as well as methods based on hashing (Soos et al. 2023), and heuristic approximations based on Neural Nets (Abboud, Ceylan, and Lukasiewicz 2020). We develop a new Monte Carlo approach with an adaptive stopping rule and short-circuit formula evaluation. We prove it achieves Probably Approximately Correct (PAC) learning bounds and is asymptotically more efficient than the previous methods. We also show experimentally that it out-performs prior algorithms by orders of magnitude, and can scale to much larger problems with millions of variables.
☆ Streaming Stochastic Submodular Maximization with On-Demand User Requests NeurIPS'25
We explore a novel problem in streaming submodular maximization, inspired by the dynamics of news-recommendation platforms. We consider a setting where users can visit a news website at any time, and upon each visit, the website must display up to $k$ news items. User interactions are inherently stochastic: each news item presented to the user is consumed with a certain acceptance probability by the user, and each news item covers certain topics. Our goal is to design a streaming algorithm that maximizes the expected total topic coverage. To address this problem, we establish a connection to submodular maximization subject to a matroid constraint. We show that we can effectively adapt previous methods to address our problem when the number of user visits is known in advance or linear-size memory in the stream length is available. However, in more realistic scenarios where only an upper bound on the visits and sublinear memory is available, the algorithms fail to guarantee any bounded performance. To overcome these limitations, we introduce a new online streaming algorithm that achieves a competitive ratio of $1/(8δ)$, where $δ$ controls the approximation quality. Moreover, it requires only a single pass over the stream, and uses memory independent of the stream length. Empirically, our algorithms consistently outperform the baselines.
comment: NeurIPS'25
☆ Two Complexity Results on Spanning-Tree Congestion Problems
In the spanning-tree congestion problem ($\mathsf{STC}$), we are given a graph $G$, and the objective is to compute a spanning tree of $G$ that minimizes the maximum edge congestion. While $\mathsf{STC}$ is known to be $\mathbb{NP}$-hard, even for some restricted graph classes, several key questions regarding its computational complexity remain open, and we address some of these in our paper. (i) For graphs of degree at most $d$, it is known that $\mathsf{STC}$ is $\mathbb{NP}$-hard when $d\ge 8$. We provide a complete resolution of this variant, by showing that $\mathsf{STC}$ remains $\mathbb{NP}$-hard for each degree bound $d\ge 3$. (ii) In the decision version of $\mathsf{STC}$, given an integer $K$, the goal is to determine whether the congestion of $G$ is at most $K$. We prove that this variant is polynomial-time solvable for $K$-edge-connected graphs.
♻ ☆ Convex optimization with $p$-norm oracles
In recent years, there have been significant advances in efficiently solving $\ell_s$-regression using linear system solvers and $\ell_2$-regression [Adil-Kyng-Peng-Sachdeva, J. ACM'24]. Would efficient smoothed $\ell_p$-norm solvers lead to even faster rates for solving $\ell_s$-regression when $2 \leq p < s$? In this paper, we give an affirmative answer to this question and show how to solve $\ell_s$-regression using $\tilde{O}(n^{\fracν{1+ν}})$ iterations of solving smoothed $\ell_p$ regression problems, where $ν:= \frac{1}{p} - \frac{1}{s}$. To obtain this result, we provide improved accelerated rates for convex optimization problems when given access to an $\ell_p^s(λ)$-proximal oracle, which, for a point $c$, returns the solution of the regularized problem $\min_{x} f(x) + λ||x-c||_p^s$. Additionally, we show that these rates for the $\ell_p^s(λ)$-proximal oracle are optimal for algorithms that query in the span of the outputs of the oracle, and we further apply our techniques to settings of high-order and quasi-self-concordant optimization.
comment: 34 pages
♻ ☆ Cost-Free Neutrality for the River Method AAAI 2026
Recently, the River Method was introduced as novel refinement of the Split Cycle voting rule. The decision-making process of River is closely related to the well established Ranked Pairs Method. Both methods consider a margin graph computed from the voters' preferences and eliminate majority cycles in that graph to choose a winner. As ties can occur in the margin graph, a tiebreaker is required along with the preferences. While such a tiebreaker makes the computation efficient, it compromises the fundamental property of neutrality: the voting rule should not favor alternatives in advance. One way to reintroduce neutrality is to use Parallel-Universe Tiebreaking (PUT), where each alternative is a winner if it wins according to any possible tiebreaker. Unfortunately, computing the winners selected by Ranked Pairs with PUT is NP-complete. Given the similarity of River to Ranked Pairs, one might expect River to suffer from the same complexity. Surprisingly, we show the opposite: We present a polynomial-time algorithm for computing River winners with PUT, highlighting significant structural advantages of River over Ranked Pairs. Our Fused-Universe (FUN) algorithm simulates River for every possible tiebreaking in one pass. From the resulting FUN diagram one can then directly read off both the set of winners and, for each winner, a certificate that explains how this alternative dominates the others.
comment: appears at AAAI 2026. V2 Change Notes: in the experiment section changed election generation parameter for more realistic elections, results unchanged; corrected pref_voting reference
♻ ☆ Assessing fault-tolerant quantum advantage for $k$-SAT with structure
For many problems, quantum algorithms promise speedups over their classical counterparts. However, these results predominantly rely on asymptotic worst-case analysis, which overlooks significant overheads due to error correction and the fact that real-world instances often contain exploitable structure. In this work, we employ the hybrid benchmarking method to evaluate the potential of quantum Backtracking and Grover's algorithm against the 2023 SAT competition main track winner in solving random $k$-SAT instances with tunable structure, designed to represent industry-like scenarios, using both $T$-depth and $T$-count as cost metrics to estimate quantum run times. Our findings reproduce the results of Campbell, Khurana, and Montanaro (Quantum '19) in the unstructured case using hybrid benchmarking. However, we offer a more sobering perspective in practically relevant regimes: almost all quantum speedups vanish, even asymptotically, when minimal structure is introduced or when $T$-count is considered instead of $T$-depth. Moreover, when the requirement is for the algorithm to find a solution within a single day, we find that only Grover's algorithm has the potential to outperform classical algorithms, but only in a very limited regime and only when using $T$-depth. We also discuss how more sophisticated heuristics could restore the asymptotic scaling advantage for quantum backtracking, but our findings suggest that the potential for practical quantum speedups in more structured $k$-SAT solving will remain limited.
comment: 40 pages, 17 tables, 1 figure
♻ ☆ Pushing the frontiers of subexponential FPT time for Feedback Vertex Set
The paper deals with the Feedback Vertex Set problem parameterized by the solution size. Given a graph $G$ and a parameter $k$, one has to decide if there is a set $S$ of at most $k$ vertices such that $G-S$ is acyclic. Assuming the Exponential Time Hypothesis, it is known that FVS cannot be solved in time $2^{o(k)}n^{\mathcal{O}(1)}$ in general graphs. To overcome this, many recent results considered FVS restricted to particular intersection graph classes and provided such $2^{o(k)}n^{\mathcal{O}(1)}$ algorithms. In this paper we provide generic conditions on a graph class for the existence of an algorithm solving FVS in subexponential FPT time, i.e. time $2^{k^\varepsilon} \mathop{\rm poly}(n)$, for some $\varepsilon<1$, where $n$ denotes the number of vertices of the instance and $k$ the parameter. On the one hand this result unifies algorithms that have been proposed over the years for several graph classes such as planar graphs, map graphs, unit-disk graphs, pseudo-disk graphs, and string graphs of bounded edge-degree. On the other hand it extends the tractability horizon of FVS to new classes that are not amenable to previously used techniques, in particular intersection graphs of ``thin'' objects like segment graphs or more generally $s$-string graphs.
comment: Appeared in proceedings of ICALP 2025
♻ ☆ Approximations for Fault-Tolerant Total and Partial Positive Influence Domination
In $\textit{total domination}$, given a graph $G=(V,E)$, we seek a minimum-size set of nodes $S\subseteq V$, such that every node in $V$ has at least one neighbor in $S$. We define a $\textit{fault-tolerant}$ version of total domination, where we require any node in $V \setminus S$ to have at least $m$ neighbors in $S$. Let $Δ$ denote the maximum degree in $G$. We prove a first $1 + \ln(Δ+ m - 1)$ approximation for fault-tolerant total domination. We also consider fault-tolerant variants of the weighted $\textit{partial positive influence dominating set}$ problem, where we seek a minimum-size set of nodes $S\subseteq V$, such that every node in $V$ is either a member of $S$ or the sum of weights of its incident edges leading to nodes in $S$ is at least half of the sum of weights over all its incident edges. We prove the first logarithmic approximations for the simple, total, and connected variants of this problem. To prove the result for the connected case, we extend the general approximation framework for non-submodular functions from integer-valued to fractional-valued functions, which we believe is of independent interest.
♻ ☆ ProbeWalk: Fast Estimation of Biharmonic Distance on Graphs via Probe-Driven Random Walks
The biharmonic distance is a fundamental metric on graphs that measures the dissimilarity between two nodes, capturing both local and global structures. It has found applications across various fields, including network centrality, graph clustering, and machine learning. These applications typically require efficient evaluation of pairwise biharmonic distances. However, existing algorithms remain computationally expensive. The state-of-the-art method attains an absolute-error guarantee epsilon_abs with time complexity O(L^5 / epsilon_abs^2), where L denotes the truncation length. In this work, we improve the complexity to O(L^3 / epsilon^2) under a relative-error guarantee epsilon via probe-driven random walks. We provide a relative-error guarantee rather than an absolute-error guarantee because biharmonic distances vary by orders of magnitude across node pairs. Since L is often very large in real-world networks (for example, L >= 10^3), reducing the L-dependence from the fifth to the third power yields substantial gains. Extensive experiments on real-world networks show that our method delivers 10x-1000x per-query speedups at matched relative error over strong baselines and scales to graphs with tens of millions of nodes.
comment: We have received some constructive advice and are going to conduct additional experiments to strengthen the results
♻ ☆ Universal Maximum Likelihood (List) Decoding via Fast Vector-Matrix Multiplication
Maximum-likelihood (ML) decoding for arbitrary block codes remains fundamentally hard, with worst-case time complexity-measured by the total number of multiplications-being no better than straightforward exhaustive search, which requires $q^{k} n$ operations for an $[n,k]_q$ code. This paper introduces a simple, code-agnostic framework that reduces the worst-case complexity by a factor of $n$, down to $q^{k}$ operations, a highly desirable reduction in practice. The result holds for both linear and nonlinear block codes over general memoryless channels and under both hard-decision and soft-decision decoding. It naturally extends to intersymbol-interference (ISI) channels and ML list decoding with only a negligible increase in complexity. Our core insight is that, upon receipt of each sequence at the receiver, the conditional probability of that sequence for each codeword in the codebook (i.e., the \emph{likelihood}) can be expressed as the inner product of two carefully constructed vectors -- the first depending on the received sequence, and the second on that codeword itself. As a result, evaluating the likelihoods for all codewords in the codebook reduces to a single vector-matrix multiplication, and ML decoding (MLD) becomes the simple task of picking the maximum entry in the resulting vector. The only non-trivial cost lies in the vector-matrix product. However, our matrix construction allows the use of the Mailman algorithm to reduce this cost. This time reduction is achieved at the cost of high space complexity, requiring $\mathcal{O}(q^{k+1} n)$ space to store the pre-computed codebook matrix.
comment: Add in figures
♻ ☆ Query-Efficient Locally Private Hypothesis Selection via the Scheffe Graph
We propose an algorithm with improved query-complexity for the problem of hypothesis selection under local differential privacy constraints. Given a set of $k$ probability distributions $Q$, we describe an algorithm that satisfies local differential privacy, performs $\tilde{O}(k^{3/2})$ non-adaptive queries to individuals who each have samples from a probability distribution $p$, and outputs a probability distribution from the set $Q$ which is nearly the closest to $p$. Previous algorithms required either $Ω(k^2)$ queries or many rounds of interactive queries. Technically, we introduce a new object we dub the Scheffé graph, which captures structure of the differences between distributions in $Q$, and may be of more broad interest for hypothesis selection tasks.
♻ ☆ Pinwheel Scheduling with Real Periods
For a sequence of tasks, each with a positive integer period, the pinwheel scheduling problem involves finding a valid schedule in the sense that the schedule performs one task per day and each task is performed at least once every consecutive days of its period. It had been conjectured by Chan and Chin in 1993 that there exists a valid schedule for any sequence of tasks with density, the sum of the reciprocals of each period, at most $\frac{5}{6}$. Recently, Kawamura settled this conjecture affirmatively. In this paper we consider an extended version with real periods proposed by Kawamura, in which a valid schedule must perform each task $i$ having a real period~$a_{i}$ at least $l$ times in any consecutive $\lceil l a_{i} \rceil$ days for all positive integer $l$. We show that any sequence of tasks such that the periods take three distinct real values and the density is at most $\frac{5}{6}$ admits a valid schedule. We hereby conjecture that the conjecture of Chan and Chin is true also for real periods.
♻ ☆ A Dynamic, Self-balancing k-d Tree
The original description of the k-d tree recognized that rebalancing techniques, used for building an AVL or red-black tree, are not applicable to a k-d tree, because these techniques involve cyclic exchange of tree nodes that violates the invariant of the k-d tree. For this reason, a static, balanced k-d tree is often built from all of the k-dimensional data en masse. However, it is possible to build a dynamic k-d tree that self-balances when necessary after insertion or deletion of each k-dimensional datum. This article describes insertion, deletion, and rebalancing algorithms for a dynamic, self-balancing k-d tree, and measures their performance.
comment: 19 pages, 9 figures, 5 tables
Graphics 6
☆ Alterbute: Editing Intrinsic Attributes of Objects in Images
We introduce Alterbute, a diffusion-based method for editing an object's intrinsic attributes in an image. We allow changing color, texture, material, and even the shape of an object, while preserving its perceived identity and scene context. Existing approaches either rely on unsupervised priors that often fail to preserve identity or use overly restrictive supervision that prevents meaningful intrinsic variations. Our method relies on: (i) a relaxed training objective that allows the model to change both intrinsic and extrinsic attributes conditioned on an identity reference image, a textual prompt describing the target intrinsic attributes, and a background image and object mask defining the extrinsic context. At inference, we restrict extrinsic changes by reusing the original background and object mask, thereby ensuring that only the desired intrinsic attributes are altered; (ii) Visual Named Entities (VNEs) - fine-grained visual identity categories (e.g., ''Porsche 911 Carrera'') that group objects sharing identity-defining features while allowing variation in intrinsic attributes. We use a vision-language model to automatically extract VNE labels and intrinsic attribute descriptions from a large public image dataset, enabling scalable, identity-preserving supervision. Alterbute outperforms existing methods on identity-preserving object intrinsic attribute editing.
comment: Project page is available at https://talreiss.github.io/alterbute/
☆ Extrinsic Vector Field Processing
We propose a novel discretization of tangent vector fields for triangle meshes. Starting with a Phong map continuously assigning normals to all points on the mesh, we define an extrinsic bases for continuous tangent vector fields by using the Rodrigues rotation to transport tangent vectors assigned to vertices to tangent vectors in the interiors of the triangles. As our vector fields are continuous and weakly differentiable, we can use them to define a covariant derivative field that is evaluatable almost-everywhere on the mesh. Decomposing the covariant derivative in terms of diagonal multiple of the identity, anti-symmetric, and trace-less symmetric components, we can define the standard operators used for vector field processing including the Hodge Laplacian energy, Connection Laplacian energy, and Killing energy. Additionally, the ability to perform point-wise evaluation of the covariant derivative also makes it possible for us to define the Lie bracket.
☆ Beyond Inpainting: Unleash 3D Understanding for Precise Camera-Controlled Video Generation
Camera control has been extensively studied in conditioned video generation; however, performing precisely altering the camera trajectories while faithfully preserving the video content remains a challenging task. The mainstream approach to achieving precise camera control is warping a 3D representation according to the target trajectory. However, such methods fail to fully leverage the 3D priors of video diffusion models (VDMs) and often fall into the Inpainting Trap, resulting in subject inconsistency and degraded generation quality. To address this problem, we propose DepthDirector, a video re-rendering framework with precise camera controllability. By leveraging the depth video from explicit 3D representation as camera-control guidance, our method can faithfully reproduce the dynamic scene of an input video under novel camera trajectories. Specifically, we design a View-Content Dual-Stream Condition mechanism that injects both the source video and the warped depth sequence rendered under the target viewpoint into the pretrained video generation model. This geometric guidance signal enables VDMs to comprehend camera movements and leverage their 3D understanding capabilities, thereby facilitating precise camera control and consistent content generation. Next, we introduce a lightweight LoRA-based video diffusion adapter to train our framework, fully preserving the knowledge priors of VDMs. Additionally, we construct a large-scale multi-camera synchronized dataset named MultiCam-WarpData using Unreal Engine 5, containing 8K videos across 1K dynamic scenes. Extensive experiments show that DepthDirector outperforms existing methods in both camera controllability and visual quality. Our code and dataset will be publicly available.
☆ Thinking Like Van Gogh: Structure-Aware Style Transfer via Flow-Guided 3D Gaussian Splatting
In 1888, Vincent van Gogh wrote, "I am seeking exaggeration in the essential." This principle, amplifying structural form while suppressing photographic detail, lies at the core of Post-Impressionist art. However, most existing 3D style transfer methods invert this philosophy, treating geometry as a rigid substrate for surface-level texture projection. To authentically reproduce Post-Impressionist stylization, geometric abstraction must be embraced as the primary vehicle of expression. We propose a flow-guided geometric advection framework for 3D Gaussian Splatting (3DGS) that operationalizes this principle in a mesh-free setting. Our method extracts directional flow fields from 2D paintings and back-propagates them into 3D space, rectifying Gaussian primitives to form flow-aligned brushstrokes that conform to scene topology without relying on explicit mesh priors. This enables expressive structural deformation driven directly by painterly motion rather than photometric constraints. Our contributions are threefold: (1) a projection-based, mesh-free flow guidance mechanism that transfers 2D artistic motion into 3D Gaussian geometry; (2) a luminance-structure decoupling strategy that isolates geometric deformation from color optimization, mitigating artifacts during aggressive structural abstraction; and (3) a VLM-as-a-Judge evaluation framework that assesses artistic authenticity through aesthetic judgment instead of conventional pixel-level metrics, explicitly addressing the subjective nature of artistic stylization.
comment: 7 pages, 8 figures
♻ ☆ Efficient Maintenance of Leiden Communities in Large Dynamic Graphs
As a well-known community detection algorithm, Leiden has been widely used in various scenarios such as large language model generation (e.g., Graph-RAG), anomaly detection, and biological analysis. In these scenarios, the graphs are often large and dynamic, where vertices and edges are inserted and deleted frequently, so it is costly to obtain the updated communities by Leiden from scratch when the graph has changed. Recently, one work has attempted to study how to maintain Leiden communities in the dynamic graph, but it lacks a detailed theoretical analysis, and its algorithms are inefficient for large graphs. To address these issues, in this paper, we first theoretically show that the existing algorithms are relatively unbounded via the boundedness analysis (a powerful tool for analyzing incremental algorithms on dynamic graphs), and also analyze the memberships of vertices in communities when the graph changes. Based on theoretical analysis, we develop a novel efficient maintenance algorithm, called Hierarchical Incremental Tree Leiden (HIT-Leiden), which effectively reduces the range of affected vertices by maintaining the connected components and hierarchical community structures. Comprehensive experiments in various datasets demonstrate the superior performance of HIT-Leiden. In particular, it achieves speedups of up to five orders of magnitude over existing methods.
♻ ☆ Bayesian Monocular Depth Refinement via Neural Radiance Fields
Monocular depth estimation has applications in many fields, such as autonomous navigation and extended reality, making it an essential computer vision task. However, current methods often produce smooth depth maps that lack the fine geometric detail needed for accurate scene understanding. We propose MDENeRF, an iterative framework that refines monocular depth estimates using depth information from Neural Radiance Fields (NeRFs). MDENeRF consists of three components: (1) an initial monocular estimate for global structure, (2) a NeRF trained on perturbed viewpoints, with per-pixel uncertainty, and (3) Bayesian fusion of the noisy monocular and NeRF depths. We derive NeRF uncertainty from the volume rendering process to iteratively inject high-frequency fine details. Meanwhile, our monocular prior maintains global structure. We demonstrate improvements on key metrics and experiments using indoor scenes from the SUN RGB-D dataset.
comment: IEEE 8th International Conference on Algorithms, Computing and Artificial Intelligence (ACAI 2025)
Operating Systems 3
☆ LatencyPrism: Online Non-intrusive Latency Sculpting for SLO-Guaranteed LLM Inference
LLM inference latency critically determines user experience and operational costs, directly impacting throughput under SLO constraints. Even brief latency spikes degrade service quality despite acceptable average performance. However, distributed inference environments featuring diverse software frameworks and XPU architectures combined with dynamic workloads make latency analysis challenging. Constrained by intrusive designs that necessitate service restarts or even suspension, and by hardware-bound implementations that fail to adapt to heterogeneous inference environments, existing AI profiling methods are often inadequate for real-time production analysis. We present LatencyPrism, the first zero-intrusion multi-platform latency sculpting system. It aims to break down the inference latency across pipeline, proactively alert on inference latency anomalies, and guarantee adherence to SLOs, all without requiring code modifications or service restarts. LatencyPrism has been deployed across thousands of XPUs for over six months. It enables low-overhead real-time monitoring at batch level with alerts triggered in milliseconds. This approach distinguishes between workload-driven latency variations and anomalies indicating underlying issues with an F1-score of 0.98. We also conduct extensive experiments and investigations into root cause analysis to demonstrate LatencyPrism's capability.
comment: 12 pages, 6 figures
♻ ☆ Secure and Efficient Access Control for Computer-Use Agents via Context Space
Large language model (LLM)-based computer-use agents represent a convergence of AI and OS capabilities, enabling natural language to control system- and application-level functions. However, due to LLMs' inherent uncertainty issues, granting agents control over computers poses significant security risks. When agent actions deviate from user intentions, they can cause irreversible consequences. Existing mitigation approaches, such as user confirmation and LLM-based dynamic action validation, still suffer from limitations in usability, security, and performance. To address these challenges, we propose CSAgent, a system-level, static policy-based access control framework for computer-use agents. To bridge the gap between static policy and dynamic context and user intent, CSAgent introduces intent- and context-aware policies, and provides an automated toolchain to assist developers in constructing and refining them. CSAgent enforces these policies through an optimized OS service, ensuring that agent actions can only be executed under specific user intents and contexts. CSAgent supports protecting agents that control computers through diverse interfaces, including API, CLI, and GUI. We implement and evaluate CSAgent, which successfully defends against all attacks in the benchmarks while introducing only 1.99% performance overhead and 5.42% utility decrease.
♻ ☆ DAVOS: An Autonomous Vehicle Operating System in the Vehicle Computing Era
Vehicle computing represents a fundamental shift in how autonomous vehicles are designed and deployed, transforming them from isolated transportation systems into mobile computing platforms that support both safety-critical, real-time driving and data-centric services. In this setting, vehicles simultaneously support real-time driving pipelines and a growing set of data-driven applications, placing increased responsibility on the vehicle operating system to coordinate computation, data movement, storage, and access. These demands highlight recurring system considerations related to predictable execution, data and execution protection, efficient handling of high-rate sensor data, and long-term system evolvability, commonly summarized as Safety, Security, Efficiency, and Extensibility (SSEE). Existing vehicle operating systems and runtimes address these concerns in isolation, resulting in fragmented software stacks that limit coordination between autonomy workloads and vehicle data services. This paper presents DAVOS, the Delaware Autonomous Vehicle Operating System, a unified vehicle operating system architecture designed for the vehicle computing context. DAVOS provides a cohesive operating system foundation that supports both real-time autonomy and extensible vehicle computing within a single system framework.
Hardware Architecture 5
☆ Late Breaking Results: Quamba-SE: Soft-edge Quantizer for Activations in State Space Models DATE
We propose Quamba-SE, a soft-edge quantizer for State Space Model (SSM) activation quantization. Unlike existing methods, using standard INT8 operation, Quamba-SE employs three adaptive scales: high-precision for small values, standard scale for normal values, and low-precision for outliers. This preserves outlier information instead of hard clipping, while maintaining precision for other values. We evaluate on Mamba- 130M across 6 zero-shot benchmarks. Results show that Quamba- SE consistently outperforms Quamba, achieving up to +2.68% on individual benchmarks and up to +0.83% improvement in the average accuracy of 6 datasets.
comment: Accepted to DATE Late Breaking Results 2026, Verona, Italy
☆ Relational Hoare Logic for High-Level Synthesis of Hardware Accelerators
High-level synthesis (HLS) is a powerful tool for developing efficient hardware accelerators that rely on specialized memory systems to achieve sufficient on-chip data reuse and off-chip bandwidth utilization. However, even with HLS, designing such systems still requires careful manual tuning, as automatic optimizations provided by existing tools are highly sensitive to programming style and often lack transparency. To address these issues, we present a formal translation framework based on relational Hoare logic, which enables robust and transparent transformations. Our method recognizes complex memory access patterns in naïve HLS programs and automatically transforms them by inserting on-chip buffers to enforce linear access to off-chip memory, and by replacing non-sequential processing with stream processing, while preserving program semantics. Experiments using our prototype translator, combined with an off-the-shelf HLS compiler and a real FPGA board, have demonstrated significant performance improvements.
comment: An extended version of the paper to appear in Proceedings of ESOP 2026
☆ Enhancing LUT-based Deep Neural Networks Inference through Architecture and Connectivity Optimization
Deploying deep neural networks (DNNs) on resource-constrained edge devices such as FPGAs requires a careful balance among latency, power, and hardware resource usage, while maintaining high accuracy. Existing Lookup Table (LUT)-based DNNs -- such as LogicNets, PolyLUT, and NeuraLUT -- face two critical challenges: the exponential growth of LUT size and inefficient random sparse connectivity. This paper presents SparseLUT, a comprehensive framework that addresses these challenges through two orthogonal optimizations. First, we propose an architectural enhancement that aggregates multiple PolyLUT sub-neurons via an adder, significantly reducing LUT consumption by 2.0x-13.9x and lowering inference latency by 1.2x-1.6x, all while maintaining comparable accuracy. Building upon this foundation, we further introduce a non-greedy training algorithm that optimizes neuron connectivity by selectively pruning less significant inputs and strategically regrowing more effective ones. This training optimization, which incurs no additional area and latency overhead, delivers consistent accuracy improvements across benchmarks -- achieving up to a 2.13% gain on MNIST and 0.94% on Jet Substructure Classification compared to existing LUT-DNN approaches.
comment: arXiv admin note: substantial text overlap with arXiv:2503.12829, arXiv:2406.04910
♻ ☆ SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training
The efficiency of attention is important due to its quadratic time complexity. We enhance the efficiency of attention through two key contributions: First, we leverage the new FP4 Tensor Cores in Blackwell GPUs to accelerate attention computation. Our implementation achieves 1038 TOPS on RTX5090, which is a 5x speedup over the fastest FlashAttention on RTX5090. Experiments show that our FP4 attention can accelerate inference of various models in a plug-and-play way. Second, we pioneer low-bit attention to training tasks. Existing low-bit attention works like FlashAttention3 and SageAttention focus only on inference. However, the efficiency of training large models is also important. To explore whether low-bit attention can be effectively applied to training tasks, we design an accurate and efficient 8-bit attention for both forward and backward propagation. Experiments indicate that 8-bit attention achieves lossless performance in fine-tuning tasks but exhibits slower convergence in pretraining tasks. The code is available at https://github.com/thu-ml/SageAttention.
♻ ☆ Challenges and Research Directions for Large Language Model Inference Hardware
Large Language Model (LLM) inference is hard. The autoregressive Decode phase of the underlying Transformer model makes LLM inference fundamentally different from training. Exacerbated by recent AI trends, the primary challenges are memory and interconnect rather than compute. To address these challenges, we highlight four architecture research opportunities: High Bandwidth Flash for 10X memory capacity with HBM-like bandwidth; Processing-Near-Memory and 3D memory-logic stacking for high memory bandwidth; and low-latency interconnect to speedup communication. While our focus is datacenter AI, we also review their applicability for mobile devices.
comment: Accepted for publication by IEEE Computer, 2026
Programming Languages 7
☆ MLIR-Forge: A Modular Framework for Language Smiths
Optimizing compilers are essential for the efficient and correct execution of software across various scientific fields. Domain-specific languages (DSL) typically use higher level intermediate representations (IR) in their compiler pipelines for domain-specific optimizations. As these IRs add to complexity, it is crucial to test them thoroughly. Random program generators have proven to be an effective tool to test compilers through differential and fuzz testing. However, developing specialized program generators for compiler IRs is not straightforward and demands considerable resources. We introduce MLIR-Forge, a novel random program generator framework that leverages the flexibility of MLIR, aiming to simplify the creation of specialized program generators. MLIR-Forge achieves this by splitting the generation process into fundamental building blocks that are language specific, and reusable program creation logic that constructs random programs from these building blocks. This hides complexity and furthermore, even the language specific components can be defined using a set of common tools. We demonstrate MLIR-Forge's capabilities by generating MLIR with built-in dialects, WebAssembly, and a data-centric program representation, DaCe -- requiring less than a week of development time in total for each of them. Using the generated programs we conduct differential testing and find 9 MLIR, 15 WebAssembly, and 774 DaCe groups of bugs with the corresponding program generators, after running them until the rate of new bugs stagnates.
☆ Relational Hoare Logic for High-Level Synthesis of Hardware Accelerators
High-level synthesis (HLS) is a powerful tool for developing efficient hardware accelerators that rely on specialized memory systems to achieve sufficient on-chip data reuse and off-chip bandwidth utilization. However, even with HLS, designing such systems still requires careful manual tuning, as automatic optimizations provided by existing tools are highly sensitive to programming style and often lack transparency. To address these issues, we present a formal translation framework based on relational Hoare logic, which enables robust and transparent transformations. Our method recognizes complex memory access patterns in naïve HLS programs and automatically transforms them by inserting on-chip buffers to enforce linear access to off-chip memory, and by replacing non-sequential processing with stream processing, while preserving program semantics. Experiments using our prototype translator, combined with an off-the-shelf HLS compiler and a real FPGA board, have demonstrated significant performance improvements.
comment: An extended version of the paper to appear in Proceedings of ESOP 2026
☆ Lazy Evaluation: A Comparative Analysis of SAS MACROs and R Functions
Lazy evaluation is a powerful technique that can optimize code execution by deferring evaluations until their results are required, thus enhancing efficiency. In most modern programming languages, like R, lazy evaluation is commonly applied to function arguments. However, the application of lazy evaluation in SAS has not been extensively explored. This paper focuses on the mechanisms of lazy evaluation in SAS MACROs and R functions, offering a comparative analysis of the underlying principles that drive these processes. R's lazy evaluation is driven by a data structure called Promise, which postpones evaluation and does not occupy memory until the value is needed, utilizing a call-by-need strategy. SAS, on the other hand, achieves lazy evaluation through its symbol tables, employing memory to store parameters, and operates on a call-by-name basis. These discrepancies in lazy evaluation strategies can notably impact the results of R functions and SAS MACROs. By examining these distinct approaches, the paper illuminates the impact of lazy evaluation on programming efficiency, supported by illustrative examples. As the shift from SAS to R becomes increasingly prevalent in the pharmaceutical industry, understanding these techniques enables programmers to optimize their code for greater efficacy. This exploration serves as a guide to enhance programming capabilities and performance in both languages.
comment: This paper was originally published in SESUG 2025 Conference Proceedings. Cary, NC: SouthEast SAS Users Group
☆ From Dynamic to Lexical: A Comparative Exploration of Scoping Rules in SAS and R
Variable scoping dictates how and where variables are accessible within programming languages, playing a crucial role in code efficiency and organization. This paper examines the distinct scoping rules in SAS and R, focusing on SAS's dynamic scoping and R's lexical scoping. In SAS, dynamic scoping utilizes symbol tables, resolving variables at runtime by dynamically searching through active macro layers. R, in contrast, employs lexical scoping, using environments to resolve variables based on the structure in which functions are defined. Illustrative examples highlight the differences between these scoping strategies, showcasing their impact on code behavior. Additionally, the paper outlines methods for inspecting variables in SAS's symbol tables and R's environments, offering practical insights for debugging and optimization. Strategies for controlling variable scope in both languages are discussed, enhancing code precision and reliability. This exploration equips programmers with critical understanding to optimize variable management, improving their programming practices in SAS and R.
comment: This paper was originally published in the SESUG 2025 Conference Proceedings. Cary, NC
♻ ☆ Representing Molecules with Algebraic Data Types: Beyond SMILES and SELFIES
Algebraic data types (ADTs) let a representation specify at the type level what molecular values are valid and what transformations are meaningful. We propose a molecular representation as a family of typed ADTs that separates (i) constitution (Dietz style bonding systems), (ii) 3D coordinates and stereochemistry, and (iii) electronic structure annotations. This separation makes invariants explicit, supports deterministic local edits, and provides hooks for symmetry aware and Bayesian modeling. These data structures allow us to consider how the representation constrains operations which may be performed over them. Types make invalid manipulations unrepresentable and make it easier to define meaningful priors/likelihoods over generative models (programs with sample and score operations). Unlike string based formats, the ADT exposes chemical structure directly; validity conditions (e.g., valence and symmetry constraints) can be enforced by construction and checked deterministically during transformations. We optionally attach electronic structure annotations (shell/subshell/orbital metadata) to atoms when such information is available; we do not attempt to compute orbitals in this work. We sketch Bayesian probabilistic programming via an integration with LazyPPL, a lazy probabilistic programming library; molecules can be made instances of a group under rotation to support geometric learning settings where molecular properties are invariant under rigid motions and relabellings; and the framework's flexibility is demonstrated through an extension to represent chemical reactions. We provide a Haskell library implementing the representation, released under an OSI approved open source license and archived with a DOI.
comment: 3 Figures
♻ ☆ Expressivity of AuDaLa: Turing Completeness and Possible Extensions
AuDaLa is a recently introduced programming language that follows the new data autonomous paradigm. In this paradigm, small pieces of data execute functions autonomously. Considering the paradigm and the design choices of AuDaLa, it is interesting to determine the expressivity of the language. In this paper, we implement Turing machines in AuDaLa and prove that implementation correct. This proves that AuDaLa is Turing complete, giving an initial indication of AuDaLa's expressivity. Additionally, we give examples of how to add extensions to AuDaLa to increase its practical expressivity and to better match conventional parallel languages, allowing for a more straightforward and performant implementation of algorithms.
comment: 30 pages, 1 figure, submitted to LMCS, extension of submission to FORTE (preprint at arXiv:2404.12934)
♻ ☆ CSSG: Measuring Code Similarity with Semantic Graphs
Existing code similarity metrics, such as BLEU, CodeBLEU, and TSED, largely rely on surface-level string overlap or abstract syntax tree structures, and often fail to capture deeper semantic relationships between programs.We propose CSSG (Code Similarity using Semantic Graphs), a novel metric that leverages program dependence graphs to explicitly model control dependencies and variable interactions, providing a semantics-aware representation of code.Experiments on the CodeContests+ dataset show that CSSG consistently outperforms existing metrics in distinguishing more similar code from less similar code under both monolingual and cross-lingual settings, demonstrating that dependency-aware graph representations offer a more effective alternative to surface-level or syntax-based similarity measures.
Data Structures and Algorithms 18
☆ Permutation Matching Under Parikh Budgets: Linear-Time Detection, Packing, and Disjoint Selection
We study permutation (jumbled/Abelian) pattern matching over a general alphabet $Σ$. Given a pattern P of length m and a text T of length n, the classical task is to decide whether T contains a length-m substring whose Parikh vector equals that of P . While this existence problem admits a linear-time sliding-window solution, many practical applications require optimization and packing variants beyond mere detection. We present a unified sliding-window framework based on maintaining the Parikh-vector difference between P and the current window of T , enabling permutation matching in O(n + σ) time and O(σ) space, where σ = |Σ|. Building on this foundation, we introduce a combinatorial-optimization variant that we call Maximum Feasible Substring under Pattern Supply (MFSP): find the longest substring S of T whose symbol counts are component-wise bounded by those of P . We show that MFSP can also be solved in O(n + σ) time via a two-pointer feasibility maintenance algorithm, providing an exact packing interpretation of P as a resource budget. Finally, we address non-overlapping occurrence selection by modeling each permutation match as an equal-length interval and proving that a greedy earliest-finishing strategy yields a maximum-cardinality set of disjoint matches, computable in linear time once all matches are enumerated. Our results provide concise, provably correct algorithms with tight bounds, and connect frequency-based string matching to packing-style optimization primitives.
comment: 12 pages (Excluding reference)
☆ Boltzmann Sampling for Powersets without an Oracle
We show that powersets over structures with a bounded counting sequence can be sampled efficiently without evaluating the generating function. An algorithm is provided, implemented, and tested. Runtimes are comparable to existing Boltzmann samplers reported in the literature.
☆ On Numbers of Simplicial Walks and Equivalent Canonizations for Graph Recognition
Two graphs are isomorphic exactly when they admit the same number of homomorphisms from every graph. Hence, a graph is recognized up to isomorphism by homomorphism counts over the class of all graphs. Restricting to a specific graph class yields some natural isomorphism relaxations and modulates recognition to particular graph properties. A notable restriction is to the classes of bounded treewidth, yielding the isomorphism relaxation of Weisfeiler--Leman refinement (WL), as shown by Dvořák [JGT 2010]. The properties recognized by WL are exactly those definable in fragments of first-order logic with counting quantifiers, as shown by Cai, Fürer, and Immerman [Comb. 1992]. We characterize the restriction to the classes of bounded pathwidth by numbers of simplicial walks, and formalize it into a refinement procedure (SW). The properties recognized by SW are exactly those definable in fragments of restricted-conjunction first-order logic with counting quantifiers, introduced by Montacute and Shah [LMCS 2024]. Unlike WL, computing SW directly is not polynomial-time in general. We address this by representing SW in terms of multiplicity automata. We equip these automata with an involution, simplifying the canonization to standard forward reduction and omitting the backward one. The resulting canonical form is computable in time $O(kn^{3k})$ for any graph on $n$ vertices and the restriction to pathwidth at most $k$.
comment: Accepted for LATIN 2026
☆ How many users have been here for a long time? Efficient solutions for counting long aggregated visits
This paper addresses the Counting Long Aggregated Visits problem, which is defined as follows. We are given $n$ users and $m$ regions, where each user spends some time visiting some regions. For a parameter $k$ and a query consisting of a subset of $r$ regions, the task is to count the number of distinct users whose aggregate time spent visiting the query regions is at least $k$. This problem is motivated by queries arising in the analysis of large-scale mobility datasets. We present several exact and approximate data structures for supporting counting long aggregated visits, as well as conditional and unconditional lower bounds. First, we describe an exact data structure that exhibits a space-time tradeoff, as well as efficient approximate solutions based on sampling and sketching techniques. We then study the problem in geometric settings where regions are points in $\mathbb{R}^d$ and queries are hyperrectangles, and derive exact data structures that achieve improved performance in these structured spaces.
☆ Engineering Compressed Matrix Multiplication with the Fast Walsh-Hadamard Transform
We present an implementation of Pagh's compressed matrix multiplication algorithm, a randomized algorithm that constructs sketches of matrices to compute an unbiased estimate of their product. By leveraging fast polynomial multiplication via the FFT, the algorithm achieves high performance when the product matrix is sparse or contains only a small number of entries with magnitudes significantly larger than the rest. We show empirically that the algorithm is practical and can outperform state-of-the-art DGEMM implementations when the product matrix has few nonzero entries or is otherwise dominated by a small subset of elements with large magnitude. As a minor theoretical contribution, we replace the FFT with the Fast Walsh-Hadamard Transform (FWHT) in sketched multiplication, preserving all correctness and variance guarantees of the original algorithm. Experiments with our carefully engineered multithreaded CPU implementation for dense double-precision matrices on 64-core CPU nodes across a range of synthetic benchmarks, exhibiting variable sparsity patterns, show that the FWHT variant is up to 4 times faster than the FFT-based version. Under favorable sparsity and magnitude patterns in the product matrix, our FWHT-based implementation achieves a speedup of up to 40 over DGEMM from Intel MKL, with low probability of error in the estimates. Our implementation is released as free software and comes with NumPy-compatible Python bindings.
comment: 23 pages
☆ Computational Complexity of Swish
Swish is a card game in which players are given cards having symbols (hoops and balls), and find a valid superposition of cards, called a "swish." Dailly, Lafourcade, and Marcadet (FUN 2024) studied a generalized version of Swish and showed that the problem is solvable in polynomial time with one symbol per card, while it is NP-complete with three or more symbols per card. In this paper, we resolve the previously open case of two symbols per card, which corresponds to the original game. We show that Swish is NP-complete for this case. Specifically, we prove the NP-hardness when the allowed transformations of cards are restricted to a single (horizontal or vertical) flip or 180-degree rotation, and extend the results to the original setting allowing all three transformations. In contrast, when neither transformation is allowed, we present a polynomial-time algorithm. Combining known and our results, we establish a complete characterization of the computational complexity of Swish with respect to both the number of symbols per card and the allowed transformations.
comment: 10 pages, 5 figures
☆ On the complexity of global Roman domination problem in graphs
A Roman dominating function of a graph $G=(V,E)$ is a labeling $f: V \rightarrow{} \{0 ,1, 2\}$ such that for each vertex $u \in V$ with $f(u) = 0$, there exists a vertex $v \in N(u)$ with $f(v) =2$. A Roman dominating function $f$ is a global Roman dominating function if it is a Roman dominating function for both $G$ and its complement $\overline{G}$. The weight of $f$ is the sum of $f(u)$ over all the vertices $u \in V$. The objective of Global Roman Domination problem is to find a global Roman dominating function with minimum weight. The objective of Global Roman Domination is to compute a global Roman dominating function of minimum weight. In this paper, we study the algorithmic aspects of Global Roman Domination problem on various graph classes and obtain the following results. 1. We prove that Roman domination and Global Roman Domination problems are not computationally equivalent by identifying graph classes on which one is linear-time solvable, while the other is NP-complete. 2. We show that Global Roman Domination problem is NP-complete on split graphs, thereby resolving an open question posed by Panda and Goyal [Discrete Applied Mathematics, 2023]. 3. We prove that Global Roman Domination problem is NP-complete on chordal bipartite graphs, planar bipartite graphs with maximum degree five and circle graphs. 4. On the positive side, we present a linear-time algorithm for Global Roman domination problem on cographs.
comment: Submitted to Elsevier, 27 pages
☆ Dynamic Hierarchical $j$-Tree Decomposition and Its Applications
We develop a new algorithmic framework for designing approximation algorithms for cut-based optimization problems on capacitated undirected graphs that undergo edge insertions and deletions. Specifically, our framework dynamically maintains a variant of the hierarchical $j$-tree decomposition of [Madry FOCS'10], achieving a poly-logarithmic approximation factor to the graph's cut structure and supporting edge updates in $O(n^ε)$ amortized update time, for any arbitrarily small constant $ε\in (0,1)$. Consequently, we obtain new trade-offs between approximation and update/query time for fundamental cut-based optimization problems in the fully dynamic setting, including all-pairs minimum cuts, sparsest cut, multi-way cut, and multi-cut. For the last three problems, these trade-offs give the first fully-dynamic algorithms achieving poly-logarithmic approximation in sub-linear time per operation. The main technical ingredient behind our dynamic hierarchy is a dynamic cut-sparsifier algorithm that can handle vertex splits with low recourse. This is achieved by white-boxing the dynamic cut sparsifier construction of [Abraham et al. FOCS'16], based on forest packing, together with new structural insights about the maintenance of these forests under vertex splits. Given the versatility of cut sparsification in both the static and dynamic graph algorithms literature, we believe this construction may be of independent interest.
comment: SODA 2026
☆ A Grouped Sorting Queue Supporting Dynamic Updates for Timer Management in High-Speed Network Interface Cards
With the hardware offloading of network functions, network interface cards (NICs) undertake massive stateful, high-precision, and high-throughput tasks, where timers serve as a critical enabling component. However, existing timer management schemes suffer from heavy software load, low precision, lack of hardware update support, and overflow. This paper proposes two novel operations for priority queues--update and group sorting--to enable hardware timer management. To the best of our knowledge, this work presents the first hardware priority queue to support an update operation through the composition and propagation of basic operations to modify the priorities of elements within the queue. The group sorting mechanism ensures correct timing behavior post-overflow by establishing a group boundary priority to alter the sorting process and element insertion positions. Implemented with a hybrid architecture of a one-dimension (1D) systolic array and shift registers, our design is validated through packet-level simulations for flow table timeout management. Results demonstrate that a 4K-depth, 16-bit timer queue achieves over 500 MHz (175 Mpps, 12 ns precision) in a 28nm process and over 300 MHz (116 Mpps) on an FPGA. Critically, it reduces LUTs and FFs usage by 31% and 25%, respectively, compared to existing designs.
☆ SCaLE: Switching Cost aware Learning and Exploration
This work addresses the fundamental problem of unbounded metric movement costs in bandit online convex optimization, by considering high-dimensional dynamic quadratic hitting costs and $\ell_2$-norm switching costs in a noisy bandit feedback model. For a general class of stochastic environments, we provide the first algorithm SCaLE that provably achieves a distribution-agnostic sub-linear dynamic regret, without the knowledge of hitting cost structure. En-route, we present a novel spectral regret analysis that separately quantifies eigenvalue-error driven regret and eigenbasis-perturbation driven regret. Extensive numerical experiments, against online-learning baselines, corroborate our claims, and highlight statistical consistency of our algorithm.
comment: 42 pages
☆ Improved Algorithms for Fair Matroid Submodular Maximization
Submodular maximization subject to matroid constraints is a central problem with many applications in machine learning. As algorithms are increasingly used in decision-making over datapoints with sensitive attributes such as gender or race, it is becoming crucial to enforce fairness to avoid bias and discrimination. Recent work has addressed the challenge of developing efficient approximation algorithms for fair matroid submodular maximization. However, the best algorithms known so far are only guaranteed to satisfy a relaxed version of the fairness constraints that loses a factor 2, i.e., the problem may ask for $\ell$ elements with a given attribute, but the algorithm is only guaranteed to find $\lfloor \ell/2 \rfloor$. In particular, there is no provable guarantee when $\ell=1$, which corresponds to a key special case of perfect matching constraints. In this work, we achieve a new trade-off via an algorithm that gets arbitrarily close to full fairness. Namely, for any constant $\varepsilon>0$, we give a constant-factor approximation to fair monotone matroid submodular maximization that in expectation loses only a factor $(1-\varepsilon)$ in the lower-bound fairness constraint. Our empirical evaluation on a standard suite of real-world datasets -- including clustering, recommendation, and coverage tasks -- demonstrates the practical effectiveness of our methods.
♻ ☆ Parameterized Complexity of Biclique Contraction and Balanced Biclique Contraction
In this work, we initiate the complexity study of Biclique Contraction and Balanced Biclique Contraction. In these problems, given as input a graph G and an integer k, the objective is to determine whether one can contract at most k edges in G to obtain a biclique and a balanced biclique, respectively. We first prove that these problems are NP-complete even when the input graph is bipartite. Next, we study the parameterized complexity of these problems and show that they admit single exponential-time FPT algorithms when parameterized by the number k of edge contractions. Then, we show that Balanced Biclique Contraction admits a quadratic vertex kernel while Biclique Contraction does not admit any polynomial compression (or kernel) under standard complexity-theoretic assumptions. We also give faster FPT algorithms for contraction to restricted bicliques.
♻ ☆ The Longest Common Bitonic Subsequence: A Match-Sensitive Dynamic Programming Approach
Given two sequences $A[1..n]$ and $B[1..m]$ over a totally ordered alphabet, the \emph{Longest Common Bitonic Subsequence} (LCBS) problem asks for a longest common subsequence that is strictly increasing up to a single peak element and strictly decreasing thereafter (allowing either phase to be empty). The only explicitly documented approach evaluates a quadratic dynamic program over the full $n\times m$ grid, which is prohibitive on large inputs. We present two exact algorithms. First, we give a simple $Θ(nm)$-time baseline that computes LCBS by combining a longest common increasing subsequence (LCIS) computation on $(A,B)$ with a second LCIS computation on the reversed inputs, and then maximizing $INC(i,j)+DEC(i,j)-1$ over all common peaks. The method is constructive via parent pointers. Second, we develop an \emph{instance-sensitive} algorithm whose running time depends on the number $\mathcal{M}$ of matching pairs $(i,j)$ with $A[i]=B[j]$. We view matches as vertices of a dominance-ordered poset and compute the increasing and decreasing halves by two 2D dominance DP passes supported by orthogonal range-maximum data structures, followed by a linear peak scan. With a standard 2D range tree (or equivalent), this yields $O(\mathcal{M}\log^{2}\mathcal{M} + \mathcal{M} + (n+m)\log(n+m))$ time and $O(\mathcal{M}\log \mathcal{M})$ space, and it improves over the dense baseline whenever $M\log^2 M\ll nm$.
comment: 12 pages, 2 figres, In the process of submission to 37th Annual Symposium on Combinatorial Pattern Matching
♻ ☆ The PRODSAT phase of random quantum satisfiability
The $k$-QSAT problem is a quantum analog of the famous $k$-SAT constraint satisfaction problem. We must determine the zero energy ground states of a Hamiltonian of $N$ qubits consisting of a sum of $M$ random $k$-local rank-one projectors. It is known that product states of zero energy exist with high probability if and only if the underlying factor graph has a clause-covering dimer configuration. This means that the threshold of the PRODSAT phase is a purely geometric quantity equal to the dimer covering threshold. We revisit and fully prove this result through a combination of complex analysis and algebraic methods based on Buchberger's algorithm for complex polynomial equations with random coefficients. We also discuss numerical experiments investigating the presence of entanglement in the PRODSAT phase in the sense that product states do not span the whole zero energy ground state space.
♻ ☆ Hamiltonian Property Testing
Locality is a fundamental feature of many physical time evolutions. Assumptions on locality and related structural properties also underlie recently proposed procedures for learning an unknown Hamiltonian from access to the induced time evolution. However, no protocols to rigorously test whether an unknown Hamiltonian is local were known. We investigate Hamiltonian locality testing as a property testing problem, where the task is to determine whether an unknown $n$-qubit Hamiltonian $H$ is $k$-local or $\varepsilon$-far from all $k$-local Hamiltonians, given access to the time evolution along $H$. First, we emphasize the importance of the chosen distance measure: With respect to the operator norm, a worst-case distance measure, incoherent quantum locality testers require $\tildeΩ(2^n)$ many time evolution queries and an expected total evolution time of $\tildeΩ(2^n / \varepsilon)$, and even coherent testers need $Ω(2^{n/2})$ many queries and $Ω(2^{n/2}/\varepsilon)$ total evolution time. In contrast, when distances are measured according to the normalized Frobenius norm, corresponding to an average-case distance, we give a sample-, time-, and computationally efficient incoherent Hamiltonian locality testing algorithm based on randomized measurements. In fact, our procedure can be used to simultaneously test a wide class of Hamiltonian properties beyond locality. Finally, we prove that learning a general Hamiltonian remains exponentially hard with this average-case distance, thereby establishing an exponential separation between Hamiltonian testing and learning. Our work initiates the study of property testing for quantum Hamiltonians, demonstrating that a broad class of Hamiltonian properties is efficiently testable even with limited quantum capabilities, and positioning Hamiltonian testing as an independent area of research alongside Hamiltonian learning.
comment: 70 pages, 3 figures; minor improvements. Version accepted for publication in Quantum
♻ ☆ Merging RLBWTs adaptively
We show how to merge two run-length compressed Burrows-Wheeler Transforms (RLBWTs) into a run-length compressed extended Burrows-Wheeler Transform (eBWT) in $O (r)$ space and $O ((r + L) \log (m + n))$ time, where $m$ and $n$ are the lengths of the uncompressed strings, $r$ is the number of runs in the final eBWT and $L$ is the sum of its irreducible LCP values.
♻ ☆ The Polymatroid Representation of a Greedoid, and Associated Galois Connections
A greedoid is a generalization of a matroid allowing for more flexible analyses and modeling of combinatorial optimization problems. However, these structures decimate many matroid properties contributing to their pervasive nature. A polymatroid greedoid [KL85a] presents an interesting middle ground, so we further develop this class. First we prove every local poset greedoid for which the greedy algorithm correctly solves linear optimizations over its basic words must have a polymatroid representation. For this, we use relationships between the lattices of greedoid flats and closed sets of a polymatroid to generalize concepts in [KL85a]. Then, we show our generalization induces a Galois injection between the greedoid flats and closed sets of a representation. Finally, we apply this duality to identify a subclass of polymatroid greedoids with a maximum representation, giving a partial answer to an open problem of [KL85a]. As technical tools for our analyses, we introduce optimism and the Forking Lemma for interval greedoids. Both are pervasive in our work, and are of independent interest.
comment: 41 pages, 8 figures, 4 appendices. In versions 1 and 2 there is an error in the proof of the main claim of an alternative description of polymatroid greedoids. In the latest version we have changed our main results to remove any overclaims or errors, and provide careful analyses of the inclusions and boundaries of the greedoid classes in our appendix which we interact with in our work
♻ ☆ Approximation algorithms for scheduling with rejection in green manufacturing
Motivated by green manufacturing, this paper investigates a scheduling with rejection problem subject to an energy consumption constraint. Machines are associated with non-uniform energy consumption rates, defined as the energy consumed per unit time. Each job is either rejected with a rejection penalty or accepted and scheduled on some machine for processing, which incurs energy consumption. The problem aims to minimize the makespan of the accepted jobs plus the total penalty of the rejected jobs while the total energy consumption is bounded by a given threshold. In this paper, when the number of machines is part of the input, we develop the first $(2+ε)$-approximation algorithm for any fixed constant $ε$ and a simple QPTAS as well as a PTAS for uniform energy consumption rates. Moreover, we present an FPTAS when the number of machines is a fixed constant.
Graphics 5
☆ Draw it like Euclid: Teaching transformer models to generate CAD profiles using ruler and compass construction steps
We introduce a new method of generating Computer Aided Design (CAD) profiles via a sequence of simple geometric constructions including curve offsetting, rotations and intersections. These sequences start with geometry provided by a designer and build up the points and curves of the final profile step by step. We demonstrate that adding construction steps between the designer's input geometry and the final profile improves generation quality in a similar way to the introduction of a chain of thought in language models. Similar to the constraints in a parametric CAD model, the construction sequences reduce the degrees of freedom in the modeled shape to a small set of parameter values which can be adjusted by the designer, allowing parametric editing with the constructed geometry evaluated to floating point precision. In addition we show that applying reinforcement learning to the construction sequences gives further improvements over a wide range of metrics, including some which were not explicitly optimized.
☆ Variable Basis Mapping for Real-Time Volumetric Visualization
Real-time visualization of large-scale volumetric data remains challenging, as direct volume rendering and voxel-based methods suffer from prohibitively high computational cost. We propose Variable Basis Mapping (VBM), a framework that transforms volumetric fields into 3D Gaussian Splatting (3DGS) representations through wavelet-domain analysis. First, we precompute a compact Wavelet-to-Gaussian Transition Bank that provides optimal Gaussian surrogates for canonical wavelet atoms across multiple scales. Second, we perform analytical Gaussian construction that maps discrete wavelet coefficients directly to 3DGS parameters using a closed-form, mathematically principled rule. Finally, a lightweight image-space fine-tuning stage further refines the representation to improve rendering fidelity. Experiments on diverse datasets demonstrate that VBM significantly accelerates convergence and enhances rendering quality, enabling real-time volumetric visualization.
comment: 11 pages. Under review
☆ TIDI-GS: Floater Suppression in 3D Gaussian Splatting for Enhanced Indoor Scene Fidelity
3D Gaussian Splatting (3DGS) is a technique to create high-quality, real-time 3D scenes from images. This method often produces visual artifacts known as floaters--nearly transparent, disconnected elements that drift in space away from the actual surface. This geometric inaccuracy undermines the reliability of these models for practical applications, which is critical. To address this issue, we introduce TIDI-GS, a new training framework designed to eliminate these floaters. A key benefit of our approach is that it functions as a lightweight plugin for the standard 3DGS pipeline, requiring no major architectural changes and adding minimal overhead to the training process. The core of our method is a floater pruning algorithm--TIDI--that identifies and removes floaters based on several criteria: their consistency across multiple viewpoints, their spatial relationship to other elements, and an importance score learned during training. The framework includes a mechanism to preserve fine details, ensuring that important high-frequency elements are not mistakenly removed. This targeted cleanup is supported by a monocular depth-based loss function that helps improve the overall geometric structure of the scene. Our experiments demonstrate that TIDI-GS improves both the perceptual quality and geometric integrity of reconstructions, transforming them into robust digital assets, suitable for high-fidelity applications.
♻ ☆ OpenPBR: Novel Features and Implementation Details SIGGRAPH 2025
OpenPBR is a physically based, standardized uber-shader developed for interoperable material authoring and rendering across VFX, animation, and design visualization workflows. This document serves as a companion to the official specification, offering deeper insight into the model's development and more detailed implementation guidance, including code examples and mathematical derivations. We begin with a description of the model's formal structure and theoretical foundations - covering slab-based layering, statistical mixing, and microfacet theory - before turning to its physical components. These include metallic, dielectric, subsurface, and glossy-diffuse base substrates, followed by thin-film iridescence, coat, and fuzz layers. A special-case mode for rendering thin-walled objects is also described. Additional sections explore technical topics in greater depth, such as the decoupling of specular reflectivity from transmission, the choice of parameterization for subsurface scattering, and the detailed physics of coat darkening and thin-film interference. We also discuss planned extensions, including hazy specular reflection and retroreflection.
comment: Part of Physically Based Shading in Theory and Practice, SIGGRAPH 2025 Course
♻ ☆ Uncertainty-Aware PCA for Arbitrarily Distributed Data Modeled by Gaussian Mixture Models
Multidimensional data is often associated with uncertainties that are not well-described by normal distributions. In this work, we describe how such distributions can be projected to a low-dimensional space using uncertainty-aware principal component analysis (UAPCA). We propose to model multidimensional distributions using Gaussian mixture models (GMMs) and derive the projection from a general formulation that allows projecting arbitrary probability density functions. The low-dimensional projections of the densities exhibit more details about the distributions and represent them more faithfully compared to UAPCA mappings. Further, we support including user-defined weights between the different distributions, which allows for varying the importance of the multidimensional distributions. We evaluate our approach by comparing the distributions in low-dimensional space obtained by our method and UAPCA to those obtained by sample-based projections.
comment: 10 pages, 6 figures
Hardware Architecture 7
☆ Memory DisOrder: Memory Re-orderings as a Timerless Side-channel
To improve efficiency, nearly all parallel processing units (CPUs and GPUs) implement relaxed memory models in which memory operations may be re-ordered, i.e., executed out-of-order. Prior testing work in this area found that memory re-orderings are observed more frequently when other cores are active, e.g., stressing the memory system, which likely triggers aggressive hardware optimizations. In this work, we present Memory DisOrder: a timerless side-channel that uses memory re-orderings to infer activity on other processes. We first perform a fuzzing campaign and show that many mainstream processors (X86/Arm/Apple CPUs, NVIDIA/AMD/Apple GPUs) are susceptible to cross-process signals. We then show how the vulnerability can be used to implement classic attacks, including a covert channel, achieving up to 16 bits/second with 95% accuracy on an Apple M3 GPU, and application fingerprinting, achieving reliable closed-world DNN architecture fingerprinting on several CPUs and an Apple M3 GPU. Finally, we explore how low-level system details can be exploited to increase re-orderings, showing the potential for a covert channel to achieve nearly 30K bits/second on X86 CPUs. More precise attacks can likely be developed as the vulnerability becomes better understood.
☆ Bio-RV: Low-Power Resource-Efficient RISC-V Processor for Biomedical Applications
This work presents Bio-RV, a compact and resource-efficient RISC-V processor intended for biomedical control applications, such as accelerator-based biomedical SoCs and implantable pacemaker systems. The proposed Bio-RV is a multi-cycle RV32I core that provides explicit execution control and external instruction loading with capabilities that enable controlled firmware deployment, ASIC bring-up, and post-silicon testing. In addition to coordinating accelerator configuration and data transmission in heterogeneous systems, Bio-RV is designed to function as a lightweight host controller, handling interfaces with pacing, sensing, electrogram (EGM), telemetry, and battery management modules. With 708 LUTs and 235 flip-flops on FPGA prototypes, Bio-RV, implemented in a 180 nm CMOS technology, operate at 50 MHz and feature a compact hardware footprint. According to post-layout results, the proposed architectural decisions align with minimal energy use. Ultimately, Bio-RV prioritises deterministic execution, minimal hardware complexity, and integration flexibility over peak computing speed to meet the demands of ultra-low-power, safety-critical biomedical systems.
comment: IEEE International Conference on Interdisciplinary Approaches in Technology and Management for Social Innovation (IATMSI-2026)
☆ A New Tool to Find Lightweight (And, Xor) Implementations of Quadratic Vectorial Boolean Functions up to Dimension 9
The problem of finding a minimal circuit to implement a given function is one of the oldest in electronics. It is known to be NP-hard. Still, many tools exist to find sub-optimal circuits to implement a function. In electronics, such tools are known as synthesisers. However, these synthesisers aim to implement very large functions (a whole electronic chip). In cryptography, the focus is on small functions, hence the necessity for new dedicated tools for small functions. Several tools exist to implement small functions. They differ by their algorithmic approach (some are based on Depth-First-Search as introduced by Ullrich in 2011, some are based on SAT-solvers like the tool desgined by Stoffelen in 2016, some non-generic tools use subfield decomposition) and by their optimisation criteria (some optimise for circuit size, others for circuit depth, and some for side-channel-protected implementations). However, these tools are limited to functions operating on less than 5 bits, sometimes 6 bits for quadratic functions, or to very simple functions. The limitation lies in a high computing time. We propose a new tool (The tool is provided alongside the IEEE article with CodeOcean and at https://github.com/seduval/implem-quad-sbox) to implement quadratic functions up to 9 bits within AND-depth 1, minimising the number of AND gates. This tool is more time-efficient than previous ones, allowing to explore larger implementations than others on 6 bits or less and allows to reach larger sizes, up to 9 bits.
☆ Annotated PIM Bibliography
Processing in Memory (PIM) and similar terms such as Compute In Memory (CIM), Logic in Memory (LIM), In Memory Computing (IMC), and Near Memory Computing (NMC) have gained attention recently as a potentially ``revolutionary new'' technique. The truth, however, is that many examples of the technology go back over 60 years. This document attempts to provide an annotated bibliography of PIM technology that attempts to cover the whole time-frame, and is organized to augment a forth-coming article.
comment: Initial version. Will be updated with more references and detail in future releases
♻ ☆ Analog In-memory Training on General Non-ideal Resistive Elements: The Impact of Response Functions
As the economic and environmental costs of training and deploying large vision or language models increase dramatically, analog in-memory computing (AIMC) emerges as a promising energy-efficient solution. However, the training perspective, especially its training dynamic, is underexplored. In AIMC hardware, the trainable weights are represented by the conductance of resistive elements and updated using consecutive electrical pulses. While the conductance changes by a constant in response to each pulse, in reality, the change is scaled by asymmetric and non-linear response functions, leading to a non-ideal training dynamic. This paper provides a theoretical foundation for gradient-based training on AIMC hardware with non-ideal response functions. We demonstrate that asymmetric response functions negatively impact Analog SGD by imposing an implicit penalty on the objective. To overcome the issue, we propose Residual Learning algorithm, which provably converges exactly to a critical point by solving a bilevel optimization problem. We demonstrate that the proposed method can be extended to address other hardware imperfections, such as limited response granularity. As we know, it is the first paper to investigate the impact of a class of generic non-ideal response functions. The conclusion is supported by simulations validating our theoretical insights.
♻ ☆ Hardwired-Neurons Language Processing Units as General-Purpose Cognitive Substrates
The rapid advancement of Large Language Models (LLMs) has established language as a core general-purpose cognitive substrate, driving the demand for specialized Language Processing Units (LPUs) tailored for LLM inference. To overcome the growing energy consumption of LLM inference systems, this paper proposes a Hardwired-Neurons Language Processing Unit (HNLPU), which physically hardwires LLM weight parameters into the computational fabric, achieving several orders of magnitude computational efficiency improvement by extreme specialization. However, a significant challenge still lies in the scale of modern LLMs. A straightforward hardwiring of gpt-oss 120 B would require fabricating photomask sets valued at over 6 billion dollars, rendering this straightforward solution economically impractical. Addressing this challenge, we propose the novel Metal-Embedding methodology. Instead of embedding weights in a 2D grid of silicon device cells, Metal-Embedding embeds weight parameters into the 3D topology of metal wires. This brings two benefits: (1) a 15x increase in density, and (2) 60 out of 70 photomask layers are homogeneous across chips, including all EUV photomasks. In total, Metal-Embedding reduced the photomask cost by 112x, bringing the Non-Recurring Engineering (NRE) cost of HNLPU into an economically viable range. Experimental results show that HNLPU achieved 249,960 tokens/s (5,555x/85x that of GPU/WSE), 36 tokens/J (1,047x/283x that of GPU/WSE), 13,232 mm2 total die area, $59.46 M-123.5 M estimated NRE at 5 nm technology. Analysis shows that HNLPU achieved 41.7-80.4x improvement in cost-effectiveness and 357x reduction in carbon footprint compared to OpenAI-scale H100 clusters, under an annual weight updating assumption.
♻ ☆ H3PIMAP: A Heterogeneity-Aware Multi-Objective DNN Mapping Framework on Electronic-Photonic Processing-in-Memory Architectures
The future of artificial intelligence (AI) acceleration demands a paradigm shift beyond the limitations of purely electronic or photonic architectures. Photonic analog computing delivers unmatched speed and parallelism but struggles with data movement, robustness, and precision, while electronic processing-in-memory (PIM) enables energy-efficient computing by co-locating storage and computation but suffers from endurance and reconfiguration constraints, limiting it to static weight mapping. Neither approach alone achieves the balance needed for adaptive, efficient AI. To break this impasse, we study a hybrid electronic-photonic-PIM computing architecture and introduce H3PIMAP, a heterogeneity-aware mapping framework that seamlessly orchestrates workloads across electronic and optical tiers. By optimizing workload partitioning through a two-stage multi-objective exploration method, H3PIMAP harnesses light speed for high-throughput operations and PIM efficiency for memory-bound tasks. In system-level evaluations, H3PIMAP delivers a 3.32x latency reduction across language and vision models and, on large language models, achieves 77.0% lower latency with 14.6% lower energy at matched quality, outperforming homogeneous and naive mapping strategies. This proposed framework lays the foundation for hybrid AI accelerators, bridging the gap between electronic and photonic computation for next-generation efficiency and scalability.
Programming Languages 3
☆ Formalization and Implementation of Safe Destination Passing in Pure Functional Programming Settings
Destination-passing style programming introduces destinations, which represent the address of a write-once memory cell. These destinations can be passed as function parameters, allowing the caller to control memory management: the callee simply fills the cell instead of allocating space for a return value. While typically used in systems programming, destination passing also has applications in pure functional programming, where it enables programs that were previously unexpressible using usual immutable data structures. In this thesis, we develop a core λ-calculus with destinations, {λ_d}. Our new calculus is more expressive than similar existing systems, with destination passing designed to be as flexible as possible. This is achieved through a modal type system combining linear types with a system of ages to manage scopes, in order to make destination-passing safe. Type safety of our core calculus was proved formally with the Coq proof assistant. Then, we see how this core calculus can be adapted into an existing pure functional language, Haskell, whose type system is less powerful than our custom theoretical one. Retaining safety comes at the cost of removing some flexibility in the handling of destinations. We later refine the implementation to recover much of this flexibility, at the cost of increased user complexity. The prototype implementation in Haskell shows encouraging results for adopting destination-passing style programming when traversing or mapping over large data structures such as lists or data trees.
comment: PhD Manuscript, 148 pages. Sources: https://github.com/tweag/tbagrel-phd-manuscript/
♻ ☆ StackPilot: Autonomous Function Agents for Scalable and Environment-Free Code Execution
Recent advances in large language models (LLMs) have substantially enhanced automated code generation across a wide range of programming languages. Nonetheless, verifying the correctness and executability of LLM-generated code remains a significant challenge, as traditional methods rely on language-specific compilers and environment-dependent runtimes. To overcome these limitations, we introduce StackPilot, an LLM-native, multi-agent framework designed for language-agnostic code verification and execution, which operates independently of conventional toolchains. StackPilot offers three principal innovations: (1) a Function-as-Agents paradigm, in which each function is modeled as an autonomous agent capable of fine-grained reasoning and collaborative verification; (2) an LLM-as-Executor strategy, which enables scalable verification via stack-based scheduling; and (3) a novel snapshot mechanism that preserves complete execution contexts, facilitating deterministic and lossless context switching during verification. Empirical evaluations demonstrate that StackPilot achieves framework reliability rates between 89% and 97%, substantially outperforming baseline approaches. These results indicate that StackPilot can reliably verify and execute a significantly larger proportion of LLM-generated code across diverse programming tasks compared to existing methods.
comment: This method needs to be reconsidered and there is something wrong with experiment
♻ ☆ Forward Symbolic Execution for Trustworthy Automation of Binary Code Verification
Control flow in unstructured programs can be complex and dynamic, which makes static analysis difficult. Yet, automated reasoning about unstructured control flow is important when certifying properties of binary (machine) code in trustworthy systems, e.g., cryptographic routines. We present a theory of forward symbolic execution for unstructured programs suitable for use in theorem provers that enables automated verification of both functional and non-functional program properties. The theory's foundation is a set of inference rules where each member corresponds to an operation in a symbolic execution engine. The rules are designed to give control over the tradeoff between the preservation of precision and introduction of overapproximation. We instantiate our theory for BIR, a previously proposed intermediate language for binary analysis. We demonstrate how symbolic executors can be constructed for BIR with common optimizations such as pruning of infeasible symbolic states. We implemented our theory in the HOL4 theorem prover using the HolBA binary analysis library, obtaining machine-checked proofs of soundness of symbolic execution for BIR. We practically evaluated two applications of our theory: verification of functional properties of RISC-V binaries and verification of execution time bounds of programs running on the ARM Cortex-M0 processor. The evaluation shows that such verification can be automated with moderate overhead on medium-sized programs.
Data Structures and Algorithms 15
☆ FPT Approximations for Connected Maximum Coverage
We revisit connectivity-constrained coverage through a unifying model, Partial Connected Red-Blue Dominating Set. Given a red-blue bipartite graph $G$ and an auxiliary connectivity graph $G_{conn}$ on red vertices, and integers $k, t$, the task is to find a $k$-sized subset of red vertices that dominates at least $t$ blue vertices, and that induces a connected subgraph in $G_{conn}$. This formulation captures connected variants of Max Coverage, Partial Dominating Set, and Partial Vertex Cover studied in prior literature. After identifying (parameterized) inapproximability results inherited from known problems, we first show that the problem is fixed-parameter tractable by $t$. Furthermore, when the bipartite graph excludes $K_{d,d}$ as a subgraph, we design (resp. efficient) parameterized approximation schemes for approximating $t$ (resp. $k$). Notably, these FPT approximations do not impose any restrictions on $G_{conn}$. Together, these results chart the boundary between hardness and FPT-approximability for connectivity-constrained coverage.
comment: Full version of ITCS 2026 paper. Abstract shortened due to character limit
☆ How Hard Is It to Rig a Tournament When Few Players Can Beat or Be Beaten by the Favorite? AAAI 2026
In knockout tournaments, players compete in successive rounds, with losers eliminated and winners advancing until a single champion remains. Given a tournament digraph $D$, which encodes the outcomes of all possible matches, and a designated player $v^* \in V(D)$, the \textsc{Tournament Fixing} problem (TFP) asks whether the tournament can be scheduled in a way that guarantees $v^*$ emerges as the winner. TFP is known to be NP-hard, but is fixed-parameter tractable (FPT) when parameterized by structural measures such as the feedback arc set (fas) or feedback vertex set (fvs) number of the tournament digraph. In this paper, we introduce and study two new structural parameters: the number of players who can defeat $v^*$ (i.e., the in-degree of $v^*$, denoted by $k$) and the number of players that $v^*$ can defeat (i.e., the out-degree of $v^*$, denoted by $\ell$). A natural question is that: can TFP be efficiently solved when $k$ or $\ell$ is small? We answer this question affirmatively by showing that TFP is FPT when parameterized by either the in-degree or out-degree of $v^*$. Our algorithm for the in-degree parameterization is particularly involved and technically intricate. Notably, the in-degree $k$ can remain small even when other structural parameters, such as fas or fvs, are large. Hence, our results offer a new perspective and significantly broaden the parameterized algorithmic understanding of the \textsc{Tournament Fixing} problem.
comment: Accepted by AAAI 2026
☆ Lower Bounds for Dominating Set in Ball Graphs and for Weighted Dominating Set in Unit-Ball Graphs
Recently it was shown that many classic graph problems -- Independent Set, Dominating Set, Hamiltonian Cycle, and more -- can be solved in subexponential time on unit-ball graphs. More precisely, these problems can be solved in $2^{O(n^{1-1/d})}$ time on unit-ball graphs in $\mathbb R^d$, which is tight under ETH. The result can be generalized to intersection graphs of similarly-sized fat objects. For Independent Set the same running time can be achieved for non-similarly-sized fat objects, and for the weighted version of the problem. We show that such generalizations most likely are not possible for Dominating Set: assuming ETH, we prove that - there is no algorithm with running time $2^{o(n)}$ for Dominating Set on (non-unit) ball graphs in $\mathbb R^3$; - there is no algorithm with running time $2^{o(n)}$ for Weighted Dominating Set on unit-ball graphs in $\mathbb R^3$; - there is no algorithm with running time $2^{o(n)}$ for Dominating Set, Connected Dominating Set, or Steiner Tree on intersections graphs of arbitrary convex (but non-constant-complexity) objects in the plane.
comment: 2020 article, uploaded for free access and archival purposes
☆ Protrusion Decompositions Revisited: Uniform Lossy Kernels for Reducing Treewidth and Linear Kernels for Hitting Disconnected Minors
Let F be a finite family of graphs. In the F-Deletion problem, one is given a graph G and an integer k, and the goal is to find k vertices whose deletion results in a graph with no minor from the family F. This may be regarded as a far-reaching generalization of Vertex Cover and Feedback vertex Set. In their seminal work, Fomin, Lokshtanov, Misra & Saurabh [FOCS 2012] gave a polynomial kernel for this problem when the family F contains a planar graph. As the size of their kernel is g(F) * k^{f(F)}, a natural follow-up question was whether the dependence on F in the exponent of k can be avoided. The answer turned out to be negative: Giannapoulou, Jansen, Lokshtanov & Saurabh [TALG 2017] proved that this is already inevitable for the special case of the Treewidth-d-Deletion problem. In this work, we show that this non-uniformity can be avoided at the expense of a small loss. First, we present a simple 2-approximate kernelization algorithm for Treewidth-d-Deletion with kernel size g(d) * k^5. Next, we show that the approximation factor can be made arbitrarily close to 1, if we settle for a kernelization protocol with O(1) calls to an oracle that solves instances of size bounded by a uniform polynomial in k. We also obtain linear kernels on sparse graph classes when F contains a planar graph, whereas the previously known theorems required all graphs in F to be connected. Specifically, we generalize the kernelization algorithm by Kim, Langer, Paul, Reidl, Rossmanith, Sau & Sikdar [TALG 2015] on graph classes that exclude a topological minor.
comment: To appear at STACS 2026. Abstract shortened to fit into the word limit
☆ Kantorovich Distance via Spanning Trees: Properties and Algorithms
We study optimal transport between probability measures supported on the same finite metric space, where the ground cost is a distance induced by a weighted connected graph. Building on recent work showing that the resulting Kantorovich distance can be expressed as a minimization problem over the set of spanning trees of this underlying graph, we investigate the implications of this reformulation on the construction of an optimal transport plan and a dual potential based on the solution of such an optimization problem. In this setting, we derive an explicit formula for the Kantorovich potential in terms of the imbalanced cumulative mass (a generalization of the cumulative distribution in R) along an optimal spanning tree solving such a minimization problem, under a weak non-degeneracy condition on the pair of measures that guarantees the uniqueness of a dual potential. Our second contribution establishes the existence of an optimal transport plan that can be computed efficiently by a dynamic programming procedure once an optimal spanning tree is known. Finally, we propose a stochastic algorithm based on simulated annealing on the space of spanning trees to compute such an optimal spanning tree. Numerical experiments illustrate the theoretical results and demonstrate the practical relevance of the proposed approach for optimal transport on finite metric spaces.
comment: 37 pages, 43 figures
☆ Symbolic Functional Decomposition: A Reconfiguration Approach
Functional decomposition is the process of breaking down a function $f$ into a composition $f=g(f_1,\dots,f_k)$ of simpler functions $f_1,\dots,f_k$ belonging to some class $\mathcal{F}$. This fundamental notion can be used to model applications arising in a wide variety of contexts, ranging from machine learning to formal language theory. In this work, we study functional decomposition by leveraging on the notion of functional reconfiguration. In this setting, constraints are imposed not only on the factor functions $f_1,\dots,f_k$ but also on the intermediate functions arising during the composition process. We introduce a symbolic framework to address functional reconfiguration and decomposition problems. In our framework, functions arising during the reconfiguration process are represented symbolically, using ordered binary decision diagrams (OBDDs). The function $g$ used to specify the reconfiguration process is represented by a Boolean circuit $C$. Finally, the function class $\mathcal{F}$ is represented by a second-order finite automaton $\mathcal{A}$. Our main result states that functional reconfiguration, and hence functional decomposition, can be solved in fixed-parameter linear time when parameterized by the width of the input OBDD, by structural parameters associated with the reconfiguration circuit $C$, and by the size of the second-order finite automaton $\mathcal{A}$.
☆ Derandomizing Matrix Concentration Inequalities from Free Probability
Recently, sharp matrix concentration inequalities~\cite{BBvH23,BvH24} were developed using the theory of free probability. In this work, we design polynomial time deterministic algorithms to construct outcomes that satisfy the guarantees of these inequalities. As direct consequences, we obtain polynomial time deterministic algorithms for the matrix Spencer problem~\cite{BJM23} and for constructing near-Ramanujan graphs. Our proofs show that the concepts and techniques in free probability are useful not only for mathematical analyses but also for efficient computations.
comment: 105 pages
☆ Delaunay Triangulations with Predictions
We investigate algorithms with predictions in computational geometry, specifically focusing on the basic problem of computing 2D Delaunay triangulations. Given a set $P$ of $n$ points in the plane and a triangulation $G$ that serves as a "prediction" of the Delaunay triangulation, we would like to use $G$ to compute the correct Delaunay triangulation $\textit{DT}(P)$ more quickly when $G$ is "close" to $\textit{DT}(P)$. We obtain a variety of results of this type, under different deterministic and probabilistic settings, including the following: 1. Define $D$ to be the number of edges in $G$ that are not in $\textit{DT}(P)$. We present a deterministic algorithm to compute $\textit{DT}(P)$ from $G$ in $O(n + D\log^3 n)$ time, and a randomized algorithm in $O(n+D\log n)$ expected time, the latter of which is optimal in terms of $D$. 2. Let $R$ be a random subset of the edges of $\textit{DT}(P)$, where each edge is chosen independently with probability $ρ$. Suppose $G$ is any triangulation of $P$ that contains $R$. We present an algorithm to compute $\textit{DT}(P)$ from $G$ in $O(n\log\log n + n\log(1/ρ))$ time with high probability. 3. Define $d_{\mbox{\scriptsize\rm vio}}$ to be the maximum number of points of $P$ strictly inside the circumcircle of a triangle in $G$ (the number is 0 if $G$ is equal to $\textit{DT}(P)$). We present a deterministic algorithm to compute $\textit{DT}(P)$ from $G$ in $O(n\log^*n + n\log d_{\mbox{\scriptsize\rm vio}})$ time. We also obtain results in similar settings for related problems such as 2D Euclidean minimum spanning trees, and hope that our work will open up a fruitful line of future research.
comment: 29 pages, 6 figures, ITCS 2026
☆ An Almost-Optimal Upper Bound on the Push Number of the Torus Puzzle
We study the Torus Puzzle, a solitaire game in which the elements of an input $m \times n$ matrix need to be rearranged into a target configuration via a sequence of unit rotations (i.e., circular shifts) of rows and/or columns. Amano et al.\ proposed a more permissive variant of the above puzzle, where each row and column rotation can shift the involved elements by any amount of positions. The number of rotations needed to solve the puzzle in the original and in the permissive variants of the puzzle are respectively known as the \emph{push number} and the \emph{drag number}, where the latter is always smaller than or equal to the former and admits an existential lower bound of $Ω(mn)$. While this lower bound is matched by an $O(mn)$ upper bound, the push number is not so well understood. Indeed, to the best of our knowledge, only an $O(mn \cdot \max\{ m, n \})$ upper bound is currently known. In this paper, we provide an algorithm that solves the Torus Puzzle using $O(mn \cdot \log \max \{m, n\})$ unit rotations in a model that is more restricted than that of the original puzzle. This implies a corresponding upper bound on the push number and reduces the gap between the known upper and lower bounds from $Θ(\max\{m,n\})$ to $Θ(\log \max\{m, n\})$.
comment: 22 pages, 8 figures
☆ Continuous Fairness On Data Streams
We study the problem of enforcing continuous group fairness over windows in data streams. We propose a novel fairness model that ensures group fairness at a finer granularity level (referred to as block) within each sliding window. This formulation is particularly useful when the window size is large, making it desirable to enforce fairness at a finer granularity. Within this framework, we address two key challenges: efficiently monitoring whether each sliding window satisfies block-level group fairness, and reordering the current window as effectively as possible when fairness is violated. To enable real-time monitoring, we design sketch-based data structures that maintain attribute distributions with minimal overhead. We also develop optimal, efficient algorithms for the reordering task, supported by rigorous theoretical guarantees. Our evaluation on four real-world streaming scenarios demonstrates the practical effectiveness of our approach. We achieve millisecond-level processing and a throughput of approximately 30,000 queries per second on average, depending on system parameters. The stream reordering algorithm improves block-level group fairness by up to 95% in certain cases, and by 50-60% on average across datasets. A qualitative study further highlights the advantages of block-level fairness compared to window-level fairness.
♻ ☆ The Conflict Graph Design: Estimating Causal Effects under Arbitrary Neighborhood Interference
A fundamental problem in network experiments is selecting an appropriate experimental design in order to precisely estimate a given causal effect of interest. In this work, we propose the Conflict Graph Design, a general approach for constructing experiment designs under network interference with the goal of precisely estimating a pre-specified causal effect. A central aspect of our approach is the notion of a conflict graph, which captures the fundamental unobservability associated with the causal effect and the underlying network. In order to estimate effects, we propose a modified Horvitz--Thompson estimator. We show that its variance under the Conflict Graph Design is bounded as $O(λ(H) / n )$, where $λ(H)$ is the largest eigenvalue of the adjacency matrix of the conflict graph. These rates depend on both the underlying network and the particular causal effect under investigation. Not only does this yield the best known rates of estimation for several well-studied causal effects (e.g. the global and direct effects) but it also provides new methods for effects which have received less attention from the perspective of experiment design (e.g. spill-over effects). Finally, we construct conservative variance estimators which facilitate asymptotically valid confidence intervals for the causal effect of interest.
♻ ☆ A refined graph container lemma and applications to the hard-core model on bipartite expanders
We establish a refined version of a graph container lemma due to Galvin and discuss several applications related to the hard-core model on bipartite expander graphs. Given a graph $G$ and $λ>0$, the hard-core model on $G$ at activity $λ$ is the probability distribution $μ_{G,λ}$ on independent sets in $G$ given by $μ_{G,λ}(I)\propto λ^{|I|}$. As one of our main applications, we show that the hard-core model at activity $λ$ on the hypercube $Q_d$ exhibits a `structured phase' for $λ= Ω( \log^2 d/d^{1/2})$ in the following sense: in a typical sample from $μ_{Q_d,λ}$, most vertices are contained in one side of the bipartition of $Q_d$. This improves upon a result of Galvin which establishes the same for $λ=Ω(\log d/ d^{1/3})$. As another application, we establish a fully polynomial-time approximation scheme (FPTAS) for the hard-core model on a $d$-regular bipartite $α$-expander, with $α>0$ fixed, when $λ= Ω( \log^2 d/d^{1/2})$. This improves upon the bound $λ=Ω(\log d/ d^{1/4})$ due to the first author, Perkins and Potukuchi. We discuss similar improvements to results of Galvin-Tetali, Balogh-Garcia-Li and Kronenberg-Spinka.
♻ ☆ Optimal Deterministic Rendezvous in Labeled Lines
In a rendezvous task, some mobile agents dispersed in a network have to gather at an arbitrary common site. We consider the rendezvous problem on the infinite labeled line, with $2$ agents, without communication, and a synchronous notion of time. Each node on the line is labeled with a unique positive integer. The initial distance between the agents is denoted by $D$. Time is divided into rounds and measured from the moment an agent first wakes up. We denote by $τ$ the delay between the two agents' wake up times. If awake in a given round $T$, an agent at a node $v$ has three options: stay at the node $v$, take port $0$, or take port $1$. If it decides to stay, the agent will still be at node $v$ in round $T+1$. Otherwise, it will be at one of the two neighbors of $v$ on the infinite line, depending on the port it chose. The agents achieve rendezvous in $T$ rounds if they are at the same node in round $T$. We aim for a deterministic algorithm for this problem. The problem was recently considered by Miller and Pelc [Distributed Computing 2025]. With $\ell_{\max}$ the largest label of the two starting nodes, they showed that no algorithm can guarantee rendezvous in $o(D \log^* \ell_{\max})$ rounds. The lower bound follows from a connection with the LOCAL model of distributed computing, and holds even if the agents are guaranteed simultaneous wake-up ($τ= 0$) and are told their initial distance $D$. Miller and Pelc also gave an algorithm of optimal matching complexity $O(D \log^* \ell_{\max})$ when the agents know $D$, but only obtained the higher bound of $O(D^2 (\log^* \ell_{\max})^3)$ when $D$ is unknown to the agents. We improve this complexity to a tight $O(D \log^* \ell_{\max})$. In fact, our algorithm achieves rendezvous in $O(D \log^* \ell_{\min})$ rounds, where $\ell_{\min}$ is the smallest label within distance $O(D)$ of the two starting positions.
comment: 24 pages, 3 figures. To appear in the proceedings of STACS 2026
♻ ☆ The Mixed Birth-death/death-Birth Moran Process
We study evolutionary dynamics on graphs in which each step consists of one birth and one death, also known as the Moran processes. There are two types of individuals: residents with fitness $1$ and mutants with fitness $r$. Two standard update rules are used in the literature. In Birth-death (Bd), a vertex is chosen to reproduce proportional to fitness, and one of its neighbors is selected uniformly at random to be replaced by the offspring. In death-Birth (dB), a vertex is chosen uniformly to die, and then one of its neighbors is chosen, proportional to fitness, to place an offspring into the vacancy. We formalize and study a unified model, the $λ$-mixed Moran process, in which each step is independently a Bd step with probability $λ\in [0,1]$ and a dB step otherwise. We analyze this mixed process for undirected, connected graphs. As an interesting special case, we show at $λ=1/2$, for any graph that the fixation probability when $r=1$ with a single mutant initially on the graph is exactly $1/n$, and also at $λ=1/2$ that the absorption time for any $r$ is $O_r(n^4)$. We also show results for graphs that are "almost regular," in a manner defined in the paper. We use this to show that for suitable random graphs from $G \sim G(n,p)$ and fixed $r>1$, with high probability over the choice of graph, the absorption time is $O_r(n^4)$, the fixation probability is $Ω_r(n^{-2})$, and we can approximate the fixation probability in polynomial time. Another special case is when the graph has only two distinct degree values $\{d_1, d_2\}$ with $d_1 \leq d_2$. For those graphs, we give exact formulas for fixation probabilities when $r = 1$ and any $λ$, and establish an absorption time of $O_r(n^4 α^4)$ for all $λ$, where $α= d_2 / d_1$. We also provide explicit formulas for the star and cycle under any $r$ or $λ$.
comment: Extended Abstract at ITCS 2026
♻ ☆ Optimal Extended Formulations from Optimal Dynamic Programming Algorithms
Vertex Subset Problems (VSPs) are a class of combinatorial optimization problems on graphs where the goal is to find a subset of vertices satisfying a predefined condition. Two prominent approaches for solving VSPs are dynamic programming over tree-like structures, such as tree decompositions or clique decompositions, and linear programming. In this work, we establish a sharp connection between both approaches by showing that if a vertex-subset problem $Π$ admits a solution-preserving dynamic programming algorithm that produces tables of size at most $α(k,n)$ when processing a tree decomposition of width at most $k$ of an $n$-vertex graph $G$, then the polytope $P_Π(G)$ defined as the convex-hull of solutions of $Π$ in $G$ has extension complexity at most $O(α(k,n)\cdot n)$. Additionally, this upper bound is optimal under the exponential time hypothesis (ETH). On the one hand, our results imply that ETH-optimal solution-preserving dynamic programming algorithms for combinatorial problems yield optimal-size parameterized extended formulations for the solution polytopes associated with instances of these problems. On the other hand, unconditional lower bounds obtained in the realm of the theory of extended formulations yield unconditional lower bounds on the table complexity of solution-preserving dynamic programming algorithms.
Graphics 8
☆ Efficient Maintenance of Leiden Communities in Large Dynamic Graphs
As a well-known community detection algorithm, Leiden has been widely used in various scenarios such as large language model generation (e.g., Graph-RAG), anomaly detection, and biological analysis. In these scenarios, the graphs are often large and dynamic, where vertices and edges are inserted and deleted frequently, so it is costly to obtain the updated communities by Leiden from scratch when the graph has changed. Recently, one work has attempted to study how to maintain Leiden communities in the dynamic graph, but it lacks a detailed theoretical analysis, and its algorithms are inefficient for large graphs. To address these issues, in this paper, we first theoretically show that the existing algorithms are relatively unbounded via the boundedness analysis (a powerful tool for analyzing incremental algorithms on dynamic graphs), and also analyze the memberships of vertices in communities when the graph changes. Based on theoretical analysis, we develop a novel efficient maintenance algorithm, called Hierarchical Incremental Tree Leiden (HIT-Leiden), which effectively reduces the range of affected vertices by maintaining the connected components and hierarchical community structures. Comprehensive experiments in various datasets demonstrate the superior performance of HIT-Leiden. In particular, it achieves speedups of up to five orders of magnitude over existing methods.
☆ Deep Learning Based Facial Retargeting Using Local Patches
In the era of digital animation, the quest to produce lifelike facial animations for virtual characters has led to the development of various retargeting methods. While the retargeting facial motion between models of similar shapes has been very successful, challenges arise when the retargeting is performed on stylized or exaggerated 3D characters that deviate significantly from human facial structures. In this scenario, it is important to consider the target character's facial structure and possible range of motion to preserve the semantics assumed by the original facial motions after the retargeting. To achieve this, we propose a local patch-based retargeting method that transfers facial animations captured in a source performance video to a target stylized 3D character. Our method consists of three modules. The Automatic Patch Extraction Module extracts local patches from the source video frame. These patches are processed through the Reenactment Module to generate correspondingly re-enacted target local patches. The Weight Estimation Module calculates the animation parameters for the target character at every frame for the creation of a complete facial animation sequence. Extensive experiments demonstrate that our method can successfully transfer the semantic meaning of source facial expressions to stylized characters with considerable variations in facial feature proportion.
comment: Eurographics 25
☆ Geo-NVS-w: Geometry-Aware Novel View Synthesis In-the-Wild with an SDF Renderer ICCV 2025
We introduce Geo-NVS-w, a geometry-aware framework for high-fidelity novel view synthesis from unstructured, in-the-wild image collections. While existing in-the-wild methods already excel at novel view synthesis, they often lack geometric grounding on complex surfaces, sometimes producing results that contain inconsistencies. Geo-NVS-w addresses this limitation by leveraging an underlying geometric representation based on a Signed Distance Function (SDF) to guide the rendering process. This is complemented by a novel Geometry-Preservation Loss which ensures that fine structural details are preserved. Our framework achieves competitive rendering performance, while demonstrating a 4-5x reduction reduction in energy consumption compared to similar methods. We demonstrate that Geo-NVS-w is a robust method for in-the-wild NVS, yielding photorealistic results with sharp, geometrically coherent details.
comment: Presented at the ICCV 2025 Workshop on Large Scale Cross Device Localization
☆ Data-Induced Groupings and How To Find Them
Making sense of a visualization requires the reader to consider both the visualization design and the underlying data values. Existing work in the visualization community has largely considered affordances driven by visualization design elements, such as color or chart type, but how visual design interacts with data values to impact interpretation and reasoning has remained under-explored. Dot plots and bar graphs are commonly used to help users identify groups of points that form trends and clusters, but are liable to manifest groupings that are artifacts of spatial arrangement rather than inherent patterns in the data itself. These ``Data-induced Groups'' can drive suboptimal data comparisons and potentially lead the user to incorrect conclusions. We conduct two user studies using dot plots as a case study to understand the prevalence of data-induced groupings. We find that users rely on data-induced groupings in both conditions despite the fact that trend-based groupings are irrelevant in nominal data. Based on the study results, we build a model to predict whether users are likely to perceive a given set of dot plot points as a group. We discuss two use cases illustrating how the model can assist visualization designers by both diagnosing potential user-perceived groupings in dot plots and offering redesigns that better accentuate desired groupings through data rearrangement.
☆ Instruction-Driven 3D Facial Expression Generation and Transition
A 3D avatar typically has one of six cardinal facial expressions. To simulate realistic emotional variation, we should be able to render a facial transition between two arbitrary expressions. This study presents a new framework for instruction-driven facial expression generation that produces a 3D face and, starting from an image of the face, transforms the facial expression from one designated facial expression to another. The Instruction-driven Facial Expression Decomposer (IFED) module is introduced to facilitate multimodal data learning and capture the correlation between textual descriptions and facial expression features. Subsequently, we propose the Instruction to Facial Expression Transition (I2FET) method, which leverages IFED and a vertex reconstruction loss function to refine the semantic comprehension of latent vectors, thus generating a facial expression sequence according to the given instruction. Lastly, we present the Facial Expression Transition model to generate smooth transitions between facial expressions. Extensive evaluation suggests that the proposed model outperforms state-of-the-art methods on the CK+ and CelebV-HQ datasets. The results show that our framework can generate facial expression trajectories according to text instruction. Considering that text prompts allow us to make diverse descriptions of human emotional states, the repertoire of facial expressions and the transitions between them can be expanded greatly. We expect our framework to find various practical applications More information about our project can be found at https://vohoanganh.github.io/tg3dfet/
♻ ☆ FastFLUX: Pruning FLUX with Block-wise Replacement and Sandwich Training
Recent advancements in text-to-image (T2I) generation have led to the emergence of highly expressive models such as diffusion transformers (DiTs), exemplified by FLUX. However, their massive parameter sizes lead to slow inference, high memory usage, and poor deployability. Existing acceleration methods (e.g., single-step distillation and attention pruning) often suffer from significant performance degradation and incur substantial training costs. To address these limitations, we propose FastFLUX, an architecture-level pruning framework designed to enhance the inference efficiency of FLUX. At its core is the Block-wise Replacement with Linear Layers (BRLL) method, which replaces structurally complex residual branches in ResBlocks with lightweight linear layers while preserving the original shortcut connections for stability. Furthermore, we introduce Sandwich Training (ST), a localized fine-tuning strategy that leverages LoRA to supervise neighboring blocks, mitigating performance drops caused by structural replacement. Experiments show that our FastFLUX maintains high image quality under both qualitative and quantitative evaluations, while significantly improving inference speed, even with 20\% of the hierarchy pruned. Our code will be available soon.
comment: 14 pages
♻ ☆ Temporal Smoothness-Aware Rate-Distortion Optimized 4D Gaussian Splatting NeurIPS 2025
Dynamic 4D Gaussian Splatting (4DGS) effectively extends the high-speed rendering capabilities of 3D Gaussian Splatting (3DGS) to represent volumetric videos. However, the large number of Gaussians, substantial temporal redundancies, and especially the absence of an entropy-aware compression framework result in large storage requirements. Consequently, this poses significant challenges for practical deployment, efficient edge-device processing, and data transmission. In this paper, we introduce a novel end-to-end RD-optimized compression framework tailored for 4DGS, aiming to enable flexible, high-fidelity rendering across varied computational platforms. Leveraging Fully Explicit Dynamic Gaussian Splatting (Ex4DGS), one of the state-of-the-art 4DGS methods, as our baseline, we start from the existing 3DGS compression methods for compatibility while effectively addressing additional challenges introduced by the temporal axis. In particular, instead of storing motion trajectories independently per point, we employ a wavelet transform to reflect the real-world smoothness prior, significantly enhancing storage efficiency. This approach yields significantly improved compression ratios and provides a user-controlled balance between compression efficiency and rendering quality. Extensive experiments demonstrate the effectiveness of our method, achieving up to 91$\times$ compression compared to the original Ex4DGS model while maintaining high visual fidelity. These results highlight the applicability of our framework for real-time dynamic scene rendering in diverse scenarios, from resource-constrained edge devices to high-performance environments. The source code is available at https://github.com/HyeongminLEE/RD4DGS.
comment: 24 pages, 10 figures, NeurIPS 2025
♻ ☆ Aplicación Gráfica para el estudio de un Modelo de Celda Electrolítica usando Técnicas de Visualización de Campos Vectoriales
The use of floating bipolar electrodes in copper electro-winning cells represents an emerging technology that promises economic and operational impacts. This thesis presents EWCellCAD, a computational tool designed for the simulation and analysis of these electrochemical systems. Based on the generalization and optimization of an existing 2D finite difference model for calculating electrical variables in rectangular cells, EWCellCAD implements a new 3D model capable of processing complex geometries, not necessarily rectangular, which also accelerates calculations by several orders of magnitude. At the same time, a new analytical method for estimating potentials in floating electrodes is introduced, overcoming the inaccuracies of previous heuristic approaches. The analysis of the results is supported by an interactive visualization technique of three-dimensional vector fields as flow lines.
comment: BSc Thesis in Electronic Engineering (part of the research project FONDECYT 1970955), Universidad de Concepción, 2000, 105 pages, 22 figures, in Spanish. Related publication: arXiv:1001.3974 [cs.GR]. Metadata-only update: Author name standardized (maternal surname removed; paternal surname as sole last name). Title orthography corrected with TeX accents. Abstract refined
Operating Systems 2
☆ Peformance Isolation for Inference Processes in Edge GPU Systems
This work analyzes the main isolation mechanisms available in modern NVIDIA GPUs: MPS, MIG, and the recent Green Contexts, to ensure predictable inference time in safety-critical applications using deep learning models. The experimental methodology includes performance tests, evaluation of partitioning impact, and analysis of temporal isolation between processes, considering both the NVIDIA A100 and Jetson Orin platforms. It is observed that MIG provides a high level of isolation. At the same time, Green Contexts represent a promising alternative for edge devices by enabling fine-grained SM allocation with low overhead, albeit without memory isolation. The study also identifies current limitations and outlines potential research directions to improve temporal predictability in shared GPUs.
♻ ☆ CounterPoint: Using Hardware Event Counters to Refute and Refine Microarchitectural Assumptions (Extended Version) ASPLOS
Hardware event counters offer the potential to reveal not only performance bottlenecks but also detailed microarchitectural behavior. In practice, this promise is undermined by their vague specifications, opaque designs, and multiplexing noise, making event counter data hard to interpret. We introduce CounterPoint, a framework that tests user-specified microarchitectural models - expressed as $μ$path Decision Diagrams - for consistency with performance counter data. When mismatches occur, CounterPoint pinpoints plausible microarchitectural features that could explain them, using multi-dimensional counter confidence regions to mitigate multiplexing noise. We apply CounterPoint to the Haswell Memory Management Unit as a case study, shedding light on multiple undocumented and underdocumented microarchitectural behaviors. These include a load-store queue-side TLB prefetcher, merging page table walkers, abortable page table walks, and more. Overall, CounterPoint helps experts reconcile noisy hardware performance counter measurements with their mental model of the microarchitecture - uncovering subtle, previously hidden hardware features along the way.
comment: This is an extended version of a paper which has been accepted to the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems conference (ASPLOS, March 2026). 20 pages, 20 figures, 8 tables
Hardware Architecture 4
☆ GRPO with State Mutations: Improving LLM-Based Hardware Test Plan Generation
RTL design often relies heavily on ad-hoc testbench creation early in the design cycle. While large language models (LLMs) show promise for RTL code generation, their ability to reason about hardware specifications and generate targeted test plans remains largely unexplored. We present the first systematic study of LLM reasoning capabilities for RTL verification stimuli generation, establishing a two-stage framework that decomposes test plan generation from testbench execution. Our benchmark reveals that state-of-the-art models, including DeepSeek-R1 and Claude-4.0-Sonnet, achieve only 15.7-21.7% success rates on generating stimuli that pass golden RTL designs. To improve LLM generated stimuli, we develop a comprehensive training methodology combining supervised fine-tuning with a novel reinforcement learning approach, GRPO with State Mutation (GRPO-SMu), which enhances exploration by varying input mutations. Our approach leverages a tree-based branching mutation strategy to construct training data comprising equivalent and mutated trees, moving beyond linear mutation approaches to provide rich learning signals. Training on this curated dataset, our 7B parameter model achieves a 33.3% golden test pass rate and a 13.9% mutation detection rate, representing a 17.6% absolute improvement over baseline and outperforming much larger general-purpose models. These results demonstrate that specialized training methodologies can significantly enhance LLM reasoning capabilities for hardware verification tasks, establishing a foundation for automated sub-unit testing in semiconductor design workflows.
☆ VLM-CAD: VLM-Optimized Collaborative Agent Design Workflow for Analog Circuit Sizing
Analog mixed-signal circuit sizing involves complex trade-offs within high-dimensional design spaces. Existing automatic analog circuit sizing approaches often underutilize circuit schematics and lack the explainability required for industry adoption. To tackle these challenges, we propose a Vision Language Model-optimized collaborative agent design workflow (VLM-CAD), which analyzes circuits, optimizes DC operating points, performs inference-based sizing and executes external sizing optimization. We integrate Image2Net to annotate circuit schematics and generate a structured JSON description for precise interpretation by Vision Language Models. Furthermore, we propose an Explainable Trust Region Bayesian Optimization method (ExTuRBO) that employs collaborative warm-starting from agent-generated seeds and offers dual-granularity sensitivity analysis for external sizing optimization, supporting a comprehensive final design report. Experiment results on amplifier sizing tasks using 180nm, 90nm, and 45nm Predictive Technology Models demonstrate that VLM-CAD effectively balances power and performance, achieving a 100% success rate in optimizing an amplifier with a complementary input and a class-AB output stage, while maintaining total runtime under 43 minutes across all experiments.
comment: 8 pages, 5 figures
♻ ☆ SliceMoE: Bit-Sliced Expert Caching under Miss-Rate Constraints for Efficient MoE Inference
MoE models offer efficient scaling through conditional computation, but their large parameter size and expensive expert offloading make on-device deployment challenging. Existing acceleration techniques such as prefetching or expert clustering often increase energy usage or reduce expert diversity. We present SliceMoE, an energy-efficient MoE inference framework for miss-rate-constrained deployment. SliceMoE introduces Dynamic Bit-Sliced Caching (DBSC), which caches experts at slice-level granularity and assigns precision on demand to expand effective expert capacity. To support mixed-precision experts without memory duplication, we propose Calibration-Free Asymmetric Matryoshka Quantization (AMAT), a truncation-based scheme that maintains compatibility between low-bit and high-bit slices. We further introduce Predictive Cache Warmup (PCW) to reduce early-decode cold misses by reshaping cache contents during prefill. Evaluated on DeepSeek-V2-Lite and Qwen1.5-MoE-A2.7B, SliceMoE reduces decode-stage energy consumption by up to 2.37x and 2.85x, respectively, and improves decode latency by up to 1.81x and 1.64x, while preserving near-high-bit accuracy. These results demonstrate that slice-level caching enables an efficient on-device MoE deployment.
♻ ☆ CounterPoint: Using Hardware Event Counters to Refute and Refine Microarchitectural Assumptions (Extended Version) ASPLOS
Hardware event counters offer the potential to reveal not only performance bottlenecks but also detailed microarchitectural behavior. In practice, this promise is undermined by their vague specifications, opaque designs, and multiplexing noise, making event counter data hard to interpret. We introduce CounterPoint, a framework that tests user-specified microarchitectural models - expressed as $μ$path Decision Diagrams - for consistency with performance counter data. When mismatches occur, CounterPoint pinpoints plausible microarchitectural features that could explain them, using multi-dimensional counter confidence regions to mitigate multiplexing noise. We apply CounterPoint to the Haswell Memory Management Unit as a case study, shedding light on multiple undocumented and underdocumented microarchitectural behaviors. These include a load-store queue-side TLB prefetcher, merging page table walkers, abortable page table walks, and more. Overall, CounterPoint helps experts reconcile noisy hardware performance counter measurements with their mental model of the microarchitecture - uncovering subtle, previously hidden hardware features along the way.
comment: This is an extended version of a paper which has been accepted to the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems conference (ASPLOS, March 2026). 20 pages, 20 figures, 8 tables
Data Structures and Algorithms 9
☆ Dynamic $(Δ+ 1)$ Vertex Coloring
Several recent results from dynamic and sublinear graph coloring are surveyed. This problem is widely studied and has motivating applications like network topology control, constraint satisfaction, and real-time resource scheduling. Graph coloring algorithms are called colorers. In §1 are defined graph coloring, the dynamic model, and the notion of performance of graph algorithms in the dynamic model. In particular $(Δ+ 1)$-coloring, sublinear performance, and oblivious and adaptive adversaries are noted and motivated. In §2 the pair of approximately optimal dynamic vertex colorers given in arXiv:1708.09080 are summarized as a warmup for the $(Δ+ 1)$-colorers. In §3 the state of the art in dynamic $(Δ+ 1)$-coloring is presented. This section comprises a pair of papers (arXiv:1711.04355 and arXiv:1910.02063) that improve dynamic $(Δ+ 1)$-coloring from the naive algorithm with $O(Δ)$ expected amortized update time to $O(\log Δ)$, then to $O(1)$ with high probability. In §4 the results in arXiv:2411.04418, which gives a sublinear algorithm for $(Δ+ 1)$-coloring that generalizes oblivious adversaries to adaptive adversaries, are presented.
comment: 16 pages, 5 figures
☆ The Secretary Problem with Predictions and a Chosen Order
We study a learning-augmented variant of the secretary problem, recently introduced by Fujii and Yoshida (2023), in which the decision-maker has access to machine-learned predictions of candidate values. The central challenge is to balance consistency and robustness: when predictions are accurate, the algorithm should select a near-optimal secretary, while under inaccurate predictions it should still guarantee a bounded competitive ratio. We consider both the classical Random Order Secretary Problem (ROSP), where candidates arrive in a uniformly random order, and a more natural learning-augmented model in which the decision-maker may choose the arrival order based on predicted values. We call this model the Chosen Order Secretary Problem (COSP), capturing scenarios such as interview schedules set in advance. We propose a new randomized algorithm applicable to both ROSP and COSP. Our method switches from fully trusting predictions to a threshold-based rule once a large prediction deviation is detected. Let $ε\in [0,1]$ denote the maximum multiplicative prediction error. For ROSP, our algorithm achieves a competitive ratio of $\max\{0.221, (1-ε)/(1+ε)\}$, improving upon the prior bound of $\max\{0.215, (1-ε)/(1+ε)\}$. For COSP, we achieve $\max\{0.262, (1-ε)/(1+ε)\}$, surpassing the $0.25$ worst-case bound for prior approaches and moving closer to the classical secretary benchmark of $1/e \approx 0.368$. These results highlight the benefit of combining predictions with arrival-order control in online decision-making.
comment: Accepted to the International Conference on Innovations in Theoretical Computer Science (ITCS 2026)
♻ ☆ The Power of Iterative Filtering for Supervised Learning with (Heavy) Contamination
Inspired by recent work on learning with distribution shift, we give a general outlier removal algorithm called iterative polynomial filtering and show a number of striking applications for supervised learning with contamination: (1) We show that any function class that can be approximated by low-degree polynomials with respect to a hypercontractive distribution can be efficiently learned under bounded contamination (also known as nasty noise). This is a surprising resolution to a longstanding gap between the complexity of agnostic learning and learning with contamination, as it was widely believed that low-degree approximators only implied tolerance to label noise. In particular, it implies the first efficient algorithm for learning halfspaces with $η$-bounded contamination up to error $2η+ε$ with respect to the Gaussian distribution. (2) For any function class that admits the (stronger) notion of sandwiching approximators, we obtain near-optimal learning guarantees even with respect to heavy additive contamination, where far more than $1/2$ of the training set may be added adversarially. Prior related work held only for regression and in a list-decodable setting. (3) We obtain the first efficient algorithms for tolerant testable learning of functions of halfspaces with respect to any fixed log-concave distribution. Even the non-tolerant case for a single halfspace in this setting had remained open. These results significantly advance our understanding of efficient supervised learning under contamination, a setting that has been much less studied than its unsupervised counterpart.
comment: 36 pages
♻ ☆ Certificates in P and Subquadratic-Time Computation of Radius, Diameter, and all Eccentricities in Graphs
In the context of fine-grained complexity, we investigate the notion of certificate enabling faster polynomial-time algorithms. We specifically target radius (minimum eccentricity), diameter (maximum eccentricity), and all-eccentricity computations for which quadratic-time lower bounds are known under plausible conjectures. In each case, we introduce a notion of certificate as a specific set of nodes from which appropriate bounds on all eccentricities can be derived in subquadratic time when this set has sublinear size. The existence of small certificates is a barrier against SETH-based lower bounds for these problems. We indeed prove that for graph classes with small certificates, there exist randomized subquadratic-time algorithms for computing the radius, the diameter, and all eccentricities respectively. Moreover, these notions of certificates are tightly related to algorithms probing the graph through one-to-all distance queries and allow to explain the efficiency of practical radius and diameter algorithms from the literature. Our formalization enables a novel primal-dual analysis of a classical approach for diameter computation that leads to algorithms for radius, diameter and all eccentricities with theoretical guarantees with respect to certain graph parameters. This is complemented by experimental results on various types of real-world graphs showing that these parameters appear to be low in practice. Finally, we obtain refined results for several graph classes.
comment: Accept{é} {à} SODA 2025
♻ ☆ Deterministic approximate counting of colorings with fewer than $2Δ$ colors via absence of zeros
Let $Δ,q\geq 3$ be integers. We prove that there exists $η\geq 0.002$ such that if $q\geq (2-η)Δ$, then there exists an open set $\mathcal{U}\subset \mathbb{C}$ that contains the interval $[0,1]$ such that for each $w\in \mathcal{U}$ and any graph $G=(V,E)$ of maximum degree at most $Δ$, the partition function of the anti-ferromagnetic $q$-state Potts model evaluated at $w$ does not vanish. This provides a (modest) improvement on a result of Liu, Sinclair, and Srivastava, and breaks the $q=2Δ$-barrier for this problem. As a direct consequence we obtain via Barvinok's interpolation method a deterministic polynomial time algorithm to approximate the number of proper $q$-colorings of graphs of maximum degree at most $Δ$, provided $q\geq (2-η)Δ$.
comment: 41 pages. This is the TheoretiCS journal version
♻ ☆ ALNS for Tugboat Scheduling in Inland Waterway
This paper focuses on the barges shipping problem, also known as the tugboats scheduling problem, within the context of a scenario where a single tugboat has the capacity to tow multiple barges and conduct multiple trips in a drop-and-pull mode during a daily work shift. The problem is mathematically formalized as mixed-integer programming models. To tackle real-world-sized problem instances, an adaptive large neighborhood search (ALNS) algorithm integrated with a decoding mathematical model is proposed. When applied to large-scale instances, the ALNS algorithm showcases performance superiority over the strengthened mathematical model.
comment: It's written so poorly, my bad
♻ ☆ Tree Embedding in High Dimensions: Dynamic and Massively Parallel
Tree embedding has been a fundamental method in algorithm design with wide applications. We focus on the efficiency of building tree embedding in various computational settings under high-dimensional Euclidean $\mathbb{R}^d$. We devise a new tree embedding construction framework that operates on an arbitrary metric decomposition with bounded diameter, offering a tradeoff between distortion and the locality of its algorithmic steps. This framework works for general metric spaces and may be of independent interest beyond the Euclidean setting. Using this framework, we obtain a dynamic algorithm that maintains an $O_ε(\log n)$-distortion tree embedding with update time $\tilde O(n^ε+ d)$ subject to point insertions/deletions, and a massively parallel algorithm that achieves $O_ε(\log n)$-distortion in $O(1)$ rounds and total space $\tilde O(n^{1 + ε})$ (for constant $ε\in (0, 1)$). These new tree embedding results allow for a wide range of applications. Notably, under a similar performance guarantee as in our tree embedding algorithms, i.e., $\tilde O(n^ε+ d)$ update time and $O(1)$ rounds, we obtain $O_ε(\log n)$-approximate dynamic and MPC algorithms for $k$-median and earth-mover distance in $\mathbb{R}^d$.
♻ ☆ Bounds on Longest Simple Cycles in Weighted Directed Graphs via Optimum Cycle Means
The problem of finding the longest simple cycle in a directed graph is NP-hard, with critical applications in computational biology, scheduling, and network analysis. Existing approaches include exact algorithms with exponential runtimes, approximation algorithms limited to specific graph classes, and heuristics with no formal guarantees. In this paper, we exploit optimum cycle means (minimum and maximum cycle means), computable in strongly polynomial time, to derive both strict bounds and heuristic estimates for the weight and length of the longest simple cycle in general graphs. The strict bounds can prune search spaces in exact algorithms while the heuristic estimates (the arithmetic mean and geometric mean of the optimum cycle means) guarantee bounded approximation error. Crucially, a single computation of optimum cycle means yields both the bounds and the heuristic estimates. Experimental evaluation on ISCAS benchmark circuits demonstrates that, compared to true values, the strict algebraic lower bounds are loose (median 80--99% below) while the heuristic estimates are much tighter: the arithmetic mean and the geometric mean have median errors of 6--13% vs. 11--21% for symmetric (uniform) weights and 41--92% vs. 25--35% for skewed (log-normal) weights, favoring the arithmetic mean for symmetric distributions and the geometric mean for skewed distributions.
comment: 17 pages, 2 figures
♻ ☆ Subgraph Isomorphism: Prolog vs. Conventional
Subgraph Isomorphism uses a small graph as a pattern to identify within a larger graph a set of vertices that have matching edges. This paper addresses a logic program written in Prolog for a specific relatively complex graph pattern for which multiple conventional implementations (including parallel) exist. The goal is to understand the complexity differences between programming logically and programming conventionally. Discussion includes the process of converting the graph pattern into logic statements in Prolog, and the resulting characteristics as the size of the graph increased. The analysis shows that using a logic paradigm is an efficient way to attack complex graph problems.
Graphics 6
☆ R3-RECON: Radiance-Field-Free Active Reconstruction via Renderability
In active reconstruction, an embodied agent must decide where to look next to efficiently acquire views that support high-quality novel-view rendering. Recent work on active view planning for neural rendering largely derives next-best-view (NBV) criteria by backpropagating through radiance fields or estimating information entropy over 3D Gaussian primitives. While effective, these strategies tightly couple view selection to heavy, representation-specific mechanisms and fail to account for the computational and resource constraints required for lightweight online deployment. In this paper, we revisit active reconstruction from a renderability-centric perspective. We propose $\mathbb{R}^{3}$-RECON, a radiance-fields-free active reconstruction framework that induces an implicit, pose-conditioned renderability field over SE(3) from a lightweight voxel map. Our formulation aggregates per-voxel online observation statistics into a unified scalar renderability score that is cheap to update and can be queried in closed form at arbitrary candidate viewpoints in milliseconds, without requiring gradients or radiance-field training. This renderability field is strongly correlated with image-space reconstruction error, naturally guiding NBV selection. We further introduce a panoramic extension that estimates omnidirectional (360$^\circ$) view utility to accelerate candidate evaluation. In the standard indoor Replica dataset, $\mathbb{R}^{3}$-RECON achieves more uniform novel-view quality and higher 3D Gaussian splatting (3DGS) reconstruction accuracy than recent active GS baselines with matched view and time budgets.
comment: 18 pages, 11 figures
☆ SIRR-LMM: Single-image Reflection Removal via Large Multimodal Model WACV
Glass surfaces create complex interactions of reflected and transmitted light, making single-image reflection removal (SIRR) challenging. Existing datasets suffer from limited physical realism in synthetic data or insufficient scale in real captures. We introduce a synthetic dataset generation framework that path-traces 3D glass models over real background imagery to create physically accurate reflection scenarios with varied glass properties, camera settings, and post-processing effects. To leverage the capabilities of Large Multimodal Model (LMM), we concatenate the image layers into a single composite input, apply joint captioning, and fine-tune the model using task-specific LoRA rather than full-parameter training. This enables our approach to achieve improved reflection removal and separation performance compared to state-of-the-art methods.
comment: 12 pages, 14 figures, accepted in WACVW 2026
☆ Statistical Blendshape Calculation and Analysis for Graphics Applications
With the development of virtualization and AI, real-time facial avatar animation is widely used in entertainment, office, business and other fields. Against this background, blendshapes have become a common industry animation solution because of their relative simplicity and ease of interpretation. Aiming for real-time performance and low computing resource dependence, we independently developed an accurate blendshape prediction system for low-power VR applications using a standard webcam. First, blendshape feature vectors are extracted through affine transformation and segmentation. Through further transformation and regression analysis, we were able to identify models for most blendshapes with significant predictive power. Post-processing was used to further improve response stability, including smoothing filtering and nonlinear transformations to minimize error. Experiments showed the system achieved accuracy similar to ARKit 6. Our model has low sensor/hardware requirements and realtime response with a consistent, accurate and smooth visual experience.
comment: 12 figures
♻ ☆ Sketch&Patch++: Efficient Structure-Aware 3D Gaussian Representation
We observe that Gaussians exhibit distinct roles and characteristics analogous to traditional artistic techniques -- like how artists first sketch outlines before filling in broader areas with color, some Gaussians capture high-frequency features such as edges and contours, while others represent broader, smoother regions analogous to brush strokes that add volume and depth. Based on this observation, we propose a hybrid representation that categorizes Gaussians into (i) Sketch Gaussians, which represent high-frequency, boundary-defining features, and (ii) Patch Gaussians, which cover low-frequency, smooth regions. This semantic separation naturally enables layered progressive streaming, where the compact Sketch Gaussians establish the structural skeleton before Patch Gaussians incrementally refine volumetric detail. In this work, we extend our previous method to arbitrary 3D scenes by proposing a novel hierarchical adaptive categorization framework that operates directly on the 3DGS representation. Our approach employs multi-criteria density-based clustering, combined with adaptive quality-driven refinement. This method eliminates dependency on external 3D line primitives while ensuring optimal parametric encoding effectiveness. Our comprehensive evaluation across diverse scenes, including both man-made and natural environments, demonstrates that our method achieves up to 1.74 dB improvement in PSNR, 6.7% in SSIM, and 41.4% in LPIPS at equivalent model sizes compared to uniform pruning baselines. For indoor scenes, our method can maintain visual quality with only 0.5\% of the original model size. This structure-aware representation enables efficient storage, adaptive streaming, and rendering of high-fidelity 3D content across bandwidth-constrained networks and resource-limited devices.
♻ ☆ Paris: A Decentralized Trained Open-Weight Diffusion Model
We present Paris, the first publicly released diffusion model pre-trained entirely through decentralized computation. Paris demonstrates that high-quality text-to-image generation can be achieved without centrally coordinated infrastructure. Paris is open for research and commercial use. Paris required implementing our Distributed Diffusion Training framework from scratch. The model consists of 8 expert diffusion models (129M-605M parameters each) trained in complete isolation with no gradient, parameter, or intermediate activation synchronization. Rather than requiring synchronized gradient updates across thousands of GPUs, we partition data into semantically coherent clusters where each expert independently optimizes its subset while collectively approximating the full distribution. A lightweight transformer router dynamically selects appropriate experts at inference, achieving generation quality comparable to centrally coordinated baselines. Eliminating synchronization enables training on heterogeneous hardware without specialized interconnects. Empirical validation confirms that Paris's decentralized training maintains generation quality while removing the dedicated GPU cluster requirement for large-scale diffusion models. Paris achieves this using 14$\times$ less training data and 16$\times$ less compute than the prior decentralized baseline.
♻ ☆ Volume Encoding Gaussians: Transfer Function-Agnostic 3D Gaussians for Volume Rendering
Visualizing the large-scale datasets output by HPC resources presents a difficult challenge, as the memory and compute power required become prohibitively expensive for end user systems. Novel view synthesis techniques can address this by producing a small, interactive model of the data, requiring only a set of training images to learn from. While these models allow accessible visualization of large data and complex scenes, they do not provide the interactions needed for scientific volumes, as they do not support interactive selection of transfer functions and lighting parameters. To address this, we introduce Volume Encoding Gaussians (VEG), a 3D Gaussian-based representation for volume visualization that supports arbitrary color and opacity mappings. Unlike prior 3D Gaussian Splatting (3DGS) methods that store color and opacity for each Gaussian, VEG decouple the visual appearance from the data representation by encoding only scalar values, enabling transfer function-agnostic rendering of 3DGS models. To ensure complete scalar field coverage, we introduce an opacity-guided training strategy, using differentiable rendering with multiple transfer functions to optimize our data representation. This allows VEG to preserve fine features across the full scalar range of a dataset while remaining independent of any specific transfer function. Across a diverse set of volume datasets, we demonstrate that our method outperforms the state-of-the-art on transfer functions unseen during training, while requiring a fraction of the memory and training time.
Programming Languages 1
☆ S4 modal sequent calculus as intermediate logic and intermediate language
In this short paper, we advocate for the idea that continuation-based intermediate languages correspond to intermediate logics. The goal of intermediate languages is to serve as a basis for compiler intermediate representations, allowing to represent expressive program transformations for optimisation and compilation, while preserving the properties that make programs compilable efficiently in the first place, such as the "stackability" of continuations. Intermediate logics are logics between intuitionistic and classical logic in terms of provability. Second-class continuations used in CPS-based intermediate languages correspond to a classical modal logic S4 with the added restriction that implications may only return modal types. This indeed corresponds to an intermediate logic, owing to the Gödel-McKinsey-Tarski theorem which states the intuitionistic nature of the modal fragment of S4. We introduce a three-kinded polarised sequent calculus for S4, together with an operational machine model that separates a heap from a stack. With this model we study a stackability property for the modal fragment of S4.
comment: 14 pages, paper accepted for PEPM 2026 in the "short paper" category