MyArxiv
Operating Systems 1
♻ ☆ EnerInfer: Energy-Aware On-Device LLM Inference
On-device LLM inference is increasingly attractive for privacy-preserving, reliable, and cost-effective deployment, yet its energy and thermal costs remain a critical bottleneck. Existing systems primarily optimize for decoding speed, implicitly assuming that faster execution is always preferable. We show instead that on-device LLM inference often has exploitable configuration slack: modestly lowering NPU and memory frequencies preserves quality of experience (QoE) while substantially improving energy efficiency and reducing heat. Realizing this opportunity in production is challenging. The most energy-efficient NPU/DDR setting varies with the model, inference engine, platform, and runtime conditions, with no stable ranking across configurations. Commercial devices further lack component-level power sensing, and shell temperature evolves with request arrivals, response lengths, and thermal history. To address these challenges, we propose EnerInfer, the first on-device LLM inference framework that jointly manages energy efficiency, throughput, and thermal comfort for LLM workloads. EnerInfer replaces per-model profiling and sensor-heavy control with disaggregated, model-structure-aware prediction and ranking-driven online feedback. It predicts throughput and power for unseen LLMs across NPU/DDR frequency settings, selects QoE-satisfying efficient configurations under runtime interference, and uses lightweight limited-horizon thermal prediction to dynamically switch between energy-optimized and thermally constrained inference. Evaluations on real-world LLMs show that EnerInfer improves energy efficiency by up to 65%, 12%, and 24% on phones, a laptop, and a development board, respectively, without QoE violation.
Hardware Architecture 7
☆ Croc: Training the Next Generation Chip Designers on Domain-Specific End-to-End Open Source Silicon
The demand for domain-specific systems-on-chip (SoCs) in artificial intelligence, robotics, and automotive systems is increasing the need for engineers with hands-on expertise on very-large-scale integration (VLSI) design from architecture specification to fabricated silicon. Yet, most VLSI courses rely on restrictively licensed electronic design automation tools and process design kits (PDKs), as well as closed-source hardware designs. We present an end-to-end open-source domain-specific SoC design and fabrication flow built around Croc, a highly customizable RISC-V platform. Built from open-source SystemVerilog intellectual property blocks and integrated with an end-to-end open-source design flow in a 130nm open PDK, Croc enables tapeout projects supporting multiple domain customization options: instruction-set extensions, accelerator co-processors, and peripherals. In our first open-source course experience using Croc, 65 students completed 33 projects, 30 of which produced manufacturable layouts. 18 designs were selected as tapeout candidates, and five were fabricated. A first baseline chip has already been successfully characterized in silicon, demonstrating microcontroller-class functionality and implementation metrics comparable to those of products with similar functional complexity completed with closed-source toolchains and PDKs.
comment: 2 pages, 4 figures, accepted for poster at the IEEE Hot Chips 2026 Conference
☆ Energy-Efficient CNN Acceleration with MSDF Digit-Serial Arithmetic on FPGA CEC
This paper presents an energy-efficient hardware acceleration of the convolutional layers in the U-Net architecture for image segmentation, implemented on FPGA. While digit-serial arithmetic, particularly most-significant-digit-first (MSDF) techniques, offers a compact hardware footprint, it suffers from initial latency before producing the first output digit. This delay accumulates in cascaded operations like multiplication followed by addition, where each unit introduces its own startup overhead. To overcome this, we propose a merged multiply-add (MMA) architecture that fuses these operations into a unified pipeline. Instead of incurring separate delays, the MMA introduces a single streamlined latency per iteration, shorter than the combined latency of conventional cascaded units, resulting in enhanced throughput and efficiency. The MMA units are designed to process spatial input depths in parallel, achieving significantly higher performance than both standalone MSDF-based and conventional designs. We evaluate the proposed design using U-Net as a target application. Despite operating at a lower frequency than a CPU, the FPGA-based accelerator achieves up to an order of magnitude higher energy efficiency, delivering up to $15.14$ GOPS/W compared to $1.93$ GOPS/W for CPU-based inference. The design also shows approximately $9\times$ reduction in energy consumption compared to MSDF-based FPGA implementations. These results highlight the efficacy of the merged arithmetic approach for resource-constrained, latency-sensitive edge applications in medical imaging and computer vision.
comment: Presented at 2025 32nd IEEE International Conference on Electronics, Circuits and Systems (ICECS)
☆ Agentic evolution of physically constrained foundation models
Artificial intelligence increasingly drives automated scientific discovery, yet contemporary generalist agents lack physical grounding, frequently hallucinating hardware-incompatible designs. Here, we present a physically grounded, multi-agent discovery engine that autonomously architects hardware-compliant computing systems. Anchored by an Evolutionary Knowledge Graph structuring past scientific innovations, the framework extracts an "algorithmic Chain-of-Thought" to transform blind stochastic search into directed structural evolution. Applied to the extreme testbed of foundation model deployment, the engine evolved two hardware-aware compression methodologies surpassing human-engineered heuristics: Q-Enhance mitigates long-context accuracy loss in dense models, and MoE-Salient-AQ outperforms state-of-the-art manual sparse Mixture-of-Experts designs by 3.7% at sub-3-bit regimes. Utilizing a bandwidth-efficient Sensitivity Profile, we successfully deployed a massive 235-billion-parameter model onto a constrained dual-A100 server, reducing memory requirements by 75% with a marginal 0.64% accuracy degradation. By transforming unconstrained combinatorial search into knowledge-driven autonomy, this establishes a scalable hardware-software co-design paradigm for machine-driven discovery within strict physical boundaries.
comment: 29 pages, 5 main figures and 4 extended data figures
☆ Cache-Resident LLM Inference in GB-Scale Last-Level Caches
Large language model (LLM) inference is increasingly dominated by data movement across the memory hierarchy. Recent 3D-stacked cache technologies have enabled GB-scale last-level caches in modern server CPUs, making it possible to keep reusable model weights on chip and exploit cache bandwidth and latency. Achieving this regime is not straightforward: deeper pipelining for weight residency increases in-flight requests and KV-cache footprint, while cache-resident operators make operator-boundary synchronization a visible bottleneck. We present a cache-resident execution model for inference on hierarchical-memory clustered systems. The model separates weight-centric operators from attention and KV-cache management into dedicated resource domains, keeping reusable weights cache-resident while scaling KV capacity independently of pipeline depth. It also relaxes synchronization from operator boundaries to true sub-operator dependencies, reducing coordination overhead in the cache-resident regime. We instantiate this model on a multi-socket CPU cluster with a weight-attention decoupled architecture, locality-aware placement, and a specialized static runtime. The prototype substantially outperforms equally provisioned llama.cpp. On deployed Llama-3.2-3B and Llama-2-7B configurations, it achieves 2.04x-11.51x speedup on time-per-output-token (TPOT). Under a validated analytical model, it further reaches up to 13.9x TPOT speedup across model sizes, context lengths, and batch sizes. These results show that commodity CPUs with GB-scale last-level caches can support efficient LLM inference when execution is organized around cache residency, decoupled state management, and dependency-aware coordination.
☆ Programmable Probabilistic Computer with 1,000,000 p-bits
Probabilistic computers built from p-bits have been proposed as hardware accelerators for sampling and optimizing Ising models, but existing systems have been confined to a single chip, capped by its capacity and memory bandwidth. Here we break this limit by networking FPGAs into a single Ising machine far larger than any one device could hold, realizing a programmable probabilistic computer with one million p-bits. The machine performs Gibbs sampling at over a trillion flips per second while keeping every coupling weight in local on-chip memory. During execution, devices exchange nothing but 1-bit boundary states. This architecture exposes a question fundamental to any distributed sampler: how frequently boundary information must be refreshed for a partitioned machine to behave as an unpartitioned one. Using three-dimensional Edwards-Anderson spin glasses, we show that the answer is set by a single timing ratio, eta = f_comm/f_p-bit, of the boundary-exchange frequency to the local p-bit update frequency. Above a topology-dependent threshold, the distributed machine matches a monolithic GPU reference. Below it, residual energy still decays as a power law but with a reduced exponent, turning parallelism into a quantifiable throughput-accuracy tradeoff. A theoretical cluster mean-field model reproduces the same behavior, showing that this tradeoff is a universal property of partitioned stochastic dynamics. These results provide a programmable million-p-bit platform, demonstrated across spin glasses, Max-Cut, and Boolean satisfiability, together with a quantitative design rule for scaling probabilistic computers beyond the single-chip limit.
☆ SafeGen: LLM-Driven Assertion Generation and Fault Criticality Evaluation for Functional Safety
With advances in autonomous driving and electric vehicle technologies, functional safety has become a critical requirement in automotive chip design. Traditional simulation-based fault analysis is often overly conservative at the module level and fails to accurately reflect fault criticality. This paper presents SafeGen, an LLM-driven, formal-verification-assisted framework for functional-safety-oriented fault criticality assessment. SafeGen leverages large language models (LLMs) and a document-level Hyper Knowledge Graph (HyperKG) that incorporates Failure Modes, Effects, and Diagnostic Analysis (FMEDA) guidelines to extract verifiable specifications from design and safety documents and evaluate their relevance to overall system safety. The HyperKG is further enriched with register-transfer-level (RTL) information to guide the generation of Functional Safety Assertions (FSAs) that are both semantically grounded and design-aware. Each assertion is linked to its corresponding specification, enabling traceable reasoning throughout the assessment process. A gate-to-RTL fault-mapping mechanism supporting both stuck-at and bridging faults, combined with formal property verification (FPV), enables semantic-level fault criticality grading based on specification-linked assertion violations. A digital-physical co-simulation platform for a field-oriented control (FOC) system is developed to validate SafeGen. Experimental results demonstrate that SafeGen generates higher-quality assertions than existing LLM-based assertion generation frameworks while providing greater semantic interpretability in fault criticality assessment compared with traditional simulation-based approaches.
comment: Accepted by DAC 2026
♻ ☆ PVF:Understanding AI Vulnerability Against SDCs
Reliability of AI systems is a fundamental concern for the successful deployment and widespread adoption of AI technologies. Unfortunately, the escalating complexity and heterogeneity of AI hardware systems make them increasingly susceptible to hardware faults, e.g., silent data corruptions (SDC), that can potentially corrupt model parameters. When this occurs during AI inference/servicing, it can potentially lead to incorrect or degraded model output for users, ultimately affecting the quality and reliability of AI services. In light of the escalating threat, it is crucial to address key questions: How vulnerable are AI models to parameter corruptions, and how do different components (such as modules, layers) of the models exhibit varying vulnerabilities to parameter corruptions? To systematically address this question, we propose a novel quantitative metric, Parameter Vulnerability Factor (PVF), inspired by architectural vulnerability factor (AVF) in computer architecture community, aiming to standardize the quantification of AI model vulnerability against parameter corruptions. We define a model parameter's PVF as the probability that a corruption in that particular model parameter will result in an incorrect output. In this paper, we present several use cases on applying PVF to three types of tasks/models during inference -- recommendation (DLRM), vision classification (CNN), and text classification (BERT), while presenting an in-depth vulnerability analysis on DLRM. PVF has been a critical metric used for making key error management design decisions in productionizing Meta's in-house AI chip - MTIA.
Programming Languages 5
☆ Information flow security on persistent memory
Persistent memory is a recently proposed memory paradigm that delivers many system-wide benefits, including improved runtime efficiency and the ability of programs to recover from power outages and system crashes. While recent research has investigated techniques for proving functional correctness of programs running on related architectures, this is not the case for the orthogonal concept of information flow security. In this paper, we provide an information flow logic for an unstructured language (i.e., with gotos rather than loops) modelling a simple assembly language. We apply this logic to x86 assembly using a notion of reordering interference freedom (rif) to reason about potential out-of-order propagation of instructions to memory. We then show how this same notion of rif can be used to similarly reason about information flow on persistent memory.
☆ Reading AI Model Compilation in MLIR Through the Lens of Formal Theories
Compiler infrastructures such as MLIR rest on a set of design principles: IR abstractions, interfaces, match-and-rewrite, flow analysis, type conversion, staged lowering, and so on. These concepts have proven themselves in practice. Good designs typically arrive through engineering knowledge, intuition and experience. Many of them, however, have correspondences in formal theory. MLIR's match-and-rewrite engine has correspondence to a \emph{term-rewriting-system}~\cite{baadernipkow1998}; staged lowering has the structure of \emph{refinement calculus}~\cite{back1998}; and range analysis is grounded in \emph{abstract interpretation}~\cite{cousot1977,cousot1979}. Highlighting these correspondences is useful because each theory supplies vocabulary precise enough to discuss structural questions. Moreover, as coding agents lower the cost of implementation, good design and abstractions become the main concern~\cite{Lattner2026ClaudeCCompiler}. A coding agent can generate a pass, but it can only reason over the semantics the representation exposes. When essential structure is missing, the limitation is one of abstraction, not of implementation. The natural next question is how to design that substrate well. Well-chosen abstractions emerge from experience and intuition, but they often mirror concepts given a more precise treatment in formal theory. We argue that knowledge of these formal concepts clarifies what completeness means for a given abstraction, what the ideal design would be, and where practical trade-offs depart from it.
♻ ☆ Shepherd: Enabling Programmable Meta-Agents via Reversible Agentic Execution Traces
As LLM agent systems take on more complex tasks, they increasingly rely on meta-agents: higher-order agents that create, operate on and manage other agents. Meta-agent operations such as coordinating agents, halting risky actions before execution, or repairing failed runs, require runtime manipulation of agentic execution. Yet existing agentic substrates make this difficult: they expose only transcripts and environment snapshots, forcing meta-agents to build ad hoc tooling to reconstruct and operate over full execution state. Therefore, we introduce Shepherd, a Python substrate grounded in functional programming principles, where an agent's execution is itself a first-class object that a meta-agent can easily inspect and transform. Every model action, tool call, and environment change becomes a structured event in a reversible, Git-like execution trace, where any past state can be reverted 5x faster than docker commit and fork. Three example use cases show Shepherd's versatility: (1) a supervisor meta-agent prevents conflicts among parallel coding agents, lifting pair-coding pass rate from 28.8% to 54.7% on CooperBench; (2) a counterfactual optimization meta-agent repairs agent workflows by proposing edits and replaying runs from the point of changed behavior, outperforming MetaHarness on Terminal-Bench 2.0 by 12.8% with 58% lower wall-clock; (3) a training meta-agent picks fork points during rollouts to improve credit assignment in long-horizon agentic RL, doubling GRPO's uplift on Terminal-Bench 2.0. We open-source Shepherd to enable principled and efficient operations over agentic execution for both users and meta-agents.
comment: 50 pages, 22 figures, 14 tables
♻ ☆ ESBMC-PLC+: A Unified IEC 61131-3 Formal Verification Framework as a PLCverif Successor
PLCverif is the most mature open-source platform for PLC formal verification, developed at CERN and in production use since 2019. Yet it has two fundamental limitations: no support for Ladder Diagram (LD) programs, the dominant PLC notation, and reliance on CBMC as its primary backend, which restricts verification to bounded proofs. The PLCverif authors themselves identified ESBMC as the appropriate backend improvement. Prior work established ESBMC-PLC (a textual LD frontend with k-induction) and ESBMC-GraphPLC (graphical PLCopen XML support); together, they cover LD with unbounded proofs but not Structured Text (ST), and graphical LD with timer/counter function blocks remains unverifiable. This paper presents ESBMC-PLC+, a unified framework that closes both gaps: (1) an ST/SCL frontend via the MATIEC IEC 61131-3 compiler, routing C-compiled ST to ESBMC with nondeterministic input modeling and YAML property injection; (2) function block state semantics for graphical LD, extending the DFS resolver to model TON/TOF/TP timers, CTU/CTD counters, and R_TRIG/F_TRIG edge triggers as persistent scan-cycle state variables in the GOTO IR. ESBMC-PLC+ is the first open-source PLC verification framework to support all three major IEC 61131-3 input formats via a single ESBMC backend, enabling k-induction-unbounded safety proofs. A feature comparison with PLCverif and experimental evaluation on 8 benchmark programs, including programs with up to 8 integer timers, shows that ESBMC-PLC+ matches PLCverif's input coverage while providing stronger guarantees. Against nuXmv's BDD backend, ESBMC-PLC+ is 400-2,000x faster on timer programs and completes proofs where nuXmv BDD times out at 120s.
comment: 21pages
♻ ☆ Specifying AI-SDLC Processes: A Protocol Language for Human-Agent Boundaries
AI agents now participate as first-class team members across the software development lifecycle, yet no specification language exists for expressing the human-agent responsibility boundaries, approval gates, and governance constraints this collaboration requires. Existing approaches encode process in agent prompts (subject to drift), target adjacent domains (workflow management, business processes), or address only fragments (access control, approval gates). We propose a domain-specific language for specifying AI-SDLC processes as protocols, with formal syntax, well-formedness conditions, operational semantics, and enforcement invariants. The language distinguishes policy (declared intent) from mechanism (structural enforcement), enabling implementations to bound process non-determinism through primitives such as validation tokens and capability boundaries. Three results follow. A failure rate analysis shows that structural enforcement bounds system failure rates at a weighted product of agent and validator rates, while behavioral compliance permits cumulative or near-saturating growth. The 2+N team pattern (two human-in-control roles plus N specialized agent members) formalizes classical Separation of Duties for AI-SDLC. Kleene closure of orchestration loops and reflexive protocol-adherence validation emerge as design properties rather than special-case constructs. We position the contribution against multi-agent frameworks (MetaGPT), workflow specification (FlowAgent, BPMN extensions), and capability-based security (SAGA): the novelty lies in the specific integration, not any single primitive. A working implementation demonstrates feasibility; empirical evaluation is future work.
comment: Position paper with formal specification, failure rate analysis, and feasibility demonstration. Companion empirical paper and open-source implementation forthcoming
Data Structures and Algorithms 8
☆ On the Parameterized Complexity of Bounded-Density Vertex Deletion
We explore the parameterized complexity of Bounded Density Vertex Deletion (BDVD): given a graph $G$, an integer budget $k$, and a target density $τ_ρ$, the task is to determine whether the density (i.e. number of edges divided by number of vertices) of the densest subgraph of $G$ can be reduced to at most $τ_ρ$ by deleting at most $k$ vertices. Our primary focus is on structural graph parameters related to treewidth, as the parameterized complexity of BDVD with respect to treewidth was left as open question by Bazgan et al. [JCSS, 2025]. We resolve this question by showing W[1]-hardness with respect to various parameters, including treedepth and feedback vertex number. These results imply W[1]-hardness with respect to treewidth. We obtain positive results for parameters larger than treedepth and feedback vertex number, namely we show BDVD is in FPT parameterized by the max leaf number or vertex integrity. Under the assumption that the target density $τ_ρ$ is a fixed constant the parameterized complexity landscape of BDVD changes drastically, allowing a fixed-parameter tractable algorithm even for parameters smaller than treewidth, namely cliquewidth. Altogether, our results provide a refined complexity landscape for Bounded Density Vertex Deletion, sharply distinguishing between tractable and intractable parameter regimes under structural parameterizations.
comment: To appear at MFCS 2026
☆ A Stronger Conditional Running-Time Lower Bound for Global Label Min-Cut
Let $n$ and $p$ denote the numbers of vertices and labels, respectively, in an undirected edge-labeled graph. Previous work showed that, under the Exponential Time Hypothesis (ETH), there is no deterministic algorithm with running time \[ (np)^{o\left(\frac{\log n}{(\log\log n)^2}\right)}. \] In this paper, we give a deterministic reduction that strengthens this conditional running-time lower bound to \[ (np)^{o\left(\frac{\log n}{\log\log n}\right)}. \]
☆ Parameterized Complexity of Power Network Design: Coordinating Cable Placement is Hard
We study generalizations of the Steiner Tree problem motivated by the design of power networks. While Steiner Tree asks for a single minimum-cost tree connecting given terminal vertices, a power network typically consists of multiple trees, each connecting a subset of the terminals, to avoid electrical overloads. The cost of installing depends on both the cable lengths and the cost of digging underground trenches for putting the cables where the digging costs can be shared. These leads to variants of Steiner Tree where the goal is to compute a minimum-cost set of Steiner trees with a common root, that together connect all terminals while balancing the power demand of the terminals in each tree. Two important variants arise depending on whether the network is intended for low-voltage or high-voltage power. In the low-voltage case, power loss imposes a bound on the maximum depth of each tree, while no such restriction applies in the high-voltage case. We study the parameterized complexity of several power network design problems, parameterized by the number of terminals. While Steiner Tree is fixed-parameter tractable under this parameterization, most of our variants are W[1]-hard. For low-voltage networks, we present an XP-algorithm for planar inputs based on structural bounds on the treewidth of solution subgraphs. We also give a reduction from Grid Tiling showing tightness under ETH. The XP-algorithm extends to the high-voltage setting and general graphs, albeit at a cost in the running time. For high-voltage networks, we show the problem remains W[1]-hard even on planar graphs. Finally, we explore a variant of the cost model for sharing digging costs in which both problems become fixed-parameter tractable.
☆ Paths and Intersections: Recognizing Outerplanar Metrics
We study the following distance realization problem: given a metric $D$ on a set $T$ of terminals, does there exist an (edge-weighted) outerplanar graph $G$, such that $T\subseteq V(G)$, and for every pair $t,t'\in T$, $\textsf{dist}_G(t,t')=D(t,t')$? We first prove that there is no ``local characterization'', forming a contrast with trees and Okamura-Seymour instances. Our main result is an efficient algorithm for this problem whose running time is polynomial in $|T|$. Both our proof and our algorithm utilize a recent new approach of analyzing graph structures, by viewing graphs as paths and their intersections, which we believe is of independent interest.
☆ Segment Watchman Routes
Motivated by applications for robust guarding, we consider a variant of the multiple-watchmen problem that ensures that every point within a polygon $P$ is seen from more than one direction: we search for two routes $W_1,W_2$, such that every point $p\in P$ is contained in a segment $\overline{w_1w_2}\subseteq P$ such that $w_1\in W_1$ and $w_2\in W_2$. We call such routes segment watchman routes. We show that finding the two routes that are optimal with respect to the min-max criterion is weakly NP-hard even in simple polygons, and that finding the routes that are optimal with respect to the min-sum criterion is NP-hard in polygons with holes. Moreover, we present sufficient conditions for routes to be segment watchman routes, and provide a polynomial-time $2$-approximation under both the min-max criterion and the min-sum criterion, both in simple polygons. Finally, we show how to generalize our results for $k$ watchmen.
comment: 19 pages, 10 figures, accepted to the 51st International Symposium on Mathematical Foundations of Computer Science (MFCS 2026)
☆ Space-Efficient Language Generation in the Limit COLT 2026
We initiate a resource-aware theory of \textit{language generation in the limit} under the minimal constraint of space efficiency. In our framework, a learner observes an adversarial positive stream from a target language $K$ and must eventually output a hallucination-free hypothesis language $L \subseteq K$ while omitting at most $Δ$ strings of $K$. We focus on $\mathcal{C}_{s,k}$, the collection of languages recognized by DFAs with at most $s$ states over an alphabet of size $k$, as the natural hypothesis class for memory-bounded learners. In the exponential-space regime, we prove that a learner can exactly identify the target $K$. Under a stricter memory budget, we characterize the strongest possible generation guarantees. In particular, we present a streaming algorithm using $\mathrm{poly}(s,k)$ space that converges to a hypothesis with generation gap $Δ= O(k^{2s-2})$. Moreover, the learned hypothesis captures every string in $K$ of length at least $2s-1$. We complement this result with a near-matching lower bound through a reduction from a standard communication complexity problem. Specifically, achieving generation gap $Δ\le k^{(1-\varepsilon)s}$ requires $k^{Ω(\varepsilon s)}$ memory. Together, these results reveal a sharp transition between polynomial-space generation and exponential-space exact identification.
comment: Accepted at COLT 2026
☆ Multi-Source Reachability in Near-Optimal Time
The multi-source reachability problem asks to compute the reachable sets from a given subset of source vertices. For $n$-vertex digraphs $G=(V,E)$ and a subset of sources $S \subseteq V$ with $|S|=n^σ$ for some $σ\in [0,1]$, we present a near-optimal deterministic algorithm that solves this problem in $\tilde{O}(n^{ω(σ)})$ time, where $ω(σ)$ is the rectangular matrix multiplication exponent for multiplying an $n^σ\times n$ matrix by an $n \times n$ matrix. For dense graphs, this yields reachability from up to $n^{0.32}$ sources in near-linear time, breaking the super-quadratic time barrier and improving over the state-of-the-art $n^{1+2/3ω(σ)}$-time randomized algorithm of Elkin and Trehan [arXiv:2401.05628, 2024].
♻ ☆ Cutting Planarians: Planar Emulators for String Graphs STOC 2026
In this paper we construct distance sketches for intersection graphs of arbitrary path-connected regions in the plane (known as the string graphs) in the constant and $1+\varepsilon$ distortion regimes. Furthermore, the distance sketches themselves are planar graphs. First, we show that every unweighted string graph $G$ has an $O(1)$-distortion planar emulator: that is, there exists an edge-weighted planar graph $H$ containing every vertex in $G$, such that every pair of vertices $(u,v)$ satisfies $δ_G(u,v) \le δ_H(u,v) \le O(1) \cdot δ_G(u,v)$. Furthermore, we show that for any constant $\varepsilon > 0$, there is an edge-weighted planar graph $H'$ such that every pair of vertices $(u,v)$ satisfies $δ_G(u,v) \le δ_{H'}(u,v) \le (1+\varepsilon) \cdot δ_G(u,v) + O(\varepsilon^{-4}\textrm{poly}\log n)$. No previous constructions of sparse distance sketches were known even for intersection graphs of simple shapes like axis-parallel rectangles or fat convex polygons. As applications, we construct the first $(1+\varepsilon, +O(1))$ mixed-distortion tree cover and distance oracle for arbitrary string graphs, as well as the first additive $+(\varepsilonΔ+O(1))$-distortion embedding of string graphs $G$ with diameter $Δ$ into graphs of constant treewidth $O(\varepsilon^{-4})$.
comment: full version of STOC 2026 paper
Graphics 5
☆ From Sparse and Imperfect 2D Anchors to Consistent 3D Gaussian Street Scenes: Support-Aware Appearance
Image priors can synthesize target conditions for 3D Gaussian street scenes, but independently edited views do not define a coherent 3D target. Direct fitting can propagate view-specific noise, while existing pipelines do not jointly handle imperfect sparse anchors and standard-rasterizer deployment. To address this gap, teacher-relative appearance residual distillation is introduced for appearance baking. A structured space for frequency decomposition, confidence estimation, and primitive-level lifting is formed by residuals between teacher anchors and original renders. The direct optimization signal is supplied by renderer-space matching, while primitive assignment is regularized by support-aware Gaussian-space aggregation. Supported detail is admitted and unsupported noise is suppressed through confidence-gated coarse-to-fine optimization, after which all residuals are baked into fixed-geometry spherical-harmonic coefficients. The teacher and auxiliary training modules are discarded at inference. Evaluation across Waymo street assets, Tanks and Temples scenes, and multiple target conditions shows a favorable overall balance of target alignment, content preservation, artifact suppression, and cross-view consistency over editing-based baselines. Ablations confirm the effectiveness of the main components. Code will be released at https://github.com/Cagares/Baking-for-3D-Gaussian.
☆ Cross-View Variance Correlation in Path-Traced Stereo:A Hidden Shortcut in Synthetic Training Data
Path-traced synthetic stereo data underlie a large fraction of modern disparity-estimation training pipelines. We report a previously unrecognised property of such data: while the Monte Carlo (MC) noise streams of the two cameras are statistically independent, the underlying \emph{variance fields} -- deterministic per-pixel functions of the rendering integrand -- are highly correlated once aligned by the ground-truth disparity warp. Across 20 scenes rendered with Mitsuba~3, the warped Pearson correlation reaches $ρ{=}0.754{\pm}0.016$ across 20 scenes at $\mathrm{SPP}{=}512$, and on a representative scene remains essentially invariant ($ρ{=}0.778{\pm}0.001$) over a $16\times$ range of samples per pixel. The effect is strongest in Lambertian regions ($ρ{\approx}0.78$) and substantially weaker in glass ($ρ{\approx}0.30$), as predicted by an integrand decomposition into view-independent and view-dependent components. A residual-shuffle intervention that breaks the cross-view alignment while preserving the clean image degrades the GT cost margin by $33\%$ on non-glass and the variance-based winner-take-all accuracy on glass by $4.3\times$, confirming the structure functions as a matching cue. This signal is unique to MC-rendered data and constitutes a candidate sim-to-real shortcut whose impact on trained networks remains to be quantified.
☆ EvoFlock: evolved inverse design of multi-agent motion
This paper describes an automatic method for adjusting or tuning models of multi-agent motion. Simulating the motion of bird flocks, human crowds, vehicle traffic, and other multi-agent systems is a widely used technique. These simulations model the behavior of a single group member (bird, human, or vehicle). The group behaviors (flock, crowd, traffic) emerge from interactions between group members. These models typically have many numerical control parameters. Even if each parameter is intuitive in isolation, their interaction can be complex and nonlinear. It is challenging to determine which parameters to adjust for the desired change in group behavior. Changing one aspect of group behavior often causes other aspects to change, leading to a tedious process of incremental changes. This work takes an inverse design approach. The desired group behavior is measured with a user-defined objective(/fitness/loss) function and optimized with a genetic algorithm. The objective function used here for basic flocking rewards proper spacing with neighbors, flying near a desired speed, and avoiding obstacles. Interestingly, the vivid alignment seen in bird flocks appears to emerge from maintaining proper spacing between flockmates.
comment: To appear in the Proceedings of Artificial Life 2026
♻ ☆ Virtual Reality Alters Perceived Functional Body Size
Virtual reality (VR) introduces sensory perturbations that may impact perception and action. The current study was designed to investigate how immersive VR presented through a head-mounted display (HMD) affects perceived functional body size using a passable aperture paradigm. Participants (n=60) performed an action task (sidle through apertures) and a perception task (adjust aperture width until passable without contact) in both physical, unmediated reality (UR) and VR. Results revealed significantly higher action and perceptual thresholds in VR compared to UR. Affordance ratios (perceptual threshold over action threshold) were also higher in VR, indicating that the increase in perceptual thresholds in VR was driven partly by sensorimotor uncertainty, as reflected in the increase in the action thresholds, and partly by perceptual distortions imposed by VR. This perceptual overestimation in VR also persisted as an aftereffect in UR following VR exposure. Geometrical modelling attributed the disproportionate increase in the perceptual threshold in VR primarily to depth compression. This compression, stemming from the vergence-accommodation conflict (VAC), caused the virtual aperture to be perceived as narrower than depicted, thus requiring a wider adjusted aperture. Critically, after mathematically correcting for the VAC's impact on perceived aperture width, the affordance ratios in VR became equivalent to those in UR. These outcomes demonstrate a recovered invariant geometrical scaling, suggesting that perception remained functionally attuned to action capabilities once VAC-induced distortions were accounted for. These findings highlight that VR-induced depth compression systematically alters perceived body-environment relationships, leading to an altered sense of one's functional body size.
♻ ☆ SplatPainter: Interactive Authoring of 3D Gaussians from 2D Edits via Test-Time Training
The rise of 3D Gaussian Splatting has revolutionized photorealistic 3D asset creation, yet a critical gap remains for their interactive refinement and editing. Existing approaches based on diffusion or optimization are ill-suited for this task, as they are often prohibitively slow, destructive to the original asset's identity, or lack the precision for fine-grained control. To address this, we introduce SplatPainter, a state-aware feedforward model that enables continuous editing of 3D Gaussian assets from user-provided 2D view(s). Our method directly predicts updates to the attributes of a compact, feature-rich Gaussian representation and leverages Test-Time Training to create a state-aware, iterative workflow. The versatility of our approach allows a single architecture to perform diverse tasks, including high-fidelity local detail refinement, local paint-over, and consistent global recoloring, all at interactive speeds, paving the way for fluid and intuitive 3D content authoring.
comment: project page https://y-zheng18.github.io/SplatPainter/
Operating Systems 3
☆ ActPlane: Programmable OS-Level Policy Enforcement for Agent Harnesses
AI agents increasingly run in production through harnesses, the software around the LLM, including an engine that enforces safety and effectiveness policies, e.g., 'run tests before committing.' Enforcing these policies requires bridging a semantic gap: policy intent is expressed in underspecified natural language, while enforcement must act on concrete system actions, e.g., which test to run. Many policies also define event ordering or data flow actions. Yet existing approaches fall short. Tool-call guardrails miss system actions that bypass the tool layer, while OS sandboxes control resource access instead of actions, returning opaque errors that confuse the agent. Our key insight is that policy context lives within the agent closest to the task, while enforcement must happen at the OS to cover all execution paths. We introduce ActPlane, a policy engine that lets agents declare policies and enforces them in the OS kernel with semantic feedback and isolation. ActPlane uses a simple information-flow control (IFC) DSL to support cross-event policies. We implement ActPlane with eBPF and evaluate it on policies from the empirical study, coding-task benchmarks, and safety benchmarks. ActPlane improves policy compliance, including on indirect execution paths that tool-call interception cannot observe, with 1.9%-8.4% overhead. ActPlane is at https://github.com/eunomia-bpf/ActPlane
☆ Kops: Safely Extending the eBPF Compilation Pipeline with Native Operations
eBPF safely extends OS kernels in domains such as networking, observability, and security. The safety comes from an in-kernel compilation pipeline where a verifier checks every program, and a kernel just-in-time compiler (JIT) translates the verified bytecode to native code. The kernel keeps the JIT simple to stay trustworthy, translating one bytecode instruction at a time in a single pass. This single-pass design misses optimization opportunities, so eBPF runs up to twice as slow as natively compiled code in our characterization. Adding optimizations to the kernel JIT directly requires upstream acceptance and a long release cycle, enlarges the trusted computing base (TCB), and grows the per-architecture kernel code. To address this, we present Kops, an extension interface that lets userspace compilers and kernel modules introduce new operations without modifying the kernel core, while keeping a minimal trusted computing base (TCB). Each operation has two forms, a proof sequence of vanilla eBPF instructions that the existing verifier checks and a native emit of machine instructions that the JIT compiles. Because the verifier checks the proof sequence, the native emit is the only per-operation addition to the TCB. Hardware idioms are the lowest-hanging fruit for this interface. With Kops, we build EInsn, seven operations such as rotate and conditional select that CPUs execute as single instructions. Lean 4 proofs show that each native emit computes the same result as its proof sequence. On x86-64 and ARM64, EInsn speeds up eBPF microbenchmarks by up to 24% and production applications by up to 12%. The same interface also supports whole-program native replacement, reaching 2.358x at the cost of a larger TCB.
comment: Yusheng Zheng and Zhengjie Ji contributed equally to this work
☆ Aquifer: Hierarchical Memory Pooling with CXL and RDMA for MicroVM Snapshots
Memory stranding wastes 25-35% of installed DRAM in production cloud clusters. Memory pooling over CXL and RDMA offers a remedy, but neither technology alone suffices: CXL provides low-latency, load/store-transparent access limited to a pod, while RDMA provides cluster-wide reach at higher latency with software overhead. A hierarchical architecture combining both tiers is the practical path forward, yet remains unexplored for MicroVM-based serverless computing, where snapshot restore latency is the dominant cold-start bottleneck. We present Aquifer, the first system to serve MicroVM snapshots from a hierarchical CXL+RDMA memory pool. A characterization of snapshot images reveals that the vast majority of pages are either zero or cold, enabling a hotness-based snapshot format that eliminates zero pages and places only the hot working set in the CXL pool while storing cold pages in the RDMA pool. Sharing these snapshots across hosts on CXL 2.0 multi-headed devices, which lack hardware cache coherence, requires Aquifer's ownership-based coherence protocol to ensure correctness. Finally, Aquifer uses a copy-based page serving mechanism pre-installs hot pages from CXL memory before MicroVM resume and demand-pages cold pages asynchronously from RDMA. On emulated CXL+RDMA hardware, Aquifer achieves a 2.2x geometric-mean speedup in end-to-end invocation time over Firecracker and 1.1x over the next best alternative.
Hardware Architecture 3
☆ PDS Joint: A Parametric Double-Spiral Joint Tailored for Dexterous Hands
Compliant joints can embed safety and adaptability into dexterous hands, but achieving large-stroke anthropomorphic motion while maintaining joint-specific, directiondependent stiffness and reliable proprioception remains challenging. This paper presents the PDS joint, a parametric doublespiral (PDS) compliant joint that enables systematic shaping of directional stiffness across multiple deformation modes, including flexion/extension, abduction/adduction, and pronation/supination. We instantiate the joint using Archimedean and logarithmic spiral templates for different hand joints and introduce an asymmetry ratio to tailor stiffness distributions for both grasp stability and hyperextension resistance. To make the joint practically usable under large deformation, we co-design embedded inductive proprioception and propose a learningbased calibration pipeline that maps raw inductive signals to joint states using ArUco-marker tracking. Experiments characterize the stiffness landscapes across geometric parameters and demonstrate a non-monotonic dependence of lateral support on asymmetry, indicating the importance of principled parameter tuning. For joint-state estimation in the most challenging abduction/adduction motion, a learned multilayer-perceptron (MLP) mapping reduces the error compared with conventional curve fitting by 41.6%. Finally, we integrate the proposed joints into an open-source dexterous hand as a demonstration platform, on which the hand grasps a set of nine everyday objects and performs safe, contact-rich human-involved interactions.
♻ ☆ ColumnKeeper: Efficient Solutions to the ColumnDisturb Vulnerability in DRAM-based Systems ISCA 2026
Modern DRAM chips are vulnerable to read disturbance phenomena such as RowHammer and RowPress, which induce bitflips after accessing nearby rows a certain number of times (the read disturbance threshold). ColumnDisturb is a new, fundamentally different DRAM read disturbance phenomenon. Specifically, ColumnDisturb (i) disturbs DRAM columns instead of rows, and (ii) increases the number of affected DRAM cells from those in only a few neighboring rows to all cells across three consecutive DRAM subarrays. We propose ColumnKeeper, the first set of ColumnDisturb mitigations, in two variants: ColumnKeeper-D (CK-D), a deterministic mechanism, and ColumnKeeper-P (CK-P), a probabilistic one. CK-D exploits DRAM's open-bitline architecture to provide deterministic security guarantees at low performance and energy overheads: it uses two counters per subarray to track activations affecting the odd and even columns, and refreshes one row in a subarray when either counter reaches a predetermined threshold. CK-P instead refreshes one row in three consecutive subarrays upon a row activation in the middle subarray, with a predetermined probability, providing configurable security guarantees at low area overhead. Both mechanisms prevent ColumnDisturb bitflips at low performance, energy, and area overheads. At the current experimentally-demonstrated ColumnDisturb threshold (1M), CK-D and CK-P incur very low average single-core performance overheads of 0.15% and 0.36%, respectively. For near-future thresholds (128K), these rise to a still low average of 1.70% and 2.73%. Mitigating ColumnDisturb at low thresholds (e.g., 16K) remains possible by adopting smaller subarray sizes or enabling subarray-level parallelism. CK-D and CK-P require low area overheads of 0.1 mm^2 and 0.03 mm^2, respectively. ColumnKeeper is freely available at https://github.com/CMU-SAFARI/ColumnKeeper .
comment: To appear at ISCA 2026
♻ ☆ Implementation and Performance Evaluation of CMOS-integrated Memristor-driven Flip-flop Circuits
In this work, we report implementation and performance evaluation of memristor-driven fundamental logic gates, including NOT, AND, NAND, OR, NOR, and XOR, and novel and optimized design of the sequential logic circuits, such as D flip-flop, T-flip-flop, JK-flip-flop, and SR-flip-flop. The design, implementation, and optimization of these logic circuits were performed in SPECTRE in Cadence Virtuoso and integrated with 90 nm CMOS technology node. Additionally, we discuss an optimized design of memristor-driven logic gates and sequential logic circuits, and draw a comparative analysis with the other reported state-of-the-art work on sequential circuits. Moreover, the utilized memristor framework was experimentally pre-validated with the experimental data of Y2O3-based memristive devices, which shows significantly low values of variability during switching in both device-to-device (D2D) and cycle-to-cycle (C2C) operation. The performance metrics were calculated in terms of area, power, and delay of these sequential circuits and were found to be reduced by more than ~24%, 60%, and 58%, respectively, as compared to the other state-of-the-art work on sequential circuits. Therefore, the implemented memristor-based design significantly improves the performance of various logic designs, which makes it more area and power-efficient and shows the potential of memristor in designing various low-power, low-cost, ultrafast, and compact circuits.
Programming Languages 3
☆ Verifiable Auto-Formalization of Mathematics Using a Relaxed Natural Formal Language
Auto-formalization aims to translate informal mathematical content into formal languages that can be processed by theorem provers. However, directly targeting existing theorem provers requires LLMs to bridge a substantial representational gap between informal mathematical writing and formal proof languages. This gap also makes semantic consistency difficult to evaluate. We address these difficulties by introducing a Relaxed Natural Formal Language (Relaxed NFL) as an intermediate target for auto-formalization. The Relaxed NFL is designed to remain close to informal mathematical writing: it preserves the usual structure of informal reasoning and allows partially specified expressions and propositions, without requiring their precise interpretation to be fixed at the auto-formalization stage. The remaining ambiguity and implicitness inherited from informal reasoning are resolved during a later elaboration stage, which transforms Relaxed NFL proofs into Core Natural Formal Language (Core NFL) proofs with formally defined semantics. The elaboration procedure combines rule-based transformations with LLM-generated heuristics, while maintaining verifiability through explicit constraints on each transformation step. The Core NFL is then used to generate proof gaps, namely verification conditions that must hold for the formalized proof to be correct. These gaps are discharged by LLM-generated proof scripts written in a domain-specific tactic language, which provides commands for invoking theorem libraries and domain-specific solvers implemented as part of our system.
♻ ☆ Constraint-Level Design of zkEVMs: Architectures, Trade-offs, and Evolution
Zero-Knowledge Ethereum Virtual Machines (zkEVMs) must reconcile an inherent tension. The Ethereum Virtual Machine (EVM) was designed for transparent step-by-step execution with dynamic control flow. Proving such execution in zero-knowledge, however, requires transforming it into algebraic circuit representations that encode computation as mathematical constraints. Existing surveys address zkEVMs at the level of implementations, cryptographic primitives, or Layer 2 deployment, leaving the constraint-system design that governs their cost largely unexamined. This survey provides the first constraint-level analysis of how five production zkEVM systems and three universal Zero-Knowledge Virtual Machines (zkVMs) resolve this tension through constraint engineering. We show that the degree of EVM compatibility, captured by the Type 1-4 spectrum, is the defining architectural decision that shapes all subsequent technical choices. We classify the design space along four architectural dimensions, namely arithmetization frameworks, dispatch strategies, semantic rewrites, and recursion approaches. Examining the mechanisms within each dimension, we identify the technical factors and trade-offs that drive each choice. The analysis reveals that all five surveyed production zkEVMs adopt PLONKish arithmetization. The zkVMs instead rely on the Algebraic Intermediate Representation (AIR), which suits uniform state machines. A single trade-off between EVM compatibility and constraint cost underlies these choices. The most Ethereum-equivalent systems accept higher constraint counts to preserve full bytecode fidelity, while systems that relax that fidelity attain substantially lower constraint counts. We close with the critical open problems and future research directions that this constraint-level view brings into focus.
♻ ☆ Scalable Probabilistic Program Verification via Typed Extended Decision Diagrams
Weakest pre-expectations are the probabilistic program analogue to weakest preconditions in classical programs. Deductive verification approaches aim to establish bounds on these quantitative expectations. Their automation has been successful in analysing a variety of discrete probabilistic programs. Key routines in that automation require reasoning about (partially unrolled) loops, however, the logical representation of weakest pre-expectations on such unrollings often explodes. In this paper, we develop typed extended decision diagrams (TEDDs), inspired by various extensions to binary decision diagrams. We demonstrate computing WPs represented as TEDDs, SMT-based pruning to further shrink their representation, and we lift some proof rules to operate on TEDDs. Finally, we demonstrate that TEDDs boost the scalability of deductive probabilistic program verification by orders of magnitudes over the state of the art.
Data Structures and Algorithms 14
☆ Scheduling with Testing: Competitive Algorithms for Minimizing the Total Weighted Completion Time in the Adversarial Model
We study scheduling with testing on a single machine and on identical parallel machines to minimize the total \emph{weighted} completion time in the adversarial model. In this setting, each job is equipped with a weight, an upper bound on its processing time, and a testing time. An algorithm can either execute a job for an amount of time equal to the upper bound or test it first to reveal a potentially lower processing time used to schedule the job later. We establish the first constant-competitive algorithms for this problem with job-dependent weights that reflect each job's relative importance. For single-machine scheduling, we present a deterministic algorithm with a competitive ratio of 2.3166 and show that a randomized variant has a competitive ratio of 2.1523. These guarantees match the best-known upper bounds in the unweighted setting. Combining these algorithms with list scheduling yields competitive ratios of 2.7763 and 2.5110 for identical-parallel-machine scheduling, improving the previously best-known bounds even in the unweighted case.
☆ A Near-Optimal Parallel Algorithm for Finding Matroid Bases
We settle the classic question of the parallel complexity of computing a matroid basis, as first posed in the seminal work of Karp, Upfal, and Wigderson (FOCS 1985, JCSS 1988). Our algorithm runs in $O(n^{1/3}\log^{1/3}n)$ rounds, matching the lower bound of KUW up to a $\log^{2/3}(n)$ factor.
☆ Faster algorithm for achieving minimal-size quantum decision diagrams
The decision diagram (DD) data structure enables fast linear-algebra calculations by bringing vectors into a normal form and subsequently merging equivalent ones, yielding a minimally-sized DD modulo the equivalence relation. A fruitful application area is quantum-circuit simulation, where the vectors represent quantum states. The Local Invertible Map Decision Diagram (LIMDD) type, merges LIM-equivalent (typically Pauli-gate equivalent) vectors, can efficiently simulate Clifford circuits as well as some high-T-count circuits, and has theoretically been proven exponentially faster for simulation than other well-developed data structures, including other common DD variants. However, these exponential advantages have not fully materialized yet in existing implementations, for which the normal-form procedure, which is a highly complex algorithm, is either absent or only partially implemented. We here present a novel normal-form algorithm for Pauli-LIMDDs, achieving a worst-case speedup from $O(n^3)$ to $O(n^2)$ for an $n$-qubit DD node with a single child node while keeping the $O(n^3)$ run time in case of two distinct children nodes. We implement the algorithm as part of QolDDer, our Pauli-LIMDD simulator for quantum circuits, written from scratch in C/C++. The implementation realizes the theoretically-proven advantages of Pauli-LIMDDs on Clifford circuits, is significantly faster than the existing LIMDD simulators on such circuits, and on a public quantum-circuit data set often outperforms them by an order of magnitude. In the future, we envision that our work will enable further application and development of LIMDD variants, not only for quantum design tasks, but also for analysis of linear-algebra-based systems in general.
☆ Scheduling jobs with unknown size distribution in a M/G/1 queue: the shifted empirical Gittins
In this paper we consider a M/G/1 queue for which we want to minimize the expected response time. We show how to compute indices from $n$ samples of the job size distribution such that the corresponding index policy is asymptotically optimal as $n$ grows. This construction is based on a discretization of the bounded support of the job size distribution and a shift of the samples to their nearest discrete point to the right. We show that the Gittins index of the empirical distribution of these shifted samples is close to the Gittins index of the original distribution. This translates to the asymptotic optimality of the corresponding index policy for minimizing the expected response time. Numerical comparison with other approaches further confirm the efficiency of our approach.
☆ Can Aggregate Invariants Accelerate Continuous Subgraph Matching? Limits, Laws, and a Dynamic Spectral Index
Spectral filtering recently delivered substantial pruning for \emph{static} subgraph matching: Laplacian interlacing rejects candidates whose neighborhoods cannot host the query. We study whether such aggregate structural tests can accelerate \emph{continuous} subgraph matching (CSM) over dynamic graphs, and answer in three parts. First, lazily maintained spectral bounds are infeasible exactly where spectral pruning has value: we characterize the tightest safe rule over a formalized perturbation relaxation and show that even it loses essentially all pruning power within four touching updates. Second, exact maintenance is affordable when selective: pruning utility and recomputation cost are anti-correlated across vertices -- hubs provably never prune -- so recomputing small-neighborhood spectra on touch sustains exact local spectra at microseconds per update, complete by construction. Third, integrated into a decoupled CSM benchmark against an identical-minus-spectra control, the tests remove up to $51\%$ of candidates or safely skip up to $47\%$ of update enumerations, yet enumeration intermediates remain unchanged -- beyond the gates' skipped first-level bindings, typically zero -- across two engines, four real graphs, two stream types, and $77$ solved queries; a constructed radius-stratified workload confirms the instrument detects the exception when one exists ($-99.9\%$ intermediates, $748\times$ faster). Aggregate tests accelerate what scales with candidate sets -- construction, list scans -- never adjacency-guided exploration. We distill an intermediate-invariance methodology for evaluating CSM filters and release a reusable dynamic local-spectra index.
comment: 11 pages, 3 figures, 3 tables
☆ cuSBF: A Minimizer-Aware Bloom Filter for Genomic Sequence Data on Modern GPUs
Efficient genomic k-mer indexing depends on approximate membership query (AMQ) structures that must deliver high throughput, low false-positive rates (FPR), and modest memory footprints. The Super Bloom filter (SBF) is attractive for this scenario because minimizer-guided sharding and the Findere scheme exploit the redundancy of overlapping k-mers. However, those same features cause high per-k-mer compute cost, severe register pressure, and irregular memory accesses, which hinder an effective GPU implementation. We present cuSBF, an open-source, header-only CUDA library that implements SBF for sequence-native workloads. cuSBF's design merges sectorized shards, cooperative shared-memory tiling, warp-level shard sharing, and segmented warp reductions, turning super-k-mer locality into scalable GPU parallelism. Across real genomic workloads on RTX PRO 6000 Blackwell and GH200 systems, cuSBF achieves the highest throughput among all evaluated sequence-capable baselines. On the RTX PRO 6000, it surpasses the cuCollections blocked Bloom filter baseline by up to 9.1x for insertion and 7.7x for query, while reaching up to 92x and 234x speedups over the multi-threaded CPU Super Bloom reference implementation. It also outperforms GPU-based dynamic AMQs (Cuckoo, Two-Choice, Quotient filters) by 1.5-3400x depending on workload characteristics. A parameter sweep identifies (s = 28, m = 16, H = 4) as Pareto-optimal for k = 31, yielding significantly lower FPR than cuCollections at matched memory budgets. Crucially, cuSBF's architecture-aware design sustains 85% streaming multiprocessor utilization even for out-of-cache filters - proving that sequence locality, not raw bandwidth, is the key to GPU-accelerated genomic indexing.
♻ ☆ Merge-width and First-Order Model Checking
We introduce merge-width, a family of graph parameters that unifies several structural graph measures, including treewidth, degeneracy, twin-width, clique-width, and generalized coloring numbers. Our parameters are based on new decompositions called construction sequences. These are sequences of ever coarser partitions of the vertex set, where each pair of parts has a specified default connection, and all vertex pairs of the graph that differ from the default are marked as resolved. The radius-$r$ merge-width is the maximum number of parts reached from a vertex by following a path of at most $r$ resolved edges. Graph classes of bounded merge-width -- for which the radius-$r$ merge-width parameter can be bounded by a constant, for each fixed $r=1,2,3,\ldots$ -- include all classes of bounded expansion or of bounded twin-width, thus unifying two central notions from the Sparsity and Twin-width frameworks. Furthermore, they are preserved under first-order transductions, which attests to their robustness. We conjecture that classes of bounded merge-width are equivalent to the previously introduced classes of bounded flip-width. As our main result, we show that the model checking problem for first-order logic is fixed-parameter tractable on graph classes of bounded merge-width, assuming the input includes a witnessing construction sequence. This unites and extends two previous model checking results: the result of Dvořák, Král, and Thomas for classes of bounded expansion, and the result of Bonnet, Kim, Thomassé, and Watrigant for classes of bounded twin-width. Finally, we suggest future research directions that could impact the study of structural and algorithmic graph theory, in particular of monadically dependent graph classes, which we conjecture to coincide with classes of almost bounded merge-width.
♻ ☆ Hardness of Dynamic Core and Truss Decompositions
The k-core of a graph is its maximal subgraph with minimum degree at least k, and the core value of a vertex u is the largest k for which u is contained in the k-core of the graph. Among cohesive subgraphs, k-core and its variants have received a lot of attention recently, particularly on dynamic graphs, as reported by Hanauer, Henzinger, and Schulz in their recent survey on dynamic graph algorithms. We answer questions on k-core stated in the survey, proving that there is no efficient dynamic algorithm for k-core or to find (2 - ε)-approximations for the core values, unless we can improve decade-long state-of-the-art algorithms in many areas including matrix multiplication and satisfiability, based on the established OMv and SETH conjectures. Some of our results show that there is no dynamic algorithm for k-core asymptotically faster than the trivial ones. This explains why most recent research papers in this area focus not on a generic efficient dynamic algorithm, but on finding a bounded algorithm, which is fast when few core values change per update. However, we also prove that such bounded algorithms do not exist, based on the OMv conjecture. We present lower bounds also for a directed version of the problem, and for the edge variant of the problem, known as k-truss. On the positive side, we present a polylogarithmic dynamic algorithm for 2-core.
comment: Full version of the paper accepted in WAOA 2025
♻ ☆ Learning with Monotone Adversarial Corruptions
We study the extent to which standard machine learning algorithms rely on exchangeability and independence of data by introducing a monotone adversarial corruption model. In this model, an adversary, upon looking at a "clean" i.i.d. dataset, inserts additional "corrupted" points of their choice into the dataset. These added points are constrained to be monotone corruptions, in that they get labeled according to the ground-truth target function. Perhaps surprisingly, we demonstrate that in this setting, all known optimal learning algorithms for binary classification can be made to achieve suboptimal expected error on a new independent test point drawn from the same distribution as the clean dataset. On the other hand, we show that uniform convergence-based algorithms do not degrade in their guarantees. Our results showcase how optimal learning algorithms break down in the face of seemingly helpful monotone corruptions, exposing their overreliance on exchangeability.
comment: added a few references in related work
♻ ☆ Bi-Lipschitz extensions and outlier embeddings into trees
We develop low distortion embeddings with outliers from arbitrary metrics into hierarchically separated trees (HSTs). In particular, we develop an efficient algorithm that for any $ε>0$, given an input metric $(X,d)$, and a probabilistic embedding of all but $k$ points from $X$ into HSTs with distortion $c$, samples from a probabilistic embedding of all but $O(\frac{k}ε\log k)$ points into HSTs that achieves distortion at most $(32+ε)c$. Our results are based on two key technical components. First, we extend an algorithm of Munagala et al. [2023] for minimizing the distortion of embeddings without outliers into HSTs to the setting with outliers. We combine this with new results on bi-Lipschitz extensions into trees and $\ell_1$ space. In particular, we show that any probabilistic embedding into HSTs can be extended to $k$ additional points with only a factor $O(\log k)$ of additional distortion. This bi-Lipschitz extension result utilizes a new probabilistic partitioning scheme that we call onion partitioning.
♻ ☆ Automated Lower Bounds for Bilinear Complexity over Finite Fields
We present a general, automated framework for proving lower bounds on the bilinear complexity (tensor rank) of multiplication problems over a finite field $\mathbb{F}_q$. The framework is parameterized only by the multiplication tensor and by a group of rank-preserving symmetries acting on one argument: it classifies the orbits of constraint subspaces under that group, runs a dynamic program over the orbits combining four lower-bound techniques, and emits a proof certificate that a verifier rechecks, typically faster than the search. Instantiating the framework for matrix multiplication, we improve the lower bounds for four small formats over $\mathbb{F}_2$, most notably showing that the bilinear complexity of multiplying two $3 \times 3$ matrices over $\mathbb{F}_2$ is at least $20$, raising the bound of $19$ that had stood since Bläser (2003). Instantiating it for polynomial multiplication -- full products, cyclic convolution, and the truncated (modulo $x^N$) and negacyclic (modulo $x^N+1$) products -- we obtain eighteen new lower bounds over $\mathbb{F}_2$ and $\mathbb{F}_3$. Every bound is backed by a machine-checkable certificate.
♻ ☆ On estimating Schatten norm and power distances between quantum states
We study the computational complexity of estimating the quantum Schatten $α$-norm distance ${\rm T}_α(ρ_0,ρ_1)$, given ${\rm poly}(n)$-size state-preparation circuits of $n$-qubit quantum states $ρ_0$ and $ρ_1$. This quantity serves as a lower bound on the trace distance and, for $α> 1$, is interchangeable with its powered version $Λ_α(ρ_0,ρ_1)$. For any constant $α> 1$, we develop an efficient rank-independent quantum estimator for ${\rm T}_α(ρ_0,ρ_1)$ with time complexity ${\rm poly}(n)$, achieving an exponential speedup over the prior best results of $\exp(n)$ due to Wang, Guan, Liu, Zhang, and Ying (TIT 2024). When $0<α<1$ is a constant, the quantum Schatten $α$-power distance $Λ_α(ρ_0,ρ_1)$ becomes a distance metric. Accordingly, we provide a rank-efficient quantum estimator for this quantity. Our quantum algorithm reveals a dichotomy in the computational complexity of the Quantum State Distinguishability Problem with Schatten $α$-norm (QSD $_α$), which involves deciding whether ${\rm T}_α(ρ_0,ρ_1)$ is at least $2/5$ or at most $1/5$. This dichotomy arises between the cases of $α> 1$ and $0 < α\leq 1$: 1. For any constant $α>1$, QSD$_α$ is $\sf BQP$-complete. 2. For any $1 \leq α(n) \leq 1+{\rm negl}(n)$, QSD$_α$ is $\sf QSZK$-complete, implying that no efficient quantum estimator for ${\rm T}_α(ρ_0,ρ_1)$ exists unless ${\sf BQP}={\sf QSZK}$. This $\sf QSZK$-hardness result also extends to the promise problem defined by $Λ_α(ρ_0,ρ_1)$ for constant $0<α<1$. The hardness results follow from reductions based on new rank-dependent inequalities for ${\rm T}_α(ρ_0,ρ_1)$ when $1\leq α\leq \infty$ and for $Λ_α(ρ_0,ρ_1)$ when $0<α<1$, which are of independent interest.
comment: 50 pages, 1 table, 4 algorithms. v3: Added results on estimating the Schatten power distance of order between 0 and 1; reorganized sections and made minor changes. v2: Minor changes; corrected parameters in the proofs of Theorems 4.5 and A.1
♻ ☆ A Private Approximation of the 2nd-Moment Matrix of Any Subsamplable Input
We study the problem of differentially private second moment estimation and present a new algorithm that achieve strong privacy-utility trade-offs even for worst-case inputs under subsamplability assumptions on the data. We call an input $(m,α,β)$-subsamplable if a random subsample of size $m$ (or larger) preserves w.p $\geq 1-β$ the spectral structure of the original second moment matrix up to a multiplicative factor of $1\pm α$. Building upon subsamplability, we give a recursive algorithmic framework similar to Kamath et al 2019, that abides zero-Concentrated Differential Privacy (zCDP) while preserving w.h.p. the accuracy of the second moment estimation upto an arbitrary factor of $(1\pmγ)$. We then show how to apply our algorithm to approximate the second moment matrix of a distribution $\mathcal{D}$, even when a noticeable fraction of the input are outliers.
♻ ☆ Fully Dynamic Graph Algorithms with Edge Differential Privacy
We study differentially private algorithms for analyzing graphs in the challenging setting of continual release with fully dynamic updates, where edges are inserted and deleted over time, and the algorithm is required to update the solution at every time step. Previous work has presented differentially private algorithms for many graph problems that can handle insertions only or deletions only (called partially dynamic algorithms) and obtained some hardness results for the fully dynamic setting. The only algorithms in the latter setting were for the edge count, given by Fichtenberger, Henzinger, and Ost (ESA 21), and for releasing the values of all graph cuts, given by Fichtenberger, Henzinger, and Upadhyay (ICML 23). We provide the first differentially private and fully dynamic graph algorithms for several other fundamental graph statistics (including the triangle count, the number of connected components, the size of the maximum matching, and the degree histogram), analyze their error and show strong lower bounds on the error for all algorithms in this setting. We study two variants of edge differential privacy for fully dynamic graph algorithms: event-level and item-level. We give upper and lower bounds on the error of both event-level and item-level fully dynamic algorithms for several fundamental graph problems. No fully dynamic algorithms that are private at the item-level (the more stringent of the two notions) were known before. In the case of item-level privacy, for several problems, our algorithms match our lower bounds.
comment: minor fixes; 31 pages, 3 figures
Graphics 8
☆ Cage-based Texture Transfer with Geometric Filtering SIGGRAPH 2026
Real-time texture transfer expands the creative horizon for interactive applications, enabling seamless detail projection in scenarios that range from digital character cosmetics to procedural automotive texturing. Yet, its practical application is governed by inherent trade-offs between processing speed and suppression of artifacts. Low-latency transfer methods frequently fail to suppress artifacts, and robust alternatives rely on large-scale models that are costly in training and memory. Our proposed method bridges the gap between efficiency and robustness by using a cage-based geometric filtering method to identify Non-Cosmetic Zones (NCZs) for artifact suppression. While other models are resource-intensive and require multiple days of training on manually annotated datasets, we are able to successfully suppress artifacts and achieve immediate deployment on consumer-grade hardware. Our framework achieved highly efficient runtimes of ~70ms on mobile devices for a ~4.8k triangle mesh.
comment: Accepted to SIGGRAPH 2026
Self-supervised Garment Dynamics with Persistent Wrinkles ECCV 2026
The self-supervised neural garment simulator has become popular due to its high efficiency, good visual realism, and no reliance on training data. However, existing methods greatly simplify the mechanical properties of fabrics, ignoring persistent wrinkles caused by plasticity. Although this simplification allows for modeling of purely elastic material and simple training via energy minimization, the lack of believable wrinkles adversely affects the visual realism. Therefore, we introduce the first self-supervised neural garment simulator that explicitly models persistent wrinkles. This is achieved by a novel physics-inspired loss function, turning learning into a moving energy minimization problem to mimic plasticity. However, this requires learning to use a changing loss function, which causes difficulties in training i.e. the loss function changes during training. To this end, we propose a new physics-inspired curriculum learning scheme where the target material for learning gradually changes from pure elasticity to elasto-plasticity, allowing the loss function and the learnable parameters to jointly converge. Through a comprehensive evaluation, we show that for the first time, self-supervised learning models can generate natural persistent wrinkles, outperforming existing methods on a variety of garments, body shapes, and body motions, according to a range of metrics.
comment: 14 pages, 6 figures. Accepted to ECCV 2026
☆ Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models
We present Wan-Streamer, a native-streaming, end-to-end interactive foundation model designed from the ground up for real-time, low-latency, full-duplex audio-visual interaction. Wan-Streamer seamlessly models language, audio, and video as both input and output within a single Transformer, where the sequence is represented as interleaved visual, audio, and text input tokens together with visual, audio, and text output tokens, coordinated by block-causal attention for incremental streaming. Unlike cascaded interactive systems that rely on separate VAD, ASR, language, TTS, audio-driven animation, or video-generation modules, Wan-Streamer does not rely on external language, speech, avatar, or video-generation modules: perception, reasoning, generation, response timing, turn management, and cross-modal synchronization are learned jointly within one unified model, reducing pipeline latency and error accumulation. To support natural audio-visual responsiveness, we redesign the entire stack around streamability, including causal encoders, causal decoders, block-causal attention, and low-latency multimodal token scheduling, enabling streaming units as short as 160 ms at 25 fps. Wan-Streamer achieves approximately 200 ms model-side response latency and approximately 550 ms total interaction latency when combined with 350 ms bidirectional network latency, supporting sub-second duplex audio-visual communication. These results position Wan-Streamer as a unified, end-to-end, multimodal interactive foundation model for low-latency streaming interaction.
comment: Website: https://wan-streamer.com
☆ Visualizing "We the People": Bridging the Perception Gap through Pluralistic Data Storytelling
Traditional visual data storytelling relies on binary graphics that depict two simplified groups in conflict. This can increase political polarization by oversimplifying intra-group disagreements and erasing ambiguity and shared ideas or values. This can inadvertently foster "us versus them" thinking. Intentional, pluralistic design choices for AI-enabled digital platforms can produce visualizations that emphasize nuance, opinion distribution, and intergroup commonalities. To demonstrate this potential, we examine deliberative technologies that map high-dimensional opinion spaces and highlight areas of both consensus and dissensus. The paper highlights the We the People deliberation conducted by Jigsaw and the Napolitan Institute in September 2025, which engaged over 2,400 Americans across all 435 congressional districts in an AI-supported, asynchronous dialogue regarding freedom and equality. By utilizing AI to synthesize long-form, text-based participant inputs into interactive "opinion landscapes," the initiative provided an alternative format for pluralistic data storytelling that humanized diverse viewpoints and revealed hidden areas of substantial broad consensus. The paper concludes that shifting from divisive, contrast-heavy visual frameworks to distribution-focused, interactive models represents a highly scalable, low-cost intervention capable of bridging perceptual gaps and cultivating a more resilient, collaborative democratic culture.
☆ FiCA: Feed-forward instant Gaussian Codec Avatars from a Single Portrait Image
We introduce FiCA, a Feed-forward, instant Gaussian Codec Avatar generation pipeline that creates lifelike avatars from a single portrait image. Generating a photorealistic and drivable avatar from just a single image is significantly challenging due to the limited visual information available to accurately infer the 3D appearance and geometry of human heads. To address this, we develop a novel system that combines human-centric vision foundation models with a diffusion model. This system is designed to fully exploit partial visual observations to generate lifelike human avatars. Our proposed diffusion model learns a generative mapping from these partial observations to complete and authentic 3D mesh reconstruction. Additionally, we introduce a feed-forward mesh refinement network that enhances the fidelity and identity preservation of the generated avatars, eliminating the need for person-specific test-time optimization. By leveraging a universal prior model that decodes a generated mesh into a set of 3D Gaussians, we generate a photorealistic 3D Gaussian avatar, capable of being driven with novel expressions in real-time. Our experiments demonstrate that the avatars generated by our feed-forward approach faithfully represent diverse identities and surpass the visual quality of avatars produced by recent competing methods.
comment: Project page: https://kim-youwang.github.io/FiCA
♻ ☆ Differentiable Packing of Irregular 3D Objects with Adaptive Container Estimation
Most existing approaches either fix the container in advance or optimize only a single container dimension through an outer search loop, leaving the remaining dimensions as a manual tuning problem. We present a differentiable packing framework that jointly optimizes all 6N object pose parameters and all three container side lengths inside a single gradient-based loop. The formulation combines six physics-inspired, differentiable loss terms computed directly on triangle meshes through axis-aligned bounding-box proxies. An adaptive squeezing mechanism periodically tightens the container whenever the overlap loss falls below a pair-count-scaled threshold, producing a large initial drop in container volume, followed by small refinements. All pairwise computations are written in tensor-broadcasting form, giving a 3.4 to 54 times speedup over a reference loop-based implementation. The pipeline is implemented in Python and PyTorch, with no physics engine, FFT library, or convex decomposition. On multiple object categories, the method produces containers that are 11 to 32 percent smaller than time-matched DBLF and simulated-annealing baselines at N =100, while running in under 4 minutes per instance on a single consumer GPU.
comment: 20 pages, 8 figures, 5 tables
♻ ☆ AI+CAD Data Representation Architecture: From DeepCAD Solid Modeling to WHUCAD Industrial-Level Parametric Feature Modeling
In July 2025, Study Times, sponsored by the Party School of the Central Committee of the CPC, pointed out that 95% of industrial software for R&D and design in China relies on imports, and that 90% of the high-end CAD/CAE/CAM software market is monopolized by European and American giants. This is a typical strategic bottleneck problem. Unlike the visually oriented goal of "visual plausibility" pursued by related sister disciplines such as CV and CG, CAD places greater emphasis on "industrial usability". In CAD, data representation architecture is more foundational than the optimization of network algorithms. This paper first starts from data representation in AI+CAD and reports a classification paradigm and research progress in AI+CAD. Then, using the open-source DeepCAD data representation as an example, it analyzes the pain points of representative AI+CAD work and the gap between such work and real industrial-level parametric feature modeling. Next, by comparison with the open-source WHUCAD data representation, it discusses how its three-level architecture provides fundamental support for industrial-grade parametric feature modeling. Finally, in view of the rapid iteration of the AI wave, large models, and agents, this paper offers an outlook on AI+industrial-grade CAD.
♻ ☆ Stochastic Signed Distance Processes
Multi-view surface reconstruction is a core problem in computer vision. One prominent line of work represents the surface implicitly as a signed distance field (SDF), optimizing it based on the photometric loss between rendered and observed pixel colors. These approaches typically employ SDF-based volume rendering to obtain a differentiable relaxation of discontinuous visibility along rays, thereby reducing reliance on silhouette supervision. In this paper, we reformulate SDF-based volume rendering as probabilistic surface rendering, where each pixel color is modeled as a mixture distribution induced by the random first ray-surface intersection. To this end, we introduce Stochastic Signed Distance Processes (SSDP), which model the SDF along each ray as a stochastic process, inducing a first-passage-time distribution for each ray. We then derive the first-passage probability for each sampling interval based on Bayesian filtering, together with its practical approximation for parallel rendering. We further show that NeuS, an existing SDF-based volume rendering method, arises as a special case of our formulation. Experiments on the DTU and MobileBrick datasets demonstrate that our method outperforms baselines in both surface reconstruction and uncertainty quantification, supporting the effectiveness of our first-passage formulation. Our code is available at https://github.com/skmhrk1209/SSDP.
Operating Systems 3
☆ LMS-AR: LMS Prediction-based Adaptive Regulator for Memory Bandwidth in Multicore Systems
Memory bandwidth contention in multi-core systems severely impacts application performance and quality-of-service (QoS) guarantees. Regulating the shared memory bandwidth mitigates the memory performance uncertainty thereby making it a manageable resource and improving trustworthiness of multi-core systems. In this work we propose a memory bandwidth regulation mechanism LMS-AR, i.e., LMS Prediction-based Adaptive Regulator within a Linux kernel module to distribute the memory bandwidth as a resource among the CPU cores. We describe a design in which both monitoring and regulation is enforced from outside by a master core - which is not a dedicated controller for regulation. This allows for plugging in computationally heavy prediction and regulation algorithms without interfering with the regulated core. An adaptive filtering technique was employed for prediction of per-core bandwidth requirement. We conducted several experiments with SPEC CPU 2017 benchmarks distributed across multiple cores. Our proposed approach demonstrated significant improvement over Memguard with respect to slowdown ratios caused due to memory contention. Our solution is hosted publicly at $\href{https://github.com/ss22ongithub/LMSAdaptiveRegulator}{https://github.com/ss22ongithub/LMSAdaptiveRegulator}$.
comment: 7 pages
☆ AOHP: An Open-Source OS-Level Agent Harness for Personalized, Efficient and Secure Interaction
AI agents are driving a new software paradigm, with the ability to autonomously call tools, extract information, manage memory, and complete tasks that span applications and data sources. Most existing end-user operating systems, however, are designed for application-centric workflows and offer little native support for AI agents. This mismatch limits the wider adoption of agents and leads to execution overhead and safety risks when running agents on conventional systems. While the concept of agent-native operating systems is emerging, the research community lacks an open testbed to explore the architectural primitives desired for agent-mediated interaction. We present AOHP (Android Open Harness Project), an OS-level agent harness built on the Android Open Source Project (AOSP). The core design principle of AOHP is to treat agents as first-class OS actors, enabling adaptive user interfaces and agent-friendly runtime environments. AOHP preserves the mature Android software and hardware ecosystem while introducing three agent-oriented system mechanisms: personalized service composition, efficient agent interfaces, and secure information flow. Based on preliminary experiments on challenging tasks covering key capabilities of OS agents, AOHP shows clear advantages in task completion (+21.12% completion rate), execution cost (-51.55% token cost), and security-policy compliance.
comment: 17 pages, 3 figures
☆ FlexServe: A Fast and Secure LLM Serving System for Mobile Devices with Flexible Resource Isolation
Device-side Large Language Models (LLMs) have grown explosively, offering stronger privacy and higher availability than their cloud-side counterparts. During LLM inference, both the model weights and the user data are valuable, and attackers may compromise the OS kernel to steal them. ARM TrustZone is the de facto hardware-based isolation technology on mobile devices, used to protect sensitive applications from a compromised OS. However, protecting LLM inference with TrustZone incurs significant overhead to both the secure inference and the normal aplications, due to two challenges: the inflexible resource isolation and the inefficient secure resource management. To address these challenges, this paper presents FlexServe, a fast and secure LLM inference system for mobile devices. The key idea is to decouple the access permission from the management permission of secure resources, so that the normal-world OS cannot access them but can still manage them as usual. First, FlexServe introduces a Recallable Resource Isolation mechanism to construct Recallable Secure Memory (Flex-Mem) and a Recallable Secure NPU (Flex-NPU). They can only be accessed by the secure world, but can be efficiently allocated and reclaimed by the normal-world OS. Based on them, FlexServe further introduces a FlexServe Framework to run secure LLM inference in the secure world. It works together with the normal-world OS to perform cooperative secure memory management. We implement a prototype of FlexServe and compare it with two TrustZone-based strawman designs. The results show that FlexServe achieves average TTFT speedups of 10.05X over the strawman and 2.44X over an optimized strawman.
Hardware Architecture 13
☆ The Energy Consumption of Transformer Fine-Tuning: A Roofline-Inspired Scaling Model
Transformer-based models underpin modern natural language processing but incur rapidly growing computational and energy costs. As training scales in both model size and parallelism, accurately predicting energy consumption has become critical for sustainable and cost-aware system design. We present a framework for modeling the energy consumption of Transformer training on multiple GPUs. Using controlled architectural sweeps of BERT models, we relate measured energy to lightweight proxies for compute, memory traffic, and hardware efficiency. Inspired by roofline models, our approach incorporates a speedup-based hardware-efficiency factor that captures the effects of tensor parallelism and fully sharded data parallelism. We derive a scaling law model that accurately predicts training energy across heterogeneous configurations.
☆ HeteroViT: A Versatile Single-Layer Vision Transformer Concept, Co-Designed for Distributed Real-Time Data Reduction on Scientific Detectors
Next-generation X-ray detectors generate data faster than any system can affordably store or process. LCLS-II, the upgraded Linac Coherent Light Source at SLAC, produces data on the order of terabytes per second, with raw-data transfer and storage projected to be prohibitively costly, even though much of the data is not scientifically useful. This concept paper focuses on two major points. The first is versatility: a deliberately tiny, single-layer Vision Transformer (ViT) is enough to serve distinct scientific quick-evaluation tasks. We demonstrate this on two very different problems: (a) a supervised hit/miss/maybe classification on the CSPAD dataset, made to resemble ePixUHR-like detector frames, and (b) a self-supervised latent space for rare-event detection in X-ray diffraction spanning two learning paradigms, two output types, and two detector modalities, with one small backbone. The second is hardware co-design: because the ViT's blocks are structurally uniform, the model maps cleanly onto the heterogeneous hardware already present in the LCLS detector pipeline (ASIC -> FPGA -> GPU) under a simple rule one ASIC is one token so the data is reduced progressively at each stage and a keep/discard decision is produced in real time at the edge. The two claims reinforce each other: versatility is precisely what justifies freezing the front-end in silicon, since a reusable front-end is only worth committing to hardware if it serves many tasks. We are explicit that this is a concept supported by early software analysis, not a hardware demonstration. The natural and primary next phase is the hardware implementation of this distributed pipeline. The decisive evidence still owed an end-to-end latency budget, ASIC feasibility of the in-sensor embedding, and the false-negative behavior that matters for a data veto defines that program. HeteroViT is our first step toward it.
comment: N/A
☆ An Open-Source LFSR-Based Stochastic Leaky Integrate-and-Fire Neuron in SkyWater 130 nm: Design, Stochastic Characterisation, and Rate Coding
Stochastic spiking neurons trade exact arithmetic for controlled randomness, lowering area and tolerating input noise, which suits event-driven edge hardware. We present a compact, configurable stochastic leaky integrate-and-fire neuron in standard-cell CMOS on the SkyWater 130 nm process, released openly. A 16-bit configurable-polynomial linear-feedback shift register drives an eight-entry programmable activation table that sets a Bernoulli firing probability, and a saturating 16-bit leaky integrator with a programmable threshold and a refractory period of zero to seven cycles produces the spike train. All parameters are set through a sixteen-register serial interface, and the neuron runs from parallel inputs or entirely from the register file. From a model checked bit-exact against the register-transfer code, the period is 65535 states for a maximal-length polynomial and 63 for the shipped default, the eight-bit comparison value is uniform over the full period, and the per-entry firing probability equals the table value divided by 256. We also characterise a property a system-level model would not expose: the comparator output is serially correlated at short lags, with a negative lobe near lag eight, because the compared byte shifts by one bit each cycle; subsampling every sixteen cycles restores whiteness. Rate-coding sweeps show monotonic control of the output rate by the input weight and the threshold, and the refractory period caps the rate at one spike per refractory-plus-one cycles. The neuron occupies about 10,600 square micrometres at 70 per cent utilisation on a single Tiny Tapeout tile, meets 50 MHz timing with positive margin, and passes eighteen directed cocotb tests at register-transfer and gate level. All results are pre-silicon, from simulation and the open flow. The neuron is an openly released companion to a four-block neuromorphic suite reported separately.
comment: 12 pages, 7 figures, 3 tables. Open-source hardware on SkyWater 130 nm via Tiny Tapeout. RTL, cocotb tests, and the open implementation flow are public at https://github.com/santhoshs93/tt_um_santhosh_stoch_neuron (commit 225ce6e). All results are pre-silicon, from simulation and the open hardening flow
☆ Unprivileged Topology Certificates for Cloud GPU Attestation
Cloud GPU tenants receive a model name and a region, but cannot directly inspect the physical accelerator that runs their job. We present a software-only attestation primitive for this setting. A CUDA probe measures an SM-by-memory-region latency matrix using physical SM labels and dependent global loads. A streaming reducer commits sufficient statistics, configuration, code hashes, network evidence, and a compressed raw data archive into a certificate that a verifier can check without a GPU. The certificate supports three claims. First, the per-SM latency map is a stable physical fingerprint. Over a six-hour full-load RTX 5090 run, its median temporal jitter is 0.09 cycles, while shape-only leave-one-out classification separates distinct Blackwell dies with 100.0% accuracy. Second, cache-bypassing HBM sweeps recover hardware-class topology across generations, including a unified Volta V100 memory domain, a two-way Hopper H200 L2 split, and a Blackwell B200 two-die NV-HBI package whose 74/74 SM partition carries a 30-cycle, 15.5 ns cross-die penalty. Third, public network landmarks bind the same certificate to a coarse location. In the B200 run, 169 RIPE Atlas probes place the server within 44 km of its claimed datacentre and reject all 11 decoy sites. Together, these measurements check cloud-GPU identity, class, and coarse location without privileged access or a vendor key.
comment: 12 pages, 2 figures, 4 tables. Source package includes code, certificates, JSON summaries, and a lossless SQLite ancillary payload for the full raw data tree with SHA-256 manifest
☆ VeriPilot: An LLM-Powered Verilog Debugging Framework
Verilog debugging remains one of the most time-consuming stages in digital circuit design. Recent advances in Large Language Models (LLMs) have enabled automated debugging; however, most existing approaches rely solely on test outputs and compiler feedback in an end-to-end manner, limiting their effectiveness on complex bugs. A key challenge is that the root cause of an error may be far removed from its observable outputs, making it difficult for LLMs to trace long dependency chains in code. This challenge is further exacerbated in large codebases, where long context lengths hinder efficient reasoning. To address these limitations, we propose VeriPilot, an LLM-powered debugging framework that leverages golden reference models to enable fine-grained bug localization and repair. VeriPilot goes beyond output-level comparison by aligning internal variable semantics between the Verilog design and its corresponding golden model through LLM-based analysis. It then performs step-by-step signal tracing using Control-Data-Flow Graphs (CDFGs) derived from static analysis, identifying a minimal set of suspicious code regions along with their correct counterparts from the golden model. These structured insights are subsequently provided to the LLM to guide reasoning and automated code repair. Experimental results on the Comprehensive Verilog Design Problems (CVDP) benchmark from NVIDIA demonstrate that VeriPilot improves the repair success rate of GPT-4o from 54.3\% to 85.71\%, significantly enhancing both bug localization accuracy and repair effectiveness for complex Verilog designs. The source code and benchmark are publicly available at Github https://github.com/YihanWn/VeriPilot.git.
comment: 13 pages, 6 figures
☆ MOCAP: Wafer-Scale-Chip-Oriented Memory-Orchestrated Chunked Pipelining Framework for Prefill-Only LLM Inference
Large language models (LLMs) are increasingly used in prefill-only workloads, where end-to-end latency is dominated by the prefill phase. For long-context prefill, communication overhead grows with sequence length and quickly becomes a bottleneck on conventional GPU systems, making wafer-scale chips (WSCs) a promising substrate due to their high communication bandwidth and large aggregate compute and memory capacity. A natural way to accelerate prefill is to partition a long input sequence into multiple chunks and execute them in a finer-grained pipeline across devices. However, directly applying this idea to long-context prefill on WSCs remains challenging. First, causal dependency across chunks causes KV cache to accumulate unevenly across pipeline stages, creating severe memory imbalance and limiting the feasible sequence length. Second, later chunks require more attention computation because each chunk depends on preceding chunks, leading to chunk-level latency imbalance. To address these challenges, we present MOCAP, a memory-orchestrated chunked pipelining framework for prefill-only LLM inference on WSCs. MOCAP introduces Memory-Balanced KV Reallocation (MBKR) to alleviate memory imbalance by redistributing KV cache across pipeline stages, thereby extending the feasible sequence length. It further incorporates Latency-Balanced Chunk Partitioning (LBCP) to balance chunk execution cost under both attention-cost growth and KV reallocation overhead, improving pipeline efficiency. Experimental results show that, compared with GPipe, MOCAP achieves 76.4\% lower end-to-end latency and 3.24$\times$ higher throughput on average. MOCAP also extends the maximum supported sequence length by up to 1.31$\times$ compared with Terapipe.
comment: 15 pages, 6 figures, accepted by APPT 2026
☆ Clutch: High Performance Vector-Scalar Comparison using DRAM via Chunked Temporal Coding
Vector-scalar comparison is a fundamental computation primitive that compares each element in a vector against a single scalar value. It is widely used in various data-intensive workloads from databases to machine learning. Due to its low computational intensity, its execution tends to be memory-bound, limiting the utilization of compute resources. Processing-using-DRAM (PuD) is an emerging computing paradigm that performs massively parallel bitwise operations directly inside DRAM arrays, alleviating off-chip data movement. Existing PuD-based approaches require many DRAM commands because the comparison's algorithmic complexity grows with operand bit-width in the bit-serial execution model. This command overhead becomes the dominant bottleneck, limiting application-level speedup. We propose Clutch, a data representation and comparison algorithm that accelerates vector-scalar comparisons in PuD systems with high efficiency and scalability. Clutch first uses temporal coding, encoding each vector value as a sequence of leading ones, which enables lookup-based comparison against a scalar by accessing the corresponding DRAM row. To avoid the prohibitive memory footprint of lookup tables at high precision, Clutch partitions operands into multiple multi-bit chunks, compares chunks independently using compact lookup tables, and merges the per-chunk results with a PuD-efficient procedure. By adjusting the number of chunks, Clutch provides a flexible tradeoff between throughput and memory usage. Across predicate evaluation and decision tree inference, Clutch improves end-to-end application throughput and energy efficiency by an average of 12x and 69x over highly optimized CPU and GPU execution, and by 2.9x and 3.0x over the state-of-the-art bit-serial PuD implementation. We also present the first mapping of decision tree inference to PuD execution, extending PuD to a new application domain.
comment: To appear at ICS 2026 (July 2026)
♻ ☆ Bitwise Systolic Array Architecture for Runtime-Reconfigurable Multi-precision Quantized Multiplication on Hardware Accelerators
Neural network accelerators have been widely applied to edge devices for complex tasks like object tracking, image recognition, etc. Previous works have explored the quantization technologies in related lightweight accelerator designs to reduce hardware resource consumption. However, low precision leads to high accuracy loss in inference. Therefore, mixed-precision quantization becomes an alternative solution by applying different precision in different layers to trade off resource consumption and accuracy. Because regular designs for multiplication on hardware cannot support the precision reconfiguration for a multi-precision Quantized Neural Network (QNN) model in runtime, we propose a runtime reconfigurable multi-precision multi-channel bitwise systolic array design for QNN accelerators. We have implemented and evaluated our work on the Ultra96 FPGA platform. Results show that our work can achieve 1.3185 to 3.5671 times speedup in inferring mixed-precision models and has less critical path delay, supporting a higher clock frequency (250MHz).
comment: Accepted at ISQED 2025. RTL source code, Vivado projects, and packaged Vivado IPs are available at: https://github.com/liuyh-Horizon/BitSys
♻ ☆ Dynamic Symmetric Point Tracking: Tackling Non-ideal Reference in Analog In-memory Training
Analog in-memory computing (AIMC) performs computation directly within resistive crossbar arrays, offering an energy-efficient platform to scale large vision and language models. However, non-ideal analog device properties make the training on AIMC devices challenging. In particular, its update asymmetry can induce a systematic drift of weight updates towards a device-specific symmetric point (SP), which typically does not align with the optimum of the training objective. To mitigate this bias, most existing works assume the SP is known and pre-calibrate it to zero before training by setting the reference point as the SP. Nevertheless, calibrating AIMC devices requires costly pulse updates, and residual calibration error can directly degrade training performance. In this work, we present the first theoretical characterization of the pulse complexity of SP calibration and the resulting estimation error. We further propose a dynamic SP estimation method that tracks the SP during model training, and establishes its convergence guarantees. In addition, we develop an enhanced variant based on chopping and filtering techniques from digital signal processing. Numerical experiments demonstrate both the efficiency and effectiveness of the proposed method.
♻ ☆ CXLMemUring: A Hardware Software Co-design Paradigm for Asynchronous and Flexible Parallel CXL Memory Pool Access
CXL-attached memory lets servers add more memory while keeping the standard load/store programming model. The main drawback is latency. CXL memory accesses are too slow for normal CPU mechanisms to hide reliably, especially when each access depends on the result of a previous one. At the same time, they are too fast for traditional software techniques, such as context switches or interrupt-based asynchrony, to manage one load at a time. On a real Granite Rapids CXL platform, we find that placing GAPBS graph workloads in CXL memory slows execution by 2.44x on average compared with local DRAM. A state-of-the-art software prefetcher still leaves a 2.21x slowdown. This paper presents our system, a hardware/software co-designed approach for hiding CXL latency using larger units of asynchronous work. The system creates regions: parts of the original program that include one or more CXL-resident memory operations together with the nearby address-generation and memory-orchestration logic needed to run them. The host CPU launches these regions asynchronously on a near-memory accelerator built on a commodity CXL Type 2 FPGA, then continues useful work while the device runs ahead and prepares future memory accesses. A compiler identifies candidate regions from unmodified source programs, and an online JIT refines region boundaries and execution parameters based on workload behavior. We implement the system as a prototype compiler, runtime, and Vortex-based CXL-side accelerator. Across GAPBS, MCF, Spatter, and NAS Parallel Benchmark workloads, it improves end-to-end performance by 1.45x to 1.75x, with a 1.59x geometric mean speedup, compared with native CXL-memory execution.
♻ ☆ An 83-Format Numeric Catalog with Bit-Exact Conformance Vectors: A Vendor-Neutral Reference for FP8, BF16, MXFP4, and Microscaling Formats
Numeric format proliferation in machine learning hardware -- FP8 (E4M3 and E5M2), BF16, MXFP4, microscaling block formats, and dozens of research variants -- has outpaced the availability of vendor-neutral, bit-exact reference material. Engineers porting models across accelerators encounter silent divergences that are difficult to diagnose without a shared ruler. This paper describes a catalog of 83 numeric formats spanning 13 families, a suite of six bit-exact conformance packs covering GF16, MXFP4 element, BF16, FP8 E4M3, FP8 E5M2, and E8M0 block scale, and an IEEE P3109 v3.2.0 cross-walk that maps each pack to its corresponding standards-track configured format. Each pack is a self-contained JSON document with a SHA-256 fingerprint, a shared row schema, and an anchor vector that encodes 3.0 -- the identity phi^2 + 1/phi^2 = 3 -- as a cross-pack sanity check. Packs are cross-validated against ml_dtypes 0.5.4 (Google/JAX); any divergence is documented explicitly and interpreted as a spec-permitted interpretation gap rather than hidden. The work is framed as registry filling: it does not propose new formats, make model-accuracy claims, or assert superiority over any vendor's implementation. All artifacts are publicly available at https://github.com/gHashTag/t27 under an open license.
comment: 17 pages. Source repository: https://github.com/gHashTag/paper3-methodology tag v4.0-trinity. Paper CC BY 4.0; code MIT. ORCID 0009-0008-4294-6159
♻ ☆ GoldenFloat: A Phi-Derived Static-Split Floating-Point Family from GF4 to GF1024 with a Lucas-Exact Integer Identity SC
We present a hardware-oriented description of GoldenFloat (GF), a static-split floating-point family generated by a single closed rule, and three concrete artefacts: (i) an open multi-width RTL generator covering GF4-GF256 with a continuous-integration differential sweep against a correctly-rounded reference; (ii) an integer-backed Lucas-exact accumulator path verified at 500-digit precision for n = 1, ..., 256; and (iii) a GF16 FPGA codec passing a 35-of-35 testbench at 323 MHz on Artix-7 (Xilinx XC7A35T). A format-conformance oracle (Corona) ships in the same repository and is used as the blackbox check in our continuous-integration audit. The rule and its scope. For each total width N >= 4, the exponent width is e = round((N-1)/phi^2) with fraction f = N-1-e and phi = (1+sqrt(5))/2. The rule reproduces the realised exponent widths of nine formats GF4, GF8, GF12, GF16, GF20, GF24, GF32, GF64, GF256 (9/9) and extends consistently to GF128, GF512, GF1024. The rule is positioned alongside posit (2022 Posit Standard), takum (Hunhold 2024, 2025), OCP-MX (Rouhani et al. 2023), and the IEEE P3109 multi-width float draft, all of which are width-spanning families under a parameterised rule. We make no per-rung accuracy or superiority claim against any of them. What is open. The breadth/toolchain-coherence framing is recorded as an open conjecture with a pre-registered falsification path: a matched-substrate FPGA experiment and a matched-budget software ablation. A falsification ledger (FL-002) records the open questions and the experiments that would settle them. An RTL-correctness erratum dated 2026-05-31 is reported in Section 5.5; the fabricated TTSKY26b dies carry the defective multiplier portfolio, and the corrected generator is the regeneration baseline.
comment: 20 pages, single-file LaTeX, ASCII source. v2: peer-anchor updates. Adds Sarnoff P3109 (arXiv:2606.04028), AMD MXFP4 silicon (arXiv:2605.09825), NVIDIA GB10 NVFP4 measurement, companion catalog (arXiv:2606.09686), MixFP4 (arXiv:2605.31035). FL-002 expanded: (c1) GF256 bias, (c2) count drift, (g) static-split vs micro-mixing. TTSKY26a regeneration timeline added. No mathematical claims revised
♻ ☆ TrojanGYM: A Detector-in-the-Loop LLM for Adaptive RTL Hardware Trojan Insertion
Hardware Trojans (HTs) remain a critical threat because learning-based detectors often overfit to narrow trigger/payload patterns and small, stylized benchmarks. We introduce TrojanGYM, an agentic, LLM-driven framework that automatically curates HT insertions to expose detector blind spots while preserving design correctness. Given high-level HT specifications, a suite of cooperating LLM agents (instantiated with GPT-4, LLaMA-3.3-70B, and Gemini-2.5Pro) proposes and refines RTL modifications that realize diverse triggers and payloads without impacting normal functionality. TrojanGYM implements a feedback-driven benchmark generation loop co-designed with HT detectors, in which constraint-aware syntactic checking and GNN-based HT detectors provide feedback that iteratively refines HT specifications and insertion strategies to better surface detector blind spots. We further propose Robust-GNN4TJ, a new implementation of the GNN4TJ with improved graph extraction, training robustness, and prediction reliability, especially on LLM-generated HT designs. On the most challenging TrojanGYM-generated benchmarks, Robust-GNN4TJ raises HT detection rates from 0% to 60% relative to a prior GNN-based detector. We instantiate TrojanGYM on SRAM, AES-128, and UART designs at RTL level, and show that it systematically produces diverse, functionally correct HTs that reach up to 83.33% evasion rates against modern GNN-based detectors, revealing robustness gaps that are not apparent when these detectors are evaluated solely on existing TrustHub-style benchmarks. Post peer-review, we will release all codes and artifacts.
Programming Languages 7
☆ DissProve: Automated Verification of Distributed Protocols with Affine Communication
We consider the problem of automatically proving safety properties of distributed protocols. Distributed protocols have been particularly challenging for automated verification due to their asynchronous and parametric nature. Compared to synchronous systems, asynchronous communication leads to a combinatorial explosion of possible execution histories of message handlers. And because distributed protocols are typically defined parametrically on the number of actors, these definitions lead to an unbounded number of possible execution histories of unbounded length. Existing verification techniques for such distributed protocols typically require global invariants about the entire actor system, which are complex even for simple protocols. In this paper, we present an automated verification technique based on proving unreachability backwards from error states in an actor system. One key insight is that the unboundedness from parametricity can be further classified into \emph{affine} and non-affine protocols, where affine protocols have execution histories of unbounded length in a bounded number of communication rounds. We show how to use novel, goal-directed notions of materialization, causality, and summarization to verify safety properties of affine protocols with an unbounded number of actors in an automated manner. Using our prototype verification tool DissProve, we provide evidence for the feasibility of automated safety verification of asynchronous parametrized systems with affine communication.
☆ A Compositional Language for Property Graphs
A major shortcoming of the recently standardized graph query languages GQL and SQL/PGQ is their lack of compositionality. Given the importance of these languages in querying knowledge graphs, we address this shortcoming and propose both theoretical solutions and a path to adding them to the new standards. The highlight of the non-compositionality problem is that while both GQL and SQL/PGQ can express graph reachability and all first-order queries, they fall short of the problems in NLOGSPACE. In view of the completeness of reachability for NLOGSPACE under first-order reductions, this is extremely counterintuitive. The issue is well recognized by the standards committee that has been searching for language extensions to fill the gaps at the level of some specific inexpressible queries. We address the issue in a systematic way and propose a language that fills expressivity gaps by allowing full compositionality between graph patterns and relational queries. It does so by using two key components: a cleaned up definition of regular path queries with variables and data value comparisons, and a fully compositional graph-to-graph language #Datalog with complete support for constructing new graph elements from nodes, edges, lists of nodes and edges, and even entire paths. We show that the resulting language addresses the issues facing the standards committee, and propose a concrete addition to GQL and SQL/PGQ that incorporates its main features.
☆ Cyclic Graphs and Memoization in Pure $λ$-Calculus
$λ$-calculus normally requires an added recursion construct, a \texttt{letrec}, a $μ$-binder, or a built-in $Y$ for graph reduction, and sharing the repeated work of a memoized or dynamic-programming function normally requires an impure cache. We show that no extension is needed. We apply tabling, the standard method for solving a least-fixpoint equation, to weak-head reduction; this defines a new operational semantics for the pure $λ$-calculus that keeps each term's standard lazy meaning. A term that reaches finitely many distinct states comes out as a finite graph, possibly cyclic; the calculus stays pure, and the graph is sound and independent of reduction order. We implemented this operational semantics as a $λ$-calculus interpreter. It does dynamic programming automatically, sharing repeated subproblems with no memoization table. It creates and transforms cyclic graphs with no added recursion construct. And it decides an unproductive loop, returning $\bot$ for $Ω$ in finite time. What the evaluator returns is a graph, so the $λ$-calculus becomes a DSL for graph computation: the memo table of dynamic programming, the transposition table of game search, and the visited set of graph reachability and points-to analysis are all tabling on state identity, and none of them is written by hand. Compilation is one more such problem: we write a bootstrap compiler that compiles its own source, all as a pure $λ$-term.
☆ Lifting E-Graphs: A Function Isn't a Constant
Variables are quite subtle and easy to get wrong. An approach is described to support rigid $α$ canonical variables in an e-graph. The lifting e-graph has a baked-in notion of functional lifting combinator. It is implemented by fattening the usual integer identifiers with thinning bitvectors, lift-pulling smart constructors, and a special thinning-aware union find variation. The approach is inspired by slotted e-graphs and Co-de Bruijn syntax.
comment: Presented at EGRAPHS 2026
♻ ☆ ESBMC-GraphPLC: Formal Verification of Graphical PLCopen XML Ladder Diagram Programs Using SMT-Based Model Checking
PLCopen XML defines two encoding formats for IEC 61131-3 Ladder Diagram programs: a textual encoding using elements, and a graphical encoding that represents rung logic as a directed graph of localId/refLocalId connections. ESBMC-PLC supported the textual format but parsed graphical exports from CONTROLLINO, Beremiz, and OpenPLC Editor into an empty GOTO intermediate representation, causing vacuous verification success. This paper presents ESBMC-GraphPLC, which closes this gap with a DFS-based graphical LD resolver. The resolver traverses the connection graph from leftPowerRail to each coil, extracts rung paths as Boolean contact conjunctions, and applies a three-tier I/O inference scheme. Ordering coils by rightPowerRail connectionPointIn sequence ensures SET coils process before RESET coils, matching IEC scan-cycle semantics. The graphical-to-IR conversion leaves the ESBMC backend unchanged. Validation on 3 graphical LD programs from CONTROLLINO/OpenPLC Editor shows all produce full GOTO IR with nondeterministic inputs and rung logic, versus the empty IR previously. All 3 verify SAFE at k=2 under 70ms. The 11 textual LD benchmarks are fully preserved, with no regression. Two Beremiz examples with no LD content or unsupported timer semantics are reported as discovered limitations. Artifact at Zenodo (DantasCordeiro2026graphical, doi:10.5281/zenodo.20699856).
comment: 18 pages
♻ ☆ Executing as You Generate: Hiding Execution Latency in LLM Code Interpreters
Current LLM systems are increasingly equipped with a code interpreter that executes generated code to obtain results. This works serially: the model first generates the complete code, then an interpreter executes it. This sequential workflow leaves the executor idle during generation and the generator idle during execution, resulting in unnecessary end-to-end latency. Our key observation is that an LLM, unlike a human developer, emits code tokens left to right and does not backtrack over what it has already written. This makes it possible to start executing a piece of code while later tokens are still being generated. We formalize this parallel execution paradigm, modeling it as a three-stage pipeline of generation, detection, and execution, and derive closed-form latency bounds that characterize its speedup potential and operating regimes. We then present EAGER, a concrete implementation featuring AST-based chunking, dynamic batching with gated execution, and early error interruption. We evaluate EAGER across four benchmarks, seven LLMs, and three execution environments. The overlap mechanism hides almost all execution behind generation, reducing the non-overlapped portion of execution time by up to 99.8% and cutting end-to-end latency by up to 37.3% on error-free runs.
comment: 10 pages
♻ ☆ Types, equations, dimensions and the Pi theorem
The languages of mathematical physics and modelling are endowed with a rich ``grammar of dimensions'' that common abstractions of programming languages fail to represent. We propose a dependently typed domain-specific language (embedded in Idris) that captures this grammar. We apply it to formalize basic notions of dimensional analysis: those of dimension function, physical quantity, homomorphic measurement, the covariance principle and Buckingham's Pi theorem. We hope that the language makes mathematical physics more accessible to computer scientists and functional programming more palatable to modellers and physicists.
Data Structures and Algorithms 13
☆ Is competitive online paging an artifact?
In any real system a newly computed datum begins its existence in the processor rather than in external memory, and thus does not inevitably incur a cold miss. This was captured by early I/O models, but not by the Sleator-Tarjan one that has come to underpin competitive analysis of paging. If one corrects the Sleator-Tarjan model by charging no cost for the first access to newly computed data, optimal offline algorithms such as LFD remain optimal, but no online paging algorithm can be competitive, even if randomized, even with arbitrary resource augmentation, even against request sequences that are not tailored against it but are instead representative of widely used computational techniques. The proofs are simple, and appear robust against any reasonable assumption/model adjustment, including virtually all tools developed to make competitive analysis less pessimistic. In other words, while competitive analysis does predict the good performance exhibited in practice by online paging algorithms such as LRU, these predictions seem just a fortuitous artifact of an incorrect assumption that has crept into the underlying model several decades ago. And there are implications beyond paging, too: for example, the same issue undermines the Ideal Cache model on which the popular Cache-Oblivious and Cache-Adaptive algorithmic frameworks are based.
☆ Flood-It with Jewelry -- Characterizing the Game Complexity for Cograph Generalizations
Flood-It is a single-player game played on a precolored graph $G$, where the objective is to make $G$ monochromatic using as few flooding moves as possible. In each move, a color $c$ is selected and all vertices reachable from a fixed pivot vertex via a monochromatic path are recolored with $c$. In the free variant, the pivot may be chosen anew in every move. Deciding whether a graph can be made monochromatic in at most $k$ moves is NP-complete for both variants, fixed and free. This hardness persists even under strong structural restrictions such as split graphs and trees. The Free Flood-It variant is generally considered more difficult than its fixed-pivot counterpart, as it remains hard on several graph classes where the latter becomes tractable, including co-comparability and AT-free graphs. Cographs, that is, $P_4$-free graphs, are among the few classes on which even Free Flood-It is solvable in polynomial time and therefore serve as our starting point. We consider the ten natural one-vertex extensions of $P_4$ -- referred to as jewels -- and study the complexity of both flooding games on the $1024$ graph classes obtained by forbidding subsets of these graphs as induced subgraphs. Our main contribution is a polynomial-time algorithm for Free Flood-It on graphs that are free of the three jewels bull, gem, and $P_5$, covering $128$ of the $1024$ classes. In addition, we prove that both variants remain NP-complete on thin-spider graphs, which exclude the eight jewels banner, co-banner, chair, gem, house, kite, $P_5$, and $C_5$, thereby establishing hardness for $256$ additional classes. Combined with known algorithms and hardness results, our work determines the complexity of both Flood-It variants for $896$ of the $1024$ considered graph classes.
comment: A short version of this paper will appear in the proceedings of the 51st International Symposium on Mathematical Foundations of Computer Science (MFCS 2026)
☆ Dynamic estimation of slowly varying sequences
We consider the problem of sequentially approximating functions of each element in a slowly-varying sequence, i.e. one where the magnitude $α_i$ of the difference between the elements at positions $i$ and $i-1$ is small. Recent work on implicit trace estimation shows that when $α_t$ is small, reusing queries to past sequence elements can reduce the overall cost [Dharangutte \& Musco, NeurIPS~2021; Woodruff et al., NeurIPS~2022]. We introduce a framework generalizing this to a variety of linear and nonlinear functions on diverse vector spaces, obtaining novel sequential estimation results for matrix powers, spectral densities, Monte Carlo integration, and a boundary value problem from partial differential equations~(PDEs). Furthermore, we develop a novel algorithm for use with this framework that locally scales the estimation budget with $α_t$, obtaining sharper path-length-style variation bounds of form $\mathcal O(\sum_{i=1}^mα_i)$ on the cost of estimating a sequence of length $m$. This improves upon the previous implicit trace estimation bound of $\mathcal O(m\cdot\max_iα_i)$ [Dharangutte \& Musco, NeurIPS~2021], which is achieved by fixing the query budget using the worst-case $α_i$ and is thus inefficient for stable sequences with rare bursts. Lastly, while all past work assumes a known bound on $α_i$, we show in certain cases how the changes can be estimated on-the-fly with (nearly) no added cost. In summary, our framework makes the sequential approximation toolkit general-purpose and adaptive while improving upon state-of-the-art-guarantees for dynamic trace estimation.
comment: Preprint. 14 pages, 4 figures
☆ Log-concavity and tunneling: adiabatic quantum optimization for convex functions (with a spike)
Quantum tunneling is expected to provide a computational speedup in quantum computing, a phenomenon that Adiabatic Quantum Optimization (AQO) aims to leverage. While some academic proofs of concept have been studied, such as the "Hamming weight with a spike" (HWS) problem, the algorithmic gains of this effect remain underexplored. In this work we extend the analysis underlying HWS to more general potentials. In the first half of the work, we establish (discrete) log-concavity of the ground state as a key structural property in this context. We devise a framework for establishing log-concavity of the ground state for a large family of discrete, 1-dimensional Schrödinger operators. The family includes convex potentials, but also certain potentials with local minima. In the convex case, this provides a discrete version of a continuous result by Brascamp and Lieb ('76). We demonstrate the utility of our result by establishing new spectral gap bounds, going beyond related results by Jarret and Jordan ('14) for convex potentials. In the second half of the work, we use our results on log-concavity to extend the perturbative analysis of HWS by Reichardt ('04) to the larger family of potentials with log-concave ground state. As a concrete instantiation, we use our result to extend the HWS analysis from a linear potential (which is exactly solvable) to a quadratic potential (which is no longer solvable). Our result strongly suggests the broader applicability of tunneling to convex potentials with spikes
☆ Computing Gaussian and exponential integrals in ${\Bbb R}^n$
We consider expectations of the type $E\ \exp \left\{\sum_{i=1}^m φ_i \right\}$, where $φ_i: {\Bbb R}^n \longrightarrow {\Bbb C}$ are functions, each depending on a few coordinates of a point in ${\Bbb R}^n$, and the expectation is taken with respect to the standard Gaussian or symmetric exponential probability measures. We prove sufficient conditions, in terms of the Lipschitz constants of $φ_i$ and the combinatorics of their dependencies, for the integral to be separated from 0, and, consequently, to be amenable to a computationally efficient approximation. We discuss applications to computing volumes of bodies and statistics on integer points in polyhedra in ${\Bbb R}^n$.
comment: 31 pages
☆ Multi-Vector Embeddings are Provably More Expressive than Single Vector Embeddings
Multi-vector (MV) embeddings have become a powerful paradigm in neural information retrieval (IR), achieving high retrieval accuracy by representing data with multiple vectors and scoring them via the non-linear Chamfer similarity. Despite their widely perceived superiority over single-vector (SV) embeddings which use inner product similarity, to date there is no formal proof that SV similarities cannot approximate MV similarities with the same representation size. Specifically, we ask the following: for any bounded dataset size $n \leq 2^{poly(m)}$, what is the smallest dimension $D$ so that given any collection of MV embeddings $Q_1,\dots,Q_n,X_1,\dots,X_n \subset \mathbb{R}^d$ containing at most $m$ vectors each, there always exist $q_1,\dots,q_n$, $d_1,\dots,d_n \in \mathbb{R}^{D}$ satisfying $|\langle q_i, d_j \rangle - \texttt{Chamfer}(Q_i,X_j)| \leq ε$ for all $i,j$? Recently, the MUVERA algorithm demonstrated that $D = m^{O(1/ε^2)}$ is possible. If improved to $D = md$, this would imply that MV embeddings are no more expressive than SV embeddings. In this paper, we rule out this scenario. Specifically, we prove the existence of a collection of MV embeddings in $\mathbb{R}^d$, each containing at most $m$ vectors, which require single-vector dimension of $D =(ε^2 m)^{Ω(1/ε)}$ to approximate, establishing a strong separation in representation size between MV and SV embeddings. Our proof leverages the Pattern Matrix Method by constructing a hard instance whose Chamfer similarity matrix encodes the $NAND_k$ boolean function. Our results confirm a long-held belief in the IR community: at a fixed representation size, multi-vector embeddings can express similarities which cannot even be approximately represented by single vector embeddings.
☆ Exact and Fast Subset Selection Algorithms for the Bi-objective Integral R2 Indicator
We study fixed-cardinality subset selection for the exact integral bi-objective $R_2$ indicator with a uniform continuum of weighted Tchebycheff scalarizing functions. The indicator measures the area under the lower envelope of scalarizing losses over weight space, rather than a finite sample average over weight vectors. For a sorted bi-objective Pareto-front approximation, represented by points ordered by increasing first objective and decreasing second objective, we derive an exact adjacent-neighbor decomposition of this integral objective into boundary terms, unary diagonal corrections, and selected-neighbor transition terms. This yields an exact Bellman dynamic program with $O(kn^2)$ running time for selecting $k$ of $n$ candidate points. We then prove that the transition matrix is Monge. This gives a divide-and-conquer implementation with $O(kn\log n)$ running time and, more strongly, a staircase matrix-search implementation with $O(kn)$ running time under constant-time arithmetic comparisons. The matrix-search proof is presented through a lower-envelope sweep over single-crossing transition functions and includes the triangular feasibility condition $i
comment: 5 Figures, 17 Pages
☆ Sublinear Time Algorithms for Abelian Group Property Testing
In this paper, we study the problems of abelian group property testing in two models. In the partially specified model (PS-model), the algorithm does not know the group size but can access randomly chosen elements of the group, along with the Cayley table of these elements, which provides the result of the binary operation for every pair of selected elements. In the stronger fully specified model (FS-model), the algorithm knows the size of the group and has access to all its elements and the Cayley table. In property testing of abelian group property, given a finite set $G$ and oracle access to a binary operation $*:G^2\to G$, we aim to distinguish whether $(G,*)$ is an abelian group or is $ε$-far from any abelian group over $G$. Using a novel approach, we present a tester in the PS-model (and consequently in the FS-model) that runs in time $\tilde O(\sqrt{|G|}+1/ε)$, improving upon the Goldreich-Tauber tester, which runs in time $O(|G|/ε)$. Additionally, our tester improves another tester by Goldreich and Tauber that runs in time $O(|G|^2)$ and makes $\tilde O(|G|+1/ε)$ queries. We further extend our result to testing subclasses of abelian groups ${\cal G}$ that are closed under isomorphism. Specifically, if one can decide in time $T$ whether an abelian group of the form $Z_{m_1}\times \cdots\times Z_{m_r}$ belongs to ${\cal G}$, then there exists a tester for ${\cal G}$ that runs in time $T+\tilde O(\sqrt{|G|}+1/ε)$ and makes $O(\sqrt{|G|}+1/ε)$ queries. This result gives testers that run in time $O(\sqrt{|G|}+1/ε)$ for subclasses such as abelian groups of rank at most $k$, abelian $p$-groups, and vector spaces over~$Z_p$.
☆ Faster enumeration of primes
We describe several new algorithms for finding all prime numbers up to a given bound $N$, achieving the first ever speedup by a positive power of $\log N$ over the ancient sieve of Eratosthenes. The fastest version, which is not fully rigorous, runs in \[ N (\log \log N)^{1+o(1)} \] bit operations when analysed in the multitape Turing model. This improves on the best existing algorithms due to Pritchard (1981), Atkin--Bernstein (2004) and Sergeev (2016) by a factor of almost $\log N$. We also present a rigorous randomised (Las Vegas) variant that is slower by a factor of $(\log \log N)^{1+o(1)}$, and a rigorous deterministic variant that is slower by a factor of $(\log N)^{1/2+o(1)}$. The new algorithms make heavy use of fast polynomial arithmetic over finite fields, and also involve ideas from the theory of error-correcting codes.
comment: 121 pages
♻ ☆ Separating Oblivious and Adaptive Models of Variable Selection COLT 2026
Sparse recovery is among the most well-studied problems in learning theory and high-dimensional statistics. In this work, we investigate the statistical and computational landscapes of sparse recovery with $\ell_\infty$ error guarantees. This variant of the problem is motivated by \emph{variable selection} tasks, where the goal is to estimate the support of a $k$-sparse signal in $\mathbb{R}^d$. Our main contribution is a provable separation between the \emph{oblivious} (``for each'') and \emph{adaptive} (``for all'') models of $\ell_\infty$ sparse recovery. We show that under an oblivious model, the optimal $\ell_\infty$ error is attainable in near-linear time with $\approx k\log d$ samples, whereas in an adaptive model, $\gtrsim k^2$ samples are necessary for any algorithm to achieve this bound. This establishes a surprising contrast with the standard $\ell_2$ setting, where $\approx k \log d$ samples suffice even for adaptive sparse recovery. We conclude with a preliminary examination of a \emph{partially-adaptive} model, where we show nontrivial variable selection guarantees are possible with $\approx k\log d$ measurements.
comment: 40 pages, Extended abstract accepted for presentation at COLT 2026
♻ ☆ Kernelization dichotomies for hitting minors under structural parameterizations
For a finite collection of connected graphs $\mathcal{F}$, the $\mathcal{F}$-MINOR-DELETION problem consists in, given a graph $G$ and an integer $\ell$, deciding whether $G$ contains a vertex set of size at most $\ell$ whose removal results in an $\mathcal{F}$-minor-free graph. We lift the existence of (approximate) polynomial kernels for $\mathcal{F}$-MINOR-DELETION by the solution size to (approximate) polynomial kernels parameterized by the vertex-deletion distance to graphs of bounded elimination distance to $\mathcal{F}$-minor-free graphs. This results in exact polynomial kernels for every family $\mathcal{F}$ that contains a planar graph, and an approximate polynomial kernel for PLANAR VERTEX DELETION. Moreover, combining our result with a previous lower bound, we obtain the following infinite set of dichotomies, assuming $NP \not\subseteq coNP/poly$: for any finite set $\mathcal{F}$ of biconnected graphs on at least three vertices containing a planar graph, and any minor-closed class of graphs $\mathcal{C}$, $\mathcal{F}$-MINOR-DELETION admits a polynomial kernel parameterized by the vertex-deletion distance to $\mathcal{C}$ if and only if $\mathcal{C}$ has bounded elimination distance to $\mathcal{F}$-minor-free graphs. For instance, this yields dichotomies for CACTUS VERTEX DELETION, OUTERPLANAR VERTEX DELETION, and TREEWIDTH-$t$ VERTEX DELETION for every integer $t \geq 0$. Prior to our work, such dichotomies were only known for the particular cases of VERTEX COVER and FEEDBACK VERTEX SET. We also provide lower bounds on the size of the kernels.
comment: 76 pages, 20 figures. An extended abstract presenting the kernelization results of this paper appeared in the Proceedings of STACS 2026, volume 364 of LIPIcs, pages 17:1--17:19. This extended abstract did not contain the lower bounds presented in Section 7 and Appendix B of this version, nor the analysis of the size of the kernels presented in Appendix A
♻ ☆ Solving Zero-Sum Games with Fewer Matrix-Vector Products
In this paper we consider the problem of computing an $ε$-approximate Nash Equilibrium of a zero-sum game in a payoff matrix $A \in \mathbb{R}^{m \times n}$ with $O(1)$-bounded entries given access to a matrix-vector product oracle for $A$ and its transpose $A^\top$. We provide a deterministic algorithm that solves the problem using $\tilde{O}(ε^{-8/9})$-oracle queries, where $\tilde{O}(\cdot)$ hides factors polylogarithmic in $m$, $n$, and $ε^{-1}$. Our result improves upon the state-of-the-art query complexity of $\tilde{O}(ε^{-1})$ established by [Nemirovski, 2004] and [Nesterov, 2005]. We obtain this result through a general framework that yields improved deterministic query complexities for solving a broader class of minimax optimization problems which includes computing a linear classifier (hard-margin support vector machine) as well as linear regression.
comment: FOCS 2025. v2: Minor updates and typo fixes
♻ ☆ Solving Matrix Games with Near-Optimal Matvec Complexity
We study the problem of computing an $ε$-approximate Nash equilibrium of a two-player, bilinear game with a bounded payoff matrix $A \in \mathbb{R}^{m \times n}$, when the players' strategies are constrained to lie in simple sets. We provide algorithms which solve this problem in $\tilde{O}(ε^{-2/3})$ matrix-vector multiplies (matvecs) in two well-studied cases: $\ell_1$-$\ell_1$ (or zero-sum) games, where the players' strategies are both in the probability simplex, and $\ell_2$-$\ell_1$ games (encompassing hard-margin SVMs), where the players' strategies are in the unit Euclidean ball and probability simplex respectively. These results improve upon the previous state-of-the-art complexities of $\tilde{O}(ε^{-8/9})$ for $\ell_1$-$\ell_1$ and $\tilde{O}(ε^{-7/9})$ for $\ell_2$-$\ell_1$ due to [KOS '25]. In both settings our results are nearly-optimal as they match lower bounds of [KS '25] up to polylogarithmic factors.
comment: v2: A few updates to the title, abstract, and intro to reflect the near optimality of our results for $\ell_1$-$\ell_1$ games in light of arXiv:2412.06990 v3. v3: Overhauled Section 3 technical overview. Minor updates and typo fixes in other sections
Graphics 9
☆ Token-to-Token Alignment of Text Embeddings for Semantic Blending
In modern generative models, images are specified and controlled through text prompts. In practice, images are generated from sequences of tokens derived from these prompts. However, the space of token sequences lacks a consistent accessible structure: semantically similar images may correspond to sequences that differ in wording, ordering, and placement of concepts, while similar token sequences may encode very different semantics. This apparent lack of structure makes it difficult to perform smooth transitions in this space, hindering applications such as image blending and continuous control of edits. We argue that this limitation stems not from the absence of semantic structure, but from misalignment between representations. To address this misalignment, we introduce Token-to-Token alignment, a framework that establishes explicit semantic correspondence between tokens across prompts. Our approach transforms prompts into a structured representation in which semantically corresponding concepts are mapped to consistent positions across prompts, and then aligns their token embeddings based on semantic similarity. Concretely, the method consists of two stages: a structural alignment that rephrases prompts into a shared structured form, followed by an embedding-level alignment that matches token representations across prompts. With this alignment in place, simple linear interpolation becomes a meaningful operation, producing smooth and coherent semantic transitions and enabling applications such as blending and continuous editing. Our results show that text embedding spaces in text-to-image models implicitly encode a continuous semantic structure that becomes accessible once representations are properly aligned, suggesting that semantic control can be achieved by organizing existing representations rather than modifying the generative model.
☆ Semantic Browsing: Controllable Diversity for Image Generation ECCV 2026
Modern text-to-image models excel in visual fidelity and prompt adherence. However, this strict adherence comes at the cost of diversity: generated samples tend to collapse into a single visual interpretation. Existing methods to improve diversity produce outputs driven by incidental variations rather than meaningful design choices. This motivates a new variant of the diversity task where structure is enforced on the generated samples. We introduce a method for controlled diversity that enables Semantic Browsing, where users can navigate structured image galleries and experience creative exploration through a systematic traversal of meaningful, interpretable axes of variation. Achieving this level of semantic control requires a deep understanding of the scene. We exploit the fact that recent text-to-image models are trained on elaborated captions, effectively decoupling semantic decision-making from pixel generation. This enables a paradigm shift: instead of relying on stochastic variation within the text-to-image model, we induce diversity directly at the text level. By leveraging rich textual representations, we allow a Vision Language Model (VLM) to operate on the full scene context. To overcome the generic outputs typical of standard VLMs, we employ an agentic workflow that explicitly enforces structured variation attuned to the original prompt. We demonstrate that our method produces diverse and navigable design spaces where every variation corresponds to a specific, user-understandable semantic decision.
comment: ECCV 2026. Project page: https://saradorfman1.github.io/SemanticBrowsing-webpage/
☆ Arbor: Explicit Geometric Conditioning for Controllable 3D Asset Generation
Text and image conditioned 3D models now generate convincing assets, but they still offer little direct control over the space an object should occupy or avoid. In authoring, this spatial intent is often known before generation starts. A chair should fit a seating envelope, a prop should leave clearance for motion, or a part should expose a contact surface. Prompts and image views are poor carriers for such constraints, requiring the need for an explicit control interface. We present Arbor, a trainable attachment for text conditioned latent 3D generation. Arbor introduces constraint meshes as a native 3D control interface. The interface uses hull regions where geometry should exist, avoidance regions that should remain empty, and touch regions the object should contact. Unlike completion or whole object scaffold control, these meshes are not target evidence. They are local typed requirements and can include regions where no surface should appear. Arbor keeps this signal as geometry by converting constraint meshes into tokens and learning a routed attachment inside a frozen denoiser. Each latent region can therefore receive the part of the constraint that matters for its spatial location. We evaluate Arbor on automatic and artist curated control benchmarks with hull, avoidance, and touch constraints, and compare the metric trends to a user preference study. Even without dedicated compliance losses, Arbor improves constraint obedience while preserving object quality and variation under fixed constraints.
comment: Project Page: https://arbor.jdihlmann.com/
☆ MeshFlow: Mesh Generation with Equivariant Flow Matching SIGGRAPH 2026
Meshes are among the most common 3D scene representations, but directly generating meshes is challenging because the representation contains important symmetries, including permutation invariance of faces and vertices. MeshFlow learns to generate triangle meshes directly as triangle soups, avoiding the need to serialize meshes into long autoregressive sequences. We adopt equivariant optimal-transport flow matching models that respect the key symmetries of triangle soups: arbitrary permutations of faces and permutations of the vertices within each face. Toward this goal, we propose a simple yet effective modification to the Diffusion Transformer architecture, resulting in a scalable network capable of modeling a velocity field while maintaining the desired equivariance. We further introduce an optimal-transport-based training objective that improves convergence by eliminating supervision signals that violate these symmetries. MeshFlow achieves mesh quality comparable to state-of-the-art autoregressive mesh generators while providing about an 18$\times$ speedup during inference. Project page is at https://qiisun.github.io/MeshFlow/.
comment: SIGGRAPH 2026
☆ VolHuMe: a High-Resolution Large Scale Dataset of Volumetric Human Meshes
We introduce VolHuMe, a dataset of high-quality 4D human scans captured with a state-of-the-art volumetric studio using 64 RGB and 32 depth cameras. VolHuMe contains individual captures of 104 subjects and provides extensive ground truth, including SMPL-X, high-resolution meshes, multi-view RGB/depth images, rigged meshes, point clouds, garment segmentation, and detailed hand and facial geometry. Unlike prior datasets that primarily rely on full-body imagery, VolHuMe uses a close-range, high-resolution capture setup that preserves fine-grained body-part details, improving geometric fidelity and texture resolution. We benchmark VolHuMe on state-of-the-art methods across 3D and 4D human reconstruction tasks, showcasing the dataset's quality and exposing the limitations of current evaluation testbeds.
☆ Controllable Texture Tiling with Transformed RoPE-Enhanced Diffusion Models
Realistic integration of user-specified textures into scene images is a fundamental task in computer graphics and image editing. While existing material transfer and reference-guided inpainting methods can edit surface appearances, they often fail to address the specific requirements of texture tiling. This task necessitates precisely repeating a reference pattern according to user-defined parameters such as frequency, orientation, and scale. Furthermore, current generative approaches often struggle to maintain the structural fidelity of the reference texture, limited by either destructive pixel-level resampling or the lack of fine-grained spatial information in semantic image encoders, and they frequently fail to preserve the coherent lighting and geometry of the original scene. In this paper, we propose a novel framework for controllable and high-fidelity texture tiling based on Diffusion Transformers. Our approach introduces two key technical innovations to decouple spatial manipulation from content generation. First, we propose a Coordinate-Transformed Rotary Embedding mechanism. By applying 2D affine transformations directly to the relative positional embeddings between the target latent and the image condition, we achieve precise control over tiling patterns without explicit pixel warping, thereby utilizing the full information of the reference condition without degradation. Second, a Disjoint Attention Mask is employed to shield reference features from semantic leakage. This preserves structural integrity while seamlessly blending the synthesized texture with the scene's original lighting and geometry. Extensive experiments demonstrate that our method outperforms state-of-the-art baselines in both control accuracy and texture fidelity.
comment: The code and dataset are publicly accessible at https://github.com/junrongh/ControlTile
☆ DJM: Compact Base Meshes for Displacement Mapping using Triangle Jacobians SIGGRAPH 2026
Representing complex geometry as a displacement function defined over a coarse base mesh enables compact storage and accelerated rendering. The core challenge in converting detailed triangle meshes into this representation is computing base meshes that have as few triangles as possible, while also supporting displacement functions that accurately approximate the input. Accurate approximation requires the supported displacement functions to bijectively map the input surface onto the base with low parametric distortion. We observe that this distortion can be measured by evaluating the pointwise Jacobian of the displacement functions. Our new DJM (Displacement Jacobian Metric)-based base-mesh construction method uses the Jacobian of the displacement functions to guide base mesh computation, enabling us to outperform prior approaches in terms of accuracy to size trade-off. We achieve this goal by proposing a variant of the QEM-based simplification scheme that constrains the displacement mapping between the input and the base to be bijective and low distortion (defined as satisfying a lower bound on the mapping Jacobian). When evaluating and encoding the displacement maps, we avoid unreliable ray-mesh intersections by explicitly storing the mapping between the input mesh and the base throughout the construction process, and use this mapping within a robust inverse barycentric displacement solver to obtain dense base-to-mesh correspondences to assist all computations. We demonstrate DJM to outperform alternative schemes in terms of reconstruction accuracy to size trade-off, and demonstrate its robustness and usability for micromesh-based rendering and neural encoding.
comment: Accepted to SIGGRAPH 2026. Project page: https://www.cs.ubc.ca/labs/imager/tr/2026/djm
♻ ☆ Configurable Holography: Towards Display and Scene Adaptation
Rendering holograms for holographic displays is often an iterative and computationally costly process. Emerging learned holography methods have alleviated this bottleneck by enabling fast hologram rendering with improved reconstruction quality. However, existing methods still depend on fixed display hardware and scene parameters, requiring retraining for each new configuration. This limits rapid adaptation to different visual needs, including scene brightness, user focus preference, and hardware compatibility. We introduce Configurable Holography, a learned CGH framework in which a single model adapts to diverse display-scene parameters through explicit conditioning, eliminating the need for retraining. As a prototype, we present a configurable structure and derive a family of models that continuously adapt to propagation distance, volume depth, peak brightness, pixel pitch, and wavelength. To further improve efficiency, we incorporate auxiliary monocular depth estimation for depth-aware 3D hologram synthesis from RGB-only inputs and apply knowledge distillation for interactive inference. Our extensive simulation and hardware experiments on three holographic display prototypes with different combinations of configurations show on-par reconstruction quality with existing methods, offering up to 2x speed-up in fp32. Our work represents an initial step toward flexible, general-purpose learned holography systems that can seamlessly adapt across diverse hardware and user-specific visual requirements.
comment: 27 pages, 29 figures
♻ ☆ Shellular Metamaterial Design via Compact Electric Potential Parametrization
We present a compact yet highly expressive design space for shellular metamaterials that support both interactive exploration and inverse design. With only a few dozen charges, our representation generates a wide family of periodic shells, spanning from simple planar configurations to complex TPMS-like morphologies. To enable rapid evaluation, we introduce an efficient GPU-based homogenization pipeline that computes the effective elastic tensor of a candidate design in near real time (~ 0.4), making interactive shellular design practical. Across a large set of synthesized structures, our design space exhibits geometric diversity and spans a broad spectrum of mechanical responses, covering a wide range of effective material properties. This fast evaluation further enables inverse design for target macroscopic properties. In the low-solid-volume regime, the resulting shellular structures achieve performance competitive with state-of-the-art shell-based metamaterials in multiple material properties. Finally, we validate manufacturability by fabricating tiled prototypes via additive manufacturing, demonstrating the potential of our approach for real-world engineering applications.
Operating Systems 1
☆ Apple Neural Engine: Architecture, Programming, and Performance
The Apple Neural Engine (ANE) is the fixed-function matrix accelerator that has shipped in Apple systems-on-chip since the A11-class iPhone and iPad chips and the M1-class Mac chips, exposed to applications only through the Core ML model framework. This guide reports a reverse-engineered account of the engine, based on direct measurement on Apple silicon and static analysis of the private runtime, compiler, kernel driver, and firmware. It documents the datapath and the roofline that bound the engine's throughput and energy, the dispatch route that reaches it below Core ML, the compiler and on-disk program format, the weight-compression scheme, and the kernel driver, firmware, and command protocol beneath them. The account covers the A11 through A18 and M1 through M5 families, with per-chip target tables and an operation-by-device matrix; the direct measurements are on the M1 and M5. Claims are labeled as measured, decompile-derived, or predicted, and the methodology and open questions are recorded. The direct route is callable from ordinary user space but remains undocumented, unsupported, and version-fragile; it is intended for measurement, research, and on-device work, not for shipping software, where Core ML remains the supported path.
comment: 302 pages, 12 figures. A reference for the Apple Neural Engine
Hardware Architecture 9
☆ Architecture for Health Initiative (Arch4Health): Computational Challenges in Health-Related Applications and the Role of Computer Architecture in Addressing Them
Recent biotechnological advances enable high-throughput, low-cost, and accurate biological data generation. This wealth of data enables unique opportunities for advancing healthcare. Despite these opportunities, efficiently analyzing large-scale biological data poses significant challenges for conventional computing systems. These systems often cannot keep up with the high-throughput rate at which data is generated, and they face additional constraints related to energy efficiency, scalability, privacy, and security. Therefore, to facilitate the wide adoption of recent advances in healthcare, there is a need to optimize the computing systems to enable high-performance, energy-efficient, low-cost, private, and secure analysis of biological data. We introduce the Architecture for Health (Arch4Health) initiative, which aims to (i) identify and analyze key computational challenges in current and future health- and life science-related applications and (ii) explore how computer architects and computing system designers can advance healthcare by addressing these challenges. In this short paper, we first present the motivations behind the Arch4Health initiative and, second, elaborate on its vision and goals, related topics, Arch4Health workshops, and future outlooks.
comment: To appear as a short invited paper in ICS 2026 (Workshops) Proceedings as part of the Arch4Health workshop. No AI or LLM help was used in creating this work
☆ Design and Development of a Neuromorphic Silicon Suite: PVT Sensing, Stochastic LIF Inference, On-Chip STDP Learning, and Crossbar Programming
Edge neuromorphic systems need compact, configurable hardware that combines probabilistic inference, local learning, and an interface to emerging analogue memory. We present four interface-compatible digital IP blocks implemented as standard-cell CMOS on the SkyWater 130 nm process: a process, voltage and temperature (PVT) sensor built from five selectable ring oscillators that also provides a jitter-based true-random-number generator and a frequency-bounds health monitor; a stochastic leaky integrate-and-fire (LIF) neuron with a configurable LFSR, a programmable activation table, and a refractory period; an on-chip spike-timing-dependent plasticity (STDP) controller with a programmable curve and reward-modulated, eligibility-trace, and anti-Hebbian modes; and a memristive-crossbar controller supporting forming, set, reset, read, and automated current-voltage sweep with current-compliance limiting and half-select biasing. All four blocks share a common serial peripheral interface (SPI) register file; the sensor also exposes a parallel readout. Each occupies a single tile at a 50 MHz target. The suite was verified with 99 cocotb tests at register-transfer and gate level (all passing) and taken through an open standard-cell flow, then submitted for tapeout via the Tiny Tapeout shared-silicon programme. Mapped to the open cell library, each block occupies a post-synthesis cell area of 9.3 to 10.6 thousand square micrometres, places at 61 to 70 per cent tile utilisation, meets the 50 MHz constraint with positive setup and hold margin after clock-tree synthesis, and draws an estimated 0.64 to 0.70 mW under a default switching-activity assumption. The contribution is a coherent, openly released set of building blocks unified by one register interface and one verification flow. All results are from simulation and the implementation flow; no fabricated silicon is reported.
comment: 20 pages, 14 figures, 8 tables. Pre-silicon: RTL with OpenLane/OpenROAD implementation on the SkyWater 130 nm open PDK via Tiny Tapeout (shuttle ttsky26a); no fabricated silicon. Source, testbenches, and layout configuration are publicly available
☆ Multi-Level Resistive Synapses for On-Chip Neural Networks: A Physics-Based Design of a Memristive Crossbar Fabric with Quasi-Continuous Conductance States SP
Building on resistive communication, this paper presents a physics-based design of an on-chip neural network with multi-level memristive synapses supporting a dense spectrum of conductance states. Derived from ionic transport physics, we develop a state-variable model and quantify storable sub-levels under thermal noise, drift, and quantized conductance. We assemble these devices into a 1T1R crossbar fabric, derive the linear algebra of analog vector-matrix multiplication (VMM) under wire resistance, and design a differential synapse for signed weights. A multilayer pipeline executes inference, backpropagation, and weight updates physically in the analog domain. We derive the in-situ outer-product learning rule, its discretization onto the conductance lattice, and the resulting quantization noise. We provide energy, area, capacity, and inter-tile models, showing this substrate is ideally suited for large language models (LLMs). Our design eliminates weight movement, surpassing binary ReRAM and traditional CMOS. We detail the material stack (HfO_2-based), the FEOL/BEOL CMOS foundry-integration flow, a self-contained SPICE model, the complete memristive-FPGA neuromorphic system, and an in-memory self-attention engine with current-mode translinear softmax. Finally, a ternary BitNet datapath shows projected per-token efficiency orders of magnitude better than advanced CPUs or GPUs. The result is a self-contained hardware-native blueprint for a high-density, analog, in-memory neural processor.
comment: 22 pages, 14 figures, includes self-contained SPICE compact model validation
☆ Non-Uniform L2 Cache Latency Across the Streaming Multiprocessors of an NVIDIA L40
The NVIDIA L40 exposes a 96 MiB L2 cache usually modeled as one uniform pool with a single hit latency. We show this is wrong at the granularity a kernel sees: L2-hit latency depends strongly and reproducibly on which physical streaming multiprocessor (SM) issues the load. A turn-serialized, %smid-resolved probe maps the hit latency across all 142 SMs in one launch; it is not a constant near 279 cycles but spans 222-339 cycles (a 52% range), with per-repetition noise below 0.01 cycles. An additive model $L = μ+ a(\mathrm{sm}) + b(\mathrm{slice})$ explains $R^2 = 0.87$ (0.98 with one rank-1 term), and the SM term is two-fold symmetric (two halves of 72 SMs at correlation $r = 0.999$), following the AD102 GPC layout. Independent access patterns agree per SM at $r = 1.000$, so the effect is physical. The same probe on a Blackwell RTX 5090 shows it generalizes, while the per-die pattern is device-specific. Read as a fingerprint, a single user-level probe identifies the SM within a device at 92%, and two physically identical L40s are separated at 100% despite near-identical mean latency (per-SM map $r = 0.63$): a per-die hardware identity, not a clock artifact. This is a self-localization and fingerprinting primitive: a kernel reads its own placement and device, not a victim's, and extracts no secret data. The map is stable, unchanged after an hour at full utilization on both devices. As a consequence, distributing latency-bound work by the map cuts makespan by up to 11%. Single-thread capacity, line-tag, prefetch-modifier, and persisting-L2 results appear as controls. The artifact contains seeds, raw observations, the trained model, and regeneration scripts.
comment: 16 pages, 10 figures, 5 tables. Artifact (seeds, raw CSV observations, trained placement oracle, regeneration scripts) provided as ancillary files
☆ Low-power analogue neural networks with trainable nonlinear connections for continuous control
Physical neural networks promise low-power machine learning by computing directly with analogue device physics, but most architectures force nonlinear device responses to act as scalar weights. Inspired by Kolmogorov-Arnold networks, we place trainable nonlinear functions on the connections, making each physical connection a learnable computational element. Realising these functions as analogue band-pass filters on field-programmable analogue arrays, we find that the benefit is task-dependent and follows from the smoothness of the physical basis: the networks represent smooth, continuously valued targets, including robotic kinematics, continuous control, and photovoltaic maximum-power-point tracking, with far fewer nodes and connections than multilayer perceptrons, but offer no parameter-efficiency advantage on classification-like decision boundaries. Trained networks transfer to hardware across approximately 35,000 connections with quantified fidelity, and a dedicated CMOS implementation is projected to operate at approximately 30 microwatts. A memristive realisation reproduces the same behaviour in simulation, indicating that the advantage comes from placing trainable nonlinearity on connections, rather than from a particular device.
comment: Preprint. Further verification of all simulations is ongoing. Any resulting corrections will be incorporated in a revised version
☆ NeutronSparse: Coordinating Heterogeneous Engines for Sparse Matrix Multiplication on NPUs SIGMOD2027
Sparse matrix-matrix multiplication (SpMM) is a fundamental data operation for large-scale sparse data processing. With NPUs increasingly deployed in data centers for their performance and energy efficiency, accelerating SpMM on these platforms is a natural choice. However, high-performance SpMM on NPUs poses a data management challenge, as irregular sparsity demands efficient data organization and scheduling. On Ascend 910B, the official MindSpore implementation achieves only 36.3% of the performance of GPU-based sparse libraries such as cuSPARSE on NVIDIA A100. To this end, we conduct an in-depth architectural analysis of SpMM execution on NPUs versus GPU and identify that the key performance bottleneck for SpMM on NPUs lies in the lack of efficient coordination across heterogeneous compute units under tile-based execution model. Therefore, we propose NeutronSparse, a coordination-first SpMM framework for NPUs. NeutronSparse integrates two key techniques: (i) Sparsity-aware coordination of heterogeneous engines, which adaptively partitions and balances workloads between heterogeneous compute units to keep them busy, and (ii) Locality-aware tile orchestrating, which reorganizes and reuses data tiles to reduce redundant computation and memory movement overhead. Evaluations on Ascend 910B show that NeutronSparse achieves 1.26x-7.78x speedup over NPU baselines and 1.03x-3.07x speedup over leading GPU libraries on NVIDIA A100, revealing untapped potential of NPUs for sparse computation.
comment: 28 pages 27 figures, SIGMOD2027
☆ DejaVu: Why You Should Write to Your DRAM Rows Twice, Carefully
We provide the first experimental demonstration of DejaVu, a phenomenon where the data previously written to DRAM cells affects DRAM's vulnerability to read disturbance. Our experimental characterization using 112 COTS DDR4 DRAM chips from all three major manufacturers shows that, compared to the baseline where we initialize the victim row by writing to it only once, 1) overwriting it with the opposite data reduces ACmin, the minimum aggressor row activation count to induce a bitflip, and 2) writing the same data twice increases ACmin. We provide two hypotheses to explain DejaVu. First, we hypothesize that overwriting the victim row with opposite data values causes under-restoration of charge in DRAM cells. Second, we hypothesize that overwriting the victim row changes charge trap states in the active region, affecting read-disturbance-induced cell leakage current. We conduct controlled characterization to provide insight into these hypotheses. We further characterize the reliability of Processing-Using-DRAM (PUD) operations with DRAM rows initialized with DejaVu patterns. Our characterization of 32-row MAJ-3 operation shows that overwriting the DRAM rows used in the operation reduces the number of bitlines that fail to reliably perform MAJ-3 by 32.7% on average compared to the baseline where rows are written only once. Based on our observations, we describe two major implications of DejaVu. We show how DRAM testing and characterization methodologies should account for DejaVu to accurately characterize read disturbance vulnerability under fixed data patterns and rigorously study data-pattern effects without unintended interference from DejaVu. We also evaluate the performance overhead of read disturbance mitigation techniques when thresholds need to be lowered to be secure against DejaVu, showing a 6.3% overhead when reducing the threshold by 20%.
☆ Apple Neural Engine: Architecture, Programming, and Performance
The Apple Neural Engine (ANE) is the fixed-function matrix accelerator that has shipped in Apple systems-on-chip since the A11-class iPhone and iPad chips and the M1-class Mac chips, exposed to applications only through the Core ML model framework. This guide reports a reverse-engineered account of the engine, based on direct measurement on Apple silicon and static analysis of the private runtime, compiler, kernel driver, and firmware. It documents the datapath and the roofline that bound the engine's throughput and energy, the dispatch route that reaches it below Core ML, the compiler and on-disk program format, the weight-compression scheme, and the kernel driver, firmware, and command protocol beneath them. The account covers the A11 through A18 and M1 through M5 families, with per-chip target tables and an operation-by-device matrix; the direct measurements are on the M1 and M5. Claims are labeled as measured, decompile-derived, or predicted, and the methodology and open questions are recorded. The direct route is callable from ordinary user space but remains undocumented, unsupported, and version-fragile; it is intended for measurement, research, and on-device work, not for shipping software, where Core ML remains the supported path.
comment: 302 pages, 12 figures. A reference for the Apple Neural Engine
♻ ☆ WarPGNN: A Parametric Thermal Warpage Analysis Framework with Physics-aware Graph Neural Network
With the advent of system-in-package (SiP) chiplet-based design and heterogeneous 2.5D/3D integration, thermal-induced warpage has become a critical reliability concern. While conventional numerical approaches can deliver highly accurate results, they often incur prohibitively high computational costs, limiting their scalability for complex chiplet-package systems. In this paper, we present WarPGNN, an efficient and accurate parametric thermal warpage analysis framework powered by Graph Neural Networks (GNNs). By operating directly on graphs constructed from the floorplans, WarPGNN enables fast warpage-aware floorplan exploration and exhibits strong transferability across diverse package configurations. Our method first encodes multi-die floorplans into reduced Transitive Closure Graphs (rTCGs), then a Graph Convolution Network (GCN)-based encoder extracts hierarchical structural features, followed by a U-Net inspired decoder that reconstructs warpage maps from graph feature embeddings. Furthermore, to address the long-tailed pattern of warpage data distribution, we developed a physics-informed loss and revised a message-passing encoder based on Graph Isomorphic Network (GIN) that further enhance learning performance for extreme cases and expressiveness of graph embeddings. Numerical results show that WarPGNN achieves more than 205.91x speedup compared with the 2-D efficient FEM-based method and over 119766.64x acceleration with 3-D FEM method COMSOL, respectively, while maintaining comparable accuracy at only 1.26% full-scale normalized RMSE and 2.21% warpage value error. Compared with recent DeepONet-based model, our method achieved comparable prediction accuracy and inference speedup with 3.4x lower training time. In addition, WarPGNN demonstrates remarkable transferability on unseen datasets with up to 3.69% normalized RMSE and similar runtime.
comment: Accepted to IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED) 2026
Programming Languages 1
☆ Compositional Generator Equivalence (Extended Version)
Property-based testing (PBT) is a powerful technique for software verification that relies on random input generators and "shrinking" processes to find and minimize counterexamples to executable specifications called properties. While optimizing these generators is crucial for testing efficiency, formally justifying such optimizations is currently difficult because existing languages lack a compositional semantics that is coarse-grained enough for high-level reasoning. In this paper, we first provide a formal account of the syntax and semantics of Hedgehog, a popular PBT framework. We demonstrate that Hedgehog's distribution semantics - which models how users typically reason about generators - is non-compositional. Furthermore, we prove that any sound and complete compositional semantics for Hedgehog must necessarily be equivalent to its sampling semantics, which is too fine-grained to justify common program optimizations. To resolve this dilemma, we introduce Hedgehog$^\rightarrow$, a restricted version of the language based on the arrow calculus, and prove that Hedgehog$^\rightarrow$ possesses a compositional distribution semantics. We evaluate Hedgehog$^\rightarrow$ through a Haskell implementation and show that it remains expressive enough to capture generators of practical interest, while providing the formal foundation needed for compositional generator equivalence proofs.
comment: 30 pages. 19 figures. 1 table. To be published in ICFP 2026
Data Structures and Algorithms 5
☆ Modular Rank and Linear-Complexity Tests for Pseudorandom Number Generators
Standard batteries of tests for pseudorandom number generators (such as dieharder, the NIST suite, and TestU01) provide two empirical tests for linearity, the binary rank and linear-complexity tests. Both operate over the field $\mathbf F_2$, and thus detect generators that are linear over $\mathbf F_2$. However, generators can be linear over a larger field, as in the case of congruential generators, single-modulus multiple-recursive recurrences, and of matrix generators such as MIXMAX. We introduce a modular version of the rank and linear-complexity tests, and provide modlin, a Rust program that implements it efficiently for fields of prime size. modlin can detect in minutes statistical bias in all current CERN's ROOT's implementations of the MIXMAX generator, for which no standard statistical test failure has been reported before.
☆ Spectral Gap for the Binary Fixed-Margin Swap Chain
We prove an inverse-polynomial spectral-gap bound for the lazy swap chain on binary matrices with prescribed row and column sums. This chain is a standard sampler for fixed-margin null models in ecology, statistics, and network analysis, and its rapid mixing for arbitrary feasible margins was conjectured by Kannan, Tetali, and Vempala in 1997. We show that for every feasible set of margins on an $m\times n$ binary matrix, the lazy swap chain has spectral gap at least $$ \binom{m}{2}^{-1}\binom{n}{2}^{-1}, $$ which is tight in the worst case. The proof compares the swap chain with a two-row heat-bath chain, reduces the analysis from arbitrary $m\times n$ matrices to the case of three rows, and proves the resulting three-row inequality by decomposing functions according to the column-count variable and the associated Johnson harmonic sectors. The proof itself was generated by ChatGPT 5.5 Pro. ChatGPT proposed the whole proof strategy, including the comparison with the two-row heat-bath chain, the reduction to the three-row case, and the decomposition of the three-row function space into the count sector and the Johnson harmonic sectors. It also generated all the technical lemmas and initial proofs. The author's role was to pose the problem, guide the search direction, evaluate the AI-generated arguments, rewrite the proof, and take responsibility for the final form and validity of the result.
☆ Submodular Welfare Maximization with Budget Constraints in the Random-Order Model
We study an online item-allocation problem with budgets and a submodular objective. A set of $m$ agents is known in advance, and each agent $j$ has a known budget. A set of $n$ items arrives over time in a uniformly random order. When item $i$ arrives, its cost $c_{i,j}$ for each agent $j$ is revealed, and the algorithm must irrevocably assign $i$ to an agent without violating any budget constraint. The goal is to maximize a monotone submodular function defined over all possible assignments $[n] \times [m]$. At the time of decision, the algorithm has only oracle access to this submodular function restricted to items seen so far. This model subsumes welfare maximization with submodular valuations, agent-specific item costs, and agent-specific budgets. We measure the performance of an algorithm by its competitive ratio, i.e., the worst-case ratio between the algorithm's expected value and that of the offline optimum, which knows all item costs and the full submodular function in advance. Prior work only considered the case of a single agent and achieved a $1/54.4$-competitive algorithm. We generalize and improve this result to a polynomial-time $α$-competitive algorithm with $α\approx 1/14.85$ for an arbitrary number of agents. We also study the special case in which all item costs and all budgets equal $1$, which yields an online submodular matching problem. Prior work achieved a polynomial $1/9.66$-competitive algorithm for this problem; we improve this to a factor of $1/6.86$. Both our algorithms rely on repeatedly computing $(1 - 1/e)$-approximations of the multilinear extensions of offline variants of the subproblems. If super-polynomial runtime is allowed, these subproblems can be solved optimally, and our competitive ratios improve by this factor.
♻ ☆ The Min Max Average Cycle Weight Problem
When an old apartment building is demolished and rebuilt, how can we fairly redistribute the new apartments to minimize envy among residents? We reduce this question to a combinatorial optimization problem called the *Min Max Average Cycle Weight* problem. In that problem we seek to assign objects to agents in a way that minimizes the maximum average weight of directed cycles in an associated envy graph. While this problem reduces to maximum-weight matching when starting from a clean slate (achieving polynomial-time solvability), we show that this is not the case when we account for preexisting conditions, such as residents' satisfaction with their original apartments. Whether the problem is polynomial-time solvable in the general case remains an intriguing open problem.
comment: Fixed calculation mistakes in the table. The question is still open
♻ ☆ Modern Primal-Dual Frameworks for Prior-Free Online Resource Allocation
Linear-programming (LP)-based primal-dual methods are fundamental for designing and analyzing algorithms in adversarial (prior-free) online resource allocation. This chapter provides a tutorial on two modern primal-dual frameworks, emphasizing recent developments and contemporary models in operations research. Part~I develops an LP-based convex-programming framework where solving a regularized convex program at each arrival captures the tradeoff between greediness and hedging, yielding a dual certificate via Karush-Kuhn-Tucker (KKT) conditions. Because standard LP relaxations can be weak or intractable for stochastic outcomes, Part~II introduces a complementary LP-free framework that provides a universal certificate system for evaluating competitive ratios under such uncertainty. Covering a wide array of models -- including online vertex-weighted bipartite matching, edge-weighted online matching with free disposal, online matching with stochastic rewards, reusable resources, two-sided assortment optimization, configuration allocation (whole-page optimization), AdWords, and costly cancellations -- the tutorial equips readers with versatile proof templates to analyze existing algorithms and develop new solutions for emerging applications.
comment: Chapter in INFORMS Tutorial (2026)
Graphics 4
☆ Illuminating English Letters Using a Flying Light Speck
This paper presents the design and implementation of a Flying Light Speck (FLS) to illuminate English letters. The FLS uses its onboard camera and computing to localize and follow a trajectory to illuminate a letter. We evaluate the illuminations quantitatively and qualitatively. The latter is based on an IRB approved human subject study with 20 participants. The obtained results show a 42 to 56 millimeter error that impacts the detection of letters. A key finding is that the order in which the illumination of letters is presented to subjects has a significant effect on detection duration.
comment: Appeared in Proceedings of the 3rd International Workshop on UAVs in Multimedia: Capturing the World from a New Perspective (UAVM '25), October 27-28, 2025, Dublin, Ireland. ACM, New York, NY, USA, 5 pages
☆ Concept-Constrained Prompt Learning for Few-Shot CLIP Adaptation
Few-shot prompt learning is an effective strategy for adapting CLIP to downstream tasks, but class-only prompt optimization can overfit base-class supervision and weaken transfer to unseen classes. We propose Concept-Constrained Prompt Learning (CCPL), a lightweight regularization framework that anchors learnable class prompts to frozen concept-level text prototypes without updating CLIP encoders. CCPL learns a set of shared context tokens, instantiates class prompts by appending class names, and constructs frozen concept prototypes from a class-level concept bank. During training, a text-space cosine consistency objective aligns learnable class-prompt embeddings with frozen concept prototypes; concept dropout provides additional regularization against over-reliance on fixed concept lists. At inference, CCPL optionally fuses class-prompt logits with concept-prototype logits using a controllable ensemble weight alpha. Our default configuration uses text-space concept regularization lambda = 0.5, concept dropout p = 0.3 and weak concept-guided fusion (alpha = 0.1), with no KL-based prediction consistency term. Experiments under identical automatically-generated fallback splits show that CCPL improves the base-to-new harmonic mean on DTD (+0.6) and EuroSAT (+2.9) compared with CoOp, while remaining near-neutral on OxfordPets (-0.1). Ablations indicate that text-space concept regularization is consistently beneficial, while the best concept-guided inference strength is dataset- and protocol-sensitive. These results suggest concept constraints are most effective when concept prototypes align naturally with dataset semantics, and identify fine-grained categories as a current boundary condition. The code is released at: https://github.com/richael-sang/concept-constrained-prompt-learning.
☆ Line Drawings using LightBenders: Authoring and Illuminating
This study presents the hardware and software architecture of a transformative system for illuminating line drawings and letterforms. These mid-air illuminations are indoors and might be animated. The hardware contribution is a drone equipped with servo-actuated rod joints and a dense, addressable LED strip that enables arbitrary orientation, a LightBender. The software contributions are threefold. First, the system implements algorithms and heuristics to estimate the minimum number of LightBenders required to render a line drawing or letterform, stagger swarm formations to mitigate LightBender downwash, generate Swarm Flight and Lighting (SFL) files, and execute these files using a swarm of LightBenders to illuminate line drawings and letterforms. Second, a Blender add-on enables users to register LightBenders, author graphics and animations represented by swarms of LightBenders, and deploy the swarm for illumination through one-click functions. Third, users may import SVG files into either the Blender add-on or a standalone LB-Author tool to illuminate line drawings directly from vector graphics. We present results from an IRB-approved human subject study (n=21) to evaluate the impact of LightBender misalignment on the perceived illuminations. Obtained results demonstrate that the system's 10.1 mm maximum misalignment is perceptually acceptable across tested illuminations, with a median quality rating of 8 on a 0-10 scale.
☆ Lighting-Consistent Object Transfer Across Radiance Fields
3D Gaussian Splatting (3DGS) is widely used to capture and render real scenes. Compositing objects from one capture into another has applications in many domains, such as VFX, architecture and interior design, or marketing. However, extracting an object from a source scene and naively pasting it into a target scene will fail to produce realistic results due to the different lighting conditions between the two scenes. To address this problem, we introduce a diffusion model that harmonizes naively composited images with inconsistent lighting. The model is trained with a heterogeneous dataset of image pairs (inconsistent composite input, consistent output), combining synthetic, generated, and real data. Our complete 3D solution allows a user to extract an object from the source scene and composite it into the target scene. From this, the (inconsistent) views of the target scene with the composite object are rendered. Our diffusion model harmonizes each one of these views, which are finally consolidated in a 3DGS representation with a post-optimization step. Our method provides visually compelling results, making object transfer between 3DGS easy to use and significantly improving quality compared to previous methods.
comment: Project page: https://repo-sam.inria.fr/nerphys/dot3d
Hardware Architecture 9
☆ 2.5D Root of Trust: Securing the Chiplet Ecosystem SC
The semiconductor industry is rapidly transitioning from monolithic systems-on-chip toward heterogeneous, multi-vendor 2.5D chiplet ecosystems integrated via silicon interposers. While this paradigm shift offers immense benefits in yield, cost, and time-to-market, it radically expands the attack surface. Integrating chiplets from untrusted foundries and design houses introduces vulnerabilities to hardware Trojans, IP piracy, and system-level communication exploits. Critically, chip-level security features and conventional Root of Trust (RoT) proposals are insufficient in this context: any component, including the interconnect fabric itself, may be sourced from an untrusted vendor. This perspective paper surveys state-of-the-art security strategies for interposer-based 2.5D integration, focusing on three threat categories: interconnect attacks (snooping, spoofing, and man-in-the-middle), cache coherence exploits including complex forging attacks, and microarchitectural side-channel threats. We examine design-time defenses via 2.5D split manufacturing and, more crucially, runtime defenses that establish an active interposer as a physically isolated 2.5D RoT. By embedding so-called transaction monitors and coherence message checkers within the trusted interposer fabric, the system enforces memory access permissions by construction and neutralizes coherence-level attacks without need for modifying/securing the commodity chiplets. Finally, we review the EDA flows required to realize these defenses and show they concurrently improve power and signal integrity while reducing overall system footprint.
comment: MWSCAS 2026
☆ Shuttling in Bidimensional Segmented Ion-Trap Quantum Processors with T-Junctions
Shuttle-based trapped ion quantum processors typically employ a one-dimensional (1D) linear architecture to transport ion-qubits between one ore more laser interaction zones where the quantum gates are implemented, along with several qubit register storage segments. The two-dimensional (2D) quantum CCD architecture employs also T- or X-junctions for an improved scaling and efficiency. Here, we explore the shuttling layer in the compilation of quantum algorithm typical building blocks in such architecture. To weight the effort of linear shuttle and junction shuttle, we introduce individual cost functions for each operation. This allows comparing the total cost for quantum circuit building blocks such as the QFT, Carry, Adder, Shift, and Comparator circuits. We study their scaling properties with increased qubit numbers. At equivalent transport cost for junction and linear shuttling, we show that 2D architectures outperform the 1D linear trap with the ratio improving as the number of ions increases. Finally, we discuss the use of cells, such that the entire processor is constructed from a 2D array of such interconnected cells. The work aims to optimize quantum processor architectures, implementing a co-design that fits to the specific task and scaling up in a shuttle-efficient way.
☆ AgentDSE: Reasoning-Augmented Architectural Design Space Exploration ISCA 2026
Traditional architectural design space exploration (DSE) is highly inefficient, typically requiring tens of thousands of simulator evaluations across various optimization methods. This inefficiency arises because conventional methods treat the simulator as a black-box oracle. In contrast, human architects effectively guide exploration by reasoning through physical constraints, performance bottlenecks, data reuse, and workload structures. To bridge this gap, we introduce AgentDSE, a simulator-in-the-loop methodology driven by a general-purpose large language model (LLM) coding agent. AgentDSE automates this architectural-reasoning loop without requiring model fine-tuning, precomputed design databases, or domain-specific optimizer code. Across deep neural network (DNN) accelerator mapping, hardware/software co-design, and CPU cache-hierarchy optimization, AgentDSE achieves competitive or better design quality with up to two orders of magnitude fewer evaluations. AgentDSE also produces inspectable traces that surface architectural hypotheses, performance cliffs, implicit priors, and simulator artifacts, making every search decision traceable rather than buried in optimizer state.
comment: Accepted to the Machine Learning for Architecture and Systems Workshop (MLArchSys), co-located with ISCA 2026
♻ ☆ Photonic Quantum Computing on Spin Memory Architecture with Tree-Encoded Fusion
Photonic quantum computing provides a promising route toward quantum computation by naturally supporting the measurement-based quantum computation (MBQC) model. In MBQC, programs are executed through measurements on a pre-generated graph state, whose construction largely depends on probabilistic fusion operations. However, fusion operations in PQC are vulnerable to two major error sources: fusion failure and fusion erasure. As a result, MBQC compilation must account for both error mechanisms to generate reliable and efficient photonic executions. Prior state-of-the-art MBQC compilation, represented by OneAdapt, is designed for all-photonic architectures and mainly focuses on handling fusion failures. Nevertheless, it does not explicitly model fusion erasures induced by photon loss, which can be substantially more damaging than fusion failures. To mitigate fusion erasure errors, we introduce a new MBQC compilation scheme built upon the spin qubit quantum memory. We propose tree-encoded fusion, an encoding strategy that suppresses erasure errors during graph-state generation. We further incorporate this scheme into a compiler framework with algorithms that reduce the execution overhead of quantum programs. We evaluate the proposed framework using a realistic PQC simulator on six representative quantum algorithm benchmarks across multiple program scales. The results show that tree-encoded fusion achieves better robustness than alternative fusion-encoding strategies, and that our compiler provides exponential improvement over OneAdapt. In addition, we validate the feasibility of our approach through a proof-of-concept demonstration on real PQC hardware.
♻ ☆ Ramulator 2.1: A Composable Memory System Simulator for Modern DRAM Systems
Ramulator 2.1 is a major overhaul of Ramulator 2.0 that substantially improves the simulator in three directions: 1) support of modern and emerging DRAM and memory-controller features, 2) better usability and extensibility of the simulator, and 3) more comprehensive tests and validation workflows. Ramulator 2.1 adds support for advanced features in recent and emerging DRAM standards and memory controllers, including HBM3/4, LPDDR5/6, and GDDR7. To improve usability and extensibility, Ramulator 2.1 introduces a Python-based modeling and configuration interface backed by a two-way code-generation framework that 1) hides low-level C++ code behind high-level DRAM specifications written in Python, and 2) automatically creates Python proxies for all components of the simulator. Doing so enables users to rapidly create variants of DRAM standards and automate design-space-exploration workflows. To improve trustworthiness in simulation results, Ramulator 2.1 provides a comprehensive testing and validation infrastructure that covers both 1) fine-grained validation of specific DRAM timing constraints and memory-controller scheduling behavior, and 2) system-level performance evaluation using latency-throughput curves. To aid performance analysis and debugging, Ramulator 2.1 also includes an easy-to-use and high-performance DRAM command trace visualizer. Ramulator 2.1 is open-source on GitHub and under active development.
♻ ☆ A comparative study on power delivery aspects of compute-in/near-memory approaches using DRAM
Compute-in-memory (PIM) mitigates the memory wall by performing computation within memory, reducing data movement and improving energy efficiency. DRAM-based PIM is particularly attractive due to its high density, mature manufacturing ecosystem, and compatibility with existing systems. Recent works exploit multiple levels of the DRAM hierarchy - including subarrays, banks, and 3D-stacked organizations - to enable in-memory computation using mechanisms such as multi-row activation, row-buffer operations, and near-bank compute units. However, these approaches introduce non-traditional current demand patterns that challenge the power delivery network (PDN). This paper surveys PDN challenges in DRAM-based PIM systems and proposes a unified taxonomy that characterizes PIM-induced current behavior along temporal (burst vs. sustained) and spatial (localized vs. distributed) dimensions. Using this framework, we analyze how representative PIM techniques stress the PDN through bursty activations, multi-row concurrency, and large-scale parallel execution, leading to voltage droop, IR drop, and thermal hotspots. We further discuss DRAM-specific mitigation strategies leveraging existing architectural and circuit-level mechanisms, including timing constraints, memory controller scheduling, data placement, and bank- and vault-level power management. This survey highlights the importance of PDN-aware design for scalable and reliable DRAM-based PIM systems and outlines key future research directions.
♻ ☆ A detailed algorithmic study on a reuse-aware, near memory, all-digital Ising machine
Recently, nature-inspired computing approaches have gained significant attention for solving difficult optimization problems, particularly through Ising machines for NP-complete applications. Existing Ising accelerators range from quantum and optical annealers to CMOS-based von-Neumann and in-memory architectures. However, many prior designs are specialized accelerators limited to specific problem classes, rely on ADC/DAC circuits, and suffer from reliability challenges due to process-variation-sensitive embedded memory technologies. This paper presents SACHI, an all-digital Ising architecture implemented by repurposing the L1 cache of a CPU using SRAM-based processing-in-memory techniques. SACHI eliminates the need for ADCs/DACs, improves reliability compared to prior approaches such as BRIM, and enables Ising acceleration with minimal hardware overhead integrated into the CPU pipeline. The paper also provides detailed architectural analysis and pseudo-code for the proposed algorithms. The key contributions of SACHI are: (i) tight integration of the accelerator with the CPU pipeline, (ii) reuse of existing cache hardware for acceleration, (iii) higher parallelism enabled through reuse-aware computation, and (iv) improved performance and energy efficiency for large-scale, high-precision optimization problems using novel compute and mapping strategies. Compared to BRIM, SACHI achieves 300x performance improvement and 80x energy reduction across applications including asset allocation, molecular dynamics, image segmentation, and traveling salesman problems. Additionally, reuse factors up to 4000x are observed for several workloads. This work demonstrates that reliable and efficient all-digital Ising acceleration can be achieved using commodity SRAM structures tightly integrated with general-purpose processors.
♻ ☆ A complete discussion on fully reconfigurable, digital, scalable, graph and sparsity-aware near-memory accelerator for graph neural networks
Graph neural networks (GNNs) have gained significant interest for applications such as citation network analysis and drug discovery due to their ability to apply machine learning techniques on graph-structured data. GNNs typically employ a two-stage execution pipeline consisting of combination and aggregation kernels. The combination stage performs data-intensive convolution operations with relatively regular memory access patterns, whereas the aggregation stage operates on sparse graph data with highly irregular accesses. These heterogeneous memory behaviors make conventional CPU- and GPU-based execution energy inefficient due to substantial data movement overheads. Existing accelerators attempt to mitigate these challenges using specialized architectures and processing-in-memory (PIM) techniques. However, prior approaches often suffer from scalability limitations, area overheads, restricted parallelism, and energy inefficiencies associated with analog compute and dedicated accelerator structures. This paper presents NEM-GNN, a scalable DAC/ADC-less processing-in-memory architecture for graph neural network acceleration. The proposed design introduces early compute termination mechanisms, pre-computation using reconfigurable system-on-chip components, and graph- and sparsity-aware near-memory aggregation using a compute-as-soon-as-ready (CAR) and broadcast-based execution model. Experimental results demonstrate that NEM-GNN achieves approximately 80--230x higher performance, 80--300x higher throughput, 850--1134x better energy efficiency, and 7--8x higher compute density compared to prior state-of-the-art approaches.
♻ ☆ A comprehensive study on ILP acceleration accounting for sparsity, area, energy, data movement using near-memory architecture
Integer Linear Programming (ILP) is widely used for solving real-world optimization problems, including network routing, map routing, and traffic scheduling. However, ILP algorithms are sparse and branch-intensive, making them inefficient on conventional CPUs and GPUs. Prior work has shown that large-scale ILP problems can require tens of hours of execution time even on massively parallel systems, limiting their applicability to time-sensitive decision-making workloads. Existing ILP solvers such as Gurobi employ software-level optimizations to handle sparsity on CPUs, but still face throughput limitations. GPU-based ILP solvers are also constrained because GPUs are not well suited for sparse and branch-heavy workloads, leading to thread divergence, under-utilization of streaming multiprocessors, and frequent host-device interactions. This paper presents SPARK, a sparsity-aware, reuse-aware, energy-efficient, reconfigurable near-cache ILP accelerator. SPARK repurposes the existing L1 cache in CPUs to provide near-cache acceleration with minimal hardware overhead of approximately 1.4\% of the CPU area. The architecture performs near-cache sparsity detection and sparsity-aware computation to reduce insignificant computations and data movement energy. SPARK also exploits computational reuse patterns in ILP algorithms to improve parallelism and efficiency. The proposed design supports both sparse and dense ILPs as well as Linear Programs (LPs). Evaluations on real-world workloads from MIPLIB 2017 show that SPARK achieves up to 15x and 20x performance improvement, and up to 152x and 740x energy reduction compared to AMD Zen3 CPUs and NVIDIA Tesla V100 GPUs, respectively, for sparse ILPs. For sparse LPs, SPARK achieves 7-17x performance improvement and 103-250x energy reduction over CPU and GPU baselines, demonstrating the broad applicability of the proposed architecture.
Programming Languages 4
☆ Powerdomains and nondeterminism in synthetic domain theory
Synthetic domain theory is an axiomatization of domain theory within a constructive universe of sets such that all definable maps between domains are continuous. In this paper we construct the counterparts to the well-known lower, upper, and convex powerdomains in the setting of synthetic domain theory and prove that they produce computationally adequate denotational models of nondeterminism. By developing the theory of powerdomains in synthetic domain theory, we obtain a nondeterministic metalanguage that directly embeds into dependent type theory, where the latter serves as an expressive logic for reasoning about the metalanguage. Moreover, the computational adequacy results imply that denotational reasoning through the metalanguage may be used to study operational behaviors of actual programs.
☆ Analyzing the Analyzers: Model Counting Meets Abstract Interpretation
Abstract interpretation offers a principled foundation for static analysis by approximating concrete program semantics via abstract domains. However, quantitatively comparing the precision of different domains remains a longstanding challenge. We present MCAI (Model Counting meets Abstract Interpretation), a new methodology that employs model counting to measure the precision of abstract domains. Unlike prior approaches that assess precision relative to specific analysis queries, MCAI encodes both concrete semantics and abstract values as logical formulas, enabling a client-independent, quantitative metric of imprecision that captures the inherent semantic loss in the abstractions. We apply MCAI to four abstract domains and evaluate the precision of their best abstract transformers via symbolic abstraction. Our results yield several insights: the Interval domain, despite its simplicity, often achieves precision comparable to that of Octagon; many octagonal constraints are semantically redundant; and the bit-level KnownBit domains consistently outperform the word-level domains. MCAI offers both theoretical insights into the precision of abstract domains and practical guidance for selecting suitable abstractions.
☆ Shared-Context Batched Satisfiability
Program analyzers often issue batches of SMT queries that share a large symbolic context and differ only in a small predicate. We formalize this recurring pattern as \emph{Shared-Context Batched Satisfiability}: given a formula $\varphi$ and predicates $P$, determine whether $\varphi \land p$ is satisfiable for each $p \in P$. We study three theory-agnostic strategies for this problem: predicate-by-predicate checking, disjunctive over-approximation, and Core-Literal Filter (CLF), a new algorithm that learns literals inconsistent with $\varphi$ and uses them to reject later predicates. Our evaluation on symbolic abstraction and active property checking shows that no strategy dominates universally: over-approximation is fastest on solved symbolic-abstraction queries, while CLF increases the number of solved hard instances and is fastest on active property checking. We advocate treating shared-context batched satisfiability as a first-class primitive in design program analyzers and exploring the algorithmic design space more systematically.
☆ CNnotator: LLM-Guided Memory Safety Annotation Synthesis ICSE 2026
Memory safety errors account for a large proportion of security bugs in systems written in C; modern languages such as Java and Rust prevent such bugs because they are memory-safe by design. To migrate systems to safer languages or identify memory errors, we must first determine how legacy code manipulates memory. This information is only represented implicitly in such code. In many cases, memory usage patterns are merely tedious for humans to figure out, rather than truly difficult. In this work, we ask if large language models (LLMs) can perform this task by having them synthesize annotations representing memory usage as specifications in CN, a hybrid testing/verification tool. Our tool, CNnotator, uses LLMs to automatically generate and test CN specifications. We find that current models are able to generate CN specifications for small-to-medium C programs, with the OpenAI o3 reasoning model achieving a 90% success rate on first attempts and 97% overall success, while the chat model GPT-4o correctly annotates 65% of first attempts. These results suggest AI-assisted annotation is becoming practical for real-world C codebases.
comment: 6 pages. Published at ReCode 2026 (1st Workshop on Code Translation, Transformation, and Modernization), co-located with ICSE 2026. This version corrects the description of the property-based testing backend (Bennet, built on Fulminate) relative to the published version
Data Structures and Algorithms 10
☆ Service-Cut Certificates for Aligned Eviction in Tiered Cache Networks
In a tiered cache, eviction is a graph decision: removing one aligned storage block can disconnect downstream demand that never addressed that block directly, so request recency alone cannot price the action. This paper studies aligned eviction as a vertex-separation problem and gives a selection rule whose decisions carry independently checkable service-cut evidence. For every candidate block, it computes the exact weighted downstream demand cut, rejects actions that disconnect protected demand, and selects the minimum-impact admissible eviction. Reclamation is characterized as vertex separation: minimum-location reclamation reduces to node-capacitated flow, while minimum aligned block actions are NP-complete. In two-hop cache networks, one streaming pass evaluates every candidate impact; a matching adversarial construction proves that a history-only victim selector has unbounded one-step damage. The packet-scale implementation combines a seed-indexed exact-cardinality residency structure with collision-aware, 32-bank impact counters. Replay compression makes the result auditable: counter intervals reproduce the stream, exact monoid summaries retain every reported additive statistic, and a counting lower bound quantifies the state required by any exact all-candidate summary. A 144-scenario evaluation processes 582.90 trillion packets (404.86 PiB of simulated payload), validates the coordinate expectations, and exposes a zero-impact extreme-value transition near $Nζ=\log m$. Complete impact vectors, decoded audit samples, telemetry, and logs remain within the ancillary-file budget. Finally, invalidation is monotone replicated state: fair asynchronous delivery converges without coordination, with a diameter bound under synchronous full-edge rounds. The architecture therefore binds capacity reclamation, path continuity, and distributed invalidation to one certifying interface.
comment: 23 pages, 1 table, no figures. Ancillary files (anc/): CUDA source and benchmark binary, reproducible exact per-block impact vectors, decoded audit samples, GPU telemetry, build and run logs, and an independent verifier
☆ Efficient Algorithms for Influence Maximization in General Models and Observed Cascades
We study influence maximization in general stochastic models, the observed cascades model, and the independent cascade (IC) model. For general stochastic models with only black-box sample access, we introduce a low-adaptivity optimization framework that improves sample complexity and running time over Sadeh et al. (2020) and is instrumental to all our results. We further introduce an adaptive algorithm guided by empirical variance, avoiding pessimistic worst-case bounds. Combining our optimization framework with sketching, we obtain the first algorithm with provable guarantees and nearly-linear running time for influence maximization on observed cascades, optimal up to logarithmic factors. For IC, we prove a novel tail bound replacing a factor $n$ with $τ$ (the number of diffusion steps) in sample complexity, improving over prior work when $τ$ is small, as is common due to small-world phenomena. Experiments confirm substantial speedups while maintaining solution quality.
☆ Optimal Macroitem Sequences in the Precedence Constrained Knapsack Problem
The Precedence Constrained Knapsack Problem (PCKP) asks for a maximum-profit subset of items, subject to a knapsack capacity constraint and precedence constraints encoded by a directed acyclic graph. We study the structure of optimal solutions of the Linear Programming (LP) relaxation of the natural Integer Linear Programming formulation of the PCKP. We introduce the notion of macroitem and of feasible sequence of macroitems, which partitions the item set while respecting the precedence structure. We establish that an optimal LP solution is fully characterized by the optimal sequence of macroitems: items are packed in nonincreasing order of the profit-to-weight ratio of their macroitem, with at most one macroitem fractionally included. We further show that the breakpoints of the parametric Lagrangian function of the capacity constraint coincide with the profit-to-weight ratios of the macroitems in the optimal sequence, and provide a complete combinatorial characterization of optimal dual solutions in terms of a feasible flow within each macroitem. Finally, for the special case in which the precedence graph is a forest, we devise an O(n^2) algorithm to compute the optimal sequence, which improves to O(n log n) for in-trees or out-trees, where n denotes the number of items.
☆ Freeze-Tag with Return
In the standard Freeze-Tag Problem (FTP), an initially awake robot (the source) is in charge of waking up a swarm of sleeping robots by moving towards them, given that all the awake robots can participate in the awakening process. The goal is to minimize the makespan to wake up all robots assuming they move at unit speed. In this paper we introduce the Freeze-Tag-with-Return Problem (FTRP) variant, where the robots must eventually return to their initial positions. In the Euclidean plane with $n$ sleeping robots lying on the unit disk centered at the initial position of the source, we show a non-trivial relationship between FTP and FTRP by proving that the difference between the optimal makespan of both problems never exceeds $1.959$, and is at least $1.732$ in the worst-case. We also present several upper and lower bounds on the optimal makespan. In particular, we show that if the sleeping robots are in convex positions, then the optimal makespan is at most $2 + 2\sqrt{2}$, which is achieved by some instances. From an algorithmic point-of-view, we present single-exponential algorithms for general distance functions. In metric spaces, these algorithms are asymptotically optimal under the ETH, which we show via an NP-hardness reduction on unweighted graphs.
☆ On the Curse of Dimensionality in Private Sparse Covariance Estimation and PCA
We study high-dimensional differentially private (DP) covariance estimation in the operator norm, and principal component analysis (PCA), under $k$-row-column sparsity ($k$-RCS) of the covariance matrix. In the non-private setting, it is known that $\mathsf{poly}(k, \log d)$ samples suffice to solve both of these problems. However, the only comparable result known under DP (Wang et al. 2021) requires $Ω(d)$ samples under standard parameterizations of the problem. We investigate when this curse of dimensionality is inherent for sparse covariance estimation tasks under DP. On the upper bound front, we show that a $\mathsf{poly}(k, \log d)$ sample complexity for PCA is possible under DP, if we also posit sparsity of the leading eigenvector. We complement this result with $\mathsf{poly}(d)$ lower bounds under DP for both sparse covariance estimation and PCA, establishing an exponential gap between the private and non-private variants of these problems when $k = \mathsf{polylog}(d)$. To our knowledge, no such separation has previously been demonstrated for any sparse estimation problems in private high-dimensional statistics. Our techniques are flexible enough that they imply stronger lower bounds even for the well-studied problem of standard DP PCA, without sparsity assumptions.
♻ ☆ Explainable Information Design
Optimal signaling schemes in information design (Bayesian persuasion) often involve randomization or disconnected partitions of state space, which might be too intricate to be audited or communicated. We propose explainable information design in the context of linear information design with a continuous state space. In the case of single-dimensional state, we restrict the information designer to use interval-partitional signaling schemes defined by deterministic and monotone partitions of the state space, where a unique signal is sent for all states in each part. We prove that the price of explainability (PoE) -- the ratio between the performances of the optimal explainable signaling scheme and unrestricted signaling scheme -- is exactly $1/2$ in the worst case, meaning that partitional signaling schemes are never worse than arbitrary signaling schemes by a factor of $2$. For a uniform prior, this PoE can be improved to a tight $2/3$. We then extend the analysis to multi-dimensional state spaces by studying two notions of explainability: convex-partitional policies and axis-aligned rectangular policies. We prove a tight PoE of $1/(m+1)$ for convex-partitional policies, while for rectangular policies we establish a PoE guarantee under uniform prior that is independent of the number of signals but unavoidably exponential in the dimension $m$. We also study the computational complexity of explainable information design, proving that the exactly optimal explainable policy is NP-hard to compute, but an explainable policy with $1/2$ approximation guarantee can be computed in polynomial time for piecewise Lipschitz utility functions.
comment: A one-page abstract of this work will appear on ACM Conference on Economics and Computation, 2026, under the title "The Price and Complexity of Explainable Information Design"
♻ ☆ Dynamic direct access of MSO query evaluation over strings
We study the problem of evaluating a Monadic Second Order (MSO) query over strings under updates in the setting of direct access. We present an algorithm that, given an MSO query with first-order free variables represented by an unambiguous variable-set automaton $\mathcal{A}$ with state set $Q$ and variables $X$ and a string $s$, computes a data structure in time $\mathcal{O}(|Q|^ω\cdot |X|^2 \cdot |s|)$ and, then, given an index $i$ retrieves, using the data structure, the $i$-th output of the evaluation of $\mathcal{A}$ over $s$ in time $\mathcal{O}(|Q|^ω\cdot |X|^3 \cdot \log(|s|)^2)$ where $ω$ is the exponent for matrix multiplication. Ours is the first efficient direct access algorithm for MSO query evaluation over strings; such algorithms so far had only been studied for first-order queries and conjunctive queries over relational data. Our algorithm gives the answers in lexicographic order where, in contrast to the setting of conjunctive queries, the order between variables can be freely chosen by the user without degrading the runtime. Moreover, our data structure can be updated efficiently after changes to the input string, allowing more powerful updates than in the enumeration literature, e.g.~efficient deletion of substrings, concatenation and splitting of strings, and cut-and-paste operations. Our approach combines a matrix representation of MSO queries and a novel data structure for dynamic word problems over semi-groups which yields an overall algorithm that is elegant and easy to formulate.
♻ ☆ Online Block Packing and Multidimensional EIP-1559
We consider the online algorithmic challenge that is faced by blockchains that have multidimensional block constraints and serve quasi-patient bidders. We first provide online approximation algorithms for the important special cases of small transactions or a small number of dimensions; this solves open problems left by [Babaioff and Nisan, EC 2025]. Second, we study multidimensional variants of Ethereum's EIP-1559 protocol. We show that if the block builders manage to approximately optimize each block's welfare myopically, then an approximation to the global offline optimal welfare is obtained. On the other hand, we show that, unlike in the single-dimensional case, EIP-1559 by itself does not guarantee any good approximation.
comment: To appear at EC'26. Title updated from 'Online Block Packing' to reflect added analysis on multidimensional EIP-1559
♻ ☆ A Constant-Factor Approximation for Directed Latency
In the Directed Latency problem, we are given an asymmetric metric space $(V \cup \{s\},c)$ on a set $V$ of clients and a depot $s$. We are looking for a path $P$ starting in $s$ that visits all clients and minimizes the sum of the clients' waiting times (also known as latency) before being visited on the path. In contrast to the symmetric version of this problem (also known as the Deliveryperson problem and the Repairperson problem in the literature), there are significant gaps in our understanding of Directed Latency. The best approximation factor has remained at $O(\log |V|)$, as shown by [Friggstad, Salavatipour, and Svitkina, '13], for more than a decade. Only recently, [Friggstad and Swamy, '22] presented a constant-factor approximation but in quasi-polynomial time. Both results follow similar ideas: they consider buckets with geometrically increasing distances, build a path on each bucket, and then stitch together all these paths to get a feasible solution. [Friggstad and Swamy, '22] showed that by guessing a vertex from each bucket and augmenting a standard LP relaxation with these guesses, one can reduce the stitching cost. Unfortunately, the number of buckets is logarithmic in the number of vertices, so the running time of their algorithm is quasi-polynomial. In this paper, we present the first constant-factor approximation for Directed Latency in polynomial time by introducing a completely new way of bucketing, which helps us strengthen a standard LP relaxation with less aggressive guessing. Although the resulting LP is no longer a relaxation of Directed Latency, it still admits a good solution. We present a rounding algorithm for fractional solutions of our LP, crucially exploiting the way we restricted the feasibility region of the LP formulation.
♻ ☆ The Complexity Landscape of Distributed Locally Checkable Problems on Trees SC 2020
Recent research revealed the existence of gaps in the complexity landscape of locally checkable labeling (LCL) problems in the LOCAL model of distributed computing. For example, the deterministic round complexity of any LCL problem on bounded-degree graphs is either $O(\log^\ast n)$ or $Ω(\log n)$ [Chang, Kopelowitz, and Pettie, FOCS 2016]. The complexity landscape of LCL problems is now quite well-understood, but a few questions remain open. For bounded-degree trees, there is an LCL problem with round complexity $Θ(n^{1/k})$ for each positive integer $k$ [Chang and Pettie, FOCS 2017]. It is conjectured that no LCL problem has round complexity $o(n^{1/(k-1)})$ and $ω(n^{1/k})$ on bounded-degree trees. As of now, only the case of $k = 2$ has been proved [Balliu et al., DISC 2018]. In this paper, we show that for LCL problems on bounded-degree trees, there is indeed a gap between $Θ(n^{1/(k-1)})$ and $Θ(n^{1/k})$ for each $k \geq 2$. Our proof is constructive in the sense that it offers a sequential algorithm that decides which side of the gap a given LCL problem belongs to. We also show that it is EXPTIME-hard to distinguish between $Θ(1)$-round and $Θ(n)$-round LCL problems on bounded-degree trees. This improves upon a previous PSPACE-hardness result [Balliu et al., PODC 2019].
comment: DISC 2020
Graphics 1
☆ Mesh2GS: White-Box 3DGS Construction via Plenoptic Sampling
3D Gaussian Splatting (3DGS) has emerged as a promising method for high-quality, real-time 3D reconstruction. To associate 3DGS with mesh representations, existing methods primarily focus on 3DGS-to-mesh reconstruction from multi-view images. In contrast, the problem of converting a mesh into 3DGS has received comparatively less attention. Instead of relying on heuristic strategies that bind 3D Gaussians to the mesh, we propose a novel white-box 3DGS construction framework, termed Mesh2GS, which generates 3DGS directly from mesh geometry based on plenoptic sampling theory, achieving Nyquist-level performance for high-quality global illumination rendering. Firstly, we propose a plenoptic sampling guided 3DGS construction strategy that theoretically derives the minimum sampling rate of the sampled views and the distribution of 3D Gaussians. Second, we propose a novel 3DGS update procedure with albedo--shading decomposition for efficient global-illumination capture. Finally, we introduce a neural illumination enhancement module to handle non-Lambertian effects. Experimental results demonstrate that our method surpasses state-of-the-art baselines and is practically effective for both real-time shared rendering and non-Lambertian effects capturing specular highlights. The project code will be released upon acceptance.
comment: 16 pages, 7 figures
Operating Systems 2
☆ AgenticOS: An Intent-Oriented Secure Operating System Architecture for Autonomous AI Agents
Traditional OS security models based on "resource exposure plus permission checks" face structural challenges as LLM-driven autonomous agents acquire capabilities for planning, tool use, network access, and code execution. Once an agent runtime is compromised through prompt injection or malicious tool outputs, an attacker can compose POSIX-style resource primitives into behaviors far beyond the user's task authorization. To address this, we propose AgenticOS, an intent-oriented secure OS architecture that consolidates delegable, auditable software capabilities into OS-native ones rather than replacing all applications. The core insight is to reframe the OS from a "resource manager" into an "intent filter": instead of requesting low-level resources directly, agents submit structured intent declarations, from which the system synthesizes a least-privilege environment with mandatory mediation, auditing, and information-flow constraints. At the implementation level, we introduce a four-layer architecture -- Ghost Kernel, Logic Shutter, Agent Capsule, and Semantic Boundary Gateway -- together with the Intent ABI, Manifest-Only Runtime, Weaver-based capability generation, and an admission model for AgenticOS-native Skills.
comment: 11 pages,5 figures,1 tables
♻ ☆ SAGA: Workflow-Atomic Scheduling for AI Agent Inference on GPU Clusters
AI agents execute tens to hundreds of chained LLM calls per task, yet GPU schedulers treat each call as independent, discarding gigabytes of intermediate state between steps and inflating end-to-end latency by 3-8x. We argue that this request-level abstraction is fundamentally mismatched to compound AI workloads, and propose a shift to program-level scheduling: treating the entire agent workflow (not individual inference calls) as the first-class schedulable unit. We present SAGA, a distributed scheduler that implements this abstraction through three mechanisms: (1) Agent Execution Graphs that capture workflow structure to predict KV cache reuse across tool-call boundaries, achieving within 1.31x of Bélády's optimal offline policy; (2) session-affinity batching with work stealing that co-locates correlated requests while maintaining global load balance; and (3) Agent Fair Share, a task-completion-time fairness metric with provable bounded-deviation guarantees. On a 64-GPU cluster serving SWE-bench coding agents and WebArena browser tasks, SAGA reduces task completion time by 1.64x (geometric mean, p < 0.001) over vLLM v0.15.1 with prefix caching and affinity routing, while improving GPU memory utilization by 1.22x and achieving 99.2% SLO attainment under multi-tenant interference. These latency gains come at a quantified cost: approximately 30% lower peak throughput than throughput-optimal batch scheduling, a tradeoff appropriate for the latency-sensitive interactive deployments that dominate compound AI usage. Our results demonstrate that workflow-aware scheduling is essential for efficient compound AI serving.
Hardware Architecture 10
☆ Solid-state transcapacitor, a new gain element for logic, memory and interconnects
Today's transistors dictate the voltage and charge scales for both logic and memory. While AI systems are recognized to be limited by memory energy, the dominant share of the energy is expended in the intrachip interconnects whose voltage and charge scales are set by transistors. The energy scaling challenges of transistors can be attributed to simultaneously meeting high current density, high current/impedance modulation, and the inability to lower voltages. Hence, a new logic element that lowers the voltage and charge needs is a priority, not only for lowering logic power but also memory access power. Here, we propose a novel 3-terminal logic element for low energy computing, a solid-state transcapacitor (TCAP). A TCAP is a solid state displacement current modulator realized by a gate which controls the charge-voltage relationship of the channel. Unlike transistors, TCAPs eliminate the dissipative transport current, are not bound by the Boltzmann current modulation limit, and operate with displacement currents limited only by the polarization response and contact resistance. Hence, TCAP circuits may simultaneously overcome the voltage, current density, and current modulation limits of CMOS. We describe a solid state TCAP using a piezoelectric transcapacitor in which a gate-controlled stressor modulates the capacitance of a polar channel via electromechanical coupling. This device achieves inversion and gain, essential for logic, and is functionally equivalent to a 1T-1C memory cell, enabling dense memory. Using voltage scaling, capacitive energy recovery, and high polarization densities of polar materials, the logic based on TCAP offers a pathway to 100 fold lower energy consumption with a delay comparable to ultimately scaled CMOS devices. This approach provides a new potential pathway for low-energy computing beyond the limits of transistors using electro-mechanics and multiferroics.
comment: 70 pages, 8 figures and 21 supplementary figures
☆ Row-Based Layout Synthesis for Analog Circuits Using Height-Quantized Primitives
Restrictive design rules and strong layout-dependent effects have tightened the coupling between physical layout decisions and electrical performance in advanced process nodes, such as FinFET, making analog and mixed-signal (AMS) layout automation increasingly difficult. This paper presents a quantized row-height layout synthesis methodology for AMS circuits, a methodology that has previously been shown to reduce the simulation-to-silicon gap. The proposed flow optimizes a row height fabric from circuit requirements and layout constraints while mapping analog building blocks into quantized-height rows. Results on multiple testcases demonstrate that the proposed flow synthesizes layouts with similar postlayout performance relative to less-constrained custom baseline designs, with comparable performance metrics. Our quantized-height designs are shown to reduce the schematic-to-postlayout performance gap by up to 68.5% and result in lower area for most of our testcases, with a maximum area reduction of 24.1%.
☆ SPORT: Spherical-PSNR-Optimized tRuncaTion for Power-Efficient 360-Degree Video Systems
Memory bandwidth accounts for 30-40% of total power consumption in standalone virtual reality (VR) headsets, yet existing systems typically store the entire 360-degree frame at a uniform resolution regardless of viewer gaze. This paper presents SPORT (Spherical-PSNR Optimized tRuncaTion), a bit-truncation framework that reduces display-path memory power by storing only the most significant bits of pixels outside the user's field of view (FoV). Specifically, a new bit-truncation framework is developed to use weighted-to-spherically-uniform PSNR (WS-PSNR) directly in the optimization constraint, eliminating the metric inconsistency that arises when standard PSNR is used for a WS-PSNR quality target. Also, gaze-predictive tile classification compensates for the 9.33 ms end-to-end pipeline latency, reducing boundary misclassifications by 5.2 percentage points at a cost of only 0.01 ms. In addition, the developed SPORT-B variant, which keeps the FoV lossless, achieves 47.9% memory power saving and 47.9% bandwidth reduction across different 4K video sequences while satisfying all three per-region WS-PSNR thresholds and maintaining SSIM = 1.000 in the attended region. The full adaptive variant SPORT-A reaches 51.6% power saving, 3.1percentage points more than a PSNR-based optimizer at equal measured quality. SPORT is validated on the TrunMEM360 flexible SRAM Application-Specific Integrated Circuit (ASIC) fabricated in SkyWater 130 nm CMOS, confirming byte-exact silicon-software agreement, with WS-PSNR and SSIM matching within 0.1 dB and 0.001. CACTI-based analysis confirms 48.72% DRAM leakage reduction and 36.4%/36.7% read/write energy reduction. The total motion-to-photon latency of 9.33 ms satisfies the 20 ms VR comfort budget with a 53.3% safety margin.
☆ COMPOSE: Static Timing-driven Composable Reconfigurable Architecture for Accelerating Recurrence-Bound Loops
Coarse-Grained Reconfigurable Architectures (CGRAs) provide a spatially programmable substrate well suited for accelerating compute-intensive workloads with abundant parallelism. However, traditional CGRA execution models rely on rigid, fixed-size processing elements (PEs) that are statically bound to individual operations, which forces inter-iteration dependencies to be resolved through serialized scheduling. This limits throughput and reduces parallelism across loop iterations. Moreover, static execution schedules often fail to exploit available timing slack between operations, leading to resource underutilization and increased latency. The frequent registering of intermediate results further exacerbates pressure on register files and local memories, introducing data movement overheads that reduce energy efficiency, particularly in power or memory constrained environments. To address these challenges, we introduce COMPOSE, a composable CGRA architecture that enables dynamic formation of PEs at compile time guided by static timing information. By spatially fusing operations across loop iterations and selectively utilizing slack, COMPOSE resolves inter-iteration dependencies that limit throughput and enables low latency execution by reducing slack wastage. Additionally, the architecture reduces register file pressure by deferring output registration when intermediate values remain locally consumable, which significantly lowers redundant memory traffic. Across a diverse set of workloads, COMPOSE on average delivers 1.6x performance improvement and 2.9x EDP reduction over state-of-the-art (SOTA), at minimal area and power overheads.
☆ Closing the Loop on LLM-Generated RTL Assertions with Quality-Aware Formal Verification
Large language model (LLM) based assertion generation is making formal verification more accessible for Register Transfer Level (RTL) designs, but three practical issues remain. Generated properties can pass for the wrong reason, proof cost can vary widely from one design to another, and failing traces are hard to interpret. This paper presents a lightweight, open-source framework that addresses these issues in one loop. Our method combines mutation-guided refinement to reject weak assertions, including vacuous ones and those that fail to distinguish faulty behaviour, a solver-selection stage that chooses among candidate Satisfiability Modulo Theories (SMT) backends using RTL structure, and causal narrative synthesis to explain why a proof failed. Across diverse RTL designs, the framework improves confidence in generated assertions, reduces runtime variability over fixed-solver choices, and produces failure explanations that remain grounded in the counterexample trace. The results suggest that quality-aware closure, rather than assertion generation alone, is the missing step for practical LLM-assisted formal verification.
♻ ☆ PowerFlow-DNN: Compiler-Directed Fine-Grained Power Orchestration for End-to-End Edge AI Inference
Edge AI systems often operate under stringent energy and volume constraints that demand extreme efficiency under limited battery capacity, with requirements worsening as intelligent capability demands advance. Prior literature suggests that fine-grained power orchestration, including both dynamic voltage and frequency scaling and power gating, enables significant energy efficiency benefits that are critical to meeting such constraints, while still exhibiting unexplored challenges. We observe that layer-level approaches incur unintended overheads due to inter-layer coupling of power control decisions, and that jointly managing these mechanisms under practical constraints such as limited voltage rails and transition overheads leads to a rapidly growing combinatorial schedule space. To address this, we propose PowerFlow-DNN, a compiler-directed framework for end-to-end power-state orchestration in ultra-low-power accelerators. By constructing a rigorous problem formulation for deadline-constrained, real-time, periodic inference as a unified inter-layer power-scheduling problem, our framework enables automated discovery of energy-minimal power-state schedules that adhere to a deadline while accounting for end-to-end, inter-layer impacts. We evaluate the framework on a DNN accelerator VLSI implementation in TSMC 40nm technology. Across representative edge networks, we show that our approach discovers near-optimal solutions under the discretized formulation and achieves energy within 0.04\% of the exact ILP oracle, reducing energy by up to 48\% compared to an aggressive baseline without power orchestration, while reasoning over a combinatorial schedule space of over $10^{160}$ possible power-state assignments, yet operating on a structured layered state graph that enables efficient optimization, achieving up to 2.14$\times$ solver speedup via lightweight pruning.
comment: Accepted at ISLPED 2026. Best Paper Nomination
♻ ☆ Ultrafast On-Chip Online Learning via Spline Locality in Kolmogorov-Arnold Networks ICML'26
Ultrafast online learning is essential for high-frequency systems, such as controls for quantum computing and nuclear fusion, where adaptation must occur on sub-microsecond timescales. Meeting these requirements demands low-latency, fixed-precision computation under strict memory constraints, a regime in which conventional Multi-Layer Perceptrons (MLPs) are both inefficient and numerically unstable. We identify key properties of Kolmogorov-Arnold Networks (KANs) that align with these constraints. Specifically, we show that: (i) KAN updates exploiting B-spline locality are sparse, enabling superior on-chip resource scaling, and (ii) KANs are inherently robust to fixed-point quantization. By implementing fixed-point online training on Field-Programmable Gate Arrays (FPGAs), a representative platform for on-chip computation, we demonstrate that KAN-based online learners are significantly more efficient and expressive than MLPs across a range of low-latency and resource-constrained tasks. To our knowledge, this work is the first to demonstrate model-free online learning at sub-microsecond latencies.
comment: Forty-Third International Conference on Machine Learning (ICML'26)
♻ ☆ ExSpike: A General Full-Event Neuromorphic Architecture for Exploiting Irregular Sparsity with Event Compression
Spiking neural networks (SNNs) promise energy-efficient computing due to their sparse spatio-temporal activity. However, effectively translating such irregular sparsity into practical performance and energy gains remains challenging, as full-event computing architectures are still underexplored. This paper proposes ExSpike, a general full-event neuromorphic architecture that fully exploits irregular sparsity in SNNs. To realize pure event-driven execution, we first propose a set of dataflow optimizations to ensure that the inputs to each SNN layer remain spike-based, thereby enabling full-event execution throughout the network. We then design a hardware-efficient full-event architecture, named ExSpike, which supports the optimized pure event-driven dataflow and an additional Attention Core for spike-driven self-attention. To further improve computing efficiency, we introduce adjacent-position event compression to reduce redundant accumulations across spatially adjacent spike sequences. ExSpike is implemented on an AMD Xilinx Virtex-7 FPGA and evaluated on both classification and segmentation workloads. Experimental results show that ExSpike achieves high normalized energy efficiency across diverse SNN models while maintaining competitive accuracy, delivering up to 479.15 GOPS, 281.85 GOPS/W, and 0.80 GOPS/W/PE. In particular, ExSpike achieves up to 10$\times$ higher PE-normalized energy efficiency than the SOTA FPGA-based SNN accelerator (FireFly-T). The code for ExSpike is available at https://github.com/xiaoyuehai/ExSpike.
comment: Accepted by the 36th International Conference on Field-Programmable Logic and Applications (FPL 2026); 9 pages, 9 figures
♻ ☆ RTLScout: Joint Agentic Code and Synthesis Optimization for Efficient Digital Circuits
We present RTLScout, an autonomous system that combines LLM-driven agentic design with circuit-level synthesis optimization and arithmetic architecture sweeps. An LLM agent iteratively writes, evaluates, and refines RTL designs using tool calls, guided by quantitative PPA (power, performance, area) feedback from Yosys and OpenROAD. We introduce a multi-run elite pool framework, where the best designs and lessons learned seed subsequent agent runs. The pipeline comprises four complementary phases: agentic code optimization, agentic gate-level rewriting, arithmetic architecture sweeps, and an optional high-effort gate-level refinement pass. On an IEEE-754-compliant 16-bit floating-point multiplier with subnormal support, RTLScout reduces area by 35% and delay by 45% relative to a starting design synthesized in ASAP7 technology. Each phase provides distinct improvements, and high-effort gate-level optimization is most effective as a refinement of already well-optimized designs rather than a substitute for earlier stages. The resulting Pareto front outperforms a commercial-tool reference design on the same technology.
comment: Updated Appendix Table 4: corrected three Verilog baseline values. Conclusions unchanged
♻ ☆ Monolithic 3D Integration for Null Convention Logic (NCL)-Based Asynchronous Circuits
As the demand for high-speed and low-power electronics continues to grow, the quasi-delay-insensitive (QDI) asynchronous domain of digital design has emerged as a promising alternative to traditional clock-based designs. However, the adoption of the paradigm has been greatly limited due to the lack of mature computer-aided design (CAD) tools and a substantially larger area footprint, owing to various architectural constraints. Monolithic-3D (M3D) technology has recently paved the way for manufacturing highly dense integrated circuits (ICs) through sequential integration, resulting in a reduced area footprint, shorter wirelengths, and increased performance. In this study, we integrate M3D technology with QDI Null Convention Logic (NCL) and propose a design methodology for the implementation of M3D-based NCL standard cells, aimed at mitigating the area inefficiencies of traditional planar or 2D counterparts. Furthermore, we employed the threshold gates to design an M3D-NCL unsigned array multiplier circuit. Simulation results suggest that, for a conservative wirelength reduction resulting from M3D implementation, a substantial area reduction of 44% can be achieved while simultaneously reducing delay and power by approximately 31% and 17%, respectively.
comment: 8 Pages, 6 Figures, and 2 Tables
Programming Languages 6
☆ Correct and Complete Symbolic Execution for Free
Symbolic execution is a powerful technique for program analysis. However, the formal semantics underlying symbolic execution is often developed on an ad-hoc basis and decoupled from the concrete semantics of the programming language. To overcome this issue, we introduce symbolic SOS: a rule format that allows us to simultaneously specify concrete and symbolic operational semantics. We prove that symbolic semantics, when generated from symbolic SOS, is both correct and complete with respect to the corresponding concrete semantics. The approach relies only on an algebraic signature of the source language, and is thus language-independent.
comment: 17 pages excl references and proofs. Published at iFM24
☆ The Alignment Problem in Constrained Code Generation
Large Language Models (LLMs) have demonstrated strong capabilities in code generation, but their outputs frequently contain syntax or type errors that result in compilation failures. Constrained decoding has been proposed as a solution to mitigate compilation errors by construction, improving functional correctness as a byproduct. However, previous works overlook a critical aspect of constrained decoding: the alignment between constrainer (e.g., types), language model and the target specification language (e.g., TypeScript). Misalignment is caused by the constrainer being incomplete--rejecting programs that belong to the target--or unsound--allowing programs that are not part of the target. The bias created by incompleteness distorts the language model distribution, and can be detrimental for code generation. We evaluate this hypothesis using seven language models, two target languages, two constrainers, enforcing types and syntax during decoding, and we study how language models react to varying levels of incompleteness. On three benchmarks, when the constrainer is incomplete, unconstrained decoding significantly outperforms constrained decoding in terms of functional correctness. Incompleteness pushes the model into low-probability regions of the program space, causing the generation to frequently time out, and reducing functional correctness by up to 97%. These contributions make the community aware of the negative effects of misalignment in constrained decoding, and provide quantitative insights on how to design constrainers that are beneficial for code generation systems with formal guarantees.
☆ KBSpec: LLM-driven Formal Specification Generation with Evolving Domain Knowledge Base
Automated formal specification generation is a key step towards program understanding and formal verification. Recently, due to the success of large language models (LLMs) in code generation, researchers have made early attempts to adopt LLMs for generating formal specifications. However, the lack of formal specification language corpora in the wild often makes LLMs fail to generate syntactically correct and semantically verifiable specifications. To mitigate this gap, we propose KBSpec, which augments LLMs with dual-source knowledge of formal specification language: external knowledge from official documentation, and internal knowledge distilled from verifier feedback on LLM-generated specifications. KBSpec maintains a self-evolving knowledge base that is continuously updated from successful generation and repair trajectories, without any LLM parameter tuning or labeled training data. We evaluate KBSpec on Java Modeling Language (JML) specification generation with three LLM backends, and results show that KBSpec improves verification pass rates by 10-25% over state-of-the-art LLM-based approaches, while producing the largest number of high-completeness specifications.
☆ Compiling Differentiable Audio Graphs to Real-Time DSP
Differentiable audio processors are habitually designed and optimised in machine-learning frameworks, but deploying them as real-time audio effects still often requires non-automatic implementation in a dedicated digital signal processing language. The translation is error-prone, demands an onerous verification process, and detaches research prototypes from usable production tools. That being so, we present ADAC, a compiler that lowers a trained model to a framework-agnostic intermediate representation and emits efficient FAUST code whose impulse response matches the source model to within floating-point arithmetic noise, direct paths included. The optimisation loop is made audible by replacing the model in a running plugin after each gradient step. The exported processor carries a small set of macro-controls that leave its stability intact. A stability certificate computed from the shipped parameters is checked before the plugin is built. At the demonstration, a feedback delay network is trained and exported to a working plugin.
comment: 4 pages, 5 figures. Demonstration paper submitted to the 29th International Conference on Digital Audio Effects (DAFx26), Cambridge, MA
☆ AI-Assisted Completion of CertiGC Proofs: An Experience Report
This experience report describes the Codex-assisted completion and stabilization of a substantial Rocq (formerly Coq) proof development for CertiGC, the verified generational garbage collector in the CertiGraph project. The development extends the collector from an effectively immutable setting to a mutable one by adding remembered-set forwarding to the collection path and re-establishing the top-level graph-isomorphism correctness theorem. The central technical issue was not low-level proof scripting alone: mutable updates invalidate the old global no-backward-edge assumption, so the proof had to be reorganized around a recorded-backward-edge invariant stating that every backward edge is recorded in the appropriate remembered-set component. This case differs from recent AI-assisted formal-proof accounts: unlike the adaptation of a nearby compilerproof architecture or a fresh metatheory formalization, it completes a long-running verification in a mature codebase built on the Verified Software Toolchain (VST) and CertiGraph, both mechanized in Rocq. The Rocq kernel remained the arbiter of correctness, while our role shifted toward adjudicating invariant proposals, constraining specification changes, reviewing theorem statements, and deciding when proof cleanup was justified. The Codex-assisted phase repaired the VST relation proofs first, then restored the mathematical graph-isomorphism theorem, and only then audited the premise path from the VST specification to the theorem. That audit found and removed a stale no-backward-edge condition from the VST-facing proof path. This report presents the workflow, resulting proof artifact, and lessons for agentic proof maintenance.
comment: 14 pages
♻ ☆ PRDTs: Composable Design and Verification of Consensus Protocols using Replicated Data Types
Consensus protocols are fundamental in distributed systems as they enable services with strong consistency properties. However, designing protocols optimized for specific use-cases under certain system assumptions is typically an error-prone process requiring expert knowledge. Furthermore, while most recent optimized protocols are variations of well-known algorithms like Paxos or Raft, they often necessitate complete re-implementations, potentially introducing new bugs and complicating the application of existing verification results. This approach impedes application-specific consistency protocols that can easily be amended or swapped out, depending on the given application and deployment scenario. We propose Protocol Replicated Data Types (PRDTs), a novel programming model for implementing consensus protocols using replicated data types (RDTs). Inspired by the knowledge-based view of consensus, PRDTs employ RDTs to monotonically accumulate knowledge until agreement is reached. This approach allows for implementations focusing on high-level protocol logic that abstracts away network details and facilitates automated verification. Moreover, by applying existing algebraic composition techniques for RDTs in the PRDT context, we enable composable protocol building-blocks for implementing complex protocols. We present a formal model of our approach and implement a proof procedure that allows automated reasoning about the consensus safety of concrete PRDT implementations. Additionally, we demonstrate the applicability of our model in verified PRDT-based implementations of existing consensus protocols, and report empirical performance evaluation results. Our findings indicate that the PRDT approach offers enhanced flexibility and composability in protocol design, facilitates reasoning about correctness, and is suited for real-world adoption without intrinsic performance drawbacks.
comment: Paper accepted at OOPSLA 26, 33 pages, 14 figures
Data Structures and Algorithms 7
☆ Breaking chains with trees: Deep learning with $\mathcal{O}(\log N)$ parallel time complexity
Modern deep neural network architectures are trained via backpropagation, which requires errors to be sequentially propagated through all layers before parameters can be updated. This introduces two limitations: locking, where layer-wise updates are strictly interdependent and cannot proceed in parallel, and the weight transport problem, which requires symmetric forward and backward pathways for exact gradient computation. These constraints restrict parallelism, increase memory and communication overhead, and pose challenges for scalable learning. In this work, we propose Hierarchical Block-Local Learning (HBLL), a framework that decomposes deep neural networks into hierarchically linked blocks trained using local learning objectives derived from variational principles, eliminating the need for full end-to-end backpropagation while maintaining effective information propagation across the network. HBLL is the first algorithm that is able to train deep neural networks in $\mathcal{O}(\log N)$ parallel time complexity, where $N$ is the number of network layers. We show that HBLL implicitly defines a family of subnetworks corresponding to different hierarchical paths, enabling flexible inference with different effective numbers of layers. We evaluate HBLL on a set of challenging vision and language modeling tasks, achieving competitive performance. We also extend HBLL to recurrent sequence architectures, applying to settings that otherwise rely on backpropagation through time.
comment: 20 pages, 2 figures, 12 tables
☆ Certifying Quantum Optimization and Circuit Cutting by Using Quantum-Classical Moment Duality
We establish a direct quantum-classical duality based on the degree-$2$ Sum-of-Squares (SoS) semidefinite programming cone: the matrix of two-qubit Pauli-$Z$ correlation functions obtained from \emph{any} quantum state $ρ$ is automatically a feasible point of the classical Goemans-Williamson (GW) relaxation. This observation provides a universal ``safety net'' for quantum optimization algorithms: applying GW random hyperplane rounding to the quantum-driven moment matrix yields a certified expected cut value $\mathbb{E}[\mathrm{Cut}] \ge α_{\mathrm{GW}}\langle\mathcal{H}\rangle_ρ$, valid for every state produced by variational algorithms such as QAOA or the Variational Quantum Power Method (VQPM), regardless of convergence quality. We further show that the same moment matrix reveals the tensor-product structure of the underlying unitary circuit, enabling a polynomial-time, correlation-based circuit cutting procedure with rigorous error bounds. The framework is validated numerically on Max-Cut instances for variational quantum algorithms and on random states for circuit cutting, demonstrating that the cheap two-point correlation data are sufficient to locate near-optimal bipartitions and that the theoretical error bounds hold in practice.
comment: submitted; feedback welcome! simulation code is available at https://github.com/adaskin/quantum-moment-certification
☆ Online Stacking with a Few Load/Unload Points
We consider the stacking problem where items are temporarily stored in stacks in a stacking area. The stacks are working as Last In First Out (LIFO) data structures. The objective is to avoid {\em shifts} or {\em restows} that occur when an item has to leave the stacking area earlier than an item above it. We present a simple online algorithm for the problem where we pick a stack for an incoming item without any information on future items. We present a sufficient condition for the algorithm to avoid shifts expressed as an inequality involving the dimension of the stacking area, the number of load/unload points and the maximum number of items present at the same time. The condition is relatively tight in the sense that we can find instances requiring shifts for {\em any} algorithm (including offline algorithms) with small modifications of the condition. Our results indicate that our algorithm is useful if the number of load/unload points is relatively small compared to the number of items.
☆ Counting and Sampling Anti-Ferromagnetic Potts Models on Random Regular Bipartite Graphs in the Non-uniqueness Regime
The anti-ferromagnetic multi-state Potts model, a generalization of the Ising model, is one of the most fundamental models in statistical physics. It was conjectured by Kotecký (Phys.~Rev.~B, 1985) that the model undergoes a phase transition from a disordered phase at infinite temperature to an ordered phase at sufficiently low temperature on lattices. Such phase transitions are believed to play an important role in computational complexity theory and remain closely connected to the problem of approximating the partition function of the system. For proper three-coloring models (corresponding to the zero-temperature), torpid mixing of a family of local-update Markov chains on lattices was established by Galvin, Kahn, Randall and Sorkin (SIDMA, 2015), coinciding with the presence of phase coexistence following shown by Feldheim and Spinka (J.~Eur.~Math.~Soc., 2019). In this work, we study approximating the partition function of the anti-ferromagnetic multi-state Potts model at low temperature on random regular bipartite graphs, which are with high probability good bipartite expanders. On the negative side, we generalize the result by Geisler, Kang, Sarantis and Wdowinski (arXiv, 2026) for anti-ferromagnetic Ising models to show that when the temperature is sufficiently low relative to the degree of the underlying graph, the celebrated single-site Glauber dynamics has exponentially slow mixing time. On the positive side, we design a deterministic algorithm that yields an approximation to the partition function of the model via the framework of abstract polymer models as Jenssen, Keevash and Perkins (SICOMP, 2020), Liao, Lin, Lu and Mao (Theor.~Comput.~Sci., 2022), Galanis, Goldberg and Stewart (TOCT, 2021) and Geisler, Kang, Sarantis and Wdowinski (arXiv, 2026).
♻ ☆ On estimating the trace of quantum state powers
We investigate the computational complexity of estimating the trace of quantum state powers $\text{tr}(ρ^q)$ for an $n$-qubit mixed quantum state $ρ$, given its state-preparation circuit of size $\text{poly}(n)$. This quantity is closely related to and often interchangeable with the Tsallis entropy $\text{S}_q(ρ) = \frac{1-\text{tr}(ρ^q)}{q-1}$, where $q = 1$ corresponds to the von Neumann entropy. For any non-integer $q \geq 1 + Ω(1)$, we provide a quantum estimator for $\text{S}_q(ρ)$ with time complexity $\text{poly}(n)$, exponentially improving the prior best results of $\exp(n)$ due to Acharya, Issa, Shende, and Wagner (ISIT 2019), Wang, Guan, Liu, Zhang, and Ying (TIT 2024), and Wang, Zhang, and Li (TIT 2024), and Wang and Zhang (ESA 2024). Our speedup is achieved by introducing efficiently computable uniform approximations of positive power functions into quantum singular value transformation. Our quantum algorithm reveals a sharp phase transition between the case of $q=1$ and constant $q>1$ in the computational complexity of the Quantum $q$-Tsallis Entropy Difference Problem (TsallisQED$_q$), particularly deciding whether the difference $\text{S}_q(ρ_0) - \text{S}_q(ρ_1)$ is at least $0.001$ or at most $-0.001$: - For any $1+Ω(1) \leq q \leq 2$, TsallisQED$_q$ is $\mathsf{BQP}$-complete, which implies that Purity Estimation is also $\mathsf{BQP}$-complete. - For any $1 \leq q \leq 1 + \frac{1}{n-1}$, TsallisQED$_q$ is $\mathsf{QSZK}$-hard, leading to hardness of approximating the von Neumann entropy because $\text{S}_q(ρ) \leq \text{S}(ρ)$, as long as $\mathsf{BQP} \subsetneq \mathsf{QSZK}$. The hardness results are derived from reductions based on new inequalities for the quantum $q$-Jensen-(Shannon-)Tsallis divergence with $1\leq q \leq 2$, which are of independent interest.
comment: 59 pages, 3 tables, 3 algorithms. v4: Minor changes to align with the published version. v3: Added a paragraph on recent developments, fixed the proof of Lemma 2.17 (Lemma 2.9 in the SODA proceedings), and made other minor changes. v2: Minor changes (particularly quantum query complexity lower bound for the hard regime, Theorem 5.8 in the SODA proceedings) and added references
♻ ☆ List Decoding Reed--Solomon Codes in the Lee, Euclidean, and Other Metrics
Reed--Solomon error-correcting codes are ubiquitous across computer science and information theory, with applications in cryptography, computational complexity, communication and storage systems, and more. Most works on efficient error correction for these codes, like the celebrated Berlekamp--Welch unique decoder and the (Guruswami--)Sudan list decoders, are focused on measuring error in the Hamming metric, which simply counts the number of corrupted codeword symbols. However, for some applications, other metrics that depend on the specific values of the errors may be more appropriate. This work gives a polynomial-time algorithm that list decodes (generalized) Reed--Solomon codes over prime fields in $\ell_p$ (semi)metrics, for any $0 < p \leq 2$. Compared to prior algorithms for the Lee ($\ell_1$) and Euclidean ($\ell_2$) metrics, ours decodes to arbitrarily large distances (for correspondingly small rates), and has better distance-rate tradeoffs for all decoding distances above some moderate thresholds. We also prove lower bounds on the $\ell_{1}$ and $\ell_{2}$ minimum distances of a certain natural subclass of GRS codes, which establishes that our list decoder is actually a \emph{unique} decoder for many parameters of interest. Finally, we analyze our algorithm's performance under \emph{random} Laplacian and Gaussian errors, and show that it supports even larger rates than for corresponding amounts of worst-case error in $\ell_{1}$ and $\ell_{2}$ (respectively).
comment: 26 pages, 1 figure
♻ ☆ Caching for Dollars, Not Hits: An Exact Offline Reference for Cloud-Egress Caching and the Crossover That Decides When It Pays
When a cache miss fetches from cloud object storage, the bill is per GET request and per byte of egress, not latency. Classic caching minimizes the miss rate, the wrong objective: a rarely but expensively fetched object can cost thousands of times more dollars than a frequently but cheaply fetched one. Generalized-caching theory bounds the miss-cost objective, but no reported benchmark measures how far deployed heuristics sit from the dollar-optimal offline policy on real cloud prices. We supply that reference. For uniform-size page caches with heterogeneous miss costs the offline dollar-optimum is exact in polynomial time via an integral interval linear program -- validated against brute force; variable sizes are NP-hard, so we extend the flow-based offline bound from the hit-ratio objective to dollars (cost-FOO), tight to about four percent. Against this reference we find: (i) a heterogeneity-regret law -- LRU's dollar-regret rises with miss-cost dispersion (Spearman 0.87) while cost-aware GreedyDual cuts it to roughly a tenth; (ii) a contention frontier -- GreedyDual's residual regret collapses to near zero exactly when the budget fits the expensive working set, and is the open slice otherwise; and (iii) a closed-form crossover s* = GET_fee/egress_rate (about 4 KB on S3, 330 B on GCS) that predicts which deployments need dollar-aware caching. On real memcache and CDN traces the price vector alone moves the workload across s*, shifting the regime as predicted. The artifact is a reproducible billing-faithful benchmark; heuristics and bounds it builds on are prior work, credited.
comment: 6 pages, 4 figures. Code, benchmarks, and full pre-registration: https://github.com/samyama-ai/cloud-egress-cache
Graphics 9
☆ Single-Event Upsets in 3D Gaussian Splatting Rendering: Bit-Level Criticality, Spatial Extent, and a Parallel Support Guard
Three-dimensional Gaussian splatting is a standard real-time scene representation increasingly deployed on hardware exposed to transient faults, such as spaceborne processors and robotic edge devices where silent data corruption occurs. A trained model is a large array of floating-point parameters in GPU memory, where a single-event upset corresponds to a single flipped bit. This paper measures these effects and constructs a defense. A GPU-resident parallel fault-injection engine applies over 3.8 million controlled single-bit upsets across four scenes, six fields, all bit positions, and three numeric formats (fp32, fp16, bf16), using 5.3 GPU-hours. The effect is highly concentrated: most upsets leave the image perceptually unchanged due to high redundancy, but a small set of high-order bits principally the logarithmic scale's sign bit enlarge a single primitive to cover up to 75.7% of the frame. A closed-form perturbation bound derived from the IEEE-754 layout and pipeline activations predicts this per-bit ordering. This concentration motivates a support guard: a per-primitive clamp of each parameter to the coordinate box observed during training, costing 76 us per frame. Over 768,000 guarded upsets, the worst corruption footprint is restricted to 11.68% of the frame. We prove the guard leaves clean models unchanged and prevents frame-covering corruption. Under an accumulated dose of 20,000 simultaneous upsets, the unguarded renderer degrades to 10.6 dB, whereas the guarded renderer remains at 21.8 dB. The corruption footprint also dictates the number of tile/compositing nodes contaminated in distributed renderers, where the per-node guard contains it.
comment: 21 pages, 7 figures, 4 tables; ancillary files provided; artifacts: https://huggingface.co/datasets/Lightcap/seu-3dgs
☆ Scene-Level Heterogeneous Physics Simulation with 3D Gaussian Splats CVPR 2026
3D Gaussian Splatting (3DGS) has achieved state-of-the-art photorealistic rendering, but the representation gap prevents these assets from being physically interactive. Production-grade physics engines do not understand the 3DGS representation, while prior physics-for-3DGS methods are monolithic silos. These prior works are fundamentally limited, demonstrating only object-centric physics in isolated environments, such as on an ideal plane. They are incapable of interacting with complex static collision geometry or heterogeneous assets. We propose a novel framework that, for the first time, bridges this gap by enabling 3DGS assets to participate in scene-level, heterogeneous, multi-solver physical simulations. Our core contribution is a Representation Abstraction Framework that translates all diverse assets, including 3DGS, virtual meshes, and fluids, into a unified physical particle set. This abstraction is key to enabling complex behaviors, such as the non-rigid deformation of 3DGS assets, within a unified physics pipeline. This particle set, along with the static scene collision boundaries derived from scene capture, is processed within a solver-agnostic physics kernel. The physical results are then mapped back to drive each asset's specific visual reconstruction. This architecture unlocks capabilities impossible with prior art. We demonstrate complex, two-way interactions between deformable 3DGS assets, standard CG assets such as fluids and meshes, and large-scale captured static environments, showcasing realistic coupled phenomena that were previously unattainable.
comment: Accepted to CVPR 2026 Findings
☆ OSOG: A Differentiable, Physics-Informed Synthetic Data Engine for Micro-Optical Environments
Deep learning in computational microscopy is severely constrained by the scarcity of densely annotated datasets. While synthetic data generation has bridged this gap in macroscopic computer vision, traditional graphics engines rely on geometric ray-tracing, failing to capture the micro-optical phenomena required for microscopy. Conversely, while wave-optics formulations exist, rendering them computationally tractable at the scale required for deep learning remains a massive systems challenge. To address this, we introduce the Optical Synthetic Object Generator (OSOG), a high-performance, fully differentiable forward-modeling engine. Drawing on established physical models of diffraction and phase retardation, OSOG maps continuous Optical Path Difference (OPD) calculations into a highly optimized, PyTorch-native Structure-of-Arrays (SoA) architecture. We validate this computational framework across three axes: First, object detection models (YOLOv11-OBB) trained purely on OSOG-generated data achieve robust zero-shot transfer to real-world highly occluded Lysozyme micrographs. Second, we introduce DiffOSOG, demonstrating that the engine's end-to-end differentiability allows for the exact recovery of continuous optical parameters via curriculum-guided inverse rendering. Finally, OSOG bypasses the $\mathcal{O}(N)$ bottlenecks of sequential ray-tracing, demonstrating sub-linear scaling by synthesizing 40,000 complex wave-optic particles in under 50 milliseconds (\>20 FPS). By providing a fast, scalable, and physically grounded tensor pipeline, OSOG enables true real-time, on-the-fly dataset generation.
comment: 9 pages, 5 figures
☆ PIAvatar: Physically Interactive Avatars via Deformation Gradient Decoupling
3D human avatars have shown impressive visual fidelity driven by pose-conditioned models, yet they still lack the physical ability required for interactions with each other and environments. Although recent studies have made various attempts to incorporate physical characteristics into 3D avatars, they only exhibit limited physical deformations, often leading to constrained interaction behaviors. To resolve this issue, we present PIAvatar, a framework to simultaneously enable physically aware interactions between avatar-avatar and avatar-environment, and a non-rigid deformable human body simulation. In this work, our key insight is to decouple kinematic velocity from deformation gradient. When external forces act on avatars, the kinematic velocity induces stress which hinders the avatar's ability to achieve a desired pose. In addition, we integrate a skeletal framework within the avatar. It allows estimating its poses and real-time tracking in a closed form, even during non-rigid physical interactions. Our approach is implemented within a conventional Material Point Method framework to ensure physically consistent dynamics. We lastly evaluate the method on both human-object and human-human interaction scenarios to assess its behavior under diverse interaction settings.
comment: 24 pages, 13 figures
☆ DPLAN: Minimal Connectivity to Floorplan Generation
Automated floor plan generation is an important problem in computational architectural design. The goal is to construct a floor plan from user-defined room numbers and door requirements. The user specifies which rooms must share a door and which rooms must not be adjacent. However, these requirements do not determine the exact placement or shape of the rooms. The task is therefore to arrange the rooms in a single floor plan so that all required door connections are satisfied and no rooms overlap. To address this problem, we propose DPLAN (Door Connectivity to Floor Plan Generation), a graph-based prototype that generates floor plans from door and non-adjacency constraints. The framework operates in three stages. First, the user-defined graph is examined and, if disconnected, additional edges are added to connect its components. Second, a bi-connected plane triangulation is constructed to ensure the existence of a floor plan without overlapping rooms or empty spaces. Third, the triangulated graph is transformed into floor plans. For rectangular floor plans (RFPs), separating triangles are removed by modifying edges without adding new vertices, thereby avoiding the creation of extra rooms. For orthogonal floor plans (OFPs), separating triangles are removed by introducing additional vertices, allowing rectilinear room shapes. By enforcing both door and non-adjacency requirements, the framework generates floor plans that satisfy the given constraints. The method is implemented in Python and includes a prototype for interactive constraint specification and floor plan visualization. Currently, the framework supports rectangular plot boundaries. Future work includes support for non-rectangular plots, dimension-based scaling, and circulation modeling.
☆ Odoriko: A Shape-Aware Multimodal Diffusion Framework for Human Motion ECCV 2026
Human motion generation has been widely studied across diverse input modalities, text, music, and video, and recent efforts have unified these into single multimodal frameworks. However, while morphological factors such as gender and body shape are known to produce distinct kinematic signatures, no existing unified framework incorporates this into generation, treating all subjects as morphologically equivalent. We present Odoriko, the first unified multimodal motion generation framework that reflects subject bio-morphological information directly in synthesized motion output. Rather than averaging over subject variation, Odoriko generates motion that is consistent with who is moving, not just what they are asked to do, across text, music, and video conditions within a single model. When explicit morphological information is unavailable, Odoriko additionally recovers subject morphology alongside motion, unifying estimation and generation in one framework. Extensive experiments across text-to-motion, music-to-dance, and video-to-motion benchmarks demonstrate that Odoriko matches or exceeds prior specialized models on standard metrics, while enabling morphology-consistent generation that no existing unified framework supports.
comment: ECCV 2026
♻ ☆ TriFlow: Generating Artist-Like 3D Mesh Topology via Nearest-Vertex Vector Fields
We present TriFlow, a new generative approach for producing compact 3D meshes with artist-like triangle topology directly from input geometry conditions such as signed distance fields. Our key insight is to represent mesh topology as a nearest-vertex vector field (NVF) defined over the surface, where each point encodes its association to the nearest triangle vertex in the local barycentric frame. We train a latent flow-matching model to synthesize this field, enabling topology generation conditioned on the input geometry. To extract a coherent mesh, we cluster surface regions using the generated NVF and guide a constrained quadric error metric (QEM) mesh simplification with topology-aware optimization. This yields output meshes that closely match the input geometry while exhibiting structured, artist-like connectivity. Experiments demonstrate that TriFlow achieves stronger generalization and significantly improved topology quality compared to state-of-the-art learning-based approaches, alongside 90% lower Chamfer Distance and an 8x speedup.
comment: Project page: https://derkleineli.github.io/triflow/ Video: https://www.youtube.com/watch?v=Nl57QcuKkeA
♻ ☆ Beyond Spherical Harmonics: Rethinking Appearance Models for Radiance Reconstruction
View-dependent appearance modeling remains a challenging problem in novel-view synthesis and reconstruction. Accurately representing complex angular effects often requires substantial memory and computational resources. For new learning-based methods, a common approach is to rely on SH. However, capturing high-frequency phenomena such as specular reflections demands high-order expansions, which increase memory usage and computational cost. Consequently, most methods employ low-order SH, which limits the ability to model complex view-dependent effects, resulting in overly smooth or diffuse representations. To address these limitations, we systematically evaluate a wide range of spherical functions in the context of scene reconstruction. Some of them are introduced to graphics and computer vision for the first time in this paper. Based on the insights from the experiment, we develop a novel spherical formulation, the Normalized Anisotropic Spherical Gabor function that enables efficient modeling and learning of high-frequency appearance effects while maintaining compact representation. Compared to existing approaches, our function achieves higher-quality reconstruction of view-dependent phenomena such as glints, while being up to five times more memory-efficient and more efficient to evaluate. We validate its performance in radiance-field reconstruction tasks.
comment: 19 pages, 11 figures
♻ ☆ Enhancing Line Density Plots with Outlier Control and Bin-based Illumination
Density plots effectively summarize large numbers of points, which would otherwise lead to severe overplotting in, for example, a scatter plot. However, when applied to line-based datasets, such as trajectories or time series, density plots alone are insufficient, as they disrupt path continuity, obscuring smooth trends and rare anomalies. We propose a bin-based illumination model that decouples structure from density to enhance flow and reveal sparse outliers while preserving the original colormap. We introduce a bin-based outlierness metric to rank trajectories. Guided by this ranking, we construct a structural normal map and apply locally-adaptive lighting in the luminance channel to highlight chosen patterns -- from dominant trends to atypical paths -- with acceptable color distortion. Our interactive method enables analysts to prioritize main trends, focus on outliers, or strike a balance between the two. We demonstrate our method on several real-world datasets, showing it reveals details missed by simpler alternatives, achieves significantly lower CIEDE2000 color distortion than standard shading, and supports interactive updates for up to 10,000 lines.
comment: Published in IEEE Transactions on Visualization and Computer Graphics (Supplementary Material: https://doi.org/10.17605/OSF.IO/PY5DZ)
Hardware Architecture 7
☆ A3C3: AI Algorithm and Accelerator Co-design, Co-search, and Co-generation
We present a holistic methodology for artificial intelligence algorithm and accelerator co-design, co-search, and co-generation (A3C3), which jointly optimizes neural network architectures and their hardware implementations to address the inefficiencies of traditional top-down AI system design flows. Conventional AI deployment often treats model design and hardware mapping as separate stages: an algorithm is first developed for accuracy, and only afterward adapted to meet latency, throughput, energy, or resource constraints. This separation can lead to suboptimal systems, particularly as modern AI workloads become increasingly heterogeneous, memory-intensive, and platform-dependent. A3C3 instead parameterizes both algorithmic and accelerator design spaces and searches them jointly, enabling the automatic generation of model-accelerator pairs that better balance accuracy, latency, throughput, energy efficiency, and hardware utilization. This article is a book chapter of the Handbook of Embedded Machine Learning, edited by Sudeep Pasricha and Muhammad Shafique, Springer Nature.
☆ Memory-Centric Computing: Security Benefits and Challenges of Processing-in-DRAM CCS
Today's computing systems are processor-centric: they require frequent data movement between processing elements (e.g., CPU) and main memory (DRAM), leading to significant inefficiencies in performance and energy consumption. Memory-centric computing instead moves computation to the data, enabling computation capability in and near all places where data is generated and stored, and greatly reducing the performance and energy overheads of data access and data movement. This shift from a processor-centric to a memory-centric paradigm has important and underexplored consequences for system security. Turning memory from a dumb, inactive store into an active computing substrate introduces benefits as well as challenges for system security: it can provide new in-memory security primitives and also reduce data exposure, but it can also expose new attack surfaces. This work discusses the security benefits and challenges of memory-centric computing, specifically Processing-in-DRAM (PiD), a paradigm where the operational characteristics of a DRAM chip are exploited and enhanced to perform computation on data stored in DRAM. Specifically, we describe 1) new state-of-the-art DRAM-based true random number generators that provide up to 16.05 Gb/s throughput and physical unclonable functions with 5.75% lower evaluation latency than the prior state-of-the-art, both on real DRAM chips and 2) two key security challenges of PiD: amplified DRAM read disturbance (e.g., 158x reduction in the minimum number of DRAM accesses required to induce the first bitflip) and high throughput memory timing channels (e.g., a communication throughput of 14.8Mb/s). We believe it is time to design, use, and program DRAM, and in general memory, not as an inactive storage substrate, but as a combined computation, storage, and security substrate, where computational capability, storage density, and security are all key goals.
comment: To appear at the Seventh Workshop on Memory-Centric Computing Systems (MCCSys-7), colocated with ICS 2026. This extended abstract is an overview paper that summarizes some recent works, including SiMRA-TRNG (arXiv:2510.20269, ICCD'25), SiMRA-PUF (arXiv:2606.15470, DSN Disrupt'26), PuDHammer (arXiv:2506.12947, ISCA'25), and IMPACT (arXiv:2404.11284, DSN'25)
☆ Low-Energy Reduced RISC-V Instruction Subset Processor for Tsetlin Machine Inference at the Edge
Tsetlin Machine (TM) is a logic-based machine learning approach that relies on simple bitwise operations and finite-state automata, which makes it attractive for edge AI deployments. Recent work has focused on co-processor and accelerator designs based on Tsetlin Machines (TMs). Although these designs achieve high performance, they typically depend on tightly coupled interfaces, microcode-style programming, and external host processors, limiting flexibility and ease of programming. In this work, we present a domain-specific RISC-V microprocessor architecture and design flow tailored for TM inference. Leveraging the modular structure of RISC-V, we design a reduced instruction subset processor that retains programmability while targeting improved performance and lower energy consumption for TM workloads. Instruction profiling is employed to guide instruction reduction, followed by datapath and control path simplifications tailored to TM inference. Both the baseline RV32IM core and the proposed reduced core are evaluated across multiple datasets and compared with Binarized Neural Networks (BNNs), which serve as a hardware-efficient baseline due to their reliance on bitwise operations during inference. Results show that TM achieves comparable or higher accuracy (e.g., up to 88.18% on CIFAR-2 compared to 60.0% for BNN) while reducing execution time by up to 98% across multiple datasets. Furthermore, the proposed design achieves an average $29.7\times$ reduction in energy consumption, demonstrating its effectiveness for programmable and efficient edge AI systems.
comment: 6 pages, 6 Figures, Accepted in IEEE ISVLSI Conference 2026
☆ Design and Evaluation of Energy-Efficient Whisper Dot-Product Kernel Offloading on a CGLA Architecture
In this paper, we implement and evaluate Whisper dot-product kernel offloading on IMAX, a programmable Coarse-Grained Linear Arrays (CGLAs) architecture. Whisper-tiny.en profiling on an ARM Cortex-A72 shows that dot-product operations account for 90.6% of FP16 execution time and 87.1% of Q8_0 execution time. To address this kernel bottleneck, we combine kernel mapping, local-memory sizing, and burst scheduling. The implementation uses inline FP16-to-FP32 conversion, 2-way SIMD FMA on a 64-bit datapath, column-wise multithreading, and mixed execution in which aligned vector segments run on IMAX and residual segments run concurrently on the host CPU. We evaluate the design with an FPGA prototype and a 28nm ASIC projection at 840MHz. For Whisper-tiny.en, 32KB local memory and burst length 16 jointly minimize PDP and EDP. Under a TDP-based cross-platform comparison, the projected IMAX records a PDP of 11.58J for Whisper-tiny.en Q8_0, 2.35x lower than Jetson AGX Orin (27.16J) and 10.48x lower than RTX 4090 (121.38J). The same design extends to Whisper-base.en and Whisper-small.en, where the PDP gap narrows as 32KB local-memory coverage drops from 93.8% for tiny to about 66.5% for base and small. These results position IMAX as a programmable architecture for lower-PDP local ASR in the tiny-model regime.
comment: This paper is accepted at Concurrency and Computation: Practice and Experience (Wiley)
♻ ☆ Performance Analysis of Digital Processing-in-Memory through a Case Study on Convolutional-Neural-Network Acceleration
Processing-in-Memory (PIM) architectures are evolving to minimize data movement by leveraging the same physical devices for both memory and logic functionalities. While analog PIM harnesses crossbar arrays for efficient approximate matrix-vector multiplication, digital PIM architectures facilitate massively-parallel bitwise operations for more general workloads. Recent works have extended digital PIM towards the full-precision acceleration of convolutional neural networks (CNNs), yet a comprehensive comparison with GPUs remains a gap in the literature that may illuminate the limitations of digital PIM. This paper aims to fill this void by conducting a thorough examination of CNN acceleration through an updated quantitative comparison with GPUs. Our approach begins with a theoretical investigation into various PIM architectures, shedding light on their performance characteristics and constraints. Subsequently, through a series of benchmarks spanning memory-bound vectored arithmetic to CNN acceleration, we provide insights into digital PIM performance that may guide the acceleration of applications in the future.
comment: Revised and expanded version with additional evaluation, CNN training results, and broader architectural analysis
♻ ☆ Edge-Inference Governors Need Memory-Clock State
Frequency-aware latency estimators let deadline-aware DVFS governors schedule edge ML inference by modeling latency over CPU and GPU clocks, but they cannot observe the memory clock (EMC) -- a missing deployment state that decides whether a governor meets its deadlines and at what energy. We show this with a deployed, measured governor on a Jetson Orin NX: an EMC-blind GPU-only fit misses 25-28% of cycles at tight deadlines, whereas an EMC-aware refit holds misses to at most 1.3% under a 2% QoS miss budget by selecting a budget-feasible clock -- the energy-minimal one for periodic vision (calibrated module-rail power). The failure generalizes across three workload classes -- MobileNetV2, a ViT transformer, and Qwen2.5 LLM token decode (where saturated decode makes the aware policy lower-energy than the infeasible blind choice): a CPUxGPU estimator sends the deployed governor to an infeasible operating point, and only an EMC-aware model identifies the feasible side of the energy frontier. The effect is real and outside the CPUxGPU state abstraction: across two Orin SKUs sharing the same lockable EMC points it shifts median latency by up to ~45%, replicates on both, and survives a fused TensorRT fp16 engine. CPUxGPU models do not absorb it: per-lockable-point EMC tables are needed, a scoped inversion shows monotone assumptions can pick the wrong direction, and clustered misses make aggregate QoS rates understate deployment risk. We release the harness; this complements, not rebuts, the state of the art within its CPUxGPU scope.
comment: 20 pages, 13 figures, 11 tables. Code and data: https://github.com/dankang21/jetson-latency-lab ; traces: https://doi.org/10.5281/zenodo.20745228
♻ ☆ Shift-Left High-Level Synthesis Verification via Knowledge-Augmented LLM Agent
High-Level Synthesis (HLS) relies on transforming original C specifications into synthesizable HLS-oriented C (HLS-C) implementations. Functional consistency verification between original C specifications and HLS-C implementations is a critical yet labor-intensive task in HLS design flows. While Large Language Models (LLMs) have recently shown promise in automated testbench generation, their stochastic nature often leads to insufficient coverage, inconsistent verification environments, and unreliable equivalence checking results. To address these limitations, we propose a knowledge-augmented, agent-driven shift-left verification framework for automated functional consistency checking between original C and HLS-C implementations before synthesis. The framework introduces a Dual-Tier Consistency Checking mechanism that jointly enforces static structural alignment and dynamic behavioral equivalence between paired testbenches, while integrating symbolic execution and coverage-driven refinement to improve verification completeness. Furthermore, we construct a heterogeneous HLS Verification Knowledge Graph to provide topology-aware reasoning priors for testbench generation, and design an autonomous verification agent to orchestrate iterative refinement and failure diagnosis across heterogeneous toolchains. Experimental results on 107 HLS benchmark pairs demonstrate that the proposed framework achieves 0.9826 average coverage and 0.9533 dynamic consistency, outperforming representative AST-based, retrieval-augmented, and iterative agent-based baselines. https://github.com/cz-5f/HLS-LeVeri.git
Programming Languages 7
☆ Big-step and small-step Horn clause derivations applied to operational semantics
The concepts of big-step and small-step derivations are familiar from the operational semantics of programming languages. These concepts are applicable in the more general setting of Horn clause derivations. We prove equivalence between big-step derivations and two versions of small-step derivations for Horn clauses. By specialising interpreters for these derivation strategies, any set of Horn clauses can be transformed into a provably equivalent set of clauses that inherits the behaviour of a given (big- or small-step) Horn clause interpreter. As a special case of this transformation, big-step semantics for any programming language, expressed directly as Horn clauses, can be transformed into equivalent small-step semantics. Experiments with a variety of programming languages are reported.
comment: 32 pages, 15 figures, 3 appendices (30 pages)
☆ Multi-LCB: Extending LiveCodeBench to Multiple Programming Languages ICLR 2026
LiveCodeBench (LCB) has recently become a widely adopted benchmark for evaluating large language models (LLMs) on code-generation tasks. By curating competitive programming problems, constantly adding fresh problems to the set, and filtering them by release dates, LCB provides contamination-aware evaluation and offers a holistic view of coding capability. However, LCB remains restricted to Python, leaving open the question of whether LLMs can generalize across the diverse programming languages required in real-world software engineering. We introduce Multi-LCB, a benchmark for evaluating LLMs across twelve programming languages, including Python. Multi-LCB transforms Python tasks from the LCB dataset into equivalent tasks in other languages while preserving LCB's contamination controls and evaluation protocol. Because it is fully compatible with the original LCB format, Multi-LCB will automatically track future LCB updates, enabling systematic assessment of cross-language code generation competence and requiring models to sustain performance well beyond Python. We evaluated 24 LLMs for instruction and reasoning on Multi-LCB, uncovering evidence of Python overfitting, language-specific contamination, and substantial disparities in multilingual performance. Our results establish Multi-LCB as a rigorous new benchmark for multi-programming-language code evaluation, directly addressing LCB's primary limitation and exposing critical gaps in current LLM capabilities.
comment: ICLR 2026
☆ A cubical formalisation of conditional independence, Bayesian conditioning, and Pearl's d-separation soundness
The standard convex-algebra interchange axiom, common to probability-monad formalisations since Stone, is provably too weak to support full Bayesian conditioning. We make this precise in Cubical Agda: finite distributions as a higher inductive type, conditional independence as a cubical path between kernels, recursive Bayesian conditioning as a total function on a full-support fragment. Lifting conditioning to the full HIT exposes a structural mismatch -- the two halves of the rearranged 4-leaf mix carry distinct Bayesian weights related by Bayes' formula, not the single shared inner weight the standard axiom provides. We exhibit the minimal generalisation that resolves this and prove the standard form is the degenerate case where the two inner weights coincide. Around this observation we verify the algebraic context constructively, with zero postulates above an abstract ordered-field interface: bind commutativity, the four semi-graphoid axioms, intersection (reduced to contraction via structural $Σ$-witnesses, without positivity), Pearl's do-calculus Rules~1, 2, and~3 in kernel form, finite-type Bayesian conditioning, and Pearl's d-separation theorem (soundness) on arbitrary $n$-vertex finite directed acyclic graphs (DAGs) in both interventional and Bayesian forms. The probability monad is also verified as a Markov category; the abstract interface discharges at $Q$.
☆ An MSO Framework for Weak-Memory Verification and Robustness
Memory models are formal specifications of concurrent-program executions, accounting for weak behaviors introduced by compiler and architectural optimizations. The increase of their number and complexity has spawned efforts for uniform verification across whole classes of models, by axiomatizing the models in an adequate metatheory that admits a uniform treatment. In this work, we formally study Monadic Second-Order logic (MSO) as a metatheory for weak memory, by proving results on the treewidth and MSO-expressibility of various popular weak-memory models, as this combination allows us to uniformly tackle several verification problems. In summary, our results are as follows. First, we prove that executions under Sequential Consistency ($\mathsf{SC}$) have bounded treewidth, while already those under Total Store Order ($\mathsf{TSO}$) do not. Second, we prove that a broad range of models, including Release/Acquire and the full RC20, are MSO-axiomatizable, while others, such as Strong Release/Acquire and $\mathsf{TSO}$, are not, unless the Orthogonal Vectors problem $\unicode{x2013}$ which requires quadratic time under SETH $\unicode{x2013}$ can be solved in linear time. Finally, we introduce the notion of reads-from robustness, as an extension to recent work on coarse robustness criteria. We show that our treewidth bounds (both upper and lower) have far-reaching algorithmic implications for any of our MSO-axiomatizable models $\mathsf{MM}$: there is an algorithm that, for every program $\mathsf{P}$, either verifies $\mathsf{P}$ under $\mathsf{MM}$ or reports that $\mathsf{P}$ is not reads-from robust against $\mathsf{MM}$. Overall, our results establish a rich and versatile theoretical framework for weak-memory verification and robustness.
comment: Accepted at CONCUR 2026
☆ Effect Systems as Abstract Interpretations
Many forms of static reasoning about program behaviours are known in the literature, yet formal relationships are studied surprisingly infrequently. While most type systems are well-known to be captured by abstract interpretations, the situation for type-and-effect systems is, in the general case, unsettled despite strong hypotheses and occasional framing of effect systems as abstract interpretations. We develop a formal relationship between abstract interpretations and a general class of effect systems. First, we describe an embedding of effect quantales into abstract domains. Second, we recover the general form of an effect quantale as an abstract interpretation -- not on states or values, but on event occurrences.
comment: Draft short paper
♻ ☆ Look Before You Leap: Checking In on Type Tag Checking
Tagging of generic dynamic values is important in symbolic-computation and dynamic-language systems, but the trade-offs change as machine architectures and workloads evolve. In particular, old folklore about boxed values, immediate values, and type tags must be recalibrated from time to time. We revisit the performance of badged object headers, low-bit tagging, and two NaN-boxing layouts on a range of platforms in use today, including AArch64 and x86-64 architectures from different manufacturers. The experiments isolate two distinct effects: the cost avoided by not heap-allocating common scalar values, and the cost avoided by obtaining tag information from the value word rather than by performing a heap read. The results show that several local bit operations are often cheaper than opening a heap object to obtain a tag or small value. Low-bit tagging remains the simplest and usually fastest choice for mostly symbolic workloads, while NaN-boxing is close in access cost and avoids the time and space of heap allocation for ordinary floating-point values.
♻ ☆ Triosecuris: Formally Verified Protection Against Speculative Control-Flow Hijacking
This paper introduces Triosecuris, a formally verified defense against Spectre BTB, RSB, and PHT that combines CET-style hardware-assisted control-flow integrity with compiler-inserted speculative load hardening (SLH). Triosecuris is based on the novel observation that in the presence of CET-style protection, we can precisely detect BTB misspeculation for indirect calls and RSB misspeculation for returns and set the SLH misspeculation flag. We formalize Triosecuris as a transformation in Rocq and provide a machine-checked proof that it achieves relative security: any transformed program running with speculation leaks no more than what the source program leaks without speculation. This strong security guarantee applies to arbitrary programs, even those not following the cryptographic constant-time programming discipline.
comment: To appear at CSF'26; extended version with appendices. W.r.t. first revision: extended with concrete protection against Spectre RSB and renamed to Triosecuris
Data Structures and Algorithms 12
☆ Sparsify Submodular Functions under Cardinality Constraints
Submodular sparsification generalizes the classical sparsification problems of graphs and matrices to summations of submodular functions. Given the summation $F(S):=f_1(S)+\cdots+f_m(S)$ of $m$ submodular functions $f_1,\ldots,f_m:\{0,1\}^n \to \mathbb{R}_{\ge 0}$. An size-$s$ sparsification of $F$ is a weight vector $w \in \mathbb{R}^m_{\ge 0}$ such that $w_1 f_1(S) + \cdots w_m f_m(S) \approx F(S)$ for every subset $S \subset [n]$. Motivated by the wide applications of submodular functions in data mining and economics, submodular sparsification has been studied in the last few years. For general submodular functions, Kenneth and Krauthgamer provided an efficient construction of size $O(n^3)$. Although several families of submodular functions admit sparsifiers of size $\tilde{O}(n)$, there is a lower bound $Ω(n^2)$ on the size of sparsifiers by Cohen et al. In this work, we study whether cardinality constraints, such as restricting $S$ to subsets of size at most $k$, could reduce the size of sparsifiers or not. Namely, if the guaranty is $w_1 f_1(S) + \cdots w_m f_m(S) \approx F(S)$ for every $S$ in $[n]$ of cardinality at most $k$, are there sparsifiers of size smaller than $o(n^2)$? Our main result shows an efficient construction of size-$O(n k^2 \log n)$ sparsifiers for summations of arbitrary submodular functions. This improves the $Ω(n^2)$ bound for the general setting. Then we consider the existence of size-$(k \log n)^{O(1)}$ sparsifiers under the constraint of cardinality at most $k$ and show several natural families do not admit such a small sparsifier. Technically, our algorithm applies the Lovász extension and Edmonds' greedy algorithm to extend Kenneth and Krauthgamer's approach. In particular, we provide an efficient algorithm to provide a tight estimate (up to a constant) of the sensitivity of each $f_i$ under cardinality constraints.
☆ Computing Twin-Width via Treedepth and Vertex Integrity
Twin-width is a graph parameter that has become central to explaining the fixed-parameter tractability of first-order model checking across many graph classes. Despite its algorithmic importance, computing twin-width remains poorly understood: even recognizing graphs of twin-width at most four is NP-hard, and no fixed-parameter approximations parameterized by twin-width itself are known. A recent approach towards breaking this barrier focuses on first developing fixed-parameter algorithms for computing or approximating twin-width under parameterizations distinct from twin-width. Our first result establishes that approximating twin-width is fixed-parameter tractable when parameterized by treedepth, thereby breaking the long-standing barrier that all previous tractable parameterizations were based on deletion distance. The proof proceeds via oriented twin-width, yielding the first constructive evidence that this variant may be easier to handle algorithmically. As our second main result, we show that computing twin-width exactly is fixed-parameter tractable with respect to vertex integrity. This constitutes the first non-trivial parameterized algorithm for computing optimal contraction sequences.
comment: A short version of this preprint appeared at STACS 2026
☆ Tight Algorithm and Hardness for Submodular Linear Ordering
We consider the Minimum Linear Ordering Problem: given a ground set $N$ of cardinality $n$ and a non-negative set function $f\colon 2^N\rightarrow \mathbb{R}_{\geq 0}$, the goal is to find an ordering $π$ of $N$ that minimizes the sum of the values of $f$ over all prefixes of $π$. This problem has been studied for various classes of set functions, and the case of a submodular $f$ is of special interest, as it captures classic problems including Minimum Linear Arrangement and Minimum Containing Interval Graph. In this work, we resolve the approximability of the Minimum Linear Ordering Problem for a general submodular $f$ by establishing matching upper and lower bounds and present: $(1)$ a polynomial-time algorithm achieving an $O(\sqrt{n/\ln n})$-approximation; and $(2)$ a matching information-theoretic hardness result, showing that no algorithm evaluating $f$ a polynomial number of times can achieve an $o(\sqrt{n/\ln n})$-approximation. Previously, the best known hardness of approximation was $2$, and an $O(\sqrt{n/\ln n})$-approximation was known only for the special case where $f$ is both submodular and symmetric.
comment: 25 pages. Accepted to the 53rd International Colloquium on Automata, Languages, and Programming (ICALP 2026)
☆ Beyond Averaging in John Ellipsoid Approximation: High-Accuracy Algorithms in the Leverage-Score Model
The John ellipsoid of a symmetric polytope $P=\{\mathbf{x}\in\mathbb{R}^d:\|\mathbf{A}\mathbf{x}\|_\infty\le1\}$, $\mathbf{A}\in\mathbb{R}^{n\times d}$, is computed by a long line of leverage-score algorithms, from Cohen, Cousins, Lee and Yang (COLT 2019) to its successors [WY24, CLS+25], all reaching a $(1+\varepsilon)$-approximation in $Θ(\varepsilon^{-1}\log(n/d))$ iterations. We separate this complexity into three costs the modern line conflates (certification, identification, and accuracy) and locate the historical $\varepsilon^{-1}$ in the first alone. In the equivalent D-optimal-design form $\min_{\mathbf{p}\inΔ_n}-\log\det(\sum_i p_i\mathbf{a}_i\mathbf{a}_i^\top)$, the leverage-score oracle is exactly the first-order oracle and the $(1+\varepsilon)$-John guarantee the Frank-Wolfe gap $g(\mathbf{p})\le\varepsilon d$; through this dictionary the costs come apart. The $\varepsilon^{-1}$ is a certification artifact: the uniform average of the iterates, the certificate used throughout the line, has gap exactly $Θ(1/T)$, however cheap each iteration is made. Pointed instead at the last iterate the same oracle is fast: a warm-started accelerated method reaches the guarantee in $C(\mathbf{A})+O(\sqrtκ\log(1/\varepsilon))$ queries after an $\varepsilon$-independent setup $C(\mathbf{A})$, and once the optimal face is identified the facial problem is an unconstrained self-concordant minimization whose Hessian the oracle recovers exactly, so damped Newton needs only $O(\log\log(1/\varepsilon))$ steps, for a total of $C(\mathbf{A})+O(d^2\log\log(1/\varepsilon))$ queries. The accuracy dependence is thus doubly logarithmic after an $\varepsilon$-independent, condition-dependent setup; the open problem is the remaining identification cost (a condition-free bound on reaching the optimal face) and lower bounds. Accuracy is not the obstruction.
☆ Optimal Sparsification of Gaussian Processes
We prove an optimal dimension-free sparsification theorem for suprema of centered Gaussian processes. Given a bounded set $T\subseteq\mathbb{R}^n$, we show that the supremum of the canonical Gaussian process on $T$ can be $L^2$-approximated by the supremum of a shifted subprocess indexed by only $\exp(O(1/\varepsilon^2))$ points, with error at most $\varepsilon$ times the Gaussian width of $T$. In particular, the size of the approximating process is independent of both the ambient dimension and the cardinality of the original index set. This improves a recent sparsification theorem of De, Nadimpalli, O'Donnell, and Servedio (2026) by an exponential factor, and we show that the dependence on $\varepsilon$ is tight up to constants in the exponent. As consequences, we obtain an exponentially improved junta theorem for norms over Gaussian space and sharpen results on learning, property testing, and polyhedral approximation of convex sets under the Gaussian measure. The proof is based on an interpolation argument that combines Sudakov's minoration with the Brascamp--Lieb inequality.
comment: 38 pages, 1 figure
♻ ☆ On Rounding on the Hypersimplex
We study correlated rounding on the hypersimplex, the base polytope of the uniform matroid. For each point \(x\) in the hypersimplex, the goal is to sample a \(k\)-subset \(A(x)\) with marginals \(x\), while coupling the samples for all choices of \(x\) so that nearby inputs produce nearby sets. We give conditional constant-stretch results for the natural maximum-entropy sequential scheme, based on a conjectured spectral property of the covariance matrix of the maximum-entropy distribution over \(k\)-subsets; this conjecture has been extensively tested numerically. Under this property, the scheme samples the maximum-entropy \(k\)-subset distribution with prescribed marginals using a common random ordering and common uniform thresholds. For every \(x,y\in[0,1]^n\) with \(\sum_i x_i=\sum_i y_i=k\), it satisfies \[ \mathbb{E}\!\left[|A(x)\triangle A(y)|\right] \le 6\|x-y\|_1 . \] Thus, conditional on the spectral hypothesis, the previous \(O(\log k)\) bound for hypersimplex correlated rounding would improve to a constant and the open question raised by Naor, Raju, Shetty, Srinivasan, Valieva, and Wajc would have a positive answer. By adding dummy coordinates, the same conditional result gives stretch at most \(12\) for the at-most-\(k\) polytope.
comment: 15 pages
♻ ☆ Combinatorial Sparse PCA Beyond the Spiked Identity Model ICML 2026
Sparse PCA is one of the most well-studied problems in high-dimensional statistics. In this problem, we are given samples from a distribution with covariance $Σ$, whose top eigenvector $v \in R^d$ is $s$-sparse. Existing sparse PCA algorithms can be broadly categorized into (1) combinatorial algorithms (e.g., diagonal or elementwise covariance thresholding) and (2) SDP-based algorithms. While combinatorial algorithms are much simpler, they are typically only analyzed under the spiked identity model (where $Σ= I_d + γvv^\top$ for some $γ> 0$), whereas SDP-based algorithms require no additional assumptions on $Σ$. We demonstrate explicit counterexample covariances $Σ$ against the success of standard combinatorial algorithms for sparse PCA, when moving beyond the spiked identity model. In light of this discrepancy, we give the first combinatorial method for sparse PCA that provably succeeds for general $Σ$ using $s^2 \cdot \mathrm{polylog}(d)$ samples and $d^2 \cdot \mathrm{poly}(s, \log(d))$ time, by providing a global convergence guarantee on a variant of the truncated power method of Yuan and Zhang (2013). We provide a natural generalization of our method to recovering a vector in a sparse leading eigenspace. Finally, we evaluate our method on synthetic and real-world sparse PCA datasets.
comment: 36 pages, 6 figures; accepted to ICML 2026
♻ ☆ Efficient Computation of Time-Index Powered Weighted Sums Using Cascaded Accumulators
This letter presents a novel approach for \mbox{efficiently} computing time-index powered weighted sums of the form $\sum_{n=0}^{N-1} n^{K} v[n]$ using cascaded accumulators. Traditional direct computation requires $K{\times}N$ general multiplications, which become prohibitive for large $N$, while alternative strategies based on lookup tables or signal reversal require storing entire data blocks. By exploiting accumulator properties, the proposed method eliminates the need for such storage and reduces the multiplicative cost to only $K{+}1$ constant multiplications, enabling efficient real-time implementation. The approach is particularly useful when such sums need to be efficiently computed in sample-by-sample processing systems.
comment: This work has been submitted to the IEEE for possible publication
♻ ☆ Non-Splitting Coflow Scheduling with Provable Guarantees in Heterogeneous Parallel Networks
As a prominent network abstraction, coflow models efficiently capture communication patterns in data centers. Since coflow scheduling in large-scale data centers is $\mathcal{NP}$-hard, the existing literature has predominantly focused on limited environments with $m=2$ network cores, relying on flow splitting, which introduces substantial operational overhead. Crucially, no approximation algorithm with provable performance guarantees has been proposed for the more practical, non-splitting coflow scheduling problem, even for the $m=2$ case, let alone for general hybrid architectures. To bridge this critical gap, this paper investigates the non-splitting problem within a hybrid, heterogeneous parallel network featuring multiple network cores ($m \ge 2$) composed of Electronic Packet Switches (EPS), not-all-stop Optical Circuit Switches (OCS), and all-stop OCS. We propose three unified polynomial-time approximation algorithms that minimize the makespan and the total weighted coflow completion time across this hybrid environment without incurring any splitting overhead. Let $τ$ denote the maximum flow degree across all ports in the network, and let $m$ be the number of network cores. To minimize the makespan, our algorithm achieves an approximation ratio of $2\min\left\{2τ-1, m+τ-1\right\}$ in the hybrid architecture. To minimize the total weighted coflow completion time, our algorithm achieves an approximation ratio of $16\min\left\{2τ-1, 2m+τ-1\right\}$ in the hybrid architecture. Moreover, we characterize the approximation ratios of our algorithm under different architectural combinations.
♻ ☆ The World's Fastest Matching Engine Algorithm
A single CPU core sustains 32 million order messages per second at sub-microsecond median end-to-end host-path response latency, 4.7-11 times faster than the best available open-source matching engines on identical hardware. Scaled out, a single 96-core commodity server (~$1,630/month) sustains ~640 million messages per second across 10,000 symbols, over 20 times the provisioned capacity of the U.S. consolidated quote feed. We reach these numbers by attacking the storage layer that sets matching latency. The dominant order-book implementation, linked lists chained through a balanced tree, imposes two costs on every operation: pointer-chased traversal to the insertion point, and root-to-leaf search to locate the target price level. Under micro-bursts these costs produce tail-latency spikes that degrade market quality precisely when liquidity is most needed. We present two data-structure contributions that eliminate them. The first is the Priority-Indicated Node (PIN), a priority queue in which entries occupy fixed-capacity, contiguously addressable slots, with indicators encoding the entry's global priority status. Unlike heaps, which require O(log n) comparisons per operation, the PIN resolves insertion position directly from the indicators without comparing entries; indicator updates are O(1), independent of queue size. A depth-aware capacity model sizes each PIN so hot entries fit within L1 residency. The second targets a broader inefficiency: balanced search trees search from root to leaf on every insertion and deletion, even when the caller already knows the key's in-order neighbors, which in electronic trading are available at zero cost. Neighbor-aware insertion and deletion use known neighbor references to attach or remove a node with O(1) reference writes, followed by single-path rebalancing, across red-black, AVL, and B+-tree variants.
comment: 20 pages, 5 figures, 7 tables
♻ ☆ Fair Online Resource Allocation
We study the problem of fair online resource allocation, motivated by applications such as refugee resettlement and airline scheduling, where agents arrive sequentially and must be assigned to facilities with limited capacities. We introduce a model that maximizes the overall welfare subject to resource constraints and a Lipschitz fairness requirement, which ensures that similar agents arriving in the same batch receive similar expected outcomes. We first analyze the offline problem, proving that the value of the optimal fair allocation is at least an $Ω(1/γ)$ fraction of the optimal unfair allocation, where $γ$ is the fairness coefficient, thereby bounding the price of fairness. For the online setting, we propose an algorithm based on dual mirror descent that enforces fairness constraints within batches while estimating optimal dual variables. We prove that this algorithm achieves sublinear regret relative to the optimal offline fluid benchmark. Finally, we validate our theoretical results using real-world data from the Refugee Economies Programme, demonstrating the algorithm's performance and examining the trade-offs between welfare maximization and fairness enforcement.
comment: 30 pages, 4 figures. To appear in the proceedings of EC 2026
♻ ☆ Efficient $\varepsilon$-approximate minimum-entropy couplings
Given $m \ge 2$ discrete probability distributions over $n$ states each, the minimum-entropy coupling is the minimum-entropy joint distribution whose marginals are the same as the input distributions. Computing the minimum-entropy coupling is NP-hard, but there has been significant progress in designing approximation algorithms; prior to this work, the best known polynomial-time algorithms attain guarantees of the form $H(\operatorname{ALG}) \le H(\operatorname{OPT}) + c$, where $c \approx 0.53$ for $m=2$, and $c \approx 1.22$ for general $m$ [CKQGK '23]. A main open question is whether this task is APX-hard, or whether there exists a polynomial-time approximation scheme (PTAS). In this work, we design an algorithm that produces a coupling with entropy $H(\operatorname{ALG}) \le H(\operatorname{OPT}) + \varepsilon$ in running time $n^{O(\operatorname{poly}(1/\varepsilon) \cdot \operatorname{exp}(m) )}$: showing a PTAS exists for constant $m$.
Graphics 7
☆ The Token Is a Group Element: On Lie-Algebra Attention over Matrix Lie Groups
We place the attention token on the group: a token is an element $g_i$ of a matrix Lie group $G$ -- a bare transformation, with no feature payload and no external action $ρ(g)$ carrying it. To our knowledge this is the first attention construction whose tokens are bare matrix Lie group elements: their score is the closed-form algebra norm of the relative pose rather than a learned kernel, and it reaches the affine full-frame groups that every irrep- or surjective-exp-based method must exclude. We call it Lie-Algebra Attention. Once tokens are group elements, the rest follows with none of the usual representation-theoretic machinery. The relative geometry of a pair is canonical, $g_i^{-1} g_j$, so the pairwise invariant $w_{ij} = \log(g_i^{-1} g_j)$ is intrinsic rather than designed; equivariance under the diagonal $G$-action is tautological, and the cocycle condition holds automatically. The attention score is the negative squared algebra norm, $s_{ij} = -\|\log(g_i^{-1} g_j)\|_λ^2/τ$: the canonical proximity kernel under a block-weighted Frobenius inner product, with no irreducible representations, spherical harmonics, Clebsch-Gordan products, or learned kernel. The construction applies to any matrix Lie group on a chosen logarithm chart containing the relative poses, including the non-compact non-abelian affine groups with scale and shear that no vector-token attention method reaches: neither the irrep tradition nor surjective-exp methods. Three sequence-completion experiments, on SE(2), SO(3), and Aff(2), bear this out: the closed-form score matches a learned MLP kernel on the same invariant and outperforms it on SE(2), using 50 to 80x fewer score parameters, while a vector-token baseline breaks invariance by five to twelve orders of magnitude.
comment: preprint, 19 pages, 3 figures
☆ One Image is All You Need: Agentic One-Shot Image Generation via Text-Based World Models for Long-Tail Spatial Perception
Reliable spatial decision automation, such as autonomous driving and maritime surveillance, critically depends on robust visual perception. However, real-world spatiotemporal data exhibits severe heterogeneity, often manifesting as extreme long-tail distributions for safety-critical scenarios. This data scarcity induces dataset shift that degrades detection performance and pose safety risks. While synthetic data generation offers a potential solution, existing generative approaches, such as diffusion models and Generative Adversarial Networks (GANs), often lack explicit spatial grounding and structural constraints, resulting in spatial and physical inconsistencies in generated scenes. To address these challenges, we introduce WMGen-v1, an agentic text-based world model framework for long-tail spatial data generation. WMGen-v1 employs a Large Vision-Language Model (LVLM) to construct a structured scene representation from a single reference image, while a Large Language Model (LLM) performs guidance-based scene expansion under physical plausibility and commonsense constraints. Subsequently, conditioned on the structured semantic representations produced by this reasoning process, a diffusion model generates diverse and physically grounded long-tail training data. Experiments on internal industrial datasets, ROADWork, and LaRS benchmarks demonstrate that WMGen-v1 outperforms baseline approaches. Notably, detectors trained solely on WMGen-v1 synthetic data approach real-only performance on aggregate dataset-level metrics, highlighting its potential to alleviate long-tail data scarcity for downstream spatial perception.
☆ MakeupMirror: Improving Facial Attribute Preservation in Diffusion Models for Makeup Transfer
Makeup transfer models enable fun augmented reality (AR) experiences as well as virtual try-on (VTO) for online makeup shopping. While recent state-of-the-art diffusion based solutions such as Stable-Makeup dramatically improve the accuracy and realism of makeup transfer, they still face limitations in identity and skin color preservation, making production-level VTO for makeup shopping unrealistic. In this work, we propose MakeupMirror, a diffusion-based approach to makeup transfer that makes significant progress towards preserving facial features and skin tone. We introduce several technical innovations over Stable-Makeup: (1) integration of facial geometry conditioning with ControlNets to maintain facial fidelity; (2) region-specific makeup transfer control to enable precise makeup application across facial regions such as skin, eyes and lips; (3) skin tone-based makeup transfer modulation that prevent skin tone alteration in cross-subject transfer scenarios; and (4) integration of a Levenberg-Marquardt Langevin sampler to speed up inference while maintaining generation quality. Our experiments on CPM-Real, Makeup Wild, and (herein newly collected, more diverse) MakeupSelfies datasets show that MakeupMirror improves relative facial recognition similarity by +60%, reduces relative skin tone difference by -50% over Stable-Makeup, with a latency of 0.7s, while achieving expert acceptance rate of 94% across core facial identity preservation criteria.
♻ ☆ RelightAnyone: A Generalized Relightable 3D Gaussian Head Model
3D Gaussian Splatting (3DGS) has become a standard approach to reconstruct and render photorealistic 3D head avatars. A major challenge is to relight the avatars to match any scene illumination. For high quality relighting, existing methods require subjects to be captured under complex time-multiplexed illumination, such as one-light-at-a-time (OLAT). We propose a new generalized relightable 3D Gaussian head model that can relight any subject observed in a single- or multi-view images without requiring OLAT data for that subject. Our core idea is to learn a mapping from flat-lit 3DGS avatars to corresponding relightable Gaussian parameters for that avatar. Our model consists of two stages: a first stage that models flat-lit 3DGS avatars without OLAT lighting, and a second stage that learns the mapping to physically-based reflectance parameters for high-quality relighting. This two-stage design allows us to train the first stage across diverse existing multi-view datasets without OLAT lighting ensuring cross-subject generalization, where we learn a dataset-specific lighting code for self-supervised lighting alignment. Subsequently, the second stage can be trained on a significantly smaller dataset of subjects captured under OLAT illumination. Together, this allows our method to generalize well and relight any subject from the first stage as if we had captured them under OLAT lighting. Furthermore, we can fit our model to unseen subjects from as little as a single image, allowing several applications in novel view synthesis and relighting for digital avatars.
♻ ☆ Fluid Control with Localized Spacetime Windows
We present a physics-based fluid control method utilizing localized spacetime windows, extending force-based fluid control to substantially larger simulation scales. In many practical editing scenarios, user-specified objectives affect only a small region of an otherwise satisfactory simulation, resulting in optimal control force distributions that are highly sparse in both space and time. However, existing optimization-based fluid control methods typically solve for control forces over the entire spacetime domain, leading to unnecessarily high computational cost and poor scalability. Motivated by this observation, we restrict optimization to localized spacetime regions surrounding the edit of interest, significantly reducing the dimensionality of the control problem. Within this framework, control forces are parameterized on a coarse "floating" background grid, decoupling control degrees of freedom from simulation resolution and promoting smooth, physically plausible forces. We further analyze spacetime-window selection as a joint spatial-temporal problem. While the full problem can be formulated as a 2D search over spatial and temporal window extents, practical workflows can often leverage user-specified spatial regions and lightweight temporal-window selection strategies to reduce search cost. Our method enables a range of intuitive editing tasks, where sparse user inputs can induce coherent motion in surrounding fluid structures. We demonstrate the effectiveness and efficiency of our method with various 2D and 3D particle-based free-surface simulation examples.
♻ ☆ VEPHand: View-Efficient Photometric Hand Performance Capture at Scale
Robust, high-fidelity 3D hand capture, while fundamental to digital human creation, remains challenging with practical multi-view systems that balance rich photometry with the geometric ambiguities of reconstruction arising from limited viewpoint density. This paper presents an end-to-end pipeline for dynamic hand performance capture and registration, specifically designed for view-efficient setups ($\sim$20 views). We address key challenges with two primary innovations. First, to overcome reconstruction difficulties like limited view overlap and background clutter, our mask-free neural method robustly extracts detailed hand geometry and appearance from unmasked images using scene parameterization and scenario-specific density regularization. Second, addressing registration challenges such as accurately capturing non-linear skin deformations and ensuring plausible results during severe self-contact, we propose a physics-inspired framework. It aligns reconstructions to a personalized hand model by optimizing intrinsic volumetric offsets within its canonical tetrahedral mesh, alongside pose parameters. This approach, supported by robust losses and optimization, captures fine surface deformations, ensures plausible results under severe articulation and self-contact, and demonstrates strong tolerance to input noise. We demonstrate the scalability and robustness of our automated pipeline on an extensive dataset of over 12,000 sequences, from which we also derive a large-scale, high-quality synthetic 2D/3D hand dataset for training downstream tasks. This showcases its effectiveness for single hands, intricate two-hand interactions, and natural hand-object manipulations. Our method achieves state-of-the-art reconstruction fidelity in view-efficient, unmasked scenarios and highly accurate registration. Our project page are available at https://vephand.github.io/.
♻ ☆ MeshPad: Interactive Sketch-Conditioned Artist-Reminiscent Mesh Generation and Editing
We introduce MeshPad, a generative approach that creates 3D meshes from sketch inputs. Building on recent advances in artist-reminiscent triangle mesh generation, our approach addresses the need for interactive mesh creation. To this end, we focus on enabling consistent edits by decomposing editing into 'deletion' of regions of a mesh, followed by 'addition' of new mesh geometry. Both operations are invoked by simple user edits of a sketch image, facilitating an iterative content creation process and enabling the construction of complex 3D meshes. Our approach is based on a triangle sequence-based mesh representation, exploiting a large Transformer model for mesh triangle addition and deletion. In order to perform edits interactively, we introduce a vertex-aligned speculative prediction strategy on top of our additive mesh generator. This speculator predicts multiple output tokens corresponding to a vertex, thus significantly reducing the computational cost of inference and accelerating the editing process, making it possible to execute each editing step in only a few seconds. Comprehensive experiments demonstrate that MeshPad outperforms state-of-the-art sketch-conditioned mesh generation methods, achieving more than 22% mesh quality improvement in Chamfer distance, and being preferred by 90% of participants in perceptual evaluations.
comment: Project page: https://derkleineli.github.io/meshpad/ Video: https://www.youtube.com/watch?v=_T6UTGTMZ1E
Operating Systems 1
☆ LiveStack: OS Support for Cluster-Scale Full-Stack Live Simulation
Cluster-scale full-stack simulation is essential for evaluating distributed software stacks and emerging hardware components before deployment. Such simulation must achieve both full-stack fidelity for the unmodified production stack and the simulation performance required for iterative configuration exploration. However, no existing method achieves both. We present LiveStack, an OS-level approach to cluster-scale full-stack simulation built on top of the Linux virtualization stack. LiveStack comprises four subsystems: simulation-oriented scheduling, live memory hierarchy management, simulation-aware IPC, and distributed simulation orchestration. Together, they coordinate live and modeled components under shared simulated time while controlling interference among co-located live hosts. These mechanisms point toward simulation-native OS support, where simulation control and orchestration become core OS responsibilities.
Hardware Architecture 7
☆ A Tool for the Synthesis of Adaptive Probabilistic Processors Based on the Ising Model MICRO
This work presents a tool for the synthesis and simulation of probabilistic architectures for solving combinatorial optimization problems by mapping them to the Ising model. The proposed approach automatically constructs the Ising Hamiltonian and determines the number of probabilistic elements (p-bits) based on problem characteristics such as size and topology. Furthermore, the tool introduces an adaptive strategy for selecting the most suitable update algorithm among Gibbs Sampling, Simulated Annealing (SA), Simulated Quantum Annealing (SQA), and cluster-based methods. Experimental results using benchmark problems demonstrate improved convergence behavior and flexibility compared to fixed approaches. The proposed framework enables systematic evaluation of probabilistic computing strategies and supports the development of future hardware implementations based on MTJs and p-bits.
comment: ACM/IEEE/SBC/SBMICRO Symposium on Integrated Circuits and Systems Design 2026
☆ SPINE: A Fault Injection Profiler for Quantized Neural Networks under Accumulated Faults MICRO
Deploying deep neural networks at the edge demands efficient inference under strict cost and power constraints. Quantized neural networks address these demands by replacing floating-point parameters with low-precision integers, yet their weights remain continuously exposed to radiation-induced bit-flips during inference. Fault Injection can be used to simulate those environments, but existing studies fail to characterize how accumulated upsets translate into mispredictions under realistic memory layouts. This paper presents a GDB-driven profiling framework that injects cumulative weight bit-flips directly onto the target binary of edge CPUs, generating per-layer fault profiles without requiring model retraining or code modification. Evaluated across multiple topologies, quantization efforts, and memory layouts, the results indicate how selective hardening strategies should be applied to effectively protect neural networks.
comment: ACM/IEEE/SBC/SBMICRO Symposium on Integrated Circuits and Systems Design 2026
☆ PuDGhost: Experimental Analysis of Computation Result Corruption in Processing-using-DRAM Operations on Real DRAM Chips and Implications for Future Systems ISCA 2026
Processing-using-DRAM (PuD) is a promising computation paradigm that alleviates frequent data movement between main memory and processing units by using each DRAM column as a computation engine via simultaneous multiple-row activation (SiMRA). Unfortunately, DRAM density scaling may hinder PuD's benefits: denser cell arrays bring rows and columns closer, making regular DRAM operations susceptible to noise and interference from neighboring cells. Yet no prior work investigates whether interference from rows or columns not intended to participate in computation can compromise PuD robustness. In this work, we reveal PuDGhost, an interference phenomenon where a PuD operation in a given column produces erroneous results due to interference from 1) data in non-activated DRAM rows and 2) data in other columns that compute concurrently under the same SiMRA operation. PuDGhost violates the ideal picture that each column's computation depends solely on its own operand data, threatening future PuD systems. We present the first extensive characterization of PuDGhost using 96 real DDR4 DRAM chips from 12 modules, quantifying these two interference sources under various conditions. Among our 15 new empirical observations, we highlight two major results: 1) data in adjacent non-activated rows affects SiMRA outputs by up to 10% for random inputs, and 2) data in concurrently computing columns affects SiMRA outputs by up to 48% for random inputs. Guided by these findings, we propose countermeasures across multiple layers of the PuD computing stack. Specifically, we evaluate on real DDR4 DRAM chips: 1) robust column screening that reduces the risk of using unreliable columns in the presence of PuDGhost, and 2) a compute row layout that mitigates PuDGhost via dedicated rows between compute rows. Our solutions greatly improve PuD computation accuracy and provide a foundation for robust future PuD systems.
comment: To appear at ISCA 2026 (June 2026)
☆ CHERI-D: Secure and efficient inline object ID for CHERI temporal memory safety
We propose CHERI-D, an architectural extension to CHERI that supports efficient temporal memory safety. Efficient memory safety is an increasing priority for programming languages, operating systems, and hardware designs, and CHERI is a leading hardware/software system that provides native spatial safety and a foundation for temporal memory safety. Due to CHERI lacking intrinsic architectural support for temporal memory safety, the state-of-the-art CHERI temporal safety solution, Cornucopia Reloaded, is a software-based solution that provides use-after-reallocation (UAR) protections instead of the stronger use-after-free (UAF) mitigation, and suffers performance overhead due to delayed reallocation and revocation. CHERI-D associates object identification (ID) metadata with capability pointers to provide temporal integrity of allocations. CHERI spatial safety allows CHERI-D to store object IDs safely inline with allocation data, potentially within unused fragmentation. Evaluated in simulation and in hardware, CHERI-D significantly reduces the revocation overhead of Cornucopia Reloaded while allowing it to support strict use-after-free mitigation.
☆ Nanoscale memristive devices: Threats and solutions
Due to their incentivizing features, memristors are a promising candidate for replacing CMOS-based memories, which are faced with various functional challenges in deep submicron process technologies. Memristors are nonvolatile, have low leakage, and are dense in comparison to CMOS-based memories like SRAM. In this regard, resistive RAM (ReRAM) and spin-transfer-torque RAM (STT-RAM) memristors are distinguished among other memristor-based memory technologies, due to their superiority in process maturity and metrics such as memory operation energy, memory latency, and area. Hence, this chapter focuses on these two memristor-based memory technologies. Despite the good features of these types of memory, they suffer from some reliability threats. Reliability parameters affect each other, and examining their positive and negative effects has a significant impact on the effectiveness of the proposed solutions. In one view, the threats can be categorized into two classes: (1) read/write error and (2) soft error. In this chapter, we comprehensively describe these threats and present the state-of-the-art solutions that enable the widespread use of memristors, particularly ReRAM and STT-RAM, in different applications. Finally, we introduce the emerging ability of memristors as a computing unit aiming to minimize data restoration in computing, and we show how to perform logic and arithmetic computation in a crossbar array.
☆ FPGA-Accelerated Neuromorphic Vision System for Real-Time Orbital Object Detection
The escalating congestion in orbital space demands advanced monitoring solutions. This work presents a comprehensive open-source framework for neuromorphic resident space object (RSO) detection, adapting the foundational grid clustering algorithm for FPGA acceleration. The system integrates a single event-based camera (EBC) with a custom, distributed processing architecture, where rapid spatial quantization is executed in programmable logic (FPGA) and cluster formation is managed by a software client. We validate this architecture through systematic sampling of night-sky observations from the EVAS dataset, demonstrating 97% detection accuracy for RSOs. The implementation, which serves as a foundational toolkit for event-based FPGA processing, achieves efficient throughput with a total power consumption of 8.5 W and deterministic processing latencies below 62 ms. The architecture's energy efficiency and high-precision detection position it as a viable solution for distributed space surveillance networks.
comment: 9 pages, 12 figures
♻ ☆ ChatModel: Automating Reference Model Design and Verification with LLMs
As the complexity of integrated circuit designs continues to escalate, functional verification becomes increasingly challenging. Reference models, critical for accelerating the verification process, are themselves becoming more intricate and time-consuming to develop. Despite the promise shown by large language models (LLMs) in code programming, effectively generating complex reference models remains a significant hurdle. Therefore, we introduce ChatModel, an LLM-aided agile reference model generation and verification platform. ChatModel streamlines the transition from design specifications to fully functional reference models by integrating design standardization and hierarchical agile modeling. Employing a building-block generation strategy, it not only enhances the design capabilities of LLMs for reference models but also significantly boosts verification efficiency. We evaluated ChatModel on 300 designs of varying complexity, demonstrating substantial improvements in both efficiency and quality of reference model generation. ChatModel achieved a peak performance improvement of 58.99% compared to alternative methods, with notable enhancements in generation stability, and delivered a 9.18x increase in its capacity to produce reference model designs. Moreover, ChatModel accelerates the reference model design and validation cycles by an average of 7.11x over traditional manual approaches. These results highlight the potential of ChatModel to significantly advance the automation of reference model generation and validation.
Programming Languages 3
☆ QDSV: A Semantic Problem Representation and Multi-Backend Execution Framework for Quantum-Oriented Computation
Predicate-based computation over state spaces separates a problem specification from the backend that realizes it. Building on the model introduced in arXiv:2606.15027, this paper studies QDSV as a semantic, multi-backend execution framework for quantum-oriented computation. We describe how QDSV, QIntent, and Qruba connect declarative problem intent to a structured semantic representation, realize that representation under heterogeneous backend constraints, and report execution trace outputs that separate model-level semantic outputs from backend-specific observations. The framework supports execution modes that do not require the original problem to be authored as a circuit, while still allowing circuit-compatible artifacts when required. As a case study, we evaluate EEG ictal/interictal classification using prepared signal features from the Bonn and Delhi datasets. The study compares classical machine-learning references, a circuit-first variational quantum classifier baseline, QDSV simulator executions, and controlled IBM Quantum hardware runs. The paper does not claim general quantum advantage or superiority over classical machine learning. Its contribution is a semantic execution validation showing how a problem-first representation can remain stable across simulator and hardware realizations while retaining interpretable execution trace outputs.
comment: 12 pages, 1 figure, 6 tables
☆ OpenRath: Session-Centered Runtime State for Agent Systems
Modern agent systems often suffer from fragmented runtime state: transcripts, tool effects, memory events, workspace placement, branch provenance, and replay evidence are recorded separately and become difficult to inspect or reproduce. OpenRath addresses this issue with a PyTorch-like programming model for multi-agent, multi-session systems. The analogy concerns the role of a central first-class runtime abstraction, not tensor computation. Its core abstraction is Session, the runtime value passed between agents and workflows. A Session is branchable, inspectable, replayable, backend-aware, and composable. It records conversation chunks, sandbox placement, lineage metadata, token usage, pending work, and tool evidence, while defining where memory interactions enter the runtime record. Since this state is carried by the same value used in program execution, fork, merge, and replay become explicit runtime operations rather than states reconstructed from external traces. OpenRath further defines Sandbox, Tool, Agent, Memory, Workflow, and Selector, with Selector turning control flow into runtime-routed decisions. This report presents the programming model, architecture, audited milestones, and evidence protocol. Its claims are limited to controlled runtime properties, while broad quantitative comparisons, live-provider quality, optional-backend availability, and memory quality are left for follow-on evaluation. The central thesis is that Session provides agent systems with a first-class runtime value for auditable composition.
☆ VERITAS: Verifier-Guided Proof Search for Zero-Shot Formal Theorem Proving
LLM-based formal provers often collapse rich verifier signals (syntax errors, type mismatches, partial goal progress) into a binary pass/fail bit. We present VERITAS, a zero-shot framework that routes every verifier signal back into proof search through a two-phase protocol: Best-of-N sampling first, then a critic-guided MCTS pass that ingests Phase 1 failures as explicit negative examples. The protocol preserves every theorem solved by its own Phase 1 sweep, so Phase 2's additional solves are attributable to feedback-driven exploration. VERITAS reaches 40.6% on miniF2F (vs. an independently run Best-of-5 at 36.9%, Portfolio 26.2%) and 7.3% on VERITAS-CombiBench, a 55-theorem combinatorics benchmark we release on which Best-of-5 (1.8%) falls below Portfolio (3.6%), exposing that unguided sampling hurts when correct lemma names must be recovered iteratively from verifier feedback. Artifacts are available on GitHub.
Data Structures and Algorithms 11
☆ Guarded Epoch Bloom Filters for Sliding-Window Membership
Approximate membership queries in streams often need recent-window semantics rather than membership over all items ever seen. This paper studies guarded epoch Bloom filters, a sliding-window alternative to counting and stable Bloom filters. The structure partitions a fixed bit budget into rotating epochs, inserts only into the current epoch, clears whole segments at epoch boundaries, and keeps one additional guard epoch. This guard yields a deterministic live-window invariant: every item inserted in the last W positions remains represented, while rotation-induced stale retention is bounded by one epoch beyond the target window. We give the construction, prove its live-coverage and bounded-staleness properties, derive a false-positive approximation, and include a blocked variant that improves locality by confining probes to one block per epoch. Experiments cover 225 synthetic streaming configurations and 45 configurations from a timestamp-ordered web-server access-log stream. At 14 bits per live item, the guarded epoch filter reduces median synthetic false positives from 0.191 for a four-bit counting Bloom baseline to 0.02225 while preserving zero measured live-key false negatives. The method is not a replacement for exact deletion; it targets systems where bounded stale positives are acceptable but false negatives inside the live window are not.
comment: 6 pages, 1 table; ancillary files provided with code, tests, experiment scripts, CSV outputs, and validation logs
☆ Tractable Gap-Constraint Languages for Complex Event Recognition
For strings $u, D \in Σ^*$, a subsequence embedding of $u$ in $D$ is a function $e \colon \{1, 2, \ldots, |u|\} \to \{1, 2, \ldots, |D|\}$ with $e(i) < e(i+1)$ for every $i \in \{1, 2, \ldots, |u|-1\}$ and the $i$-th symbol of $u$ equals the $e(i)$-th symbol of $D$. A gap-constraint for $u$ is a triple $(i, j, L)$ with $1 \leq i < j \leq |u|$ and $L$ is a regular language over $Σ$. An embedding $e$ satisfies a gap-constraint $(i, j, L)$ if the factor of $D$ strictly between positions $e(i)$ and $e(j)$ is a word from $L$. We investigate the subsequence matching problem with gap-constraints, which is relevant in the context of complex event recognition (CER): given $u, D \in Σ^*$ and a set $C$ of gap-constraints, find an embedding of $u$ in $D$ that satisfies all gap-constraints from $C$. In general, subsequence matching is NP-complete and the only known tractable variants restrict the interval structure of the gap-constraints. In this work, we show that we can solve subsequence matching with gap-constraints with an arbitrary interval structure rather efficiently (in fact, optimally under SETH) in time $O(|D| (|u| + |C|))$ if the gap-constraint languages satisfy a property which we dub left-convexity: whenever $u v w \in L$ and $v \in L$, then also $uv \in L$. Left-convex languages are sufficiently expressive to model interesting real-world scenarios considered in CER, e.g., length constraints $L = \{w \mid a \leq |w| \leq b\}$ for $a, b \in \mathbb{N}$. We also show how our algorithm can be used in order to efficiently enumerate all satisfying embeddings, which is particularly relevant for possible applications in CER. Finally, we show how non-left-convex languages can lead to intractability, i.e., if in addition to length constraints we allow $\{aa, ε\}$ as the only non-left-convex constraint language, then the problem is NP-complete again.
comment: 50 pages
☆ Learning Augmented Exact Exponential Algorithms
The field of learning-augmented algorithms has demonstrated that machine-learned predictions can bypass worst-case lower bounds across a wide range of problems. So far, however, the focus has been almost exclusively on polynomial-time algorithms, where predictions improve competitive ratios, approximation guarantees, or running times. In this paper, we raise the question of whether predictions can push the frontier of exact exponential-time algorithms for NP-hard problems. We answer this question affirmatively by proposing a general approach that augments an entire family of state-of-the-art exact algorithms for a variety of subset selection problems. We show that a noisy predictor that is only marginally better than random guessing suffices to provably reduce the search space, and that the resulting runtime speedup scales smoothly with the prediction quality. Importantly, our algorithms require only pairwise independence of predictions or, alternatively, do not require the knowledge of the predictor's accuracy - both strictly weaker and more realistic settings than typically assumed.
☆ Compact multi-text index for circular Cartesian tree matching
Cartesian tree matching (CTM) is a structural pattern matching approach that identifies sequences with the same Cartesian tree topology, making it suitable for data with natural variability where exact comparisons carry little semantic meaning. While theoretical algorithms for CTM have been studied extensively, systematic empirical evaluations of practical implementations remain rare. This article presents an implementation of the Cartesian Extended Burrows-Wheeler Transform (ceBWT), a BWT-based index structure for CTM. The implementation supports both a dynamically extendable and a statically compressed index variant.
comment: Implementation of https://doi.org/10.1007/s00224-025-10257-4
☆ On (Non-)Isomorphism of Self-Dual Lattices and Codes
A recent line of work motivated by cryptographic applications has studied the complexity of the Lattice Isomorphism Problem (LIP). In this work, we study LIP on self-dual lattices $\cal{L} \subset \mathbb{R}^n$, which appear naturally in many applications. Our main results are a $2^{n/2 + o(n)}$-time randomized algorithm for LIP and a $\mathsf{coNP}$ protocol for LIP on a broad class of self-dual lattices. These results extend recent work on ZLIP, the problem of deciding whether a lattice is isomorphic to $\mathbb{Z}^n$. In particular, the former result extends the $2^{n/2 + o(n)}$-time algorithms for ZLIP of Bennett, Ganju, Peetathawachai, and Stephens-Davidowitz (Eurocrypt, 2023) and of Ducas (Des. Codes Cryptogr., 2024). The latter result extends the $\mathrm{ZLIP} \in \mathsf{coNP}$ result of Hunkenschröder (Math. Prog. Series A, 2024). Our results leverage two key structural properties of self-dual lattices $\cal{L} \subset \mathbb{R}^n$: (1) every such lattice $\cal{L}$ is isomorphic to $\cal{L}_0 \oplus \mathbb{Z}^r$ for some self-dual lattice $\cal{L}_0$ with $λ_1(\cal{L}_0)^2 \geq 2$, and (2) every such lattice $\cal{L}$ has \emph{characteristic vectors}, i.e., there exist vectors $\mathbf{w} \in \cal{L}$ such that for every $\mathbf{v} \in \cal{L}$, $\langle\mathbf{v}, \mathbf{w}\rangle \equiv \langle\mathbf{v}, \mathbf{v}\rangle \pmod{2}$. Our results use a line of work by Elkies and Gaulter on lattices with long shortest characteristic vectors, and can be strengthened assuming a positive answer to a related question of Elkies (Math. Res. Lett., 1995). We also study Permutation Code Equivalence (PCE) on self-dual codes, and we observe that similar structural properties imply a polynomial-time algorithm for PCE on certain such codes. This gives a natural class of codes with large hull for which PCE is easy.
♻ ☆ How fast can you find a good hypothesis? COLT 2026
In the hypothesis selection problem, we are given sample and query access to finite set of candidate distributions (hypotheses), $\mathcal{H} = \{H_1, \ldots, H_n\}$, and samples from an unknown distribution $P$, both over a domain $\mathcal{X}$. The goal is to output a distribution $Q$ whose distance to $P$ is comparable to that of the nearest hypothesis in $\mathcal{H}$. Specifically, if the minimum distance is $\mathsf{OPT}$, we aim to output $Q$ such that, with probability at least $1-δ$, its total variation distance to $P$ is at most $C \cdot \mathsf{OPT} + \varepsilon$. The optimal approximation for proper algorithms (where $Q \in \mathcal{H}$) is $C=3$ using $Θ(\log(n/δ)/\varepsilon^2)$ samples from $P$ and for improper algorithms (where $Q$ is not necessarily in $\mathcal{H}$) is $C=2$ using $\tildeΘ(\log(n/δ)/\varepsilon^2)$ samples from $P$. In the improper setting, the algorithm achieving $C=2$ [Bousquet, Braverman, Kol, Efremenko, Moran, FOCS 2021] runs in time which grows polynomially with $|\mathcal{X}|$ -- it does not run in finite time for real-valued distributions. A promising path towards improved runtime is to consider improper algorithms which output a mixture $Q$ of the hypotheses as such a distribution can be represented in $n$ words of memory. We show (1) a lower bound that no algorithm which outputs a mixture can achieve approximation better than $C = 3-2/n$ unless the number of samples is polynomial in $|\mathcal{X}|$, as well as (2) an algorithm which runs in time $\text{poly}(n)$ and achieves the same approximation guarantee. In the proper setting, [Aliakbarpour, Bun, Smith, NeurIPS 2024] provided an algorithm with $C=3$ running in $\tilde{O}(n/(δ^3\varepsilon^3))$ time. We improve this time complexity to $\tilde{O}(n/(δ\varepsilon^2))$, significantly reducing the dependence on the confidence and error parameters.
comment: Abstract abridged to meet arxiv requirements. This is the full version of a paper appearing at COLT 2026
♻ ☆ Face-hitting dominating sets in planar graphs: Alternative proof and linear-time algorithm
In a recent paper, Francis, Illickan, Jose and Rajendraprasad showed that every $n$-vertex plane graph $G$ has (under some natural restrictions) a vertex-partition into two sets $V_1$ and $V_2$ such that each $V_i$ is \emph{dominating} (every vertex of $G$ contains a vertex of $V_i$ in its closed neighbourhood) and \emph{face-hitting} (every face of $G$ is incident to a vertex of $V_i$). Their proof works by considering a supergraph $G'$ of $G$ that has certain properties, and among all such graphs, taking one that has the fewest edges. As such, their proof is not algorithmic. Their proof also relies on the 4-color theorem, for which a quadratic-time algorithm exists, but it would not be easy to implement. In this paper, we give a new proof that every $n$-vertex plane graph $G$ has (under the same restrictions) a vertex-partition into two dominating face-hitting sets. Our proof is constructive, and requires nothing more complicated than splitting a graph into 2-connected components, finding an ear decomposition, and computing a perfect matching in a 3-regular plane graph. For all these problems, linear-time algorithms are known and so we can find the vertex-partition in linear time.
comment: Appeared at SOFSEM 2026
♻ ☆ Successor-bispecial strings with minimum Burrows--Wheeler transform runs SP
We study successor-bispecial strings over an alphabet $Σ$ of size $σ$, a minimal-branching analogue of de Bruijn strings, and ask how few Burrows--Wheeler transform (BWT) runs are possible. In a de Bruijn string of order $k$, every $(k-1)$-gram has all $σ$ right-extensions; here, every $(k-1)$-gram has exactly two right-extensions, determined by a successor rule, which also forces two left-extensions. For order $3$, we construct an explicit family $B_σ^{(3)}$, for every $σ\geq 2$, whose cyclic BWT has $r_c = σ^2 + 2$ runs. A suitable terminated linearization has the same run count, $r = r_c = σ^2 + 2$, while the smallest suffixient set has size $χ= 2σ^2 + 1$. The ratio $χ/r = 2 - 3/(σ^2 + 2)$ nearly saturates the known bound $χ/r \leq 2$, which we have previously shown to be asymptotically tight. Compared with our earlier general construction, this improves the gap from $O(1/σ)$ to $O(1/σ^2)$. We also show that the order-$3$ pattern appears as a blockwise two-row projection of normalized linear-feedback shift register (LFSR) de Bruijn sequences over $\mathbb F_q$, when primitive trinomials $x^3 - x + c$ exist. For higher orders, we prove a general lower bound $r_c \geq σ^{k-1} + 2$ for every $σ\geq 3$ in the exact-length regime and analyze the boundary-merged higher-order candidate using the last-to-first (LF) permutation: it fails for $k = 4$ and all $σ\geq 3$, while verified $k = 5$ instances for $σ\in {3,4}$ yield $χ/r$ ratios exceeding $1.96$.
comment: 13 pages, 1 table. Revised version: terminology changed from successor right-special to successor-bispecial; added the exact-length lower bound and finalized the higher-order LF obstruction. Submitted to SPIRE 2026
♻ ☆ First-order optimal codes for an adversarial nanopore channel
We study error-correcting codes for an adversarial nanopore channel, where a $q$-ary string is first transformed by an inter-symbol interference channel with window size $\ell$ into a sequence of overlapping $\ell$-mers, and an adversary then corrupts this $\ell$-mer sequence by introducing at most $t$ edits. For the deletion-only nanopore channel, we show that the optimal redundancy of $t$-deletion-correcting codes of length $n$ lies between $t\log_q n+Ω(1)$ and $2t\log_q n-\log_q\log_2 n+O(1)$. We then give two explicit deletion-correcting constructions in the regime $t\leq \min\{(\ell-1)/2,(\ell+2)/3\}$. The first construction relies on generalized Reed-Solomon codes and has redundancy $2t\log_q n+Θ(\log\log n)$. The second is based on Sidon sets (or rather $B_t$ sequences) and has redundancy $t\log_q n+Θ(\log\log n)$, matching the lower bound to first order. We further extend the $B_t$-based approach to the edit channel, allowing insertions, deletions, and substitutions of $\ell$-mers. In the regime $t\leq \min\{(\ell-1)/4,(\ell+2)/6\}$, this gives explicit $t$-edit-correcting codes with redundancy $t\log_q n+Θ(\log\log n)$, which is first-order optimal.
comment: Updated title and abstract. Added order-optimal constructions. Corrected the constraint on $t$
♻ ☆ Quantum Pattern Matching in Generalised Degenerate Strings
A degenerate string is a sequence of sets of characters. A generalized degenerate (GD) string extends this notion to the sequence of sets of strings, where strings of the same set are of equal length. Finding an exact match for a pattern string inside a GD string can be done in $O(mn+N)$ time (Ascone et al., WABI 2024), where $m$ is the pattern length, $n$ is the number of strings and $N$ the total length of strings constituting the GD string. This is the best classical algorithm achieved so far, and no matching lower bound, neither unconditional nor conditional, has been shown. We make progress on this problem proposing a quantum algorithm that achieves running time $\tilde{O}(\sqrt{mnN})$, thus beating the current best classical solution. To the best of our knowledge, this is the first quantum algorithm proposed in the context of GD strings. We present our results starting from the framework of classical parallel computing, which we believe makes them intuitive to understand and possibly easy to generalise to other similar structures.
♻ ☆ Robust Detection of Planted Subgraphs in Semi-Random Models
Detection of planted subgraphs in Erdös-Rényi random graphs has been extensively studied, leading to a rich body of results characterizing both statistical and computational thresholds. However, most prior work assumes a purely random generative model, making the resulting algorithms potentially fragile in the face of real-world perturbations. In this work, we initiate the study of semi-random models for the planted subgraph detection problem, wherein an adversary is allowed to remove edges outside the planted subgraph before the graph is revealed to the statistician. Crucially, the statistician remains unaware of which edges have been removed, introducing fundamental challenges to the inference task. We establish fundamental statistical limits for detection under this semi-random model, revealing a sharp dichotomy. Specifically, for planted subgraphs with strongly sub-logarithmic maximum density detection becomes information-theoretically impossible in the presence of an adversary-despite being possible for some planted subgraphs in the classical random model. In stark contrast, for subgraphs with super-logarithmic density, the statistical limits remain essentially unchanged; we prove that the optimal (albeit computationally intractable) likelihood ratio test remains robust. Beyond these statistical boundaries, we design a new computationally efficient and robust detection algorithm, and provide rigorous statistical guarantees for its performance. Our results establish the first robust framework for planted subgraph detection and open new directions in the study of semi-random models, computational-statistical trade-offs, and robustness in graph inference problems.
comment: 38 pages, 2 figures
Graphics 8
☆ GB-LSR: A Fast Local Spectral Image Representation with a Single Global Bandwidth for Continuous Reconstruction and Super-Resolution
We present GB-LSR (Global-Bandwidth Local Spectral Representation), a fixed-grid local spectral representation for continuous image reconstruction. The image domain is partitioned into non-overlapping square patches, each carrying coefficients for a truncated Fourier basis predicted from shared convolutional-encoder features. A single trainable scalar bandwidth is shared globally across all patches and images, and reconstruction at any continuous coordinate is a fixed-size basis contraction whose cost is independent of image size. We study three bandwidth-handling variants: a trainable global scalar (main), a fixed global scalar, and a per-patch bandwidth field. On a standardized native-reconstruction benchmark across Kodak, Set14, and Urban100, the main variant outperforms matched-budget amortized LIIF / LTE / WIRE re-implementations by 2.8-3.6 dB PSNR and 0.11-0.15 LPIPS, while running at roughly one-quarter of the slowest baseline's inference cost. The single global scalar suffices empirically: per-patch adaptive-bandwidth alternatives do not improve over it on either a closed-form locality diagnostic or an end-to-end ablation. In a separate arbitrary-scale super-resolution (ASR) extension, GB-LSR achieves competitive PSNR-Y under a canonical-style SR protocol and runs 1.44x faster than LIIF-RDN and 3.25x faster than LTE-SwinIR at x4; within the same extension, a variant trained and evaluated without 4-corner local-ensemble averaging gives a 1.77x speedup with 35% lower peak memory and negligible PSNR change, while additionally widening the RDN encoder from 64 to 96 channels gives a small positive PSNR shift with a 1.58x speedup and 31% lower peak memory. Native-reconstruction claims are scoped to the matched-budget amortized protocol, and ASR claims are scoped to a separate canonical-style SR protocol.
☆ Building Drift: Documenting On-Site Construction Adaptations Across Material Lifecycles
In a circular economy for construction, reclaimed materials carry prior lives of use and go on to have post-lives in future buildings. Yet working with such materials introduces unpredictability that requires on-site improvisation, making their reuse challenging to document and scale across building lifetimes. Without documentation, the on-site adaptations that make construction with reclaimed materials possible leave collaborators, evaluators, and inheritors without the information they need to continue, assess, and reuse materials. We call the collective deviation of the physical state from the digital model through these adaptations "building drift." Through a case study, ReShelter, a reclaimed timber pavilion constructed in the forest, we develop a taxonomy for building drift that characterizes the collective deviation across building lifetimes: Tending the Site, Foraging for Fit, Interpreting the Material, Marking Measurements, and Coordinating Across Communities. To put our taxonomy for building drift into practice, we present Pentimento, a documentation tool that leverages video documentation and 3D Gaussian Splatting to spatially, temporally, and semantically represent on-site adaptations in relation to the designed model. Pentimento enables each stakeholder to navigate material histories in ways that reduce barriers to material reuse. Together, these contributions open pathways towards computational tools that support the on-site improvisation essential to construction with reclaimed materials, enabling more sustainable cycles of recovery, repair, and reuse.
comment: In submission
☆ RespGeomLib: A Reproducible Parametric Engine for Generating Analysis-Ready Human Airway Lumen Geometry
CT-derived airway models support pulmonary morphometry and airflow simulation, but are often limited by distal scan resolution and the need for substantial cleanup near bifurcations. Procedural alternatives are reproducible, yet many rely on stitched tubular primitives that introduce non-smooth junctions and poorly defined open boundaries. We present RespGeomLib, a reproducible parametric engine for generating analysis-ready human airway lumen surfaces from compact YAML specifications. The framework combines port-based assembly with implicit smooth-min junction blending to produce seamless junctions, while avoiding full-tree voxelization through analytic segments and local implicit extraction around bifurcations. Quantitatively, RespGeomLib yields cleaner junctions than a Boolean/stitch baseline and is substantially faster and more memory-efficient than whole-tree global implicit extraction. We further demonstrate morphometry-guided tree generation, controlled synthetic airway variants, and CFD-ready export with stable airflow simulation. RespGeomLib targets biomedical workflows requiring reproducible morphometry, controlled synthetic variants, and simulation-ready lumen geometry. The code is publicly available at https://nichula01.github.io/Respgeomlib/
comment: Accepted to Publication at 2026 IEEE Mercon
☆ Multi-Modal Hyper-Graph Fusion for Low-Light Crowd Counting
Crowd counting is a fundamental task in computer vision. However, crowd counting in low-light environments remains largely underexplored, despite its practical importance in the real world. Existing methods mainly focus on well-lit scenes or rely on single-modality Red-Green-Blue (RGB) representations, which often become unreliable under extreme darkness and complex non-uniform illumination. To handle this problem, we construct three new low-light crowd counting benchmarks, which consist of two synthetic datasets, SHA\_Dark and SHB\_Dark, and a real-world benchmark LC-Crowd (Low-light Crowd Dataset). Inspired by Retinex-based physical modeling, we introduce depth and Canny edge cues as complementary geometric and structural priors to enhance the intrinsic reflectance representation under low-light conditions. We propose a Multi-Modal Hyper-Graph Fusion module, which formulates RGB appearance, depth geometry, and edge structure cues as nodes in a unified hyper-graph and explicitly captures their high-order complementary relationships via dynamic hyperedge construction and message passing. Furthermore, to adaptively allocate computation in dense prediction, we propose a Deformable Rectangular Sparse Attention (DRSA) module, which concentrates computation on informative regions through anchor-aware estimation and adaptive rectangular window modeling. Based on these designs, we develop a unified Low-Light Counting Network (LCNet) for robust low-light crowd counting. Extensive experiments on three benchmarks demonstrate that the proposed method achieves the best overall performance against existing state-of-the-art (SOTA) methods. The code is in the supplementary material. The datasets will be made public upon acceptance.
♻ ☆ Region-Aware Wasserstein Distances of Persistence Diagrams and Merge Trees
This paper presents a generalization of the Wasserstein distance for both 0th persistence diagrams and merge trees [21], [68] that takes advantage of the regions of their topological features in the input domain. Specifically, we redefine the comparison of topological features as a distance between the values of their extrema-aligned regions. It results in a more discriminative metric than the classical Wasserstein distance and generalizes it through an input parameter adjusting the impact of the region properties in the distance. We present two strategies to control both computation time and memory storage of our method by respectively enabling the use of subsets of the regions in the computation, and by compressing the regions' properties to obtain low-memory representations. Extensive experiments on openly available ensemble data demonstrate the efficiency of our method, with running times on the orders of minutes on average. We show the utility of our contributions with two applications. First, we use the assignments between topological features provided by our method to track their evolution in time-varying ensembles and propose the temporal persistence curves to facilitate the understanding of how these features appear, disappear and change over time. Second, our method allows to compute a distance matrix of an ensemble that can be used for dimensionality reduction purposes and visually represent in 2D all its members, we show that such distance matrices also allow to detect key phases in the ensemble. Finally, we provide a C++ implementation that can be used to reproduce our results.
♻ ☆ Maintaining Leiden Communities in Large Dynamic Graphs
Community detection is a foundational capability in large-scale industrial graph analytics, powering applications such as fraud-ring discovery, recommendation systems, and hierarchical indexing for retrieval-augmented generation. Among modularity-based methods, the Leiden algorithm has been widely adopted in production because it delivers high-quality communities with connectivity guarantees. However, real-world graphs evolve continuously, and timely community updates are needed to keep downstream features and retrieval indices fresh. Meanwhile, existing dynamic Leiden approaches recompute the communities whenever their vertices and edges change, thereby almost degrading to near-full recomputation under frequent updates. To alleviate the efficiency issue, we study the efficient maintenance of Leiden communities in large dynamic graphs and present a novel algorithm, called Hierarchical Incremental Tree Leiden (HIT-Leiden). We first provide a boundedness analysis showing that prior incremental Leiden methods can incur essentially unbounded work even for small updates. Guided by this analysis, we propose HIT-Leiden which effectively reduces the range of affected vertices by maintaining connected components and hierarchical community structures. Extensive experiments on large real-world dynamic graphs demonstrate that HIT-Leiden achieves community quality comparable to the state-of-the-art competitors while delivering speedups of up to five orders of magnitude over existing solutions. The production deployment results show that HIT-Leiden meets stringent latency requirements under high-rate updates at scale.
♻ ☆ Efficient upsampling for tensor-network and quantum-state encoded functions
Both tensor trains (TTs) and quantum states provide compressed representations of grid-structured data with potentially exponential compression power. We present a unified framework for upsampling data encoded in vector amplitudes, with efficient realizations in both classical TT and quantum settings. Starting from an \(n\)-core TT or an \(n\)-qubit state on a coarse grid with \(2^n\) points, the construction produces an \((n+m)\)-core TT or \((n+m)\)-qubit state on a finer grid with \(2^{n+m}\) points. In the TT setting, it supports interpolation, quasi-interpolation, augmentation, and synthesis through efficient low-rank contractions, with the added \(m\) cores retaining constant rank. For function-value encodings, the resulting interpolation satisfies an \(\ell^2\)-error bound independent of the number of added grid points, achieves exponential compression at fixed accuracy, and has a logarithmic complexity in the number of grid points. In the quantum setting, the refined state is prepared by a \(\mathrm{poly}(n,m)\)-size circuit using \(\log(p+1)\) ancillas, where \(p\) controls the smoothness of the quasi-interpolant; the corresponding error scales quadratically with the initial grid spacing. We validate our framework for tensor networks in one-, two-, and three-dimensional examples, including functions, derivatives, airfoil masks, and synthetic random fields such as three-dimensional turbulence. In particular, fractal fields can be generated directly in TT format with logarithmic memory and runtime. These results open a practical route to multiscale solvers, generative models, and geometry-aware algorithms on tensor-network and quantum platforms, with potential applications in scientific simulation, imaging, and real-time graphics.
comment: 19 pages, 9 figures
♻ ☆ Evolution & Foundation: AI Shares Creative Control
This paper investigates the creative process of automated design and artistic evaluation using an evolutionary system. We consider how a multimodal artificial intelligence (AI) model can communicate and guide a combined generative and evolutionary computational system. This creates a framework for the evolution of aesthetically pleasing complex 3D organic forms by integrating genetic algorithms with the visual reasoning capabilities of large-scale AI foundation models. The framework shifts the artist role from that of intensive direct selection to one of system design; transferring detailed step-by-step curation to an AI agent capable of multimodal aesthetic judgement. This framework enables the human artist/designer to rapidly traverse large areas of multi-dimensional evolutionary parameter space to find creative outcomes based on their semantic targets. Detailed audit trails of the AI's aesthetic reasoning are generated for each experiment. Interactive visualisation tools, together with AI-generated summaries and evolutionary narratives, enable deep exploration into each evolutionary experiment and providing a transparent insight into the AI-guided process.