Hardware Architecture 5
☆ IMS: Intelligent Hardware Monitoring System for Secure SoCs DATE
Wadid Foudhaili, Aykut Rencber, Anouar Nechi, Rainer Buchty, Mladen Berekovic, Andres Gomez, Saleh Mulhem
In the modern Systems-on-Chip (SoC), the Advanced eXtensible Interface (AXI) protocol exhibits security vulnerabilities, enabling partial or complete denial-of-service (DoS) through protocol-violation attacks. The recent countermeasures lack a dedicated real-time protocol semantic analysis and evade protocol compliance checks. This paper tackles this AXI vulnerability issue and presents an intelligent hardware monitoring system (IMS) for real-time detection of AXI protocol violations. IMS is a hardware module leveraging neural networks to achieve high detection accuracy. For model training, we perform DoS attacks through header-field manipulation and systematic malicious operations, while recording AXI transactions to build a training dataset. We then deploy a quantization-optimized neural network, achieving 98.7% detection accuracy with <=3% latency overhead, and throughput of >2.5 million inferences/s. We subsequently integrate this IMS into a RISC-V SoC as a memory-mapped IP core to monitor its AXI bus. For demonstration and initial assessment for later ASIC integration, we implemented this IMS on an AMD Zynq UltraScale+ MPSoC ZCU104 board, showing an overall small hardware footprint (9.04% look-up-tables (LUTs), 0.23% DSP slices, and 0.70% flip-flops) and negligible impact on the overall design's achievable frequency. This demonstrates the feasibility of lightweight, security monitoring for resource-constrained edge environments.
comment: The final version is accepted for publication at the Design, Automation & Test in Europe Conference (DATE) 2026
☆ InterPUF: Distributed Authentication via Physically Unclonable Functions and Multi-party Computation for Reconfigurable Interposers
Modern system-in-package (SiP) platforms increasingly adopt reconfigurable interposers to enable plug-and-play chiplet integration across heterogeneous multi-vendor ecosystems. However, this flexibility introduces severe trust challenges, as traditional authentication schemes fail to scale or adapt in decentralized, post-fabrication programmable environments. This paper presents InterPUF, a compact and scalable authentication framework that transforms the interposer into a distributed root of trust. InterPUF embeds a route-based differential delay physically unclonable function (PUF) across the reconfigurable interconnect and secures authentication using multi-party computation (MPC), ensuring raw PUF signatures are never exposed. Our hardware evaluation shows only 0.23% area and 0.072% power overhead across diverse chiplets while preserving authentication latency within tens of nanoseconds. Simulation results using pyPUF confirm strong uniqueness, reliability, and modeling resistance under process, voltage, and temperature variations. By combining interposer-resident PUF primitives with cryptographic hashing and collaborative verification, InterPUF enforces a minimal-trust authentication model without relying on a centralized anchor.
comment: Accepted to IEEE International Symposium on Hardware Oriented Security and Trust (HOST) 2026
☆ OpenACM: An Open-Source SRAM-Based Approximate CiM Compiler DATE 2026
Yiqi Zhou, JunHao Ma, Xingyang Li, Yule Sheng, Yue Yuan, Yikai Wang, Bochang Wang, Yiheng Wu, Shan Shen, Wei Xing, Daying Sun, Li Li, Zhiqiang Xiao
The rise of data-intensive AI workloads has exacerbated the ``memory wall'' bottleneck. Digital Compute-in-Memory (DCiM) using SRAM offers a scalable solution, but its vast design space makes manual design impractical, creating a need for automated compilers. A key opportunity lies in approximate computing, which leverages the error tolerance of AI applications for significant energy savings. However, existing DCiM compilers focus on exact arithmetic, failing to exploit this optimization. This paper introduces OpenACM, the first open-source, accuracy-aware compiler for SRAM-based approximate DCiM architectures. OpenACM bridges the gap between application error tolerance and hardware automation. Its key contribution is an integrated library of accuracy-configurable multipliers (exact, tunable approximate, and logarithmic), enabling designers to make fine-grained accuracy-energy trade-offs. The compiler automates the generation of the DCiM architecture, integrating a transistor-level customizable SRAM macro with variation-aware characterization into a complete, open-source physical design flow based on OpenROAD and the FreePDK45 library. This ensures full reproducibility and accessibility, removing dependencies on proprietary tools. Experimental results on representative convolutional neural networks (CNNs) demonstrate that OpenACM achieves energy savings of up to 64\% with negligible loss in application accuracy. The framework is available on \href{https://github.com/ShenShan123/OpenACM}{OpenACM:URL}
comment: Accepted by DATE 2026
☆ RidgeWalker: Perfectly Pipelined Graph Random Walks on FPGAs HPCA 2026
Graph Random Walks (GRWs) offer efficient approximations of key graph properties and have been widely adopted in many applications. However, GRW workloads are notoriously difficult to accelerate due to their strong data dependencies, irregular memory access patterns, and imbalanced execution behavior. While recent work explores FPGA-based accelerators for GRWs, existing solutions fall far short of hardware potential due to inefficient pipelining and static scheduling. This paper presents RidgeWalker, a high-performance GRW accelerator designed for datacenter FPGAs. The key insight behind RidgeWalker is that the Markov property of GRWs allows decomposition into stateless, fine-grained tasks that can be executed out-of-order without compromising correctness. Building on this, RidgeWalker introduces an asynchronous pipeline architecture with a feedback-driven scheduler grounded in queuing theory, enabling perfect pipelining and adaptive load balancing. We prototype RidgeWalker on datacenter FPGAs and evaluated it across a range of GRW algorithms and real-world graph datasets. Experimental results demonstrate that RidgeWalker achieves an average speedup of 7.0x over state-of-the-art FPGA solutions and 8.1x over GPU solutions, with peak speedups of up to 71.0x and 22.9x, respectively. The source code is publicly available at https://github.com/Xtra-Computing/RidgeWalker.
comment: Accepted by HPCA 2026
☆ SwiftKV: An Edge-Oriented Attention Algorithm and Multi-Head Accelerator for Fast, Efficient LLM Decoding
Edge acceleration for large language models is crucial for their widespread application; however, achieving fast attention inference and efficient decoding on resource-constrained edge accelerators remains challenging. This paper presents SwiftKV Attention, a per-token pipelined, low-latency single-pass attention inference algorithm, where every (kt, vt) in the KV cache is processed exactly once in a uniform per-token pipeline without score materialization, blockwise softmax, or a second pass, thereby enabling fast execution on edge accelerators with a single hardware set and no resource-intensive parallelism. Furthermore, to address the limited support for multi-head LLM decoding in existing accelerators, we design the SwiftKV-MHA accelerator, which enables high precision attention and low precision GEMV on the same processor array, achieving fast and efficient multi-head parallel decoding. Experimental results show that, on the edge accelerator, the SwiftKV Attention algorithm achieves a 7.16* speedup over native attention and significantly outperforms other attention algorithms. SwiftKV-MHA further reduces attention latency by 13.48*; under the same settings, it improves generation speed by 17.4% and increases token efficiency by 1.98* compared with state-of-the-art works.