2026 IEEE International Conference on Acoustics, Speech, and Signal Processing

4-8 May 2026, Barcelona, Spain

Industry Program

Featuring world-class keynotes, industry expert speakers, technology trend panels, standard overviews, in-depth workshops, specialized seminars, and interactive demos, the Industry Program offers a comprehensive agenda designed to connect innovation with real-world applications. It consists of five well-organized components:

Industry Keynotes
Industry Expert Speakers
Industry Panels
Industry Workshops
Spotlight Sessions
Show and Tell Demos

Industry Keynotes

Program Schedule

Date	Time	Speaker	Talk Title
May 6	8:00-9:00	Dr. Tara Sainath Distinguished Research Scientist Google DeepMind	Audio Processing with Large Language Models
May 7	8:00-9:00	Dr. Hamid Sheikh Vice President Samsung Research	Latest Trends in AI Signal Processing for Consumer Experiences
May 8	8:00-9:00	Dr. Chris Dick Senior Distinguished Engineer NVIDIA	AI-Native 6G: Building the Wireless Stack for AI-RAN and 6G Innovation

Talk 1: Audio Processing with Large Language Models

Wednesday, 6 May 2026, 8:00 – 9:00
Location: Auditorium
Speaker: Dr. Tara Sainath, Distinguished Research Scientist, Google Deep Mind

Abstract: Large Language Models (LLMs) have recently introduced a paradigm shift in Machine Learning, including in audio processing tasks. In automatic speech recognition, we have improved understanding quality across a large number of languages, trained within one universal model. In generation, prompt-based capabilities and naturalness have opened up new paradigm shifts. In translation, we can translate numerous language pairs in real time. Finally, in dialogue, we can build single end-to-end systems that understand and reply to user queries. These capabilities are transforming audio products across the industry. This talk will detail the research and product impact of audio LLMs at Google.

Dr. Tara Sainath holds an S.B., M.Eng, and PhD in Electrical Engineering and Computer Science from MIT. Her expertise in speech recognition and deep neural networks led to a 5 year stint at IBM T.J. Watson Research Center, and currently fuels her work as the Lead of Gemini Audio at Google DeepMind. There, she focuses on the integration of audio capabilities with large language models (LLMs).

Her technical prowess is recognized through her IEEE Fellowship and awards such as the 2021 IEEE SPS Industrial Innovation Award and co-recipient of the 2022 IEEE SPS Signal Processing Magazine Best Paper Award. She has served as a member of the IEEE Speech and Language Processing Technical Committee (SLTC) as well as the Associate Editor for IEEE/ACM Transactions on Audio, Speech, and Language Processing. Dr. Sainath’s leadership is exemplified by her roles as Program Chair for ICLR (2017, 2018) and her extensive work co-organizing influential conferences and workshops, including: Interspeech (2010, 2016, 2019), ICML (2013, 2017), and NeurIPS 2020. Her primary research interests are in deep neural networks for speech recognition.

Talk 2: Latest Trends in AI Signal Processing for Consumer Experiences

Thursday, 7 May 2026, 8:00 – 9:00
Location: Auditorium
Speaker: Dr. Hamid Sheikh, Vice President, Samsung Research

Abstract: AI has been the driving force in technological innovation for the past several years, significantly impacting signal processing in consumer devices such as cameras, audio systems, computer vision, and health and fitness devices. The rapid development of generative techniques has also necessitated constant adaptation of these techniques to adapt to evolving consumer expectations. The challenges coming from small form factor mobile devices such as small batteries, low light and signal strength, and limited compute, continue to be hard physical constraints against which algorithmic methods need to keep improving. This talk will go over the challenges and opportunities that come from enabling the latest generation of AI techniques to improve consumer experiences.

Dr. Hamid Sheikh is Vice President R&D at Mobile Processor Innovation Team in Samsung Research America, where he leads a team of AI computational imaging experts developing algorithms and features for Samsung flagship smartphones. Prior to Samsung, he was Camera technology lead in OMAP Platform Business Unit at Texas Instruments Inc, where he led the development of ISP and camera solution. He completed his PhD from the University of Texas at Austin, where he researched image and video quality assessment algorithms, and contributed to the development of the world famous Structural Similarity Metric (SSIM). He is an IEEE Fellow, and winner of two technology Emmy Awards for his contributions to image and video quality, and numerous Samsung internal awards for his contributions to camera technology innovations.

Talk 3: AI-Native 6G: Building the Wireless Stack for AI-RAN and 6G Innovation

Friday, 8 May 2026, 8:00 – 9:00
Location: Auditorium
Speaker: Dr. Chris Dick, Senior Distinguished Engineer, NVIDIA

Abstract: The transition to 6G marks a fundamental evolution from connectivity-centric networks to AI-native wireless infrastructure. In this keynote, we explore how AI-RAN principles redefine the wireless stack, embedding intelligence across signal processing, radio access, core networks, and edge computing. We highlight how data-driven learning, accelerated computing, and tight AI–network co-design enable autonomous operation, dynamic optimization, and multi-modal sensing-communication. AI-native 6G provides the foundation for scalable innovation, supporting applications that extend beyond traditional communications and positioning wireless networks as intelligent, adaptive platforms for the next decade.

Dr. Chris Dick is a Senior Distinguished Engineer at NVIDIA working on the application of artificial intelligence and machine learning to 5G and 6G wireless. In his 35 years working in signal processing and communications, he has worked on silicon and software products for radar, 3G, 4G and 5G baseband DSP and Docsis cable access, FPGA architecture, and ML silicon accelerator architecture. His research is in the area of 6G architecture, ML model architecture, channel coding, design flows for GPU signal processing systems, digital front-end (DFE) technology for cellular systems with a particular emphasis on ML algorithms for RF signal processing.

Prior to moving to Silicon Valley in 1998 he was a tenured academic in Melbourne Australia for 13 years. He has over 250 publications and 100 patents. From 1998 to 2020 he was a Fellow and the DSP Chief Architect at Xilinx.

Chris is the chair of the AI-for-RAN Working Group in the AI-RAN Alliance, and he serves on the Board of Directors for the OpenAIRInterface Software Alliance. In 2018 he was awarded the IEEE Communications Society Award for Advances in Communication for research in the area of full-duplex wireless communication.

Industry Expert Speakers

Program Schedule

Location: Auditorium

Session order	Session’s Theme	Date	Time	Proposal ID
1	Integrated Sensing, Communications and Radar	May 5	14:00–16:00	603, 631, 641
2	Audio Technology and Consumer Audio Innovation	May 6	9:00–11:00	628, 629, 632
3	Speech and Audio AI Systems	May 6	14:00–16:00	609, 626, 640
4	Edge AI and Efficient Intelligence	May 7	9:00–11:00	612, 616, 639
5	AI Security and Strategic Applications	May 7	14:00–16:00	624, 633
6	Applied AI Systems and Defense Applications	May 8	9:00–11:00	618, 627, 638

603: The Future of Mobile Communications: Challenges and Opportunities

Bio: Dr. Yongxing Zhou (IEEE Fellow) received his Ph.D degree in Tsinghua University, China in 2002. He is now a professor with Beijing University of Posts and Telecommunications (BUPT). He has been a prestigious wireless telecom technical leader in the development of MIMO and smart spectrum access innovative technologies for 4G and 5G cellular communication standards and products. Before joining BUPT in September 2025, he was with Huawei as Principal Scientist of Standard & Patent and Huawei Device communication standard Principal Expert. He has led 16 years of Huawei 4G/5G/Device Communication research and standardization including multiple paradigm shifts such as Linear Combination Double MIMO Codebook, Flexible Bandwidth Part (a.k.a. BWP), beam based cellular initial access and the world’s first commercial satellite-smart phone direct communication protocols that have shaped landscape of 5G mobile communications.
Dr. Zhou’s current research interest includes 5G Advanced and 6G technologies such as Artificial Intelligence (AI) Communication, Satellite Communication, Integrated Sensing and Communication (ISAC), cell free MIMO and reconfigurable intelligent surface (RIS).
Dr. Zhou has more than 200 Granted U.S. Utility Patents and several tens of Standard Essential Patents (SEP) have been widely used in global commercial 4G/5G base stations and terminals.

Abstract: 5G was expected to be the key driving force behind the shift of digitalization of all industries and a more intelligent, connected world in addition to the provision of much improved mobile broadband (MBB) internet user experience. However, ten years later it seems both industry digitalization transformation and MBB business are quite behind the schedule and expectation. For example, it is worth mentioning the whole MBB revenue of telecom industry in China has been declining since 2023. There would be no future of mobile communications if those deadly challenges were always put aside and could not be met effectively.
Now 6G is on the way. Without exception, the newer generation of mobile communication has always been highly expected and 6G has been given the label of “enabler of connected intelligence”. It has been conceived with many attractive properties such as ultra-fast, extremely low latency, ubiquitous connectivity, network and edge sensing, along with inferencing capabilities and distributed learning. However, how to meet and tackle with the aforementioned unprecedented challenges still remained unclear.
This talk explores challenges and opportunities of mobile communications at the moment, how 6G technology components combined with device and network API exposure, transform wireless networks from passive data conduits into active enablers of high value businesses (e.g. AI services). Meanwhile, the cutting edge technologies in the areas of AI communication, Satellite Communication, advanced Waveform&Modulation, Coding and Sequences are also deeply addressed in the context.

631: Enhanced Integrated Sensing and Communications for 6G with AI and multi-modal fusion

Bio: Christian Ibars Casas has been with the Nvidia Aerial project since its inception. He led the vRAN implementation of 5G, focusing on accelerating the physical (PHY) and MAC layers on NVIDIA GPUs, and is now engaged in 6G pathfinding in the areas of ISAC and AI-for-RAN. Before joining NVIDIA, he was the standards and technology development lead at Cohere Technologies, where he contributed to the development of a novel physical layer based on the Orthogonal Time Frequency Space (OTFS) waveform. He has held research and engineering roles at Intel and served as a wireless researcher at the CTTC in Barcelona, working in the areas of cellular communications, satellite, and ultra-wideband. Christian has a Ph.D. degree E.E. and over 20 years of experience in wireless communications R&D, has published over 100 technical papers and is the inventor of numerous patents in the field.

Abstract: Wireless networks have become ubiquitous and are a basic pillar of modern society. Integrated sensing and communications (ISAC) present the opportunity to reuse this massive infrastructure for an entirely new purpose, with potentially huge societal advantages, and revenue opportunities for network operators. However, the unproven nature of ISAC technology presents technical challenges that need to be addressed by the wireless community. In this talk we explore such challenges and describe potential solutions facilitated by the recent onslaught of new technologies, namely virtualized RAN, AI, and the availability of NVIDIA GPUs for accelerated computing. We will review standardization progress for ISAC in 6G, AI-based solutions for reliable, accurate and efficient multi-modal fusion, and functional architectures for ISAC-capable wireless networks. Finally, we will present NVIDIA’s open source reference design for ISAC.

641: “Modern radar systems, challenges and perspectives: a personal viewpoint”

Bio: Alfonso Farina, received the Dr. Ing. degree in electronic engineering from the University of Rome La Sapienza, Rome, Italy, and a PhD HC in ICT (Information and Communication Technology) from the University of Palermo, Palermo, Italy. From 1973 to 2014, he worked at Selex ES (Finmeccanica) until he became Senior Vice President (SVP) and Chief Technology Officer (CTO). Currently, he is President of the Leonardo Radar & Sensors Academy and past President of the Academy of Underwater and Sensor Systems of the Leonardo Electronics Division. A multifaceted engineer, scientist, and university professor (Royal Academy of Engineering (UK), Académico Correspondiente de la Real Academia de Ingeniería de España, European Academy of Science, and National Academy of Engineering (USA), he is among the 2% top world scientists and one of the world's leading experts on radar systems. He authored or co-authored more than 1000 publications. His h-index is 62, with 14,673 citations (Scopus, Sep. 2025).

https://en.wikipedia.org/wiki/Alfonso_Farina
https://scholar.google.it/citations?user=A0jgvksAAAAJ&hl=it

Abstract: The lecture focuses on the development of modern radar systems in the background of Alfonso Farina’s carrier and technical achievements. Target tracking systems, Multistatic architectures, adaptivity, cognitive radar, green radar, waveform diversity and design, electromagnetic spectrum management, space-time adaptive processing (STAP), synthetic aperture radars (SAR) are among the most important modern applications which will be reviewed, highlighting the profound transformation driven by the convergence of advanced signal processing, high performance computing, and increasingly complex operational scenarios. Contemporary radars are expected to deliver unprecedented levels of accuracy, resilience, and spectral efficiency, not without significant challenges, such as the progressively congested e.m. spectrum, the demand for low-cost, low-power and highly integrated platforms, from automotive radars to spaceborne constellations, pushes technology toward new materials, digital front ends, and AI driven processing pipelines.
In this personal viewpoint, the envisaged future of radar lies in embracing heterogeneity, of sensors, processing layers and mission profiles, in a perspective that invites a shift from radar as a standalone instrument to radar as an adaptive, collaborative node within a broader sensing ecosystem.

628: Immersive Audio via Headphones: Status and New Solutions

Bio: Prof. Dr.-Ing. Karlheinz Brandenburg (IEEE life fellow) is a world-renowned inventor and entrepreneur best known as the co-inventor of the MP3 and AAC audio coding standards. He holds Dipl.-Ing. and Dipl.-Math. degrees in electrical engineering and mathematics, and a Dr.-Ing. degree in electrical engineering from the Friedrich-Alexander-Universität, Erlangen-Nürnberg, Germany. He is currently CEO of Brandenburg Labs GmbH, a startup company specializing in immersive audio technologies.
Following times as a Postdoctoral Member of Technical Staff at AT&T Bell Laboratories in Murray Hill, U.S.A., and again at Friedrich-Alexander-Universität, he joined the Fraunhofer Institute for Integrated Circuits IIS, Erlangen, as head of the Audio and Multimedia Department. He is the founding director of the Fraunhofer Institute for Digital Media Technology IDMT, Ilmenau, where he retired in July 2019. He is a retired full professor of TU Ilmenau as of May 2020.
For his pioneering work in digital audio coding, perceptual measurement techniques, wave field synthesis, psychoacoustics, and analysis of audio and video signals, he received several awards. Among them are the IEEE Masaru Ibuka Consumer Electronics Award, the German Future Award (shared with his colleagues), and the Audio Engineering Society Silver Medal Award. He is a member of the Hall of Fame of the Internet Society and the IEEE Consumer Electronics Association. He received honorary doctorate degrees from the Universities Koblenz-Landau, Germany, Leuphana University of Lüneburg, Germany and Universitat Politéchnica de Valencia, Spain.

Abstract: Currently, nearly all audio productions for movies and music are mastered to both traditional two channel stereo sound and newer multichannel formats. These include Dolby Atmos, MPEG-H/Sony-360 and Eclipsa audio. To listen to these sounds, until now many loudspeakers (e.g., 12 at an 7.1.4 arrangement) are necessary. There has been work on headphone reproduction for immersive audio, but all current software-based systems pale when compared to a loudspeaker-based solution.
The talk will both introduce the basics of spatial audio including standards like MPEG-H, and give details of earlier advancements in headphone-based reproduction over the last 50 years.
The company of the presenter introduced a state-of-the-art headphone-based system in early 2025. This excels in the plausibility of the reproduction of sounds via headphones. It is currently targeted at professional users at mixing studios or post-production facilities as well as at schools educating the next generation of audio engineers. There are already plans for a second generation of these products for users craving the best audio fidelity, who want to have a headphone-based solution.
A multichannel system with the best plausibility should reproduce sound in a room so the listener doesn’t have the feeling of wearing headphones at all. This necessitates the virtual reproduction of sound in a way that it is nearly impossible to distinguish from real sound sources like loudspeakers in the room.
Technically, this process relies on measuring the room’s acoustics so that rendering can be performed using a simplified model of the room. Additionally, fast and accurate 6DoF (six degrees of freedom) head tracking is required, allowing the algorithm to estimate the room’s impulse responses from a virtual sound source to the listener’s headphone position. Thus, the main cues necessary to trick the human brain into perceiving virtual sources as real are reconstructed as needed for a plausible reproduction of sounds via headphones.

629: Building Dolby Atmos FlexConnect: From Research Project to Product

Bio: Daniel Arteaga is a Senior Staff Researcher at Dolby Laboratories in Barcelona, where he has been part of the research organization since 2014. His work at Dolby spans a broad range of audio technologies, including spatial audio, room mapping, acoustic inference, audio capture and rendering, and the integration of physical acoustics with machine learning, optimization, and signal‑processing methods. His research has contributed to several core Dolby initiatives, including next‑generation spatial rendering technologies and adaptive audio systems such as Dolby Atmos FlexConnect.

Before joining Dolby, he transitioned into audio technology in 2008 after completing a Ph.D. in physics at the University of Barcelona, where he specialized in quantum effects in gravitation. Prior to joining Dolby, we contributed to algorithms for Imm Sound, a spatial‑audio start‑up later pioneer in bringing spatial audio to cinemas, acquired by Dolby in 2012.

Parallel to his research activity, Daniel is an Associate Lecturer at Universitat Pompeu Fabra, where he teaches spatial audio and has supervised numerous bachelor’s, master’s, and doctoral students.

Abstract: Dolby Atmos FlexConnect (DAFC) is an adaptive home‑audio solution designed to operate with an arbitrary number of loudspeakers placed freely throughout a room, without requiring predefined layouts. The system begins by performing acoustic mapping to localize each speaker, estimating its position, distance, and orientation. These estimations feed into a flexible rendering framework that renders the Dolby Atmos or multichannel soundtrack in real time to maintain a faithful and stable soundstage across highly asymmetric speaker configurations.

A significant portion of the DAFC effort centered on developing algorithms that remain reliable under practical, uncontrolled conditions. The mapping pipeline had to handle uncertainty, ambiguities, reflections, and occasionally contradictory information arising from real acoustic environments. The rendering stage required preserving the spatial and timbral characteristics of the original mix despite irregular geometries and variable device capabilities. In both areas, we employed a combination of signal‑processing methods, optimization‑based approaches, and data‑driven techniques to find the most appropriate solution for each problem. Much of the algorithmic insight arose from exploring how to generalize solutions beyond ideal test settings and how to design behaviors that degrade gracefully when assumptions inevitably break.

Equally important were the engineering and product‑focused challenges that shaped the final system. Bringing DAFC from a research prototype to a deployable technology required adapting algorithms to embedded hardware with strict constraints on compute, memory, and power, all influenced by device Bill of Materials (BOM) costs. Integration with partner devices introduced variability in acoustics, transducer performance, and wireless synchronization characteristics. This demanded extensive stress testing, data collection, and automated evaluation across diverse room configurations. These constraints guided algorithmic simplifications, robustness strategies, and the overall system architecture.

By presenting both the algorithmic foundations and the practical steps required to transform them into a shipping product, the talk aims to provide ICASSP attendees with insights relevant to real‑world algorithmic and engineering work: how theoretical approaches evolve under practical pressures, how robustness becomes a central design principle, and how interdisciplinary iteration enables the deployment of complex audio technology in everyday environments. The goal is to convey lessons that are technically grounded, broadly applicable, and motivating to researchers developing systems intended for real use.

632: From ANC to Blood Pressure: How Earbuds Are Becoming Multimodal Health Sensors

Bio: Alessandro is the deep technical architect behind OmniBuds, driven by a relentless focus on turning scientific breakthroughs into robust, manufacturable products. He combines rigorous sensing science with pragmatic engineering, building full‑stack systems that span novel biosensors, embedded algorithms, and clinically grounded applications. Over more than a decade at the intersection of wearable sensing and embedded systems, he established the Earable Computing field, driving both academic insight and industry impact. His work has led to over 50 peer‑reviewed publications and 20+ patents in pervasive computing, embedded AI, and wearable sensing. Previously, he was a Principal Research Scientist and Tech Lead at Nokia Bell Labs, where he led the team that created the original OmniBuds platform, earning the Top Nokia Innovator Award.

As Founder & Chief Scientific Officer of OmniBuds, he is responsible for the research strategy that harnesses in-ear sensing and analytics to transform how major cardiovascular conditions are managed, leading the technology and engineering programmes that translate cutting-edge research into production-ready medical devices.

Alessandro holds a PhD in Computer Science from the University of Cambridge and remains active in the research community through organising committee roles and delivering keynotes at leading mobile and sensor systems conferences (https://scholar.google.com/citations?user=5y9eO9MAAAAJ&hl=en).

Abstract: True wireless earbuds have become one of the most ubiquitous computing platforms we wear—yet we still mostly use them for audio. This talk provides a general overview of the emerging technology area of in-ear sensing and explains why earables are poised to follow the smartwatch trajectory: from convenience features to sensor-first, health-oriented systems. The ear is a particularly attractive measurement site because it combines stable skin contact, rich local vasculature, natural vibration damping from the musculoskeletal system, and a built-in acoustic interface for privacy-preserving feedback and just-in-time interventions.

I will survey the field from early earable computing platforms (IMU + microphone) that enabled head-gesture, activity, diet (chewing/drinking), and facial-expression inference using lightweight time-frequency features and compact classifiers, to modern multimodal earables that add optical biosensing (PPG), temperature, multiple microphones, storage, and on-device machine learning. For the signal processing community, the key point is that earables are constrained, real-time, multi-rate sensing systems where algorithm design must co-optimize accuracy, latency, memory, and energy. I will discuss architecture: low-power scheduling, on-device fusion, and privacy-preserving processing that avoids cloud dependence.

The core technical deep dive is in-ear photoplethysmography (PPG). PPG can yield heart rate (HR), heart-rate variability (HRV), oxygen saturation (SpO₂), and respiration rate (RR) from a small number of LEDs and a photodiode, but in-ear deployment introduces unique constraints: anatomical variability, comfort, seal quality, ambient-light leakage, and motion artifacts. I will outline an end-to-end vital-sign extraction pipeline (bandpass filtering, normalization, peak detection, AC/DC component estimation, and windowed estimation), then zoom into the most consequential design choice: placement behind-the-ear (BTE), in-the-ear (ITE), or in-the-canal (ITC). Using controlled recordings across rest and motion (speaking, walking, running), I will show why ITC placement typically reduces error variability for HR/HRV/SpO₂ via stronger skin–sensor adhesion and improved ambient-light shielding, while also emphasizing the remaining Achilles’ heel: motion artifacts that can drive errors from about 15% (speaking) up to about 30% (running). This motivates co-design of ear-tip mechanics, seal-quality estimation, and artifact suppression.

Finally, I will highlight a novel multimodal route to cuffless blood pressure (BP) sensing from a single earbud: combining in-ear PPG with an in-ear microphone that can capture attenuated heart sounds (S1/S2) when the ear canal is well sealed (occlusion effect). Because acoustic propagation through tissue is far faster than blood flow, timing features such as vascular transit time (S1→PPG peak) and ejection time (S1→S2) become measurable from one location. I will describe a low-compute, time-domain pipeline for marker extraction and a personalized calibration procedure; in a pilot with 10 healthy participants and induced BP changes (slow breathing and cold pressor), we observed SBP MAE = 2.50 ± 2.20 mmHg and DBP MAE = 2.42 ± 2.62 mmHg under controlled conditions.

The session closes with a roadmap of open problems for ICASSP: robustness under ANC and music playback, in-the-wild validation at scale, and principled personalization without sacrificing generalization. The goal is to equip attendees with a practical taxonomy of in-ear signals—and a research agenda where signal processing can turn earables into “tiny but mighty” health platforms.

609: Why Blind Audio Processing Fails: Edge Intelligence for Content-Aware Audio Processing in Streaming Media

Bio: Sunil Bharitkar received his Ph.D in Electrical Eng. from the University of Southern California (USC) in 2004 and is presently at Samsung Research. His research spans neural networks, multimedia signal processing, speech & audio processing, & machine learning. From 2016 to 2020, he was at HP Labs, involved in audio/speech and deep learning. From 2011 to 2016, he was the Director of Audio at Dolby, leading/guiding research in audio, signal processing, haptics, machine learning, & helping standardization at ITU, SMPTE. He co-founded the Intel-funded company Audyssey Labs in 2002, where he served as VP/Research, responsible for inventing new technologies. His research has led to technologies that are in Samsung, HP, Dolby, and Audyssey products. He also taught in the Dept. of Electrical Engineering at USC.

Sunil has published over 75 peer-reviewed papers and holds over 30 patents in the areas of signal processing, acoustics, and neural networks, and has authored a textbook, "Immersive Audio Signal Processing" from Springer-Verlag. He is a recipient of the Best Paper Award at the 2023 154th Audio Eng. Soc. Conv., Outstanding Paper Award at the 2019 9th IEEE Int. Conf on Consumer Electronics, and a Best Paper Award at the 2003 37th IEEE Asilomar Conf. on Signals, Systems, &Computers.

He is a reviewer for various IEEE journals and conferences, the Journal of the Acoustical Society of America, EURASIP, and the Journal of the Audio Eng. Soc. He has also been on the Organizing and Technical Program Committees of the 2008 & 2009 European Sig. Proc. Conference (EUSIPCO), the 57th AES Conference. He has also served as an invited tutorial speaker at the 2006 IEEE Conf. on Acoust. Speech and Sig. Proc. (ICASSP).

Sunil is an IEEE Senior Member, member of the ILB of the IEEE Systems, Man, & Cybern. Society, the Acoustical Soc. of Amer. (ASA), European Association for Signal and Image Processing (EURASIP), and the Audio Eng. Soc. (AES). Sunil is a PADI diver, plays the Digi, plays football (soccer), and is a FIFA-licensed coach for a USA club.

Abstract: Streaming platforms such as YouTube, Vimeo, and Youku host a diverse mix of content, including movies, music videos, news, documentaries, and advertisements, with hundreds of hours of video uploaded every minute. While this diversity enables scale, it introduces a fundamental challenge for consumer electronics (CE) devices: content-agnostic audio and video post-processing can degrade user experience and violate artistic intent.

A concrete example arises in audio rendering. Movies are typically authored in multichannel formats such as spatial 5.1 or 7.1.4, while music content is intentionally produced in stereo. To conserve transmission bandwidth, multichannel movie audio is often downmixed to stereo for streaming. Edge devices such as TVs, soundbars, smartphones/tablets, and audio-video receivers rely on post-processing DSP pipelines to upmix the stereo to spatial audio for movies. However, when stereo music is blindly upmixed using the same processing chain, audible artifacts are introduced, and artistic intent is compromised. This is just one example of content-agnostic signal processing that can degrade the quality of experience. This motivates the need for real-time multimedia content classification directly on edge devices to guide appropriate post-processing decisions by identifying the content type.

This talk presents an industry-driven view of multimedia classification for edge deployment, focusing on real-world constraints rather than algorithmic benchmarks alone. We will briefly review current state-of-the-art deep learning approaches for audio-visual classification and explain why many frame-level, audio- or vision-centric models—while accurate—are impractical for deployment on CE hardware due to latency, memory footprint, and power constraints. Model compression techniques such as pruning and quantization help, but often at the cost of degraded classification reliability in real-time settings. We also address why server-side classification is not a viable alternative at scale. Embedding content class metadata upstream would require changes to existing MPEG standards and would be incompatible with billions of legacy decoders already deployed worldwide. These realities shift the problem decisively toward edge-based intelligence.

The core of the presentation introduces a low-latency, low-memory edge deep-learning classifier that leverages linguistic metadata in the MPEG standard, specifically video titles rather than raw audio or video frames. This approach achieves high classification accuracy with a fraction of the computational cost of conventional deep learning pipelines. We will also present the latest extensions that enable multilingual support via neural machine translation, allowing the solution to interface with the DSP audio signal processing chain and to deploy across global streaming ecosystems.

The session concludes with a video demonstration of the classifier deployed on a TV connected to a YouTube streaming service for content-aware processing to improve the quality of experience in practice.

Attendees will leave with a concrete understanding of how edge-intelligence and signal-processing can be co-designed to improve the quality of experience while readily scaling to billions of CE devices.

626: How to Build Realistic Acoustic Datasets For AI-audio Training Using Simulated Data From Validation to Large-Scale Datasets

Bio: Steinar Guðjónsson is a Senior Acoustic Engineer and Team Lead at Treble Technologies. With extensive experience in room acoustics modeling, spatial audio, and binaural rendering, Steinar specializes in bridging measured and simulated acoustics to enable scalable, physically grounded dataset generation. His work supports applications ranging from architectural acoustics to machine learning, with a focus on accuracy, efficiency, and practical deployment of simulation pipelines.

Abstract: This presentation demonstrates a practical, end-to-end workflow for building realistic acoustic datasets using modern simulation tools, emphasizing validation, efficiency, and scalability. Using realistic acoustic datasets in audio AI training as opposed to simplified approaches in empty shoebox rooms has shown to give 25% lower WER through improved speech enhancement.

The talk begins with validation of the Treble SDK simulation engine by comparing simulated results against measurements from the Benchmark for Room Acoustic Simulations (BRAS) database. A controlled single-reflection scenario is analyzed across multiple boundary conditions to establish physical accuracy and confidence in the underlying solver.

Next, a full Head-Related Transfer Function (HRTF) is simulated by importing a 3D scan of a KEMAR mannequin into Treble SDK. The audience will see how dense, high-quality HRTFs can be generated rapidly and efficiently.

Building on this foundation, a realistic room environment is created, and the simulated HRTF is used to render binaural room impulse responses at arbitrary listener positions. These results are validated through direct comparison with measured data.

Finally, the presentation scales these techniques to large dataset production. Using 1,000 procedurally generated living room models—each containing five source locations and fifty receiver positions—the pipeline produces a total of 250,000 binaural impulse responses. This final step illustrates how physically validated simulation can enable diverse, large-scale datasets suitable for training and evaluating spatial audio and machine learning systems. These large datasets can then be used to build a realistic audio scenes, including multiple speakers and background noises, ideal for training and evaluating complex audio AI enhancement algorithms.

Attendees will gain concrete insights into validated simulation workflows and practical strategies for generating realistic acoustic data at scale and why that matters.

640: From Text to Talk: How New Speech LLMs Will Make Conversations with Technology More Natural

Bio: Kyu Jeong Han earned his Ph.D. from the University of Southern California in 2009 and currently serves as Senior Director of Applied Science at Oracle Cloud Infrastructure. Dr. Han has held leading research roles at organizations including IBM, Ford, Capio.ai (acquired by Twilio), JD.com, ASAPP, and AWS, where he has made significant contributions to the advancement of speech and language technologies.

Dr. Han is an engaged member of the speech and language processing community. He regularly serves as a reviewer for top journals and conferences organized by IEEE, ISCA, and ACL, and since 2019, has been a member of the Speech and Language Processing Technical Committee of the IEEE Signal Processing Society. Dr. Han is a seasoned speaker and educator, having delivered tutorials at Interspeech 2021 and 2025, as well as a survey presentation at Interspeech 2019, sharing emerging insights and best practices with the global research community. In 2018, he received the ISCA Best Paper Award for his work published in Computer Speech & Language between 2013 and 2017, recognizing his outstanding impact on the field.

Dr. Han’s ongoing research and leadership are dedicated to driving innovation in speech recognition and natural language processing, with a particular focus on scalable, real-world applications.

Abstract: The rapid rise of Large Language Models (LLMs) has redefined the boundaries of natural language understanding and generation, propelling advances across machine learning, conversational AI, and human-computer interaction. However, LLMs, while remarkable in text-based tasks, inherently overlook the vibrant complexity of spoken communication—where meaning is interwoven with emotion, prosody, timbre, and speaker individuality. For the ICASSP community, which sits at the forefront of speech, signal processing, and audio research, the evolution from text-only LLMs to models natively bridging speech and language stands as a defining technical frontier.

This talk spotlights Speech Large Language Models (SpeechLLMs)—a novel class of models that move beyond the traditional ASR → LLM → TTS pipeline by directly learning from and generating speech waveforms. SpeechLLMs fuse the representational power of LLMs with the rich acoustic and prosodic informatics of speech. This paradigm shift resolves persistent bottlenecks faced by cascaded systems: information loss during conversion, compounded errors across modules, and latency that limits real-time interaction. By integrating raw audio processing with end-to-end context-aware generation, SpeechLLMs capture nuances such as emotion, speaker traits, and conversational dynamics, enabling new forms of expressive, natural dialogue.

The technical content will delve into the architectures and training strategies that empower SpeechLLMs—from self-supervised audio representation learning and sequence-to-sequence modeling, to tokenization techniques that merge acoustic and semantic information. Real case studies will illustrate how SpeechLLMs enable capabilities like real-time speaker turn-taking, emotional tracking, and cross-lingual voice interaction. This talk will review pioneering benchmarks and evaluation frameworks, offering a candid look at open research questions around scalability, bias, and robustness.

ICASSP attendees will see how SpeechLLMs not only push the envelope in foundational areas like neural signal processing, end-to-end modeling, and multimodal learning, but also open up broader interdisciplinary collaborations across audio, NLP, and user experience design. The relevance and novelty for ICASSP is clear: as generative models become increasingly universal, the integration of speech into the LLM ecosystem will fuel fundamentally new applications in assistive technology, global communication, accessibility, and interactive media.

Participants will gain technical insights into this fast-emerging landscape as well as concrete inspiration. The session is designed to motivate not just application-oriented engineers, but also researchers interested in core algorithms, data representation, and theoretical challenges. By surfacing both the current limitations and the promise of SpeechLLMs, this talk invites ICASSP’s diverse audience to join in shaping the future of conversational AI—one that doesn’t just generate text, but listens, responds, and connects through speech as naturally as humans do.

Join us to explore how SpeechLLMs will unlock the next frontier of conversational intelligence, energizing future research and redefining how we interact with and through technology.

612: Signal Processing Inspired AI for Sensing - an Industry Perspective

Bio: Arpan Pal has more than 33 years of experience in Intelligent Sensing, Signal Processing &AI, Edge Computing and Affective Computing. Currently, as Distinguished Chief Scientist and Research Area Head, Embedded Devices and Intelligent Systems, TCS Research at Tata Consultancy Services (TCS) India, he is working in the areas of Connected Health, Smart Manufacturing, Smart Retail and Remote Sensing.

Arpan has filed 225+ patents (out of which 140+ granted in different geographies) and has published 200+ papers and book chapters in reputed conferences and journals. He is listed among the top 15 innovators in India in terms of patents. He is two times winner of Tata Group top Innovation award in Tata Innovista under Piloted technology category. He is recipient of Distinguished Alumnus award from IIT Kharagpur, India and is awarded as a Fellow of Indian National Academy of Engineers (FNAE) in 2025.

Prior to joining TCS, Arpan had worked for Indian Defence R&D lab DRDO as Scientist for Missile Seeker Systems and in Rebeca Technologies as their Head of Real-time Systems. He is a B.Tech and M. Tech from IIT, Kharagpur, India and PhD. from Aalborg University, Denmark.

He is on the editorial board of notable journals like ACM Transactions on Embedded Systems and has been on different organizing committees of notable conferences like IEEE Sensors, IEEE APSCON, ICPR and IEEE Percom. He has written three complete books on IoT, Digital twins in Manufacturing and Application of AI in Cardiac Disease screening. He has featured in interviews in IEEE Signal Processing and IEEE Computer Society newsletters and has given invited talks in various industry forums and workshops of ICASSP, ICIP and Percom.

Linked In - http://in.linkedin.com/in/arpanpal
Google Scholar - http://scholar.google.co.in/citations?user=hkKS-xsAAAAJ&hl=en

Abstract: Sensing is the key for creation of new perception modalities in cyber-physical system (CPS) applications – it provides the right data with relevant markers for the downstream AI inferencing applications - this is well-understood in the context of IoT and CPS.

In this talk we introduce the concept of “AI for Sensing" – integration of advanced ML techniques at every stage of the sensing workflow, from transducer data acquisition/ calibration to signal enhancement/ denoising to signal representation/ fusion to meet diverse sensitivity/ specificity/ resolution/ dynamic range requirements for a given sensor type. This is a sensor-specific yet application-agnostic soft-sensing pipeline that can be computed on-board the sensing device. It needs to be lightweight enough to be embedded in the sensing device and low latency enough to enable closed-loop acquisition and calibration strategies, such as adaptive sampling, auto-gain-control, auto-filtering, beam steering and auto-calibration.

The talk will cover ML/DNN based signal enhancement/ denoising; attention/ auto-encoder based learning of embeddings for signal representation from signal features followed by multi-modal early-fusion; and RL based closed-loop control for acquisition/calibration and will give evidence of how this pipeline is sensor type specific and is not dependant on the application. The talk will present a novel idea of how such a pre-trained pipeline can be created in a lightweight edge-deployable manner (low latency/ low memory/ low power) with examples from real life industry applications and how such a pipeline can be adopted for different make/model /configuration of the same sensor type. The output signal representation can be considered as unsupervised or pre-trained; can be considered as a master sensor-specific feature representation that can be seen as an equivalent of token or vocabulary in the sensing context. This can be used by any downstream AI/ML models pipeline to learn application specific features via application focussed supervised learning. The whole idea will be explained with real-life examples of ECG Sensing for cardiac conditions, Microwave radar based sensing for concealed imaging/ in-body imaging; Acousto-optic sensing for heat susceptibility; Quantum sensing for high-resolution/ high sensitivity electromagnetic field ; Nano-sensing based physiological sensing for early disease screening.

The talk will outline how such sensor specific chips can lead to Sensor Specific Intelligent chips (SSICs) that can embed both the transducer and a "AI of Sensing" pipeline of data acquisition/calibration to signal enhancement/denoising to signal representation/fusion and present why such systems can be of real value for deployable systems.

The talk will also briefly cover how AI can be used for design of sensing systems with an example of Plasmonic sensor design for micro-plastics detection in water. It will present how a Generative AI based system can be used for structural design of complex Plasmonic sensors.

AI for sensing transforms traditional, handcrafted pipelines into intelligent, self-adapting closed-loop systems towards software-defined, scalable, and explainable sensing representations that are application-agnostic and can be used by downstream application specific systems. This intelligent sensing elevates the performance/reliability of sensing and it uses Signal Processing based signal morphology understanding to make the AI-based representation syntactically interpretable.

616: Enabling End-to-End Ecosystem of Spatial-Temporal Gaussian Splatting

Bio: Guan-Ming Su received the Ph.D. degree from the University of Maryland, College Park. He is currently the Director of Research with the Dolby Laboratories, Sunnyvale, CA, USA. He is the inventor of more than 220 U.S./international patents and pending applications. He is one of the recipients of 2020 (72nd) Technology and Engineering Emmy Award and 2021 (73rd) Engineering Emmy Award Philo T. Farnsworth Award for the contribution to high dynamic range (HDR) and wide color gamut (WCG) video as Dolby Vision format. He received 2025 University of Maryland ECE Distinguished Alumni Award and 2025 APSIPA Industrial Distinguished Leader Award. His co-authored paper won the best industry paper award in IEEE ICIP 2025. He served in multiple IEEE international conferences, such as the TPC Co-Chair in ICME 2021, the Industry Innovation Forum Chair in ICIP 2023 and 2025, and the General Co-Chair in MIPR 2024 and 2025. He served as a VP for industrial relations and development in APSIPA, from 2018 to 2019. He has been serving as the Vice Chair for Conference in IEEE Technical Committee on Multimedia Computing (TCMC), since 2021. He served as an Associate Editor for APSIPA Transactions on Signal and Information Processing, IEEE MultiMedia Magazine, and now IEEE Transactions on Circuits and Systems for Video Technology.

Abstract: in the past few years, Gaussian Splatting has become the most promising volumetric video representation. For real-world scenario, a great amount of research efforts has been conducted in many perspectives for better spatial-temporal multi-modal scene reconstruction and more effective deployment. In this talk, we will first introduce the fundamental representation of Gaussian Splatting, including attributes, construction, rendering, and optimization. Then, we will overview the recent development of Gaussian Splatting along the end-to-end ecosystem from content capture, content creation, content delivery, to content consumption. More specifically, on the content capture stage, we will address issues and solutions for multi-view camera set up in both spatial and temporal domain to assist good quality model building. On the content creation stage, to enrich the multi-modal experience, learned audio and semantics attributes from foundation models, such as CLIP, DINO, etc., are embedded into the Gaussian Splatting primitives to enable the joint audio-visual-semantics representation. Volumetric video editing to enhance perceived experience is also an important tool along the pipeline. On the content delivery stage, instead of explicitly coding dozens of attributes per Gaussian, implicit methods such as using coarse geometry representations, 2D plane projection, and MLP to leverage the conventional 2D video codec to reach high photorealistic quality and streamable bit rate will be discussed. On the content consumption stage, language and semantics guided methods are presented to enable interactive 3D scene navigation and efficient physics-aware multi-modal rendering. At the end of this talk, we will present the latest international standardization efforts and highlight future research trends. In summary, the goal of this talk is to enable the ICASSP 2026 attendees to understand the latest technology for Gaussian Splatting in both theories and applications. The other goal is to enable more discussion for the volumetric video research in ICASSP 2026, and hope this talk can motivate attendees to identify the potential research topics along those directions and inspire more innovative solutions and technical papers to ICASSP 2027.

639: Language Models on Microcontrollers: Achieving Cloud-Class AI in <32MB

Bio: Niall Lyons is a Senior Staff Machine Learning Engineer at Infineon Technologies, where he leads the edge language model development team responsible for pre-training, post-training, and deployment of the breakthrough Nexus model family. His work has achieved 3rd place globally on HuggingFace's LLM Edge leaderboard, demonstrating that sophisticated language models can run on microcontrollers with under 32MB of memory.
With a background spanning explainable AI, WiFi systems, computer vision, speech processing, and audio, Niall bridges the gap between ML research and production deployment on resource-constrained embedded platforms. He has published multiple papers across leading conferences and holds 26 filed patents in machine learning and embedded systems.
At Infineon, Niall's focus on data-centric AI and hardware-software co-design has enabled AI capabilities on billions of edge devices previously dependent on cloud infrastructure, fundamentally expanding what's possible in industrial IoT, medical devices, and consumer electronics.

Abstract: Thirty billion microcontrollers ship annually, powering everything from industrial sensors to medical devices. Yet AI remains trapped in the cloud, inaccessible to the vast majority of embedded systems. We've broken this barrier.
Infineon's Nexus model family demonstrates that model efficiency begins with data, not just architecture. Our sophisticated data curation pipeline, quality filtering, synthetic data generation, and strategic dataset composition enables 8M-25M parameter models to achieve capabilities typically requiring 10-100x more parameters. Leveraging Infineon's unique position in embedded silicon for hardware-software co-design, we developed novel quantization techniques and hardware-optimized attention mechanisms that achieved 3rd place globally on HuggingFace's LLM Edge leaderboard. Our 25M parameter model outperforms 1.5B parameter models while ranking behind only 2B parameter solutions, a 60-80x parameter efficiency advantage. Running entirely on microcontrollers with under 32MB memory, this fundamentally changes what's possible at the edge.
This efficiency unlocks entirely new markets. Battery-powered industrial sensors now perform intelligent audio analysis for predictive maintenance, enabling extended multi-year operation without network connectivity. Medical wearables process patient speech on-device with sub-50ms latency, maintaining HIPAA compliance while enabling real-time health monitoring. Smart home devices achieve always-on voice activation at minimal power consumption, impossible with cloud-dependent solutions. Consumer devices with $1-5 BOMs gain AI capabilities previously reserved for premium products. These aren't theoretical, pilot deployments are currently validating these capabilities across industrial, consumer, and medical applications.
Our architecture extends beyond text to simultaneously enable speech-to-text, text-to-speech, and audio classification, all on the same PSOC™ Edge platform. A single microcontroller can understand voice commands, generate speech responses, and classify environmental sounds concurrently, transforming passive sensors into intelligent systems capable of rich environmental understanding through multiple modalities.
We present the complete pipeline: curated training datasets specifically designed for parameter efficiency, distributed training frameworks optimized for small-scale models, aggressive quantization and optimization techniques, and seamless deployment tooling for embedded conversion. This end-to-end workflow enables rapid iteration from research to production-ready firmware, addressing the critical gap that has historically prevented edge AI adoption at scale.
The implications extend beyond technical achievement. When AI inference costs approach zero and operate entirely offline, new business models emerge. Privacy-critical applications become viable. Battery-powered devices gain intelligence without infrastructure dependencies. This presentation demonstrates that the future of AI isn't solely about frontier models, it's about making sophisticated intelligence accessible everywhere, enabling billions of existing devices to gain capabilities previously impossible at their price point, power budget, and connectivity constraints.

624: Personalising GenAI: Fine-Tuning Models to Understand & Perform Specific Tasks

Bio: Ondrej is a senior machine learning researcher at Samsung R&D Institute UK, where he focuses on personalization of generative AI models. Before joining Samsung, he was a postdoctoral researcher at the University of Edinburgh, working on topics such as multimodal large language models, image generation, fairness, uncertainty calibration and out-of-distribution generalization. He did his PhD on Meta-Learning Algorithms and Applications at the University of Edinburgh.

Abstract: The proposed presentation will address a critical challenge in modern AI: adapting foundation models to individual user needs efficiently. As generative AI systems with large numbers of parameters become ubiquitous, personalization is essential for practical deployment across diverse applications. This talk will be highly relevant to ICASSP attendees, bridging signal processing, machine learning, and efficient algorithm design.

The technical content will cover parameter-efficient fine-tuning techniques, with particular focus on Low-Rank Adaptation (LoRA). I will explain how freezing foundation models and training only low-rank adapters enables cost-effective personalization. The presentation will detail LoRA implementation strategies, including for text, speech and image generation applications, demonstrating how these methods achieve state-of-the-art results with minimal computational overhead.

Beyond single-task personalization, a key frontier is enabling models to handle multiple specialized tasks efficiently. The talk will introduce advanced adapter merging techniques that address this challenge by combining multiple task-specific LoRA adapters into unified models. I will present three complementary research contributions from our lab. 1) Compositional Multi-tasking (EMNLP'25), which merges adapters for complex operations like translated summarization in a single inference pass. 2) LoRA.rar (ICCV'25), which uses hypernetwork-based merging to combine subject and style adapters for image generation with 4000x speedup. 3) D2C (ICASSP'26), a data-driven clustering method that identifies suitable groupings of task-specific adapters using minimal number of examples. It then merges adapters within each cluster to create compact multi-task adapters deployable on resource-constrained devices.

Such research direction is motivated by the practical limitations of current foundation models: full fine-tuning is computationally prohibitive, while "one-size-fits-all" models fail to capture individual user preferences and task-specific requirements. The need for efficient, personalized AI is particularly acute in resource-constrained environments where memory and compute are scarce, yet users demand sophisticated multi-task capabilities.

This talk will inspire the signal processing community by showcasing how efficient parameter adaptation techniques can democratize access to powerful AI systems. Attendees will gain practical insights into LoRA-based personalization, understanding adapter merging strategies, and applying these methods to real-world applications. The presentation will balance theoretical foundations with empirical results, providing both researchers and practitioners with actionable knowledge for advancing personalized generative AI.

633: AIGuardrail: A Skill-Driven, Zero-Training Security Framework for Telecom LLMs in Resource-Constrained Environments

Bio: Professional Profile
Security and Compliance Officer for AI at Huawei GTS, responsible for defining security and compliance requirements, standards, and technical capability roadmaps. Successfully delivered key AI safety components and services—including content risk control, topic restriction, IP protection, privacy protection, security sandboxing, and AI safety guardrails—enabling secure and compliant commercial deployment of over xx AI-powered products across xx+ countries and regions, with zero major security incidents since launch.
WorkExperience
• Assistant Chief Expert (Applied Security) — Huawei Technologies Co., Ltd., Nanjing, China, 2023– Present
• Senior Engineer — Huawei Technologies Co., Ltd., Nanjing, China, 2020– 2023
Education Experience
• Ph.D. in Computer Science and Technology Harbin Engineering University, Harbin, China, 2017– 2020
• M.S. in Computer Science and Technology Harbin Engineering University, Harbin, China, 2016– 2017
Research Interests
AI Security, Agent Security, Software Supply Chain Security, Application Security
4 Publications
4.1 Patents
[1] Method and Apparatus for Deploying Models, Electronic Device, and Storage Medium, CN Patent, 2025
[2] Method and Apparatus for Controlling Model Execution, CN Patent, 2025
[3] Large Model Safety Inspection Method and Related System, CN Patent, 2025
[4] Content Generation Review Method and Related System, CN Patent, 2025
[5] Method, System, and Apparatus for Generating Synthetic Data, CN Patent, 2024
[6] Software Detection Method and Related Device, PCT Patent, 2024
[7] System for Improving Efficiency of Open-Source Component Poisoning Attack Detection, CN Patent, 2024
[8] Code Analysis Method and Related System, PCT Patent, 2023
[9] Multimedia Data Processing Method and Apparatus, CN Patent, 2023
[10] Open-Source Vulnerability Analysis Method, Device, and Computer-Readable Storage Medium, CN Patent, 2022
[11] Security Inspection Method and Apparatus for Open-Source Component Packages, PCT Patent, 2022
[12] Open-Source Component Package Detection Method, Apparatus, and Device, CN Patent, 2021
4.2 Conference Proceedings
[1] An Exploration of Large Language Models in Malicious Source Code Detection, Di Xue, Gang Zhao, et al., in ACM
CCS2024 (Poster Presentation)
5 Invited Talks
[1] ”Interpretation of LLM Security, Compliance Requirements and Standards”, Huawei GTS Cybersecurity and Privacy
Protection Training Program, Nanjing, China, 2025
[2] ”Technical Solutions and Industrial Applications for Large Model Asset Protection”, Huawei Cybersecurity and Privacy
Protection Technology Conference, Wuhan, China, 2024
[3] ”Security Risks and Detection Techniques for Large Language Models”, China Mobile & Huawei Joint Workshop,
Beijing, China, 2024
[4] ”Attack and Defense: LLMs and Malware Detection under the Software Supply Chain”, Huawei Cybersecurity and
Privacy Protection Technology Conference, Beijing, China, 2023
6 Honors and Awards
[1] Huawei GTS Product and Architecture Competitiveness Award, 2025
[2] Huawei GTS Cybersecurity and User Privacy Protection Award, 2025
[3] Huawei ICT Outstanding Individual Award, 2024
[4] Huawei GTS Outstanding Expert, 2024
[5] Huawei Cloud Gold Team Award, 2023
[6] Outstanding Individual, Huawei Company System Technology Conference, 2023
[7] Huawei GTS Outstanding Innovation Practice Individual, 2021
7 Industrial Projects
[1] AI Safety Guardrails and Agent Security Project, Huawei, 2025– Present [Role: Project Leader]
[2] AI Content Safety and Application Security Project, Huawei, 2024 [Role: Project Leader]
[3] AI-Assisted Malware Detection Project, Huawei, 2023 [Role: Project Leader]
[4] Low-Code Orchestration Security Project, Huawei, 2022 [Role: Project Leader]
[5] Open-Source Software Supply Chain Poisoning Attack Detection Project, Huawei, 2021 [Role: Project Leader]
8 Professional Service
[1] Huawei AI Compliance Working Group, Product Security & Compliance Manager
[2] Huawei GTS AI Agile Delivery Working Group, Security & Compliance Officer
[3] Huawei GTS AI Governance Working Group, Security & Compliance Officer

Abstract: Large language models (LLMs) in the telecommunications domain are accelerating the digital transformation of emerging industries like low-altitude economies and IoT. However, AIGC (AI-Generated Content) faces critical safety risks—such as value misalignment, privacy leakage, and prompt injection. The unique business environment of the communication industry requires security solutions to operate with extremely low resource overhead, creating an urgent need for lightweight, low-cost security approaches.
Our solution: We introduce AIGuardrail, a zero-training, plug-and-play LLM safety guardrail derived from industrial deployment practices. By pioneering a "Security-as-a-Skill" paradigm, leveraging agent skills and prompt engineering, it establishes a skill-driven automated lifecycle for security compliance, drastically reducing reliance on expert personnel and intricate coding.
Key Innovations & Methodology
AIGuardrail integrates non-intrusively into AIGC processing pipelines, augmenting security without disrupting core business logic. By embedding structured safety guidelines and few-shot examples into system prompts, it enables low-latency first-token judgments for inference-time safety checks.
Harnessing the natural language orchestration of Agent Skills, complex compliance requirements are encapsulated as modular "Skills" spanning prompt authoring, optimization, testing, and deployment. Business users simply describe emerging risks in natural language, triggering the skill engine to auto-update safeguards for "hot-swapping" policies.
The safety adjudication of user inputs hinges on five pivotal elements:
(1) Role assignment: explicitly designate the model as a safety auditor and constrain it to perform only safety-review tasks.
(2) Safety guidelines: the guidelines cover known unsafe content in current AIGC applications (e.g., illegal content, IP infringement, privacy leakage, injection attacks) and instruct the LLM to identify potentially non-compliant queries or malicious attack intent in user inputs. For emerging threats or product-specific policies, AIGuardrail supports dynamic, in-runtime insertion and updates of safety guidelines.
(3) Global principle: To address challenges such as moderation in low-resource languages, code-injection risks, and false positives triggered by English abbreviations, AIGuardrail adopts a globally scoped prompting scheme that enables the model to construct a consistent safety decision criterion prior to inference. For example, by composing structured prompts with few-shot examples, we elicit the model’s inherent multilingual capability and enable safety screening for inputs written in low-resource languages.
(4) Moderation principle: AIGuardrail employs a chain-of-thought (CoT)–based hierarchical moderation mechanism，the moderation flow follows the priorities below. Block, Allow, Review.
(5) Output format: AIGuardrail adopts a first-token safety-moderation policy. The system completes compliance determination early in decoding, significantly reducing the latency and computational overhead induced by deep reasoning, making it well-suited for high-throughput, low-latency industrial deployments.
Experimental Results and Industrial Impact: AIGuardrail pioneers an agent-skill-driven paradigm for LLM security, automating a closed-loop from compliance specification to defensive execution by externalizing safeguards onto the model's inference. Deployed across 20+ production systems (including Qwen and DeepSeek variants) for six months, our solution comprehensively outperforms the current SOTA—Qwen3Guard—in core metrics: the overall detection rate reaches 90.2% (8-12% higher), with a False Positive Rate (FPR) of only 0.08% (∼10% lower)

618: AI/ML for defense applications, its impact and limitations.

Bio: Dr. Shubha Kadambe currently is a Principal Tech Fellow at Raytheon, a business unit of RTX. She has in-depth experience in the development of advanced and innovative machine learning (ML) and artificial intelligence (AI) algorithms for different defense applications. Shubha uses this background to lead engineers in the development of advanced solutions for many different problems (e.g., cognitive EW, activity-based intelligence) by applying ML/AI approaches. Currently Shubha is the PI for the EA sub-system development as part of the AFRL E-Gon program. She is also technical lead for couple of DARPA programs and Raytheon IRAD projects. Shubha was the PI for Cognitive EW programs of ONR Reactive Electronic Attack Measures (REAM) and Electromagnetic Maneuver Warfare Resource Allocation Manager (EMWRAM). Shubha actively mentor engineers and serves as a chair of the Autonomous Intelligent Systems (AIS) Community of Practice of RTX. Prior to joining Raytheon in February of 2013, Shubha worked at Rockwell Collins, Inc. where she led a team of engineers to develop, from concepts to prototype, a cognitive EW architecture and system to be part of a larger communication system. She was a program officer at the Office of Naval Research (ONR) prior to joining Rockwell Collins. At ONR, she managed signal/image processing and understanding, a multi-university research initiative, and Small Business Innovation Research (SBIR) programs. Additionally, Shubha has held various research positions at HRL Laboratories, Atlantic Aerospace Electronics Corporation and AT&T Bell Laboratories. Shubha’s technical credits include more than ninety refereed journal and conference papers, eight invited chapters, an IEEE video tutorial on Wavelets and its applications, 30 US patents and eight trade secrets. Shubha has taught graduate and undergraduate courses at California Institute of Technology, University of California Los Angeles, University of Southern California and University of Maryland Baltimore Campus. She is an active senior member of IEEE and has served as an associate editor of IEEE transactions on signal processing and technical committee member of many national/international conferences. She has also served as a technical session chair at these conferences.

Abstract: Applications of AI/ML in the commercial world is fast growing. However, its adaptation in to defense applications is slow. After working in the defense industry and in the area of AI/ML for decades we understand the differences between commercial and defense world. In this presentation would like to share that experience, and discuss the reasons why adaptation of AI/ML approaches to defense applications is slow to catch up. In particular, propose to cover the below topics during the presentation:
1. Differences between commercial and defense applications
2. Reasons for slow adaptation
3. Some of the defense problems where it is gaining some traction and
4. Issues that AI/ML approaches need to address for them to make a difference and have a significant impact.

This presentation and discussion would be of interest to ICASSP audience since it helps the community in understanding what needs to be worked on for AI/ML approaches to have a major impact in defense applications and how the community can help defense organizations to imbibe AI/ML approaches in solving their problems.

627: From Signals to Systems: Making AI Industrial-Grade Across the Engineering Lifecycle

Bio: Dr. Sanjukta Ghosh is a Senior Data and AI Leader at Siemens AG, with over 20 years of experience in designing and deploying complex intelligent systems across industrial, defense and aerospace, automotive, and life sciences sectors. She currently leads global multi-disciplinary teams that develop and deploy scalable, production-grade AI applications, focusing on the intersection of generative AI, deep learning, and classical signal processing.
Dr. Ghosh holds a Ph.D. in AI/ML and Computer Vision from Friedrich Alexander University (FAU) Erlangen-Nuremberg. Her career spans the full engineering lifecycle—from research in computer vision to developing underwater acoustic imaging and thermal imaging systems to architecting modern AI systems for mission-critical industrial applications.
A recognized expert in her field, she holds multiple patents in deep learning and has been a frequent contributor to the signal processing community, with several publications in IEEE ICASSP and ICIP. Her current work at Siemens focuses on bridging the gap between theoretical AI research and the rigorous reliability requirements of industrial cyber-physical systems.

Abstract: The rapid adoption of artificial intelligence (AI) has transformed many research areas in signal processing, yet deploying AI reliably at scale across real industrial engineering lifecycles remains a significant challenge. Industrial environments impose constraints that go far beyond benchmark performance: data scarcity and imbalance, data quality, heterogeneity of data sources, non-stationarity, strict reliability and safety requirements, explainability, latency, and long operational lifetimes. This talk explores how signal processing principles, combined with modern AI and emerging computational paradigms, provide a rigorous foundation for making AI truly industrial grade.
We present a lifecycle-centric view of engineering problems—from design and simulation, to manufacturing, deployment, operations and maintenance—and discuss some of the challenges that arise at each stage. Topics include representation learning for unstructured multimodal data, physics-informed and hybrid model-based/data-driven approaches, robust and adaptive learning under distribution shifts, and uncertainty quantification for decision-critical systems. Emphasis is placed on how classical signal processing concepts such as filtering, spectral analysis, system identification, and optimization continue to play a central role in addressing these challenges when integrated with AI.
Beyond AI, the talk also highlights the growing role of alternative computational approaches, including quantum-inspired algorithms and advanced optimization techniques, for tackling large-scale, combinatorial industrial problems. Beyond individual algorithms, the talk will emphasize system-level considerations for deploying signal-driven AI on an industrial scale.
By drawing on Siemens’ global research and real-world industrial examples from manufacturing, engineering design and simulation software, process industries and more, this presentation aims to bridge signal processing, AI and industrial domains. This talk aims to provide the signal processing community with insights into what it takes to move AI from prototypes to mission-critical industrial systems. The talk will conclude by outlining open research challenges and opportunities at the intersection of signal processing, AI, and cyber-physical systems in engineering—areas where the signal processing community can play a decisive role in shaping the next generation of industrial intelligence.

638: Real-Time Human–AI Collaboration for Trustworthy Conversational Agentic Systems

Bio: Mahnoosh holds a Ph.D. in Electrical Engineering (2013) with a career-long focus on advancing speech and language technologies. Following early research at AT&T Labs, she spent over a decade at Interactions, leading the state-of-the-art research, design, and deployment of human-in-the-loop and human–AI collaborative conversational systems. She is now at SoundHound AI, a leader in enterprise conversational intelligence, continuing to advance these systems at scale. Her work focuses on uncertainty estimation, real-time mitigation, human-in-the-loop workflows, and multi-agent conversational architectures, integrating signal processing, machine learning, and NLP to build robust human–AI systems. Beyond technical leadership, she actively bridges industry and academia through collaborations and participation in technical program committees. She also served as a member of the Speech and Language Technical Committee of the IEEE Signal Processing Society. Her contributions advance both industry applications and research in conversational AI.

Abstract: As large language models (LLMs) continue to improve in accuracy and capability, they are increasingly deployed in user-facing conversational applications. Despite these advances, LLM-based systems remain susceptible to unpredictable behaviors, including hallucinations, unsafe outputs, and inconsistent or contextually inappropriate responses. In conversational multi-agent systems, such failures can propagate rapidly across interactions and agents, amplifying risk for both users and enterprises. Ensuring trustworthiness in these systems therefore requires not only more accurate models, but also effective strategies for real-time monitoring, evaluation, and mitigation to maintain reliable and safe operation.

Most current research and industrial practice focuses on offline evaluation, post-deployment monitoring, and periodic human review. While valuable, these approaches are insufficient for interactive systems operating under strict latency constraints, where delayed intervention can already result in user harm or degraded customer experience. A key open challenge is real-time mitigation in conversational multi-agent systems, where autonomous AI agents interact and coordinate with one another. Addressing this challenge requires architectures in which AI agents collaborate with human agents to generate, evaluate, and, when necessary, correct responses on the fly, ensuring both safety and high-quality user interactions.

This talk focuses on real-time human–AI collaborative architectures for building trust and managing risk in conversational generative AI systems. Drawing on our company’s decades of experience deploying large-scale, real-time conversational AI systems with human-in-the-loop for millions of customers in enterprise customer care environments, we explore how established principles of human–machine collaboration can be adapted and extended to generative and multi-agent systems, and where new design paradigms are needed to enable seamless interaction.

We describe system architectures in which AI and human agents bring specialized, complementary expertise to produce high-quality conversational responses. AI agents may include customer-facing agents responsible for communication and guidance, transactional agents handling bookings or payments, orchestration agents coordinating multi-agent workflows, and evaluation agents monitoring outputs for confidence, policy compliance, or risk. Similarly, human agents contribute diverse skills, including domain knowledge, familiarity with enterprise policies, and the ability to review and correct unacceptable or high-risk model outputs. This diversity enables nuanced, context-aware responses: AI agents provide real-time assistance through summarization, intent detection, emotion and sentiment analysis, and customer experience signals, while human agents can operate behind the scenes to review or approve AI-generated outputs without disrupting conversation flow. When automated mitigation is insufficient, human agents can seamlessly take over to de-escalate issues or handle high-risk scenarios. Beyond real-time collaboration, closed-loop feedback mechanisms leverage multiple end-to-end measures—including AI performance, efficiency, and customer experience—to continuously optimize the system. Reinforcement learning and adaptive escalation strategies allow both AI behavior and human workflows to improve over time, creating a dynamic, self-improving human–AI collaborative ecosystem.

The talk concludes with lessons learned from real-world deployments and data-driven insights, highlighting key trade-offs among accuracy, latency, cost, and user experience, as well as open research and standardization challenges at the intersection of signal processing, human-centered AI, and industrial-scale multi-agent conversational systems. The discussion emphasizes actionable guidance for designing real-time human–AI collaborative systems that integrate multiple AI and human expertise, effectively manage risk, and continuously improve through feedback. The goal is to provide researchers and practitioners with practical insights from real-world applications to build robust, scalable conversational AI systems capable of operating safely and efficiently in complex, dynamic environments.

Industry Panels

Program Schedule

Date	Time	Panel Title
May 5	16:30-18:30	Open Audio Codecs for the Next Generation of Immersive and Scalable Media
May 6	16:30-18:30	Industrializing AI-Native, Distributed, and Sustainable 6G with Open RAN and TN-NTN Integration
May 7	16:30-18:30	From Labs to Learners: Preparing the Next Generation Signal Processing Workforce through Industry-Academic Coalitions
May 8	14:00-16:00	Scaling Intelligence for the Smart Society: Human-Centric, Sovereign, and Efficient Digital Twins

Panel 1: Open Audio Codecs for the Next Generation of Immersive and Scalable Media

Tuesday, 5 May 2026, 16:30 – 18:30
Location: Auditorium

Moderator: Rémi Audfray, Meta, ACWG Co-Chair

Panelists:

Jan Skoglund: Google
Alan Silva: Spatial9
Jean-Marc Valin: Google
Chris Hold: Meta
Nick Zacharov: Meta

Abstract

The audio industry is at a pivotal moment: immersive experiences, spatial audio, and scalable streaming demand new, open, and royalty-free codec solutions. The Alliance for Open Media (AOM) Audio Codec Working Group (ACWG) is driving the development of the Open Audio Codec (OAC) and Open Audio Renderer (OAR) specifications to address these challenges. This panel will convene leading industry experts to discuss the technical, business, and standardization imperatives for open audio formats, with a special focus on spatial audio, real-time communication, and efficiency gains. The topics will include:

Limitations of Existing Open Audio Codecs for Immersive Audio
Innovations in Codec Algorithms
Advanced Renderer Design
Renderer Listening Tests and Perceptual Evaluation
Call to Action

Bio

Rémi Audfray is an Engineering Manager on the Media Core team at Meta, building audio technologies used by billions of people worldwide in audio/video calling, messaging, conversational AI, and entertainment across Messenger, Instagram, WhatsApp, Facebook, MetaAI, Wearables, and other applications. Rémi’s prior experience includes XR Audio at Reality Labs, Sound Technology Research at Dolby Labs, and AR audio at Magic Leap. He received his ‘Diplôme d’Ingénieur’ from the Ecole Centrale de Lyon (France), and MSc. in Music Technology from IUPUI (USA) in 2006. He is passionate about advancing the state of the art of audio technology in the service of great user experiences.

Jan Skoglund leads a team at Google in San Francisco, CA, developing speech and audio signal processing components, contributing to Google’s software products (such as Meet) and hardware products (such as Chromebooks). Jan received his Ph.D. degree in 1998 from Chalmers University of Technology in Sweden. His doctoral research was centered around low bitrate speech coding. After obtaining his Ph.D., he joined AT&T Labs-Research in Florham Park, NJ, where he continued to work on low bit rate speech coding. In 2000, he moved to Global IP Solutions (GIPS) in San Francisco and worked on speech and audio processing technologies, including compression, enhancement, and echo cancellation, which were particularly tailored for packet-switched networks. GIPS’ audio and video technology was integrated into numerous deployments by companies such as IBM, Google, Yahoo, WebEx, Skype, and Samsung. The technology was later open-sourced as WebRTC after GIPS was acquired by Google in 2011. Jan is an IEEE Senior Member involved in the Audio and Acoustic Signal Processing TC and the Speech and Language Processing TC, and he is an Associate Editor for IEEE Transactions on Acoustics, Speech, and Language Processing.

Alan Silva is Chief Technology Officer at Spatial9, where he leads research and development initiatives at the intersection of machine learning, distributed systems, and immersive media. He specializes in designing scalable, real‑world solutions that combine advanced algorithms with high‑performance computing to address complex data and AI challenges. He has held technical and research roles at Alcatel‑Lucent and Samsung, where his work resulted in several granted patents. He has also contributed to the growth of leading data and AI organizations, including Cloudera, H2O.ai, and Databricks, spanning open‑source technologies, large‑scale analytics, and enterprise AI platforms. Alan is a strong advocate for open source and actively contributes to collaborative projects that advance the state of the art and strengthen the broader engineering community. His current interests center on immersive audio, where he applies machine learning and artificial intelligence to develop adaptive, interactive soundscapes that respond to user context and behavior, enhancing both engagement and the perceived quality of musical experiences.

Jean-Marc Valin, Ph.D., is a Senior Staff Research Scientist at Google and a long-time contributor to the Xiph.Org Foundation. He received his B.Eng., M.Sc.A., and Ph.D. in Electrical Engineering from the University of Sherbrooke, Canada. He is a lead architect of the Opus and Speex audio codecs and also contributed to the AV1 video codec. His research focuses on speech and audio coding, neural vocoders (LPCNet, FARGAN), and deep-learning-based speech enhancement. He was previously at Amazon Web Services and Mozilla.

Chris Hold is a Research Scientist at Meta Reality Labs Research Audio, where he focuses on spatial audio capture, reproduction, and perception. His current research interests include head-worn microphone arrays and spatial audio coding. Chris earned his PhD from the Aalto Acoustics Lab, specializing in perceptually-motivated parametric coding of higher-order Ambisonics.

Nick Zacharov (D.Sc. (Tech.), M.Sc., B.Eng. (Hons.), C.Eng., FAES) is Perceptual Audio Evaluation Technical Lead at Meta Reality Labs, focusing on applied sound quality research, aerodynamics and ML-model development for wearables product development. With an academic background in electroacoustics, acoustics and signal processing, Nick has broad industrial experience in the audio profession spanning from mobile phone audio to AR/VR devices, professional studio monitor design to smart and VR glasses. Nick is the co-author of “Perceptual Audio Evaluation – theory, method and application”, and also editor/co-author of the book “Sensory Evaluation of Sound”. He has been an active member of the Audio Engineering Society and has more than 90 publications and patents to his name.

Panel 2: Industrializing AI-Native, Distributed, and Sustainable 6G with Open RAN and TN-NTN Integration

Wednesday, 6 May 2026, 16:30 – 18:30
Location: Auditorium

Moderator: Engin Zeydan: CTTC (Centre Tecnològic de Telecomunicacions de Catalunya)

Panelists:

Luis M. Contreras: Telefónica CTIO
Sihem Cherrared: Orange Innovation
Carles Navarro Manchón: Keysight Technologies
Maria A. Serrano: Nearby Computing

Abstract

As 6G research quickly advances from conceptual visions to large-scale experimental platforms and pre-commercial trials, the industry faces a critical question: how can AI-native, Open RAN-based, and TN-NTN-integrated architectures be deployed in a trustworthy, scalable, and sustainable way?

This panel brings together leading industrial stakeholders, operators, vendors, and applied research organizations to discuss the challenges and opportunities of industrializing 6G connectivity. A unified 6G architecture proposes a unified architecture based on Open RAN, distributed cloud-native orchestration, AI-driven control loops, intent-based communications for TN-NTN, and network exposure using frameworks such as CAPIF, CAMARA APIs. Although these concepts are well explored in research, their real-world deployment presents significant issues related to operational complexity, AI governance, interoperability, cost, sustainability, and regulatory compliance (AI Act, GDPR, CRA).

The panel will focus on six tightly coupled industrial themes:

From Architecture to Operations: Making Unified 6G Architecture Deployable at Scale
TN-NTN Convergence as a Commercial Service, Not a Research Demo
AI-Native Control Loops vs. Human Control: Where Should Automation Stop?
Interoperability Nightmares: Multi-Vendor Open RAN in Practice
Network Exposure in AI-Driven Networks
Sustainability Beyond KPIs: Is Unified 6G Architecture Actually Greener?

Bio

Dr. Engin Zeydan is a senior researcher at CTTC with extensive experience in AI-native network management, Open RAN, TN-NTN integration, and trust frameworks for 6G. He has contributed to multiple EU SNS JU projects (including UNITY-6G) and regularly collaborates with industry partners on experimental platforms, standardization, and large-scale validation.

Dr. Luis M. Contreras completed a six-year Telecom Engineer degree (M.Sc.) at the Universidad Politécnica of Madrid (1997), holds an M.Sc. on Telematics jointly by the Universidad Carlos III of Madrid and the Universitat Politècnica of Catalunya (2010), and a Ph.D. on Telematics by the Universidad Carlos III of Madrid (2021). Since August 2011 he is part of Telefónica I+D / Telefónica CTIO, working on SDN, transport networks and their interaction with cloud and distributed services, and interconnection topics.

Dr. Carles Navarro Manchón received the degree in telecommunication engineering from the Miguel Hernández University of Elche, Spain, in 2006, and the Ph.D. degree in wireless communications from Aalborg University, Denmark, in 2011. Since 2023, he has been Senior Researcher with Keysight Technologies, Spain.

Dr. Sihem Cherrared received her PhD degree in 2020 at the University of Rennes 1 France, in INRIA and ORANGE Labs, on the fault management of programmable multitenant networks. She is currently working as a senior R&D engineer on network management and automation at Orange Innovation.

Dr. Maria A. Serrano is a senior researcher in the R&I department at Nearby Computing. She received her PhD in Computer Architecture from the Technical University of Catalonia (UPC) in March 2019 and works on orchestration techniques in edge/cloud computing environments.

Panel 3: From Labs to Learners: Preparing the Next Generation Signal Processing Workforce through Industry-Academic Coalitions

Thursday, 7 May 2026, 16:30 – 18:30
Location: Auditorium

Moderator:

Arvind Rao: The University of Michigan, Ann Arbor
Yang Lei: HP Inc.

Panelists:

Ioannis Katsavounidis: Meta
Ivan Tashev: Microsoft
Gabriele Bunkheila: MathWorks
Marios S. Pattichis: IEEE Education Board
Ramani Duraiswami: University of Maryland

Abstract

As AI, signal processing, and intelligent sensing systems transition from research labs into mission-critical roles across industry sectors (including healthcare, mobility, energy, defense, media, and sustainability), workforce readiness has emerged as a pressing bottleneck. While ICASSP is home to world-class research, a key translational gap remains: how do we prepare the next generation of engineers and interdisciplinary professionals to translate these innovations into real-world deployment?

This panel explores how industry-academic partnerships and professional societies like IEEE can co-create scalable, inclusive educational ecosystems to meet that need. The panel will examine the full continuum of talent development: from K–12 STEM engagement to university curricula redesign, to professional and continuing education for current and emerging roles in the AI/SP workforce.

It will emphasize:

The shifting expectations of employers: from algorithmic skillsets alone to domain-contextualized system thinking, teamwork, and lifecycle awareness.
The growing need for interdisciplinary fluency: as professionals in product, regulatory, clinical, or sustainability roles increasingly interact with SP/AI systems.
The role of companies like Microsoft, Meta, and MathWorks: in creating content, platforms, and credentialing models to scale global talent capacity.
How IEEE can act as a trusted, neutral convenor: co-developing open, modular, and locally adaptable educational formats for use by chapters, institutions, and startups alike.
How to convert ICASSP research outputs into real-world learning artifacts: enabling faculty and companies to jointly build pathways that connect signal processing innovation with employment opportunity.
This panel intends to be a dynamic conversation among stakeholders building the future of work in SP and AI: equally relevant to researchers, engineers, educators, product leaders, and outreach directors. Panelists will share deployment experiences, program models, and lessons learned, followed by collaborative ideation with the audience.

Bio

Arvind Rao is a Professor in the Department of Computational Medicine and Bioinformatics, Biostatistics, Radiation Oncology at the University of Michigan. His group uses image analysis and AI methods to link imaging and non Euclidean signals across biological scales (i.e. single cell, tissue and radiology data). Such methods have found application in various areas of biomedical data science. Arvind received his PhD in Electrical Engineering and Bioinformatics from the University of Michigan, specializing in transcriptional genomics, and was a Lane Postdoctoral Fellow at Carnegie Mellon University, specializing in bioimage informatics. He is also a Fellow of American Medical Informatics Association (AMIA), The Royal College of Pathology (RCPath) in the UK (by published works) and American Association for Advancement in Science (AAAS). He is an active contributing member of several initiatives with the IEEE SPS Education Board (K-12, Education Community) and Data Science Initiative.

Dr. Yang Lei is a Principal AI Research Engineer in HP Inc. She currently leads the development of novel and robust AI technologies for future video conferencing solutions. In previous roles, she defined and developed the core computer vision technologies for HP Labs’ education initiative and expanded the AI capabilities in the company’s immersive computing platform. During HP’s microfluidics business creation, she led the development of key algorithms for isolating circulating tumor cells. This was a significant step toward affordable cancer diagnosis and personalized treatment and earned her the HP Reinventor Award, HP’s highest innovation honor. She authored 20+ patent applications and more than 19 publications and talks (IEEE ICASSP, ICIP, ISBI, Grace Hopper Celebration) in high-priority areas of Computer Vision and AI. Dr. Lei is an IEEE Senior Member. She received the inaugural Purdue Engineering 38 by 38 award in 2024, for her track record of successfully applying AI technologies across key domains. She is also the recipient of the 2025 Society of Women Engineers (SWE) Pathfinder Award, the 2023 IEEE SPS Industry Young Professional Leadership Award, and the 2021 Eaton Award of Design Excellence.

Dr. Ioannis Katsavounidis is part of the Video Infrastructure team, leading technical efforts in improving video quality and quality of experience across all video products at Meta. Before joining Meta, he spent 3.5 years at Netflix, contributing to the development and popularization of VMAF, Netflix’s open-source video quality metric, as well as inventing the Dynamic Optimizer, a shot-based perceptual video quality optimization framework that brought significant bitrate savings across the whole video streaming spectrum. He was a professor for 8 years at the University of Thessaly’s Electrical and Computer Engineering Department in Greece, teaching video compression, signal processing and information theory. He was one of the cofounders of Cidana, a mobile multimedia software company in Shanghai, China. He was the director of software for advanced video codecs at InterVideo, the makers of the popular SW DVD player, WinDVD, in the early 2000’s and he has also worked for 4 years in high-energy experimental Physics in Italy. He is one of the co-chairs for the statistical analysis methods (SAM) and no-reference metrics (NORM) groups at the Video Quality Experts Group (VQEG). He is actively involved within the Alliance for Open Media (AOMedia) as co-chair of the software implementation working group (SWIG). He has over 150 publications, including 50 patents. His research interests lie in video coding, quality of experience, adaptive streaming, and energy efficient HW/SW multimedia processing. He is an IEEE Fellow.

Dr. Ivan Tashev is a Partner Software Architect in Microsoft Research (MSR), Redmond, WA, USA, where he leads the Audio and Acoustics Research Group. His interests include multichannel signal processing with machine learning and artificial intelligence. Ivan Tashev also coordinates the Brain-Computer Interfaces project in MSR. Dr.Tashev has published two books as a sole author, two book chapters, 100+ scientific papers, listed as inventor in 50 US patents. Ivan Tashev is affiliate professor in the Department for Electrical and Computer Engineering of University of Washington in Seattle, USA, and honorary professor at Technical University of Sofia, Bulgaria. Technologies created by Dr. Tashev are incorporated in many Microsoft products, he served as the audio architect for Kinect and for HoloLens. He is an IEEE Fellow, member of AES and ASA. More details about him can be found in his web page https://www.microsoft.com/en-us/research/people/ivantash/.

Gabriele Bunkheila is a senior product manager at MathWorks, where he coordinates the strategy of MATLAB toolboxes for audio and DSP. After joining MathWorks in 2008, he worked as a signal processing application engineer for several years, supporting MATLAB and Simulink users across industries from algorithm design to real-time implementations. Before MathWorks, he held a number of research and development positions related to signal processing. Manages global partnerships with universities and K–12 systems for STEM and SP/AI learning.

Marios S. Pattichis ([email protected]) is a Professor in the Department of Electrical and Computer Engineering at the University of New Mexico. He holds a Ph.D. In Computer Engineering, an M.S.E. In Electrical Engineering, a B.A. (high honors) in Mathematics, and a B.Sc. (high honors and special honors) in Computer Sciences, all from the University of Texas at Austin. Since 2012, he has been involved in research projects that teach Python to middle-school students from underrepresented groups. He is the chair of the IEEE Signal Processing Society’s K-12 subcommittee on Education. At UNM, he is the director of Online Programs, including the new online M.Sc. Degree in Applied Machine Learning & Artificial Intelligence Systems Engineering. His current research interests include integrating Mathematics, Computer Programming, and AI into Engineering Education. He has served as Associate Editor to several journals and as Senior Area Editor for IEEE Transactions on Image Processing and IEEE Signal Processing Letters. He is a Senior Member of the IEEE, a Senior Member of the National Academy of Inventors, and a Fellow of the European Alliance of Medical and Biological Engineering and Science (EAMBES).

Ramani Duraiswami is Professor and Associate Chair (for Graduate Studies) at the Department of Computer Science, and in UMIACS, at the University of Maryland. Prof. Duraiswami got his B. Tech. at IIT Bombay, and his Ph.D. at The Johns Hopkins University. After spending a few years working in industry, he joined the University of Maryland, where he established the Perceptual Interfaces and Reality Lab. He has broad research interests, including both algorithm development (for machine learning, statistics, wave propagation and scattering, the fast multipole method), and systems development/applications (spatial audio capture rendering and personalization; computer vision, acoustics). He has published over 280 peer-reviewed archival papers, co-authored a book, has several issued patents, and according to Google Scholar has an h-index of 64 (in 2023). Some of his research has been spun out into a startup, VisiSonics, whose technology is in millions of devices. A particular theme of Prof. Duraiswami’s recent research has been combining machine learning with scientific simulation, and the understanding of the interaction of waves with objects – electromagnetic, acoustic, and visual.

Panel 4: Scaling Intelligence for the Smart Society: Human-Centric, Sovereign, and Efficient Digital Twins

Friday, 8 May 2026, 14:00 – 16:00
Location: Auditorium

Moderator:

Antonio J. Jara: Libelium and Gaia-X evangelist
Arijit Ukil: TCS Research

Panelists:

Martin Serrano: NIST and University of Galway
Juan Jose Hierro: FIWARE Foundation
Francisca Rubio: Gaia-X Hub
Oscar Garcia, Bosch Global Software Technologies

Abstract

The paradigm is shifting from the “Smart City,” a geography anchored in infrastructure, to the “Smart Society,” an ecosystem defined by its people. In this expanded canvas, the Industrial Digital Twin (IDT) must evolve from asset monitors to engines of public well-being with scaling across health, energy, mobility, and logistics with data aggregation and ensuring privacy and trust.

This panel outlines the architectural overhaul and technologies for human-centric twins, rooted in signal and information processing that decouple intelligence from centralized clouds; adopt efficient, decentralized AI, edge-native GenAI and Small Language Models (SLMs), learning locally from multimodal signals with IoT infrastructure, augmented by federated learning and privacy-preserving analytics with zero-touch operations twins, edge-resident agents with executable policies, and causal and counterfactual twins. The outcome is data sovereignty and resource-efficient scale, enabling equitable deployment across diverse socioeconomic contexts.

Key Technical Topics:

Sovereignty by design: Local-first edge AI and federated learning extract societal (clinical, energy, mobility,…) insights, ensuring compliant, trusted, human-centric platforms.
Modeling the human element and context: stochastic human behaviors driving adaptive, trustworthy urban intelligence with empathetic machine to human conversation.
Sustainable scalability: Green-AI economics; quantization, pruning, sparsity, mixed precision delivers high-fidelity reasoning on constrained devices.

Representative Use Cases:

Connected Mobility Twin runs city traffic itself with pedestrians, connected vehicles, roadside sensors.
Climate resilience and air-quality Twin with radar, satellites, IoT sensors to forecast micro-events and trigger controls.
Energy-Equity District Twin coordinates Distributed Energy Resources with federated learning to enforce carbon and comfort optimization.
Community Digital Health Twin with on-device multimodal learning delivers services using wearables, clinics, and causal policies towards health equity.

Bio

Dr. Antonio J. Jara: Chief Scientific Officer at Libelium and Gaia-X evangelist, Dr. Jara is a highly cited IoT researcher bridging sensing, AI, data spaces, and city digital twins. He founded HOPU (now part of Libelium) and contributes to SENSE CitiVerses and the EU Local Digital Twin Toolbox, enabling trustworthy, sovereign, and operational twins for cities. He earned his Ph.D. (Cum Laude) from the University of Murcia and has participated as speaker in 100+ international events and publications, holding several IoT patents.

Dr. Arijit Ukil: Principal Scientist at TCS Research, Kolkata and IEEE Senior Member (2016), Dr. Ukil brings 22 more years of industrial research experience across ML, deep learning, and interpretable AI. He has published 50+ research papers and filed 60+ patents (50+ granted across multiple geographies). He earned his Ph.D. (Cum Laude) from the University of Murcia and serves as adjunct faculty at the Defence Institute of Advanced Technology, India. He has organized workshops/tutorials at ACM CIKM, Ubicomp, ICASSP, ACM SAC, and ECAI, and delivered invited talks at leading universities and venues.

Dr. Martin Serrano: International associate at NIST (USA) and Principal Investigator/Head of the AIoT Research Unit at the University of Galway (Ireland). Dr. Serrano is an engineer, data scientist, and lecturer on smart cities with 15+ years of experience in semantic interoperability, distributed data & information systems, and cybersecurity. He is active in IEEE and ACM, has 100+ peer-reviewed publications, and contributes broadly to EU and Irish innovation programs.

Dr. Juan Jose Hierro: CTO of the FIWARE Foundation and Chair of the FIWARE Technical Steering Committee, Dr. Hierro champions open, royalty-free standards (e.g., NGSI-LD) and Smart Data Models that make digital-twin data interoperable and reusable across domains. He focuses on efficient, open-source platforms and data-space building blocks that help cities scale intelligence without vendor lock-in and supports the Open & Agile Smart Cities initiative.

Ms. Francisca Rubio: General Manager, Gaia-X Hub Spain, Ms. Rubio is an electronics engineer (University of Granada) with MBA and Big Data & Data Engineering credentials. She has led R&D and innovation across sectors and now helps organizations apply Gaia-X federation, trust, and governance so data and AI can scale across borders—supporting sovereign, human-centric digital twins and European data spaces. She has been instrumental in establishing new R&D centers including ISFOC and the Ricardo Valle Institute.

Oscar García: Head of Strategy & Business Development for Digital Transformation and Sustainability at Bosch Global Software Technologies, leading initiatives across EMEA and Latin America. With a strong background in Lean Manufacturing and Industry 4.0, he has driven high-impact transformation projects that enhance productivity, quality, and cost efficiency in manufacturing environments. His expertise spans AI-driven solutions, PLM, and end-to-end digital strategy, combining technical depth with business execution.

Industry Workshops

Meta Industry Workshop: “Frontiers in Human-Machine Communication: Wearable Sensing, Speech Enhancement, and Conversational Interaction”

Tuesday, May 5, 14:00 -15:30
Location: Room 112
Session Chair: Sriram Srinivasan

Talk 1: Multichannel Voice AI for Wearables: From Cocktail Parties to Directional Intelligence

Abstract: Wearable devices with microphone arrays offer a unique opportunity to tackle the cocktail party problem — selectively attending to a target speaker amid competing voices. This talk presents our work on multichannel voice AI for Meta smart glasses across three progressively challenging application classes: single-speaker assistance, two-speaker live translation, and multi-talker ambient comprehension. We review SAIN, a cascade approach for directional listening in today’s production systems, and MD-ASR, an end-to-end multichannel architecture for live speech translation that replaces the traditional M:1 beamforming bottleneck with an M:N frontend preserving spatial cues. We close with our vision of LLM-based systems that jointly model direction and meaning.

Bio: Yiteng (Arden) Huang is an AI research scientist on the Wearables AI team in Reality Labs at Meta Platforms, focusing on voice-driven AI models, devices, and systems. He earned his Ph.D. in electrical and computer engineering from Georgia Tech with a distinction in signal processing. He began his professional career at Bell Labs before founding and running WeVoice Inc., where he helped NASA develop acoustic and audio signal processing technologies for future missions under SBIR (Small Business Innovation Research) contracts. Prior to joining Meta, Arden held positions at Google and Amazon, continuously conducting research and development in speech and audio across signal processing, machine learning, and artificial intelligence. Dr. Huang has published extensively, with 8 co-authored/co-edited books, nearly 100 peer-reviewed journal and conference papers, and approximately 20 patents. He received the 2008 Best Paper Award and the 2002 Young Author Best Paper Award from the IEEE Signal Processing Society (SPS), along with awards from NASA and during his graduate studies. He has also served the IEEE SPS as an associate editor and technical committee member for multiple terms.

Talk 2: Conversational Fluidity: A Full-Stack Audio Approach to Natural AI Voice Interaction

Abstract: Voice-based AI agents shift real-time communication from human-to-human to human-to-machine conversation, requiring fundamental redesign of the audio processing stack. We present a full-stack audio approach to enabling conversational fluidity, an interaction that feels as responsive and natural as face-to-face conversation, deployed at scale across Meta’s communication platforms. Our stack addresses several challenges: distinguishing the primary user from background speakers and noise, end-of-turn detection to distinguish true conversational yields from mid-utterance pauses enabling responsive yet non-interruptive agent behavior, and systematic latency reduction across the capture-to-playout pipeline to shrink round-trip response time. We present production results at scale and argue that conversational fluidity is an emergent property of coordinated improvements across the full audio stack, not a single-model solution.

Bio: Sriram Srinivasan is Director of Engineering at Meta, leading the core media teams powering video-on-demand, real-time audio-video calling, and full-duplex MetaAI Voice experiences across Meta’s family of apps including Facebook, Instagram, WhatsApp, Messenger and Meta AI. Prior to Meta, he was at Microsoft leading the teams working on audio technologies for Microsoft Teams. He has over 20 years of real-time signal processing and ML experience in areas such as real-time media, audio/video processing, low bitrate audio/video codecs, spatial audio and network resilience algorithms. He holds a PhD in audio signal processing from KTH Royal Institute of Technology, Stockholm, Sweden, 30+ granted US patents and over 50 peer-reviewed publications.

Talk 3: Text input via noninvasive neuromotor signals

Abstract: Noninvasive neuromotor interfaces have the potential to transform human-computer interaction by providing users with low friction, information rich, always available inputs. Reality Labs at Meta has recently shipped the Meta Neural Band, which uses electromyographic (EMG) signals captured at the wrist to control the Meta Ray-Ban Display glasses and provide text-entry capabilities via handwriting recognition. This talk will describe the development of these control and text entry systems and discuss their important aspects and properties as well as briefly describing several other interactions built using the same hardware and signals. Building such interfaces requires collecting data at scale and building machine learning models that can generalize across individuals. Shipping these interactions brings us closer to our ultimate goal of building effortless and joyful human-computer interfaces.

Bio: Michael I Mandel is a Research Scientist in Reality Labs at Meta Platforms, Inc building text interactions for neural interfaces using machine learning and signal processing. He earned his BSc in Computer Science from the Massachusetts Institute of Technology and his MS and PhD with distinction in Electrical Engineering from Columbia University as a Fu Foundation Presidential Scholar. He was an FQRNT Postdoctoral Research Fellow in the Machine Learning laboratory (LISA/MILA) at the Université de Montréal, an Algorithm Developer at Audience Inc, a company that has shipped over 500 million noise suppression chips for cell phones, a Research Scientist in Computer Science and Engineering at the Ohio State University, and an Associate Professor of Computer and Information Science at Brooklyn College and the CUNY Graduate Center. His work has been supported by the National Science Foundation, including via a CAREER award, the Alfred P. Sloan Foundation, and Google, Inc.

Talk 4: Artifree: Detecting And Reducing Generative Artifacts In Diffusion-Based Speech Enhancement

Abstract: Diffusion-based speech enhancement (SE) achieves natural-sounding speech and strong generalization, yet suffers from key limitations like generative artifacts and high inference latency. In this work, we systematically study artifact prediction and reduction in diffusion-based SE. We show that variance in speech embeddings can be used to predict phonetic errors during inference. Building on these findings, we propose an ensemble inference method guided by semantic consistency across multiple diffusion runs. This technique reduces WER by 15% in low-SNR conditions, effectively improving phonetic accuracy and semantic plausibility. Finally, we analyze the effect of the number of diffusion steps, showing that adaptive diffusion steps balance artifact suppression and latency. Our findings highlight semantic priors as a powerful tool to guide generative SE toward artifact-free outputs.

Bio: Yang Gao is an ML Research Scientist on the XR Sonic team in Reality Labs at Meta Platforms. He focuses on low-level audio and speech AI models, devices, and systems, including a neural-based ambisonic encoder, the Gleam ML framework, and models for speech enhancement and echo cancellation. He earned his Ph.D. in Computer Science from the University of Utah, where his research focused on medical image processing, including longitudinal MRI segmentation. Prior to joining Meta, Yang worked at Amazon, where he conducted R&D in speech and audio across signal processing, machine learning, and AI—developing ASR, wake-word, and endpointing models for Alexa.

Talk 5: WearVox: An Egocentric Multichannel Voice Assistant Benchmark for Wearables

Abstract: Wearable devices like AI glasses are transforming voice assistants into hands-free collaborators, but introduce challenges such as egocentric audio with motion and noise, micro-interactions, and distinguishing device-directed speech from background conversations. We present WearVox, the first benchmark for evaluating voice assistants in realistic wearable scenarios. It comprises 3,842 multi-channel egocentric recordings collected via AI glasses across five tasks — Search-Grounded QA, Closed-Book QA, Side-Talk Rejection, Tool Calling, and Speech Translation — spanning diverse acoustic conditions. Leading speech LLMs achieve only 29-59% accuracy, with significant degradation on noisy outdoor audio. We further show that multi-channel audio inputs substantially improve robustness, highlighting the importance of spatial audio cues for wearable voice AI.

Bio: Florian Metze is a Director in the Wearables AI org in Meta’s Reality Labs division. He is also an Adjunct Professor at Carnegie Mellon University, in the School of Computer Science’s Language Technologies Institute. His work covers many areas of speech recognition and multi-media analysis with a focus on end-to-end deep learning. Currently, he focuses on speech translation and language coaching for the RayBan Meta range of AI glasses. In previous roles, he supported multi-modal Content understanding for recommendation systems across Meta’s Family of Apps. Before joining Meta in 2019, he worked on multi-modal processing of speech in videos and, as one of the co-founders of Abridge, information extraction from medical interviews. He has also worked on low resource and multi-lingual speech processing, speech recognition with articulatory features, large-scale multi-media retrieval and summarization, along with recognition of personality or similar meta-data from speech.

Spotlight Sessions

MERL: “Every Moment | Something New — Signal Processing at MERL”

Tuesday, May 5, 13:05 -13:20
Location: Spotlight Area, Exhibition Hall
Speakers: Petros T. Boufounos, Distinguished Research Scientist, Deputy Director; Jonathan Le Roux, Distinguished Research Scientist, Senior Team Leader

Abstract: Signal processing is a core research area at Mitsubishi Electric Research Laboratories (MERL). Our work is at the cutting edge of signal processing research in a broad range of applications, from speech and audio processing to sensing and optimization. In this talk we will spotlight some of our most recent development, in areas including machine learning, speech and audio, multimodal imaging and physics-informed neural networks for sensing. We will discuss how to use a single radar to sense people’s pose and movement, a single camera to sense airflows, single photons to measure distance and velocity, a single microphone to detect equipment anomalies, or a single model to rule all sound separation tasks. Join us to learn more about MERL’s work and engagement with the signal processing community.

Adobe: “Enabling Rich Creative Control in Generative Audio”

Wednesday, May 6, 13:05 -13:20
Location: Spotlight Area, Exhibition Hall
Speakers: Jiaqi Su, Senior Research Scientist

Show and Tell Demos

Program Schedule

Location: Exhibition Hall

Session order	Date	Time	Proposal ID
Demo Session 1	May 5	14:00–16:00	501, 515, 536, 553, 555, 574
Demo Session 2	May 5	16:30–18:30	506, 521, 529, 532, 533, 570
Demo Session 3	May 6	9:00–11:00	519, 538, 548, 557, 558, 563
Demo Session 4	May 6	14:00–16:00	520, 523, 546, 552, 568, 571
Demo Session 5	May 6	16:30–18:30	510, 512, 526, 527, 543, 545
Demo Session 6	May 7	9:00–11:00	511, 549, 550, 554, 561
Demo Session 7	May 7	14:00–16:00	508, 525, 531, 562, 567
Demo Session 8	May 7	16:30–18:30	502, 516, 517, 524, 573
Demo Session 9	May 8	9:00–11:00	528, 565, 537, 556, 542
Demo Session 10	May 8	14:00–16:00	522, 530, 539, 544, 559

For Show & Tell demos requiring shipping assistance, please contact [email protected] for further details. Please note that this is not a complimentary service and will incur additional cost.

501: Nkululeko 1.0: A Python package to predict speaker characteristics with a high-level interface

Authors: Felix Burkhardt: audEERING, Bagus Tris Atmaja: NAIST, Florian Eyben: audEERING, Björn Schuller: audEERING, TUM, Imperial CL

Description: The Nkululeko demo showcases a cutting-edge, open-source Python toolkit designed to simplify audio-based machine learning tasks, particularly in speech processing. Aimed at users with varying levels of expertise, Nkululeko eliminates the need for coding by leveraging a command-line interface (CLI) and configuration files. Built on scikit-learn and PyTorch, it provides a powerful yet user-friendly framework for training, evaluating, and analyzing speech databases using advanced machine learning methods and acoustic features.

Novelty and Innovations
The key innovation of Nkululeko lies in its ability to empower users—whether novices or experienced researchers—to easily experiment with speech processing tasks without deep technical knowledge. With version 1.0, Nkululeko introduces several significant advancements, such as:
* Transformer model fine-tuning: Users can now fine-tune pre-trained transformer models, enabling them to achieve state-of-the-art performance with minimal data and computation.
* Ensemble learning: This feature allows users to combine multiple models to improve prediction accuracy and robustness.
* Linguistic feature modeling: Nkululeko also supports advanced linguistic feature extraction, enabling the incorporation of higher-level language characteristics into speech analysis.
These innovations make it an invaluable tool for quickly testing hypotheses and deploying machine learning models, especially for those working with speech data and acoustic features.

Impact to Signal Processing Communities
Nkululeko has the potential to make a significant impact on various fields within the signal processing and machine learning communities. By simplifying complex workflows, it lowers the barrier for entry to speech processing research and application, making it accessible to a broader range of users, from educators to researchers. Additionally, its ability to detect biases in speech data (e.g., correlations between speaker characteristics and target labels) provides a novel approach to addressing fairness in AI-driven speech processing.

Interactivity for Attendees
At the ICASSP 2020 demo session, attendees will have the opportunity to engage with live demonstrations of key Nkululeko features, including model training, database analysis, and bias detection. The demo is designed to be highly interactive, allowing participants to explore various machine learning experiments in real-time on a laptop. This hands-on experience will give attendees a practical understanding of how Nkululeko can be used in both academic and industry settings.

515: Interactive Spectrogram-Based Rhythm and Melody Annotation for Speech Analysis

Authors: Shreevatsa G. Hegde, Department of Computing and Software Systems, University of Washington Bothell
Min Chen, Department of Computing and Software Systems, University of Washington Bothell

Description: This demo presents MeTILDA (https://metilda.net/), an interactive, cloud-based, and open-access signal processing platform for endangered language documentation and education. The system supports human-centered speech signal analysis by enabling hands-on exploration of rhythm and melody through direct interaction with audio representations. The demo showcases a complete, end-to-end speech analysis workflow, including spectrogram-based rhythm annotation, melody analysis, and pitch visualization. It also demonstrates the integration of our proposed MeT perceptual pitch scale, a key innovation that allows users to focus on relative melodic contours by normalizing speaker-dependent pitch variation caused by age, gender, or physiological differences.
Attendees can inspect spectral content while controlling audio playback for improved perception of rhythmic and melodic features. Interactive zooming and time navigation enable close inspection of short temporal regions, supporting precise analysis of rapidly changing acoustic events. For rhythm analysis, attendees can place vertical markers as taps on the spectrogram to annotate perceived rhythmic boundaries. The system provides multiple playback modes with rate control, enabling focused exploration of temporal structure and alignment of rhythmic annotations with acoustic events.
The demo also highlights melody analysis workflows in which users select regions of the spectrogram and apply different pitch extraction strategies, including region-based averaging, contour-based extraction, and manual frequency selection. Each method generates pitch data that are mapped to interactive Pitch Art visualizations, which abstract pitch movement patterns while remaining grounded in the underlying signal representation. Users can label syllables, apply time normalization, vertically center pitch ranges, and mark
primary and secondary accent positions. The system further supports multi-speaker prosodic analysis by overlaying pitch representations from multiple speakers within a single Pitch Art chart, enabling direct visual comparison of pronunciation and intonation patterns.
The main novelty of the demo lies in its integration of interactive spectrogram manipulation, rhythm annotation, and melody visualization within a single end-to-end workflow with human-in-the-loop.
The demo is highly interactive, with attendees directly manipulating speech signals and receiving immediate auditory and visual feedback. It demonstrates the broader impact of interactive and perceptually grounded signal processing tools for researchers in speech and audio processing, prosody analysis, signal visualization, and human-centered and explainable signal processing systems.

536: An Interactive Demonstration of the Open ASR Leaderboard

Authors: Eric Bezzam (Hugging Face), Steven Zheng (Hugging Face), Eustache Le Bihan (Hugging Face)

Description: With the proliferation of automatic speech recognition (ASR) systems, selecting the right model for a given application can be challenging. We present a live demonstration of the Open ASR Leaderboard, a community-driven benchmarking platform that enables transparent, reproducible, and continuously updated comparison of ASR systems: https://huggingface.co/spaces/hf-audio/open_asr_leaderboard

The leaderboard evaluates a wide range of ASR models across standardized datasets and metrics, and aggregates both open and closed-source systems. For open-source models, links to their Hugging Face model cards provide example code and implementation details. For closed-source systems, links point to the corresponding API documentation. In addition, an open-source GitHub repository provides evaluation scripts to reproduce leaderboard results: https://github.com/huggingface/open_asr_leaderboard

The Open ASR Leaderboard has seen strong adoption across academia and industry, with participation from major speech toolkits and companies (SpeechBrain, NVIDIA NeMo, ElevenLabs, IBM, Microsoft, etc.), and significant community engagement (550K+ total visits, 37K+ in the last month, 48 merged GitHub PRs).

To move beyond static benchmarking, the ICASSP demonstration will include a Reachy Mini desktop companion robot, enabling live speech interaction. Leveraging the open-source and rapid prototyping nature of Reachy Mini, different ASR and text-to-speech (TTS) models can be used interchangeably. This allows attendees to directly experience how offline benchmark scores translate into perceptual quality, latency, and robustness in realistic human-machine interactions.

Main Novelty and Innovations
- A unified, public benchmark comparing open and closed-source ASR systems.
- Community-driven, continuously evolving evaluation framework.
- Human-in-the-loop, embodied evaluation to translate offline metrics into live interactions with Reachy Mini.

Impact to the Signal Processing Community
- Promotes transparent and reproducible evaluation practices.
- A shared reference point for comparing ASR systems across academia and industry.

Interactivity for Attendees
- Attendees can speak directly to the Reachy Mini robot, select different ASR/TTS backends, observe real-time transcriptions, latency differences, and qualitative behavior. They can compare these observations with leaderboard results, making the leaderboard metrics more tangible.

553: Speech Enhancement Intelligence - Inspecting a Model Under Controlled Degradation

Authors: Yair Amar (Technion - Israel Institute of Technology)
Amir Ivry (Technion - Israel Institute of Technology)
Israel Cohen (Technion - Israel Institute of Technology)

Description: This Show and Tell demonstration presents an interactive system for speech enhancement intelligence: observing, probing, and interpreting how a speech enhancement model responds as noise conditions change. Rather than treating the model as a black box, the demo provides an interface that exposes how internal representations evolve under increasing noise, controlled by the user.
Attendees begin by speaking a short utterance into a microphone. This recording is treated as a clean reference. Artificial noise is then added in a controlled manner using an SNR slider, allowing users to smoothly move from clean to highly noisy conditions while keeping the underlying speech fixed. At each noise level, the clean and noisy signals are processed through a speech enhancement model, and internal activations from selected layers are extracted.
The interface visualizes how the activations evolve under increasing noise and evaluates how closely the model’s representations under noise resemble those elicited by clean speech. These similarities are shown layer by layer using Centered Kernel Alignment (CKA), revealing which parts of the model remain stable, which become noise-sensitive, and which recover as noise conditions improve. These measures are summarized via linearization of the CKA versus SNR trend. Alongside these internal indicators, standard enhancement performance metrics such as PESQ, STOI, and SI-SDR are updated in real time.
By interacting with the noise controls, attendees can observe how internal representation stability degrades and recovers, and how these internal changes align with variations in output quality. This enables inspection of model behavior beyond post-hoc evaluation of enhanced signals alone.
The demo offers an intuitive, hands-on view of how speech enhancement models internally respond to noise. It is relevant to the ICASSP community, as it illustrates how signal processing, learning-based models, and interpretability tools can be combined to better understand the internal behavior of modern speech systems.

555: SCRIBAL: A Multilingual Transcription Platform for Academic Lectures and Impaired Speech Accessibility

Authors: Pol Pastells (1,2), Javier Román (1), Mauro Vázquez (1), Clara Puigventós (1), Montserrat Nofre (1), Mariona Taulé (1,2), Mireia Farrús (1,2)
1 - Centre de Llenguatge i Computació (CLiC), Universitat de Barcelona, Spain
2 - Institut de Recerca en Sistemes Complexos (UBICS), Universitat de Barcelona, Spain

Description: SCRIBAL is a comprehensive web-based transcription and translation ecosystem comprising three integrated products. SCRIBAL provides real-time multilingual transcription and translation for university lectures and conferences, supporting most major languages through Whisper-based models, with specialized domain-optimized terminology currently available for Catalan. SCRIBAL-Social specializes in transcribing and translating impaired speech from Catalan speakers with Down syndrome and cerebral palsy, addressing critical accessibility needs. Additionally, the platform offers file-based transcription for post-processing scenarios.
SCRIBAL exemplifies how speech processing can bridge digital divides across linguistic and ability spectrums. It demonstrates practical solutions for low-resource language ASR, domain adaptation in academic contexts, and impaired speech transcription.

The key innovation is a modular architecture that seamlessly integrates general-purpose multilingual ASR, domain-adapted academic models, and specialized impaired speech recognition within a unified platform. The impaired speech component represents pioneering work in Catalan speech processing, an under-resourced language for accessibility applications. This dual-focus approach—combining broad multilingual coverage with deep specialization for underserved populations—sets SCRIBAL apart from conventional transcription services while maintaining real-time performance suitable for live deployment.

Participants will actively engage with all three SCRIBAL modalities through multiple interaction modes. Using either our demonstration laptop or their own smartphones, attendees can: speak directly into the system in their native language to experience live multilingual transcription, upload audio files to test batch processing capabilities, and observe specialized impaired speech recognition through pre-recorded Catalan samples. This hands-on experience allows attendees to compare transcription accuracy across academic and general domains, experiment with various acoustic conditions and speaking styles, and discuss potential deployment strategies for their own institutions or research applications.

This work has been funded by the Generalitat de Catalunya (2024 PROD 00016 grant). It is also part of the FairTransNLP-Language project (PID2021-124361OB-C33), funded by MICIU/AEI/10.13039/501100011033/FEDER, UE.

574: Tahlil: An Interactive Toolkit for Standardized ASR Evaluation and Error Analysis

Authors: Yousseif Alshahawy, Daniel Izham, Aljawharah Bin Tamran, Ahmed Ali
HUMAIN, Riyadh, Saudi Arabia

Description: Demo Overview
Tahlil is a stand-alone web application for standardized automatic speech recognition (ASR) evaluation and error analysis, designed to improve the transparency, interpretability, and reproducibility of reported ASR results. The system provides both single-utterance inspection and large-scale batch evaluation, with asynchronous processing to support realistic experimental workflows. Its architecture combines a Nuxt 4 frontend with a FastAPI backend, enabling responsive interaction and scalable evaluation.
A key component of Tahlil is a custom Rust-based alignment module that enables efficient token-level alignment, detailed error inspection, and confusion statistics. By unifying ASR hypotheses and human annotations within a single evaluation framework, Tahlil allows systematic comparison across annotators, models, datasets, and normalization settings. The toolkit was initially motivated by inconsistencies observed in Arabic ASR reporting, where divergent text normalization practices, such as diacritic handling and letter-form variants that can substantially influence reported error rates. However, the framework itself is language-agnostic and applicable to a wide range of ASR evaluation scenarios.

Novelty and Innovation
Tahlil transforms ASR evaluation from a single aggregate score into a structured and reproducible analysis workflow. It integrates a custom Rust extension into the JiWER evaluation stack, enabling fast and consistent alignment with optional custom-cost or weighted alignment strategies. The resulting RapidFuzz-compatible opcode streams form a single source of truth from which all metrics, visualizations, and confusion statistics are derived.
This design allows users to directly trace how normalization choices, alignment parameters, and input annotations affect final WER/CER values and error distributions. In addition, Tahlil provides built-in tools for text cleaning and normalization, supports both single and batch evaluation with asynchronous job tracking, and enables export of analysis artifacts. Together, these features standardize evaluation practices across models, datasets, and annotators.

Impact on the Signal Processing and ASR Community
By combining standardized scoring with interactive, alignment-based error analysis, Tahlil enables researchers and engineers to move beyond reporting a single WER or CER figure. The toolkit facilitates identification of systematic failure patterns, such as consistent substitutions, deletion bursts, and normalization-sensitive errors, in a manner that is transparent and easy to communicate. Its support for batch evaluation and self-contained deployment enables scalable, reproducible comparisons across ASR systems, improving experimental reliability and accelerating iteration cycles. In multilingual and morphologically rich settings, Tahlil provides a shared reference point for fairer benchmarking and clearer reporting of ASR performance.

Interactivity for Attendees
During the demo, attendees will interactively upload ASR hypotheses and references, adjust normalization and alignment settings in real time, visualize token-level errors and confusion statistics, compare multiple systems side-by-side, and export reproducible evaluation artifacts.

506: Flow Matching for Real-Time Joint Speech Enhancement and Bandwidth Extension

Authors: Simon Welker (Signal Processing Group, University of Hamburg, Germany)
Bunlong Lay (Signal Processing Group, University of Hamburg, Germany; Hamburg Informatik Technologie-Center e.V.)
Maris Hillemann (Signal Processing Group, University of Hamburg, Germany)
Tal Peer (Signal Processing Group, University of Hamburg, Germany)
Timo Gerkmann (Signal Processing Group, University of Hamburg, Germany)

Description: Diffusion-based speech enhancement is a popular and active research topic. In our demo, we present a real-time generative system for joint speech enhancement and bandwidth extension with flow matching, a method closely related to diffusion. The system receives a noisy and reverberant single-channel input on a consumer GPU laptop, which can optionally be low-pass filtered at a configurable frequency cutoff before being fed to the system. With an efficiently cached frame-wise inference scheme and an optimized causal DNN, our system achieves a total latency of only 48ms (32ms algorithmic latency + 16ms computational latency), bringing low-latency, high-quality generative speech restoration with generative flow matching models to consumer hardware for the first time. The underlying real-time flow matching backbone is described in our accepted 2026 ICASSP paper (ID: 16059).

We combine a predictive network with a generative flow network in a joint predictive-generative scheme, outputting a clean bandwidth-extended speech estimate with up to 24 kHz bandwidth (48kHz sampling rate). The graphical user interface allows three interactive changes: (1) the flow network can be toggled on/off to switch between predictive and predictive-generative speech restoration; (2) using a graphical slider, attendees can set the frequency cutoff of the low-pass filter (4-16 kHz) to simulate a lower sampling frequency; (3) either one or multiple generative sampling steps can be chosen to show how the generative model behaves in each scenario.

The demo lets attendees switch between unprocessed speech and three possible variants of enhanced speech on the fly, allowing them to explore the advantages and downsides of predictive and generative speech restoration in a real-time setting. We use one omnidirectional microphone placed in an open conference area, and run our models on a laptop with a NVIDIA RTX 5090 Laptop GPU. The laptop is connected to a soundcard and headphone amplifier. Up to five active noise-canceling headphones can be connected, so that multiple attendees can listen and interact simultaneously.

Our demonstration offers an interactive experience, illustrating how modern generative methods can be used for real-time single-channel speech enhancement and bandwidth extension in a real conference environment, and how they differ qualitatively from predictive methods.

521: NPU-Accelerated Real-Time Voice Conversion for Customizable Digital Identities

Authors: Andrey Kramer (Voicemod), Pritish Chandna (Voicemod), Merlijn Blaauw (Voicemod), Jordi Bonada (Voicemod), Jordi Janer (Voicemod)

Description: Real-time voice conversion has seen widespread adoption by millions of users within gaming and digital identity ecosystems. Historically, however, these systems have been restricted to low-complexity models due to limited CPU overhead and the need to maintain stability alongside concurrent, high-demand applications.

To overcome these limits, we present a high-fidelity, low-latency voice conversion system optimized for Neural Processing Units (NPUs). This solution leverages Transformer-based models specifically architected for edge-device acceleration, moving beyond cloud or GPU reliance.

Our demo prototype showcases two recent areas of research of our team: moving towards larger models that leverage NPUs on-device, and ways to control the speaker identity and voice characteristics using higher level, intuitive controls - such as age, gender, depth, and breathiness - via sliders or text prompts.

Main Novelty and Innovations

While NPUs are becoming standard in modern chipsets, their application for real-time, stream-based signal processing is a new frontier. Our innovation lies in:

- Architecture Optimization: A transition from low complexity recurrent models suitable for CPU to highly parallelizable, high complexity models suitable for NPU. This approach maps workloads directly to NPU instruction sets to maximize throughput while offering superior synthesis quality.
- Accelerator-Agnostic Inference: Transitioning complex neural audio tasks from GPUs to dedicated NPU silicon, enabling professional-grade AI audio on consumer desktops, laptops and mobile platforms.

Impact on Signal Processing

This demonstrator attempts also to bridge the gap between describing voices via perceptual characteristics that can be extracted via DSP algorithms and common approaches in Deep Learning-based generative voice conversion models (e.g. a speaker embedding that is learned or estimated from audio).

By using signal processing-based annotators to map speakers into a 5-D space, we provide a framework for a parametrized exploration of voice timbres using perceptually meaningful controls of neural vocal transformation in timing-critical environments.

Interactivity for Attendees

- Generative Voice Design: Create bespoke identities using 5-D descriptors or text prompts (e.g., "a deep, gravelly, yet smooth voice").
- Real-Time Identity Swap: Experience live vocal transformation with an algorithmic latency of 45 ms for the low-complexity model, ensuring a seamless feedback loop without cognitive dissonance.

529: Speaking rate control in the stream

Authors: Nikita Torgashov, Gustav Eje Henter, Gabriel Skantze

Description: We introduce an online speaking-rate control mechanism for streaming text-to-speech (TTS) that adjusts duration at the frame level while audio is being generated. A continuous control signal is provided as an additional model input and is consumed causally, enabling the system to smoothly speed up or slow down the emitted speech. The controller supports gradual transitions, so rate changes do not introduce audible discontinuities. Unlike prior duration-control approaches that work only offline or use post-processing, the proposed method changes the speaking rate online as frames are being produced, enabling true speaking-rate control for streaming TTS.

This work contributes to the signal processing community in two ways. First, it introduces causal, frame-level duration control for streaming TTS, enabling low-latency, real-time adaptation of speaking rate. The system can dynamically slow down or speed up based on user preference or text buffer size, mimicking how humans regulate speech flow under different conditions. This enables new research on adaptive, feedback-driven audio generation under strict latency constraints.
Second, the method improves rate-dependent speech realism. Our analysis shows that speaking rate affects not only timing but also content and articulation: slow speech includes fillers (e.g., “uhm,” “yeah”), while fast speech reduces fillers and increases articulation speed. These effects are often overlooked in modern TTS systems. By enabling online rate control, our approach helps close this gap and moves streaming synthesis closer to natural human speech.

The demo is implemented as a Gradio app. Users upload a short reference clip (3-5 s) of a target speaker and enter text. The system begins emitting an audio stream after a short initial delay (~150 ms). While audio is playing, users can repeatedly adjust the speaking rate (speed up or slow down) and immediately hear the effect on the continuing stream, evaluating naturalness and voice similarity across rates. Any language can be used for the reference voice.

This demo extends ICASSP 2026 paper 4854, “VoXtream: Full-Stream Text-to-Speech with Extremely Low Latency,” recently accepted for presentation.

532: Real-Time Demo of Single-Channel Target Speaker Extraction Using State-Space Modeling

Authors: Hiroshi Sato (NTT, Inc.)
Takafumi Moriya (NTT, Inc.)
Marc Delcroix (NTT, Inc.)
Tsubasa Ochiai (NTT, Inc.)
Taichi Asami (NTT, Inc.)

Description: Target speaker extraction (TSE) aims to extract the voice of a pre-enrolled speaker from a single-channel audio mixture that may contain competing talkers and background noise. While recent TSE models demonstrate strong offline performance, practical deployment is often constrained by latency, compute, and stability under continuously changing acoustic conditions. This demo showcases an on-device, real-time single-channel TSE system that runs entirely on a laptop CPU and produces low-latency enhanced audio suitable for live listening. The core novelty lies in the adoption of state-space sequence modeling for streaming acoustic modeling. Specifically, we introduce a new state-space modeling (SSM) -based architecture into Conv-TasNet-based TSE, which has been shown to efficiently capture long-term temporal dependencies. By leveraging SSM, the proposed model requires fewer dilated convolutional layers to model temporal context, resulting in a reduction in overall model complexity. Consequently, the proposed method achieves a more favorable trade-off between computational efficiency and extraction performance.

Demo description: In the demo, we perform online TSE using real recordings captured during the demonstration session. First, the target speaker is enrolled by recording a short voice prompt of approximately 10 seconds. Then, the target speaker talks into the microphone while an interferer speaks nearby, and ambient noise from the demo environment is simultaneously captured. The system processes audio in real time and outputs the extracted target speech to headphones. Participants can directly compare the processed and unprocessed audio streams in real time. In addition, an optional visualization panel displays input and output waveforms as well as basic runtime statistics (e.g., real-time factor / latency) to facilitate understanding of the relationship between perceptual quality and system behavior.

Interactivity: Participants can actively vary conditions such as speaking style, distance to mic, overlap ratio, etc., and immediately hear how extraction quality changes. This hands-on experience fosters deeper discussion on the current state of TSE technology and highlights the gap between academic benchmark evaluations and real-world streaming constraints.

Impact: Overall, this demo provides a concrete reference for the signal processing community regarding the current practicality of TSE systems, and is expected to stimulate further discussion on streaming architectures and on-device efficiency.

533: Semantic-Aware Speech Anonymization via Neural Codec Editing

Authors: Ngoc Hung Le, Soongsil University; Kyujin Kim, Soongsil University; Sangjun Park, Soongsil University; Yowon Lee, Soongsil University; An Thien Nguyen, Soongsil University; Souhwan Jung, Soongsil University

Description: This demo presents a novel Content Speaker Anonymization Pipeline designed to redact Personally Identifiable Information (PII) from speech while preserving prosodic continuity and naturalness. The system integrates an efficient Whisper-based Automatic Speech Recognition (ASR) module leveraging precise word-level forced alignment with a robust BERT-based Named Entity Recognition (NER) system to locate the timestamp of sensitive information in terms of semantic meaning. After detecting the identifiable information, unlike traditional obfuscation methods that rely on destructive signal masking (e.g., beeping) or artifact-prone copy-paste concatenation, our system utilizes a neural codec language model for speech editing. This architecture treats speech synthesis as a token prediction task, allowing the system to generate pseudonymized speech segments that blend seamlessly with the unedited surrounding context. The pipeline supports flexible replacement strategies, allowing users to switch between rule-based substitution and generative infilling, effectively editing the audio waveform through text manipulation.

For the signal processing community, this system directly addresses the critical privacy-utility trade-off in creating public datasets. By removing sensitive semantic content without degrading signal coherence, it enables the ethical sharing of speech data for downstream tasks such as ASR training and sentiment analysis. It demonstrates a shift from signal-level anonymization to semantic-level editing, setting up a new standard for intelligibility in privacy-preserving speech processing.

The demonstration offers a real-time, hands-on experience. Attendees will be invited to record live speech containing mock sensitive information (e.g., names, locations, phone number). They will visualize the pipeline in action via a dashboard that displays the ASR transcription and highlights detected PII entities. Users can then interactively select replacement methods (e.g., manually type in the replacement or let the system decide) and immediately listen to the anonymized output. This allows for a direct comparison between our neural editing approach and traditional baselines, showcasing the system’s ability to maintain smooth transitions and high speech quality. Currently we support 2 languages: Korean and English for the users to choose from.

570: Electrolaryngeal Speech Enhancement Based on Any-to-Many Voice Conversion

Authors: Bowen Wu (RIKEN, GRP), Carlos Toshinori Ishi (RIKEN, GRP)

Description: A common vocalization alternative for Laryngectomees (individuals who have undergone laryngectomy) is the use of the electrolarynx (EL), a handheld device that generates mechanical vibrations to enable speech production. However, EL speech sounds unnatural due to its monotonous pitch and mechanical excitation, which reduces communicative efficiency. In this demo, we will show an online (on-site) electrolaryngeal-to-normal (EL2NL) voice conversion (VC) system which is based on any-to-many DNN-based VC. We fine-tuned an existing VC model using a small amount of collected EL speech to synthesize EL speech from large-scale NL speech datasets. Using the synthetic EL speech and their NL counterparts as pairs, we fine-tuned an VC for NL2NL to adapt to EL2NL VC. The resulting system can restore reasonably natural intonation and improve intelligibility in the EL speech. Participants may try the EL to experience our EL2NL VC system.

519: Seamlessly Upgrading On-Device Speech Recognition System with More Recent Foundation Models

Authors: Sheng Li (Institute of Science Tokyo)

Description: Recent advances in automatic speech recognition (ASR) have been driven by foundation models trained on massive datasets with large-scale parameters. However, deploying these models on edge AI devices, especially robotic platforms, remains a significant challenge due to limited computational resources. Existing solutions often rely on cloud-based APIs or traditional DNN-HMM frameworks, which may raise privacy concerns or fall short of state-of-the-art performance. This paper presents a novel solution that enables ASR decoders, originally designed for GMM-HMM, DNN-HMM, or End-to-End CTC-Attention architectures, to support modern foundation models such as wav2vec2, HuBERT, Whisper, and recent speech LLMs. Our approach enables seamless integration of cutting-edge, highly accurate speech recognition capabilities into edge AI systems, including ROS-based robotic and smart glass platforms, without compromising user privacy or sacrificing performance. We will provide two interactive demonstrations on robotic platforms and smart glass platforms. This work is a joint project between the School of Engineering, the Institute of Science, Tokyo (previously TokyoTech), and the Department of Informatics, Kyoto University, aiming to integrate recent EdgeAI and spoken-language processing technologies.

538: NVIDIA NeMo Voice Agent: An Open-Source, Multi-Model Framework for Building Your Own Real-Time Conversational AI

Authors: Taejin Park (NVIDIA), He Huang (NVIDIA), Kunal Dhawan (NVIDIA), Jagadeesh Balam (NVIDIA) and Boris Ginsburg (NVIDIA)

Description: We propose a demonstration of the NVIDIA NeMo Voice Agent framework, a comprehensive open-source toolkit designed for building low-latency, real-time voice-to-voice agents. While traditional voice systems often rely on fragmented pipelines, this framework provides a unified, modular architecture that orchestrates the entire conversational loop—from STT(ASR) and speaker diarization to LLM and TTS—within a single high-performance ecosystem.

A core innovation is the framework's commitment to a plug-and-play open ecosystem. Beyond support for NVIDIA’s models, it is fully compatible with widely adopted open-source models, including LLaMA and Qwen for text-based instruction following, and Kokoro and ChatterBox for TTS. For ASR, we provide a rich selection of NeMo endpointing and speaker diarization tools. This flexibility ensures developers can swap components to meet specific performance or hardware needs.

In addition, we support a simulated human evaluator to benchmark other voice agents. Users can fully customize this evaluator by plugging in specific ASR, LLM, and TTS models, allowing for automated, end-to-end testing that is tailored to unique domain requirements or real-life scenarios, not to mention comparisons with other industry-leading voice agents.

Crucially, this initiative democratizes Voice AI, ensuring advanced conversational technology is no longer the exclusive domain of large corporations. By providing this open framework, we empower small startups, students, and researchers to experience, investigate, and innovate on equal footing with industry giants, lowering the barrier to entry for high-quality voice agent development.

Interactive Component: Attendees will interact with a live, ultra-low latency agent and view a real-time "Under the Hood" dashboard. Participants in our Show and Tell session can expect the following:
- Live Model Interoperability: Experience the "plug-and-play" nature of the framework by hot-swapping models on the fly to hear immediate changes in latency and reasoning.
- Task-oriented agent-to-agent Evaluation: Evaluations that focus on achieving real-life tasks, highlighting the practical deployment of voice agents.
- Architectural Transparency: Students and researchers can inspect how the system handles complex audio signals and manages state across the unified pipeline.
- Hardware Agnostic Insights: View performance data across a range of profiles, from local consumer-grade GPUs to data-center infrastructure, offering a wide variety for academic experimentation.

548: Lightweight End-to-end Spoken Language Understanding System for Speech-controlled Video games

Authors: Alex Peiró-Lilja, Barcelona Supercomputing Center and Universitat de Barcelona
Rodolfo Zevallos, Barcelona Supercomputing Center
Iván Cobos, Universitat Politècnica de Catalunya
Xin Lu, Universitat Politècnica de Catalunya
Javier Hernando, Universitat Politècnica de Catalunya and Barcelona Supercomputing Center

Description: We are developing a cross-platform, voice-controlled video game. In this video game, the player is a construction site manager who must instruct worker robots to install the ornamental elements of a chapel’s façade in the correct positions. Players are encouraged to speak commands naturally and can even adapt their speaking style. The game engine uses the SLU to obtain labels, enabling the robots to obey commands accordingly and respond with synthetic voices trained in Catalan. To map natural speech to specific labels, we previously used a cascaded Spoken Language Understanding (SLU) system based on Whisper-Large and a BERT model, both adapted to Catalan. This system was very expensive, and the video game had to remain constantly connected to a server providing inference. To address this issue, we fine-tuned Whisper-Tiny as an end-to-end SLU system, achieving a solution that is more than 20 times lighter while maintaining similar performance. This allows us to integrate the SLU locally, enabling devices to run inference on their own. The game mechanics are original, and therefore no existing data was available to train the model. To solve this, based on samples created by a group of humans, combinations of natural sentences useful for the game mechanics were designed. A text-to-speech system trained in Catalan was then used to synthesize these sentences using different voices. In total, more than 475k labeled samples were generated. In the demo, the video game will be presented ready to play fully offline, using only a laptop and a headset with a microphone to interact with the robots. If the player is not a Catalan speaker, we can interact by translating the player’s intended command ourselves. Moreover, we will share the knowledge required to reproduce the system for other video games or interactive applications that use speech in any language.

557: Toward Realistic Multimodal Speech Processing Benchmarks Using a Multi-Talker Audio-Visual Conversational Corpus

Authors: Bryony Buck, Edinburgh Napier University & University of Dundee
Lorena Aldana, University of Edinburgh
Ondrej Klejch, University of Edinburgh
Peter Bell, University of Edinburgh,
Michael Akeroyd, University of Nottingham

Description: This demonstration presents a novel audio-visual testbed corpus designed for realistic evaluation of multimodal speech enhancement, speech separation, and speech intelligibility systems. The corpus consists of free-flowing three-person conversations recorded under controlled quiet and noisy conditions, including both normal-hearing participants and experienced hearing aid users. Sessions were captured using synchronised lapel microphones and multi-angle video, enabling multimodal signal processing, feature fusion, and joint audio-visual analysis in complex multi-talker environments.

Attendees will engage in immersive speech-in-noise intelligibility assessment scenarios derived from the corpus, experiencing conversational excerpts with varying acoustic interference and audiovisual cues. The demonstration will showcase the application of an established keyword-based data mining evaluation framework (Valentini-Botinhao et al., 2023), previously applied to scripted speech with synthetic noise (Blanco et al., 2023), extending it to spontaneous conversational speech recorded in real noisy environments. This enables scalable intelligibility evaluation without reliance on scripted material while directly testing model robustness and generalisation to ecologically valid conditions. Comparative examples using established speech corpora will illustrate improvements in lexical diversity, reduced repetition, and increased conversational realism afforded by the presented dataset.

The presented corpus is the first audio-visual dataset of free-flowing small-group conversations recorded directly in realistic noisy environments with mixed hearing abilities among interlocutors. Unlike conventional scripted or synthetically corrupted datasets commonly used for benchmarking, it captures natural turn-taking, overlapping speech, lexical variability, and visual articulatory cues critical for multimodal signal processing under real-world communication conditions.

The demonstration further introduces a novel layered real-world soundscape, incorporating competing talkers and multi-level environmental interference known to challenge hearing-impaired listeners. As such, the corpus provides ecologically valid, highly realistic validation conditions for multimodal speech enhancement and separation algorithms.

This demo addresses a critical gap in current evaluation practices (see Buck et al., 2024 for review) by providing realistic, out-of-domain conversational data for benchmarking and generalisation testing of multimodal systems. Attendees can actively engage in immersive listening tasks, exploring the influence of audiovisual cues on intelligibility first-hand. They are invited to provide feedback on dataset usability and evaluation design, contributing to future corpus development and supporting community-driven, realistic benchmarking standards for multimodal communication technology evaluation.

558: Modular, Safe Granite Speech Conversation with Multiple Speakers

Authors: IBM Research: Nathaniel Mills, George Saon, Zvi Kons, Hagai Aronowitz, Avihu Dekel, Slava Shechtman, Raul Fernandez, David Haws, Samuel Thomas, Alexander Brooks, Sashi Novitasari, Tohru Nagano, Takashi Fukuda, Ron Hoory, Brian Kingsbury, Luis Lastras
IBM Software: Richie Verma, Leonid Rachevsky

Description: This demonstration presents the Granite Speech framework as a novel, modular platform for conversational voice interaction, highlighting its ability to support fluid, natural, and contextually grounded exchanges between humans and AI systems. The framework tightly orchestrates high quality, low latency Granite Speech-based transcription, language reasoning with integrated safety guardrails, and expressive speech synthesis within a coordinated runtime. This design enables responsive, low latency interactions while maintaining the flexibility and interpretability characteristic of modular architectures. The demonstration aims to illustrate how such an approach can deliver an end-to-end conversational experience comparable to monolithic speech-to-speech models, yet retain the adaptability needed for the continuous integration of innovations across individual components, such as improved ASR, enhanced LLM reasoning, updated guardrails, or more advanced speech generation.

One such innovation highlighted in the demonstration is the system’s ability to support structured multiparty dialogue through speaker-attributed automatic speech recognition (SA-ASR), a Granite Speech capability that appends explicit speaker labels to the ASR transcript. This mechanism enables the framework to manage conversations involving multiple human speakers while preserving clear attribution, continuity, and contextual grounding across turns. In a typical demonstration scenario, the interaction begins with an initial user exchange in which each participant introduces themselves. As the dialogue progresses, the system continues to detect the inputs of individual speakers and generate responses tailored to the appropriate participant, for example by referencing the corresponding speaker’s name. Prompt-based contextual biasing, a new Granite Speech capability that injects bias keywords into the ASR prompt, can improve the recognition of foreign or otherwise rare names.

Impact to signal processing communities:
1. Contributing mindshare on the design of modern modular spoken conversation systems and their components.
2. Demonstrating the use of open weights models such as Granite Speech, released on Hugging Face, and providing guidance on how to employ these models and their newly released features.

This will be an interactive demonstration in which participants converse with an AI system. Most of the interaction will be conducted by IBM demonstrators, with attendee participation enabled when technically feasible.

563: Reliable Real-Time Meeting Transcription through Multimodal Speaker Detection and Emotion Recognition

Authors: Ran Han (Electronics and Telecommunications Research Institute), Jeom-ja Kang (Electronics and Telecommunications Research Institute), Kiyoung Park (Electronics and Telecommunications Research Institute), Woo Yong Choi (Electronics and Telecommunications Research Institute), Changhan Oh (University of Science and Technology, Electronics and Telecommunications Research Institute), Yeeun Jo (University of Science and Technology, Electronics and Telecommunications Research Institute), Hwa Jeon Song (Electronics and Telecommunications Research Institute)

Description: We present a reliable real-time meeting transcription system that integrates multimodal speaker detection and emotion recognition using video and circular array microphone signals. The proposed Show & Tell demo targets realistic multi-party meeting environments, where background noise and overlapping speech often degrade conventional speech-only transcription systems. By combining acoustic processing from a circular array microphone with visual information from a 360-degree camera, the system enables robust speaker diarization and real-time speaker-aware transcription.
The system processes synchronized audio-visual streams captured during meetings. On the acoustic side, beamforming and noise reduction enhance target speech signals recorded by a six-channel circular array microphone. In parallel, a video-based Active Speaker Detection (ASD) module estimates speaker locations and speaking activity using visual cues. The integration of acoustic and visual modalities allows recognized speech segments to be accurately associated with individual speakers, even under overlapping speech and spatial ambiguity.
Beyond speaker-aware transcription, the system incorporates a multimodal emotion recognition module that analyzes visual cues, acoustic characteristics, and linguistic context derived from recognized speech. Facial expressions, acoustic features, and semantic information are jointly used to estimate the emotional states underlying each utterance. This allows users to understand not only what was said and by whom, but also how participants expressed themselves emotionally during the meeting.
To support an interactive Show & Tell experience, the demo provides a live visualization interface where attendees can observe, in real time, who is speaking and how emotional states are reflected alongside the transcribed speech. Participants can engage in spontaneous discussions and observe how speaker overlaps, turn-taking dynamics, and emotional shifts are captured and visualized by the system. After the meeting concludes, users can generate a summary of the entire conversation upon request, including speaker-wise summaries for structured review of individual contributions.
By integrating array microphone–based acoustic processing, video-based speaker detection, automatic speech recognition, multimodal emotion recognition, and post-meeting summarization into a unified pipeline, the proposed demo demonstrates the feasibility of multimodal signal processing for real-time meeting transcription and offers an intuitive Show & Tell experience for robust multimodal meeting understanding in multi-speaker environments.

520: DeepAudioX: An Open-Source Python Library for Audio Learning and Rapid Prototyping with Pretrained Models

Authors: Christos Nikou, National Centre for Scientific Research Demokritos
Stefanos Vlachos, National Centre for Scientific Research Demokritos
Ellie Vakalaki, National Centre for Scientific Research Demokritos
Theodoros Giannakopoulos, National Centre for Scientific Research Demokritos

Description: We present DeepAudioX, an open-source PyTorch-based library that enables rapid development, training, evaluation, and deployment of audio classification systems using pretrained audio foundation models as feature extractors. Unlike existing toolkits that require extensive boilerplate code or impose rigid workflows, DeepAudioX combines plug-and-play pretrained backbones, modular pooling and classifier components, and unified high-level training and evaluation loops, offering an end-to-end solution from model development to deployment. Additionally, the library is designed to be easily extensible and customizable, enabling users to integrate their own backbones, datasets, and pooling methods while leveraging the remaining pipelines of the framework. In this way, DeepAudioX meets the needs of both novice and expert users.

The demo will feature a series of interactive Jupyter notebooks showcasing end-to-end workflows on representative benchmark tasks, including speech emotion recognition, music genre classification, language identification, and sound event classification. Attendees will observe and directly interact with dataset construction, integration of pretrained backbones (e.g., BEATs), classifier configuration, and training and evaluation through concise APIs. Live performance metrics and efficiency indicators — including coding effort, training time, and accuracy — will be displayed to illustrate the effectiveness of pretrained audio representations combined with modular pooling and classifier architectures.

Source code and documentation are publicly available at: https://github.com/magcil/deepaudio-x

523: Smart Passive Acoustic Monitoring: Embedding a Classifier on AudioMoth Microcontroller

Authors: Louis Lerbourg, CEA Grenoble, UGA
Paul Peyret, Biophonia
Juliette Linossier, Biophonia
Marielle Malfante, CEA Grenoble, UGA

Description: Passive Acoustic Monitoring (PAM) is an efficient and non-invasive method for monitoring ecosystems by allowing the acquisition of large bioacoustic datasets during lengthy deployment campaigns.
The AudioMoth is a standard among autonomous recorders allowing such studies by providing means to record data insitu at a reduced cost.
In this demonstration, we show two concepts of smart-AudioMoth by extending the original system capabilities with AI functions without adding any hardware components.
Specifically, the first device firmware is updated to allow a continuous analysis of the soundscape in real time, along side its recording.
The second device is flashed with a firmware starting a recording only if the targetted bird specy has been detected in the soundscape.
To the best of the authors knowledge such a contribution displaying real time classification of the soundscape directly on the AudioMoth has never been published.
The same neural network is used in both cases and fits under the remaining 10kB of RAM available on the AudioMoth.
It is based on a 1d-CNN architecture (12 layers) and is trained on more than 10.000 calls of Scopoli’s Shearwater seabird (male, female, chicks and background noise & non-target species) recorded at 24kHz with 16bits resolution using ten different recorders dispatched in the Pelagie islands.
The model is validated with 91% accuracy on the test set and 100% accuracy in experimental conditions.

Both devices can be manupulated by the audience.
Their analysis capabilities are illustrated by playing real world recordings within earshot of the devices.
Furthermore, the energy consumption and latency are visible in real time for the audience with the experimental set-up that was built for this purpose.
A five minutes video illlustrating the demonstration is also avaible.

Our aim with this demonstration for the signal processing community is to show the possibility to develop but also deploy models performing a continuous analysis of acoustic data, in real time and with strong memory, computation and energy constraints.
The success of this contribution lies within the multi-disciplinary skills of the authors which gathers: signal processing and machine learning, data and eco-acoustics, and embedded devices expertises.

546: An Interactive Music Analysis Platform for Pedagogy and Audio Organization

Authors: Parampreet Singh: Indian Institute of Technology Kanpur
Sumit Kumar: Indian Institute of Technology Kanpur
Vipul Arora: Indian Institute of Technology Kanpur, Katholieke Universiteit Leuven

Description: We present an interactive platform for Indian Art Music (IAM) that is useful for music pedagogy and automatic analysis of music audio in terms of aspects such as raga, ornamentation, and melody.

Pedagogy: The teaching module[1] enables music teachers to digitally record structured lessons by selecting specific tonic and tāla. The complete lesson package can then be shared with students. Students can import lessons via a dropdown menu, set their preferred tonic and tempo, practice repeatedly, and submit their final recording to the teacher. An AI-based automatic mistake recognition module compares the student’s performance with the teacher’s reference audio and highlights mistakes, and assigns an overall score.

Organisation: The analysis module enables users to either upload an audio file or provide a YouTube link. The system analyses the audio and identifies its raga[2], various ornamentations[3], and main melody[4]. Some components enable interactive corrections.

Novelty: The demo uniquely integrates novel pedagogical and analysis workflows with signal-processing-driven analysis, bridging education, MIR, and explainable AI.

Impact: The platform demonstrates how signal processing and machine learning can be embedded into culturally rich, real-world music analysis and education workflows, opening new directions for applied MIR, educational signal processing, and human-centered AI systems.

Interactivity: The booth will feature
-fully interactive "Music Classroom" where attendees can wear headsets and act as students or teachers. They can listen and sing. The system will provide immediate visual feedback on their singing, highlighting specific mistakes and assigning a score.
-an analysis system where attendees can provide IAM clips from YouTube or their own devices to know the musicological aspects, such as raga, melody contours and ornamentations in real-time.

References:
[1] doi.org/10.36227/techrxiv.23269502.v2
[2] doi.org/10.1109/TASLPRO.2025.3574839
[3] doi.org/10.1109/TASLPRO.2025.3639738
[4] doi.org/10.1109/TASLP.2024.3399614

552: A System-Integrated Parametric Array Loudspeaker Prototype for Controllable and Localized Sound Field Regulation

Authors: Jun Yang, Institute of Acoustics, Chinese Academy of Sciences
Yunxi Zhu, Institute of Acoustics, Chinese Academy of Sciences
Xiaoyi Shen, Institute of Acoustics, Chinese Academy of Sciences

Description: This Show & Tell demo showcases a parametric array loudspeaker (PAL)-based sound field control system developed by the research group led by Prof. Jun Yang from the Institute of Acoustics, Chinese Academy of Sciences. The demo is rooted in the group’s long-term, systematic research on the theoretical analysis and engineering implementation of PAL technology—findings comprehensively summarized in their recent monograph Parametric Array Loudspeakers: From Theory to Application (Yang & Ji, 2025, Springer Nature)—presenting a fully integrated PAL sound field control prototype.
The prototype encompasses core components including ultrasonic transducer arrays, driving electronics, modulation and control modules, as well as real-time measurement and visualization tools. It enables the generation of highly directional audible sound and localized sound field regulation in free space, exhibiting practical sound field control capabilities that surpass conventional loudspeaker systems.
The key innovation of this demo lies in its systematic integration of PAL theory and engineering applications. Leveraging rigorous nonlinear acoustic modeling, the demo illustrates how theoretical findings are translated into actionable engineering design decisions, such as modulation strategies, carrier frequency selection, and array configuration optimization. Distinct from typical PAL demonstrations that merely focus on perceptual effects, this work emphasizes stable, controllable, and repeatable sound field regulation—an indicator of the maturity of its underlying theoretical and engineering framework.
This pioneering integration breaks through the limitations of traditional discrete PAL setups, which often suffer from instability and poor repeatability, and establishes a standardized engineering paradigm for directional sound technology.
This demo provides the signal processing community with a representative case study on the practical implementation of nonlinear acoustic signal processing and array-based control techniques in PAL systems. It serves as a valuable reference for researchers and engineers engaged in spatial audio, sound field control, and the engineering application of advanced acoustic signal processing.
During the demonstration, attendees will have the opportunity to interact with the system by adjusting modulation parameters and playback signals. They can observe real-time variations in sound directivity and spatial confinement through both perceptual experience and on-site acoustic measurements, fostering an intuitive understanding of PAL-based sound field control technology.

568: SciPhi - a spatial audio language model that understands real multi-source audio scenes

Authors: Sebastian Braun (Microsoft), Dimitra Emmanouilidou (Microsoft), David Johnston (Microsoft), Xilin Jiang (Columbia University New York), Hannes Gamper (Microsoft)

Description: Spatial audio records sound sources and their directions, allowing humans and machines to hear not just what happens when, but also where. Audio-language models (ALMs) play a crucial role in bridging the gap between audio and language understanding, expanding the modalities of human-computer interaction. While most established ALMs are only monaural, offering no spatial understanding, our IEEE OJSP paper presented here at ICASSP 2026 “Sci-Phi: A Large Language Model Spatial Audio Descriptor” introduces the first ALM with spatial audio support that shows generalization to real recordings beyond synthetic audio data. Sci-Phi describes sound events, direction, time occurrence, loudness and acoustic attributes such as reverb and room characteristics.

In this demo, we showcase Sci-Phi with an improved and more robust spatial audio encoder, and extensive Question and Answering (Q&A) capabilities trained on a new Q&A dataset. Attendees see and hear real spatial sound scenes recorded with a first-order Ambisonics microphone array (and a 3D camera for reference). The analysis output of the Sci-Phi ALM is overlaid onto the panoramic video at the corresponding time and location, while the model input can be heard binaurally via headphones. The interactive Question and Answering mode allows attendees to query the language model about specific sound objects and their relations in the scene.

This demo illustrates the capabilities and limitations of spatial audio understanding and spark future developments. Spatial ALMs can have great impact on the signal processing community as generalist tools for data analysis and curation, enablers for spatially aware agents and hearing assistive or augmented technologies.

571: VisionSFX: Cross-Shot Consistent Video-to-Audio Generation with Depth-Aware Binaural Audio Rendering

Authors: Hyeonwoo Park, Gwangju Institute of Science and Technology
Dayeon Ku, Gwangju Institute of Science and Technology
Jung Hyuk Lee, Gwangju Institute of Science and Technology
Hwa-Young Park, Gwangju Institute of Science and Technology
Jongyeon Park, Gwangju Institute of Science and Technology
Hong Kook Kim, Gwangju Institute of Science and Technology

Description: Problem definition.
As Video-to-Audio (V2A) technology advances toward real-world deployment in film post-production and automated video editing, perceptual consistency across shots becomes critical. However, current V2A methods [1] operate on a shot-by-shot basis, treating each shot as an isolated unit. This leads to two critical limitations: (1) ambient sounds disappear in shots where the source is not visible, and (2) the same sound effects (SFX) exhibit inconsistent characteristics across different shots. Moreover, existing methods ignore spatial positioning, producing audio without directional correspondence to on-screen objects. These limitations severely degrade perceptual continuity, spatial realism, and overall production quality.

Key Challenges.
. Cross-shot Consistency: Grouping shots that share the same physical space despite different camera angles, ensuring consistent ambient sound.
. Binaural Audio Rendering: Estimating sound source positions and listener perspective from visual cues alone, without explicit depth or 3D scene data.

Methodology.
We demonstrate VisionSFX, a working system that addresses both challenges. For cross-shot consistency, our system feeds all shot keyframes to a vision-language model (VLM) at once, detecting objects and their visual attributes [2]. The VLM groups shots by shared visual features, identifies sound-producing objects, and generates SFX for each using TangoFlux [3]. For binaural audio rendering, each video frame is converted to a monocular depth map to estimate where sound sources are located in 3D space. The system then binaurally renders each sound—anchoring dialogue to the center and placing SFX across the binaural field based on their on-screen positions.

System Pipeline.
Our system comprises two stages. For SFX generation: (1) TransNetV2 [4] for shot boundary detection, and (2) parallel VLM inference for shot grouping. For binaural audio rendering: (3) Grounded-SAM [5] for object detection, (4) Depth Anything 3 [6] for depth estimation, and (5) HRTF-based binaural rendering [7].

What to Show at ICASSP.
We demonstrate VisionSFX running fully offline on a laptop and an edge device. Attendees can explore the pipeline with prepared videos or upload their own to experience cross-shot consistency and binaural audio rendering firsthand. Additional materials at [https://drive.google.com/drive/folders/10BqWfxXRZTwPcwy_3lX2ObSaLSW4vLx3?usp=sharing].

[1] https://github.com/hkchengrex/MMAudio
[2] https://doi.org/10.48550/arXiv.2511.21631
[3] https://github.com/declare-lab/TangoFlux
[4] https://github.com/soCzech/TransNetV2
[5] https://github.com/IDEA-Research/Grounded-Segment-Anything
[6] https://github.com/ByteDance-Seed/Depth-Anything-3
[7] https://www.isca-archive.org/interspeech_2019/lee19b_interspeech.html

510: Sub-Nyquist DoA Estimation of an Ultrasound Source in a Sector of Interest

Authors: Paul Barend Groen, Bein Frederik Jacob Kamminga, Lars Jisse Hoogland, Boele van Schaik, Anniek Christine van der Veen, Edoardo Focante, Hamed Masoumi, Nitin Jonathan Myers

Description: Our prototype demonstrates direction-of-arrival (DoA) estimation of an ultrasound source from a limited number of microphone measurements. The demo leverages prior knowledge that the sound source lies within a known sector of interest. In practice, this sector is typically identified from an initial scan with wide beams (as done in this demo) or from known source motion statistics. The scientific challenge here is to estimate the DoA of any source, within the sector of interest, for a specified number of active microphones. We propose an integer programming-based approach that optimizes the set of active microphones so that the aliasing artifacts within the sector of interest are minimized. While the optimized configuration results in higher aliasing artifacts outside the sector, the out-of-sector artifacts are irrelevant as the source is known to lie within the identified sector.

Our live demonstration comprises a 40 kHz ultrasound testbed with a single speaker (transmitter) and an array of eight microphones (receive channels). To make the demo interactive, we will show the ultrasound signal power using FFTWave Android application. Furthermore, the receive array of microphones is mechanically steered (instead of electronic beam steering) so that visitors can observe how the sector of interest is identified. The received signals at the microphone array are processed on a Teensy 4.1 to estimate inter-microphone phase differences. These phase differences are then used for matched-filter-based DoA estimation. In our experiment, we consider only four active microphones of the available eight to demonstrate sub-Nyquist sampling. Using these four measurements, we show the matched-filtered output, corresponding to our optimized configuration, for DoA estimation on our laptop. We will demonstrate, in the form of a live plot, the aliasing artifacts within and outside the sector of interest. Visitors will also have the opportunity to enter their own choice of active microphones to observe the aliases. Furthermore, they can also visualize how the resolution of the DoA estimate varies with the number of microphones using our demo. A video of our setup is available here: https://www.youtube.com/watch?v=-2LHcuZD1S0.

512: Photo-Driven Multimodal Conversational AI for Reminiscence-Based Cognitive Training and Longitudinal Cognitive Assessment

Authors: 1. Byung Ok Kang (Electronics and Telecommunications Research Institute)
2. Byounghwa Lee (Electronics and Telecommunications Research Institute)
3. Hwa Jeon Song (Electronics and Telecommunications Research Institute)

Description: We present an interactive multimodal cognitive training and assessment system that leverages photo-driven conversational AI to support cognitive enhancement and early detection of cognitive decline. The proposed demo focuses on reminiscence-based cognitive training using a large-scale historical photo and video database that reflects past lifestyles, everyday objects, cultural contexts, and autobiographical memories. By selecting photos from the database, users engage in natural spoken dialogues generated by a multimodal generative AI model that adaptively tailors questions according to each user’s cognitive level, response context, and prior interaction history.

The framework builds upon reminiscence therapy, a clinically validated cognitive intervention in which narrating personal past experiences stimulates memory, emotion, and communication. During reminiscence-based interaction, recalling and describing past events primarily activates episodic long-term memory, which stores autobiographical experiences associated with temporal and contextual cues. As users reconstruct these memories through dialogue, semantic long-term memory is simultaneously engaged to support conceptual understanding and language processing, while working memory is recruited to maintain conversational flow, comprehend questions, and organize responses. Through this process, multiple cognitive systems—including attention, language generation, executive function, and emotional regulation—are integratively stimulated.

The system integrates visual understanding, automatic speech recognition, natural language generation, and cognitive signal analysis to automatically generate clinically informed questions that stimulate memory recall, attention, language, and executive functions. Unlike conventional repetitive question-and-answer cognitive training, the proposed approach emphasizes personalized, context-aware, and emotionally engaging conversations grounded in autobiographical memory cues and multisensory associations.

To support continuous monitoring, the system quantitatively analyzes linguistic, acoustic, and response-pattern signals extracted from spoken interactions and compares them with historical cognitive assessment records. This enables intuitive visualization of individual cognitive trajectories over time, supports early identification of cognitive decline trends, and provides timely and interpretable feedback for preventive intervention and personalized training adjustment.

This work extends our ICASSP 2025 Show&Tell system, which focused on spoken language–based screening of mild cognitive impairment, toward an active and interactive cognitive training paradigm. The unified integration of reminiscence-based cognitive theory, adaptive multimodal conversational AI, and longitudinal cognitive signal analysis demonstrates how multimodal signal processing can be operationalized in practical, user-centered cognitive healthcare systems beyond laboratory settings.

526: Plug-and-Play Latent Diffusion for Ultrasound Inverse Imaging – Show and Tell Demonstration

Authors: Ruizhi Zhang, Rui Guo, Yonatan Kvich, Adi Wegerhoff, Yonina C. Eldar
Weizmann Institute of Science, Rehovot, Israel

Description: Pathological tissues exhibit acoustic property variations (e.g., speed of sound and attenuation) relative to healthy tissue. Recovering spatial maps of these properties provides valuable diagnostic information, but ultrasound imaging in the presence of strong acoustic contrasts (e.g., bone vs. soft tissue) becomes a challenging ultrasound inverse problem due to severe reflections and multiple scattering. This demo showcases a physics-guided plug-and-play latent diffusion approach that reconstructs speed of sound maps directly from measured channel data, enabling stable imaging under strong scattering conditions.

This demo demonstrates how modern diffusion-based generative priors can be combined with physics-based models to tackle challenging nonlinear inverse problems, with relevance to medical ultrasound imaging. Our key innovation is physics-guided latent diffusion inference for ultrasound inverse imaging: iterative latent refinement integrates (i) the learned generative prior and (ii) the measurement-consistency constraint from the forward acoustic model, enabling stable reconstructions under severe multiple scattering.

The demonstration platform is built on a Verasonics research ultrasound system with 16 individually addressable custom immersion transducers (300 kHz center frequency) arranged in a circular configuration with a 10 cm aperture diameter. The Verasonics platform performs synchronized transmission/reception and stores raw channel data for reconstruction. To provide intuitive insight into the measurement process, the platform additionally display example received waveforms in real time. We scan tissue-mimicking phantoms containing bone-like structures and edema-like inclusions and reconstruct speed-of-sound maps from the time-domain channel measurements. Attendees can interact with a custom GUI to (1) visualize raw channel data, calibration, and preprocessing, and (2) run reconstructions and compare with baseline methods.

Reference:
[1] R. Guo, Y. Zhang, Y. Kvich, T. Huang, M. Li, and Y. C. Eldar, “Plug-and-Play Latent Diffusion for Electromagnetic Inverse Scattering with Application to Brain Imaging,” arXiv preprint arXiv:2509.04860, 2025.

527: Full Wave Inversion for Pulse-Echo Ultrasound Linear Arrays

Authors: 1. Ditza Auerbach, Weizmann Institute of Science, Rehovot, Israel
2. Rui Guo, Weizmann Institute of Science, Rehovot, Israel
3. Adi Wegerhoff, Weizmann Institute of Science, Rehovot, Israel
4. Yonina C. Eldar, Weizmann Institute of Science, Rehovot, Israel

Description: Acoustic tissue parameters such as the speed of sound (SoS) and attenuation can serve as localized biomarkers for pathological conditions of tissues such as fibrosis, tumors and inflammation. Changes in these parameters relative to their counterpart in heathy tissue can provide diagnostic information beyond structural imaging. However, conventional clinical pulse-echo systems use B-mode imaging, which captures qualitative anatomical structure but cannot quantitatively map the underlying acoustic properties throughout the medium. In the proposed demonstration, we show for the first time how local acoustic properties can be inferred from the raw data of a commercial pulse-echo ultrasound system by carrying out a physical full wave inversion (FWI).
The demonstration is based on applying an efficient FWI algorithm framework that enables SoS reconstruction from the raw channel data sensed by a linear array ultrasound probe. Specifically, it will be carried out using a portable commercial research ultrasound system ArtUs, manufactured by Telemed UAB, enabling the attendees to experience scanning phantoms to get the look and feel of standard beamformed images. On a separate route, the raw data will be collected, and a custom GUI will be used to visualize & demonstrate the algorithmic route leading to the reconstruction of the SoS maps. The reconstruction itself is an extremely ill-posed inverse problem, which we solve using regularization, ADMM optimization, and efficient computational frameworks for both inversion and calibration.
The demonstration serves as an initial guide and proof of concept for FWI of pulse-echo Ultrasound data; we plan to generalize it to clinical scenarios by incorporating model-based AI into the FWI framework. We will allow attendees to scan some phantoms mimicking clinical targets using B-mode imaging, thus demonstrating some of the challenges in bringing inverse-imaging to the clinic.

543: From Wearables to Generative Insight: A Multimodal Framework for AI-Augmented Cardiology Assistance with Single-lead Electrocardiogram

Authors: Arijit Ukil, TCS Research, India
Prithwiraj Mitra, TCS Research, India
Trisrota Deb, TCS Research, India

Description: We present an interactive demonstration of an edge-native Cardiac Clinical Guidance System (CCGS) that translates raw ECG signals into real-time, personalized clinical insights, exemplifying a “Signal-to-Semantics” paradigm. Using a Polar H10 wearable, the system performs real-time signal processing for artefact suppression and feature extraction, enabling accurate detection of cardiac Arrhythmias and other cardiac anomalies. These signals are fused with multi-modal patient data, including symptoms (for e.g., fatigue, fever, chest pain, etc…), blood reports (Blood Glucose, Complete Blood Count, Cholesterol, etc…), blood pressure measurements, imaging (for e.g., Echocardiogram), other reports like tread-mill test, and self and family history on cardiac diseases to generate a comprehensive cardiac risk profile, incorporating standard clinical scores (ASCVD, QRisk3, Framingham, CHA₂DS₂ VASc). BioMistral-7B, a domain-specific medical small language model (MSLM), then produces tailored guidance, including lifestyle advice, medication suggestions, and diagnostic recommendations. The system runs on an Android interface of a smartphone with GPU-accelerated inference for responsive performance. Designed as a clinical companion rather than a replacement for medical professionals, the CGS supports personalized care and informed decision-making.
Novelty and Impact:
This demonstration features a novel integration of biosignal processing with generative AI, showing how signal analysis can serve as the foundation for semantic-level clinical reasoning. It highlights how signal processing and established risk scoring enable a small language model to produce actionable health insights. This approach integrates signal processing and Gen AI for intelligent clinical guidance and demonstrates the critical role of signal processing in the real-world healthcare AI applications.
Impact to Signal Processing Community:
By bridging low-level physiological data with high-level semantic inference, our work positions signal processing as a key enabler of context-aware AI systems. It showcases new opportunities for signal processing professionals to contribute to intelligent, conversational AI-driven smart healthcare solutions that yield real-time, clinically meaningful guidance and support.
Interactive Demo:
Attendees will engage directly with the system by wearing a Polar H10 chest strap to capture their ECG signals. The system will display real-time cardiac analytics and generate anonymized clinical reports, accessible either on participants’ Android smartphones via a companion application or on a dedicated demo tablet provided at the venue.

545: Size Doesn’t Matter: Interactive Acoustic Imaging System Design Using a 1024-Channel Ultrasound Array

Authors: Wouter Jansen - Cosys-Lab, University of Antwerp, Antwerp, Belgium
Dennis Laurijssen - Cosys-Lab, University of Antwerp, Antwerp, Belgium
Jan Steckel - Cosys-Lab, University of Antwerp, Antwerp, Belgium

Description: As a research lab specialized in in-air ultrasound, we present an interactive demonstration of our latest sensor technology, showcasing our expertise in this field. We propose a live demonstration of our HiRIS sensor, a massive 1024-channel in-air ultrasonic array [10.1109/ACCESS.2024.3385232]. While its dense aperture offers artifact-free imaging with high dynamic range, this exhibit highlights its broader contribution as a reconfigurable validation platform for array signal processing. By treating the 1024 elements as a dense candidate grid, the system can function as a virtual, programmable aperture. This allows for the experimental validation of arbitrary sparse array geometries and beamforming pipelines using real-world acoustic data. Typically, validating these geometries relies heavily on simulation, which often fails to capture the real-world performance.

The setup of our demonstrator consists of the HiRIS sensor facing a set of closely positioned, adjustable acoustic sources. Visitors will interact via a computer interface (e.g. tablet) to define the virtual array geometry by selecting from standard topologies (e.g. regular grid, spiral, poisson disk sampling, random, concentric) or drawing custom patterns. In addition, users can choose among different beamforming and imaging algorithms, such as Delay-and-Sum, Delay-Multiply-and-Sum, MUSIC, and MVDR, enabling a direct exploration of how array geometry and processing method jointly influence performance. The system will compute on the spot the resulting beam patterns using our fast signal processing pipeline. This allows a direct comparison of spatial resolution, peak-to-sidelobe ratio, and grating-lobe suppression against the ground truth of the full 1024-element array, which will also be visualized for the visitor.

To encourage visitor engagement, we introduce an optimization challenge: participants must design a geometry that maximizes source separation while minimizing the active channel count, optionally leveraging advanced beamforming methods to compensate for sparsity. This interactive benchmark effectively illustrates the fundamental trade-offs between array density, algorithmic complexity, and imaging quality. A live leaderboard will track the most efficient designs, encouraging participants to find the best solutions in terms of both hardware sparsity and imaging fidelity.

511: Millisecond-Order Self-Adaptive AI WiFi Receiver

Authors: Nimrod Glazer, Ben-Gurion University of the Negev, Israel
Gal Francis, Ben-Gurion University of the Negev, Israel
Gur Masury, Ben-Gurion University of the Negev, Israel
Nir Shlezinger, Ben-Gurion University of the Negev, Israel

Description: Deep learning is expected to greatly facilitate the operation of wireless receivers in challenging environments. However, traditional deep learning methods, when adapted to wireless receivers, struggle to adapt in real-time due to rapidly changing channels, hardware limitations, and latency constraints. In particular, conventional deep learning models, though powerful, are large, static, and ill-suited for dynamic, resource-limited environments, making real-time adaptation challenging.
This demo proposal for ICASSP 2026 aims to showcase an innovative approach for realizing AI for wireless receivers that is both (i) lightweight and (ii) adaptive in orders of milliseconds. Our proposed solution leverages recent developments in modular, Bayesian architectures designed for rapid, online training and adaptation. In doing so, we develop autonomous, energy-efficient, and reliable physical-layer systems tailored for next-generation networks. Key to our approach is achieving real-time self-adaptation with training times in millisecond-order, with limited data and on hardware-constrained edge devices, by deviating from traditional stochastic gradient descent based optimization. Instead, we cast adaptation of AI receivers as ultra-fast lightweight continual Bayesian tracking to facilitate swift online learning and implement asynchronous training updates triggered only when drift detectors identify significant environmental changes. This strategy minimizes computational load while maintaining high responsiveness.
The demo showcases our novel methodology using Pluto Plus SDR hardware to implement the receiver chain for OFDM Wi-Fi 802.11a/c signals. The standard equalizers and least squares symbol detection are replaced with an AI-aided pipeline comprised of a modular convolutional neural network constantly adapted using single-step Extended Kalman Filter (EKF) based continuous learning. Training data includes the 802.11 a/c synchronization symbols acquired from over-the-air signals and processed on GNU platform.
Our demonstration highlights several key aspects of AI for wireless receivers:
(i) it can lead to superior performance over classical equalizers;
(ii) it can be made adaptive without inducing notable latency in neither inference nor learning;
and importantly (iii) it can realize self-autonomous protocol-compliant wireless receivers on limited off-the-shelf SDR hardware.
In doing so, we illustrate its potential for real-time, adaptive wireless communication systems.

549: Over-the-Air Computation with Neural Constellations for Two-Way Streaming

Authors: Aswathylakshmi Pallathadka (CTTC), Adriano Pastore (CTTC)

Description: Description:
The demo consists of two nodes (UEs) transmitting video streams in the same time and frequency resources by mapping their messages onto end-to-end learned neural constellations optimized for over-the-air computation of their sum. A relay located between the UEs captures this composite signal, decodes the bitwise XOR sum of their messages, and retransmits the sum bits. Each UE receives the sum signal from the relay and subtracts its own message bits to recover the video stream from the other UE.

Set-up:
Two USRPs transmit the video streams of the two UEs and receive the sum of the streams from a third USRP functioning as the relay. The UEs and relay are controlled by MATLAB interfaces.

Novelty and innovation:
While traditional QAM constellations optimize the performance of individual transmitter-receiver links, coherent interference from another transmitting node can degrade its error performance. The neural constellations used by the UEs in this demo are optimized for over-the-air computation of the sum of their transmitted signals and offer better error rates when both the UEs are transmitting concurrently in the same frequency. The use of these neural constellations for signaling not only doubles the spectral efficiency, but it also reduces receiver complexity and decoding time at the relay node by circumventing the need to decode the individual messages through expensive successive interference cancellation or joint decoding operations.

User interactivity
1. Live over-the-air communication using USRPs.
2. User interface for adjusting power and phase differences between the UEs (as a proxy for relative changes in UE locations).
3. Audience can also manipulate the antennas on the USRPs to see changes in the constellations received at the relay in real-time.
4. Live display of decoded video messages next to the original transmitted ones for comparison of video quality recovered through the neural scheme against the baseline (traditional QAM) scheme.
5. Display of the neural vs baseline constellations used at the UEs, and the respective sum constellations received at the relay in real-time.
6. Display of the live bit error rates.

550: Near-field MIMO with tri-polarized antennas

Authors: Adrian Agustin
Centre Tecnològic de Telecomunicacions de Catalunya (CTTC/iCERCA)
Xavier Mestre
Centre Tecnològic de Telecomunicacions de Catalunya (CTTC/iCERCA)

Description: Detailed description of the demo:
In the journal paper:
A. Agustin and X. Mestre, "Exploiting Multiple Polarizations in Extra Large Holographic MIMO," in IEEE Transactions on Wireless Communications, vol. 25, 2026, doi: 10.1109/TWC.2025.3623866
We investigated the spatial multiplexing capabilities of large multi-antenna configurations under line-of-sight and near field conditions with multiple orthogonal polarizations, by means of three infinitesimal dipoles.
In the proposed demo we will show, on a small scale, when it is possible to use up to three spatial dimensions with tri-polarized antennas. Each prototyped tri-polarized antenna consists of a patch antenna (2 polarizations) and 1 monopole (perpendicular polarization) working at 2.45 GHz. Considering the transmitter and receiver are separated by around half-a meter (for the demo), it will be evaluated how the separation among the transmitter elements or the separation between transmitter and receiver influences the eigenvalues of the channel. The concept can be extended to larger distances if the aperture array increases.
The demo is based on a single SDR with 8 radio frequency channels that controls the transmitted and received signals through the tri-polarized antenna (3 RF chains per antenna). The equivalent wireless channel is estimated from dedicated pilots sent by each transmitting polarization.

Main novelty and innovations of the demo:
This demo will present protypes of antennas with 3 polarizations and show under what conditions additional spatial streams can be transmitted in the near-field by means of polarized antennas.

Impact to signal processing communities:
Considering that conventional systems are designed to work with 2 polarizations, since communications are designed to work in the far field region, this demo shows a new dimension that can be exploited.

Interactivity for attendees:
The attendees will observe through the screen how the eigenvalues of the equivalent channel vary as a function of the distance of terminal and separation of transmitting antennas, elucidating the benefit of the near-field transmission.

554: A flexible system-on-chip FPGA architecture for prototyping experimental GNSS receivers

Authors: Marc Majoral (CTTC), Javier Arribas (CTTC), Carles Fernández-Prades (CTTC)

Description: This demo presents a flexible and low-cost Global Navigation Satellite System (GNSS) receiver prototype based on a System-on-Chip Field Programmable Gate Array (SoC FPGA) architecture, enabling efficient prototyping of experimental GNSS signals and advanced signal processing algorithms. The platform combines the adaptability of Software-Defined Radio (SDR) concepts with the massive parallelism and power efficiency of reconfigurable hardware, addressing the limited flexibility of Application-Specific Integrated Circuit (ASIC)-based receivers and the high power consumption of software-only implementations.

The prototype integrates a Free and Open Source Software (FOSS) GNSS processing engine, providing full visibility and control over the baseband processing chain. The architecture emphasizes customization and reprogrammability, enabling researchers to implement, test, and refine novel receiver concepts.

By offloading computationally intensive signal processing tasks to the FPGA while retaining software flexibility on the embedded processor, the proposed architecture achieves improved energy efficiency compared to software-only GNSS receivers operating on general-purpose processors. This balance between performance, power efficiency, and programmability enables advanced GNSS concepts to be evaluated in realistic field testing environments using small form factor (SFF) devices.

The capabilities of the proposed platform have been validated through multiple concept implementations, including a low-power spaceborne GNSS receiver capable of processing signals in Low Earth Orbit scenarios, a real-time GNSS signal rebroadcaster enabling signal generation and regeneration with minimal latency, and a high-sensitivity GNSS receiver capable of acquiring and tracking weak signals with carrier-to-noise density ratios as low as 20 dB-Hz.

Attendees will be guided through the configuration and operation of the receiver. A detailed description of the architecture will be provided, along with an explanation of how experimental signal processing strategies can be implemented within the GNSS processing engine.

The demo will include a live demonstration, with the receiver operating in real time, and a post-processing demonstration using recorded multi-frequency and multi-constellation GPS and Galileo signals. The demonstrations will cover a static and a Low Earth Orbit (LEO) scenario, highlighting the platform’s capabilities under different signal conditions.

During the demo, standard-format receiver outputs, such as navigation solutions and measurement data, will be demonstrated, and key receiver measurements will be monitored live to illustrate real-time system performance.

561: FPGA Demonstration of High-reliability Low-latency Belief Propagation Decoding of Quantum LDPC Codes

Authors: Alexios Balatsoukas-Stimming (TU/e)
Alex Alvarado (TU/e)

Description: Quantum computing could transform fields like drug discovery, materials science, and cryptography. However, qubits are susceptible to noise, decoherence, and operational errors that rapidly corrupt information. Robust quantum error correction (QEC) is vital for scaling past today’s noisy prototypes. QEC must operate with micro-to-nanosecond-scale latency and high reliability to maintain logical error rates below 1e-12 to achieve fault tolerance.

Low-density parity-check (LDPC) codes are classical error correction schemes defined by a binary parity-check (PC) matrix that can be represented as a Tanner graph. LDPC codes are decoded using graph-based iterative message passing, i.e., belief propagation (BP). Their QEC counterparts, quantum LDPC (QLDPC) codes, are defined by quaternary PC matrices to handle the three types of Pauli errors under the depolarizing quantum noise model. QLDPC codes can be decoded via quaternary BP4.

In this demo, a custom hardware architecture of an FPGA-based QEC simulator will be demonstrated. We consider the [[126, 28]] and [[254, 28]]] generalized bicycle QLDPC codes. The core component is a fixed-point hardware-optimized BP4 decoder implemented in HDL. The decoder is specifically designed to efficiently decode QLDPC codes with low latency and low complexity.

Our FPGA demo emulates a quantum (depolarizing) channel in real time, with channel quality (physical error rate) controllable by the attendees. The attendees will also be able to pick the QEC code and the parameters (e.g., number of iterations) of the BP decoder on the FPGA in runtime. Resulting logical error rates and the decoding latency (in nano or microsecond scale) will be visualized (vs. physical error rate) in Jupyter Notebook environment connected to the FPGA board.

Our implementation (A) utilizes less than 25% of the resources on a commercially-available FPGA; (B) achieves a logical error rate of about 1e-12 at a physical error rate of 1e−4, with an average decoding latency below 50 nanoseconds; and (C) decodes 15.8 Mcodewords/s. (A) allows it to potentially coexist with the DSP and control algorithms required in quantum computers. (B) and (C) demonstrate to the broader quantum signal processing community the feasibility of integrating high-reliability low-latency QEC in future quantum systems.

508: Vehicular Hazard Detection with Multi-Object Multi-Camera Tracking in Open RAN Networks

Authors: Anton Aguilar (Technological Telecommunications Centre of Catalonia (CTTC/CERCA)),
Jordi Serra (Technological Telecommunications Centre of Catalonia (CTTC/CERCA)),
Raúl Parada (Technological Telecommunications Centre of Catalonia (CTTC/CERCA)),
Ebrahim Abu-Helalah (Technological Telecommunications Centre of Catalonia (CTTC/CERCA)),
Paolo Dini (Technological Telecommunications Centre of Catalonia (CTTC/CERCA))

Description: Hazard detection is an important problem for Intelligent Transportation Systems (ITS), although, developing and deployment of such systems is a complicated task, given human presence in the loop and a large account of non-controlled variables (e.g. traffic levels, weather conditions, etc). In addition, a distributed hazard detection system will require a communication network for a large number of sensors, making this network a crucial element of the system and a potential bottleneck for its performance. Therefore, a suitable platform is needed for the development and validation of vehicular applications that handle both the vehicular and network aspects of the system to seamlessly migrate an application to an urban environment. This demo introduces a distributed hazard detection service integrated to open-RAN that guarantees both reliability and low-latency required for ITS applications. The demo, leveraged by CARLA simulator, shows connected vehicles driving around a virtual environment, while detecting objects using machine learning models and estimating their partial trajectories (a.k.a, tracklets). The information is converted to metadata, which is transmitted to an edge server via the open-RAN network through universal software radio peripherals (USRPs). The server integrates the partial trajectories using Kalman-based methods and computes risk-metrics in real-time to provide timely warnings to connected vehicles. The demo shows an implementation of the open-RAN network using SRS software, introduces a new tracking association algorithm based on Kalman filters and Mahalonabis distance, and implements different risk metrics. such Streetscope collision hazard measure (SHM) or time-to-collision (TtC). Metadata communications is implemented to reduce network traffic and ensure system scalability. The connected vehicle could be driven by the attendees or set to autopilot.

525: Sensing-Triggered Adaptive ISAC: High-Speed Data and Instant Emergency Alerts

Authors: Homa Nikbakht (New York University), Hang Ruan (Tampere University), Shlomi Savariego (Weizmann Institute of Science), Moshe Namer (Weizmann Institute of Science), Adi Wegerhoff (Weizmann Institute of Science), Yonina Eldar (Weizmann Institute of Science, Northeastern University)

Description: This demonstration presents an integrated sensing and communications (ISAC) system that can detect and broadcast time-critical emergencies occuring at random times. We emulate a V2X scenario where a roadside base station continuously serves connected vehicles with a high-rate data stream. Beyond standard connectivity, the same infrastructure must recognize priority events such as an approaching emergency vehicle and immediately broadcast an alert to nearby road users, without waiting for any pre-scheduled control interval.
The hardware testbed uses distributed Adalm Pluto+ SDRs to realize the full closed loop in real time. In baseline operation, the base station transmits background data streams to users. In parallel, a dedicated sensing node monitors a defined “sensing zone”. When a target enters this zone (like an ambulance), the sensing node triggers an immediate safety message, and the base station promptly adapts its transmission to broadcast the alert while maintaining the ongoing data service.
To guarantee that the emergency alert is received reliably without disrupting high-speed data, the system employs Dirty Paper Coding (DPC). Unlike time-sharing approaches that stop data to send alerts, or power-sharing approaches that boost the alert while treating interference as noise, DPC pre-cancels the predictable interference created by the simultaneous sensing and data signals. This allows the system to sustain higher data rates while ensuring the safety alert is decoded correctly.
The main novelty is a hardware demonstration of sensing-triggered dynamic coexistence: the network “reflexively” changes its signaling immediately when an event is sensed. This validates the “sensing-for-communication” paradigm for V2X and highlights how signal processing enforces reliability for emergency services even under heavy traffic. The demo is interactive: attendees generate random arrivals by placing a physical object into the sensing zone at any time. A real-time dashboard visualizes the reflex through (i) a radar view confirming detection, (ii) a data view showing the signal constellation adapting to accommodate the alert, and (iii) live performance metrics demonstrating error-free decoding of the emergency message.
Reference: H. Nikbakht, Y. C. Eldar and H. V. Poor, "An Integrated Sensing and Communication System for Time-Sensitive Targets with Random Arrivals", IEEE Journal on Selected Areas in Communications, 2026.

531: Live Demonstration of Doppler Radiance Fields (DoRF) for Robust Human Activity Recognition Using Wi-Fi

Authors: 1- Navid Hasanzadeh: Department of Electrical & Computer Engineering, University of Toronto, Toronto, ON, Canada
2- Shahrokh Valaee: Department of Electrical & Computer Engineering, University of Toronto, Toronto, ON, Canada

Description: This proposal outlines a live, interactive demonstration of a novel Wi-Fi–based Human Activity Recognition (HAR) framework built on Doppler Radiance Fields (DoRF)- our recently proposed method accepted at ICASSP 2026 (Paper ID: 9898).

The core idea is that Wi-Fi multipath reflections act as a collection of virtual cameras that observe the same human motion from different unknown angles. Each path provides a one-dimensional Doppler projection of motion. DoRF fuses these complementary views into a structured latent motion representation. To our knowledge, this is the first Neural Radiance Fields (NeRF)-like framework introduced for Wi-Fi sensing.

The demo system consists of one or two commodity Wi-Fi routers configured as passive sniffers, a Raspberry Pi that transmits packets, and a laptop that performs real-time processing and visualization while a participant performs a gesture or activity. The goal is to recognize the performed activity from Wi-Fi Channel State Information (CSI) using our proposed method, DoRF. A custom graphical interface displays the following live components:

1-Real-time CSI monitoring, showing how wireless channels respond to human motion.
2-Online Doppler extraction, dynamically illustrating how multiple virtual viewpoints are fused into a coherent motion field.
3-Live activity and gesture recognition, where DoRF representations are fed into a trained model and the predicted activity is displayed instantly.

-Main Novelty and Innovations:
1-First demonstration of a NeRF-inspired representation for Wi-Fi sensing.
2-Real-time visualization of virtual Doppler cameras and radiance-field-style fusion.
3-Demonstrates how DoRF improves generalization across users and environments, addressing a core limitation of existing Wi-Fi HAR methods.

-Impact to the Signal Processing Community:
With IEEE 802.11bf bringing WLAN sensing into the standard, there is increasing demand for practical and robust wireless sensing solutions. This demo presents our proposed method, which directly addresses the critical challenge of generalization and enables reliable real-world Wi-Fi sensing.

-Interactivity for Attendees:
The demonstration is fully interactive. Attendees actively participate by performing gestures and observing real-time changes in CSI, DoRF visualizations, and recognition outputs. This hands-on experience goes well beyond static plots, making the system intuitive, engaging, and educational for the ICASSP audience.

More information about our work: https://dorf.navidhasanzadeh.com/
Data collected for this study: https://ieee-dataport.org/documents/uthamo-multi-modal-wi-fi-csi-based-hand-motion-dataset-0

562: ELLAS: Enhancing LiDAR Perception with Location-Aware Scanning Profile Adaptation

Authors: Roger Kalkman -TU Delft, Thymon Rhemrev-TU Delft, Emma de Jong-TU Delft, Gideon van Triest-TU Delft, Jordy Pronk-TU Delft, Ashish Pandharipande -NXP Eindhoven, Nitin Jonathan Myers-TU Delft

Description: Light Detection and Ranging (LiDAR) is widely used in robotics and automotive systems to perceive the surrounding environment. Conventional spinning LiDARs operate at a constant rotational speed and employ fixed laser scanning parameters, resulting in uniform angular resolution and range across the entire field of view. Such a uniform scanning profile, however, is suboptimal when prior information about static obstacles in the environment is available from street topology maps.

In this demo, we present ELLAS, our situation-aware LiDAR system that dynamically adapts its scanning profile to location-specific street maps. ELLAS jointly optimizes the laser ranging parameters and the instantaneous rotational speed of the spinning platform over different sectors to maximize the scanning envelope around the vehicle. By adapting these parameters to the ego vehicle's location, ELLAS achieves a longer range and a higher angular resolution in critical regions. The live demo allows attendees to see themselves represented as points in the LiDAR point cloud. Participants can directly observe how adaptive sensing produces a higher-density point cloud compared to a standard LiDAR configuration operating at the same frame rate. Finally, the attendees can see how the spin rate profile in ELLAS changes with the location of the ego vehicle and the static obstacles around it.

A video demonstration of ELLAS is available here: https://www.youtube.com/watch?v=DYse8EQgHYI

Note: ELLAS was presented as a live demonstration at the IEEE Sensors 2024 conference, but we have not presented this at any IEEE Signal Processing Society conference yet. We believe that a live demonstration at ICASSP would be highly valuable for introducing location-aware sensing as an emerging research direction. Although this demo will also be included as part of my ETON talk at ICASSP 2026, I believe it would be beneficial to also present it as a show-and-tell demo, which facilitates one-on-one interactions for those interested.

567: CONVERGE Multimodal Wireless ISAC Demo: Video-Aided Beamforming

Authors: Jichao Chen, EURECOM
Filipe B. Teixeira, INESC TEC
Francisco M. Ribeiro, INESC TEC
Luis M. Pessoa, INESC TEC
Raymond Knopp, EURECOM
Dirk Slock, EURECOM

Description: This demonstration presents a cutting-edge Integrated Sensing and Communications (ISAC) system designed to overcome the fragility of mmWave/6G links. While high-frequency bands offer immense bandwidth, they suffer from severe sensitivity to blockage. We demonstrate a "Vision-Aided Beamforming" solution that integrates a LiteOn FR2 Radio Unit (running the OpenAirInterface software stack) with a synchronized Nerian RGB-D camera.
Unlike traditional reactive systems, our setup utilizes an advanced machine learning model to fuse visual data with RF measurements. The demo showcases the system's ability to anticipate blockage events and predict the best beams when the link is severed. We leverage the CONVERGE experimental infrastructure—a unique platform integrating a mobile gNB, a User Equipment (UE), and programmable obstacles—to validate these multimodal algorithms in real-time.
Main Novelty and Innovations:
The primary innovation is the practical, real-world implementation of multimodal Deep Learning for 6G, moving beyond the pure simulations common in this field. We demonstrate:
1. True-Multimodality: Tight synchronization between visual frames (RGB-D) and RF signatures (CSI/SRS).
2. OAI-Integration: A fully functional 5G NR software stack (OpenAirInterface) augmented with vision-based control.
3. Proactive-Intelligence: Machine learning models that predict optimal beam indices and blockage status using visual context and RF measurements.
Impact to Signal Processing Communities:
This demo directly addresses the "Where Signals Meet Intelligence" theme. It bridges the gap between Computer Vision and Wireless Signal Processing, offering a tangible platform for validating AI/ML algorithms in 6G ISAC scenarios. It provides a benchmark for cross-modal learning, crucial for future reliable low-latency communications.
Interactivity for Attendees:
The demonstration goes beyond static graphs. Attendees will interact with a "Digital Twin" dashboard of the CONVERGE chamber. They will be able to:
1. Live Multimodal Dashboard: Visualise synchronized RGB-D video streams and RF measurements in real-time, overlaid with predicted blockage events and signal status.
2. Monitor Beamforming: Watch the system dynamically select and switch between the hardware’s available beams in real-time response to visual cues.
3. Test the Model: Attendees can manually toggle specific blockage scenarios to see how the beam prediction algorithm adapts instantly.

502: SensingSP™: An Open-Source Digital Twin for 4D Imaging Radar Design, Simulation, and AI

Authors: Moein Ahmadi — Interdisciplinary Centre for Security, Reliability and Trust (SnT), University of Luxembourg
Bhavani Shankar M. R. — Interdisciplinary Centre for Security, Reliability and Trust (SnT), University of Luxembourg
Björn Ottersten — Interdisciplinary Centre for Security, Reliability and Trust (SnT), University of Luxembourg
Thomas Stifter — IEE S.A., Luxembourg

Description: SensingSP™ is an open-source digital twin framework that enables end-to-end modelling, simulation, and processing of 4D imaging radar systems. The demo showcases how high-level sensing requirements such as maximum range, bandwidth, angular resolution, and update rate are automatically translated into complete FMCW radar parameters including waveform design, PRF, antenna configuration, and MIMO virtual aperture sizing. The system operates directly inside Blender, allowing radar designers and researchers to construct rich 3D environments containing vehicles, pedestrians, and infrastructure.

A key innovation is the integration of electromagnetic ray tracing, signal generation, MIMO demodulation, CFAR detection, and 4D point-cloud formation into a single, interactive pipeline. For every frame, SensingSP computes multipath-aware propagation paths, synthesizes raw ADC cubes, performs range–Doppler–angle processing, and visualizes the resulting detections. CUDA acceleration enables rapid experimentation with radar modes, resolutions, and scene geometries. The framework also includes machine-learning modules for gesture recognition, health monitoring, WiFi sensing, and generative waveform modelling, making it a unified platform for sensing and intelligence, aligned with the ICASSP 2026 theme, “Where Signals Meet Intelligence.”

The demo’s novelty lies in providing a fully open-source, physically grounded digital twin that bridges electromagnetic modelling, advanced signal processing, and AI-based sensing within a single environment.

Attendees will engage directly with the system by manipulating 3D scenes, repositioning sensors, modifying radar parameters, and immediately observing how these changes affect radar signatures, Doppler spectra, angular estimates, and 4D point clouds. This hands-on exploration illustrates key radar trade-offs, such as the impact of bandwidth on resolution, virtual aperture on angular accuracy, and motion dynamics on Doppler structure.

The demonstration will appeal broadly to the signal-processing community, including researchers in array processing, radar, machine learning for sensing, wireless environments, and digital twin technologies. It provides a practical, accessible tool for education, research, and rapid prototyping of next-generation sensing systems.

516: Unlimited Sensing Radar: Enhancing Resolution via Modulo ADCs

Authors: Ruiming Guo, Dept. of Electrical and Electronic Engg., Imperial College London

Vaclav Pavlicek, Dept. of Electrical and Electronic Engg., Imperial College London

Ayush Bhandari, Dept. of Electrical and Electronic Engg., Imperial College London

Description: Conventional radar receivers face a fundamental trade-off between dynamic range and digital resolution: strong reflections saturate the ADC while weak targets are buried in quantization noise. This limits the simultaneous detection of near–far targets and constrains achievable resolution.

We demonstrate an end-to-end radar prototype based on the Unlimited Sensing Framework (USF) that breaks this limitation using analog-domain modulo folding prior to digitization. Instead of saturating, large-amplitude returns are folded and later reconstructed algorithmically, enabling high-dynamic-range acquisition from low-resolution hardware.

The system integrates a custom modulo-ADC front-end, real-time reconstruction and estimation algorithms, and an interactive GUI into a complete acquisition–processing–detection pipeline. Attendees can directly interact with the radar: moving in front of the sensor generates folded measurements in real time, which are visualized alongside the recovered signals and detected targets. The demo allows side-by-side comparison with conventional sampling to highlight clipping and missed detections.

We showcase two configurations:

• Doppler radar: 12.35× dynamic-range expansion with Hz-scale frequency resolution.
• FMCW radar: reliable detection using extremely low-resolution sampling while maintaining 0.1 Hz precision.

These results illustrate how hardware–algorithm co-design with modulo sensing reduces sampling rate and bit depth without sacrificing accuracy.
This demonstration provides a tangible, working example of high-dynamic-range radar sensing and will be of interest to researchers in radar, sampling theory, and low-power sensing systems.

This demo validates the hardware implementation for the upcoming ICASSP conference presentation (ID: 14095), “Enhancing Doppler and FMCW Radars via Unlimited Sensing.”

517: Hand Gesture Recognition with USF-Radar

Authors: Václav Pavlíček, Dept. of Electrical and Electronic Engg., Imperial College London
Ayush Bhandari, Dept. of Electrical and Electronic Engg., Imperial College London

Description: Radar-based recognition of hand gestures, human activities, and motions has experienced a significant research interest due to the ability of radar to operate under diverse lighting conditions while preserving privacy. The radar signal is digitized using an Analog-to-Digital Converter (ADC) with constrained Dynamic Range (DR) and Digital Resolution (DRes). These bottlenecks were solved by the Unlimited Sensing Framework (USF), which inserts a zero-centered modulo operation in the analog domain.

We demonstrate USF-enabled radar for hand gesture recognition, which relies on processing directly on modulo-folded measurements without signal reconstruction.

The demonstration features custom-built modulo ADCs (each for I and Q channels) integrated into a Doppler radar acquisition pipeline, together with a real-time GUI processing pipeline. Attendees can interact with the real system:
- trigger the classifier by performing a hand gesture in front of the radar,
- see the modulo-folded version of their gesture waveform,
- see a wavelet scalogram and the classified hand gesture in real-time.

In https://doi.org/10.1109/RadarConf2559087.2025.11204889, we show 15% classification accuracy improvement achieved by USF-Radar with low-resolution ADCs when compared with a conventional radar.

This demo provides a tangible, hardware-validated example of a radar-based human-computer interface and will be of interest to researchers in sampling theory, classification, low-power acquisition, and radar systems.

524: Contact-Free Blood Pressure Waveform Estimation From Radar Signals Via Multimodal Dictionary Learning

Authors: Mengchu Xu, Hang Ruan, Rui Guo, Daniel Kogan, Luda Nisnevich, Adi Wegerhoff, and Yonina C. Eldar, Weizmann Institute of Science

Description: Continuous, non-invasive monitoring of arterial blood pressure (BP) is critical for early cardiovascular risk detection. While Frequency-Modulated Continuous-Wave (FMCW) radar offers a promising contact-free alternative to uncomfortable cuff-based devices, accurately estimating the full BP waveform remains a significant challenge. The mapping from radar-sensed chest displacement to arterial pressure is highly non-linear and subject-specific, often obscured by respiration and noise. Conventional approaches typically resort to simple regression for scalar values (SBP/DBP) or "black-box" deep learning, lacking the physiological interpretability required for clinical trust.
This demonstration is based on our work that reconstructs high-fidelity BP waveforms from radar signals within a Multimodal Convolutional Sparse Coding (CSC) framework. Our approach leverages a learned dictionary trained on synchronized Radar, PPG, and BP data to capture shared physiological "cardiac codes". By integrating a non-linear feature extraction backend, we address the complex mapping between chest micro-vibrations and BP amplitude, enabling the precise recovery of systolic peaks and dicrotic notches. This architecture explicitly separates cardiac activity from respiratory artifacts, delivering accurate, full-waveform reconstruction with "glass-box" transparency suitable for clinical analysis.
Our demonstration platform utilizes a visualization system running on practical radar recordings acquired with the TI IWR1443 FMCW radar, operating at 77–81 GHz. The radar is positioned approximately 50 cm above the subject’s chest while the subject lies supine. Ground-truth signals are obtained from a CNAP continuous blood pressure monitor and a finger-clip PPG sensor, synchronized with the radar data for multimodal training and validation. A custom GUI replays recorded data to visualize the signal processing pipeline across three synchronized windows: (1) The Raw Radar Waveform showing original chest displacement; (2) The Filtered Intermediate Signal, displaying the extracted sparse cardiac features and physiological localization maps; and (3) The synthesized BP Waveform overlaid with the ground truth. Below these waveforms, the interface displays physiological metrics, including SBP, DBP, Heart Rate, and Respiration Rate. This setup allows attendees to observe the system's robustness in separating vital signs from noise and its accuracy in tracking hemodynamic changes using pre-acquired experimental recordings.

573: EVENT-DRIVEN NEUROMORPHIC SAMPLING AND RANGE ESTIMATION ON RADAR

Authors: Ayush Jha, Abijith Jagannath Kamath, Chandra Sekhar Seelamantula, Chetan Singh Thakur
Indian Institute of Science, Bengaluru

Description: This demonstration presents an innovative event-driven signal acquisition strategy for frequency-modulated continuous-wave (FMCW) radar, replacing traditional power-hungry analog-to-digital converters (ADCs) with neuromorphic encoders. While standard radar systems rely on uniform Nyquist sampling generating a constant, heavy data stream regardless of whether a target is present our setup performs precise range estimation using opportunistic sampling. We utilize a mmWave radar integrated with a neuromorphic encoder that operates asynchronously, triggering a measurement only when a significant change in signal amplitude occurs. This effectively compresses the signal at the source and ensures the system remains quiet in the absence of targets.
The main novelty lies in the hardware-software synergy of applying neuromorphic principles to RF sensing. By exploiting the sum-of-complex-exponentials (SWCE) structure inherent in dechirped radar signals, we prove that range information can be accurately recovered from sparse, non-uniform event triggers. We pose the signal recovery as a sparse reconstruction problem in the Fourier domain, achieving high-resolution range profiles with significantly reduced data overhead compared to traditional methods.
The impact on the signal processing community is substantial, as it bridges the gap between asynchronous hardware and classical estimation theory. This work provides a practical blueprint for ultra-low-power, "always-on" sensing in resource-constrained environments, such as IoT devices and autonomous micro-robotics, where power efficiency and bandwidth are critical bottlenecks.
Interactivity is central to our showcase. Attendees will engage with a live hardware demo where they can move objects or their hands in front of the radar. An interactive dashboard will visualize the real-time "event stream" alongside the reconstructed range profile. Participants will see firsthand how the systems sampling rate dynamically adapts to movement: generating dense spikes for moving targets and dropping to near-zero activity when the scene is static. This hands-on experience highlights the efficiency of event-based sensing without sacrificing the precision required for high-fidelity range estimation.

528: GridSense: Ask Your Power Grid

Authors: Kim, Changhun (Pattern Recognition Lab, FAU); Karim, Redwanul (Pattern Recognition Lab, FAU); Conrad, Timon (Institute of Electrical Energy Systems, FAU); Riebesel, David (Institute of Electrical Energy Systems, FAU); Mayerhofer, Lukas (LEW Verteilnetz); Mengele, Fabian (LEW Verteilnetz); Oelhaf, Julian (Pattern Recognition Lab, FAU); Gourmelon, Nora (Pattern Recognition Lab, FAU); Arias Vergara, Tomás (Pattern Recognition Lab, FAU); Jaworski, Michael (LEW Verteilnetz); Maier, Andreas (Pattern Recognition Lab, FAU); Jäger, Johann (Institute of Electrical Energy Systems, FAU); Bayer, Siming (Pattern Recognition Lab, FAU)

Description: GridSense is an open-source Python package and web demo for natural-language exploration of operational power-grid models by grounding large language models (LLMs) in a Neo4j knowledge graph. Grid data is often exchanged as CGMES (Common Grid Model Exchange Standard) / CIM (Common Information Model) RDF/XML: interoperable, but difficult to query directly because it is document-centric, often requiring full-file parsing and offering limited indexing and topology traversal. GridSense converts CGMES/CIM into a queryable Neo4j graph and applies a GraphRAG-style workflow: it detects grid-related intent, generates schema-aware Cypher, executes it on the database, and summarizes strictly from retrieved results to reduce hallucinations and improve auditability.

The representation follows a two-layer architecture. A static CIM knowledge graph stores hierarchy, topology, and equipment. A dynamic snapshot overlay links time-indexed operating states such as load (P/Q), terminal power flows, bus voltage magnitude/angle, breaker status, and tap positions to the same assets, enabling efficient time-series queries without duplicating static equipment across timestamps.

In the ICASSP 2026 Show-and-Tell, we demonstrate interactive inspection of grid assets and conversational querying, including: (1) transformer utilization trends over snapshots, (2) extracting line parameters/connectivity with Cypher to build the network admittance (Y-bus) matrix, and (3) restoration decision support by computing impedance-weighted shortest paths under current switch states to propose candidate energization routes, leveraging Neo4j traversal and graph algorithms. We also highlight optional fusion of exogenous signals (e.g., weather) to enrich operational context.

Impact to signal processing communities: GridSense bridges graph-structured, time-varying grid signals to verifiable retrieval and computation, supporting reproducible topology-aware analytics (matrix construction, temporal monitoring, and graph-algorithm decision primitives).

Interactivity for attendees: participants can ask their own questions, see how each question is translated into Cypher, execute it live against the Neo4j graph, and click grid elements to inspect connected topology and time-indexed state; the demo returns retrieved subgraphs and plots derived from query results.

The grid dataset is anonymized because it represents critical infrastructure. Finally, we provide “bring-your-own-grid” documentation to import CGMES/CIM XML, validate mappings, build snapshot overlays, and run the same LLM-to-Cypher workflow on a user’s dataset.

Demo video:
https://www.youtube.com/playlist?list=PL2EIbrGMjR_m2khSEMiUtbKo0dY-LOn3-

565: UtilityTwin: A Knowledge Graph based Digital Twin for Municipal Utilities

Authors: Adithya Ramachandran, Thorkil Flensmark B. Neergaard, Andreas Maier, and Siming Bayer

Description: Municipal utilities work with massive volumes of heterogeneous data, ranging from high-frequency sensor signals (smart meters, SCADA) and static geospatial topologies to unstructured historical records and data from external sources. However, effective analysis is currently hindered because these modalities remain siloed, creating technical barriers that obscure the semantic context required for rapid decision-making.

We present UtilityTwin, a digital twin framework that fuses these disparate streams into a unified Knowledge Graph. By modeling the complex relationships between physical assets, sensor time-series, and legacy documentation (structured, unstructured and aerial images), we establish a semantic relationship for the different components of the network.

In the interactive demonstration, attendees will access a web-based interface connected to a live data server to navigate a real-world water and heating network. Users can explore the infrastructure behind residential and industrial consumption through visual navigation or a natural language interface. Participants will be invited to pose complex queries, such as "Summarize recent anomalies in the city alongside repair history", to receive real-time, visually corroborated answers. This demonstrates how concepts such as Retrieval-Augmented Generation, LLMs, Agentic framework, and Knowledge Graphs can enable advanced downstream tasks, from demand forecasting to leak detection, by democratizing access to complex signal data.

Specifically, for the research community, this addresses the critical challenge of the "context gap." While researchers possess deep technical expertise, they often lack the domain-specific nuances usually expected from partnering utilities. By providing semantic context via a modeled Knowledge Graph, our solution allows researchers to interpret abstract signal anomalies by instantly correlating them with physical asset history and geospatial reality, rather than analyzing data in isolation.

537: TwinShip: A Decision Support Platform for Maritime Operations

Authors: Loukas Ilias, Afroditi Blika, Ariadni Michalitsi-Psarrou, Theodoros Florakis, Anastasia Askouni, Georgios Klavdianos, Giannis Xidias, Dimitris Askounis, Spiros Mouzakitis

Decision Support Systems Laboratory, School of ECE, National Technical University of Athens

Description: 1. The TwinShip demo presents an AI-powered maritime analytics platform developed within the Horizon Europe TwinShip project, building upon the deployed VesselAI platform from the Horizon 2020 programme. The demo focuses on scalable data integration, signal processing, and operational performance analysis for ships. Heterogeneous maritime data sources, including AIS trajectories, onboard engine and propulsion measurements, and environmental signals, are ingested, harmonized, and stored in a secure datalake. The platform supports energy-efficient navigation, emissions-aware operation, and data-driven decision making through advanced analytics services. As part of its analytical capabilities, the platform includes the Engine-Propeller Combinator Diagram (EPCD) as a supporting representation linking indicative operational and environmental inputs to propulsion and energy performance outputs. The demo does not present digital twins, but highlights analytical foundations supporting TwinShip’s long-term digital twin vision.

2. The main novelty of the demo lies in its modular analytics architecture. TwinShip provides an environment for constructing AI and data analytics workflows, allowing researchers to develop and test algorithms using JupyterLab with direct access to the secure datalake. Dagster is used to orchestrate data processing and model pipelines, enabling transparent experimentation and repeatable results. The platform also incorporates EPCD as a structured analytical representation supporting propulsion and energy performance analysis. In addition, the integration of AI-agent frameworks through LangGraph enhances assisted data exploration and analysis across large datasets.

3. The demo demonstrates how modern signal processing methods, such as time-series analysis, sensor fusion, spatiotemporal modeling, and data-driven estimation, can be operationalized within a real-world maritime analytics platform. By providing structured access to heterogeneous datasets and end-to-end analytics workflows, TwinShip provides a concrete platform for applied signal processing research.

4. Interactivity is a key element of the demo. Attendees interact with live dashboards and user interface components to explore vessel trajectories, fuel consumption patterns, performance indicators, and indicative operating points within the Engine-Propeller Combinator Diagram. Guided demonstrations show how analytics workflows are constructed in JupyterLab, executed and monitored through Dagster, and explored through AI-assisted analysis using LangGraph. This experience provides transparency across the analytics lifecycle.

556: esp-data: A Unified Python Library for Large-Scale Cross-Taxonomic Bioacoustics Research

Authors: Gagan Narula, Milad Alizadeh, Ellen Gilsenan-McMahon, Paul Laisne, Marius Miron, David Robinson, Masato Hagiwara, Benjamin Hoffman, Sara Keen, Maddie Cusimano, Ines Nolasco, Logan James, Anthony Fine, Eklavya Sarkar, Emmanuel Fernandez, Jules Cauzinille, Gregory Yauney, Diane Kim, Laura Hay, Jane Lawton, Brittany Solano, Matthieu Geist, Emmanuel Chemla, Aza Raskin, Olivier Pietquin

Affiliation for all authors: Earth Species Project

Description: Recent bioacoustic foundation models show performance gains with larger, diverse datasets, but the field struggles with fragmented data and incompatible formats. We present esp-data, an open-source Python library developed by Earth Species Project (ESP) that provides a unified, cloud-native interface for loading, transforming, and combining over 35 curated bioacoustics datasets spanning birds (BirdSet, Xeno-Canto and iNaturalist), marine mammals (whale and dolphin phonations with ecotype annotations), primates (gibbon, macaque, and marmoset vocalizations), amphibians (AnuraSet), and insects (InsectSet459) All datasets with permissive licenses are hosted publicly on ESP infrastructure, providing a single source of truth. New datasets will be added periodically, and researchers can easily integrate and optionally store their open-source datasets within esp-data infrastructure with access to the full collection under a consistent and actively maintained interface.

The library introduces the following innovations:

(1) Registry-based dataset abstraction which enables YAML driven dataset configuration for reproducible ML pipelines. Users can install the library, plugin their research datasets by “registering” and seamlessly work with the library’s tools.
(2) Cloud storage and local filesystem access via a unified path abstraction
(3) A composable transform system for performing common operations on datasets such as filtering, balanced sampling over features, adding taxonomic information, label encoding; transforms also allow for easy extension
(4) Flexible dataset concatenation strategies and simple abstractions like “chaining” heterogeneous sources while preserving annotation fidelity
(5) A backend API that integrates with popular libraries such as Pandas, Polars, and the Pytorch Dataloader
(6) Iterable-only “streaming” mode that allows users without access to high memory compute infrastructure to iterate over their desired datasets
(7) Comprehensive documentation and tutorial notebooks.

Our demonstration showcases workflows for training bioacoustic classifiers, real-time dataset discovery, transform pipelines for class balancing, and cross-taxa model evaluation—all through a consistent API that abstracts storage backends and preprocessing complexity.
esp-data addresses critical infrastructure gaps in and lowers barriers to multi-species acoustic analysis. The library provides ways to accelerate AI applications in conservation monitoring, ethological research, and the emerging field of interspecies communication.

542: Signal-Driven Autonomous Satellite Tasking via Tip-and-Cue: An Interactive Demo

Authors: Gil Weissman, Amir Ivry, Israel Cohen
Faculty of Electrical and Computer Engineering
Technion – Israel Institute of Technology, Haifa, Israel

Description: This Show & Tell demonstration presents an interactive implementation of AI-driven Tip-and-Cue framework for autonomous satellite sensing, focusing on how heterogeneous spatiotemporal signals can drive task formulation and scheduling decisions in a closed loop. The demo is designed to make the behavior of such systems observable and understandable through direct interaction.
The demonstration consists of an interactive map-based visual interface shows a geospatial scene with satellite ground tracks and time-varying signal indicators such as trajectory deviations and natural disasters. Such predefined scenarios are initialized in which signals evolve over time and are visualized as overlays on the scene. The signals are converted into tips, probabilistic hypotheses about events of interest. The tips are later converted into cues, candidate satellite imaging tasks. Unlike static task planning approaches, the system continuously updates cue priorities and schedules as signals evolve.
Attendees can interact with the system by adjusting a set of high-level parameters using sliders and toggles. These controls modify properties of the signals (e.g., strength, uncertainty, or temporal persistence) as well as system constraints (e.g., sensing priority or satellite availability). Following each interaction, the system updates the generated tips and cues and recomputes feasible acquisition windows and scheduling decisions. The resulting changes are immediately reflected in the visualization, allowing attendees to observe how different signal interpretations lead to different sensing outcomes.
The demo provides an intuitive view of how future satellite systems can move from passive signal analysis toward intelligent, self-tasking sensing architectures. The demo is relevant to the ICASSP community as it illustrates how signal processing outputs can be integrated into AI-driven decision systems operating under constraints. By enabling hands-on exploration, the demonstration supports intuition-building around adaptive sensing, interpretability, and signal-driven autonomy.

522: Seeing Smoke: A Large-Scale Open-Source Multimodal Dataset for Real-Time Wildfire Detection Models.

Authors: Emadeldeen Hamdan (University of Illinois Chicago), Yingyi Luo (University of Illinois Chicago), Behcet Ugur Toreyin (Technical University of Istanbul), Ugur Gudukbay (Bilkent University), Ahmet Enis Cetin (University of Illinois Chicago).

Description: Early and automatic wildfire detection is critical for minimizing environmental damage, infrastructure loss, and threats to human life. However, real-time detection and monitoring remain challenging due to varying conditions, including smoke, atmospheric distortion, motion, and illumination variability. To this end, we present the Global Wildfire Prevention Dataset (GWFP): a large-scale, open-source multimodal dataset to support robust, efficient, real-time AI-based detection models.
The GWFP dataset is compiled from public sources, including The High Performance Wireless Research and Education Network (HPWREN) and General Directorate of Forestry-Turkey camera networks, as well as drone-based recordings. It consists of approximately 40GB of video and 80GB of image data. The video component is divided into five categories: Flame/Smoke, Negative Samples, Waterdogs, Ember, and Unlabeled sequences. Flame/Smoke videos include flame-only, smoke-only, and flame-to-smoke transitions, captured from stationary cameras and drones. The Ember class contains recordings of airborne embers from active fires, while waterdogs represent natural motion patterns that cause false alarms.

The image component includes eight classes: Flames, Smoke, Negative Samples, Waterdogs, Near-Infrared (NIR) Fire, NIR No Fire, and Unlabeled images. NIR imagery enables cross-spectral analysis and supports multimodal fusion under challenging visibility. A dedicated classification subset with standardized training, validation, and test splits facilitates benchmarking.

Firefighters often report excessive false alarms from video-based smoke detection systems, reducing trust in automated alerts and wasting critical response resources. These errors are often caused by visually similar phenomena, such as clouds, fog, or changes in lighting. This dataset and real-time wildfire smoke detection demonstration aim to advance robust video-based smoke detection and efficient edge AI systems, designed with low-cost FPGA-based deployment in mind, to enable more reliable early wildfire detection and prevention.

This work was supported in part by the National Science Foundation (NSF) under Award No. 2531376

530: Pre-Characterization of Electromagnetic Side-Channel Leakage Using Publicly Available Information: A Case Study on E-Voting Screens

Authors: Leonardo Teodoro, Federal University of Technology - Paraná, Brazil;
Kemuel Vieira, Federal University of Technology - Paraná, Brazil;
Saulo Queiroz, Federal University of Technology - Paraná, Brazil.

Description: Wireless side-channel attacks (SCAs) on monitor displays—often referred to as TEMPEST attacks—constitute a class of threats in which an eavesdropper remotely infers sensitive screen information by processing electromagnetic emanations unintentionally emitted by the display. In this demo, we present public TEMPEST, a variant of the TEMPEST threat model in which publicly available system information is leveraged to identify structural signal characteristics ex ante, prior to the physical acquisition of electromagnetic leakage. Such pre-characterized properties can both facilitate subsequent side-channel exploitation and support jamming-based mitigation strategies. We illustrate the public TEMPEST concept through a case study based on the Brazilian electronic voting machine.

This research is motivated by a public call issued by the Brazilian electoral authority aimed at anticipating security issues in the electronic voting process and by a recent judicial decision that revoked a councilman’s mandate after identifying the use of micro-cameras to violate voting privacy. We examine how publicly available information about the Brazilian electoral system can expose electronic voting machines to TEMPEST-related SCAs.

We show that key design characteristics of the Brazilian e-voting interface—such as high-contrast images and minimal on-screen information adopted to improve usability for over 150 million electors—result in a highly distinctive spectral signature. Because these interfaces are publicly available, this signature can be analyzed offline and used to support the automatic tuning of electromagnetic parameters that vary across different e-voting machine models (e.g., critical harmonic frequencies), which is a relevant feature to automating mitigation strategies.

The demo consists of a computer running the official public simulator of the Brazilian electronic voting machine and a software-defined radio (SDR) setup that displays the identified spectral signature and leakage-derived voting information in real time. We believe the public TEMPEST concept presented in this demo can initiate discussion among academia, industry, and government on information forensics challenges and best practices for mitigating signal-processing threats in public systems.

539: Real-Time Continuous EEG Authentication: Streaming Neural Biometrics

Authors: Arnault H. Caillet (Yneuro)
Apolline Mellot (Yneuro)
Bruno Aristimunha (Yneuro)
Arnaud Delorme (Institute for Neural Computation, University of California San Diego (UCSD))
Thomas Semah (Yneuro)

Description: This demo presents a real-time EEG-based authentication system designed to continuously verify a user’s identity from neural activity. It offers a hands-on demonstration of the feasibility of EEG-based biometrics in everyday conditions, highlighting the distinctive advantages of neural signals over traditional static biometrics for passive and continuous identity verification.

The demo is fully interactive and participants will engage with the authentication system. Participants wear a consumer-grade EEG headset connected to a laptop running the pipeline. Once EEG is detected, the system initializes automatically, enters a short setup phase with live signal-quality feedback, builds a biometric profile directly from the streaming EEG signals, and performs continuous verification over a sliding window of neural signatures. During authentication, a desktop session remains accessible as long as incoming signatures remain consistent with the enrolled user; when signal quality degrades or signatures drift, access is paused or locked, making security decisions directly observable and signal-driven.

A real-time dashboard displays the internal behavior of the pipeline, including (i) live EEG traces and quality metrics, (ii) channel-activity heatmaps, (iii) low-dimensional projections of streaming signatures, and (iv) a “uniqueness” view illustrating separation from a reference cohort.

Novelty for ICASSP lies in the end-to-end demonstration of how streaming biomedical signal processing (online preprocessing + artifact/quality handling), machine learning for multivariate time–space-dependent time series (online embedding and matching), and security-oriented decision logic interact under real-world conditions. This research area at the intersection of neuroscience and cybersecurity remains relatively underexplored and has not seen real-world translation into deployed systems (>20 publications/year since 2010).

Previously presented at NeurIPS 2025 and designed to be accessible to a broad audience, the demo aims to stimulate ICASSP-aligned discussions around neural biometric evaluation, real-time signal-processing constraints (e.g., domain adaptation, frugal real-time artifact rejection), and how EEG authentication fundamentally differs from voice recognition, another signal-based biometric.

For SP impact, the demo illustrates a general blueprint for real-time, closed-loop signal processing, where signal-quality estimation, representation learning, and online decision rules are tightly coupled. It offers a testbed to discuss robustness to non-stationarity, domain shift, and low-SNR multichannel signals, which are common challenges across many “signals meet intelligence” applications.

544: Listening to Food: Interactive Multimodal Signal Intelligence for Texture determination of food

Authors: Michaël Verlinden, Vives University of Applied Sciences
Tom Van Gaever, Vives University of Applied Sciences
Catherine Middag, Vives University of Applied Sciences
Thomas Sprangers, Vives University of Applied Sciences
Kevin Vynckier, Vives University of Applied Sciences
Jonas Lannoo, Vives University of Applied Sciences
Mohammed Saif Ismail Hameed, KU Leuven
Bart De Ketelaere, KU Leuven

Description: Topics: Acoustic Signal Processing, Food Texture Analysis, Multimodal Sensing, Machine Learning, Signal-Based Perception

This Show & Tell Demo presents an interactive experimental platform that demonstrates how signal processing and machine learning can be used to objectively analyze food texture through sound and vibration. Developed within the TETRA KRAK project (Vives University of Applied Sciences and KU Leuven), the demo focuses on controlled food fracture and real-time acoustic intelligence, aligning closely with the ICASSP 2026 theme ‘Where Signals Meet Intelligence’.
Food products such as croissants, biscuits, breaded products and even Belgian chocolate are broken in a compact mechanical setup with 3D-printed probes under well-defined and reproducible conditions. During fracture, three synchronized sensors capture complementary signals: a microphone records airborne acoustic emissions, an accelerometer measures structure-borne vibrations, and a load cell registers force-displacement behavior. All signals are streamed live and processed in real time.

Advanced audio and vibration signal processing methods extract time-domain, spectral, and time-frequency features that characterize the fracture events. Machine-learning models learn relationships between these signals and sensory attributes such as crispness and crunchiness. Live visualizations present both raw sensor data and ML-based texture predictions.

The key novelty of this demo is twofold. First, it demonstrates audio analysis as a new methodological tool for food research, extending acoustic signal processing into a domain historically reliant on sensory panels. Second, it introduces machine learning as a new layer in sensory science, moving beyond feature analysis toward learned models that link physical signals to human perception. The integration of controlled mechanical breaking, multimodal sensing, and ML-based analysis represents a unique approach in today’s research landscape.

Attendees can see and hear food being broken live, listen via headphones to fracture sounds, compare different gradations of crispiness, observe live audio analysis outputs and taste the products being analyzed, offering a multisensory illustration of how signals meet intelligence.

559: Open-Source FPGA-Based Echo State Network Nodes for Low-Power Distributed Wildfire Risk Detection

Authors: Nima Ghaffarzadeh 1, Matteo Mendula 1, Raúl Parada 1, and Paolo Dini 1
1 Centre Tecnològic de Telecomunicacions de Catalunya (CTTC/CERCA), Castelldefels, Barcelona 08860, Spain

Description: Wildfires pose a serious and escalating threat to ecosystems, economies, and human safety, demanding faster and more reliable detection systems. Traditional monitoring approaches often depend on centralized data processing and external communication links, which can introduce delays and reduce effectiveness in remote forest regions. This demo presents a distributed, low-power sensing architecture for early wildfire risk detection using field-deployable edge devices built around Field Programmable Gate Arrays (FPGAs) running embedded Echo State Network (ESN) models. Each node integrates temperature and humidity sensors with a pre-trained ESN that performs real-time inference directly on the FPGA. To validate the architecture's feasibility for energy-constrained environments, we present comprehensive power efficiency benchmarking, demonstrating that the FPGA-accelerated ESN achieves high inference throughput while minimizing total power consumption. The novelty of the system lies in combining reservoir computing with reconfigurable hardware for scalable, autonomous environmental intelligence at the forest edge, using a fully open-source hardware and software stack to facilitate reproducibility and further research. The demonstration includes an interactive setup in which attendees can manipulate sensor conditions (for example, locally increasing temperature or modifying humidity) and immediately observe the resulting on-board ESN predictions, instantaneous power usage metrics, and device behavior through a live interface. This hands-on interaction highlights the practical potential of low-power, hardware-accelerated machine learning for real-time environmental monitoring and wildfire risk assessment.

2026 IEEE International Conference on Acoustics, Speech, and Signal Processing

Top Reading

Most Upvoted

Top Reading

Most Upvoted

Industry Program

Industry Keynotes

Program Schedule

Talk 1: Audio Processing with Large Language Models

Talk 2: Latest Trends in AI Signal Processing for Consumer Experiences

Talk 3: AI-Native 6G: Building the Wireless Stack for AI-RAN and 6G Innovation

Industry Expert Speakers

Program Schedule

Location: Auditorium

603: The Future of Mobile Communications: Challenges and Opportunities

631: Enhanced Integrated Sensing and Communications for 6G with AI and multi-modal fusion

641: “Modern radar systems, challenges and perspectives: a personal viewpoint”

628: Immersive Audio via Headphones: Status and New Solutions

629: Building Dolby Atmos FlexConnect: From Research Project to Product

632: From ANC to Blood Pressure: How Earbuds Are Becoming Multimodal Health Sensors

609: Why Blind Audio Processing Fails: Edge Intelligence for Content-Aware Audio Processing in Streaming Media

626: How to Build Realistic Acoustic Datasets For AI-audio Training Using Simulated Data From Validation to Large-Scale Datasets

640: From Text to Talk: How New Speech LLMs Will Make Conversations with Technology More Natural

612: Signal Processing Inspired AI for Sensing - an Industry Perspective

616: Enabling End-to-End Ecosystem of Spatial-Temporal Gaussian Splatting

639: Language Models on Microcontrollers: Achieving Cloud-Class AI in <32MB

624: Personalising GenAI: Fine-Tuning Models to Understand & Perform Specific Tasks

633: AIGuardrail: A Skill-Driven, Zero-Training Security Framework for Telecom LLMs in Resource-Constrained Environments

618: AI/ML for defense applications, its impact and limitations.

627: From Signals to Systems: Making AI Industrial-Grade Across the Engineering Lifecycle

638: Real-Time Human–AI Collaboration for Trustworthy Conversational Agentic Systems

Industry Panels

Program Schedule

Panel 1: Open Audio Codecs for the Next Generation of Immersive and Scalable Media

Panel 2: Industrializing AI-Native, Distributed, and Sustainable 6G with Open RAN and TN-NTN Integration

Panel 3: From Labs to Learners: Preparing the Next Generation Signal Processing Workforce through Industry-Academic Coalitions

Panel 4: Scaling Intelligence for the Smart Society: Human-Centric, Sovereign, and Efficient Digital Twins

Industry Workshops

Meta Industry Workshop: “Frontiers in Human-Machine Communication: Wearable Sensing, Speech Enhancement, and Conversational Interaction”

Spotlight Sessions

MERL: “Every Moment | Something New — Signal Processing at MERL”

Adobe: “Enabling Rich Creative Control in Generative Audio”

Show and Tell Demos

Program Schedule

Location: Exhibition Hall

501: Nkululeko 1.0: A Python package to predict speaker characteristics with a high-level interface

515: Interactive Spectrogram-Based Rhythm and Melody Annotation for Speech Analysis

536: An Interactive Demonstration of the Open ASR Leaderboard

553: Speech Enhancement Intelligence - Inspecting a Model Under Controlled Degradation

555: SCRIBAL: A Multilingual Transcription Platform for Academic Lectures and Impaired Speech Accessibility

574: Tahlil: An Interactive Toolkit for Standardized ASR Evaluation and Error Analysis

506: Flow Matching for Real-Time Joint Speech Enhancement and Bandwidth Extension

521: NPU-Accelerated Real-Time Voice Conversion for Customizable Digital Identities

529: Speaking rate control in the stream

532: Real-Time Demo of Single-Channel Target Speaker Extraction Using State-Space Modeling

533: Semantic-Aware Speech Anonymization via Neural Codec Editing

570: Electrolaryngeal Speech Enhancement Based on Any-to-Many Voice Conversion

519: Seamlessly Upgrading On-Device Speech Recognition System with More Recent Foundation Models

538: NVIDIA NeMo Voice Agent: An Open-Source, Multi-Model Framework for Building Your Own Real-Time Conversational AI

548: Lightweight End-to-end Spoken Language Understanding System for Speech-controlled Video games

557: Toward Realistic Multimodal Speech Processing Benchmarks Using a Multi-Talker Audio-Visual Conversational Corpus

558: Modular, Safe Granite Speech Conversation with Multiple Speakers

563: Reliable Real-Time Meeting Transcription through Multimodal Speaker Detection and Emotion Recognition

520: DeepAudioX: An Open-Source Python Library for Audio Learning and Rapid Prototyping with Pretrained Models

523: Smart Passive Acoustic Monitoring: Embedding a Classifier on AudioMoth Microcontroller

546: An Interactive Music Analysis Platform for Pedagogy and Audio Organization

552: A System-Integrated Parametric Array Loudspeaker Prototype for Controllable and Localized Sound Field Regulation

568: SciPhi - a spatial audio language model that understands real multi-source audio scenes

571: VisionSFX: Cross-Shot Consistent Video-to-Audio Generation with Depth-Aware Binaural Audio Rendering

510: Sub-Nyquist DoA Estimation of an Ultrasound Source in a Sector of Interest

512: Photo-Driven Multimodal Conversational AI for Reminiscence-Based Cognitive Training and Longitudinal Cognitive Assessment

526: Plug-and-Play Latent Diffusion for Ultrasound Inverse Imaging – Show and Tell Demonstration

527: Full Wave Inversion for Pulse-Echo Ultrasound Linear Arrays

543: From Wearables to Generative Insight: A Multimodal Framework for AI-Augmented Cardiology Assistance with Single-lead Electrocardiogram

545: Size Doesn’t Matter: Interactive Acoustic Imaging System Design Using a 1024-Channel Ultrasound Array

511: Millisecond-Order Self-Adaptive AI WiFi Receiver

549: Over-the-Air Computation with Neural Constellations for Two-Way Streaming

550: Near-field MIMO with tri-polarized antennas

554: A flexible system-on-chip FPGA architecture for prototyping experimental GNSS receivers

561: FPGA Demonstration of High-reliability Low-latency Belief Propagation Decoding of Quantum LDPC Codes