AI Machine Sound Inspection: why LLMs and pretrained models fail on the factory floor

Published: April 20, 2026By: QAAD Engineering

Inspecting wireless-charger operation sound — coil rub, shaft contact, scratch — shows why both large AI models and pretrained audio backbones collapse. QAAD builds a specialised CNN from scratch.

AI Machine Sound Inspection for factories

In consumer-electronics factories there is a class of defects that cameras and AOI simply cannot catch: mechanical-acoustic anomalies that only reveal themselves when the product is running. For wireless chargers, these are coil movement, shaft rubbing, or scratching sounds that appear only during high-frequency vibration. A veteran technician can hear them by ear — but at 1000 units/hour for 8 hours straight, human accuracy collapses.

"Small but critical" sounds

Very short — typically 20–200 ms, buried inside the continuous coil hum.
High frequency — most discriminative energy sits in the 3–10 kHz band, unlike speech (80–300 Hz).
Low amplitude — only 3–8 dB above the noise floor; linear thresholds will miss them.

Fig. 1. Audio signal during charger operation: the OK region is uniform; the anomaly (dashed red) shows a short transient burst.

Why LLMs and big AI models fail

LLMs (GPT, Gemini, Claude): They don't process raw audio signal at all. Even "multi-modal audio" variants are trained for speech & content captioning — not micro-mechanical features.
Latency: Round-trip to cloud is seconds; a moving line needs < 100 ms to reject a bad unit at the right position.
Cost per inference: Millions/day × cloud API = not viable.
Reliability: No internet, no line — unacceptable.

Audio pretrained models also collapse

"Just use a pretrained audio model like Whisper, Wav2Vec2, AST, YAMNet, PANNs, CLAP and fine-tune?" — We tried. The answer is no:

Whisper / Wav2Vec2 are trained for speech; their feature extractors focus on formants & phonemes — irrelevant for mechanical noise.
YAMNet / PANNs / AST are trained on AudioSet (dogs, cars, music, voice). No class is remotely close to "5 kHz coil rub". Transfer learning is worse than training from scratch.
CLAP maps audio to text — yet no natural-language vocabulary covers the specific defects our customers care about.
The domain gap is huge: sample rate, SNR, window length, dominant-energy region — all different. A thousand NG samples is nowhere near enough to shift the feature space of a 100M-parameter pretrained encoder.

Conclusion: for a specific factory-acoustic problem, pretrained is not a starting point — it's a dead end.

Fig. 2. Mel-spectrogram: the 4–8 kHz band during a coil rub event clearly out-energises the baseline.

The QAAD approach: a specialised CNN, built from scratch

On-site data collection: measurement microphone at the correct pickup, 48 kHz sampling, labels per defect class: coil_rub, shaft_contact, scratch, solder_crack, …
Feature: STFT → Mel 128 bands × 96 frames over a 1-second window.
Architecture: 5 Conv2D + BatchNorm + ReLU blocks with Squeeze-and-Excitation channel attention, Global Average Pool + FC 256. Total ~320k parameters — runs on a Raspberry Pi 4 or an Intel N100.
Training: SpecAugment + MixUp, focal loss because NG:OK ≈ 1:50.
Deployment: ONNX → edge inference, latency < 50 ms.

Fig. 3. Pipeline: Audio → Mel spectrogram → 5-block CNN + SE attention → GAP/FC → OK/NG + confidence + reason.

Results

Real customer test set: F1 = 0.983, false-negative < 0.4%.
Average edge-CPU latency: 38 ms.
Model size: 1.3 MB — fully offline, no cloud dependency.
Explainable: spectrogram heatmaps + class-activation → factory operators see why a unit was flagged.

Fig. 4. Real-time dashboard: every unit passing the band test is tagged OK/NG with a probability and a reason.

Our mission

Quality Assurance · Active Development — not chasing the biggest model, but building the right solution for a specific customer problem.

In a world obsessed with LLMs, we believe domain expertise + small, specialised models remains the only path onto the factory floor. If your business has a "trivial-looking problem nobody can solve", contact QAAD Vietnam — that's exactly the kind of problem we want.