QAAD Vietnam
Back to list

AI Machine Sound Inspection: why LLMs and pretrained models fail on the factory floor

Published: April 20, 2026By: QAAD Engineering
AI Machine Sound Inspection: why LLMs and pretrained models fail on the factory floor

Inspecting wireless-charger operation sound — coil rub, shaft contact, scratch — shows why both large AI models and pretrained audio backbones collapse. QAAD builds a specialised CNN from scratch.

In consumer-electronics factories there is a class of defects that cameras and AOI simply cannot catch: mechanical-acoustic anomalies that only reveal themselves when the product is running. For wireless chargers, these are coil movement, shaft rubbing, or scratching sounds that appear only during high-frequency vibration. A veteran technician can hear them by ear — but at 1000 units/hour for 8 hours straight, human accuracy collapses.

"Small but critical" sounds

  • Very short — typically 20–200 ms, buried inside the continuous coil hum.
  • High frequency — most discriminative energy sits in the 3–10 kHz band, unlike speech (80–300 Hz).
  • Low amplitude — only 3–8 dB above the noise floor; linear thresholds will miss them.
ANOMALY 0 s 2 s
Fig. 1. Audio signal during charger operation: the OK region is uniform; the anomaly (dashed red) shows a short transient burst.

Why LLMs and big AI models fail

  • LLMs (GPT, Gemini, Claude): They don't process raw audio signal at all. Even "multi-modal audio" variants are trained for speech & content captioning — not micro-mechanical features.
  • Latency: Round-trip to cloud is seconds; a moving line needs < 100 ms to reject a bad unit at the right position.
  • Cost per inference: Millions/day × cloud API = not viable.
  • Reliability: No internet, no line — unacceptable.

Audio pretrained models also collapse

"Just use a pretrained audio model like Whisper, Wav2Vec2, AST, YAMNet, PANNs, CLAP and fine-tune?" — We tried. The answer is no:

  • Whisper / Wav2Vec2 are trained for speech; their feature extractors focus on formants & phonemes — irrelevant for mechanical noise.
  • YAMNet / PANNs / AST are trained on AudioSet (dogs, cars, music, voice). No class is remotely close to "5 kHz coil rub". Transfer learning is worse than training from scratch.
  • CLAP maps audio to text — yet no natural-language vocabulary covers the specific defects our customers care about.
  • The domain gap is huge: sample rate, SNR, window length, dominant-energy region — all different. A thousand NG samples is nowhere near enough to shift the feature space of a 100M-parameter pretrained encoder.

Conclusion: for a specific factory-acoustic problem, pretrained is not a starting point — it's a dead end.

Anomaly at 4-8 kHz — coil rub Time → Freq ↑
Fig. 2. Mel-spectrogram: the 4–8 kHz band during a coil rub event clearly out-energises the baseline.

The QAAD approach: a specialised CNN, built from scratch

  1. On-site data collection: measurement microphone at the correct pickup, 48 kHz sampling, labels per defect class: coil_rub, shaft_contact, scratch, solder_crack, …
  2. Feature: STFT → Mel 128 bands × 96 frames over a 1-second window.
  3. Architecture: 5 Conv2D + BatchNorm + ReLU blocks with Squeeze-and-Excitation channel attention, Global Average Pool + FC 256. Total ~320k parameters — runs on a Raspberry Pi 4 or an Intel N100.
  4. Training: SpecAugment + MixUp, focal loss because NG:OK ≈ 1:50.
  5. Deployment: ONNX → edge inference, latency < 50 ms.
Audio 48 kHz, 1 s STFT → Mel 128 × 96 feat CNN 5 blocks Conv + BN + ReLU + SE attention GAP + FC 256 → 2 OK / NG + confidence + reason
Fig. 3. Pipeline: Audio → Mel spectrogram → 5-block CNN + SE attention → GAP/FC → OK/NG + confidence + reason.

Results

  • Real customer test set: F1 = 0.983, false-negative < 0.4%.
  • Average edge-CPU latency: 38 ms.
  • Model size: 1.3 MB — fully offline, no cloud dependency.
  • Explainable: spectrogram heatmaps + class-activation → factory operators see why a unit was flagged.
Unit A · 12:34:05 OK Confidence: 98.6% Features: low-noise hum, no transient Latency: 38 ms on edge CPU Unit B · 12:34:11 NG Confidence: 96.3% Reason: coil rub 4–8 kHz burst Action: flag for manual recheck
Fig. 4. Real-time dashboard: every unit passing the band test is tagged OK/NG with a probability and a reason.

Our mission

Quality Assurance · Active Development — not chasing the biggest model, but building the right solution for a specific customer problem.

In a world obsessed with LLMs, we believe domain expertise + small, specialised models remains the only path onto the factory floor. If your business has a "trivial-looking problem nobody can solve", contact QAAD Vietnam — that's exactly the kind of problem we want.