AI Machine Sound Inspection: why LLMs and pretrained models fail on the factory floor
Inspecting wireless-charger operation sound — coil rub, shaft contact, scratch — shows why both large AI models and pretrained audio backbones collapse. QAAD builds a specialised CNN from scratch.
In consumer-electronics factories there is a class of defects that cameras and AOI simply cannot catch: mechanical-acoustic anomalies that only reveal themselves when the product is running. For wireless chargers, these are coil movement, shaft rubbing, or scratching sounds that appear only during high-frequency vibration. A veteran technician can hear them by ear — but at 1000 units/hour for 8 hours straight, human accuracy collapses.
"Small but critical" sounds
- Very short — typically 20–200 ms, buried inside the continuous coil hum.
- High frequency — most discriminative energy sits in the 3–10 kHz band, unlike speech (80–300 Hz).
- Low amplitude — only 3–8 dB above the noise floor; linear thresholds will miss them.
Why LLMs and big AI models fail
- LLMs (GPT, Gemini, Claude): They don't process raw audio signal at all. Even "multi-modal audio" variants are trained for speech & content captioning — not micro-mechanical features.
- Latency: Round-trip to cloud is seconds; a moving line needs < 100 ms to reject a bad unit at the right position.
- Cost per inference: Millions/day × cloud API = not viable.
- Reliability: No internet, no line — unacceptable.
Audio pretrained models also collapse
"Just use a pretrained audio model like Whisper, Wav2Vec2, AST, YAMNet, PANNs, CLAP and fine-tune?" — We tried. The answer is no:
- Whisper / Wav2Vec2 are trained for speech; their feature extractors focus on formants & phonemes — irrelevant for mechanical noise.
- YAMNet / PANNs / AST are trained on AudioSet (dogs, cars, music, voice). No class is remotely close to "5 kHz coil rub". Transfer learning is worse than training from scratch.
- CLAP maps audio to text — yet no natural-language vocabulary covers the specific defects our customers care about.
- The domain gap is huge: sample rate, SNR, window length, dominant-energy region — all different. A thousand NG samples is nowhere near enough to shift the feature space of a 100M-parameter pretrained encoder.
Conclusion: for a specific factory-acoustic problem, pretrained is not a starting point — it's a dead end.
The QAAD approach: a specialised CNN, built from scratch
- On-site data collection: measurement microphone at the correct pickup, 48 kHz sampling, labels per defect class:
coil_rub,shaft_contact,scratch,solder_crack, … - Feature: STFT → Mel 128 bands × 96 frames over a 1-second window.
- Architecture: 5 Conv2D + BatchNorm + ReLU blocks with Squeeze-and-Excitation channel attention, Global Average Pool + FC 256. Total ~320k parameters — runs on a Raspberry Pi 4 or an Intel N100.
- Training: SpecAugment + MixUp, focal loss because NG:OK ≈ 1:50.
- Deployment: ONNX → edge inference, latency < 50 ms.
Results
- Real customer test set: F1 = 0.983, false-negative < 0.4%.
- Average edge-CPU latency: 38 ms.
- Model size: 1.3 MB — fully offline, no cloud dependency.
- Explainable: spectrogram heatmaps + class-activation → factory operators see why a unit was flagged.
Our mission
Quality Assurance · Active Development — not chasing the biggest model, but building the right solution for a specific customer problem.
In a world obsessed with LLMs, we believe domain expertise + small, specialised models remains the only path onto the factory floor. If your business has a "trivial-looking problem nobody can solve", contact QAAD Vietnam — that's exactly the kind of problem we want.
