Building OEM Hunter, Part 3: The Part Number Normalization Problem Nobody Warns You About

By the end of Part 2, I had 14 plugins running in parallel, bypassing WAF protection, returning real catalog data. The search worked. Then I looked at the results more carefully.

I searched for part number 18-B5032. Eleven plugins returned results. But when I sorted by price, I had results from eleven sources, with no deduplication, no merging, and several of them were clearly the same part under different number formats.

One source listed it as 18B5032. Another as 018-B5032. Another as WK18B5032. Same part, same manufacturer, four different representations. My system was treating them as four unrelated items.

Without deduplication, the price sort was meaningless. I could be showing the "cheapest" result for a subtly different part number that referred to the wrong component.

The Scope of the Problem

I spent a day cataloging every part number format variant I encountered across all 18 sources. The patterns:

Hyphenation. 18-B5032 vs 18B5032. Some suppliers include hyphens, some strip them. No standard.

Leading zeros. 18B5032 vs 018B5032. Some suppliers pad part numbers to a fixed width with leading zeros; others don't.

Manufacturer prefixes. 18B5032 vs WK18B5032 vs ACD18B5032. Independent OEM suppliers often prepend their own brand code. The core part number is identical; the prefix is supplier-specific packaging.

Remanufactured suffixes. 18B5032 vs 18B5032X vs 18B5032-REMAN. Remanufactured parts often get a suffix to distinguish them from new OEM. This one is semantically significant: it's a different product, not just a formatting variant.

Category codes. Some suppliers prepend a category code: BRK-18B5032 for a brake component. The category code is implicit in the catalog context but explicit in the part number.

The naive approach of just stripping all hyphens and comparing would work for the first two patterns but would incorrectly merge remanufactured parts with new OEM, and would fail on prefix/suffix variants entirely.

The Normalization Function

I built a normalization layer with four stages:

Stage 1: Canonicalize formatting. Strip all hyphens, spaces, and dots. Uppercase. Remove leading zeros after any alphabetic prefix. This handles the pure formatting variants.

import re

def basic_normalize(part_number: str) -> str:
    pn = part_number.upper().strip()
    pn = re.sub(r'[\-\s\.]', '', pn)   # strip formatting
    pn = re.sub(r'^([A-Z]+?)0+(\d)', r'\1\2', pn)  # strip internal leading zeros
    return pn

Stage 2: Strip known manufacturer prefixes. I built a lookup table of known OEM supplier prefixes: WK, ACD, BRK, FDX, and about 40 others. If the normalized part number starts with a known prefix followed by a digit string that looks like a part number, strip the prefix.

KNOWN_PREFIXES = {'WK', 'ACDelco', 'ACN', 'BRK', 'FDX', ...}

def strip_prefix(pn: str) -> str:
    for prefix in sorted(KNOWN_PREFIXES, key=len, reverse=True):
        if pn.startswith(prefix) and pn[len(prefix):len(prefix)+1].isdigit():
            return pn[len(prefix):]
    return pn

Longest-match first, so ACDEL18B5032 matches ACDEL before trying AC.

Stage 3: Flag remanufactured markers. Before stripping suffixes, check for known remanufactured markers: X, REMAN, R, REMANUFACTURED. If found, set a is_reman flag on the result object and strip the marker from the core number. Remanufactured parts get filtered out by default but can be shown with a flag.

Stage 4: Fuzzy matching for remaining variants. Even after the three previous stages, there are cases where two part numbers refer to the same component but differ in ways that don't match any known pattern. For these, I use Levenshtein distance with a tight threshold (distance ≤ 1, i.e. one character insertion/deletion/substitution) to flag candidate duplicates for review.

The fuzzy stage is conservative on purpose. At distance ≤ 1, false positives (merging actually different parts) are rare. At distance ≤ 2, I started getting incorrect merges on shorter part numbers. I'd rather show a few unmerged duplicates than incorrectly consolidate two different parts.

Results

After adding the normalization layer and re-running the test queries:

Cross-source deduplication rate: 68% of part numbers that appeared in multiple sources had at least one formatting variant. Without normalization, these showed up as separate results. With it, they're merged into a single result showing all prices for that part.

False positive rate: In a test of 200 queries with manual verification, the normalization merged two actually-different parts zero times at the threshold I settled on. I consider that acceptable; a single incorrect merge would be a bad user experience.

Price data quality: With deduplication working correctly, the effective price comparison is now meaningful. The cheapest result for a given part is reliably the cheapest result, not a price for a differently-named related part.

What I'd Do Differently

The prefix lookup table is manual and needs periodic maintenance. A better approach would be to learn prefixes automatically from the data: if a normalized part number appears across 5+ sources and one source consistently adds a 2–3 character alphabetic prefix, flag the prefix as a candidate for the lookup table.

I've prototyped this but haven't shipped it yet. The manual table has been accurate enough that the automated approach isn't urgent.

The full OEM Hunter codebase is on GitHub. The normalization module (lib/normalize.py) is the most interesting part to borrow if you're building something similar, as the patterns it handles come up in any domain where the same entity appears under different identifiers across multiple sources.