Simple Stock Description (string) comparison on Syspro data, using MS SQL Server

When bringing companies’ systems together, ‘duplications’ in master data are likely to occur.

Take for example the following three StockCodes and Descriptions:

StockCode Description
001 LG TV 42 inch LCD
ABC TV LG (42 lcd)
A01 LG TV 32″ LED 32LJ500D

For a human it is easy to see that the first two products refer to the same thing.¬† A simple ‘exact’ comparison (WHERE InvMaster.Description = #InvMaster.Description) will fail, so one will have to use a method to show¬†‘most likely’ matches.

I’ve chosen a ‘word match and score’ technique because of the programming simplicity and (small) size of my data set (+-2000 lines).
(Machine learning algorithms are more suited for much larger data sets)

Using the example above, I create the following ‘base’ table:

StockCode Word
001 LG
001 TV
001 42
001 INCH
001 LCD
ABC TV
ABC LG
ABC 42
ABC LCD
A01 LG
A01 TV
A01 32
A01 32LJ500D

Note the standardisation, using a nested REPLACE containing
– UPPER
– elimination of sequential spaces (delimiter in this case)
– removing of ‘special’ characters and punctuation: ‘&’, ‘-‘, ‘.’, ‘”‘, ‘/’, ‘\’, ‘(‘ and ‘)’.

I break up each string into words by looping the following insert into my (say) ‘StockCodeWordList’ table:

SELECT StockCode,
SUBSTRING(Description, 1, CASE WHEN CHARINDEX(‘ ‘, Description) = 0 THEN 100 ELSE CHARINDEX(‘ ‘, Description) END) AS [Word]

The next step is to ‘score’ StockCodes based on the amount of ‘word matches’ with a specific (Source) Description.

I do this by looping the following statement:

UPDATE dbo.StockCodeWordList SET Score =0; — Reset for a new match exercise
DECLARE @DescString VARCHAR(100) = (SELECT REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(UPPER(Description),’-‘,”),’&’,”),’ ‘,’ ‘),’ ‘,’ ‘),’ ‘,’ ‘),’ ‘,’ ‘) FROM dbo.InvMaster WHERE StockCode = @SourceStockCode);

DECLARE @Counter INT = 0;
WHILE @Counter < 10 — Assuming there’s a maximum of 10 ‘words’ in a Description
BEGIN
UPDATE dbo.StockCodeWordList SET Score = Score+1
WHERE [Word] = (SELECT REPLACE(REPLACE(SUBSTRING(@DescString, 1, CASE WHEN CHARINDEX(‘ ‘, @DescString) = 0 THEN 100 ELSE CHARINDEX(‘ ‘, @DescString) END),'(‘,”),’)’,”))

SET @DescString = (SELECT CASE WHEN CHARINDEX(‘ ‘, @DescString) = 0 THEN NULL
ELSE LTRIM(SUBSTRING(@DescString, CHARINDEX(‘ ‘, @DescString)+1, 100)) END);
SET @Counter = @Counter+1
END

Using StockCode: 001 Description: ‘LG TV 42 inch LCD’ as Source, will produce the following result:

StockCode: ABC Description: TV LG (42 lcd) Score: 4

StockCode: A01 Description: LG TV 32″ LED 32LJ500D Score: 2