Simple Stock Description (string) comparison on Syspro data, using MS SQL Server
When bringing companies’ systems together, ‘duplications’ in master data are likely to occur.
Take for example the following three StockCodes and Descriptions:
StockCode | Description |
001 | LG TV 42 inch LCD |
ABC | TV LG (42 lcd) |
A01 | LG TV 32″ LED 32LJ500D |
For a human it is easy to see that the first two products refer to the same thing. A simple ‘exact’ comparison (WHERE InvMaster.Description = #InvMaster.Description) will fail, so one will have to use a method to show ‘most likely’ matches.
I’ve chosen a ‘word match and score’ technique because of the programming simplicity and (small) size of my data set (+-2000 lines).
(Machine learning algorithms are more suited for much larger data sets)
Using the example above, I create the following ‘base’ table:
StockCode | Word |
001 | LG |
001 | TV |
001 | 42 |
001 | INCH |
001 | LCD |
ABC | TV |
ABC | LG |
ABC | 42 |
ABC | LCD |
A01 | LG |
A01 | TV |
A01 | 32 |
A01 | 32LJ500D |
Note the standardisation, using a nested REPLACE containing
– UPPER
– elimination of sequential spaces (delimiter in this case)
– removing of ‘special’ characters and punctuation: ‘&’, ‘-‘, ‘.’, ‘”‘, ‘/’, ‘\’, ‘(‘ and ‘)’.
I break up each string into words by looping the following insert into my (say) ‘StockCodeWordList’ table:
SELECT StockCode,
SUBSTRING(Description, 1, CASE WHEN CHARINDEX(‘ ‘, Description) = 0 THEN 100 ELSE CHARINDEX(‘ ‘, Description) END) AS [Word]
The next step is to ‘score’ StockCodes based on the amount of ‘word matches’ with a specific (Source) Description.
I do this by looping the following statement:
UPDATE dbo.StockCodeWordList SET Score =0; — Reset for a new match exercise
DECLARE @DescString VARCHAR(100) = (SELECT REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(UPPER(Description),’-‘,”),’&’,”),’ ‘,’ ‘),’ ‘,’ ‘),’ ‘,’ ‘),’ ‘,’ ‘) FROM dbo.InvMaster WHERE StockCode = @SourceStockCode);
DECLARE @Counter INT = 0;
WHILE @Counter < 10 — Assuming there’s a maximum of 10 ‘words’ in a Description
BEGIN
UPDATE dbo.StockCodeWordList SET Score = Score+1
WHERE [Word] = (SELECT REPLACE(REPLACE(SUBSTRING(@DescString, 1, CASE WHEN CHARINDEX(‘ ‘, @DescString) = 0 THEN 100 ELSE CHARINDEX(‘ ‘, @DescString) END),'(‘,”),’)’,”))
SET @DescString = (SELECT CASE WHEN CHARINDEX(‘ ‘, @DescString) = 0 THEN NULL
ELSE LTRIM(SUBSTRING(@DescString, CHARINDEX(‘ ‘, @DescString)+1, 100)) END);
SET @Counter = @Counter+1
END
Using StockCode: 001 Description: ‘LG TV 42 inch LCD’ as Source, will produce the following result:
StockCode: ABC Description: TV LG (42 lcd) Score: 4
StockCode: A01 Description: LG TV 32″ LED 32LJ500D Score: 2