Zoomix: a self-learning data quality engine

Zoomix is an Israeli company that you probably haven’t heard of
though it has offices in both the UK and the Netherlands (the US is
planned for next year). It is interesting because it is taking an
innovative approach to data quality.

Traditionally, data quality solutions were (and are) purely
based on statistics. That is, you apply a particular algorithm or
rule and then the software compares the data based on that
algorithm and tells you the likelihood of a match. An extension to
this is to use natural language capabilities that provides context
to this matching. The upside of this is that the software will know
that the “New York Yankees”, for example, represents a
single term rather than three separate words. The downside is that
you have to develop that natural language capability for each
individual language. What Zoomix has done is go beyond natural
language capability by using a self-learning dictionary that works
with any language.

Suppose you are matching product data then, since the software
has classification capabilities built-in, it is able to recognise
that “wit” and “weiß” and
“white” are potentially all the same—the software
will suggest that these form a match and once you okay this then
the dictionary will know that fact thereafter and apply it
automatically.

The second thing that makes Zoomix different is that its whole
methodology is to move away from a project based approach.
Traditionally, you had a project team that developed data quality
rules to ensure that the data was fit for purpose and then you
loaded the data into the live system and then, in effect, you start
all over again because you have no way of guaranteeing the accuracy
of further input data. Zoomix’s view (and it is not alone in this)
is that we need to move to environment in which data quality is
assured up front and on a continuous basis.

The product, which is metadata driven, actually includes five
steps, which can be used independently and in any order:

Pattern detection to determine meaning.
Attribute (which can be weighted) detection to create
structured records associated with this meaning.
Matching and de-duplication to spot and eliminate
duplicates.
Classification, so that records (particularly products) can be
categorised.
Cleansing and normalisation so that you can spot and correct
data errors.

A number of the other particular features provided include
support for multiple category hierarchies, the ability for items to
be in multiple hierarchies (for example, a diving watch to be in
both the diving and watch categories), the ability to modify
standards (that is, you are not restricted to stated UNSPS
categories, say, but can amend these if appropriate), and accuracy
matrices that continuously check the accuracy of results. This last
is important because it helps users (and the product is aimed at
primarily at end users who do not want or need the involvement of
IT) to understand what is happening and to have confidence in the
system, which should lead to further automation of data
quality.

That, primarily, is what Zoomix is about: the automation of data
quality. And about time too.