IRI is a privately-owned ISV founded in 1978. Its offices are in Florida and it relies on a partner network of resellers for international coverage (in 40 locations throughout the world).
The company’s first product, CoSort, is a high-performance data transformation utility that was first designed to offload JCL sort/merge steps to CP/M. Needless to say, this has been extended and ported to other environments but it remains at the heart of IRI’s offerings, including Voracity.
Voracity is a data management platform designed to perform and consolidate common work in Data Discovery, Data Integration, Data Migration, Data Governance and Analytics.
Last Updated: 9th November 2020
Powered by CoSort (or Hadoop), and built on Eclipse, IRI Voracity is a multi-purpose data management platform designed to perform, speed, and consolidate common work in 5 general areas:
- Data Discovery – data profiling, classification, search, and metadata redefinition
- Data Integration – high volume ETL, change data capture, slowly changing dimensions
- Data Migration – file/data/database type conversion, replication, and federation
- Data Governance – data quality, PII masking, re-ID risk scoring, test data synthesis
- Analytics – embedded BI, tie ups to Datadog, KNIME and Splunk, wrangling for the rest
As can be seen in Figure 1, Voracity drives solution depth by including standalone products in both the IRI Data Manager Suite and the IRI Data Protector Suite, each of which have various sub-components that support multiple capabilities.
Voracity is an integrated platform with metadata shared across the whole environment, which supports the provision of data lineage. A formal data catalogue is missing, though the product does have inherent data classification capabilities and its central metadata stores are easy to understand, share, and use across the above applications, or create for Collibra.
A similar consideration applies to data governance whereby there are some capabilities provided, mostly related to data privacy and quality, but not a general-purpose capability for which the company would rely on integration with partners like Erwin. Most notable of ancillary governance capabilities in Voracity are test data management, with options for synthetic data generation, database subsetting, and static and dynamic data masking (with the option to combine both).
Illustrated in Figure 1 but not discussed is the IRI Workbench IDE, which supports graphical metadata creation, conversion, discovery, and application wizards to create, deploy, and manage data rules, job scripts, data definition files (DDF), and the XML workflows common to all IRI software. In the same pane of glass, you can also administer your databases and develop or use applications in other languages and any plug-in supported in Eclipse. As an alternative to the wizards you can also develop jobs using diagrams, dialogs, or IRI’s domain specific language (a 4GL), called SortCL.
“We sought a reliable tool that would quickly sort and transform very large files… we see the Voracity platform as a much more cost-effective (and higher-performing) alternative to legacy ETL tools.”
“CoSort accurately and quickly processes billions of rows of data and allows us to join and analyze this information in connection with our other data warehouse processes. No other tool gives us this much speed and flexibility.”
IRI CoSort is the default Voracity data integration engine. Unlike other such products, it is not confined to ETL (extract, transform and load) operations but also performs data replication (change data capture), federation, masking, cleansing, and reporting. Another key point to note about it is that it does not have to transform data in separate steps. You can define jobs that way, but at run time the engine consolidates multiple steps to reduce I/O. Added to the fact that the run-time engine is a 2MB, multi-threaded C executable and loads only the libraries it requires, and you will appreciate why CoSort has a performance advantage over its competitors.
Note that IRI also offers a Hadoop-based option that does not have the same footprint advantages of CoSort but otherwise runs in a similar fashion. Moreover, many jobs developed for native CoSort implementations will run without change in Map Reduce 2, Spark, Spark Streams, Storm or Tez. Dataflows are actually stored in files and can be executed from anywhere.
The company offers an extensive range of native connectors (including MQTT and Kafka) plus JDBC support. Not surprisingly given its heritage, it also supports mainframe sources that use COBOL copybooks, EBCDIC and so on. While it does not run on z/OS it does support mainframe databases as sources and will itself run on z/Linux.
While IRI Voracity does not offer a module called “data quality”, it does provide substantial relevant capability, as illustrated in Figure 2.
A major strength of IRI Voracity is clearly in its Data Protector Suite. To begin with, IRI has deployed machine learning (including within IRI DarkShield) to support the identification of sensitive data (though we are disappointed that M/L has not yet been implemented more widely across the platform). It also uses natural language processing for this purpose. Once discovered, as mentioned, the company offers significant capabilities when it comes to masking. In particular, dynamic data masking may be proxy-based, run in situ or driven by APIs, and can be mixed and matched with static masking. It is also worth mentioning that Voracity supports the ability to search, parse and protect multiple sources containing semi- and unstructured data.
Finally, given the current predilection for companies to migrate from on-premises data warehouses to cloud-native data warehouses such as Snowflake or Google BigQuery, it is worth noting the availability of IRI FACT and IRI NextForm, which bolster high volume database migration operations.
IRI Voracity is close to being a complete data management platform. It only lacks a formal data catalogue and some extensions to its policy and governance capabilities, which are in development. On the other hand, it is much more advanced when it comes to ETL performance and sensitive data protection than many of its competitors. The company’s data migration capabilities will also be a boon in the current environment, as will its relatively attractive price points and licensing options.
The Bottom Line
The key features of IRI Voracity are the performance that the CoSort engine offers, and the depth of capability it provides in extending its data management platform into the identification and management of sensitive data. If these are important issues for you, then you should seriously consider IRI Voracity.
Sensitive Data Discovery and Masking in IRI Voracity
Last Updated: 3rd March 2022
IRI Voracity is a data management platform that offers its core capabilities through two product suites: IRI Data Manager Suite, and IRI Data Protector Suite. In particular, the latter provides a selection of data masking products (namely IRI FieldShield, CellShield EE, and DarkShield, plus a services option that leverages them called DMaaS) that also come equipped with significant data discovery capabilities. This functionality can be used for a variety of purposes, not the least of which is to find and protect your sensitive data.
The Voracity platform, including the above products, can be accessed through either IRI Workbench, a largely wizard-driven Eclipse interface backed by graphical modelling, or via APIs. Licensing is flexible, with options available for Voracity as a whole as well as individual products and APIs. IRI also partners (and integrates) with a number of other vendors. These can variously add additional capabilities to the IRI offering as well as provide enhanced support for provisioning and CI/CD pipelines.
“Our experience with millions of unstructured files confirms the need to identify and mitigate the data privacy risks within them. Standalone and embedded spreadsheets, Word and PDF documents, image files in multiple formats, as well as logs and emails, are strewn with PII unknown to our customers. These needles in historical or operational customer haystacks need to be found and blunted. Fortunately, the search methods and masking functions in IRI DarkShield specifically and Voracity generally help us get control of these hidden risks.”
Masking in Voracity is rule-based and powered by the CoSort engine. FieldShield masks structured databases and flat files, CellShield masks Excel sheets, and DarkShield can search and mask structured, semi-structured and unstructured data sources simultaneously. Several dozen static masking functions are available for FieldShield and DarkShield, and about half of those are available in CellShield as well. In static operations, masked data is kept consistent across multiple data sources so that referential integrity is always maintained. Dynamic data masking is also available.
In addition to data masking, the various Data Protector Suite products provide data discovery and profiling capabilities. This enables you to classify your data against a centralised library of either pre-configured or bespoke data classes shared between all of the shield products, which can in turn be married to masking rules when they correspond to sensitive data (see Figure 1). These rules are acted on at execution time, ensuring that the associated sensitive data is protected. Each data class can also be equipped with a search methodology that is used to locate matching data in your system. This means that when set up correctly IRI can effectively automate the process of finding and anonymising your sensitive data: it will discover your sensitive data using the aforementioned search, associate it with the appropriate data class, and mask it at execution time. There are also considerations for performance that have been built in. For instance, tables that have already been scanned will be skipped during repeated discovery phases, and you can choose to exclude specific tables or data classes from the process entirely.
An impressive range of discovery methods can be used as part of this these capabilities, including lookup value or pattern matching, NER (Named Entity Recognition), column name matching, fuzzy or exact dictionary matching, path searching, facial recognition matching, font matching, character recognition, and coordinate matching (the latter two mostly for images). NER in particular uses semi-supervised machine learning to enable more sophisticated and effective language analysis of highly unstructured data. In addition, any number of these methods can be used in concert with each other to improve the accuracy of your results. There is also a configurable matching threshold for discovery, allowing you specify how sure you want to be before settling on a result.
Moreover, there are two ways to consume Voracity’s discovery and masking capabilities. You can go through Workbench – which has the advantage of a relatively friendly, wizard-driven user interface coupled with visualised reporting, as shown in Figure 2 – or you can leverage them directly through an API. In the latter case, this essentially allows you to use Voracity as a discovery and masking engine that underpins your other data pipelines. This has obvious (and positive) implications for integration and automation.
IRI Voracity uses a robust architecture for managing your data classes that both manages data class definitions, and assigns discovery and masking methods to them, centrally. It offers a healthy range of discovery methods running from the simple to the sophisticated, and its applicability to highly unstructured data, such as image files, is particularly notable.
Moreover, Voracity is billed as a total data management platform, and to that end it offers a wealth of additional capabilities – data integration, governance, quality, and so on – that will frequently tie into, and either augment or be augmented by, data discovery (and, to a lesser extent, masking) in one way or another. These capabilities are offered through a unified and user-friendly interface, complete with wizards, visual programming, and so on. This makes it easy to use each individual product and to shift your attention from one product to another. These advantages carry over to data discovery and data masking, at least if you plan to leverage these technologies through Workbench. That said, even if you don’t, you will simply benefit from the flexibility, integration and automation offered by an API-driven approach instead. By way of example, data discovery through the DarkShield API can be coupled with test data generation using IRI RowGen to replace values in images and documents with synthetic, but realistic data and fonts – providing more safety for applications and processes that handle these sorts of files.
The bottom line
IRI justifiably positions Voracity as a total data management platform. As a solution for data masking and data discovery, either for sensitive data or not, it is both highly competent and rather flexible in how you can interact with it. In short, whether you want a solution that comes integrated into a larger platform, or one that works as a standalone engine, IRI Voracity should satisfy.