IRI is a privately-owned ISV founded in 1978. Its offices are in Florida and it relies on a partner network of resellers for international coverage (in 40 locations throughout the world).
The company’s first product, CoSort, is a high-performance data transformation utility that was first designed to offload JCL sort/merge steps to CP/M. Needless to say, this has been extended and ported to other environments but it remains at the heart of IRI’s offerings, including Voracity.
Voracity is a data management platform designed to perform and consolidate common work in Data Discovery, Data Integration, Data Migration, Data Governance and Analytics.
Last Updated: 9th November 2020
Powered by CoSort (or Hadoop), and built on Eclipse, IRI Voracity is a multi-purpose data management platform designed to perform, speed, and consolidate common work in 5 general areas:
- Data Discovery – data profiling, classification, search, and metadata redefinition
- Data Integration – high volume ETL, change data capture, slowly changing dimensions
- Data Migration – file/data/database type conversion, replication, and federation
- Data Governance – data quality, PII masking, re-ID risk scoring, test data synthesis
- Analytics – embedded BI, tie ups to Datadog, KNIME and Splunk, wrangling for the rest
As can be seen in Figure 1, Voracity drives solution depth by including standalone products in both the IRI Data Manager Suite and the IRI Data Protector Suite, each of which have various sub-components that support multiple capabilities.
Voracity is an integrated platform with metadata shared across the whole environment, which supports the provision of data lineage. A formal data catalogue is missing, though the product does have inherent data classification capabilities and its central metadata stores are easy to understand, share, and use across the above applications, or create for Collibra.
A similar consideration applies to data governance whereby there are some capabilities provided, mostly related to data privacy and quality, but not a general-purpose capability for which the company would rely on integration with partners like Erwin. Most notable of ancillary governance capabilities in Voracity are test data management, with options for synthetic data generation, database subsetting, and static and dynamic data masking (with the option to combine both).
Illustrated in Figure 1 but not discussed is the IRI Workbench IDE, which supports graphical metadata creation, conversion, discovery, and application wizards to create, deploy, and manage data rules, job scripts, data definition files (DDF), and the XML workflows common to all IRI software. In the same pane of glass, you can also administer your databases and develop or use applications in other languages and any plug-in supported in Eclipse. As an alternative to the wizards you can also develop jobs using diagrams, dialogs, or IRI’s domain specific language (a 4GL), called SortCL.
“We sought a reliable tool that would quickly sort and transform very large files… we see the Voracity platform as a much more cost-effective (and higher-performing) alternative to legacy ETL tools.”
“CoSort accurately and quickly processes billions of rows of data and allows us to join and analyze this information in connection with our other data warehouse processes. No other tool gives us this much speed and flexibility.”
IRI CoSort is the default Voracity data integration engine. Unlike other such products, it is not confined to ETL (extract, transform and load) operations but also performs data replication (change data capture), federation, masking, cleansing, and reporting. Another key point to note about it is that it does not have to transform data in separate steps. You can define jobs that way, but at run time the engine consolidates multiple steps to reduce I/O. Added to the fact that the run-time engine is a 2MB, multi-threaded C executable and loads only the libraries it requires, and you will appreciate why CoSort has a performance advantage over its competitors.
Note that IRI also offers a Hadoop-based option that does not have the same footprint advantages of CoSort but otherwise runs in a similar fashion. Moreover, many jobs developed for native CoSort implementations will run without change in Map Reduce 2, Spark, Spark Streams, Storm or Tez. Dataflows are actually stored in files and can be executed from anywhere.
The company offers an extensive range of native connectors (including MQTT and Kafka) plus JDBC support. Not surprisingly given its heritage, it also supports mainframe sources that use COBOL copybooks, EBCDIC and so on. While it does not run on z/OS it does support mainframe databases as sources and will itself run on z/Linux.
While IRI Voracity does not offer a module called “data quality”, it does provide substantial relevant capability, as illustrated in Figure 2.
A major strength of IRI Voracity is clearly in its Data Protector Suite. To begin with, IRI has deployed machine learning (including within IRI DarkShield) to support the identification of sensitive data (though we are disappointed that M/L has not yet been implemented more widely across the platform). It also uses natural language processing for this purpose. Once discovered, as mentioned, the company offers significant capabilities when it comes to masking. In particular, dynamic data masking may be proxy-based, run in situ or driven by APIs, and can be mixed and matched with static masking. It is also worth mentioning that Voracity supports the ability to search, parse and protect multiple sources containing semi- and unstructured data.
Finally, given the current predilection for companies to migrate from on-premises data warehouses to cloud-native data warehouses such as Snowflake or Google BigQuery, it is worth noting the availability of IRI FACT and IRI NextForm, which bolster high volume database migration operations.
IRI Voracity is close to being a complete data management platform. It only lacks a formal data catalogue and some extensions to its policy and governance capabilities, which are in development. On the other hand, it is much more advanced when it comes to ETL performance and sensitive data protection than many of its competitors. The company’s data migration capabilities will also be a boon in the current environment, as will its relatively attractive price points and licensing options.
The Bottom Line
The key features of IRI Voracity are the performance that the CoSort engine offers, and the depth of capability it provides in extending its data management platform into the identification and management of sensitive data. If these are important issues for you, then you should seriously consider IRI Voracity.
Sensitive Data Discovery and Masking in IRI Voracity
Last Updated: 3rd March 2022
IRI Voracity is a data management platform that offers its core capabilities through two product suites: IRI Data Manager Suite, and IRI Data Protector Suite. In particular, the latter provides a selection of data masking products (namely IRI FieldShield, CellShield EE, and DarkShield, plus a services option that leverages them called DMaaS) that also come equipped with significant data discovery capabilities. This functionality can be used for a variety of purposes, not the least of which is to find and protect your sensitive data.
The Voracity platform, including the above products, can be accessed through either IRI Workbench, a largely wizard-driven Eclipse interface backed by graphical modelling, or via APIs. Licensing is flexible, with options available for Voracity as a whole as well as individual products and APIs. IRI also partners (and integrates) with a number of other vendors. These can variously add additional capabilities to the IRI offering as well as provide enhanced support for provisioning and CI/CD pipelines.
“Our experience with millions of unstructured files confirms the need to identify and mitigate the data privacy risks within them. Standalone and embedded spreadsheets, Word and PDF documents, image files in multiple formats, as well as logs and emails, are strewn with PII unknown to our customers. These needles in historical or operational customer haystacks need to be found and blunted. Fortunately, the search methods and masking functions in IRI DarkShield specifically and Voracity generally help us get control of these hidden risks.”
Masking in Voracity is rule-based and powered by the CoSort engine. FieldShield masks structured databases and flat files, CellShield masks Excel sheets, and DarkShield can search and mask structured, semi-structured and unstructured data sources simultaneously. Several dozen static masking functions are available for FieldShield and DarkShield, and about half of those are available in CellShield as well. In static operations, masked data is kept consistent across multiple data sources so that referential integrity is always maintained. Dynamic data masking is also available.
In addition to data masking, the various Data Protector Suite products provide data discovery and profiling capabilities. This enables you to classify your data against a centralised library of either pre-configured or bespoke data classes shared between all of the shield products, which can in turn be married to masking rules when they correspond to sensitive data (see Figure 1). These rules are acted on at execution time, ensuring that the associated sensitive data is protected. Each data class can also be equipped with a search methodology that is used to locate matching data in your system. This means that when set up correctly IRI can effectively automate the process of finding and anonymising your sensitive data: it will discover your sensitive data using the aforementioned search, associate it with the appropriate data class, and mask it at execution time. There are also considerations for performance that have been built in. For instance, tables that have already been scanned will be skipped during repeated discovery phases, and you can choose to exclude specific tables or data classes from the process entirely.
An impressive range of discovery methods can be used as part of this these capabilities, including lookup value or pattern matching, NER (Named Entity Recognition), column name matching, fuzzy or exact dictionary matching, path searching, facial recognition matching, font matching, character recognition, and coordinate matching (the latter two mostly for images). NER in particular uses semi-supervised machine learning to enable more sophisticated and effective language analysis of highly unstructured data. In addition, any number of these methods can be used in concert with each other to improve the accuracy of your results. There is also a configurable matching threshold for discovery, allowing you specify how sure you want to be before settling on a result.
Moreover, there are two ways to consume Voracity’s discovery and masking capabilities. You can go through Workbench – which has the advantage of a relatively friendly, wizard-driven user interface coupled with visualised reporting, as shown in Figure 2 – or you can leverage them directly through an API. In the latter case, this essentially allows you to use Voracity as a discovery and masking engine that underpins your other data pipelines. This has obvious (and positive) implications for integration and automation.
IRI Voracity uses a robust architecture for managing your data classes that both manages data class definitions, and assigns discovery and masking methods to them, centrally. It offers a healthy range of discovery methods running from the simple to the sophisticated, and its applicability to highly unstructured data, such as image files, is particularly notable.
Moreover, Voracity is billed as a total data management platform, and to that end it offers a wealth of additional capabilities – data integration, governance, quality, and so on – that will frequently tie into, and either augment or be augmented by, data discovery (and, to a lesser extent, masking) in one way or another. These capabilities are offered through a unified and user-friendly interface, complete with wizards, visual programming, and so on. This makes it easy to use each individual product and to shift your attention from one product to another. These advantages carry over to data discovery and data masking, at least if you plan to leverage these technologies through Workbench. That said, even if you don’t, you will simply benefit from the flexibility, integration and automation offered by an API-driven approach instead. By way of example, data discovery through the DarkShield API can be coupled with test data generation using IRI RowGen to replace values in images and documents with synthetic, but realistic data and fonts – providing more safety for applications and processes that handle these sorts of files.
The bottom line
IRI justifiably positions Voracity as a total data management platform. As a solution for data masking and data discovery, either for sensitive data or not, it is both highly competent and rather flexible in how you can interact with it. In short, whether you want a solution that comes integrated into a larger platform, or one that works as a standalone engine, IRI Voracity should satisfy.
Test Data Management in IRI Voracity
Last Updated: 6th December 2023
IRI Voracity contains two product suites that are relevant to test data management (TDM): IRI Data Manager Suite and IRI Data Protector Suite. The latter provides a selection of masking products (IRI FieldShield, CellShield EE, and DarkShield) suitable for various use cases, including TDM, that also come equipped with significant data discovery and classification capabilities. It also offers data classification, discovery, and masking as a professional service, aptly named Data Masking as a Service, or DMaaS. The former, on the other hand, contains IRI RowGen, which can be used to generate synthetic test data. In principle it also provides data subsetting, but in practice this is more typically delivered as part of the platform’s broader data integration capabilities.
The Voracity platform, including the above products, is accessed through either IRI Workbench, a largely wizard-driven Eclipse interface backed by graphical modelling (displayed in Figure 1), or via APIs. Licensing is flexible, with options available for Voracity as a whole as well as individual products and APIs. Database virtualisation is not offered directly, but is provided through integration with partner vendors Windocks and Actifio. Other partnerships support integration with provisioning and CI/CD pipelines – among other things – and recent collaborative efforts with Cigniti and ValueLabs are resulting in those company’s more workflow-oriented front-ends being applied to the core Voracity engine, creating a smoother experience when they are deployed together (at least for organisations that require extensive approval processes as part of their data access).
“Test data management (TDM) is a critical part of our agile SDLC, and is subject to data privacy regulations. Integrated data classification, discovery, anonymization, subsetting, and synthesis functions in Voracity improve our time-to-market delivery strategy, and help us comply with GDPR and similar laws.”
Capgemini Technology Services
The Data Protector Suite provides (sensitive) data discovery and classification facilities in support of data profiling and masking operations (and thence TDM). It categorises your data against an extensible library of pre-configured or bespoke data classes, which can be tagged with varying levels of sensitivity and then married to appropriate masking or test data generation rules that are acted on at execution time. In this way, you can use Voracity to find and protect your sensitive data, allowing it to be used for testing.
Various discovery methods are available, including pattern matching, named entity recognition (which in turn leverages semi-supervised machine learning), column name matching, fuzzy and exact dictionary matching, path searching, font matching, character recognition, and coordinate matching. Any number of these methods can be used together for additional accuracy, and validation scripts can be employed to reduce false positives. Discovery results can be rendered as graphical reports; an example of this is shown in Figure 2.
Masking is powered by the CoSort engine. FieldShield masks relational databases and flat files, CellShield masks Microsoft Excel data, and DarkShield masks structured, semi-structured and unstructured data (including images and documents) simultaneously and consistently. Static and dynamic masking are available, as is support for a variety of data sources. Various masking functions are provided out of the box, and you can build your own functions externally and integrate them via an API. You can also combine multiple discovery methods and/or masking functions together and apply them simultaneously. Masked data is consistent across all sources, while referential integrity is always maintained.
RowGen provides synthetic data generation. It emphasises the customisation of test data, giving you fine-grained control over what, how and where your data is generated. For instance, it can generate test data using parameters you provide to it (including which class of data it should belong to) or select data randomly from one or more “set files” that have been prepared ahead of time, creating a holistic data profile for a person or other entity that does not exist but that has realistic attributes drawn directly from your data. Moreover, this extends past just what data you are generating and also encompasses how and where you are generating it (which means that you could, say, generate test data within a CI/CD pipeline).
Various generation functions are available for creating test data sets, including both the specific – such as national ID number generation – and the generic – generating data according to a predefined, weighted statistical distribution. There are multiple ways to customise the end results of these functions: test data can be generated in such a way that each value is unique, each value in a set file can be mandated to be used exactly once, and so on. You can even define your own compound data formats. Regardless, its production characteristics – including original data formats and sizes, value ranges, key relationships, and frequency distributions – are preserved. You can also generate test data in a variety of unstructured formats, including images and PDFs, based on predefined templates.
Subsetting is delivered via either RowGen or with Voracity’s data integration capabilities. In either case, you can specify a driver table and trace its foreign key relationships to create a self-contained subset. Voracity gives you the option to follow these relationships “downhill” – only moving from parent to child – or to move through them in either direction. The former is faster, but the latter is more comprehensive. In addition to quantitative subsetting based purely on volume, you can also employ more qualitative methods that apply conditions to the initial data set in order to create a coherent subset (which will, again, be self-contained).All of this functionality can be executed as individual scripts or batch jobs, which can be created using various wizards, form editors, and mapping diagrams. They can then be executed from within IRI Workbench, the command line, or a partnered database virtualisation environment such as Windocks. In the latter case in particular, Voracity and Windocks can be used to create sanitised clones of your production data in on-demand, self-service, containerised, and virtualized repositories.
APIs are provided, meaning that Voracity TDM functions are also operable as part of an external pipeline, and can be invoked directly from within your CI/CD platform processes, either on-premises or in the cloud. The test data created by Voracity’s processes can be exported to many databases and file formats, including spreadsheets, PDFs and images.
Finally, IRI Ripcurrent, a real-time database event processing module, was recently added to Voracity. Ripcurrent offers incremental data replication by detecting and acting on changes to relational database tables in real-time. This works by monitoring log events for inserts, updates, deletes and schema structural changes, then mapping the data on-the-fly and/or issuing alerts. Applied to TDM, it can be used to refresh your test data environment by carrying out both data replication and masking processes automatically as soon as a corresponding production environment is changed.
IRI’s subsetting, masking and synthetic data generation capabilities are all highly competent. The ability to create representative synthetic data sets via analysis is particularly notable and useful, as is Ripcurrent’s automatic and real-time refresh of your test data. That said, in this paper we have only been able to scratch the surface of the product’s capabilities. There is a significant depth of functionality here: IRI has been organically growing its technology for over 40 years, and it shows. If you would like to learn more, we refer you to our recent series of articles on IRI and Voracity, which explores several of the topics touched on in this report in greater detail. We are also told that IRI is working on implementing generative AI as part of its sensitive data discovery and synthetic data generation capabilities, although the details of this have yet to be announced.
What is more, TDM is only one aspect of Voracity. It is billed as a total data management platform, and to that end it offers a wealth of other capabilities – data integration, governance, quality, and so on – that stretch beyond just TDM. Moreover, these capabilities (including TDM) are offered through a unified and user-friendly interface, complete with wizards, visual programming and so on. This makes it easy to use each individual product and to shift your attention from one product to another. Integration with CI/CD pipelines is also a useful feature, enabling Voracity to automate both the production and consumption of test data.
The company’s partnership with containerised database virtualisation vendor Windocks is particularly notable, and its other relevant partners, including Actifio, CommVault, Cigniti, and ValueLabs, should be considered as well.
The bottom line
IRI Voracity is both a data management platform and TDM solution, with many elements of the former being highly applicable to the latter. Ripcurrent is a particularly compelling example of this kind of applicability. The end result is an effective and versatile solution for TDM.