Trifacta is a self-service data preparation platform that was originally designed primarily for use by data scientists. However, the company has gradually evolved the product to make it more targeted for use by business analysts while still maintaining the flexibility and power required by data scientists. This makes it unique in the marketplace as other products are (currently) only aimed at business analysts and do not have specific features to support the work of data scientists, for whom Trifacta is the clear market leader. Nevertheless, Trifacta tells us that it has more business analysts as users than it does data scientists.
Trifacta offers both Trifacta Wrangler Enterprise and Trifacta Wrangler, with the former being Hadoop-based and the latter being a desktop product that does not require Hadoop. Wrangler Enterprise can be deployed in the Cloud or on-premises.
Trifacta primarily uses a direct sales model though it also has reseller partners as well as systems integrators (Infosys is an example) that are partners. As far as technical partnerships are concerned, these include Hadoop distributors such as Cloudera, Hortonworks and MapR as well as business intelligence vendors such as Tableau, Qlik and ZoomData. Especially notable partnerships, where the respective products have been closely integrated, are with data cataloguing and governance products such as Waterline and Cloudera Navigator.
Trifacta has a substantial user base for both the enterprise and desktop products. Many of the company's user base are household names. For example, UnitedHealth Group, Santander, Zurich Insurance, GoPro, LinkedIn, Dish Networks, Lockheed Martin, Royal Bank of Scotland, and many others. As can be seen for from this list the use of Trifacta is not limited to any particular verticals or industry sectors.
Trifacta offers its capabilities for a variety of users. For example, business analysts would use what Trifacta calls visual transform cards. These allow business users to see the results of available transformations in a visual format. At a lower level, data scientists can script directly in Trifacta's scripting language, which is called Wrangle and which will build regular expressions for you which are then compiled to MapReduce or Spark. In this context it is worth remarking that Trifacta uses the term "wrangling" in a broader sense than some other vendors: encompassing discovery, structuring, cleaning, enrichment, validation, and the publishing of data.
Various tools are available to make the process of development easier: for example, there is a Transform Builder and a Script IDE (integrated development environment). Wrangle itself is delivered with several hundred pre-built functions (for example, changing case from upper to lower). There is also support for User Defined Functions (UDF) that can be written in Python or Java.
The platform is highly available and there are built in connections to the NameNode (in Hadoop environments) and ZooKeeper in order to ensure this. The company is a partner of both Cloudera and Hortonworks (as well as MapR) and the product supports both Sentry and Ranger for security purposes. Other notable Hadoop-based capabilities include support for HCatalog and the Hive MetaStore. While Trifacta will work with traditional relational and file-based data - xlsx, CSV - it also supports cloud sources in AWS, Microsoft Azure and Google Cloud Platform as well as more modern file formats such as JSON, Parquet, ORC and Avro. Lastly, Trifacta supports publishing of data in specific file formats for downstream use in business intelligence products such as Tableau.
In terms of underlying features and functions Trifacta has the sorts of capabilities that one might expect: profiling that allows you to see type-specific histograms of values; automatic identification of data quality issues such as missing or mismatched values; automatic parsing of nested data formats and structures such as JSON (important when Trifacta is used in conjunction with business intelligence products such as Tableau); data enrichment; task orchestration and scheduling; and machine learning capabilities that will progressively improve its recommendations with respect to appropriate transformations. All of these are accessible through the browser and can run at scale on Hadoop and meets industry standards for security with support for Kerberos, Secure Impersonation, Sentry and Ranger.
One notable feature is what the company calls "Interactive Data Exploration", which is a form of data visualisation, not in the traditional sense of visualisation for end-consumption in analysis, but instead to more effectively provide users with information on the data they're working with to jumpstart or guide the process of transformation. The system presents the user with automated visual representations of the data based upon the inferred data type of each attribute of the data. These profiles require no specification by the user and automatically present each data type using the most compelling visual representation: geographic elements are presented as maps, time-oriented elements are presented according to common hierarchies such as day, month, year, and so on. Every Trifacta profile is completely interactive: allowing the user to simply select certain elements of the profile to prompt transformation suggestions.
Although the core focus of Trifacta is enabling the people who know the data best to be able to access and transform it themselves, the company has recognised that their enterprise customers still must have centralised processes for determining who has access to data, how metadata and lineage are tracked, how transformation jobs are operationalised and how datasets & transformation scripts are shared with other users. Instead of creating a completely separate governance framework in Trifacta, the company has built support for the existing enterprise standard frameworks on Hadoop for security, user authentication, access controls, job scheduling and so forth. This enables Trifacta customers to simply implement existing governance policies in Hadoop instead of creating a new, entirely separate governance framework for Trifacta.
The company has recently launched (March 2016) the Photon Compute Framework which incorporates high-performance in-memory capabilities (compatible with Apache Arrow) directly into the Trifacta interface.
Trifacta provides the sort of training, consulting services and support, that you would expect.