If you have been following the news, you may have noticed that the EU is getting increasingly serious about its Data Protection directives - and that there aren't any easy safe harbours left - see here.
This has implications for Big Data since one of the obligations under the EU directives is to keep "personal data" (roughly, anything that can be attributed to an identifiable person) secure. As I've pointed out before, one's obligations start when you store personal data whether or not you use it - and, if you just dump data into a data lake for possible future use, how can you be sure whether it is personal data or not? At the very least, you had better keep it all reasonably secure, even if you aren't using it to run your business - yet. You have other data protection obligations too, it's all part of proper governance, but they are "left as an exercise for the reader".
Unfortunately, several sessions at Apache Big Data 2015 in Budapest pointed out that you have very little security for big data on the Hadoop data platform unless you use Kerberos - a network authentication protocol designed to provide strong authentication for client/server applications by using secret-key cryptography (there's an introduction here and rather more detail here). So people using Hadoop data stores had better get to grips with Kerberos - and probably not switch it off for performance reasons, not without considerable attention to appropriate risk analyses, anyway.
Cerberus (Kerberos in Greek) is the 3 headed dog that guards the gates to Hell, which might tell you something about managing Kerberos, although Kerberos, by all accounts does its security job pretty well (I would argue that if it is hard to set up and manage, this impacts its security capabilities, but that's for another day).
Lovecraft, evil in New England - Kerberos: Project Athena;
Lovecraft, ancient inhuman deities - Kerberos: Domain Controller;
Lovecraft, the Necronomicon which drives one insane - Kerberos: IETF RFC 4120;
Lovecraft, He Who Must Not Be Named - Kerberos: UserGroupInformation;
Lovecraft, doomed explorers of darkness - Kerberos: YOU.
Scary stuff, but Steve made a good case for it: Cthulhu is not a metaphor, he says. He claims to be building unique documentation for this stuff - gitbook.com/@steveloughran (there's a signup process involved, then look for author steveloughran). I am no Kerberos expert - the book is where you should go (but at your own peril). It is a work in progress, but it looks like a useful resource to me.
Steve confirms that "without Kerberos there's no security in Hadoop - people are trusted to be who they say they are". However, he also points out that "even with Kerberos, you still need to deal with wire encryption and authorisation for data access (which is why Apache Ranger is in the incubator), but at least people and machines really are who they say they are". He goes on to say that, "Kerberos itself shouldn't be a performance hit, although encryption might be; but security has to run bottom to top, and cannot be left as an afterthought, or (worse) as 'somebody else's problem". Basically, I think, security always needs thought and risk/threat analysis; it's never just as simple as flipping a "security ON" switch or installing a security technology.
Steve wants to make it clear to the developers in his audience that they need to start assuming that the system will be secure from the outset, and code and test accordingly - because this will have visible implications for their code.
Another issue, it seems to me, is that Kerberos works well when properly set up by people who understand it; but it is remarkably easy to misconfigure Kerberos/Hadoop and possibly bring all work to a shuddering stop. Worse, it is all too easy to misconfigure it so that you think you are more secure than you are. And, remember, that without Kerberos, Hadoop has virtually no security... The answer, of course, is to make developers aware that Kerberos isn't optional, and then to provide appropriate training resources around its use.