I have frequently been hearing the term data lake. Being the curious person that I am, I decided to go in search of a definition.
Currently, the company Pivotal is responsible for marketing the term. However, I believe the term was originally coined by Dan Woods of CITO Research back in 2011. Anyhow, here is a basic description of a data lake.
A data lake is an information system consisting of the following 2 characteristics
- A parallel system able to store big data
- A system able to perform computations on the data without moving the data
Currently, Hadoop is the most common technology to implement a data lake, but it might not be that way forever. Thus it is important to distinguish the difference between Hadoop and a data lake. A data lake is a concept, and Hadoop is a technology to implement the concept.
The following is a recent Strata Talk by Kaushik Das of Pivotal. He discusses how a data lake can be used to create the digital brain.