Here are five important things to know about data lakes:
1. What is a data lake? That’s a good place to start any conversation! A data lake is essentially a landing zone to store all the data that an organization collects. The main advantage over a traditional enterprise data warehouse (EDW) is that there is no need for extract-transform-load (ETL) processes to ingest the data from any operational systems or to access the data from the data lake itself. In addition, it is relatively inexpensive and massively scalable.
2. Traditional EDW systems also have restrictions on the data types that they can support. All enterprise organizations today collect more data than they process. The data lake can be used to store data of any type and in any format. As a result, the cost of transforming herewith inaccessible information (such as text, images and other unstructured data) is eliminated or at least substantially reduced. What this really means for any organization is that new operational systems can be easily added into the data lake and users can start deriving insights from them almost immediately.
3. Why isn’t everyone adopting data lakes? There are a couple of pertinent reasons. To begin with, a lot of organizations have invested heavily in the infrastructure, support and services offered by the large EDW solution providers (IBM, SAP, Oracle, Microsoft) and making a transition needs many levels of business justification. Also, the data lake technology (and the Enterprise Hadoop ecosystem) is new and evolving. As a result, early adopters will only include organizations that want to be on the cutting-edge of technological advances, those that would like to capitalize on the financial advantages of the data lake or those that are willing to hedge their bets on revolutionary solutions offered by up and coming players like QuickLogix (www.quicklogix.com. full disclosure- I am affiliated with this organization).
4. Data governance has been a challenge with EDW systems. It is only going to gain more prominence with the advent of data lakes. Gaps in data quality and reliability will be more easily exposed. We should collectively be applauding this development. IT teams can shift their emphasis from working on ETL processes to move the data into the common store to ensuring that the data collection (operational) systems meet stringent quality standards.
5. Data lakes are not for everyone. One of the common complaints from data architects and technologists is that their organization is simply not suited for a shift to scale-out, parallel, no-SQL systems. It is true. To dig a hole, you might just need a spade not a jackhammer. However, it is important to assess current and future technological requirements of the organization while making these choices.