A data lake is a method of storing all data types and schemas in an unstructured manner at a central location. Data lakes retain all types of data such as structured, semi-structured, and unstructured or raw data. Data lakes are typically used in the profession of data science and are less restrictive than data warehouses for analyzing data. The phrase 'data lake' is credited to the CTO of Pentaho James Dixon who explains data lakes using the following analogy:
If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.
Advantages of data lakes over data warehouses include: the retention of all data in data lakes, data lakes supporting all data types, data lakes support all users, users can make changes to data lakes more easily compared to data warehouses, and data lakes are generally faster to gain insights based on data analytics compared to data warehouses.