A few years ago (in 2010) a new concept of «data lakes» or
«data hubs» has been appears. The term itself was introduced
by James Dixon [11], but sometimes it is disparaged as being
simply a marketing label for a product that supports Hadoop.
Or we know also another vision: yesterday's unified storage is
today's enterprise data lake [12].
A data lake refers to a massively scalable storage repository
that holds a vast amount of raw data in its native format («as
is») until it is needed plus processing systems (engine) that can
ingest data without compromising the data structure [13].
The data lakes are typically built first of all to handle large
and quickly arriving volumes of unstructured data (in contrast
to data warehouses processing highly structured data) from
which further insights are derived. Thus the lakes use dynamic
(not pre-build static like in data warehouses) analytical
applications. The data in the lake becomes accessible as soon
as it is created (again in contrast to data warehouses designed
for slowly changing data).
The data lakes often include a semantic DB, a conceptual
model that leverages the same standards and technologies used
to create Internet hyperlinks, and add a layer of context over
the data that defines the meaning of the data and its
interrelationships with other data. The data lake strategies can
combine SQL and NoSQL DB approaches and online analytics
processing (OLAP) and online transaction processing (OLTP)
capabilities.
In contrast to a hierarchical data warehouse with files or
folders data storage, the data lake uses a flat architecture,
where each data element has a unique identifier and a set of
extended metadata tags. The data lake does not require a rigid
schema or manipulation of the data of all shapes and sizes, but
it requires maintaining the order of the data arrival. It can be
imagined as a large data pool to bring in all of the historical
data (collected and accumulated data about past events and
circumstances pertaining to a particular subject) and new data