top of page

Pretending to be a horse won’t grow you hooves (or why a data warehouse is not a data-lake)

  • Writer: Peter D Greaves
    Peter D Greaves
  • Jun 19, 2023
  • 5 min read

One of the aspects of going to conferences like HIMSS that is both annoying and amusing is how many vendors identify the next industry buzz-term on and jump on the bandwagon, regardless of their ability to deliver a viable product in that space. In some ways can’t say I blame companies - the reality is that for any vendor to be successful they have to constantly be looking at their position within their chosen industry, determining what the differentiated product is that they are selling, and positioning themselves to capture a share of the market.

What I do find disturbing are company who position themselves as selling a solution without understanding the real market need, and creating a new version of an established term that caters to their existing solution, regardless of what the underlying requirements are or a conceptual framework really means. Let me explain.

We saw this in the early days of Accountable Care Organizations (ACOs). The premise behind an ACO was simple – you were assigned a set of patients, you took care of their health. If you could do this below a certain cost while demonstrating certain quality measure you were financially rewarded, if you didn’t (depending on the model of the ACO) you either made no extra money or were penalized. (Yes I know I will likely have folks excoriate me for oversimplifying this).

At the time ACOs were announced the healthcare market was rife with vendors trying to be Health Information Exchange (HIE) providers, which involved acquiring cleansing and centralizing data, which seems on the face of it to be a good fit. You get all the data for your ACO into one place, you analyze it to see if you are meeting the measures, you report on the measures, you get your money. And HIE companies saw ACOs as being a gravy train, a huge untapped source of federal funding they could use to solidify a somewhat undefined market.

What a whole lot of vendors missed was that ACOs did not want to find out after three months (when you got the claims data) that they were missing the mark, they needed to know on day one how they were measuring up, and make adjustments as needed. That needed operational dashboards based on clinical rather than claims data, aged in days rather than weeks. In most cases, an HIE solution was never (in reality) an out of the box ACO solution, and a lot of unsuspecting clients were oversold solutions by companies that hijacked the ACO concept and simply sold their legacy product as an ACO solution, often without any changes or new development.

So now to my current soap-box. The term data lake has been around for some time. Wikiepedia defines a Data Lake as “…a method of storing data within a system or repository, in its natural format, that facilitates the collocation of data in various schemata and structural forms, usually object blobs or files.” A simple Google search will come up with definitions such as, “A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed.”. (TechTarget.com)

Key concepts you fill find again and again are storage in native formats, vast amounts of data, and flat architecture. Many companies are looking at using Hadoop as a platform for this.

At the same time you find companies who have large investments in more traditional BI technologies creating their own definition of a data lake. The result is confusion in the market from people who are being marketed to by companies who are trying to morph the concept of a data lake to fit their technologies, rather than innovating around data lakes.

Google “what is a data lake”, and you find a slide-deck from a major IT corporation that has a box labeled data-lake, and inside that box you find Hadoop, a data warehouse, data marts, and sand boxes. The deck talks about Inmon atomic models and Kimball dimensional models as being part of the data lake, and having a warehouse and data marts. All of this is encapsulated in a slide called “The broadening scope of analytics” in which they state that “Adding in a business desire for real-time analytics, self service data and individual privacy, it becomes necessary to have a well – defined, managed and governed approach to information architecture. We call this ...(our) Data Lake”.

By the way, before I get accused of picking on one company, this is hardly the only example I could cite, The reality is that it has become commonplace for companies and individuals to hijack terms that have an established meaning for their own benefit.

The first public penning of the term Data lake is clearly attributable to James Dixon, Pentaho CTO, In a 2014 blog in which he describes the new concept of a Data Lake, saying:

"Based on the requirements above and the problems of the traditional solutions we have created a concept called the Data Lake to describe an optimal solution."

He goes on to provide an analogy:

"If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples."

I am sure there are those who argue that companies are simply innovating on the concept of a data lake, and extending the model. That, frankly, is nonsense. The very article that kicked off the concept of data-lake specifically positions it as different to "traditional solutions" and the datamart (and by extension the data warehouse). Datamarts and data warehouses by definition are dependent on highly cleansed and organized data, not data in their "natural state". The Inmon and Kimball models both still have their roles in modern information architecture, but they are not part of a larger data lake.

I have a background in psychology, and one analogy I would use is that you cannot be a true Freudian psychologist and true Jungian psychologist at the same time, there are core underlying concepts that are in conflict with each other. You cannot be a member of both the Flat Earth Society and believe the world is a globe. Some concepts are so diverse that simply lumping them together and saying they are the same thing, or trying to grab the “best ideas” from conflicting ideologies simply does not work.

Likewise companies and vendors cannot decide they are going to take the whole of their reporting infrastructure and call it a data-lake and expect to be taken seriously. Words mean something. The fact that a new concept came out and you were not able to technically embrace does not mean you achieve anything by taking a different (often older) technology and calling it that. And by the way, Dixon's blog came out in 2010, you have had seven years to embrace this concept.

So I have probably annoyed enough people now that I will take a break and take my new hybrid vehicle for a spin. Actually it’s really my old 1998 Camry with 17 car batteries in the trunk, but I call it my hybrid car. It makes me feel better, and gives me bragging rights.

 
 
 

Comments


bottom of page