The Rise of Data Engineering

Transcript

The start point

So we're going to examine this from the perspective of myself. My start point was in enterprise systems. At that time the AS 400 was still a dominant machine out there. The beauty about the AS 400 was that it basically did everything. There was no real distinction between the database that it ran and the operating system.

It was an extremely robust machine and you could run a small and medium sized enterprise on one machine without really needing to know very much about database administration or really about data modeling.

It was not until moving to Unix-based machines that I really started to hear about database administrators. But even in those days we didn't really call the database administrator a data engineer. What they were doing they were ensuring that the database was up and running, that it had enough space etc. They might be tuning the database to make sure that it works slightly faster and very occasionally making sure that it would record something about something new in the world. When I think of the skill set of the database administrator, I don't really think of someone who's out there sort of modeling what they're seeing in the real world, although at that time, we certainly did have people that were data modelers

Data Modeling

In the context of a large enterprise system the data modeler might have been someone that's more akin to a draftsman. Someone that's recording the entities that are in that like in the system and how they're related, and could certainly take advantage of that those relationships that had been recorded in the database itself.

They were not as key to the creation of the applications as software engineers, but the scale of understanding that was needed by them was quite vast.

To give you some sense of context in oracles business applications, of which there may have been, greater than 150 different applications that cover all of the business operations of a large enterprise. Then there might have been over 22,000 tables at least the last time that I looked, which is now some time ago.

So that gives you some sense of the number of things that have been kept track of within that within that single database

Shifts in Business Models

There have been some shifts in business models. Even within that one database, we certainly digitized many more processes Transactions moved more towards self service, as we've created all of the applications for the human resources world. And even beyond the bounds of the end of a single enterprise, we put applications on the extranet. So rather than someone calling in to chase up where a payment might be, you've now got a supplier portal. There may be a sale happening on a business to business basis through e-commerce, so those are shifts in business models that may or may not be within one larger system.

That was that that was at the time when we were still building these large enterprise systems.

Shifts in Technology

There are also some shifts in technology. We're now connecting many more end points to the system as we've got sensors that are on machinery there in the factory. Maybe they're not just sensors, but actually doing some computing themselves in edge computing, creating much more data. But it is now moved into data stores and those data stores may not be the kind of relational data stores with transaction consistency. Now some are moving into event based or no SQL databases. And we're busy, enormously increasing the amount of data that we have.

Shifts in what is valuable

It's shifted what we consider to be valuable. Being able to make better predictions has become incredibly valuable in and of itself.

So it shifted some of the value from the things that we were creating to the information we've created about them, and that's caused the rise of the data scientist.

Shifts in Thinking

Back in the day when we were building out those large enterprise systems, there was a mantra, or set of beliefs around storing data in one place, such that you have a picture of the world at one point in time, meant you could make better decisions.

And, well, that thinking is kind of changed a little bit.

With the rise of micro services as a way of thinking, that span of control is shrunk dramatically, so there are much smaller services that can be scaled independently. To have control of much smaller bits of data they can be like confident that they've got control over everything that they that they need.

And it's not just the data that the enterprise is creating. We have opened ourselves up to the possibility of making better predictions from data the enterprise itself does not necessarily create: Social media type data or government data might also be going into the mix.

New Problems

We've got a data governance problem that we've got lots of applications that needs to make sure that they've got confidence in the data that they need and make sure that it's right when they've got at it. Making sure that everyone knows what they own and no-one is trying to update data that they do not own.

Establishing good naming Conventions that we know what is what, what belongs to whom. There are certainly issues where things might not be sufficiently different to warrant them being truly separate things, so there's a lot of art involved. There's a lot of policy and lots of local Optima that you need to step above to be able to make enterprise-wide decisions. The Data Governance role needs to rise high in an organization.

And we've also got this data acquisition problem. If there is some government data or social media data that we would like to get access to such that we can have good predictions. how to get at it and that and make make sure that we can bring it in and corral it correctly.

Of course, that opens up data wrangling that we've known about many different systems with potentially overlapping data in different formats that we've got to make sensible and rational to be able to make predictions from.

Data Governance

A single database at least forces you to have the discussion over your data governance. The immediate need for that discussion is delayed if you move to independent micro services, but the debt of understanding increases with the complexity. Having a chief data officer allows an enterprise to balance the needs of many applications for dominion over same data, and the need of time to market with the need to good governance.

As the pendulum swings back we find the need to understand a sea of data whose relationships are not enfoced, but must be devined before we can create a picture of the world from which inference and actions can be drawn.

A New Process Emerges

This has meant that a new process emerged. Is starts with the creation of those applications, whether they be web applications, IoT applications, that that's what computer scientists have always done.

Then we have the data engineering problem, which is grabbing that data from wherever it might be in each of those systems, Government data, social media Data; making sure that we can wrangle it move it into a form where it can be used by systems that might be making predictions, or there might be human beings that need to actually look at that data in a rational way to be able to make their own decisions from.

A Split of Responsibilities

In the days of the single system that's running the enterprise, it's very important to have a CIO who is responsible for running all of those systems, but many of those systems are now outsourced to cloud providers.

So keeping up with emerging technologies has become much more important, so we've got the rise of the CTO relative to the CIO. We've got the chief data officer there to know what we've got and where everything is and to be able to make judgments in in that overall Data governance process. The chief Information Security Officer is going to make sure that where we have data appropriately protected in terms of it's in terms of confidentiality, availability and integrity. The software engineering problem itself, we still got the creation of those applications, so whether they be web based applications or IoT based applications, edge computing applications. But we've also got the data engineer that's got to make sure that the data that they create can be made of service to data scientists, and we obviously have the data scientists, so that is, I think, how we've got the rise of data engineering.

Helping with your data engineering and data governance needs

If you recognize the issues and problems expressed in this piece and would like to have a conversation on how SSTC can bring expertise of managing the data models that support the world's largest organizations, to bear on your problems, please contact us at info@softwarestrategyconsulting.co.uk or click the contact us tab above,

Google Sites

Report abuse