Data science has become a real buzzword over the last few years. Everybody wants to be a data scientist, resulting in phenomena such as data science boot camps. Yet ten years ago there was no "data science," even though a lot of the tasks data scientists carry out today were certainly carried out ten years' ago, albeit with smaller datasets than those used today.
Perhaps, in another ten years, data science will be called something else, although the general principles probably won't change much; using mathematics and statistical models to maximize the value of data.
Data engineering, however, is a more recent concept which people are less likely to have heard of, or, if they do know of it, may assume it is just a branch of data science. However, data engineers are every bit as important as data scientists in a successful analytics strategy. Companies should consider the statistic that for every data scientist it employs, it will need at least two and possibly as many as five data engineers, meaning it is essential to recruit not just the right data scientists, but the right engineers.
What is data engineering?
Think of data engineers as being the foundations of a pyramid. If data scientists need to carry out work to help a company to meet its KPIs or achieve its goals and ambitions, data engineers build the structure that the data scientists require to do that work.
If the goal is data mining the data the company has to feed predictive analytics, the engineers build the structures needed so the data scientists can complete this task. Data scientists do not want to have to build a data lake or data warehouse, even using available products such as Amazon's Athena and Redshift or platforms such as Oracle Exadata or open-source data repositories such as Hadoop and Hive, as this is a job for data engineers (see our article on choosing a database). The scientists want the database built and ready-for-use so they can build the models needed to help drive a business forward.
A data engineer is typically a software engineer who ideally already has some experience in working within distributed systems, though this experience can always be gained on-the-job if the candidates are competent. Once they have built the infrastructure for the data scientists to use, the engineers are also responsible for maintaining it, along with a data operations team. Typically, the initial structure, which may have met requirements initially, may have to be modified as the data science team face unexpected problems or issues or as the business team makes new requests.
The main areas where data engineers work include ensuring data quality by avoiding formatting and rogue data issues and thus ensuring that the data has the resilience to function smoothly. Scalability and security are other commons tasks for the data engineering team. It may be that changes in the company, such as a growth spurt in people or customers, drives a need for up-scaling across all departments, including analytics. The structures which the data engineers created should have been built with scalability in mind, so part of this task is not just scaling, but doing so in a non-disruptive manner for the rest of the business
Security involves initially identifying which datasets are subject to the highest levels of scrutiny (such as PII) and ensuring the structures are in place so that relevant data gets extra levels of security and that PII data is properly pseudonymized. While the pseudonymization is a data science task, the data engineering team needs to create the infrastructure the company uses to house the pseudonymized data, along with all relevant permissions.
The fundamental task, however, is first creating and then maintaining the data pipeline. This undertaking is not so easy given that a typical modern data pipeline which processes large amounts of data on a daily basis may require dozens of different technologies to function properly. A well-designed data pipeline allows a company to do what it wants with the data it has, such as maintaining a complex, rapidly changing website, as well as giving access to the analytics, reporting and other essential feeds that are required by people within the organization. This is why data engineers are so important. Their work provides the foundations on which everything else, analytics-wise, is built.