Using a DataOps approach to your big data project — modeled on similar methods used in DevOps teams — could unlock real value for your firm.
Big data should bring big changes in how you work as well as the tools you use if you want to take full advantage of emerging technologies and innovative architectures. DataOps – a style of work that extends the flexibility of DevOps to the world of large-scale data and data-intensive applications – can make a big difference. It’s more than just a buzzword. To make DataOps work, you have to know how to organize and manage a DataOps team. Let’s look at what DataOps is, why it’s worth your consideration and how to make the necessary changes in your cultural organization to put this style of work into action.
We’ll start with value. There’s huge potential value in large-scale data, but just collecting it and storing it isn’t going to get you much value from it. To get the benefits of big data, you have to connect the results of data-intensive applications to actions that address practical business goals – and you need to be able to do this “at the speed of business”. Modern approaches such as streaming, real-time or near real-time data processing, microservice architectures, and machine learning/AI optimization of certain decisions all offer new ways for data-intensive applications and the actions based on them to be a better fit for the way business happens.
See also: Experienced DataOps engineers are needed for streaming analytics
These approaches can also give agility and flexibility that lets you respond in a timely manner when conditions change. But it’s difficult to take full advantage of modern agile approaches and emerging technologies if you have a rigid, monolithic organizational culture. You will need to make a change.
This key, but sometimes overlooked, aspect of successful development and production deployment is beginning to get attention. A 2017 survey by New Vantage Partners of F1000 firms and industry leaders indicated that one of the biggest challenges they face is to change their business culture appropriately to deal with big data. Based on what I hear from people in the field, I think that DataOps is a big part of that change.
What exactly is meant by “DataOps”? This term can mean somewhat different things to different people, but fundamentally, roles such as data engineering and data science are coupled with operations and software development to form a DataOps team. This doesn’t usually require hiring additional people; often it’s just a matter of re-organization of people into teams with the right mix of skills. You might embed people with data-heavy data skills into an existing DevOps team. When done properly, the result is not only a faster time-to-value and better use of people’s efforts, but also a change in the rhythm of human decisions throughout the lifecycle of an application. Work goes forward efficiently towards a focused goal, but the team also has the ability to pivot and make adjustments in response to new situations. But the thing that defines the DataOps team – indeed the thing that drives its successful execution – is having a shared data-focused goal connected to real business value.
The Power of Owning the Goal
A DataOps team brings together diversity in experience and skills, which is a good thing. But managing this diverse group may feel like herding cats unless you get genuine buy-in to a shared goal. A strength of a well-designed DataOps team, flexibility with focus, is possible in part because the team cuts across skill guilds and is not slowed down or stalled by a cumbersome series of departmental-based decisions at each step of the pipeline. The team has the skills needed to build and deploy the desired application, from planning to production. A key symptom that this goal-focused style hasn’t been achieved is when people with one skill set feel they are “doing a favor” for those with different skills, particularly if they think the favor gets in the way of getting their own job done.
In contrast, in a smooth-functioning DataOps team, members see that they are executing a variety of needed steps toward a shared goal. One way to help build consensus is to have members re-state the overall goal in their own words as well as to identify the role they will play. By articulating the goal, they are better able to internalize it, and it’s easier for a manager to determine if the whole team feels a sense of ownership. It may sound like a small thing, but this sense of owning the goal, of working towards a shared target instead of only caring about time put in on their own partial steps, makes a big difference in job satisfaction and that, in turn, helps to power better team performance.
Trying to reach consensus on a data-focused goal does not mean planning a project by committee. There is still a need for leadership and for higher level planning, although hopefully with the benefit of input based on diverse experience. But while each individual executes based on her or his skill set, they are focused on the common goal and make decisions and adjust work accordingly. This cross-functional approach combines the strength of having individual members with finely honed specialized skills but a well-coordinated overall effort toward building the particular service or project that is the team goal.
One of the characteristics that distinguish a DataOps team is that its members become “data aware”. This is natural, of course, for data scientists; being aware of the value of data and how to harvest it is their starting condition. Similarly, data engineers also have strong data skills, especially about how to build an appropriate data pipeline. And while the operations group need not be data scientists, one change from traditional ops teams is that they, too, develop some level of data awareness that makes them better able to appreciate and engage with the data-driven goal of the team. For example, an operations person may naturally want a dashboard to monitor when an application runs, how long it took and whether or not this is within an expected and acceptable range.
With additional data awareness, the ops person might monitor the volume of data in the output. Furthermore, working with a data scientist, the ops person might also be open to adding a model that looks at the actual content of the output of the application to make sure it “smells right”. An operations expert shouldn’t be expected to build this model without collaborating, but with a bit of data awareness, they should be able to see the value of such a model and thus be more willing to include it in operations monitoring.
Cross-Skill Communication is Key
Better focus on a data-driven goal is one of the motivators that make DataOps style effective. Equally important is better cross-skill communication. Without efficient and timely communication, the end-to-end execution chain breaks down or slows down or misses the target. DataOps calls for transparent accountability – that is, adequate monitoring and reporting done in a manner that is understandable by those with different types of expertise. Once again, if you notice team members don’t report on status or report in a way that is inscrutable, mainly just intended to show they’ve gone through the motions of their part of the project, you are seeing a symptom that DataOps isn’t really functioning properly. In contrast, in an effective DataOps team, members proactively decide at any given point what other team members need to know in order to successfully reach the goal. They communicate that information, including an honest update on problems they’ve encountered that might affect the work of others.
This type of cross-skill communication becomes easier when people have frequent contact or, even better, work in close proximity to others in the team with different skills. It’s also more likely to happen when people care about reaching the team goal. They continually think about what they can do to facilitate the process, and that includes communicating with others to avoid mismatched efforts and to accurately set expectations on deliverables.
See also: The 5 phases of big data projects
The exact mix of skills on a DataOps team is not a rigidly defined formula: it will vary with the nature of the application being developed and with practical consideration of the human resources available. In addition, the balance of roles will evolve across the life cycle of the application. In planning and development, the mix tends to be richer in data-heavy roles than will be the case in production. Typically, the peak level of data scientists likely occurs at the start, as they identify potential value in data and design the application that will be able to take advantage of that value. Data engineers come into the mix as development starts, when their skills are needed to make data available and to help run applications in the appropriate environment. This need for data engineering continues into production, but the number of data scientist typically goes down as the operations experts get more heavily involved.
Technologies That Support a DataOps Style
The agility and flexibility afforded by DataOps relies on a degree of independence for the team, but it also depends on having access to various utilities such as computers, networking and other services as well as having a way for different services to communicate effectively. Providing these basic capabilities becomes the new mission for IT teams. Technologies such as the data platform, containers and an orchestration layer (most often Kubernetes) all play a role in providing a common system that makes DataOps possible.
This design is a bit of a contrast with older systems in which the IT team took on responsibility for managing and maintaining very complex applications. In the new style, IT is responsible for providing very uniform resources for compute and state storage. The key to success is to have both a container orchestrator as well as a data platform that allow large degrees of multi-tenancy so that management burdens can be decreased and the agility of DataOps teams becomes practical to achieve.
With the right capabilities, infrastructure such as the data platform can provide a comprehensive view of data and access by multiple teams. Some of the logistics can be handled at the platform level, easing the job of the data engineer who might otherwise have to build all this in the application. Similarly using a data platform makes a big difference to operations, in terms of performance, ease of monitoring and reliability.
DataOps in Machine Learning
The special case of machine learning and AI applications emphasizes the usefulness of a DataOps organization. Increasingly, people are realizing that a large part of what matters for successful machine learning goes far beyond the algorithm – logistics such as well-managed models and input or training data are essential for machine learning to have a practical impact. It’s not surprising that data engineers, for instance, are beginning to be seen as being just as important to a machine learning team as are the data scientists. DataOps makes it more likely for machine learning applications– like other data-intensive applications – to be successful in production.