How to get the most out of your data science initiatives?
“Every business is a software business” proclaimed more than 20 years ago Watts S. Humphrey, the “Father of Software Quality”. A cursory look at organizations today — whether big or small — is enough to ascertain his premonitions. In the 2020s we could even go one step further and say that “Every business is a data business.”
Due to the proliferation of both physical and virtual sensors and cheap, commoditized storage, companies are sitting on valuable data that they have already collected (or that they could easily collect). While some early adopters have used data science to make better business decisions, most companies are only now starting to realize its potential. This is no coincidence: while being data driven can bring enormous benefits to a business, obstacles abound.
We have worked with several companies as consultants for their data science initiatives and during these projects we have found that there are some key steps you can take to increase your return on investment.
Have solid data infrastructure in place
While data is much easier to collect nowadays, one does need to exercise this collection ability. And not all data is created equal — “quality” is an elusive characteristic of good data. It is safe to say that any data science initiative can at most be as good as the data it produces.
By high quality data, we understand a few things:
- Timely: Old data is of limited use to business decisions. An important question to determine is “how old is still okay.” It is much more expensive to collect data with 1s of latency, than with one hour of latency. Many organizations will still extract most benefit from data that is even one day or one week stale. It is important however to be deliberate with respect to where the line between current and historical data lies.
- Relevant. This may sound redundant, but data needs to answer questions that are of interest to the organization. Figuring out what these questions are is more “data art” than “data science.”
- Systematic. Ideally, we want data that is “complete” and captures an entire population of interest, whether consumers, employees, products, etc. Sometimes this is not entirely possible, for reasons having to do with the cost of data collection (or with compliance mandates, in the case of certain data categories such as Personally Identifiable Information or Personal Health Information). In such cases statistical samples are very powerful still, but a systematic sampling frame is required to satisfy the assumptions upon which sampling theory is based.
- Consistent. We ideally want uniform definitions of what each unit of analysis means. To take one example, a user account is meant to represent one person and a person is usually meant to have a single user account on any online platform. This assumption is often violated however — single user accounts used by an entire family or small business exist, as do situations where one person creates multiple accounts. Enforcing consistency over how one defines a user is quite a tricky task for big Internet platforms that live and die by their number of users! The problem of “it’s complicated” is ubiquitous wherever humans collect data!
- Discoverable. Information in organizations often tends to be siloed, with particular datasets belonging to certain departments. The situation is sometimes even more complicated with datasets in different formats or in multiple legacy data warehouses which can be difficult to interrogate. Discoverability does not mean simply building a search engine (although that helps), but also making people aware of data assets’ very existence. This task can be surprisingly hard to do in the cacophony of internal communication tools that characterizes the early 21st century corporate workplace. Good information management practices ultimately mean faster data science. The more time your data science teams spend on finding data, accessing it and cleaning it, the less time they spend on other valuable tasks such as providing better insights for your business decisions.
Work on creating a culture focused on data
Implementing data science in your company should be seen like any other change management initiative. It is important to make a compelling case that data can provide a competitive advantage and that decisions can be taken or challenged based on data. We believe that companies need proactive internal advocates focused on the benefits (and not only the risks!) of widespread data sharing. If you want to read more about this, you can check out our article series on this issue.
Another great strategy to encourage employees to have a more data centric approach is to provide them with training programs and opportunities to learn data science. As Redman and Davenport recently noted in a Harvard Business Review article, such initiatives can create “citizen data scientists”. There is a continuum that begins with Excel and culminates with sophisticated statistical models, enormous machine learning pipelines or complex A/B testing tools. Not all employees need to be working at the frontier of what we think of as “data work,” but it benefits everyone to be a bit conversant in basic statistics and the elements of computer programming.
We specifically propose a strategy that eschews drawing a strict line between “data scientists” or “engineers” and “the others” in the organizations. We believe this is good policy. For one, it recognizes the value of the many and diverse talents that need to come together to make even the deepest technical organisation work. This is not only a means to disseminate data science within the organization however. Through “citizen data science” some employees will develop their skills to a point where they can constitute a pool of internal specialists on which the organisation can then rely.
Integrate the data science team in your company
One of the biggest mistakes one can make is not properly integrating data science teams within the organization. As we already hinted, simply creating a data science department and hiring a team of professionals to fill it will not ensure success.
Your data science team needs to be able to thoroughly understand the business and work together with all the departments that might benefit from its insights. You need to be proactive in your approach. Otherwise, you run the risk of having a data science department that provides reports or analyses only when they receive requests from other departments. Data scientists are ultimately just that… scientists. They are creative types that first and foremost need to understand the business, work together with other departments, come up with their own hypotheses and test them. Seen in this light, data science is a new form of R&D, and good R&D needs deep integration with the business to succeed in developing new products, business lines or solutions.
It is desirable to try and answer important questions about the function of a data science team in advance of its creation. You want to be able to clearly communicate what the role of the team is and how it will go about doing its job. Once established, the organization should also ensure that there is ample communication between your data science teams and the other teams they will be working with.
Last but not least, we want to emphasize the importance of removing obstacles between data practitioners and actual data. Missing infrastructure is sadly not the only limitation: when data is valuable it is not surprising that “turf wars” will erupt with respect to who has access to it. The good news is that a lot of data territoriality is likely irrational. Squirreling away data not only goes against company goals, but is oftentimes a suboptimal means of advancing one’s own career — data science, like any science, is incremental and there is glory to be had in creating great datasets as well as in training great models.
Ultimately, everybody wins from sharing data to the greatest extent possible. This is prosocial behavior that should be encouraged and rewarded, for instance by including it in employee reviews.
Skill development and autonomy
We already emphasized the need for continuous skill development. We want to emphasize the need to focus on communication skills in particular, as this area is often overlooked in favor of easier-to-quantify technical skills.
Managing expectations for instance is a key data science skill. Given the open-ended nature of data science, mistaken assumptions can often rush in to fill the void created by insufficient direction regarding the role, promise and limitation of data science initiatives. This reality makes expectation setting particularly important in the early stages of data science initiatives.
Communicating clearly is another skill that is learned and should be developed. Plain language is desirable, as is the ability to adapt complex information to audience and context. It is indeed hard to prescribe how exactly one ought to improve communication skills. A focus on the agency of data scientists — being accountable for things they control and understand — is perhaps the most effective high-level principle to adopt. You, the hypothetical CEO, can tell them your data scientists what you need them to do but now how to do it. They must be allowed to refine, prioritize and plan any analysis tasks that come their way. This amount of freedom is doubled by accountability for results, which ensures that data scientists can develop ownership in the success of the company.
Given that the judicious use of data can generate a competitive advantage for many businesses, it is likely that data will be among your top priorities. As we said in the beginning, making a company more data oriented is no easy task by any measure. We hope these thoughts help advance your thinking on the best way for your organization to benefit from the data revolution.
Looking for an efficient open-source solution to manage data in your project? Aorist is a tool for managing data for your ML project. It produces readable, intuitive code that you can inspect, edit, and run yourself. You can then focus on the hard parts while automating the repetitive parts. To get this, you just need a description of how your data is formatted and organized, and where it needs to go.
Originally published at https://scie.nz.