Tackling common data science challenges
By Emma Bellamy on February 4, 2022 - 5 Minute ReadWhen AI is used to solve real world problems and make decisions for businesses, it can be very powerful.
Yet, as data scientists know, there’s a lot of work that goes into getting it there, and the problems facing businesses aren’t always easily solved. The Data Community has come up against some big challenges this year, and in this blog we discuss these common problems and how best to tackle them.
1. Poor quality data
We’ve all heard the saying “garbage in, garbage out.” In other words, that flawed or nonsense input data produces nonsense output.
The data received by data scientists can be incomplete, faulty, inaccurate or really scarce. On top of this, data columns or data labels can be inconsistent or unstructured. It’s just plain MESSY! This makes it hard to achieve high performing optimized results from machine learning models.
Dr. Andrew Ng recently shared a fundamental skill: “using a data-centric approach to machine learning, rather than a model-centric approach, will consistently improve model performance.” This means improving the data in a more systematic way, from removing missing data to making the data more consistent and ensuring that there is enough historic data for the model. Investing effort to improve existing data quality is as effective as collecting the triple amount of the data.
Data quality has to be monitored and improved at every step, so having alerts for when data is missing or outliers are identified is a key step when productionizing the model.
For the Peak data science ops team, who work closely with customers, ensuring that an organization has AI-ready data is a key step during initial conversations with prospective customers. Performing quality assurance and analyzing the accuracy of the data is one of the first steps that Peak’s data scientists perform during those initial conversations. This principle can be applied to any data science project, that it’s important to have confidence in the data before starting to build a model. That is equally true for in-house data professionals.
To learn more about how Peak ingests data, and how the Peak Decision Intelligence platform can help data scientists to analyze the data they will be working with, be sure to read my colleague Vanessa’s blog: Kick-starting a data science project with Peak.
2. Difficult stakeholders
Getting new stakeholders excited about AI is often a hurdle for a data scientist or project owner. Some stakeholders are skeptical of the benefits of the tech, or mistrustful of the machine learning algorithms, and would like all parts of a solution to be transparent and explainable.
For other stakeholders, they may have never experienced working on a data science project and expect the project management to be similar to a typical IT project. The main differences are that data science projects can be pretty unpredictable, and often involve more exploration and iteration than IT projects.
This iterative approach to building data science models helps teams deliver value to their customers quickly, but relies on stakeholders being prepared to work with the data scientists and give feedback on the models. These projects often start with a narrow scope and a minimum viable product (MVP) will be built on a small subset of a businesses portfolio of products. When the MVP has been tested and feedback received on the decisions that it makes, the data scientist can then quickly iterate and expand the scope of the solution, by testing and receiving feedback from the stakeholders.
An all too common problem is a misalignment between the data scientist and the stakeholders on the project goals or outcomes. This may happen when the project proposal is too vague, the data scientist builds the MVP, then it unfolds that what the data scientist expects to deliver and what the stakeholder expects to receive are totally different.
Alternatively, they may agree on the end project goal but, due to different priorities, stakeholders requesting many changes, adding additional requirements, or expecting lots of ad-hoc analysis, the solution build gets held up. A good way to avoid this scenario is to have a tight project scope agreed by everyone before the project begins.
Finally, one of the biggest challenges faced by data scientists is to get buy-in from many different stakeholders within that organization. The original project sponsors may be on board, but there can remain resistance from different departments to understand how data science can solve the problems that they have been experiencing. Data scientists often have to utilize their communication skills to influence people who don’t believe in data-driven decision making or who don’t trust the decisions delivered by the solution (often via super cool data visualizations!) When all stakeholders are aligned, the project can really come alive and turn into a reality.
In our blog, ‘Top tips for data scientists dealing with difficult stakeholders’, Peak’s data scientists share their best nuggets that they’ve learned through experience when it comes to managing data projects.
The main differences are that data science projects can be pretty unpredictable, and often involve more exploration and iteration than IT projects.
Emma Bellamy
Data Scientist at Peak
3. Ensuring that model outputs are reliable, accurate and interpretable
One of the main challenges that data scientists across the industry face is the need to deliver outputs that are reliable, accurate and interpretable. If stakeholders begin to lose trust in the solution, the project could be derailed.
After a model has been trained and deployed, the next step requires developing an understanding of the predictive outputs and communicating them back to the relevant project stakeholders. Data scientists must develop tools that allow these results to be interpretable by subject matter experts, who understand the business needs of the project but not necessarily the technical details. One way this can be done is by showing particular example outputs from a model.
For example, if the model is predicting customer churn then the activities of a few customers can be shown along with their predicted churn score, highlighting the factors that influence the score the most. This communication is crucial, because if it’s not possible to convince others of the business value of the project, then it’s almost certainly destined for failure.
Once the stakeholders are happy that they understand the outputs, the focus can then go on deploying the solution to create reliable and accurate outputs, using the most appropriate computing infrastructure.
Model outputs are often displayed in up-to-date dashboards using hosted web apps. These dashboards provide a great way to display the model outputs in an interactive way, providing useful analysis to end users or the project stakeholders.
4. Building trust through partnerships
In the majority of projects, the main stakeholders that data scientists talk to don’t have technical expertise, they haven’t studied for a Master’s in data science and often this is their first encounter with Decision Intelligence. Explaining both the solution and the outputs to a non-technical audience can be hard. In fact, one of the reasons why the majority of AI projects ultimately fail, is due to a poor relationship between stakeholders and the data scientist. The stakeholders’ vision, guidance and input eventually play a big part of a project’s success.
At the start of a data science project, it’s important that the data science team and stakeholders get to know each other and can reaffirm the shared vision of the project – this is where the project vision starts to become a reality. The next most important stage is to carry out exploration of the stakeholder’s business in detail, and in relation to the proposed solution, exactly how the business works. When data scientists ask probing questions to enrich their contextual information, they can understand the business, how it operates and the end users and stakeholders’ requirements. This enables a tightening up of the definition of the solution and reduces the risk of misalignment on project goals.
For the company’s main stakeholders, the data science solution outputs sometimes look very different to how a human would make the decision. Often the data scientist will need to convince the stakeholders why the model is efficient, accurate and adds value to their business.
The project is successful when the end users rely on the decisions output from the model, either as these decisions are derived faster or in many cases as they are more cost effective. Therefore, user adoption is the main goal of a data science project.
If there is a lot of skepticism regarding the model, one way to increase trust is to start by replicating what the business currently does, by adding in lots of business logic and additional constraints into the solution. As the solution matures and the stakeholders trust the model, these constraints and some of the business logic can be removed.
A solution that no-one is using is useless. This risk can be mitigated by the constant guidance and input from the main stakeholders. Data scientists can build trust via starting simple, both in terms of the machine learning model and/or by concentrating efforts on a subset of products. As data scientists and stakeholders together assess this first model, they will understand its shortcomings and develop solutions to overcome them. They can also work together to understand what is missing from the model and codify any business logic or practical considerations that are required. The pure optimal solution resulting from the machine learning model isn’t always the best.
By encouraging open communication and receiving detailed, informative feedback, machine learning model parameters can be tuned and the solution can be iteratively improved to meet the requirements of all stakeholders.
Join our inclusive Data Community
The Peak community exists to support data scientists and analysts who want to make a difference and drive change within their organizations.