How to plan a commercial data science project
By Sorcha Gilroy on November 2, 2021At Peak, we've delivered many data science projects over the past few years.
Here, we’ve gathered together some of our best practices for successful commercial data science projects that we hope will be useful to other data scientists!
Engage with end users
It might sound simple, but speak to the people that will actually use the output of your data science project before you start. Put the time in to really understand their business processes and how you’re going to use data science to help them.
For example, say you’re building something to help merchandisers decide how to allocate stock across different stores. It’s well worth sitting down with the merchandising team to see how they make those decisions now and to understand their pain points. This allows you to…
- Make sure you’re building something useful. Many data science projects fail because the data science solution doesn’t actually fit in with how the business process works. It enables us to check if the feasibility of the solutions matches the expectations of the customer.
- Identify quick wins. Sometimes you can automate a simple task before building the full data science solution which can give the end user value earlier than expected.
- Balance ease with impact. You can create a list of possible solutions and get the end users to rate what’s most useful to them – and also rate the difficulty of each solution (from a change management and implementation point of view, as well as from a data science perspective.) Then, you can start with the highest impact/lowest effort items.
Understand systems
Any commercial data science project will involve some form of data input and data output. For example, a recommender system on a website could take in transactional data and web behavior data and output a set of recommendations for each user, which are then surfaced on a website.
Data inputs
Once you have scoped out how you’re going to plan a data science project, write down a detailed list of your data requirements. From there, you can find out which systems hold the data you need. If you need datasets from different systems, it’s worth finding out early on how those datasets can be tied together.
For example, you may need a particular customer identifier that’s common to both systems. Often, at this stage, you’ll find that you’ll need help from the teams who manage these data sources – so make sure you know who they are before you begin!
Data outputs
Before you start building your data science model, find out where the outputs need to go and how they’re going to get there. Do you need to build an API? Or do you need to build an integration that pushes outputs into another system? Finding this out early really helps to shape the project and identifies any constraints on things like data formats. For a data science project to be a success, this part is vital and often gets overlooked.
Speak to the people that will actually use the output of your data science project before you start. Put the time in to really understand their business processes and how you’re going to use data science to help them.
Sorcha Gilroy
Data Science Team Lead, Peak
Start simple
Start by building the simplest solution you can think of. This helps you to…
- Quickly find out if there are any issues with the system integration. If you build something from end-to-end from the beginning, you can find out if there’s a fundamental issue with how the outputs of the solution are processed – and whether you might need to change what you’re doing, or how you’re integrating with another system. It’s better to find this out quickly so that you can fix it before spending a lot of time on the data science solution.
- Handle change management with the teams you’re working with. By quickly getting some output and making small changes, it can be easier for teams to get on board with your data science solution – particularly if it’s something quite new to them. It allows you to show them early on what you’ve done, take on feedback as you go and make updates as they request them. Doing this helps to build trust with end users and demystifies data science a bit!
- Reduce the time to value. With data science projects, a simple solution over nothing often results in the biggest incremental value gain, compared to the uplift you get from iterations to the model. For that reason, it’s useful to get even a basic model out quickly as this will likely give the end users value – even before the model is as good as it could be.
Plan your code
While it’s good to start with something simple, it’s also worth planning what you want to build. Often when doing initial exploratory data analysis, you might find yourself using a lot of different unstructured scripts and notebooks. This can work well for early prototyping and experimentation, but doesn’t tend to translate into a high-quality data science solution.
One option to move beyond loose collections of code is to draw out the solution at a high level. This can help you see the structure and patterns in the problem you’re trying to address – including spelling out what main capabilities and associated components you’ll need and how they’ll interact in order to achieve your desired outcome.
When you’re going through this process, do your best to keep things as simple as they can be – don’t overcomplicate things too quickly. Iteration speed is crucial, so you don’t want to create something that will take months to be ready for initial testing. It’s usually a great idea to draw out these high-level designs collaboratively if you can, and if you can’t, try and get a colleague to review the design after the fact to check your thinking.
A benefit of drawing out an approach to a solution is that it helps you structure and break down your code. This has many benefits, including…
- It makes it easier to apply changes. If you need to make a small change to your code, combing through an unstructured script can take a long time. But, if you have everything organized into functions and classes, it makes it easier to find exactly where to make your changes.
- It makes it easier to hand over projects. If someone else takes over your project, they can ease into the project by understanding each smaller piece of code at a time. You can also help to get them on board by getting them to improve or iterate on a particular function – without them having the worry that they’ll break the whole codebase.
- It makes it easier to reuse pieces of code for future projects. As you write out your code, you can spot which functions get used again and again. This can help you spot things that might be useful for other projects. This could then lead to you making a package that helps you with all of your projects and speeds you up in the future!
Measure impact
Think about the business impact that you’re aiming to make with your data science project. This is something that the business stakeholders or end users can really help you with by telling you their priorities. Examples include: improving profit margins if it’s a pricing project, reducing transportation costs for a logistics project, or increasing orders per customer for a customer retention project. The main points to consider for measuring impact are…
- Make sure you’re measuring a business metric and not just a data science or modelling metric. Model metrics can be useful when building your solution and are important for validating that you’re on the right track. But, at the end of the day, the F1 score of a classification model being 5% higher does not always translate to a better business outcome. Find out the business metrics people are interested in and try to measure your model against those as much as possible.
- Set a benchmark to compare your system against. Try to measure as much as you can for what the status quo is before you add in your data science solution. This will help you to see what uplift is gained from adding your solution in. Bear in mind any other changes that might be happening at a similar time that may impact the metrics. It’s also really valuable to set a simple baseline to compare against to understand how much impact your model is having. A simple rule-based approach can be something good to compare against. For example, a baseline to compare a recommender against could be just top-selling products.
- Set up A/B tests. To see how well your solution is working, set up robust A/B tests. You may need to work with some other teams to achieve this. For example, if you’re running an A/B test for a recommender system that’s been launched on a website, you’ll need to work with the web team to assign people to either the A or B bucket as they enter the website. You’d also need to set up tracking to see how the two groups behave differently. When setting up your test, make sure you understand how long the test will need to run for to ensure that the results are statistically significant.
Join our inclusive data community
The Peak community exists to support data scientists and analysts who want to make a difference and drive change within their organizations. Learn how to build models that deliver on commercial needs, troubleshoot problems and exchange ideas.