Modernizing our data infrastructure for an effective long term Data Science Program

What is data science and the motivation behind this program?

Investing in data science opens new ways of working for RightShip. We have become very proficient at sifting and examining data – our expert teams work through thousands of data points and rules every month. But data science is about something new – it’s about experimentation.

If you imagine the traditional scientist in a lab adding different substances to a test tube, they vary the formula to see what happens, trying to find that magic medicine. Penicillin, for example was identified in 1928, but didn’t become an antibiotic until 1942. I’m not saying that we want to take 14 years to develop our ideas, but sometimes we will need to try an experiment and find that it doesn’t work. This does not mean that it is a failure, that the project is over. Instead, it is a learning opportunity. We’ll alter our formulae and try again.

Science doesn’t always have to be complex, either. I think there’s a fear that data science is complicated and technical – a black box that people are afraid to open. But it doesn’t have to be. Some of the problems that we solve will be easy wins, others will take more time. And this is precisely the mindset that we wanted to set for this program.

We structured the program in 3 phases: firstly “educate” – to establish a common understanding of Data Science terms across RightShip; then “engage”, to work with teams to identify opportunities, and then “experiment” as we try to solve the business problems.

As part of the "educate" phase, our aim is for everyone at RightShip to understand the key concepts of data science and the process of running effective data science projects. Our data team will also be receiving intensive training and upskilling to enhance our capability, capacity, and agility to work with business users and solve problems through the application of data science. The technology upskilling is being made at a time when RightShip is also investing in the expanding our data platform to enhance the smooth delivery of data science and analytics projects.

educate engage experiment

Data Analytics. Isn’t it the same as Data Science?

Although many people use the terms interchangeably, there is both a relation and a distinction between the two terms. Data science is effectively an umbrella term for several practices that include data analytics but also focuses on machine learning and predictive modelling. Data Analytics focuses more on looking at historical data in context to answer business questions. A data scientist’s role is thus broader than that of a data analyst and the latter can be thought of as being a subset of the former.

The data analysis spectrum is often described across for different dimensions: descriptive, diagnostic, predictive and prescriptive. Descriptive and diagnostic is what typically fits in our definition of Data Analytics and answer the questions “What happened?” and “Why did it happen?”.

Predictive analytics attempts to answer the question “What is likely to happen in the future?” while Prescriptive, the most complex type of analytics attempts to discover the best course of action.

Referring to the Data Information Knowledge Wisdom (DIKW) pyramid, we could argue that Data is input to Analytics and Data Science, Information and Knowledge typically deal with the past (answering the ‘how?’ and ‘why?’) and thus, typically attributed to Analytics while Wisdom attempts to answer the ‘what if?’ and thus attributed to more advanced techniques and algorithms.

Wisdom knowledge information data

What is your opinion about Chat GPT?

The success of ChatGPT is more about the potential it would bring to humanity. For example, a virtual assistant that possesses impressive intelligence that could and will enhance our lives. It represents a new era, which sooner than later, we will wonder how we managed to live without. In exactly same way that we cannot imagine ourselves without the smartphone or the internet. My advice to anyone using this technology is to exercise caution as its fallible and always have a human verify the output for anything considered as business critical.

What is this data science program trying to solve?

The gap between the huge investments in AI and the level of impact is widely discussed issue. Research** suggests that although over 90% of fortune 1000 companies invest in AI activities, only around 26% report implementations into production. This number will not come as a surprise to anyone in the industry as we struggle with a typical challenge that data science projects are stuck in a prototype phase. There isn’t one single reason for this but here are some challenges from my personal experience:

First, AI can often be the elephant in the room, with so much hype and confusion on what AI is actually is. Organizations also typically invest in setting up siloed data science teams without putting the needs of the business at the centre of the initiative.

Data scientists are typically equipped with enough knowledge and skills to ingest data, and from a place of enthusiasm to meet demands of the business, typically bypass the data team and create an end-to-end solution in the “laboratory”. Once the model goes through enough iterations for it to be “accepted” by the business, we realized that the journey to production is a long and painful one that typically requires the re-implementation of feature extraction to match the performance needs and integration with the rest of the data infrastructure.

Data lineage is often disregarded and that will reduce the ability of data scientists to properly understand the data. Data lineage is defined as understanding the data as it flows from data sources to consumption, including the transformations that happened along the way. Data lineage cannot be an afterthought and is difficult to achieve unless it is inbuilt in the holistic data architecture.

The industry has a lot of experience in productionizing software over that last 40 years or so - process is mature. Productionizing machine learning is productionizing can be seen as productionizing software + data. Unfortunately, the equivalent of the “devops” process for the data element is far from mature and this makes deploying data science projects a pain.

To summarize, data science is only possible with the right combination of business experts, data and technology. Businesses should therefore invest in the infrastructure, skills and strategy to embed data science within the DNA of the organization. And this is exactly the objective of this program.

** https://www.businesswire.com/news/home/20220103005036/en/NewVantage-Partners-Releases-2022-Data-And-AI-Executive-Survey

How are you solving this problem?

RightShip has been working on data science and analytics for many years and has successfully deployed several models to production. I am lucky to be surrounded by some talented individuals within our data science, data engineering and data analytics teams. We are committed and determined to accelerate our data science journey and be within those few companies that manage to not only introduce AI into their organizations, but actually productionize the successful models within short timeframes.

We believe that the right partnerships are critical to our success and we have therefore partnered with industry partners Red Marble AI, Eunoia and Otzma Analytics and working in close collaboration with National University of Singapore (NUS) as our academic partners. Finally, we entered in long term commitments on the adoption of Azure and Databricks as our technology of choice to implement our vision of the next generation data science platform. More specifically we look forward to implementing our MLOps process using Databricks MLflow and Unity Catalog as the platform of choice for federated governance.

Modernizing our data infrastructure for an effective long term Data Science Program

Subscribe to our news