Home

Introduction

As the adoption of Materials Informatics (MI) continues to grow, have you encountered challenges like these?

“We built a model using our data, but don’t know how to connect it to actual product development.” “We obtained results, but they don’t align with practical experience, and the project stalled.”

In recent years, MI tools have become more accessible, significantly lowering the barrier to entry. However, ease of access does not necessarily mean ease of effective use.

Now that tools are widely available, it is no longer enough to simply build models. What matters is what to do after building them, and even more importantly, why they are built in the first place.

This requires designing the overall framework of the project—what we call Materials R&D DX (Digital Transformation). In this article, we introduce a structured approach based on CRISP-DM, the global standard process for data analytics, adapted specifically for MI projects.

These steps will help organizations move from “just trying things out” to consistently delivering results at scale.

What is CRISP-DM?

CRISP-DM is a data analytics framework consisting of six iterative steps. It is not a linear process, but a cycle—moving back and forth between steps to improve outcomes. Running this cycle itself is what enables organizations to embed a culture of data-driven R&D— in other words, to practice Materials R&D DX. Let’s walk through the 6 steps.

1. Defining the R&D Challenge (Business Understanding)

Clarify “What kind of materials do we want to develop?”, “What problems are we trying to solve?”.

Key point: The objective does not need to be perfect from the beginning. Even a simple goal like “Let’s see if we can predict this property” is sufficient. Why it matters: Clear objectives make it much easier to evaluate model performance later. This step answers the fundamental question: “What is the purpose of DX?”

2. Data Understanding

Take inventory of your available data - Experimental notebooks, Excel files on personal PCs, Historical reports in shared drives..

Assess data potential - Can this data be used for machine learning? Is the volume sufficient?Understanding both data quality and quantity at a high level is critical.

3. Data Preparation

This is the most labor-intensive, most critical, and generally the most time-consuming step. First, collect and consolidate data scattered across individuals and departments into one place, and organize it into a common format (template). Then, correct inconsistencies in notation, handle missing values, and organize the data into a structured format for machine learning.

- DX perspective: This is not just preprocessing. It is the process of transforming fragmented, individual data into organizational assets.

- Role of platforms: MI platforms don’t magically clean data, but they help standardize formats and reduce inconsistencies, making data structuring much more efficient.

4. Modeling

Build machine learning models. Start simple: There is no need to begin with complex algorithms. Start by creating a baseline model. By actually running models, you gain insights such as “We need more data here”, “This might be more predictable than expected”.

5. Evaluation

Evaluate the model not only by accuracy, but also by usability and interpretability.

(1)Forward Prediction - Can the model reasonably predict properties under new conditions? (2)Inverse Design - Can the model propose compositions or process conditions to achieve target properties? (3)Cross-check with domain knowledge (Interpretability): - Review SHAP analysis and feature importance to understand which factors the model considers important. - Check whether these align with past knowledge and domain expertise (chemical intuition). - If they do align, confidence in the model increases significantly. - If they contradict, it may present an opportunity. - While it could indicate potential data bias, it may also reveal hidden correlations or new insights that humans have overlooked. This can lead to breakthroughs in R&D.

(5)Perspective for improving accuracy - If accuracy or confidence is insufficient, how can it be improved? - Should we simply increase the amount of experimental data? - Should we refine the existing data (e.g., feature engineering or transformations)? - Should we add metadata on raw materials or SMILES information to provide more informative inputs to the model?

6. Deployment

Integrate the model into actual R&D workflows. This is the true goal of Materials R&D DX.

Operation in Practice - Use forward prediction (simulation) to reduce the number of experiments - Use inverse design (optimization) to discover new formulations that humans might not conceive - Treat AI not as a replacement for researchers, but as a partner that augments their capabilities

Iterative Cycle - Accumulate newly obtained experimental results as data - Retrain the model - Improve model performance

Conclusion “Start small, iterate fast”

These six steps may seem complex. However, the reality is the opposite. Start by quickly running through steps 3–5 using your existing data. This will naturally reveal “What data is missing”, “How the problem should be redefined”.

Accelerating the Cycle with the Platform

Our Materials R&D DX platform is designed to accelerate the CRISP-DM cycle and enable it to be repeated continuously.

Data Preparation: Standardization and centralized management (assetization) of data through the use of templates Modeling: Automated modeling capabilities that can be used even without specialized expertise Deployment: Implementation of forward prediction and inverse design using the developed models

Whether you want to “just try it out” or “build a full-scale process” - this platform enables you to move forward smoothly without losing sight of where you are in your project. Why not start with a free trial and experience your “first cycle” using your own data?