This is my daily life log.

2022-02-17

Research Oriented Code in AI/ML projects

*This is the same article I wrote in medium.

Recently, web application engineers have more opportunities to work with data scientists or researchers than before. At the same time, they are often faced with the research-oriented code written by data scientists or researchers. The code tends to be written for products involving with artificial intelligence (AI) or machine learning (ML). This article might help junior/middle engineers or data scientists understand the code written for AI/ML products.

What is research-oriented code?

Research oriented code is the code written mainly by researchers or scientists, and it refers to analysis scripts and/or prototype code. It is written based on scientific research paradigm. The products can be developed through the below three iterative processes 1) writing analysis scripts, 2) developing prototypes based on analysis scripts, and 3) transforming prototypes into application products.

Reload again, if you are not able to see images.

Figure 1 created by Jesse Tetsuya

The data scientists and researchers tend to have a strong background in mathematics and algorithm modeling. They can use their skill sets to aid development work, such as writing analysis scripts and developing prototypes. In many cases of the projects related to AI/ML, they might frequently write analysis scripts and code in prototype level.

Although some data scientists and researchers have engineering skill sets, production level code is mainly written by web application engineers. Web application engineers have responsibility for making the better quality of the production code itself and making the code work on the server. For example, in order to integrate the code written by scientists or researchers with systems such as APIs, web application engineers need to refactor it.

How analysis scripts are written in the AI/ML projects?

For narrowing down the meaning of analysis scripts, it could be more beneficial to figure out the processes of writing analysis scripts in business situations. This is because the analysis scripts are used in academia and business, and that might largely influence the understanding of analysis scripts.

Writing analysis scripts is a part of the research processes in academia. The traditional research tends to be deductive and functionalistic in nature as described by figure 2. The analysis scripts are written for figuring out new knowledge and writing it down on the thesis. The rigid and exact evidence is required to support the existence of the new knowledge and the arguments in the thesis, no matter how it is beneficial in business. Except the case involved with a business, researchers might not need to consider the strict resource planning and estimated delivery date for academic conference or journal.

Figure 2 created by Jesse Tetsuya

In the real work in business situations, especially the work about the AI/ML projects, the analysis scripts themselves can be the base of prototypes and products. By looking at the demonstration of the prototypes, the decision makers need to judge whether it is beneficial in business before other companies start the same business. So, the quick and iterative outputs of analysis scripts, which can also be prototypes, are required. The below flow is suitable for these cases being inductive, iterative, and organic.

Figure 3 created by Jesse Tetsuya

The cycle of the yellow boxes in the above figure 3 are discussed in detail below.

Research Methods

In the research, there are two ways of analysis; 1) qualitative analysis such as observation and unstructured interviews, which uses subjective judgment based on unquantifiable information, and 2) quantitative analysis that seeks to understand behavior by using mathematical and statistical modeling, measurement, and research.
In the AI/ML projects, the second way is chosen and the stakeholders of the projects need to select analysis methods such as regression, classification, or neural network. Also, what analysis tools (python, R, SPSS, and etc.) they should use will be decided.

Data Collection

In the first place, secondary data such as log data needs to be collected from databases etc. Before they start to collect data, the stakeholders judge what data can properly stand for what they want to know. In the cases of research in academia, the database or data itself is sometimes outside campus or laboratory.
On the other hand, most IT companies tend to have the data stored in their own database. That could make the projects easier to go back and forth among data collection, data analysis, and research method decision.

Data Analysis

In academia and business, collected data is pre-processed. The way of pre-processing depends on the analysis methods and tools. The coders input the pre-processed data into the analysis models and verify whether the output is appropriate or not for knowing what they wanted to know.

Types of prototype

The prototypes can be categorized into the two groups of something visible such as client application and web application and something invisible such as API or statistical/mathematical models. (described in figure 4) In the AI/ML projects, the latter can be the prototypes developed through the above cycles of writing analysis scripts.

Figure 4 created by Jesse Tetsuya

Five methods for evaluating research-oriented code

The last section describes which evaluation indicators can be useful for fixing and writing the actual programming code. Understanding the indicators can help the coders make the iterative cycle of writing analysis scripts work better.

Research oriented code can be evaluated by using scientific research indicators suggested in The Essential Guide to Doing Your Research Project written by Zina O’Leary. This is because it is based on scientific research paradigm as described in the above section. The five indicators can be applied into the situation of the actual processes writing research-oriented code in the AI/ML projects.

1. Objectivity

Objectivity implies distance between the researcher and the researched, and suggests that relationships are mediated by protocol, theory, and method.This standard exists in order to prevent personal bias from ‘contaminating’ results. Have subjectivities been acknowledged and managed ?

Situation: when pre-processing data, the programmers or researchers need to decide the data which is not important for the analysis and drop off them such as outliers. Having a third person look at the data might provide valuable insight, and is an effective way to avoid personal bias. Having a stable environment to quickly go back and forth among data analysis, data collection, and research method selection can be one of the solutions.

2. Validity

Validity is premised on the assumption that what is being studied can be measured or captured, and seeks to confirm the truth and accuracy of this measured and captured ‘data’, as well as the truth and accuracy of any findings or conclusions drawn from the data. It indicates that the conclusions you have drawn are trustworthy. Has ‘true essence’ been captured?

Situation: when the analysis algorithm did not output accurate results or the learning models did not work well. These are prevented by using machine learning metrics such as precision.

3. Reliability

Reliability is premised on the notion that there is some sense of uniformity or standardization in what is being measured, and that methods need to consistently capture what is being explored. Reliability is thus the extent to which a measure, procedure, or instrument provides the same result on repeated trials. Are methods approached with consistency?

Situation: if the preprocessing code outputs inconsistent results, it is not reliable. For example, if the pre-processing procedure, the data input flow, and the parameter check are all defined in the code, the reliability depends on the quality of this code

4. Generalizability

Generalizability indicates that the findings of a sample are directly applicable to a larger population. While findings from the sample may vary to that of the population, findings considered generalizable show statistical probability of being representative. Are findings applicable outside the immediate frame of reference?

Situation: except for an extreme case, the input data size and data type can largely influence the output from algorithms. Generalizability is controlled by the amount and properties of the input data. The weak generalizability of algorithms can lead to over-fitting or data leakage of the data. These can be prevented by conducting parameter tuning through the use of cross-validation, regularization, and looking at the learning curve.

5. Reproducibility

Reproducibility is directly concerned with issues of credibility and indicates that the research process can be replicated in order to verify research findings. In other words, conclusions would be supported if the same methodology was used in a different study with the same/similar context. Can the research be verified?

Situation: when the programming code outputs different results depending on different operating environment or server, the code is unreliable. The configuration of infrastructure might have some problems, or the operating environment or server might not be appropriate to make the code work well.

In short words, 1)objectivity and 5)reproducibility seem to be the indicators related to analysis script itself. On the other hand, 2)validity, 3)reliability, and 4)generalizability are directly related with analysis algorithms. Using machine learning metrics built in scikit-learn can be useful to measure machine learning algorithms themselves.

“Research Oriented Code” can be a useful word to separate the problem into bits and pieces and look at them from engineer’s or researcher’s perspective. The skillsets engineers and researchers have seem to be slightly overlapped so sometimes their job responsibilities seem to be vague in AI/ML projects. Especially, less experienced engineers/data scientists might feel more ambiguous for their own actual tasks in AI/ML projects than seniors feel. This is why I conceptualized the research-oriented code in AI/ML projects and described it in detail.

References

Bhattacherjee, A., University of South Florida, Scholar Commons & Open Textbook Library,(2012). Social science research: principles, methods, and practices, Available at: https://open.umn.edu/opentextbooks/BookDetail.aspx?bookId=79

O’Leary, Z.,(2014). The essential guide to doing your research project 2nd edition., Los Angeles: SAGE.