CLASSIFICATION OF STRAWBERRY FRUIT SHAPE BY MACHINE LEARNING

: Shape is one of the most important traits of agricultural products due to its relationships with the quality, quantity, and value of the products. For strawberries, the nine types of fruit shape were defined and classified by humans based on the sampler patterns of the nine types. In this study, we tested the classification of strawberry shapes by machine learning in order to increase the accuracy of the classification, and we introduce the concept of computerization into this field. Four types of descriptors were extracted from the digital images of strawberries: (1) the Measured Values (MVs) including the length of the contour line, the area, the fruit length and width, and the fruit width/length ratio; (2) the Ellipse Similarity Index (ESI); (3) Elliptic Fourier Descriptors (EFDs), and (4) Chain Code Subtraction (CCS). We used these descriptors for the classification test along with the random forest approach, and eight of the nine shape types were classified with combinations of MVs + CCS + EFDs. CCS is a descriptor that adds human knowledge to the chain codes, and it showed higher robustness in classification than the other descriptors. Our results suggest machine learning's high ability to classify fruit shapes accurately. We will attempt to increase the classification accuracy and apply the machine learning methods to other plant species.


INTRODUCTION
Shape is one of the most important traits of agricultural products due to its relationships with the quality, quantity and value of the products.Shape is also important in cultivar identification, as cultivars are defined by their morphological characteristics.Although mechanization and automation are used widely in agriculture (mainly for cultivation systems such as planting and harvesting), automated technologies for the identification of the shapes of agricultural products have not been established.The identification of the shapes of agricultural products has been done for centuries by visual assessment only, and the criteria for judgement have not been well defined.It has been considered challenging to transfer the classification of shapes from visual assessment to computerization, in part because it is difficult to verbally describe shapes in detail and in a standard manner.Strawberries are an important fruit species used worldwide as fresh or processed products.In Japan, strawberries are third in production value among agricultural crops (after rice and tomatoes), contributing a significant portion of the agricultural economy.Higher value is attached to shape as a trait for several horticultural products - including strawberries - compared to other crops, because it directly links to the price per unit.In addition, a specific fruit shape is often preferred such as a conical shape of strawberries for visual presentations.A cultivar's specific shape is also important for a brand's value; an example is the round shape of the strawberries of the cultivar 'Fukuoka S6 Go' (Trademark name, ''Amaou.')Agricultural products are not generated from artificial designs, and the shapes of the products are not uniform.In addition, the shapes of these products can change depending on the growing conditions.This is true for strawberries, and there are no welldefined guidelines for the classification of strawberry shapes.Therefore, sample patterns defined by agriculture authorities have been used for shape classification, and the judgement is done by a human's comparison of the sample pattern and the fruit.This judgement often changes depending on the 'human sensor' used, i.e., among individuals.Practice is thus required to increase a human sensor's precision.In Japan, the nine types of strawberry shape are defined in the guideline issued for strawberries by the Plant Variety Protection (PVP) office at the Ministry of Agriculture, Forestry and Fisheries (MAFF, 2011).That guideline was issued for the registration of strawberry varieties, and the nine shape types are used in for characterizing the shape of developed and registered novel strawberry varieties.In this study, we tested the classification of strawberry shapes by machine learning with the goal of increasing the accuracy of the classification, and in order to introduce computerization into this field.We describe the types of strawberry shape in Section 2, our image analysis (Section 3), machine learning and feature descriptions for classification (Section 4), our results (Section 5), and a discussion and summary (Section 6).

THE NINE TYPES OF STRAWBERRY SHAPE
The nine types of strawberry shape defined by the PVP office at MAFF (Fragaria L. Char.37, Fruit: shape) are shown in Figure 1: reniform, conical, cordate, ovoid, cylindrical, rhomboid, obloid, globose, and wedged.Examples of the nine sample patterns of the fruit are provided in Figure 3.

Image capture
A total of 2,969 strawberry fruit images were captured by a digital camera (Canon EOS 40D).The 388 strawberry plants in a Multi-parent Advanced Generation Inter-crosses (MAGIC) population, which was derived from six funder lines, were grown at the Fukuoka Agricultural and Forestry Research Centre (Wada et al. 2017a).The harvested fruits were put on Styrofoam trays (Figure .4), and the digital images were taken under artificial lighting covered with scattering light film.This time we don't use the colour checker.We are going to use it for evaluation of breed improvement in future.The image resolution of the system in this study was 0.08 mm/ pixel and measuring accuracy was estimated as < 0.8 mm when the range of error by lens distortion was assumed within 10 pixels (calibrated result).The acceptable range of error was considered as of 0.5~1.0mm in practical measurement, therefore we considered that camera calibration was not an essential process in our measuring system and skipped the process to increase the throughput of the work.The accuracy of measurement was rather promoted by increase of number of leaning data.The digital images were processed by the following image analysis.

Image analysis software
The length (i.e., the major axis of the ellipse), the width (the minor axis of the ellipse) and the ratio of width/length in each fruit image were determined based on the software developed by Hayashi et al. (2017b).This software uses the algorithms developed by Tanabata et al. (2012), and it automatically extracts the fruit from the image's background based on colours.
The contour line of the fruit is determined by the software, and ellipse approximation was performed for the measurement of the length and width of the ellipse.Figures 4-6 are an input image, extracted fruit images, and an approximate ellipse, respectively.
We developed a program for generating chain codes (Freeman, 1974) of fruit contour lines in order to add feature descriptors.The directions of approximate ellipses were identified by extracting the top and bottom coordinates of the approximate ellipses (Figure . 6).The start points of the chain codes were identified, and the directions of connection points were determined.We performed a principal component analysis (PCA) with Elliptic Fourier Descriptors (EFDs) that were determined based on the chain codes as well as other feature descriptors.

SHAPE (software for the calculation of the EFDs)
The Elliptic Fourier Descriptors (EFDs) method, a mathematical method for the description of contours, is often used for the investigation of organs in organisms (Furuta et al. 1995, Keith et al. 2013).In this study, we determined the EFDs with the use of freeware, SHAPE (Iwata and Ukai, 2002), which was designed for the quantitative evaluations of biological shapes based on EFDs.SHAPE has functions for extracting contour lines from the digital images of (plant) organs, and for investigating the contours in quantitative ways based on EFDs and principal component scores.The features extracted as principal components are visualized, and the scores are used as the descriptors of the features.Shape is operated on Windows OS, and includes four applications: (1) the extraction of contour lines, (2) the derivation and normalization of the EFDs, (3) a principal component analysis, and (4) visualization of the PCA results (Figure 8).We input the chain codes data to SHAPE, and we investigated the fruit shapes by using functions (2) to (4).The obtained Fourier series was used for machine learning as a feature descriptor.2014).Identification of fruits in species level (such as apple and banana) is more easier than classification of fruit shape patterns in same species (such as this study) because the target fruits shows obviously different shape among different species.There were several studies reported classification of fruits in same species; Separating 'good' and 'defective' strawberry fruits by neural network (Morimoto et al., 2000), classification of lemon fruits in three types by color and volume (Khojastehnazh et al. 2010), and classification of strawberry fruits in four shape patterns (Nagata et al. 1996 andLiming et al. 2010).Gonzalo et al. (2009) classified tomato fruits in nine shape patterns for genetic analysis.In this case, they use sliced fruit images that gave more amount of characteristics than surface appearances.Our methods identify more fine differences than previous studies based on digital images of surface appearances.We will first discuss the most appropriate machine learning approach to the classification of strawberry shapes into the nine types.We focused on two criteria: (1) the accuracy of classification and (2) the smallest number of parameters needed to minimize the process of trial and error.Deep learning was passed over from the candidate because of the small size of training and assessment data in this study (nine types in 1,500 data points).The random forest approach was reported to show higher accuracy compared to other approaches (Kobayashi, 2011).Among the programs available for use with a random forest approach, the scikit-learn module in Python has the benefit of a smaller number of main parameters that require settings by a learner.Table 2 summarizes the results of the comparisons of characteristics between a random forest classifier and a support vector machine (SVM) classifier, which is currently the most commonly used classifier.An SVM requires the tuning of kernels when normalized parameters and kernels are configured.
In this study, we planned to perform tests 36 times (the number of possible pairs from the nine types: 9C2 = 36) for determining the feature descriptors.To avoid configuring the parameters 36 times, we used a random forest approach.The random forest was coded with the following parameters.
A: The number of binary decision trees in learning.A larger number of binary decision trees increases the accuracy of classification; but it requires more calculation.We used 500 binary decision trees in this study, in accord with the recommendation of Breiman (2001).B: The maximum number of feature descriptors used for the learning in each binary decision tree.We used √M, as M is an explanatory variable, according to Breiman (2001).C: The maximum depth in each binary decision tree.
'Random forest' is an ensemble learning approach using binary decision trees.The depth of each decision tree affects the variances of the learning model.According to Breiman (2001), the possibility of over-fitting is increased when the maximum depth is increased.However, the proper depth changes depending on the data set used.In this study, we configured the maximum depth as 1 and selected effective feature descriptors for the test as described below in section 5.1.In the test explained in section 5.2, the entire condition of the identification learning data was changed depending on the selected feature descriptors.Therefore, the maximum depth was configured as 7.The use of two maximum depth values was expected to suppress the possibility of over-fitting.

Feature descriptors
We used the following four feature descriptors: Measured Values (MVs), the Ellipse Similarity Index (ESI).EFDs, and Chain Code Subtraction (CCS). (1)

Elliptic Fourier Descriptors (EFDs):
The EFD method (Kuhl and Giardina, 1982) uses X and Y coordinates of the point of the contour as a function of the arc length.Here, we determined the EFDs for the contour line of strawberries.First, suppose a point 'P' circles a contour line of a fruit from the starting point of 'S' at a constant speed.With the coordinates of 'P' at the time point 't' as x(t) and y(t), P represents a periodic function rotating with a period of 'T' because P returns to the starting point 'S' in each round.A periodic function is represented as a Fourier descriptor by Fourier transform.
Here, x(t) and y(t) are represented as follows: (2) (3) In this case, an, bn, cn, and dn are coefficients of the EFDs.The first term of the formulas, a0 and c0, represent the X and Y coordinates of origin.The shape information thus includes the coefficients of EFDs.Both formulas are Fourier descriptors with N dimensions, and therefore, a more refined contour can be represented with higher dimensions.Figure 9 shows the reconstruction of a contour line of a strawberry from N=1 (ellipse) to higher dimensions (N=80) that represent the contour line with harmony ellipses.We performed a PCA with the coefficients in order to extract the major differences between the shapes.We used the first to fifth principal components by following the approach using the principal component scores (Rohlf and Archie, 1984).

Chain Code Subtraction (CCS):
The directions of the connection points of the chain codes are shown with eight direction codes, and we thus considered that the appearance ratio of the eight directions represented the shape of the fruit.
Figure 10 shows the appearance ratio of chain codes (the maximum ratio is 1) of the chain code representative value (CR) of each fruit shape estimated based on the sample patterns in Figure 3.The x-and y-axes represent the nine types of strawberry shape and the average of the appearance ratio, respectively.The colours of the bars shows the eight directions.The chain code subtraction (CCS) was calculated by subtracting the chain codes of each fruit (CS) from the CR.
CCSn = CRn − CSn (n = fruit type 1-9) (4) The histograms of CCS values are shown in Figure 11.The distribution patterns of the histograms were different for each shape type.Because the differences of distribution pattern were more clearly represented in CCS, we added CCS in the subsequent analysis as a feature descriptor.

Exploring the variables that affect the classification of fruit shapes
To explore the variables that affect the classification of fruit shapes, we performed a classification test using each pair of the nine types of shape, in a round-robin system (9C2 = 36 pairs).A total of 85 to 100 images of the fruits were selected for each shape type in order to avoid bias in the learning data.The images of each type were divided at a 7:3 ratio, and the former and latter images were used as the learning data and test data, respectively.The numbers of images of each shape type are shown in Table 3.We tested the usefulness of the MVs (the length of the contour line, the area, the fruit length, width, and width/length) for the classification of strawberry shapes, and we then added the ESI, EDF and CCS to the MVs and investigated the effectiveness of these additions by using the random forest approach.The depth of the decision tree was set as 1 because the test was for the classification of pairs of the nine shape types.A coefficient of agreement defined by Cohen (1960, hereinafter Kappa coefficient) was used for the index of the classification test.The Kappa coefficient is an index representing the degree of coincidence by two examiners.It is calculated by subtracting the theoretical coincidence from the practical coincidence.In the present study's test, the Kappa coefficients showed the coincidences of the predictions of classification of strawberry shape types estimated by the learning data and the results from the test data.Kappa coefficients of equal or more than 0.9, 0.8, 0.7, 0.6 and less than 0.6 are classified as great, good, OK (fair), possible, and re-work needed, respectively (Imai and Shiomi, 2004).Here we used the Kappa coefficient of 0.7 as the decision threshold of successful classification.

5.1.1
Classification by MVs: The x axis of Figure 12-17 shows the 36 pairs and the y axis shows the Kappa coefficient.
Sixteen of the 36 pairs showed a Kappa coefficient ≥0.7, suggesting that the 16 pairs accurately classified by the MVs (Figure 12).The results suggested that the MVs are not sufficient for the classification of all of the strawberry shape types, but the MVs are useful in several combinations.We therefore concluded that the MVs could be used as the basic information for classification, and we used them for the subsequent analyses.Based on these results, we concluded that the best combination for the classification was e), i.e., MVs + CCS + EFD.

Classification test for the nine strawberry shape types
We performed a classification test for the nine strawberry shape types with the combination of MVs + CCS + EFD for practical applications.That is, the data of all nine shape types were used at the same time for learning, and the classification ability was determined.Recall ratio, i.e., number of agreed / number of tested, were used for investigation of accuracy, because use of Kappa coefficients was not adequate for the investigation of accuracy in classification of more than three types (Tsushima, 2002).The test design was as follows: 1) Each of 70 fruit images of each of the first eight types (Types 1-8) was used for learning data.For Type 9, 60 fruit images were used due to a smaller total number of images.The total number of fruit images for learning data was thus 620.
2) The fruit images for the test data did not overlap with the learning data.A total of 871 fruit images were used (Table 4).The numbers of images for each type ranged from 25 to 150.
3) The depth of the decision tree was set at 7 to increase the accuracy of learning in the complex structure of learning data.The results are shown in Table 4.The recall ratio ≥0.7 was observed in eight of the nine types.The recall ratio of Type 1 was 0.71, which is slightly more than 0.7, and accuracy of classification of this type is thus considered possible.Type 7 showed a smaller recall ratio, 0.52, and this type involved difficulty in classification compared to the other types.

Results summary
The results described above in section 5.1.1 suggested that the MVs (i.e., the length of the contour line, the area, the fruit length and width, and the fruit width/length) were useful as basal descriptors for the classification of strawberry shapes.The next test adding the ESI or the EFD or the CCS to the MVs improved the classification ability, in the descending order CCS > FED > ESI for classification ability.Based on the subsequent 3-fold cross-validation, we concluded that the best combination for the classification of strawberry shapes was MVs + CCS + EFD.We further investigated the classification ability of MVs + CCS+ EFD by using the learning data with the nine shape types simultaneously.Although the Type 7 results indicated that re-working was necessary, the other types showed recall ratio >0.7 (Types 1, 2, 3, 4, 5, 6, 8 and 9).These results suggested that the eight of the nine strawberry shape types could be classified by using MVs + CCS + EFD.Nagata et al. (1996) reported that the agreement ratio (recall) between their method and personal decision was 71% in classification of the four types strawberry fruit shape (triangle, square, round and abnormal), and considered it was 'high' ratio because the average agreement ratio of classifications among multiple persons was 67%.In this study, we classified the strawberry fruit types into nine types, which is more complex classification that Nagata et al. (1996).Therefore we concluded that the agreement ratio of 70% is a considerable high value.

Classification of type 7 and type 1
Both Type 7 and Type 1, which showed the worst and secondworst accuracy rates, showed a horizontal ellipse-based shape.The difference between these two types is the shape of the top part; Type 1 is curving and Type 7 is flat.This difference is often subtle and difficult to identify, because the degree of curving of the top part changes depending on the direction of the set position of the fruit on the tray.Therefore, for the classification between Type 1 and Type 7, it is necessary to improve the classification ability for the top part of the fruit.We considered several options to improve the accuracy of distinguishing Types 1 and 7: (1) zooming in on the top part of the fruit and adding a further image analysis; (2) increasing the number of chain code sections (e.g., from eight to 16) for higher accuracy in the expression of the shape; (3) increasing the number of sample images; and (4) trying reinforcement learning.The solutions from (2) to (4) are general approaches and are effective for a variety of targets.The effectiveness of reinforcement learning has been reported and is expected to become one of the most promising approaches for increasing the accuracy of shape classification.Kochi et al. (2018) reported a robust methods for reconstruction of three-dimensional model in strawberry fruits from the digital images.Use of three-dimensional models would enhance the accuracy improvement by supporting the solutions of ( 1) -( 4), because larger number of contour lines ( or descriptors) would detected from the circumference shape of 3D models than 2D images, and increase the analysing and the learning data.

The classification abilities of CCS and EFD
Our present findings indicated that the Fourier series and chain codes, which represent the features of shapes in a numeric form, were effective in the classification of strawberry shapes.The CCS in particular showed higher classification ability than the EFD and the ESI.Because the CCS was determined based on the classification results of the sample patterns by humans, it includes the human knowledge in the descriptor.In other words, CCS is a descriptor that adds human knowledge to the chain codes, and the combination resulted in a more robust classification than the other indices.Because the classification with MVs + CCS showed a Kappa coefficient <0.7 only for the classifications of Type 1 versus Type 7, Type 2 versus Type 4, and Type 2 versus Type 5, it was predicted that CCS better classifies the shape of the bottom part of the strawberry (show Figure 13).EFDs quantify the features of fruit shapes and include comprehensive information such as sharpness in the bottom part, roundness of the fruit, and curving or raising of the top part.Its effectivity was less than that of the chain codes.However, by adding the EFD to MVs + CCS, Type 2 versus Type 4 and Type 2 versus Type 5 were successfully classified (show Figure 14).We thus considered that using EFDs improved the classification of the shape of the strawberries' top part.

Conclusion
The results obtained in this study suggested machine learning's high ability to classify fruit shapes.Machine learning has gained more attention in the study of plant genetics rather than the identification of plant shapes.The investigations of phenotypes of organisms on a large scale are known as 'phenomics' which is becoming a significant field in biology.It is expected that machine learning will contribute to phenomics as well as genetics.Based on the present findings, we will attempt to increase the classification accuracy of machine learning and the number of target plant species.The system's packaging and automation are also necessary for practical applications.

Future perspectives
Automation and labour-saving in classification of strawberry fruit shapes is desired by strawberry farmers, distributors and breeders.The process of classification of fruit shape and packing occupies more than 60% of working time of farmers (Suenaga et al, 1989).The farmers should pack the fruits in limited times, for example, 20,000 fruits are classified and packed (Hayashi et al, 2004).Therefore, Hayashi et al. (2004) estimated that more than 80% workers desired automation of classification.Meanwhile, a few study reported use of machine leaning for classification of fruit shapes; large or small size of fruit, four types of classification (Cao et al, 1996).The result obtained in this study suggested that machine leaning is effective for classification of fruit shape into nine types.In order to realize automation of the process of strawberry shape classification, we will improve the classification accuracy in machine leaning and try to develop an automation system for fruit classification.

Figure
Figure 4. Input image Figure 5. Extracted fruit images

Figure 8 .
Figure 8.The work of SHAPE Measured Values (MVs): Five measured values were used: (1) the length of the contour line, (2) the area, (3) the major axis of the approximate ellipse (i.e., the fruit length), (4) the minor axis of the approximate ellipse (i.e., the fruit width), and (5) the ratio of the fruit width/length.4.2.2Ellipse Similarity Index (ESI): To investigate the ellipse similarity of the fruits, we used two feature descriptors: the optimum ellipse area ratio, and the optimum ellipse boundary length ratio.(a) Optimum ellipse area ratio (ER) = Ellipse area (EA) / Fruit area (FA), Optimum ellipse area (EOA) = a*b*π/4 (length; long radius = a, width; short radius = b) (b) Optimum ellipse boundary length ratio = Optimum ellipse boundary length (L)/fruit boundary length (LI), obtained with the following equation: L

Figure 9 .
Figure 9. Restriction of a contour line of strawberry by Fourier series.White and blue lines show the contour lines of strawberry and harmony ellipses, respectively.

Figure
Figure 11.Distribution of CCS 5. RESULTS

Figure 12 .
Figure 12.The results of classification by MVs 5.1.2The ESI, EFD, and CCS: A further classification was performed by adding the ESI, the EFD, and the CCS, respectively to the MVs.The results are indicated in Figure 13.

Figure 13 .
Figure 13.The classification results by adding the ESI (blue), the EFD (orange) and the CCS (grey) to the measured values.The numbers of pairs that showed Kappa coefficients ≥0.7 were: a) MVs + ESI = 21/36 pairs b) MVs + EFD = 23/36 pairs c) MVs + CCS = 33/36 pairs Larger Kappa coefficients were shown for many of the pairs with the additional descriptors.MVs + CCS showed the highest classification ability.Based on the results described above, we used the following combinations for a further classification test: d) MVs + CCS + ESI e) MVs + CCS + EFD The results are illustrated in Figure 14.

Figure 14 .
Figure 14.The classification results by MVs + CCS + ESI (blue) and MVs + CCS + EFD (orange) The numbers of pairs that showed Kappa coefficients ≥0.7 were: d) MVs + CCS + ESI = 33/36 pairs e) MVs + CCS + EFD = 35/36 pairs.The improvement of classification ability was observed by adding EFD to MVs + CCS, i.e., combination c).The addition of ESI did not produce significant differences in the classification by MVs + CCS.To test the classification ability in a different way, we performed a 3-fold cross-validation for the combination of c) and e) above, which showed the best and second-best results.The structure of the data set is shown in Figure 15, and the results are illustrated in Figures 16 and 17.

Figure 15 .
Figure 15.Structure of data sets in the 3-hold cross-validation

Figure 17 .
Figure 17.The classification results with MVs + CCS + EFD in the 3-fold cross-validation.The numbers of pairs that showed Kappa coefficients ≥0.7 in all of the trees were: d) MVs + CCS = 31/36 pairs e) MVs + CCS + EFD = 34/36 pairs

Table 3 .
Number of images of each shape type used in the training data and test data

Table 4 .
Classification results with MVs + CCS + EFD and using all nine of the strawberry shape types at once for the learning data