Anomaly detection in the Ivory Coast

Anomaly detection in the Ivory Coast

In the Ivory Coast alone, 2 million tons(!) of cocoa beans were harvested last year. This makes the Ivory Coast the largest producer in the world. Unfortunately, corruption and child labor cannot be ruled out. To fight this there are organizations like UTZ certified.

Logo UTZ certified

Does this logo look familiar to you? Surely you have seen it on packaging in the supermarket before. UTZ certified, now part of the Rainforest Alliance, is a non-profit organization motivated to protect farmers, create fair working conditions and document the entire supply chain from start to finish in order to achieve complete transparency.

Despite very careful certification processes, it is not yet possible to rule out the possibility that false production data will be given by farmers for their own benefit and that it is not yet possible to identify them in time. In order to support UTZ in recognizing false declarations (so-called potential outliers or anomalies), in my bachelor thesis I have dealt with the use of machine learning algorithms to identify outliers, thus possible fraud.

Just like with classical machine learning methods, you have to differentiate between the three cases in anomaly detection with machine learning:

Supervised Anomaly Detection
Semi-Supervised Anomaly Detection and
Unsupervised Anomaly Detection

They differ in the availability of the labels. If labels are available, certain models can be trained and a decision made for future values. Since I did not have any labels, I resorted to unsupervised learning. Selected features were now prepared and normalized. Ten different unsupervised algorithms, specially developed for such cases, were fed with the data. The following algorithms were used:

k nearest neighbor Global (kNN Global)
Local Outlier Factor (LOF)
Connectivity-based Outlier Factor (COF)
Local Outlier Probability (LoOP)
Influenced Outlierness (INFLO)
Cluster-based Local Outlier Factor (CBLOF)
Unweighted CBLOF
Local Density Cluster-based Outlier Factir (LDCOF)
Histogram-based Outlier Score (HBOS)
Eta One-Class Support Vector Machine (eta OC SVM)

What is special about these algorithms is that they all not only output a continuous score that shows the probability of an anomaly, but also recognize global and local anomalies. What are global and local anomalies? Global anomalies are instances that stand out from other data points because they have very high values, for example. They are visually recognizable with the naked eye. If you only look at a section of the entire data set, it is a local outlier if it only stands out from a smaller group.

In unsupervised learning, the results cannot be validated. In order to produce a robust result, the global and local aspects are used to form a so-called outlier ensemble. For each farmer, a score was generated by each algorithm, which was now combined to one, in which the average was generated, or the maximum value was used. After validation with a domain expert from UTZ it could be confirmed that the scores give a correct tendency and will be used for the further verification of farmers in the future.