4 Farm typologies

This chapters proposes a step-by-step method to understand farm diversity and select representative farms for e.g., on-farm experiments.

4.1 Concept of farm typology

Farm typologies aims at characterizing the diversity of farming systems and their distribution in heterogenous communities, as a basis for prioritizing interventions (Hammond et al. 2020) and/or to select representative farms, or ‘prototype farms’ (Alvarez et al. 2018).

Various methodologies, all with their strengths and their weaknesses, are commonly used to delineate typologies, from (quantitative) statistical typology (Tittonell et al. 2010) to (qualitative) expert-based typology involving farmers and other knowledgeable stakeholders (Kebede 2009).

Here, we are using a method of statistical typology delineation developed by Hassall et al. (2023). One of the key strength of the method is that non-metric multidimensional scaling can be used, allowing for the selection of qualitative variables, not only quantitative ones, for typology delineation.

In the example below, we will use the small dataset we created in the previous chapter (and that was saved as a rds file). You can also download it here (and load it with the function read.csv()): https://github.com/FBaudron/Supporting-codesign/blob/main/04-typologies/small_dataset.csv

A standalone R script containing the full workflow is available at the following link: https://github.com/FBaudron/Supporting-codesign/blob/main/04-typologies/typology.R.

4.2 Data preparation

We begin by clearing the environment, ensuring all necessary packages are installed and loaded, and importing the data.

rm(list = ls())


if (!require("openxlsx")) install.packages("openxlsx")
if (!require("vegan")) install.packages("vegan")
if (!require("ade4")) install.packages("ade4")
if (!require("dendextend")) install.packages("dendextend")
if (!require("gtsummary")) install.packages("gtsummary")
if (!require("kableExtra")) install.packages("kableExtra")
if (!require("splitstackshape")) install.packages("splitstackshape")
if (!require("NbClust")) install.packages("NbClust")
if (!require("randomForest")) install.packages("randomForest")


library(openxlsx)
library(vegan)
library(ade4)
library(dendextend)
library(gtsummary)
library(kableExtra)
library(splitstackshape)
library(NbClust)
library(randomForest)

data = readRDS("small_dataset.rds")
# data = read.xlsx('small_dataset.xlsx', sheet = 1)

We first rename some variables, recode some categorical variables into binary ones, and convert NAs into zeros for some variables.

names(data)[c(1, 4:6, 27, 29, 33:36)] = c("ID", "age_hhh", "sex_hhh", "edu_hhh",
    "cover_crops", "integrated_pest_management", "homemade_feeds", "fodder_production",
    "fodder_preservation", "survival_feeding")

data$sex_hhh = as.numeric(data$sex_hhh == "Female")

data$edu_hhh = as.numeric(data$edu_hhh != "Primary level")

data$main_food_source = as.numeric(data$main_food_source == "Own production")

data$main_income_source = as.numeric(data$main_income_source == "Crop sales" | data$main_income_source ==
    "Livestock sales")

data[c(23:36)][is.na(data[c(23:36)])] = 0

4.2.1 Removing outliers

We remove extreme values to avoid skewing the analysis. For this, we plot and examine boxplots. Points beyond the whiskers of the boxplot have values higher than 1.5 times the interquartile range and are considered potential outliers. When such extreme points are observable, we remove them from the dataset.

Age of the head of the household doesn’t have any true marked outlier.

boxplot(data$age_hhh)

But family size does.

boxplot(data$family_size)

Thus, we remove all records with a family size greater than 20.

data = data[data$family_size < 20, ]
boxplot(data$family_size)

We inspect all other continuous variables, and find outliers for the number of small ruminants, the number of poultry, livestock offtake, and equipment value. We remove these extreme values from the dataset that we will use to delineate the farm typology.

# boxplot(data$cropped_area)

# boxplot(data$fallow)

# boxplot(data$prop_non_cereals)

# boxplot(data$fertilizers)

# boxplot(data$amendment)

# boxplot(data$cattle)

# boxplot(data$small_rum)
data = data[data$small_rum < 100, ]
# boxplot(data$small_rum)

# boxplot(data$poultry)
data = data[data$poultry < 400, ]
# boxplot(data$poultry)

# boxplot(data$food_security)

# boxplot(data$hdds)

# boxplot(data$tot_div)

# boxplot(data$cereal_prod)

# boxplot(data$offtake)
data = data[data$offtake < 6, ]
# boxplot(data$offtake)

# boxplot(data$eq_value)
data = data[data$eq_value < 8000, ]
# boxplot(data$eq_value)

4.2.2 Normalization of continuous variables

To reduce the influence of extreme values, continuous variables are log-transformed.

We create first a copy of the dataset before data transformation.

typo = data

We plot histograms of continuous variables, and when distribution departs too much from a Gaussian distribution, we logtransform the variable. In that case, we take care of adding a term - half of the minimum positive value - to avoid passing zeros to log10.

Age of the head of the household has a relatively Gaussian distribution.

hist(typo$age_hhh)

But family size has a distribution that departs markedly from a Gaussian distribution.

hist(typo$family_size)

Thus, we logtransform the variable family size.

typo$family_size = log10(typo$family_size + (0.5 * min(typo$family_size[typo$family_size >
    0], na.rm = TRUE)))
hist(typo$family_size)

We inspect all other continuous variables, and find the total cropped area, the fallow area, the total quantity of fertilizer used, the total quantity of organic amendments used, the number of cattle, the number of small ruminants, the number of poulty, the total quantity of cereal produced, the livestock offtake, and the equipment value to depart markedly from a Gaussian distribution. We therefore log-transformed these variables.

# hist(typo$cropped_area)
typo$cropped_area = log10(typo$cropped_area + (0.5 * min(typo$cropped_area[typo$cropped_area >
    0], na.rm = TRUE)))
# hist(typo$cropped_area)

# hist(typo$fallow)
typo$fallow = log10(typo$fallow + (0.5 * min(typo$fallow[typo$fallow > 0], na.rm = TRUE)))
# hist(typo$fallow)

# hist(typo$prop_non_cereals)

# hist(typo$fertilizers)
typo$fertilizers = log10(typo$fertilizers + (0.5 * min(typo$fertilizers[typo$fertilizers >
    0], na.rm = TRUE)))
# hist(typo$fertilizers)

# hist(typo$amendment)
typo$amendment = log10(typo$amendment + (0.5 * min(typo$amendment[typo$amendment >
    0], na.rm = TRUE)))
# hist(typo$amendment)

# hist(typo$cattle)
typo$cattle = log10(typo$cattle + (0.5 * min(typo$cattle[typo$cattle > 0], na.rm = TRUE)))
# hist(typo$cattle)

# hist(typo$small_rum)
typo$small_rum = log10(typo$small_rum + (0.5 * min(typo$small_rum[typo$small_rum >
    0], na.rm = TRUE)))
# hist(typo$small_rum)

# hist(typo$poultry)
typo$poultry = log10(typo$poultry + (0.5 * min(typo$poultry[typo$poultry > 0], na.rm = TRUE)))
# hist(typo$poultry)

# hist(typo$food_security)

# hist(typo$hdds)

# hist(typo$tot_div)

# hist(typo$cereal_prod)
typo$cereal_prod = log10(typo$cereal_prod + (0.5 * min(typo$cereal_prod[typo$cereal_prod >
    0], na.rm = TRUE)))
# hist(typo$cereal_prod)

# hist(typo$offtake)
typo$offtake = log10(typo$offtake + (0.5 * min(typo$offtake[typo$offtake > 0], na.rm = TRUE)))
# hist(typo$offtake)

# hist(typo$eq_value)
typo$eq_value = log10(typo$eq_value + (0.5 * min(typo$eq_value[typo$eq_value > 0],
    na.rm = TRUE)))
# hist(typo$eq_value)

4.3 Delineation of farm types

Next, we calculate dissimilariries and distances, reduce the dimensionality of the dataset through principal coordinate analysis, and delineate farm types through hierarchical cluster analysis.

We first remove records with missing values.

typo = na.omit(typo)

4.3.1 Calculation of dissimilarities and distances

We calculate dissimilarities for 3 groups of variables: structural variables, functional variables, and adoption variables. Structural and functional variables are further divided into continuous and binary variables, for which different methods are used in the calculation of dissimilarities.

We first calculate the scaled euclidean dissimilarity between continuous structural variables using the function vegdist() of the R package vegan (Oksanen et al. 2025).

dEucs = vegdist(typo[, c("age_hhh", "family_size", "cropped_area", "fallow", "prop_non_cereals",
    "cattle", "small_rum", "poultry", "eq_value")], method = "gower")

Similarly, we calculate the scaled euclidean dissimilarity between continuous functional variables.

dEucf = vegdist(typo[, c("fertilizers", "amendment", "food_security", "hdds", "tot_div",
    "cereal_prod", "offtake")], method = "gower")

We then calculate the dissimilarity between binary structural variables using the function dist.binary() of the R package ade4 (Thioulouse et al. 2018).

dBin_ppls = dist.binary(typo[, c("sex_hhh", "edu_hhh", "bee_hives", "ind_garden",
    "com_garden")], method = 2)

Similarly, we calculate the dissimilarity between binary functional variables.

dBin_pplf = dist.binary(typo[, c("main_food_source", "main_income_source")], method = 2)

Finally, we calculate the dissimilarity between binary adoption variables.

d_practices = dist.binary(typo[, c("community_seed_banks", "small_grains", "crop_rotation",
    "intercropping", "cover_crops", "mulching", "integrated_pest_management", "compost_manure",
    "tree_planting_on_farm", "tree_retention_natural_regeneration_on_farm", "homemade_feeds",
    "fodder_production", "fodder_preservation", "survival_feeding")], method = 2)

We combine the 2 dissimilarities of structural variables, weighing them by the number of variables in each dissimilarity.

We do the same for the 2 dissimilarities of functional variables.

d_structural = (9 * dEucs^2 + 5 * dBin_ppls^2)/14

d_functional = (7 * dEucf^2 + 2 * dBin_pplf^2)/9

We combine the three dissimilarity matrices — structural, functional, and adoption variables — into a single overall dissimilarity. Because the adoption component includes many binary variables, we downweight its contribution to avoid letting it dominate the resulting typology. Specifically, we multiply the adoption dissimilarity by 0.5 before combining, and then normalize by dividing by the sum of the weights (2.5) to keep the overall scale consistent. The factor of 0.5 used is arbitrary, but could be fine-tuned through iterations.

dAll = (d_structural + d_functional + 0.5 * d_practices^2)/2.5

Finaly, we transform the matrix of dissimilarities to a matrix of distances.

distAll = sqrt(2 * dAll)

4.3.2 Principal coordinate analysis & hierarchical cluster analysis

We reduce dimensionality while preserving distances using using principal coordinate analysis (PCO) implemented via the base R function cmdscale().

We select k axes to explain at least 50% of the variability .

pco = cmdscale(distAll, eig = TRUE, k = 10)
barplot(pco$eig[1:20])

cum_var = cumsum(pco$eig)/sum(pco$eig)
k_axes = which(cum_var >= 0.5)[1]
k_axes

[1] 4

We cluster farms based on PCO axes through hierarchical cluster analysis (HCA) carried out using the base R function hclust().

We plot the dendrogram.

pco_var = pco$points[, 1:k_axes]

hc_pco = hclust(dist(pco_var), method = "complete")
plot(hc_pco, hang = -1)

4.3.3 Selection of the number of clusters

We select the optimum number of clusters using the function NbClust() of the R package NbClust (Charrad et al. 2014), which computes multiple cluster validity indices (30 in our case) and compares them. The final number of clusters is determined using a majority rule, where each index ‘votes’ for a number of clusters and the number supported by the largest number of indices is chosen.

NbClust(pco_var, diss = dist(pco_var), distance = NULL, method = "complete")

*** : The Hubert index is a graphical method of determining the number of clusters.
                In the plot of Hubert index, we seek a significant knee that corresponds to a 
                significant increase of the value of the measure i.e the significant peak in Hubert
                index second differences plot.

*** : The D index is a graphical method of determining the number of clusters. 
                In the plot of D index, we seek a significant knee (the significant peak in Dindex
                second differences plot) that corresponds to a significant increase of the value of
                the measure. 
 
******************************************************************* 
* Among all indices:                                                
* 4 proposed 2 as the best number of clusters 
* 2 proposed 3 as the best number of clusters 
* 5 proposed 4 as the best number of clusters 
* 1 proposed 6 as the best number of clusters 
* 2 proposed 7 as the best number of clusters 
* 4 proposed 10 as the best number of clusters 
* 1 proposed 13 as the best number of clusters 
* 4 proposed 15 as the best number of clusters 

                   ***** Conclusion *****                            
 
* According to the majority rule, the best number of clusters is  4 
 
 
*******************************************************************

$All.index
        KL      CH Hartigan     CCC     Scott   Marriot  TrCovW  TraceW
2   1.9625 72.0646  43.1709 -5.1974  236.6688 1596.8350 30.0213 19.0591
3   0.4950 63.4680  68.8217 -9.0010  353.6366 2279.2036 18.3870 16.2997
4   7.4435 76.4151  33.7409 -5.7550  633.6448 1362.9670 10.2851 12.8248
5   0.1917 73.0997  45.3120 -6.1273  761.7904 1293.4722  8.3896 11.3157
6   1.3434 77.7476  36.5987 -3.1357  944.7205  914.0985  5.8329  9.5911
7   8.0015 80.0165  15.2453 -1.1029 1053.3798  815.2061  4.7069  8.3706
8   0.1962 74.6461  25.6635 -1.6139 1100.9807  884.7331  4.0523  7.8895
9   1.0510 74.9528  23.7527 -0.5060 1206.6020  742.3897  3.5915  7.1523
10 34.9270 75.3400  12.3494  0.5180 1302.6660  630.6844  3.1240  6.5272
11  0.6694 72.1379  11.3744  0.2748 1346.8089  642.6907  2.7626  6.2163
12  0.1881 69.3630   7.7251  0.0600 1392.9727  639.0998  2.5024  5.9416
13 35.2137 65.9610   9.2750 -0.4788 1427.3519  656.1405  2.3591  5.7600
14  0.0045 63.6531  23.5850 -0.7189 1473.8089  635.1265  2.2128  5.5491
15  8.1803 66.2540   9.7789  0.7913 1589.2913  465.1975  1.8209  5.0581
   Friedman  Rubin Cindex     DB Silhouette   Duda Pseudot2  Beale Ratkowsky
2    1.1437 1.2826 0.4627 1.8797     0.2032 0.7318  46.1796 0.8779    0.1985
3    1.9536 1.4997 0.4461 1.6481     0.1752 0.6038  83.3362 1.5718    0.2320
4    4.5921 1.9061 0.4219 1.5696     0.2122 0.5819  43.8218 1.7064    0.2608
5    5.3597 2.1603 0.4035 1.3629     0.2229 0.5600  49.5005 1.8673    0.2900
6    7.3190 2.5488 0.4416 1.3070     0.2533 0.6521  35.7503 1.2692    0.2938
7    8.2513 2.9204 0.4134 1.3268     0.2640 0.6361  21.1712 1.3450    0.2906
8    8.8637 3.0985 0.4172 1.2797     0.2509 0.5048  33.3575 2.3009    0.2762
9   10.2412 3.4178 0.4039 1.1882     0.2611 0.6503  31.1912 1.2763    0.2686
10  11.4810 3.7452 0.3929 1.1489     0.2719 0.6977  21.6634 1.0255    0.2618
11  12.1036 3.9324 0.3994 1.2175     0.2483 0.7228  10.7362 0.8938    0.2523
12  12.9869 4.1143 0.4142 1.2728     0.2421 0.7429   9.3449 0.8057    0.2433
13  13.4337 4.2440 0.4148 1.2218     0.2454 0.3628  21.0719 3.9132    0.2358
14  14.1698 4.4053 0.4199 1.1575     0.2444 0.6213  22.5571 1.4331    0.2294
15  16.4827 4.8329 0.3988 1.1471     0.2608 0.5936  14.3791 1.5779    0.2256
     Ball Ptbiserial   Frey McClain   Dunn Hubert SDindex Dindex   SDbw
2  9.5296     0.3479 0.1994  0.7892 0.0697 0.0692 14.8165 0.2640 1.2413
3  5.4332     0.4039 0.1306  1.2356 0.0674 0.0793 12.2688 0.2424 0.9594
4  3.2062     0.4745 0.1319  1.9716 0.0736 0.0973 11.3109 0.2104 0.5103
5  2.2631     0.4896 0.0831  2.2104 0.0741 0.1025 10.2074 0.1962 0.4275
6  1.5985     0.5146 0.1635  2.4950 0.0878 0.1124 10.1193 0.1794 0.3865
7  1.1958     0.5217 0.1528  2.9490 0.0908 0.1237 10.1449 0.1678 0.3377
8  0.9862     0.5225 0.1245  3.0371 0.0935 0.1253 10.0881 0.1629 0.2970
9  0.7947     0.5253 0.1011  3.1675 0.0940 0.1311 10.2124 0.1551 0.3008
10 0.6527     0.5310 1.1800  3.3425 0.0974 0.1364  9.8162 0.1490 0.2538
11 0.5651     0.4957 0.2488  3.9393 0.0952 0.1407 12.9801 0.1452 0.2411
12 0.4951     0.4890 0.0694  4.1510 0.1011 0.1444 12.9797 0.1425 0.2365
13 0.4431     0.4901 0.1565  4.1792 0.1026 0.1459 13.2881 0.1406 0.2206
14 0.3964     0.4894 0.1347  4.2132 0.1045 0.1468 12.8471 0.1372 0.1994
15 0.3372     0.4845 0.1589  4.5381 0.1068 0.1548 12.7409 0.1307 0.1942

$All.CriticalValues
   CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
2          0.6623            64.2590       0.4769
3          0.6629            64.5683       0.1805
4          0.5863            43.0451       0.1492
5          0.5902            43.7378       0.1167
6          0.5976            45.1144       0.2824
7          0.5173            34.5247       0.2560
8          0.5041            33.4479       0.0618
9          0.5800            42.0003       0.2800
10         0.5607            39.1821       0.3952
11         0.4720            31.3281       0.4703
12         0.4656            30.9841       0.5241
13         0.3008            27.8883       0.0079
14         0.5173            34.5247       0.2259
15         0.4195            29.0549       0.1877

$Best.nc
                     KL      CH Hartigan     CCC    Scott  Marriot  TrCovW
Number_clusters 13.0000  7.0000   4.0000 15.0000   4.0000   4.0000  3.0000
Value_Index     35.2137 80.0165  35.0808  0.7913 280.0083 846.7419 11.6342
                TraceW Friedman   Rubin  Cindex      DB Silhouette   Duda
Number_clusters 4.0000   4.0000  7.0000 10.0000 15.0000    10.0000 2.0000
Value_Index     1.9658   2.6385 -0.1935  0.3929  1.1471     0.2719 0.7318
                PseudoT2  Beale Ratkowsky   Ball PtBiserial Frey McClain
Number_clusters   2.0000 2.0000    6.0000 3.0000     10.000    1  2.0000
Value_Index      46.1796 0.8779    0.2938 4.0964      0.531   NA  0.7892
                   Dunn Hubert SDindex Dindex    SDbw
Number_clusters 15.0000      0 10.0000      0 15.0000
Value_Index      0.1068      0  9.8162      0  0.1942

$Best.partition
  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
  1   2   2   2   2   2   2   2   3   2   4   2   3   2   2   2   2   1   1   2 
 21  23  24  25  26  27  28  29  30  31  32  34  35  36  37  38  39  40  41  42 
  2   2   3   2   1   3   3   1   3   4   3   4   3   3   4   3   4   2   4   1 
 43  44  45  46  47  48  49  51  52  53  54  55  56  57  58  59  60  61  63  64 
  3   2   2   1   4   1   2   1   3   3   2   4   3   4   1   2   4   3   2   4 
 65  66  67  68  69  70  71  72  74  75  76  77  78  80  81  82  83  84  85  86 
  4   1   1   4   4   1   2   4   1   3   4   4   4   1   1   1   2   1   2   1 
 87  88  89  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 
  2   1   1   1   1   4   1   2   2   4   1   3   2   3   4   1   1   2   3   3 
107 108 109 110 111 112 113 114 115 116 117 118 119 121 123 124 126 127 128 129 
  2   2   3   2   4   4   3   4   3   4   3   4   4   1   4   1   1   3   2   1 
130 131 132 133 134 135 136 137 138 139 140 141 142 143 146 147 148 149 150 151 
  4   4   4   2   1   1   4   4   1   1   1   4   4   4   1   1   1   1   2   3 
152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 
  4   2   1   3   4   1   4   3   3   3   4   4   4   1   4   4   1   4   1   2 
172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 
  1   1   2   3   4   3   2   2   3   4   3   3   3   4   3   3   2   4   4   4 
192 193 194 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 
  4   2   4   4   4   3   3   3   1   2   2   2   1   1   3   4   3   3   4   3 
213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 
  3   3   3   4   3   3   2   3   2   2   2   2   1   4   2   2   3   2   2   3 
233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 
  3   2   1   2   4   4   4   1   1   3   1   2   2   2   1   2   2   3   4   1 
253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 
  3   3   2   3   1   3   2   2   2   1   2   1   1   1   1   4   4

We plot the dendrogram using the suggested optimum number of clusters of 4, visualized with functions from the R package dendextend (Galili 2015). We reset the plotting window to default before that (single plot)

par(mfrow = c(1, 1))

hdend = as.dendrogram(hc_pco)
hdend = color_branches(hdend, k = 4)
hdend = color_labels(hdend, k = 4)
plot(hdend)

grpPCO = cutree(hc_pco, k = 4)

We plot the clusters on PCO planes.

plot(pco$points[, 1], pco$points[, 2], col = grpPCO)

plot(pco$points[, 1], pco$points[, 3], col = grpPCO)

plot(pco$points[, 2], pco$points[, 3], col = grpPCO)

We add ‘Type’ in our dataset.

typo$type = grpPCO


data = merge(data, typo[, c(1, 41)], by = "ID")

4.4 Interpretation of the farm typology

4.4.1 Random forest

To understand which variables are the most segregating in our typology, we fit a random forest classification model and assess variable importance through the Mean Decrease in Gini index using the R package randomForest (Liaw and Wiener 2002).

In our case, the prediction is good, with a low out of bage estimate of error rate of 10.75%, and class errors for the different types between 5.19% and 18.18%.

data$type = as.factor(data$type)

rf_type = randomForest(type ~ ., data = data[, -c(1:3)], ntree = 1500)
print(rf_type)


Call:
 randomForest(formula = type ~ ., data = data[, -c(1:3)], ntree = 1500) 
               Type of random forest: classification
                     Number of trees: 1500
No. of variables tried at each split: 6

        OOB estimate of  error rate: 15.95%
Confusion matrix:
   1  2  3  4 class.error
1 43 10  1  9  0.31746032
2  3 64  2  0  0.07246377
3  0  0 55  5  0.08333333
4  4  0  7 54  0.16923077

Next, we plot variable importance.

The plot shows that the most segregating variables in our typology are farming as the main source of income, and total diversity (crop and livestock).

varImpPlot(rf_type)

4.4.2 Summary table

To further interprete the typology, we produce a table from variables used in the delineation of the typology, displaying means and standard deviations for continuous ones and % for binary ones for the different types, using tbl_summary() from the R package gtsummary (see previous chapter.

add_p() adds a column of statistical test, and add_overall() values for the whole population.

data[, -c(1:3)] %>%
    tbl_summary(by = type, statistic = list(all_continuous() ~ "{mean} ({sd})", all_categorical() ~
        "{p}%"), type = list(hdds ~ "continuous"), digits = all_continuous() ~ 1,
        label = list(age_hhh ~ "Age of the head of the household", sex_hhh ~ "Female-headed households",
            edu_hhh ~ "Education of the head of the household higher than primary",
            family_size ~ "Family size", eq_value ~ "Equipment value (USD)", cropped_area ~
                "Total cropped area (ha)", fallow ~ "Fallow land (ha)", prop_non_cereals ~
                "Non-cereal crops (% total cropped area)", fertilizers ~ "Total fertilizer used (kg)",
            amendment ~ "Total manure + compost used (kg)", cattle ~ "Cattle (n)",
            small_rum ~ "Small ruminants (n)", poultry ~ "Poultry (n)", bee_hives ~
                "Bee keeping", ind_garden ~ "Owning an individual garden", com_garden ~
                "Having access to a communal garden", main_food_source ~ "Own production as main source of food",
            main_income_source ~ "Farming as main source of income", food_security ~
                "Proportion of the year being food secured", hdds ~ "24H household dietary diversity score (0-12)",
            community_seed_banks ~ "Using a community seed bank", small_grains ~
                "Using small grains", crop_rotation ~ "Using crop rotation", intercropping ~
                "Using intercropping", cover_crops ~ "Using cover crops", mulching ~
                "Using mulching", integrated_pest_management ~ "Using integrated pest management",
            compost_manure ~ "Using compost and manure", tree_planting_on_farm ~
                "Planting trees on-farm", tree_retention_natural_regeneration_on_farm ~
                "Retaining naturally regenerated trees on-farm", homemade_feeds ~
                "Using homemade animal feed", fodder_production ~ "Produccing fodder",
            fodder_preservation ~ "Preserving fodder", survival_feeding ~ "Using survival feeding",
            tot_div ~ "Total farm diversity (n species)", cereal_prod ~ "Total cereal production (kg/yr)",
            offtake ~ "Total livestock offtake (TLU/yr)")) %>%
    add_p(test = list(all_continuous() ~ "kruskal.test", all_categorical() ~ "chisq.test")) %>%
    add_overall()

Characteristic	Overall N = 257¹	1 N = 63¹	2 N = 69¹	3 N = 60¹	4 N = 65¹	p-value²
Age of the head of the household	51.6 (15.8)	50.8 (13.3)	50.3 (16.5)	57.0 (17.5)	48.8 (14.7)	0.035
Female-headed households	24%	17%	35%	27%	15%	0.032
Education of the head of the household higher than primary	38%	48%	39%	5.0%	58%	<0.001
Family size	5.4 (2.5)	5.9 (2.5)	4.5 (2.1)	4.9 (2.1)	6.4 (3.0)	<0.001
Total cropped area (ha)	1.7 (1.4)	1.5 (0.9)	0.9 (0.9)	1.9 (1.4)	2.7 (1.7)	<0.001
Fallow land (ha)	0.7 (1.4)	1.0 (1.1)	0.3 (0.7)	0.7 (1.4)	1.0 (1.9)	<0.001
Non-cereal crops (% total cropped area)	29.3 (29.1)	30.6 (23.6)	15.7 (27.8)	31.4 (32.0)	40.5 (27.3)	<0.001
Total fertilizer used (kg)	85.6 (115.9)	77.9 (69.3)	32.4 (56.9)	59.7 (72.5)	173.7 (170.7)	<0.001
Total manure + compost used (kg)	258.0 (543.0)	328.7 (384.9)	29.5 (71.5)	148.5 (454.2)	533.2 (831.1)	<0.001
Cattle (n)	3.6 (6.7)	2.7 (3.6)	0.6 (2.2)	2.0 (3.3)	9.2 (10.4)	<0.001
Small ruminants (n)	8.5 (11.2)	9.9 (12.1)	2.9 (3.9)	4.1 (4.7)	17.0 (14.1)	<0.001
Poultry (n)	9.4 (11.3)	10.4 (9.7)	4.0 (5.2)	6.8 (11.1)	16.6 (13.5)	<0.001
Bee keeping	11%	11%	2.9%	17%	12%	0.075
Owning an individual garden	47%	95%	23%	13%	55%	<0.001
Having access to a communal garden	1.2%	1.6%	0%	1.7%	1.5%	0.8
Own production as main source of food	75%	62%	57%	87%	95%	<0.001
Farming as main source of income	51%	13%	4.3%	100%	91%	<0.001
Proportion of the year being food secured	0.5 (0.3)	0.6 (0.2)	0.4 (0.3)	0.5 (0.3)	0.7 (0.2)	<0.001
24H household dietary diversity score (0-12)	3.1 (1.0)	3.3 (0.9)	2.8 (0.8)	2.7 (1.0)	3.6 (1.2)	<0.001
Using a community seed bank	12%	6.3%	5.8%	23%	12%	0.007
Using small grains	86%	89%	72%	93%	92%	0.001
Using crop rotation	50%	62%	30%	37%	72%	<0.001
Using intercropping	16%	27%	13%	0%	25%	<0.001
Using cover crops	8.2%	6.3%	5.8%	13%	7.7%	0.4
Using mulching	9.7%	13%	1.4%	8.3%	17%	0.019
Using integrated pest management	49%	57%	25%	47%	71%	<0.001
Using compost and manure	16%	16%	2.9%	17%	28%	0.001
Planting trees on-farm	14%	21%	5.8%	13%	15%	0.094
Retaining naturally regenerated trees on-farm	36%	43%	32%	23%	45%	0.044
Using homemade animal feed	25%	30%	7.2%	17%	45%	<0.001
Produccing fodder	3.9%	6.3%	0%	3.3%	6.2%	0.2
Preserving fodder	3.1%	3.2%	0%	0%	9.2%	0.007
Using survival feeding	3.9%	4.8%	0%	1.7%	9.2%	0.034
Total farm diversity (n species)	6.7 (3.5)	9.1 (2.1)	3.8 (1.8)	4.8 (2.4)	9.1 (3.6)	<0.001
Total cereal production (kg/yr)	839.4 (923.3)	734.2 (790.5)	410.5 (454.5)	840.5 (773.1)	1,395.8 (1,228.9)	<0.001
Total livestock offtake (TLU/yr)	0.3 (0.5)	0.2 (0.4)	0.1 (0.1)	0.1 (0.3)	0.6 (0.8)	<0.001
Equipment value (USD)	231.5 (296.5)	207.9 (250.5)	58.0 (91.8)	202.5 (275.0)	465.4 (350.4)	<0.001
¹ Mean (SD); %
² Kruskal-Wallis rank sum test; Pearson’s Chi-squared test

4.4.3 Most representative farm of each type

Finally, we pursue our interpretation of farm types by analyzing the most representative farm of each class, as the farm closest to the medoid of its farm type.

For that, we first compute distances on PCO variables.

dist_mat = dist(pco_var)

We then extract medoid (e.g., the most central observation) for each cluster/farm type, and we display IDs of these medoids (row numbers from ‘data’)

medoids = sapply(unique(grpPCO), function(cl) which.min(colSums(as.matrix(dist_mat)[grpPCO ==
    cl, grpPCO == cl])))
medoids

243  17 207 156 
 55  13  42  38

We then retrieve the full farm data for these medoids/most representative farms.

most_representative_farms = data[medoids, ]
most_representative_farms

    ID           ward district age_hhh sex_hhh edu_hhh family_size cropped_area
55 363 Ward 3 9 12 17    Mbire      57       0       0           5          1.0
13 318         Ward 2    Mbire      69       0       0           8          0.5
42 349 Ward 3 9 12 17    Mbire      39       0       1           4          1.0
38 345 Ward 3 9 12 17    Mbire      60       0       0           7          0.0
   fallow prop_non_cereals fertilizers amendment cattle small_rum poultry
55    2.0          0.00000          50       600      0         5       0
13    0.0          0.00000           5         0      0         0      20
42    0.8         19.98002           0         0      0         0       1
38    1.2          0.00000          10       250      0         0       0
   bee_hives ind_garden com_garden main_food_source main_income_source
55         1          1          0                0                  0
13         1          0          0                0                  1
42         0          0          0                1                  0
38         0          1          0                0                  0
   food_security hdds community_seed_banks small_grains crop_rotation
55     0.7500000    2                    1            1             1
13     0.3333333    3                    0            1             1
42     0.5000000    3                    0            1             1
38     0.0000000    2                    0            0             0
   intercropping cover_crops mulching integrated_pest_management compost_manure
55             1           1        1                          1              1
13             0           0        0                          1              0
42             1           0        0                          0              0
38             0           0        0                          0              0
   tree_planting_on_farm tree_retention_natural_regeneration_on_farm
55                     1                                           0
13                     0                                           0
42                     0                                           1
38                     0                                           0
   homemade_feeds fodder_production fodder_preservation survival_feeding
55              1                 1                   1                0
13              0                 0                   0                0
42              0                 0                   0                0
38              0                 0                   0                0
   tot_div cereal_prod offtake eq_value type
55      10         257    0.20      150    1
13       4         170    0.02        0    3
42       5         550    0.00       50    2
38       3           0    0.00      100    2

4.5 Selection of stratified samples

Typologies can be used to select a sample of farms representative of the overall population to e.g., host on-farm experiments.

Below, we select 30 representative farms using farm type as single stratum.

sample1 = as.data.frame(stratified(data, "type", 30/nrow(data)))
table(sample1$type)


1 2 3 4 
7 8 7 8

We can also use more than one stratum: here we select 30 representative farms using ward (an administrative unit) and farm type as strata.

sample2 = as.data.frame(stratified(data, c("ward", "type"), 30/nrow(data)))
table(sample2$type, sample2$ward)

   
    Ward 2 Ward 3 9 12 17
  1      4              3
  2      5              3
  3      2              5
  4      3              4

3 Baselining

5 Managing on-farm experiments