ebook img

ERIC ED615536: Fair-Capacitated Clustering PDF

2021·0.47 MB·English
by  ERIC
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview ERIC ED615536: Fair-Capacitated Clustering

Fair-Capacitated Clustering Tai Le Quy Arjun Roy Gunnar Friege LeibnizUniversityHannover LeibnizUniversityHannover LeibnizUniversityHannover Hannover,Germany Hannover,Germany Hannover,Germany [email protected] [email protected] [email protected] hannover.de Eirini Ntoutsi FreieUniversitätBerlin,Germany Berlin,Germany [email protected] ABSTRACT tion of students [15] to education admission decisions [21]. Traditionally,clusteringalgorithmsfocusonpartitioningthe Recently, the issue of bias and discrimination in ML-based data into groups of similar instances. The similarity objec- decision-making systems is receiving a lot of attention [28] tive, however, is not sufficient in applications where a fair- astherearemanyrecordedincidentsofdiscrimination(e.g., representation ofthegroupsintermsofprotectedattributes recidivism prediction [20], grades prediction [4, 14]) caused like gender or race, is required for each cluster. Moreover, by such systems against individuals or groups or people on in many applications, to make the clusters useful for the the basis of protected attributes like gender, race etc. Bias end-user, a balanced cardinality among the clusters is re- in education is not a new problem. There is already a long quired. Our motivation comes from the education domain literatureondifferentsourcesofbiasineducation[24]orstu- where studies indicate that students might learn better in dents’ data analysis [3] as well as studies on racial bias [31] diverse student groups and of course groups of similar car- and gender bias [22]. However, ML-based decision-making dinality are more practical e.g., for group assignments. To systemshavethepotentialtoamplifyprevalentbiasesorcre- this end, we introduce the fair-capacitated clustering prob- ate new ones and therefore, fairness-aware ML approaches lem thatpartitionsthedataintoclustersofsimilarinstances are required also for the educational domain. while ensuring cluster fairness and balancing cluster cardi- nalities. We propose a two-step solution to the problem: i) In this work, we focus on fairness in clustering, as in edu- we rely on fairlets to generate minimal sets that satisfy the cational activities, group assignments [8] and student team fair constraint and ii) we propose two approaches, namely achievementdivisions[30]areimportanttoolsthathelpstu- hierarchical clustering and partitioning-based clustering, to dentsworkingtogethertowardssharedlearninggoals. Clus- obtain the fair-capacitated clustering. Our experiments on tering is an effective solution for partitioning students into threeeducationaldatasetsshowthatourapproachesdeliver groups of similar instances [3, 26]. Traditional algorithms, well-balanced clusters in terms of both fairness and cardi- however, focus solely on the similarity objective and do not nality while maintaining a good clustering quality. consider the fairness of the resulting clusters w.r.t. pro- tectedattributeslikegender. However,studiesindicatethat students might learn better in diverse groups, e.g., mixed- Keywords gendergroups[11,32]. Lately,fair-clusteringsolutionshave fair-capacitated clustering, fair clustering, capacitated clus- beenproposed[1,2,5,6],whichaimtodiscoverclusterswith tering, fairness, learning analytics, fairlets, knapsack. afairrepresentationregardingsomeprotectedattributes. In this work, we adopt the cluster fairness of [6], called clus- 1. INTRODUCTION terbalance,accordingtowhichprotectedgroupsmusthave Machinelearning(ML)playsacrucialroleindecision-making approximately equal representation in every cluster. in almost all areas of our lives, including areas of high soci- In a teaching situation, it is obvious that the size of the etal impact, like healthcare and education. Our work’s mo- groups should be comparable to allow a fair allocation of tivationcomesfromtheeducationdomainwhereML-based work among the students. As traditional clustering algo- decision-makinghasbeenusedinawidevarietyoftasksfrom rithms do not consider this requirement, clusters of varying student dropout prediction [9], forecasting on-time gradua- sizes might be extracted, reducing the usefulness and ap- plicability of the partitioning for end-users/teachers. This leads to the demand for clustering solutions that also take into account the size of the clusters. The problem is known as capacitated clustering problem (CCP) [25] which aims to TaiLeQuy,ArjunRoy,GunnarFriegeandEiriniNtoutsi“Fair-Capacitated extract clusters with a limited capacity while minimizing Clustering”.2021.In:ProceedingsofThe14thInternationalConferenceon the total dissimilarity in the clusters. Capacitated cluster- EducationalDataMining(EDM21).InternationalEducationalDataMining ing is useful in quite a few applications such as transfer- Society,407-414.https://educationaldatamining.org/edm2021/ ring goods/services from the service providers (post office, EDM’21June29-July022021,Paris,France stores,etc.),garbagecollectionandsalesforceterritorialde- Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 407 sign [27]. To the best of our knowledge, no solution exists 3. PROBLEMDEFINITION that considers both fairness and capacity of clusters on the Let X ∈ Rn be a set of instances to be clustered and let top of the similarity objective. d():X×X →Rbethedistancefunction. Foranintegerk we use [k] to denote the set {1,2,...,k}. A k-clustering C is To this end, we propose a new problem, the so-called fair- apartitionofX intokdisjointsubsets,C ={C ,C ,...,C }, 1 2 k capacitated clustering that ensures fairness and balanced calledclusters withS ={s ,s ,...,s }bethecorresponding 1 2 k cardinalities of the resulting clusters. We decompose the cluster centers. The goal of clustering is to find an assign- problem into two subproblems: i) the fairness-requirement ment φ:X →[k] that minimizes the objective function: compliancestepthatpreservesfairnessataminimumthresh- (cid:88) (cid:88) old of balance score and ii) the capacity-requirement com- L(X,C)= d(x,s ) (1) i pliance step that ensures clusters of comparable sizes. For si∈Sx∈Ci thefirststep,wegeneratefairlets[6],whichareminimalsets As shown in Eq. 1, the goal is to find an assignment that that satisfy fair representation w.r.t. a protected attribute. minimizes the sum of distances between each point x ∈ Forthesecondstep,weproposetwosolutionsfortwodiffer- Xand its corresponding cluster center s ∈ S. It is clear ent clustering types, namely hierarchical and partitioning- i that such an assignment optimizes the similarity but does based clustering, that consider the capacity constraint dur- not consider fairness or capacity of the resulting clusters. ing the merge step (hierarchical approach)/assignment step (partitioningapproach). Experimentalresults,onthreereal Capacitated clustering: The goal of capacitated clustering datasetsfromtheeducationdomain,showthatourmethods [25]istodiscoverclustersofgivencapacitieswhilestillmin- result in fair and capacitated clusters while preserving the imizing the distance objective L(X,C). The capacity con- clustering quality. straint is defined as an upper bound Q on the cardinality i of each cluster C : i 2. RELATEDWORK |C |≤Q (2) i i Chierichettietal. [6]introducedthefairclusteringproblem with the aim to ensure equal representation for each pro- Clustering fairness: We assume the existence of a binary tected attribute, such as gender, in every cluster. In their protectedattributeP ={0,1},e.g.,gender={male,female}. formulation,eachinstanceisassignedwithoneoftwocolors Letψ:X →P denotesthedemographicgrouptowhichthe (red, blue). They proposed a two-phase approach: clus- point belongs, i.e., male or female. tering all instances into fairlets - small clusters preserving the fairness measure, and then applying vanilla clustering Fairness of a cluster is evaluated in terms of the balance methodsonthosefairlets. Subsequentstudiesfocusongen- score [6] as the minimum ratio between two groups. eralization and scalability. Backurs et al. [1] presented an (cid:16) (cid:17) approximatefairletdecompositionalgorithmwhichcanfor- balance(Ci)=min ||{{xx∈∈CCii||ψψ((xx))==01}}||, ||{{xx∈∈CCii||ψψ((xx))==01}}|| mulatethefairletsinnearlylineartimethustacklingtheeffi- (3) ciencybottleneckof[6]. R¨osnerandSchmidt[29]generalized Fairness of a clustering C equals to the balance of the least the fair clustering problem to more than two protected at- balanced cluster C ∈C. i tributes. A more generalized and tunable notion of fairness for clustering was introduced in Bera et al. [2]. Anshuman balance(C)= min balance(Ci) (4) Ci∈C andPrasant[5]introducedafairhierarchicalagglomerative clustering method for multiple protected attributes. We now introduce the problem of fair-capacitated cluster- ing that combines all aforementioned objectives regarding Thecapacitatedclusteringproblem(CCP),acombinatorial distance, fairness and capacity. optimization problem, was first introduced by Mulvey and Definition 1. (Fair-capacitated clustering problem) Beck [25] who proposed solutions using heuristic and sub- Wedefinetheproblemof(t,k,q)-fair-capacitatedclustering gradient algorithms. Several approaches exist to improve as finding a clustering C ={C ,C ,··· ,C } that partitions the efficiency of solutions or CCP approaches for different 1 2 k the data X into |C| = k clusters such that the cardinality cluster types. Khuller and Sussmann [17], for example, in- of each cluster C ∈ C does not exceed a threshold q, i.e., troduced an approximation algorithm for the capacitated i |C |≤q(thecapacityconstraint),thebalanceofeachcluster k-Center problem. Geetha et al. [10] improved k-Means i is at least t, i.e., balance(C) ≥ t (the fairness constraint), algorithm for CCP by using a priority measure to assign and minimizes the objective function L(X,C). Parameters points to their centroid. Lam and Mittenthal [19] proposed k,t,q are user defined referring to the number of clusters, a heuristic hierarchical clustering method for CCP to solve minimumbalancethresholdandmaximumclustercapacity, the multi-depot location-routing problem. respectively. In this work, we introduce the problem of fair-capacitated 4. FAIR-CAPACITATEDCLUSTERING clustering which builds upon the notions from fair cluster- ingandcapacitatedclustering. Inparticular,webuildupon 4.1 Fairletdecomposition the notion of fairlets [6] to extract the minimal sets that Traditionally, the vanilla versions of clustering algorithms preserve fairness. Regarding the CCP we follow the formu- arenotcapableofensuringfairnessbecausetheyassignthe lationof[25]toensurebalancedclustercardinalities. Tothe data points to the closest center without the fairness con- best of our knowledge, the combined problem has not been sideration. Hence, if we could divide the original data set studied before and as already discussed, comprises a useful into subsets such that each of them satisfies the balance tool in quite a few domains like education. thresholdtthengroupingthesesubsetstogeneratethefinal 408 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) clustering would still preserve the fairness constraint. Each Therefore, from Eq. 5 and Eq. 6 we get, fairsubsetisdefinedasafairlet. Wefollowthedefinitionof balance(Y)≥t (7) fairlet decomposition by [6]. Thus,thestatementgiveninTheorem1istrueforanyclus- Definition 2. (Fairlet decomposition) terformedbytheunionofanytwofairlets. Nowweassume Suppose that balance(X) ≥ t with t = f/m for some in- that the statement holds true for a cluster formed from i tegers 1 ≤ f ≤ m, such that the greatest common divisor fairlets, i.e, Y =∪ F , where 1<i<l. Then, gcd(f,m) = 1. A decomposition F = {F ,F ,...,F} of X j≤i j 1 2 l is a fairlet decomposition if: i) each point x∈X belongs to (cid:80) f exactlyonefairletF ∈F;ii)|F |≤f+mforeachF ∈F, balance(Y)= (cid:80)j≤i j ≥t (8) j j j m i.e.,thesizeofeachfairletissmall;andiii)foreachF ∈F, j≤i j j balance(F )≥t,i.e.,thebalanceofeachfairletsatisfiesthe j threshold t. Each Fj is called a fairlet. ConsideranotherfairletFi+1 ∈F whichisnotintheformed f By applying fairlet decomposition on the original dataset clusterY,balance(Fi+1)= mi+1 ≥t. Then,byjoiningFi+1 i+1 X, we obtain a set of fairlets F = {F1,F2,...,Fl}. For with the cluster Y we get the new cluster Y(cid:48) such that, each fairlet F we select randomly a point r ∈ F as the j j (cid:80) caesnttehre. inFdoerxaofptohinetmxap∈peXd,fawireletd.enTotheeγsec:oXnd →step[1,,ils] balance(Y(cid:48))= mfi+1++(cid:80)j≤ifmj (9) to cluster the set of fairlets F into k final clusters, sub- i+1 j≤i j ject to the cardinality constraint. The clustering process Following the steps in Eq. 6, we can similarly show that, is described below for the hierarchical clustering type (Sec- (cid:80) f + f tion4.2)andforthepartitioning-basedclusteringtype(Sec- i+1 (cid:80)j≤i j ≥t =⇒ balance(Y(cid:48))≥t (10) m + m tion 4.3). Clustering results in an assignment from fairless i+1 j≤i j to final clusters: δ : F → [k]. The final fair-capacitated Hence, the theoremholds true for cluster formedwithi+1 clustering C can be determined by the overall assignment fairletsifitistrueforifairlets. Sinceiisanyarbitrarynum- function φ(x) = δ(Fγ(x)), where γ(x) returns the index of beroffairlets,thusthetheoremholdstrueforallcases. the fairlet to which x is mapped. Theorem 1 shows that for any cluster formed by union of 4.2 Fair-capacitatedhierarchicalclustering fairlets, the fairness constraint is always preserved. Hence- Given the set of fairlets: F = {F1,F2,...,Fl}, let W = forth, we don’t need further interventions w.r.t. fairness. {w ,w ,...,w} be their corresponding weights, where the 1 2 l weight wj of a fairlet Fj is defined as its cardinality, i.e., The pseudocode is shown in Algorithm 2 of Appendix B. number of points in Fj. In each step, the closest pair of clusters is identified and a new cluster is created only if its capacity does not exceed Traditional agglomerative clustering approaches merge the the capacity threshold q. Otherwise, the next closest pair two closest clusters, so rely solely on similarity. We extend is investigated. The procedure continues until k clusters the merge step by also ensuring that merging does not vio- remain. Theremainingclustersarefairandcapacitatedac- late the cardinality constraint w.r.t. the cardinality thresh- cordingtothecorrepondingthresholdstandq. Tocompute old q. theproximitymatrix(line1andline8),weusethedistance Theorem 1. Thebalancescoreofaclusterformedbythe between centroids of the corresponding clusters. The func- tion capacity(cluster) in line 5 returns the size of a cluster. union of two or more fairlets, is at least t. balance(Y)≥t, where Y =∪j≤lFj and balance(Fj)≥t 4.3 Fair-capacitatedpartitioningclustering Partitioning-basedclusteringalgorithms,suchask-Medoids, Proof. We use the method of induction to derive the canbeviewedasadistanceminimizationproblem,inwhich, proof. AssumewehaveasetoffairletsF ={F ,F ,...,F}, 1 2 l we try to minimize the objective function in Eq. 1. The in which, balance(F ) ≥ t, j = 1,...,l. We first con- j vanillak-Medoidsdoesnotsatisfythecardinalityconstraint sider the case for any two fairlets {F ,F } ∈ F. We have 1 2 since the allocating points to clusters step is only based on f f balance(F ) = 1 ≥ t and balance(F ) = 2 ≥ t. We thedistanceamongthem. Now,ifwechangethegoalofthis 1 m 2 m 1 2 assignmentsteptofindthe“best”datapointswithadefined denote by Y is the union of two fairlets F and F , then 1 2 capacity for each medoid instead of searching for the most balance(Y)=balance(F ∪F )= f1+f2 (5) suitablemedoidforeachpoint,wecancontrolthecardinality 1 2 m1+m2 of clusters. We formulate the problem of assigning points It holds: to clusters subject to a capacity threshold q as a knapsack problem [23]. f f tm 1 ≥t or, 1 ≥ 1 m m +m m +m 1 1 2 1 2 Let S ={s ,s ,...,s } be the cluster centers, i.e., medoids, 1 2 k f tm Similarly, 2 ≥ 2 andC ={C1,C2,...,Ck}betheresultingclusters. Wechange m1+m2 m1+m2 (6) theassignmentsofpointstoclusters,usingknapsack,inor- =⇒ f1 + f2 ≥ tm1 + tm2 der to meet the capacity constraint q. In particular, we m +m m +m m +m m +m define a flag variable y = 1 if x is assigned to cluster C , 1 2 1 2 1 2 1 2 j j i =⇒ f1+f2 ≥ t(m1+m2) =t otherwise yj = 0. Now, we define a value vj to data point m1+m2 m1+m2 xj,whichdependsonthedistanceofxj toCi,withvj being Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 409 maximum if Ci is the best cluster for xj, i.e, the distance Algorithm1: k-Medoids fair-capacitated algorithm between x and s is minimum. We formulate value v of j i j Input: F ={F ,F ,...,F}: a set of fairlets instancex basedonanexponentialdecaydistancefunction: 1 2 l j W ={w ,w ,...,w}: weights of fairlets 1 2 l vj =e−λ1∗d(xj,si) (11) qk:: anugmivbeenrmofaxcliumsutemrscapacity of final clusters where d(x ,s ) is the Euclidean distance between the point Output: A fair-capacitated clustering j i xj and the medoid si. The higher λ is the lower the effect 1 Function ClusterAssignment(medoids): of distance in the value of the points. The point which is 2 clusters←∅; closer to the medoid will have a higher value. 3 for each medoid s in medoids do 4 candidates ← all fairlets which are not assigned Then, the objective function for the assignment step is: to any cluster ; 5 p ← length(candidates) ; n (cid:88) 6 w ← weights(candidates) ; maximize vjyj (12) 7 for each fairleti in candidates do j=1 8 values[i] ← v(fairleti) //Eq. 11 ; Now, given F ={F1,F2,...,Fl} and W ={w1,w2,...,wl} 9 end are the set of fairlets and their corresponding weights, i.e., 10 clusters[s]←knapsack(p, values, w, q) ; the number of instances in the fairlet, respectively; q is the 11 end maximum capacity of the final clusters. Our target is to 12 return clusters; cluster the set of fairlets F into k clusters centered by k 13 Function main(): medoids. WeapplytheformulasinEq. 11andEq.12onthe 14 medoids← select k of the l fairlets arbitrarily ; setoffairletsF, i.e, eachfairletFj hasthesameroleasxj. 15 ClusterAssignment(medoids) ; Then,theproblemofassigningthefairletstoeachmedoidin 16 costbest ← current clustering cost; the clusterassignmentstepbecomesfindingasetof fairlets 17 sbest ←null ; withtotalweightslessthanorequaltoqandthetotalvalue 18 obest ←null ; is maximized. In other words, we can formulate the cluster 19 repeat assignmentstepinthepartitioning-basedclusteringasa0-1 20 for each medoid s in medoids do knapsackproblem. 21 for each non-medoid o in F do 22 consider the swap of s and o, compute l (cid:88) the current clustering cost; maximize v y j j 23 if current clustering cost < costbest then j=1 (13) 24 sbest ←s; (cid:88)l 25 obest ←o; subject to w y ≤q and y ∈{0,1} j j j 26 costbest ←current clustering cost; j=1 27 end Inwhich,yjistheflagvariableforFj,yj =1ifFjisassigned 28 end toacluster,otherwiseyj =0;vj isthevalueofFj whichis 29 end computedbytheEq. 11;qisthedesiredmaximumcapacity. 30 update medoids by the swap of sbest and obest ; 31 ClusterAssignment(medoids) Thepseudocodeofourk-Medoidsfair-capacitatedalgorithm 32 until no improvements can be achieved by any is described in Algorithm 2. In which, for each medoid we replacement; wouldsearchfortheadequatepoints(line3)byusingfunc- 33 return clusters; tion knapsack(p,values,w,q) (line 10) implemented using dynamic programming, which returns a list of items with a maximum total value and the total weight not exceeding q. Inthemainfunction,line12,weoptimizetheclusteringcost 2005-2006.“Gender”isselectedastheprotectedattribute, by replacing medoids with non-medoid instances when the i.e., we aim to balance gender in the resulting clusters. clustering cost is decreased. This optimization procedure will stop when there is no improvement in the clustering Open University Learning Analytics (OULAD). This cost is found (lines 19 to 32). is the dataset from the OU Analyse project [18] of Open University,Englandin2013-2014. Informationofstudents 5. EXPERIMENTS includestheirdemographics,courses,theirinteractionswith the virtual learning environment, and final outcome. We Inthissection,wedescribeourexperimentsandtheperfor- aim to balance the“Gender”attribute in the results. manceofourproposedmethodsonthreeeducationaldatasets. MOOC. The data covers students who enrolled in the 16 edXcoursesofferedbythetwoinstitutions(HarvardUniver- 5.1 Experimentalsetup sity and the Massachusetts Institute of Technology) during Datasets. The datasets are summarized in Table 1. 2012 - 2013 [13]. The dataset contains aggregated records whichrepresentstudents’activitiesandtheirfinalgradesof UCI student performance. Thisdatasetincludesthede- the courses.“Gender”is the protected attribute. mographics,grades,socialandschool-relatedfeaturesofstu- dentsinsecondaryeducationoftwoPortugueseschools[7]in Baselines. We compare against well-known fairness-aware 410 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) Table1: Anoverviewofthedatasets Dataset #instances #attributes Protected attribute Balance score UCI student performance 649 33 Gender (F: 383; M: 266) 0.695 OULAD 4,000 12 Gender (F: 2,000; M: 2,000) 1 MOOC 4,000 21 Gender (F: 2,000; M: 2,000) 1 clustering methods and a vanilla clustering method. Center method. Figure 1-b depicts the clustering fairness. As we can observe, in terms of fairness, vanilla fairlet hier- k-Medoids [16]: a traditional partitioning clustering tech- archicalfair-capacitated hasthebestperformancewhenkis nique that divides the dataset into k clusters so as to mini- less than 10. Contrary to that, by selecting each point for mize clustering cost. Cluster centers are actual instances. each cluster in the cluster assignment step, the k-Medoids fair-capacitated method can maintain well the fairness in Vanillafairlet[6]: afairness-awareclusteringapproachthat manycases. Regardingthecardinality,asillustratedinFig- i)decomposesthedatasetintofairletsandii)appliesavanilla ure 1-c, our approaches outperform the competitors when k-centerclusteringalgorithm[12]toformthefinalkclusters. they can keep the number of instances for each cluster un- der the specified thresholds. MCF fairlet [6]: Similar to Vanilla fairlet but the fairlet OULAD. Our MCF fairlet k-Medoids fair-capacitated ap- decompositionpartistransformedintoaminimumcostflow proach outperforms other methods in terms of clustering (MCF) problem, by which an optimized version of fairlet cost, although there is an increase compared to the vanilla decomposition in terms of cost value is computed. k-Medoids algorithm, as we can see in Figure 2-a. Con- cerning fairness, in Figure 2-b, k-Medoids is the weakest Evaluationmeasures. Wereportonclusteringquality(mea- method while others can achieve the highest balance. The sured as clustering cost, see Eq. 1), cluster fairness (ex- balanceofGender featureinthedatasetisthemainreason pressed as cluster balance, see Eq. 4) and cluster capacity for this result. All fairlets are fully fair; this is a prerequi- (expressed as cluster cardinality). site for our methods of being able to maintain the perfect balance. Regardingcardinality,ourapproachesdemonstrate Parameterselection. Regardingfairness,aminimumthresh- theirstrengthinensuringthecapacityofclusters(Figure2- old of balance t is set to 0.5 for all datasets in our experi- c). The difference in the size of the clusters generated by ments. It means that the proportion of the minority group our methods is tiny. This is in stark contrast to the trend is at least 50% in each resulting cluster. Regarding the λ of competitors. factor in Eq. 11, a value λ = 0.3 is chosen for our experi- mentsfromarangeof[0.1,1.0]viagrid-search. Weevaluate MOOC. The results of clustering quality are described in the clustering cost and balance score on a small dataset, Figure3-a(AppendixA). Althoughanincreaseintheclus- i.e., UCI student performance dataset - Mathematics sub- tering cost is the main trend, our methods outperform the ject w.r.t λ. Theoretically, the ideal capacity of clusters is vanilla fairlet and MCF fairlets methods. Regarding clus- (cid:108)|X|(cid:109) where |X| is the population of dataset X, k is the tering fairness, as depicted in Figure 3-b, our approaches k can maintain the perfect balance for all experiments. This number of desired clusters. However, in many cases, the is the result of an actual balance in the dataset and the clustering models cannot satisfy this constraint, especially fairlets. Theemphasisisourmethodscandividealltheex- thehierarchicalclusteringmodel. Hence,wesettheformula perimented instances into capacitated clusters, as shown in inEq. 14tocomputethemaximumcapacityqofclusters;εis Figure 3-c, which proves their superiority in presenting the aparameterchoseninexperimentsforeachfair-capacitated results over the competitors regarding clusters’ cardinality. clustering approach. Summary of the results. In general, fairness is well main- (cid:108)|X|∗ε(cid:109) q= (14) tained in all of our experiments. When the data is fair, in k case of OULAD and MOOC datasets, our methods achieve To find the appropriate value of ε, we set a range of [1.0, a perfect fairness. In terms of cardinality, our methods are 1.3] to ensure all the generated clusters have members and able to maintain the cardinality of resulting clusters within evaluatethecardinalityofresultingclustersontheUCIstu- the maximum capacity threshold, which is significantly su- dentperformance(Mathematicssubject)dataset. εissetto perior to competitive methods. The fair-capacitated par- 1.01and1.2,fork-Medoidsfair-capacitatedandhierarchical titioning based method is better than the hierarchical ap- fair-capacitated methods, respectively. proachsinceitcandeterminethecapacitythresholdclosest to the ideal capacity. Regarding the clustering cost, the hi- 5.2 Experimentalresults erarchicalapproachhasanadvantageoverothermethodsby outperforming its competitors in most experiments. UCI student performance. When k is less than 4, as shown in Figure 1-a, the clustering quality of our models can be 6. CONCLUSIONANDOUTLOOK close to that of the vanilla k-Medoids method. However, In this work, we introduced the fair-capacitated clustering the clustering cost is fluctuated thereafter due to the ef- problem that extends traditional clustering, solely focusing fort to maintain the fairness and cardinality of methods. onsimilarity,byalsoaimingatabalancedcardinalityamong Our vanilla fairlet hierarchical fair-capacitated outperforms the clusters and a fair-representation of instances in each other competitors in most cases. Vanilla fairlet and MCF cluster according to some protected attributes like gender fairlet show the worst clustering cost as an effect of the k- or race. Our solutions work on the fairlets derived from Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 411 a) Clustering quality (lower is better) b) Clustering fairness (higher is better) 0.7 2600 0.6 2400 0.5 Clustering cost2200 kV-aMneildlao ifdasirlet Balance00..34 kVVV-aaaMnnneiiillldlllaaao ifffdaaasiiirrrllleeettt hk-ieMreadrcohidicsa fla fiar-irc-acpaapcaicaittaatteedd 2000 Vanilla fairlet hierarchical fair-capacitated 0.2 MCF fairlet Vanilla fairlet k-Medoids fair-capaciatated MCF fairlet hierarchical fair-capacitated MCF fairlet 0.1 MCF fairlet k-Medoids fair-capacitated 1800 MCF fairlet hierarchical fair-capacitated Minimum balance MCF fairlet k-Medoids fair-capacitated 0.0 Dataset's balance 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Number of clusters Number of clusters c) Clustering cardinality 300 k-Medoids Vanilla fairlet Vanilla fairlet hierarchical fair-capacitated 250 Number of instances112050000 VMMMMMaCCCaanxxFFFiii lmmffflaaaauu iiifrrrmmallleeei rtttccl ehkaat-ippe Mkaare-accMdiirttceoyyhdi dooiocsffia dhkflas -ifeM iafrra-iearcid-rrac-copcahiaadpicpcsaai actflacai tfitiaatrea-tidrcet-eadcdpaapcaictaittaetded 50 0 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Number of clusters Figure1: PerformanceofdifferentmethodsonUCIstudentperformancedataset a) Clustering quality (lower is better) b) Clustering fairness (higher is better) 1.0 12500 12000 0.8 k-Medoids Vanilla fairlet Clustering cost111011505000000 kVVV-aaaMnnneiiillldlllaaao ifffdaaasiiirrrllleeettt hk-ieMreadrcohidicsa fla fiar-irc-acpaapcaicaittaatteedd Balance00..46 VVMMMMDaaaCCCinnntFFFiaiim llsfffllaaaaaeu iiitmffrrr'aalllseee ii rrbtttbll aeeahklttl-iaae Mhknnre-iacceMdreerceoahdirdicocshia diflcas fa iafrla- ifrcia-rac-iprca-aapcpcaaiapctaciatitacaeitttdeaadtteedd MCF fairlet 0.2 10000 MCF fairlet hierarchical fair-capacitated MCF fairlet k-Medoids fair-capacitated 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Number of clusters Number of clusters c) Clustering cardinality k-Medoids 2000 Vanilla fairlet Vanilla fairlet hierarchical fair-capacitated mber of instances11050000 VMMMMMaCCCaanxxFFFiii lmmffflaaaauu iiifrrrmmallleeei rtttccl ehkaat-ippe Mkaare-accMdiirttceoyyhdi dooiocsffia dhkflas -ifeM iafrra-iearcid-rrac-copcahiaadpicpcsaai actflacai tfitiaatrea-tidrcet-eadcdpaapcaictaittaetded Nu 500 0 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Number of clusters Figure2: PerformanceofdifferentmethodsonOULADdataset theoriginalinstances: thehierarchical-basedapproachtakes Acknowledgements intoaccountthecardinalityrequirementduringthemerging The work of the first author is supported by the Ministry step,whereasthepartitioning-basedapproachtakesintoac- ofScienceandEducationofLowerSaxony,Germany,within count the cardinality of the final clusters during the assign- thePhDprogram“LernMINT:Data-assistedteachinginthe ment step which is formulated as a knapsack problem. Our MINT subjects”. The work of the second author is sup- experimentsshowthatourmethodsareeffectiveintermsof ported by the Volkswagen Foundation under the call“Arti- fairness and cardinality while maintaining clustering qual- ficial Intelligence and the Society of the Future”. ity. In the future, we plan to extend our approach for more thanoneprotectedattributesaswellastofurtherinvestigate what fair group assignments means in educational settings. 412 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 7. REFERENCES k-center problem. SIAM Journal on Discrete [1] A. Backurs, P. Indyk, K. Onak, B. Schieber, Mathematics, 13(3):403–418, 2000. A. Vakilian, and T. Wagner. Scalable fair clustering. [18] J. Kuzilek, M. Hlosta, and Z. Zdrahal. Open In International Conference on Machine Learning, university learning analytics dataset. Scientific data, pages 405–413. PMLR, 2019. 4:170171, 2017. [2] S. Bera, D. Chakrabarty, N. Flores, and [19] M. Lam and J. Mittenthal. Capacitated hierarchical M. Negahbani. Fair algorithms for clustering. In clustering heuristic for multi depot location-routing H. Wallach, H. Larochelle, A. Beygelzimer, problems. Int. J. Logist. Res. Appl., 16(5):433–444, F. d'Alch´e-Buc, E. Fox, and R. Garnett, editors, 2013. Advances in Neural Information Processing Systems, [20] J. Larson, S. Mattu, L. Kirchner, and J. Angwin. How volume 32. Curran Associates, Inc., 2019. we analyzed the compas recidivism algorithm. [3] S. Bharara, S. Sabitha, and A. Bansal. Application of ProPublica (5 2016), 9(1), 2016. learning analytics using clustering data mining for [21] F. Marcinkowski, K. Kieslich, C. Starke, and students’ disposition analysis. Education and M. Lu¨nich. Implications of ai (un-) fairness in higher Information Technologies, 23(2):957–984, 2018. education admissions: the effects of perceived ai (un-) [4] K. Bhopal and M. Myers. The impact of covid-19 on a fairness on exit, voice and organizational reputation. level students in england. SocArXiv, 2020. In FAT* ’20, pages 122–130, 2020. [5] A. Chhabra and P. Mohapatra. Fair algorithms for [22] T. Masterson. An empirical analysis of gender bias in hierarchical agglomerative clustering. arXiv preprint education spending in paraguay. World Development, arXiv:2005.03197, 2020. 40(3):583–593, 2012. [6] F. Chierichetti, R. Kumar, S. Lattanzi, and [23] G. B. Mathews. On the partition of numbers. S. Vassilvitskii. Fair clustering through fairlets. In Proceedings of the London Mathematical Society, Proceedings of the 31st International Conference on 1(1):486–490, 1896. Neural Information Processing Systems, pages [24] M. Meaney and T. Fikes. Early-adopter iteration bias 5036–5044, 2017. and research-praxis bias in the learning analytics [7] P. Cortez and A. M. G. Silva. Using data mining to ecosystem. In Companion Proceeding of the 9th predict secondary school student performance. International Conference on Learning Analytics & EUROSIS-ETI, 2008. Knowledge (LAK’19), Fairness and Equity in Learning [8] M. Ford and J. Morice. How fair are group Analytics Systems Workshop, pages 14–20, 2019. assignments? a survey of students and faculty and a [25] J. M. Mulvey and M. P. Beck. Solving capacitated modest proposal. Journal of Information Technology clustering problems. European Journal of Operational Education: Research, 2(1):367–378, 2003. Research, 18(3):339–348, 1984. [9] J. Gardner, C. Brooks, and R. Baker. Evaluating the [26] A´. A. M. Navarro and P. M. Ger. Comparison of fairness of predictive student models through slicing clustering algorithms for learning analytics with analysis. In Proceedings of the 9th International educational datasets. IJIMAI, 5(2):9–16, 2018. Conference on Learning Analytics & Knowledge, pages [27] M.NegreirosandA.Palhano.Thecapacitatedcentred 225–234, 2019. clustering problem. Computers & operations research, [10] S. Geetha, G. Poonthalir, and P. Vanathi. Improved 33(6):1639–1663, 2006. k-means algorithm for capacitated clustering problem. [28] E. Ntoutsi, P. Fafalios, U. Gadiraju, V. Iosifidis, INFOCOMP, 8(4):52–59, 2009. W. Nejdl, M.-E. Vidal, S. Ruggieri, F. Turini, [11] D. Gnesdilow, A. Evenstone, J. Rutledge, S. Sullivan, S. Papadopoulos, E. Krasanakis, et al. Bias in and S. Puntambekar. Group work in the science data-driven artificial intelligence systems-an classroom: How gender composition may affect introductory survey. Wiley Interdisciplinary Reviews: individual performance. 2013. Data Mining and Knowledge Discovery, 10(3):e1356, [12] T. F. Gonzalez. Clustering to minimize the maximum 2020. intercluster distance. Theoretical computer science, [29] C. R¨osner and M. Schmidt. Privacy preserving 38:293–306, 1985. clustering with constraints. arXiv preprint [13] HarvardX. HarvardX Person-Course Academic Year arXiv:1802.02497, 2018. 2013 De-Identified dataset, version 3.0, 2014. [30] M. Tiantong and S. Teemuangsai. Student team [14] S. Hubble and P. Bolton. A level results in england achievement divisions (stad) technique through the and the impact on university admissions in 2020-21. moodle to enhance learning achievement. House of Commons Library, 2020. International Education Studies, 6(4):85–92, 2013. [15] S. Hutt, M. Gardner, A. L. Duckworth, and S. K. [31] N.Warikoo,S.Sinclair,J.Fei,andD.Jacoby-Senghor. D’Mello. Evaluating fairness and generalizability in Examining racial bias in education: A new approach. models predicting on-time graduation from college Educational Researcher, 45(9):508–514, 2016. applications. International Educational Data Mining [32] Z. Zhan, P. S. Fong, H. Mei, and T. Liang. Effects of Society, 2019. gender grouping on students’ group performance, [16] L. Kaufman and P. J. Rousseeuw. Partitioning around individual achievements and attitudes in medoids (program pam). Finding groups in data: an computer-supported collaborative learning. Computers introduction to cluster analysis, 344:68–125, 1990. in Human Behavior, 48:587–596, 2015. [17] S. Khuller and Y. J. Sussmann. The capacitated Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 413 APPENDIX A. MOOCDATASET a) Clustering quality (lower is better) b) Clustering fairness (higher is better) 1.0 8000 7000 0.8 k-Medoids Vanilla fairlet Clustering cost56000000 kVV-aaMnneiilldllaao iffdaasiirrlleett hierarchical fair-capacitated Balance00..46 VVMMMMaaCCCinnnFFFiiim llfffllaaaaau iiimffrrraallleee iirrbtttll aeehkltt-iae Mhknre-iaceMdrerceoahdirdicocshia diflcas fa iafrla- ifrcia-rac-iprca-aapcpcaaiapctaciatitacaeitttdeaadtteedd Vanilla fairlet k-Medoids fair-capaciatated 0.2 Dataset's balance 4000 MCF fairlet MCF fairlet hierarchical fair-capacitated MCF fairlet k-Medoids fair-capacitated 0.0 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Number of clusters Number of clusters c) Clustering cardinality k-Medoids 3000 Vanilla fairlet Vanilla fairlet hierarchical fair-capacitated mber of instances122505000000 VMMMMMaCCCaanxxFFFiii lmmffflaaaauu iiifrrrmmallleeei rtttccl ehkaat-ippe Mkaare-accMdiirttceoyyhdi dooiocsffia dhkflas -ifeM iafrra-iearcid-rrac-copcahiaadpicpcsaai actflacai tfitiaatrea-tidrcet-eadcdpaapcaictaittaetded Nu1000 500 0 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Number of clusters Figure3: PerformanceofdifferentmethodsonMOOCdataset B. HIERARCHICAL FAIR-CAPACITATED ALGORITHM Algorithm2: Hierarchical fair-capacitated algorithm Input: F ={F ,F ,...,F}: a set of fairlets 1 2 l q: a given maximum capacity of final clusters W ={w ,w ,...,w}: weights of fairlets 1 2 l k: number of clusters Output: A fair-capacitated clustering 1 compute the proximity matrix ; 2 clusters←F //each fairlet Fj is considered as cluster ; 3 repeat 4 cluster1,cluster2 ← the closest pair of clusters ; 5 if capacity(cluster1)+capacity(cluster2)≤q then 6 newcluster← merge(cluster1,cluster2); 7 update clusters with newcluster; 8 update the proximity matrix ; 9 else 10 continue; 11 end 12 until k clusters remain; 13 return clusters; 414 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021)

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.