151 lines
5 KiB
Text
151 lines
5 KiB
Text
# Clustering datasets
|
|
|
|
## Datasets
|
|
|
|
This project contains collection of labeled clustering problems that can be found in the literature. Most of datasets were artificially created.
|
|
|
|
All datasets can be found link:https://github.com/deric/clustering-benchmark/tree/master/src/main/resources/datasets/artificial[data folder].
|
|
|
|
### 2d-10c
|
|
|
|
[align="right",options="header"]
|
|
|===
|
|
| data points | clusters | dimension
|
|
| 2990 | 10 | 2
|
|
|===
|
|
|
|
image::https://github.com/deric/clustering-benchmark/blob/images/fig/artificial/2d-10c.png["2d-10c",400,float="left"]
|
|
|
|
[.float .right]
|
|
* link:https://github.com/deric/clustering-benchmark/blob/master/src/main/resources/datasets/artificial/2d-10c.arff[ARFF]
|
|
* link:https://github.com/deric/handl-data-generators[generator]
|
|
|
|
> J. Handl and J. Knowles, “Multiobjective clustering with automatic
|
|
> determination of the number of clusters,” UMIST, Tech. Rep., 2004.
|
|
|
|
### atom
|
|
|
|
[align="right",options="header"]
|
|
|===
|
|
| data points | clusters | dimension
|
|
| 800 | 2 | 3
|
|
|===
|
|
|
|
image::https://github.com/deric/clustering-benchmark/blob/images/fig/artificial/atom.png["atom",400,float="left"]
|
|
|
|
[.float .right]
|
|
* source: link:https://www.uni-marburg.de/fb12/datenbionik/data?language_sync=1[FCPS]
|
|
* link:https://github.com/deric/clustering-benchmark/blob/master/src/main/resources/datasets/artificial/atom.arff[ARFF]
|
|
|
|
### aggregation
|
|
|
|
[align="right",options="header"]
|
|
|===
|
|
| data points | clusters | dimension
|
|
| 788 | 7 | 2
|
|
|===
|
|
|
|
image::https://github.com/deric/clustering-benchmark/blob/images/fig/artificial/aggregation.png[aggregation,400,float="left"]
|
|
|
|
[.float .right]
|
|
* link:https://github.com/deric/clustering-benchmark/blob/master/src/main/resources/datasets/artificial/aggregation.arff[ARFF]
|
|
* link:http://cs.joensuu.fi/sipu/datasets/[original source]
|
|
|
|
> Gionis, A., H. Mannila, and P. Tsaparas, Clustering aggregation.
|
|
> ACM Transactions on Knowledge Discovery from Data (TKDD), 2007. 1(1): p. 1-30.
|
|
|
|
### chainlink
|
|
|
|
[align="right",options="header"]
|
|
|===
|
|
| data points | clusters | dimension
|
|
| 1000 | 2 | 3
|
|
|===
|
|
|
|
image::https://github.com/deric/clustering-benchmark/blob/images/fig/artificial/chainlink.png["chainlink",400,float="left"]
|
|
|
|
[.float .right]
|
|
* source: link:https://www.uni-marburg.de/fb12/datenbionik/data?language_sync=1[FCPS]
|
|
* link:https://github.com/deric/clustering-benchmark/blob/master/src/main/resources/datasets/artificial/chainlink.arff[ARFF]
|
|
|
|
> Alfred Ultsch, Clustering with SOM: U*C,
|
|
> in Proc. Workshop on Self Organizing Feature Maps ,pp 31-37 Paris 2005.
|
|
|
|
### D31
|
|
|
|
[align="right",style="asciidoc",options="noborders,wide"]
|
|
|===
|
|
| data points | 3100
|
|
| clusters | 31
|
|
| dimensions | 2
|
|
| image::https://github.com/deric/clustering-benchmark/blob/images/fig/artificial/D31.png["D31",400,float="left"] | * link:https://github.com/deric/clustering-benchmark/blob/master/src/main/resources/datasets/artificial/D31.arff[ARFF]
|
|
|===
|
|
|
|
> Veenman, C.J., M.J.T. Reinders, and E. Backer,
|
|
> A maximum variance cluster algorithm. IEEE Trans. Pattern Analysis and Machine Intelligence 2002. 24(9): p. 1273-1280.
|
|
|
|
### 3MC
|
|
|
|
[align="right",options="header",style="literal"]
|
|
|===
|
|
| data points | clusters | dimension
|
|
| 400 | 3 | 2
|
|
|===
|
|
|
|
[.float .right]
|
|
image::https://github.com/deric/clustering-benchmark/blob/images/fig/artificial/3MC.png["3MC",400,float="left"]
|
|
|
|
|
|
### DS577
|
|
|
|
[align="right",options="header"]
|
|
|===
|
|
| data points | clusters | dimension
|
|
| 577 | 3 | 2
|
|
|===
|
|
|
|
image::https://github.com/deric/clustering-benchmark/blob/images/fig/artificial/DS577.png["D31",400,float="left"]
|
|
|
|
[.float .right]
|
|
* link:https://github.com/deric/clustering-benchmark/blob/master/src/main/resources/datasets/artificial/DS577.arff[ARFF]
|
|
|
|
> M. C. Su, C. H. Chou, and C. C. Hsieh, “Fuzzy C-Means Algorithm with a Point Symmetry Distance,”
|
|
> International Journal of Fuzzy Systems, vol. 7, no. 4, pp. 175-181, 2005.
|
|
|
|
|
|
### cluto-t4_8k
|
|
|
|
[align="right",options="header"]
|
|
|===
|
|
| data points | clusters | dimension
|
|
| 8000 | 7 | 2
|
|
|===
|
|
|
|
image::https://github.com/deric/clustering-benchmark/blob/images/fig/artificial/cluto-t4_8k.png["cluto-t4_8k",400,float="left"]
|
|
|
|
[.float .right]
|
|
* link:https://github.com/deric/clustering-benchmark/blob/master/src/main/resources/datasets/artificial/cluto-t4.8k.arff[ARFF]
|
|
|
|
> G. Karypis, “CLUTO A Clustering Toolkit,”
|
|
> Dept. of Computer Science, University of Minnesota, Tech. Rep. 02-017, 2002, available at
|
|
http://www.cs.umn.edu/ ̃cluto.
|
|
|
|
|
|
## Experiments
|
|
|
|
This project contains set of clustering methods benchmarks on various dataset. The project is dependent on [Clueminer project](https://github.com/deric/clueminer).
|
|
|
|
in order to run benchmark compile dependencies into a single JAR file:
|
|
|
|
mvn assembly:assembly
|
|
|
|
# Consensus experiment
|
|
|
|
allows running repeated runs of the same algorithm:
|
|
|
|
```
|
|
./run consensus --dataset "triangle1" --repeat 10
|
|
```
|
|
by default k-means algorithm is used.
|
|
|
|
For available datasets see [resources folder](https://github.com/deric/clustering-benchmark/tree/master/src/main/resources/datasets/artificial).
|