且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

在ELKI中运行聚类算法

更新时间:2022-12-30 18:22:55

我们非常感谢文献贡献! (更新:我已将此帖子转变为新ELKI目前的教程条目

We do appreciate documentation contributions! (Update: I have turned this post into a new ELKI tutorial entry for now.)

ELKI确实主张不将其嵌入其他Java Java应用程序中,原因有很多。这就是我们建议使用MiniGUI(或它构造的命令行)的原因。添加自定义代码***完成,例如作为自定义 ResultHandler 或仅使用 ResultWriter 并解析生成的文本文件。

ELKI does advocate to not embed it in other applications Java for a number of reasons. This is why we recommend using the MiniGUI (or the command line it constructs). Adding custom code is best done e.g. as a custom ResultHandler or just by using the ResultWriter and parsing the resulting text files.

如果确实希望将其嵌入到您的代码中(有很多情况下它很有用,特别是当您需要多个关系时,并且想要评估不同的相互之间的索引结构),这里是获取数据库关系的基本设置:

If you really want to embed it in your code (there are a number of situations where it is useful, in particular when you need multiple relations, and want to evaluate different index structures against each other), here is the basic setup for getting a Database and Relation:

// Setup parameters:
ListParameterization params = new ListParameterization();
params.addParameter(FileBasedDatabaseConnection.INPUT_ID, filename);
// Add other parameters for the database here!

// Instantiate the database:
Database db = ClassGenericsUtil.parameterizeOrAbort(
    StaticArrayDatabase.class,
    params);
// Don't forget this, it will load the actual data...
db.initialize();

Relation<DoubleVector> vectors = db.getRelation(TypeUtil.DOUBLE_VECTOR_FIELD);
Relation<LabelList> labels = db.getRelation(TypeUtil.LABELLIST);

如果您想编程更一般,请使用 NumberVector<?>

If you want to program more general, use NumberVector<?>.


  1. API仍在发生很大变化。我们一直在添加选项,我们无法(还)提供稳定的API 。命令行/ MiniGUI / Parameterization更稳定,因为处理默认值 - 参数化仅列出非默认参数,所以只有在这些更改时你才会注意到。

  1. The API is still changing a lot. We keep on adding options, and we cannot (yet) provide a stable API. The command line / MiniGUI / Parameterization is much more stable, because of the handling of default values - the parameterization only lists the non-default parameters, so only if these change you'll notice.

在上面的代码示例中,请注意我也使用了这种模式。对解析器,数据库等的更改可能影响此程序!

In the code example above, note that I also used this pattern. A change to the parsers, database etc. will likely not affect this program!

内存使用情况:数据挖掘非常耗费内存。如果您使用MiniGUI或命令行,则在任务完成时您有一个很好的清理。如果您从Java调用它,那么非常高的更改会让您在某处保留一些引用,并最终泄漏 lot 的内存。所以在确保完成后正确清理对象时不要使用上面的模式

Memory usage: data mining is quite memory intensive. If you use the MiniGUI or command line, you have a good cleanup when the task is finished. If you invoke it from Java, changes are really high that you keep some reference somewhere, and end up leaking lots of memory. So do not use above pattern without ensuring that the objects are properly cleaned up when you are done!

通过从命令行运行ELKI,你有两件免费的东西:

By running ELKI from the command line, you get two things for free:


  1. 没有内存泄漏。任务完成后,进程退出并释放所有内存。

  1. no memory leaks. When the task is finished, the process quits and frees all memory.

无需为同一数据重新运行两次。后续分析不需要重新运行算法。

no need to rerun it twice for the same data. Subsequent analysis does not need to rerun the algorithm.


  • ELKI 未设计为可嵌入库有充分理由。 ELKI有很多选项和功能,这在运行时(虽然它可以轻松地胜过R和Weka,例如!)内存使用,特别是在代码复杂性方面是有代价的。
    ELKI 专为研究数据挖掘算法而设计,而不是让它们易于包含在任意应用程序中。相反,如果您遇到特定问题,则应使用ELKI找出哪种方法运行良好,然后以优化的方式重新实现该方法

  • ELKI is not designed as embeddable library for good reasons. ELKI has tons of options and functionality, and this comes at a price, both in runtime (although it can easily outperform R and Weka, for example!) memory usage and in particular in code complexity. ELKI was designed for research in data mining algorithms, not for making them easy to include in arbitrary applications. Instead, if you have a particular problem, you should use ELKI to find out which approach works good, then reimplement that approach in an optimized manner for your problem.



    使用ELKI的***方式



    以下是一些提示和技巧:

    Best ways of using ELKI

    Here are some tips and tricks:


    1. 使用MiniGUI构建命令行。请注意,在GUI的日志记录窗口中,它显示了相应的命令行参数 - 从命令行运行ELKI易于编写脚本,并且可以轻松地分发到多个计算机,例如通过Grid Engine。

    1. Use the MiniGUI to build a command line. Note that in the logging window of the "GUI" it shows the corresponding command line parameters - running ELKI from command line is easy to script, and can easily be distributed to multiple computers e.g. via Grid Engine.

    #!/bin/bash
    for k in $( seq 3 39 ); do
        java -jar elki.jar KDDCLIApplication \
            -dbc.in whatever \
            -algorithm clustering.kmeans.KMedoidsEM \
            -kmeans.k $k \
            -resulthandler ResultWriter -out.gzip \
            -out output/k-$k 
    done
    


  • 使用索引。对于许多算法,索引结构可以产生巨大的差异!
    (但你需要做一些研究,哪些索引可以用于哪些算法!)

  • Use indexes. For many algorithms, index structures can make a huge difference! (But you need to do some research which indexes can be used for which algorithms!)

    考虑使用扩展点,例如 ResultWriter 。您可能最容易挂钩此API,然后使用 ResultUtil 选择要以您自己的首选格式输出的结果或分析:

    Consider using the extension points such as ResultWriter. It may be the easiest for you to hook into this API, then use ResultUtil to select the results that you want to output in your own preferred format or analyze:

    List<Clustering<? extends Model>> clusterresults =
        ResultUtil.getClusteringResults(result);
    


  • 要识别对象,请使用标签和 LabelList 关系。默认解析器在看到数字属性的文本时会执行此操作,即文件,例如

  • To identify objects, use labels and a LabelList relation. The default parser will do this when it sees text along the numerical attributes, i.e. a file such as

    1.0 2.0 3.0 ObjectLabel1
    

    可以通过标签轻松识别对象!

    will make it easy to identify the object by its label!

    更新:参见 ELKI教程是根据这篇文章创建的更新。