更新时间:2023-10-24 09:04:34
我建议手动从CSV文件中读取条目,从中创建NamedVectors,然后使用序列文件编写器将向量写入序列文件中.从那里开始,KMeansDriver运行方法应该知道如何处理这些文件.
I would recommend manually reading in the entries from the CSV file, creating NamedVectors from them, and then using a sequence file writer to write the vectors in a sequence file. From there on, the KMeansDriver run method should know how to handle these files.
序列文件编码键值对,因此键将是样本的ID(应为字符串),并且值是向量周围的VectorWritable包装器.
Sequence files encode key-value pairs, so the key would be an ID of the sample (it should be a string), and the value is a VectorWritable wrapper around the vectors.
以下是有关如何执行此操作的简单代码示例:
Here is a simple code sample on how to do this:
List<NamedVector> vector = new LinkedList<NamedVector>();
NamedVector v1;
v1 = new NamedVector(new DenseVector(new double[] {0.1, 0.2, 0.5}), "Item number one");
vector.add(v1);
Configuration config = new Configuration();
FileSystem fs = FileSystem.get(config);
Path path = new Path("datasamples/data");
//write a SequenceFile form a Vector
SequenceFile.Writer writer = new SequenceFile.Writer(fs, config, path, Text.class, VectorWritable.class);
VectorWritable vec = new VectorWritable();
for(NamedVector v:vector){
vec.set(v);
writer.append(new Text(v.getName()), v);
}
writer.close();
此外,我建议您阅读行动中的问题的第8章.它提供了有关Mahout中数据表示的更多详细信息.
Also, I would recommend reading chapter 8 of Mahout in Action. It gives more details on data representation in Mahout.