Spark中使用HanLP分词

更新时间：2022-04-16 11:29:42

1.将HanLP的data(包含词典和模型)放到hdfs上，然后在项目配置文件hanlp.properties中配置root的路径，比如：
root=hdfs://localhost:9000/tmp/

2.实现com.hankcs.hanlp.corpus.io.IIOAdapter接口：

public static class HadoopFileIoAdapter implements IIOAdapter {

    @Override
    public InputStream open(String path) throws IOException {
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create(path), conf);
        return fs.open(new Path(path));
    }

    @Override
    public OutputStream create(String path) throws IOException {
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create(path), conf);
        OutputStream out = fs.create(new Path(path));
        return out;
    }
}

3.设置IoAdapter，创建分词器：

private static Segment segment;

static {

HanLP.Config.IOAdapter = new HadoopFileIoAdapter();
segment = new CRFSegment();

}

然后，就可以在Spark的操作中使用segment进行分词了。

文章来源于云聪的博客

上一篇 : ：汉语言处理包Hanlp的使用下一篇 : hanlp源码解读之字符正规化CharTable

Spark中使用HanLP分词

相关阅读

推荐文章