且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

mahout0.7 使用 JDBCDataModel

更新时间:2022-10-05 08:10:59

首先创建在mysql中创建库以及对应的表

1
2
3
4
5
6
7
8
9
10
mysql> create database mahout;
Query OK, 1 row affected (0.00 sec)
mysql> use mahout;
Database changed
mysql> create table intro(
    ->  uid varchar(20) not null,
    ->  iid varchar(50) not null,
    ->  val varchar(50) not null,
    ->  time varchar(50) default null
    -> );

注意 在计算的时候会损耗大量资源 建议 添加索引 在my.ini当中设置各种调优参数

(这里只是为了实现功能)

插入数据 (这里就使用mahout in action 第一个推荐例子当中的数据 注意 要把里面的空行删除 不然会有不能为空的提示)


1
2
3
mysql> load data local infile 'D:/intro.csv' replace into table intro fields terminated by ',' lines terminated by '\n' (@col1,@col2,@col3) set uid=@col1,iid=@col2,val=@col3;
Query OK, 21 rows affected (0.19 sec)
Records: 21  Deleted: 0  Skipped: 0  Warnings: 0

查看一下数据

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
mysql> select from intro;
+-----+-----+-----+------+
| uid | iid | val | time |
+-----+-----+-----+------+
| 1   | 101 | 5.0 | NULL |
| 1   | 102 | 3.0 | NULL |
| 1   | 103 | 2.5 | NULL |
| 2   | 101 | 2.0 | NULL |
| 2   | 102 | 2.5 | NULL |
| 2   | 103 | 5.0 | NULL |
| 2   | 104 | 2.0 | NULL |
| 3   | 101 | 2.5 | NULL |
| 3   | 104 | 4.0 | NULL |
| 3   | 105 | 4.5 | NULL |
| 3   | 107 | 5.0 | NULL |
| 4   | 101 | 5.0 | NULL |
| 4   | 103 | 3.0 | NULL |
| 4   | 104 | 4.5 | NULL |
| 4   | 106 | 4.0 | NULL |
| 5   | 101 | 4.0 | NULL |
| 5   | 102 | 3.0 | NULL |
| 5   | 103 | 2.0 | NULL |
| 5   | 104 | 4.0 | NULL |
| 5   | 105 | 3.5 | NULL |
| 5   | 106 | 4.0 | NULL |
+-----+-----+-----+------+
21 rows in set (0.00 sec)


然后就是正式程序 写的比较简单主要是为了实现功能

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import java.util.List;
import org.apache.mahout.cf.taste.impl.model.jdbc.MySQLJDBCDataModel;
import org.apache.mahout.cf.taste.impl.neighborhood.NearestNUserNeighborhood;
import org.apache.mahout.cf.taste.impl.recommender.GenericUserBasedRecommender;
import org.apache.mahout.cf.taste.impl.similarity.PearsonCorrelationSimilarity;
import org.apache.mahout.cf.taste.model.DataModel;
import org.apache.mahout.cf.taste.model.JDBCDataModel;
import org.apache.mahout.cf.taste.neighborhood.UserNeighborhood;
import org.apache.mahout.cf.taste.recommender.RecommendedItem;
import org.apache.mahout.cf.taste.recommender.Recommender;
import org.apache.mahout.cf.taste.similarity.UserSimilarity;
import com.mysql.jdbc.jdbc2.optional.MysqlDataSource;
public class MysqlJDBCRecommender {
    public static void main(String[] args) throws Exception {
        MysqlDataSource dataSource = new MysqlDataSource();
        dataSource.setServerName("localhost");
        dataSource.setUser("root");
        dataSource.setPassword("toor");
        dataSource.setDatabaseName("mahout");
                                                                         
        JDBCDataModel dataModel = new MySQLJDBCDataModel(dataSource, "intro""uid""iid""val""time");
                                                                         
        DataModel model = dataModel;
        UserSimilarity similarity=new PearsonCorrelationSimilarity(model);
        UserNeighborhood neighborhood=new NearestNUserNeighborhood(2,similarity,model);
                                                                         
        Recommender recommender=new GenericUserBasedRecommender(model,neighborhood,similarity);
                                                                         
        List<RecommendedItem> recommendations = recommender.recommend(13);
        for (RecommendedItem recommendation : recommendations) {
            System.out.println(recommendation);
        }
    }
}

计算结果

1
2
3
4
5
6
7
8
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/D:/java%e9%a9%b1%e5%8a%a8/mahout0.7/slf4j-jcl-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/D:/java%e9%a9%b1%e5%8a%a8/mahout0.7/mahout-examples-0.7-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/D:/java%e9%a9%b1%e5%8a%a8/mahout0.7/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
13/12/07 13:56:41 WARN jdbc.AbstractJDBCDataModel: You are not using ConnectionPoolDataSource. Make sure your DataSource pools connections to the database itself, or database performance will be severely reduced.
RecommendedItem[item:104, value:4.257081]
RecommendedItem[item:106, value:4.0]


MySQLJDBCDataModel API中的建议

JDBCDataModel backed by a MySQL database and accessed via JDBC. It may work with other JDBC databases. By default, this class assumes that there is a DataSource available under the JNDI name "jdbc/taste", which gives access to a database with a "taste_preferences" table with the following schema:

user_id item_id preference
987 123 0.9
987 456 0.1
654 123 0.2
654 789 0.3

preference must have a type compatible with the Java float type. user_id and item_id should be compatible with long type (BIGINT). For example, the following command sets up a suitable table in MySQL, complete with primary key and indexes:

 CREATE TABLE taste_preferences (
   user_id BIGINT NOT NULL,
   item_id BIGINT NOT NULL,
   preference FLOAT NOT NULL,
   PRIMARY KEY (user_id, item_id),
   INDEX (user_id),
   INDEX (item_id)
 )
 

The table may optionally have a timestamp column whose type is compatible with Java long.

Performance Notes

See the notes in AbstractJDBCDataModel regarding using connection pooling. It's pretty vital to performance.

Some experimentation suggests that MySQL's InnoDB engine is faster than MyISAM for these kinds of applications. While MyISAM is the default and, I believe, generally considered the lighter-weight and faster of the two engines, my guess is the row-level locking of InnoDB helps here. Your mileage may vary.

Here are some key settings that can be tuned for MySQL, and suggested size for a data set of around 1 million elements:

  • innodb_buffer_pool_size=64M

  • myisam_sort_buffer_size=64M

  • query_cache_limit=64M

  • query_cache_min_res_unit=512K

  • query_cache_type=1

  • query_cache_size=64M

Also consider setting some parameters on the MySQL Connector/J driver:

 cachePreparedStatements = true
 cachePrepStmts = true
 cacheResultSetMetadata = true
 alwaysSendSetIsolation = false
 elideSetAutoCommits = true
 

Thanks to Amila Jayasooriya for contributing MySQL notes above as part of Google Summer of Code 2007.



本文转自    拖鞋崽      51CTO博客,原文链接:http://blog.51cto.com/1992mrwang/1337759