且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

比较数据库中两个表格之间的字符串或本地字符串

更新时间:2023-02-22 12:56:43

合理的表现是全文搜索。我自己很少做这些事情(20多年以来可能会有3次);所以我会推迟给别人带来更多的经验。



使用 https://dev.mysql.com/doc/refman/5.7/en/fulltext-search.html 作为出发点。



提供全文索引已被创建,它可能是这样简单的:

  SELECT pat.patent_ID,group_concat(P.Name )
FROM PAT拍卖
CROSS JOIN产品p
匹配(pat.text)
反对(自然语言模式下的p.name)
GROUP BY pat.patent_ID ;

由于每件产品和每件专利都必须交叉连接,因此我们现在拥有8.8亿行...仅此而已。然而,我在这方面做的阅读越多,我越意识到我们正在处理RDBMS中的非结构化数据。由于它的天性,这不是一个理想的契合;并且可能有更多的优化方法来处理RDBMS之外的这种情况。要么;我们必须花时间在RDBMS中构造数据,以便在索引中更有效(比如将文本分割成每个词的索引中的单行)

我们是否真的需要寻找所有产品?涉及两种大小的数据的剪切大小意味着这将花费时间在不能很好地处理非结构化数据的数据库中。

划伤以下部分,因为它无法有效处理负载。但是为了后代保留它



我认为 concat() group_concat()可以做到这一点。



我们加入了patent.text与产生多行的产品名称相同的地方。然后group_concat将这些行组合成一条记录。

  SELECT pat.text,group_concat(P.Name) as产品
来自专利pat
INNER JOIN文本
对于pat.text像concat('%',p.name,'%')
GROUP by pat.text

然而,不要指望这很快;因为我们在两端使用%进行通配符搜索;所以不能使用索引。


Edit: SQL doesn't work for this. I just found out about Solr/Sphinx and it seems like the right tool for this problem, so if you know Solr or Sphinx I'm eager to hear from you.

Basically, I have a .tsv with patent info and a .csv with product names. I need to match each row of the patents column against the product names and extract the occurrences in a new .csv column.

You can scroll down and see the example at the end.

Original question:

SQL newbie here so bear with me :). I can't figure out how to do this:

My database:

mysql> SHOW TABLES;
+-----------------------+
| Tables_in_prodpatdb   |
+-----------------------+
| assignee              |
| patents               |
| patent_info           |
| products              |
+-----------------------+
mysql> DESCRIBE patents;
+-------------+-------------+------+-----+---------+-------+
| Field       | Type        | Null | Key | Default | Extra |
+-------------+-------------+------+-----+---------+-------+
| ...         |             |      |     |         |       |
| patent_id   | varchar(20) | YES  |     | NULL    |       |
| text        | text        | YES  |     | NULL    |       |
| ...         |             |      |     |         |       |
+-------------+-------------+------+-----+---------+-------+
mysql> DESCRIBE products;
+-------------+-------------+------+-----+---------+-------+
| Field       | Type        | Null | Key | Default | Extra |
+-------------+-------------+------+-----+---------+-------+
| name        | text        | YES  |     | NULL    |       |
+-------------+-------------+------+-----+---------+-------+

I have to work with the columns name and text, they look like this:

name
product1
product2
product3
...
~10M rows


text
long text description 1
long text description 2
long text description 3
...
~88M rows

I need to check patents.text row 1 and match it against products.name column to find every product name in that row, then store those products names in a new table. Then check row 2 and repeat.

If a patents.text row has a product name several times only copy it to the new table once. If some row has no product names just skip it. The output should be something like this:

Operation  Product
1          prod5, prod6
2          prod7
...

An example:

name
valve
a/c fan
farmed salmon
...


  text
  This patent deals with a new approach to air-conditioned fan. With some new valve the a/c fan is 
so much better. The new valve is great.
  This patent has no product names in it.
  This patent talks about farmed salmon.
  ...


Desired output:

Operation   Product
1           valve, a/c fan
2           farmed salmon
...

The only way I can see doing this with a reasonable performance is a full text search. I've seldom done these myself (maybe 3 times in 20+ years now); so I'll defer to someone else w/ more experience.

Using https://dev.mysql.com/doc/refman/5.7/en/fulltext-search.html as a starting point.

Provided the full text index has been created, it may be something as simple as:

SELECT pat.patent_ID, group_concat(P.Name)  
FROM patents pat 
CROSS JOIN products p 
WHERE MATCH (pat.text)
        AGAINST (p.name IN NATURAL LANGUAGE MODE)
GROUP BY pat.patent_ID;

Since every product vs every patent we have to cross join so we now have 880 million rows... That alone is a alot. The more reading I do on this however, the more I realize we're dealing with unstructured data in a RDBMS. by it's nature that's not an ideal fit; and there may be much more optimized methods to handle this outside of a RDBMS. or; we have to spend the time to structure the data in the RDBMS so it can be more effective iwth the indexes (such as splitting the text into it's own rows per word for indexing)

Lastly, Do we really need to look for ALL products? the shear size of the data involved on both sizes means this is going to take time in a database that doesn't handle unstructured data well.

Scratch the below as it will not be able to handle the load effectively. But keeping it out there for posterity

I think concat() and group_concat() may do the trick.

We join where the patent.text is like the product name generating multiple rows. the group_concat then combines these rows into one record. I'm not sure where "Operation" comes from in your result.

SELECT pat.text, group_concat(P.Name) as Product
FROM patents pat
INNER JOIN text
 on pat.text like concat('%',p.name,'%')
GROUP by pat.text

However don't expect this to be fast; as we're doing a wild card search using a % on both ends; so no index can be used.