且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

检测文本字符串中的(顽皮或漂亮)URL或链接

更新时间:2023-02-23 09:07:01

我正在集中精力避免垃圾邮件发送者.这导致两个子假设:使用该系统的人员将积极尝试违反您的检查,并且您的目标只是检测URL的存在,而不是提取完整的URL.如果您的目标是其他目标,则此解决方案看起来会有所不同.

I'm concentrating my answer on trying to avoid spammers. This leads to two sub-assumptions: the people using the system will therefore be actively trying to contravene your check and your goal is only to detect the presence of a URL, not to extract the complete URL. This solution would look different if your goal is something else.

我认为您***的选择将是TLD.有两个字母的ccTLD和(目前)相对较小的其他ccTLD列表.这些需要以小数点为前缀,并以斜杠或某些单词边界作为后缀.正如其他人指出的那样,这并不是完美的.在没有禁止合法的我再试一次.它不起作用"或类似的情况下,没有办法获得"buyfunkypharmaceuticals.it".所有这些,这就是我的建议:

I think your best bet is going to be with the TLD. There are the two-letter ccTLDs and the (currently) comparitively small list of others. These need to be prefixed by a dot and suffixed by either a slash or some word boundary. As others have noted, this isn't going to be perfect. There's no way to get "buyfunkypharmaceuticals . it" without disallowing the legitimate "I tried again. it doesn't work" or similar. All of that said, this would be my suggestion:

[^\b]\.([a-zA-Z]{2}|aero|asia|biz|cat|com|coop|edu|gov|info|int|jobs|mil|mobi|museum|name|net|org|pro|tel|travel)[\b/]

这将得到:

  • buyfunkypharmaceuticals.it
  • google.com
  • http://stackoverflo**w.com/**questions/700163/

当人们开始混淆其URL并替换为"时,它当然会中断.与点".但是,再次假设垃圾邮件发送者是您的目标,如果他们开始这样做,则其点击率将再下降几个数量级,降至零.我认为,足够多的人知道对URL进行模糊处理的信息,而没有足够信息的人却很少访问垃圾邮件站点,这是一个微不足道的交集.该解决方案应该使您能够检测到可复制并粘贴到地址栏的所有URL,同时将附带损害保持在最低限度.

It will of course break as soon as people start obfuscating their URLs, replacing "." with " dot ". But, again assuming spammers are your goal here, if they start doing that sort of thing, their click-through rates are going to drop another couple of orders of magnitude toward zero. The set of people informed enough to deobfuscate a URL and the set of people uninformed enough to visit spam sites have, I think, a miniscule intersection. This solution should let you detect all URLs that are copy-and-pasteable to the address bar, whilst keeping collateral damage to a bare minimum.