且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何从python中的URL获取域名(名称+ TLD)

更新时间:2023-02-25 15:08:45

这有点重要,因为没有简单的规则来确定有效公共后缀(站点名称 + TLD)的构成要素.相反,公共后缀是在 PublicSuffix.org 上作为列表维护.

存在查询该列表的python包(本地存储);它被称为 publicsuffix:

>>>从 publicsuffix 导入 PublicSuffixList>>>psl = PublicSuffixList()>>>打印 psl.get_public_suffix('mail.yahoo.com')雅虎网>>>打印 psl.get_public_suffix('account.hotmail.co.uk')hotmail.co.uk

I want to extract the domain name(name of the site+TLD) from a list of URLs which may vary in their format. for instance: Current state---->what I want

mail.yahoo.com------> yahoo.com
account.hotmail.co.uk---->hotmail.co.uk
x.it--->x.it
google.mail.com---> google.com

Is there any python code that can help me with extracting what I want from URL or should I do it manually?

This is somewhat non-trivial, as there is no simple rule to determine what makes a for a valid public suffix (site name + TLD). Instead, what makes a public suffix is maintained as a list at PublicSuffix.org.

A python package exists that queries that list (stored locally); it's called publicsuffix:

>>> from publicsuffix import PublicSuffixList
>>> psl = PublicSuffixList()
>>> print psl.get_public_suffix('mail.yahoo.com')
yahoo.com
>>> print psl.get_public_suffix('account.hotmail.co.uk')
hotmail.co.uk