且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

Selenium WebDriver快速分析大量链接

更新时间:2023-02-26 13:14:01

您似乎纯粹是在使用Webdriver来执行Javascript,而不是访问对象.

如果您使用javascript掉线,可以尝试以下几种方法(对不起Java,但您知道了);

 //We have restricted via xpath so will get less links back AND will not haveto check the text within loop
        List<WebElement> linksWithText = driver.findElements(By.xpath("//a[text() and not(text()='')]"));

        for (WebElement link : linksWithText) {

            //Store the location details rather than re-get each time
            Point location = link.getLocation();
            Integer x = location.getX();
            Integer y = location.getY();

            if (x < windowX && y < windowY) {
                ///Insert all info using webdriver commands;
            }
        }

我通常使用远程网格,因此性能是测试中的关键问题,因此为什么我总是尝试通过CSS选择器或XPath进行限制,而不是获取所有内容并循环

I have a web page with an extremely large amount of links (around 300) and I would like to collect information on these links.

Here is my code:

beginning_time = Time.now
#This gets a collection of links from the webpage
tmp = driver.find_elements(:xpath,"//a[string()]")
end_time = Time.now
puts "Execute links:#{(end_time - beginning_time)*1000} milliseconds for #{tmp.length} links"


before_loop = Time.now
#Here I iterate through the links
tmp.each do |link|
    #I am not interested in the links I can't see
    if(link.location.x < windowX and link.location.y < windowY)
        #I then insert the links into a NoSQL database, 
        #but for all purposes you could imagine this as just saving the data in a hash table.
        $elements.insert({
            "text" => link.text,
            "href" => link.attribute("href"),
            "type" => "text",
            "x" => link.location.x,
            "y" => link.location.y,
            "url" => url,
            "accessTime" => accessTime,
            "browserId" => browserId
        })
    end
end
after_loop = Time.now
puts "The loop took #{(after_loop - before_loop)*1000} milliseconds"

It currently take 20ms to get the link collection and around 4000ms (or 4 seconds) to retrieve the information for the links. When I separate the accessors from the NoSQL insert, I find that the NoSQL insert only takes 20ms and that the majority of time is spent with the accessors (who became much slower after being separated from the NoSQL insert, for reasons I don't understand), which makes me conclude that the accessors must be executing JavaScript.

My question is: How do I collect these links and their information more quickly?

The first solution that came to mind was to try running two drivers in parallel, but WebDrivers are not thread-safe, meaning that I would have to create a new instance of the WebDriver and navigate to the page. This raises the question, how to download the source of the page so that it can be loaded into another driver, which cannot be done in Selenium, thus must be performed on Chrome itself with desktop automation tools, adding a considerable amount of overhead.

Another alternative I heard of was to stop use ChromeDriver and to just use PhantomJS, but I need to display the page in visual browser.

Is there any other alternative that I haven't considered yet?

You seem to be using Webdriver purely to execute Javascript rather than access the objects.

A couple of ideas to try IF you drop using javascript (Excuse the java but you get the idea);

 //We have restricted via xpath so will get less links back AND will not haveto check the text within loop
        List<WebElement> linksWithText = driver.findElements(By.xpath("//a[text() and not(text()='')]"));

        for (WebElement link : linksWithText) {

            //Store the location details rather than re-get each time
            Point location = link.getLocation();
            Integer x = location.getX();
            Integer y = location.getY();

            if (x < windowX && y < windowY) {
                ///Insert all info using webdriver commands;
            }
        }

I normally use remote grids so performace is a key concern in my tests, hence why I always try to restrict by CSS Selectors or XPath rather than get everything and loop