且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何做一个SPA SEO抓取?

更新时间:2023-11-04 07:56:04

开始前,请确保您明白谷歌的要求,尤其是使用 pretty的的网址。现在,让我们看到了实现:

Before starting, please make sure you understand what google requires, particularly the use of pretty and ugly URLs. Now lets see the implementation:

在客户端,你只有一个HTML页面,该页面通过AJAX调用服务器动态交互。这就是SPA左右。所有 A 标签中的客户端在我的应用程序动态创建的,以后我们会看到如何让这些链接可以看到谷歌的僵尸服务器。每个这样的 A 标签需要能够有一个 pretty的网址 HREF 标记,以便谷歌的机器人将抓取。你不想在的href 部分时要使用在客户端点击它(即使你希望服务器能够解析它,我们会看到,以后),因为我们可能不希望一个新的页面加载,才使一个AJAX调用得到一些数据,显示在页面的一部分,并通过javascript更改URL(例如,使用HTML5 pushstate Durandaljs )。因此,我们同时拥有一个的href 属性为谷歌以及对的onclick 由它来完成的工作,当用户点击链接。现在,因为我使用推状态我不希望任何的URL,这样一个典型的 A 标记可能是这样的:
< A HREF =htt​​p://www.xyz.com/#!/category/subCategory/product111的onClick =loadProduct(类别,子类别,product111)>看到product111 ...< / A>

类别和子类别很可能是其他短语,如沟通和电话或计算机和笔记本电脑的一个电器商店。显然,会有很多不同的类别和子类别。正如你所看到的链接是直接的类别,子类别和产品,而不是额外的参数到一个特定的商店页面,如 http://www.xyz.com/store /分类/子/ product111 。这是因为我preFER更短,更简单的链接。这意味着我不会有同名的一类,作为我的页,即关于之一。
我不会去到如何通过AJAX(即的onclick 部分)加载数据,寻找它的谷歌,也有很多很好的解释。这里唯一重要的事情,我也想提一提的是,当用户点击这个链接,我想在浏览器的URL看起来像这样:
http://www.xyz.com/category/subCategory/product111 。这是URL被不发送到服务器!记住,这是一个SPA的其中客户机和服务器之间的所有的相互作用通过AJAX完成的,在所有不链接!所有的页在客户端实现的,而不同的URL不会使调用服务器(服务器需要知道如何处理的情况下,它们被用作从其他网站的外部链接到你网站的这些网址,我们会看到,后来在服务器端的部分)。现在,这是通过迪朗达尔奇妙处理。我强烈推荐它,但你也可以跳过这一部分,如果你preFER其他技术。如果你选择它,您还使用微软的Visual Studio的前preSS 2012网络像我一样,你可以安装的迪朗达尔入门套件,还有,在 shell.js ,使用这样的事情:

On the client side you only have a single html page which interacts with the server dynamically via AJAX calls. that's what SPA is about. All the a tags in the client side are created dynamically in my application, we'll later see how to make these links visible to google's bot in the server. Each such a tag needs to be able to have a pretty URL in the href tag so that google's bot will crawl it. You don't want the href part to be used when the client clicks on it (even though you do want the server to be able to parse it, we'll see that later), because we may not want a new page to load, only to make an AJAX call getting some data to be displayed in part of the page and change the URL via javascript (e.g. using HTML5 pushstate or with Durandaljs). So, we have both an href attribute for google as well as on onclick which does the job when the user clicks on the link. Now, since I use push-state I don't want any # on the URL, so a typical a tag may look like this:
<a href="http://www.xyz.com/#!/category/subCategory/product111" onClick="loadProduct('category','subCategory','product111')>see product111...</a>

'category' and 'subCategory' would probably be other phrases, such as 'communication' and 'phones' or 'computers' and 'laptops' for an electrical appliances store. Obviously there would be many different categories and sub categories. As you can see, the link is directly to the category, sub category and the product, not as extra-parameters to a specific 'store' page such as http://www.xyz.com/store/category/subCategory/product111. This is because I prefer shorter and simpler links. It implies that I there will not be a category with the same name as one of my 'pages', i.e. 'about'.
I will not go into how to load the data via AJAX (the onclick part), search it on google, there are many good explanations. The only important thing here that I do want to mention is that when the user clicks on this link, I want the URL in the browser to look like this:
http://www.xyz.com/category/subCategory/product111. And this is URL is not sent to the server ! remember, this is a SPA where all the interaction between the client and the server is done via AJAX, no links at all! all 'pages' are implemented on the client side, and the different URL does not make a call to the server (the server does need to know how to handle these URLs in case they are used as external links from another site to your site, we'll see that later on the server side part). Now, this is handled wonderfully by Durandal. I strongly recommend it, but you can also skip this part if you prefer other technologies. If you do choose it, and you're also using MS Visual Studio Express 2012 for Web like me, you can install the Durandal Starter Kit, and there, in shell.js, use something like this:

define(['plugins/router', 'durandal/app'], function (router, app) {
    return {
        router: router,
        activate: function () {
            router.map([
                { route: '', title: 'Store', moduleId: 'viewmodels/store', nav: true },
                { route: 'about', moduleId: 'viewmodels/about', nav: true }
            ])
                .buildNavigationModel()
                .mapUnknownRoutes(function (instruction) {
                    instruction.config.moduleId = 'viewmodels/store';
                    instruction.fragment = instruction.fragment.replace("!/", ""); // for pretty-URLs, '#' already removed because of push-state, only ! remains
                    return instruction;
                });
            return router.activate({ pushState: true });
        }
    };
});

有需要注意一些重要的事情在这里:

There are a few important things to notice here:

  1. 第一条路线(与航线:'')是其中有没有额外的数据的URL,即 HTTP:// WWW .xyz.com 。在这个页面加载使用AJAX常规数据。有可能实际上是没有 A 标记都在此页。您将要添加下面的标记,以便谷歌的机器人会知道该怎么做它:
    &LT; META NAME =片段的内容=&GT!; 。这个标签将使谷歌的机器人改造网址 www.xyz.com?_escaped_fragment _ = 我们将在后面看到的。
  2. 在关于路线仅仅是一个例子来链接到其他的页,你可能想在你的Web应用程序。
  3. 现在,棘手的部分是,有没有类别的路线,并可能有很多不同的类别 - 没有一个是有一个predefined路线。这是 mapUnknownRoutes 用武之地它这些未知的路线映射到商店的路线,并删除任何'!从案件的网址是谷歌的SEACH引擎生成一个 pretty的URL。在商店路线需要在片段财产信息,使Ajax调用来获得数据,显示它,并在本地更改URL。在我的应用程序,我不加载不同的页面,每一个这样的电话;我只改变其中这个数据是相关的,也局部改变的URL页面的一部分。
  4. 注意 pushState:真正的这指示迪朗达尔使用推送状态的网址。
  1. The first route (with route:'') is for the URL which has no extra data in it, i.e. http://www.xyz.com. In this page you load general data using AJAX. There may actually be no a tags at all in this page. You will want to add the following tag so that google's bot will know what to do with it:
    <meta name="fragment" content="!">. This tag will make google's bot transform the URL to www.xyz.com?_escaped_fragment_= which we'll see later.
  2. The 'about' route is just an example to a link to other 'pages' you may want on your web application.
  3. Now, the tricky part is that there is no 'category' route, and there may be many different categories - none of which have a predefined route. This is where mapUnknownRoutes comes in. It maps these unknown routes to the 'store' route and also removes any '!' from the URL in case it's a pretty URL generated by google's seach engine. The 'store' route takes the info in the 'fragment' property and makes the AJAX call to get the data, display it, and change the URL locally. In my application, I don't load a different page for every such call; I only change the part of the page where this data is relevant and also change the URL locally.
  4. Notice the pushState:true which instructs Durandal to use push state URLs.

这是我们所需要的客户端。它还可以与哈希的URL来实现(在迪朗达尔您简单删除 pushState:真正的的那个)。更复杂的部分(至少对我...)是服务器部分:

This is all we need in the client side. It can be implemented also with hashed URLs (in Durandal you simple remove the pushState:true for that). The more complex part (at least for me...) was the server part:

我用 MVC 4.5 的WebAPI 控制器的服务器端。该服务器实际上需要处理3种类型的URL:由谷歌产生的 - 无论 pretty的并且也是简单的URL以相同的格式显示在客户端的浏览器中的之一。让我们看看如何做到这一点:

I'm using MVC 4.5 on the server side with WebAPI controllers. The server actually needs to handle 3 types of URLs: the ones generated by google - both pretty and ugly and also a 'simple' URL with the same format as the one that appears in the client's browser. Lets look on how to do this:

pretty的网址和服务器好像试图引用一个不存在的控制器PTED'简单'的是第一个跨$ P $。该服务器发现类似 http://www.xyz.com/category/subCategory/product111 ,并查找名为类别控制器。因此,在的web.config 我添加以下行重定向这些到一个特定的错误处理的控制器:

Pretty URLs and 'simple' ones are first interpreted by the server as if trying to reference a non-existent controller. The server sees something like http://www.xyz.com/category/subCategory/product111 and looks for a controller named 'category'. So in web.config I add the following line to redirect these to a specific error handling controller:

<customErrors mode="On" defaultRedirect="Error">
    <error statusCode="404" redirect="Error" />
</customErrors><br/>

现在,这种转变的URL是这样的: http://www.xyz.com/Error?aspxerrorpath=/category/subCategory/product111 。我想被发送到将加载通过AJAX数据的客户端的URL,所以这里的关键是要调用默认的索引控制器,如果不引用任何控制器;我做到这一点的添加的哈希之前所有的类别的网址和子类别参数;散列URL不要求除缺省的索引控制器的任何特殊控制器和数据被发送到哪个然后删除散列并使用信息的哈希后通过AJAX加载数据的客户端。以下是错误处理程序控制器code:

Now, this transforms the URL to something like: http://www.xyz.com/Error?aspxerrorpath=/category/subCategory/product111. I want the URL to be sent to the client that will load the data via AJAX, so the trick here is to call the default 'index' controller as if not referencing any controller; I do that by adding a hash to the URL before all the 'category' and 'subCategory' parameters; the hashed URL does not require any special controller except the default 'index' controller and the data is sent to the client which then removes the hash and uses the info after the hash to load the data via AJAX. Here is the error handler controller code:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Net;
using System.Net.Http;
using System.Web.Http;

using System.Web.Routing;

namespace eShop.Controllers
{
    public class ErrorController : ApiController
    {
        [HttpGet, HttpPost, HttpPut, HttpDelete, HttpHead, HttpOptions, AcceptVerbs("PATCH"), AllowAnonymous]
        public HttpResponseMessage Handle404()
        {
            string [] parts = Request.RequestUri.OriginalString.Split(new[] { '?' }, StringSplitOptions.RemoveEmptyEntries);
            string parameters = parts[ 1 ].Replace("aspxerrorpath=","");
            var response = Request.CreateResponse(HttpStatusCode.Redirect);
            response.Headers.Location = new Uri(parts[0].Replace("Error","") + string.Format("#{0}", parameters));
            return response;
        }
    }
}


但是,我们的丑网址?这是由谷歌的机器人创建,应该返回纯HTML包含所有用户看到在浏览器中的数据。为此,我使用 phantomjs 。幻影是一个无头的浏览器做浏览器做在客户端 - 但在服务器端。换句话说,幻影知道(除其他事项外)如何通过URL得到一个网页,解析它包括运行所有的JavaScript code。在它(以及获得的数据通过AJAX调用),给你回HTML,反映了DOM。如果你正在使用微软的Visual Studio的前preSS你很多想通过这个链接安装幻象。
但首先,当一个丑陋的URL被发送到服务器,我们必须抓住它;对于这一点,我加入'App_start文件夹中的以下文件:


But what about the Ugly URLs? These are created by google's bot and should return plain HTML that contains all the data the user sees in the browser. For this I use phantomjs. Phantom is a headless browser doing what the browser is doing on the client side - but on the server side. In other words, phantom knows (among other things) how to get a web page via a URL, parse it including running all the javascript code in it (as well as getting data via AJAX calls), and give you back the HTML that reflects the DOM. If you're using MS Visual Studio Express you many want to install phantom via this link.
But first, when an ugly URL is sent to the server, we must catch it; For this, I added to the 'App_start' folder the following file:

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.Linq;
using System.Reflection;
using System.Web;
using System.Web.Mvc;
using System.Web.Routing;

namespace eShop.App_Start
{
    public class AjaxCrawlableAttribute : ActionFilterAttribute
    {
        private const string Fragment = "_escaped_fragment_";

        public override void OnActionExecuting(ActionExecutingContext filterContext)
        {
            var request = filterContext.RequestContext.HttpContext.Request;

            if (request.QueryString[Fragment] != null)
            {

                var url = request.Url.ToString().Replace("?_escaped_fragment_=", "#");

                filterContext.Result = new RedirectToRouteResult(
                    new RouteValueDictionary { { "controller", "HtmlSnapshot" }, { "action", "returnHTML" }, { "url", url } });
            }
            return;
        }
    }
}

这是从'filterConfig.cs'也'App_start'叫:

This is called from 'filterConfig.cs' also in 'App_start':

using System.Web.Mvc;
using eShop.App_Start;

namespace eShop
{
    public class FilterConfig
    {
        public static void RegisterGlobalFilters(GlobalFilterCollection filters)
        {
            filters.Add(new HandleErrorAttribute());
            filters.Add(new AjaxCrawlableAttribute());
        }
    }
}

正如你所看到的,AjaxCrawlableAttribute路线丑陋的URL到一个名为HtmlSnapshot控制器,这里是该控制器:

As you can see, 'AjaxCrawlableAttribute' routes ugly URLs to a controller named 'HtmlSnapshot', and here is this controller:

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.Linq;
using System.Web;
using System.Web.Mvc;

namespace eShop.Controllers
{
    public class HtmlSnapshotController : Controller
    {
        public ActionResult returnHTML(string url)
        {
            string appRoot = Path.GetDirectoryName(AppDomain.CurrentDomain.BaseDirectory);

            var startInfo = new ProcessStartInfo
            {
                Arguments = String.Format("{0} {1}", Path.Combine(appRoot, "seo\\createSnapshot.js"), url),
                FileName = Path.Combine(appRoot, "bin\\phantomjs.exe"),
                UseShellExecute = false,
                CreateNoWindow = true,
                RedirectStandardOutput = true,
                RedirectStandardError = true,
                RedirectStandardInput = true,
                StandardOutputEncoding = System.Text.Encoding.UTF8
            };
            var p = new Process();
            p.StartInfo = startInfo;
            p.Start();
            string output = p.StandardOutput.ReadToEnd();
            p.WaitForExit();
            ViewData["result"] = output;
            return View();
        }

    }
}

相关联的视图很简单,只需一行code:
@ Html.Raw(ViewBag.result)
正如你可以在控制器中看到,幻影加载一个名为javascript文件 createSnapshot.js A我创建了名为文件夹下的 SEO 。下面是这个javascript文件:

The associated view is very simple, just one line of code:
@Html.Raw( ViewBag.result )
As you can see in the controller, phantom loads a javascript file named createSnapshot.js under a folder I created called seo. Here is this javascript file:

var page = require('webpage').create();
var system = require('system');

var lastReceived = new Date().getTime();
var requestCount = 0;
var responseCount = 0;
var requestIds = [];
var startTime = new Date().getTime();

page.onResourceReceived = function (response) {
    if (requestIds.indexOf(response.id) !== -1) {
        lastReceived = new Date().getTime();
        responseCount++;
        requestIds[requestIds.indexOf(response.id)] = null;
    }
};
page.onResourceRequested = function (request) {
    if (requestIds.indexOf(request.id) === -1) {
        requestIds.push(request.id);
        requestCount++;
    }
};

function checkLoaded() {
    return page.evaluate(function () {
        return document.all["compositionComplete"];
    }) != null;
}
// Open the page
page.open(system.args[1], function () { });

var checkComplete = function () {
    // We don't allow it to take longer than 5 seconds but
    // don't return until all requests are finished
    if ((new Date().getTime() - lastReceived > 300 && requestCount === responseCount) || new Date().getTime() - startTime > 10000 || checkLoaded()) {
        clearInterval(checkCompleteInterval);
        var result = page.content;
        //result = result.substring(0, 10000);
        console.log(result);
        //console.log(results);
        phantom.exit();
    }
}
// Let us check to see if the page is finished rendering
var checkCompleteInterval = setInterval(checkComplete, 300);

我首先要感谢托马斯·戴维斯了解在那里我得到了基本的$ C $页面c从:-)。
你会发现一个奇怪的现象在这里:幽灵不断重新加载页面,直到 checkLoaded()函数返回true。这是为什么?这是因为我的特殊SPA提出了一些AJAX调用获取所有数据,并把它放在我的网页上的DOM,和幻象不可能知道什么时候所有的调用返回我回DOM的HTML反思之前已经完成。我在这里所做的最后AJAX调用我添加后&LT;跨度ID ='compositionComplete'&GT;&LT; / SPAN&GT; ,因此,如果此标记存在,我知道DOM的完成。我这样做回应Durandal的 compositionComplete 事件,请参见这里一>更多。如果不这样做withing10秒我放弃了(它应该只需要一秒钟所以最)。返回的HTML包含所有用户看到在浏览器中的链接。该脚本将不能正常工作,因为&LT;脚本&GT; 确实存在于HTML快照标记不引用正确的URL。这也可以在JavaScript的幻影文件被改变,但我不认为这是necassary因为HTML snapshort仅由谷歌获得 A 链接,而不是运行JavaScript;这些链接不要引用pretty的网址,如果事实上,如果你试图看到HTML快照在浏览器中,你会得到JavaScript错误,但所有的链接将正常工作,并指导您服务器再次以pretty的URL,这次获得完全正常的页面。
就是这个。现在,服务器知道如何处理这两个pretty的和丑陋的网址,用推状态服务器和客户机上启用。所有丑陋的URL都被使用虚拟因此没有必要为每种类型的呼叫一个单独的控制器以同样的方式。
你可能preFER改变的一件事是不能作一般性的类/子/产品的电话,但添加一个'商店',这样的链接看起来像:的http:// www.xyz.com/store/category/subCategory/product111 。这将避免所有无效的网址都被我的解决方案的问题,因为如果他们实际上是调用的索引控制器,我想,这些可处理那么'商店'控制器中不加入到的web.config 我上面显示。

I first want to thank Thomas Davis for the page where I got the basic code from :-).
You will notice something odd here: phantom keeps re-loading the page until the checkLoaded() function returns true. Why is that? this is because my specific SPA makes several AJAX call to get all the data and place it in the DOM on my page, and phantom cannot know when all the calls have completed before returning me back the HTML reflection of the DOM. What I did here is after the final AJAX call I add a <span id='compositionComplete'></span>, so that if this tag exists I know the DOM is completed. I do this in response to Durandal's compositionComplete event, see here for more. If this does not happen withing 10 seconds I give up (it should take only a second to so the most). The HTML returned contains all the links that the user sees in the browser. The script will not work properly because the <script> tags that do exist in the HTML snapshot do not reference the right URL. This can be changed too in the javascript phantom file, but I don't think this is necassary because the HTML snapshort is only used by google to get the a links and not to run javascript; these links do reference a pretty URL, and if fact, if you try to see the HTML snapshot in a browser, you will get javascript errors but all the links will work properly and direct you to the server once again with a pretty URL this time getting the fully working page.
This is it. Now the server know how to handle both pretty and ugly URLs, with push-state enabled on both server and client. All ugly URLs are treated the same way using phantom so there's no need to create a separate controller for each type of call.
One thing you might prefer to change is not to make a general 'category/subCategory/product' call but to add a 'store' so that the link will look something like: http://www.xyz.com/store/category/subCategory/product111. This will avoid the problem in my solution that all invalid URLs are treated as if they are actually calls to the 'index' controller, and I suppose that these can be handled then within the 'store' controller without the addition to the web.config I showed above.