且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

使用Azure Cognitive Search为静态HTML Blob存储内容编制索引无法正常工作

更新时间:2023-02-08 23:02:19

这可能是由于索引器中的配置"parsingMode":文本"

此解析模式用于从文档中提取文字文本值.在这种情况下,它包括所有html标记.

将该配置更改为"parsingMode":默认"以从文档中剥离html标签.

I'm working on indexing static HTML content in blob storage. The documentation states that preprocessing analyzers will strip surrounding HTML tags when indexing content from that data source. However, our content value is always the entire raw HTML document. I'm also unable to pull out the value of our "meta description" tags. According to the documentation on Indexing Blob Storage, HTML content should automatically produce a metadata_description property, but the value is always null.

I've tried many different indexer configurations, but thus far have not been able to tell if I have something misconfigured or if Azure Search doesn't recognize the content type properly.

All of the files in blob storage have a .html file extension, and the Content Type column shows text/html.

This is the indexer configuration (some bits <redacted>):

{
  "@odata.context": "https://<instance>.search.windows.net/$metadata#indexers/$entity",
  "@odata.etag": "\"<tag>\"",
  "name": "<name>",
  "description": null,
  "dataSourceName": "<datasource name>",
  "skillsetName": null,
  "targetIndexName": "<target index>",
  "disabled": null,
  "schedule": {
    "interval": "PT2H",
    "startTime": "0001-01-01T00:00:00Z"
  },
  "parameters": {
    "batchSize": null,
    "maxFailedItems": -1,
    "maxFailedItemsPerBatch": null,
    "base64EncodeKeys": null,
    "configuration": {
      "parsingMode": "text",
      "dataToExtract": "contentAndMetadata",
      "excludedFileNameExtensions": ".png .jpg .mpg .pdf",
      "indexedFileNameExtensions": ".html"
    }
  },
  "fieldMappings": [
    {
      "sourceFieldName": "metadata_storage_path",
      "targetFieldName": "id",
      "mappingFunction": {
        "name": "base64Encode",
        "parameters": null
      }
    },
    {
      "sourceFieldName": "metadata_description",
      "targetFieldName": "description",
      "mappingFunction": null
    },
    {
      "sourceFieldName": "metadata_storage_path",
      "targetFieldName": "url",
      "mappingFunction": {
        "name": "extractTokenAtPosition",
        "parameters": {
          "delimiter": "<delimiter>",
          "position": 1
        }
      }
    }
  ],
  "outputFieldMappings": [],
  "cache": null
}

This is likely due to the configuration in your indexer "parsingMode": "text"

This parsing mode is for extracting literal text values from the documents. In this case, that includes all of the html tags.

Change that configuration to "parsingMode": "default" to strip html tags from your documents.