且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

Azure数据工厂通过访问密钥连接到Blob存储

更新时间:2023-02-08 23:14:52

At the very bottom of the article listed above about white listing IP ranges of the integration runtime, Microsoft says the following:

连接到Azure存储帐户时,IP网络规则没有 对源自Azure集成运行时中的请求的影响 与存储帐户相同的区域.有关更多详细信息,请参阅

When connecting to Azure Storage account, IP network rules have no effect on requests originating from the Azure integration runtime in the same region as the storage account. For more details, please refer this article.

我曾就此与Microsoft支持部门联系过,问题是白名单公用IP地址不适用于同一区域内的资源,因为由于资源位于同一网络上,因此它们使用私有IP而不是公共IP相互连接.

I spoke to Microsoft support about this and the issue is that white listing public IP addresses does not work for resources within the same region because since the resources are on the same network, they connect to each other using private IP's rather than public.

有四个选项可以解决原始问题:

There are four options to resolve the original issue:

  • 允许从存储帐户的防火墙"和虚拟网络"下的所有网络进行访问(显然,如果要存储敏感数据,这是一个问题).我对此进行了测试,并且有效.
  • 创建一个新的Azure托管的集成运行时,该运行时在其他区域中运行.我也对此进行了测试.我的ADF数据流在东部区域中运行,我创建了一个在东部2中运行的运行时,该运行时可立即工作.对我来说,这里的问题是,在推送到产品之前,我必须经过安全审查,因为我们将通过公用网络发送数据,即使它是加密的,等等,它仍然不如拥有两个相互通信的资源那么安全.同一网络中的其他人.
  • 使用单独的活动,例如HDInsight活动(例如Spark)或SSIS包.我敢肯定这是可行的,但是SSIS的问题是成本,因为我们必须启动SSIS数据库然后为计算付费.您还需要在管道中执行多个活动,以在执行之前和之后启动和停止SSIS管道.另外,我不想为此而学习Spark.
  • 最后,我使用的解决方案是创建一个新连接,该连接将Blob存储替换为数据集的Data Lakes Gen 2连接.它像魅力一样运作.与Blob存储连接不同,Azure Data Lakes Storage Gen 2以
  • Allow access from all networks under Firewalls and Virtual Networks in the storage account (obviously this is a concern if you are storing sensitive data). I tested this and it works.
  • Create a new Azure hosted integration runtime that runs in a different region. I tested this as well. My ADF data flow is running in East region and I created a runtime that runs in East 2 and it worked immediately. The issue for me here is I would have to have this reviewed by security before pushing to prod because we'd be sending data across the public network, even though it's encrypted, etc, it's still not as secure as having two resources talking to each other in the same network.
  • Use a separate activity such as an HDInsight activity like Spark or an SSIS package. I'm sure this would work, but the issue with SSIS is cost as we would have to spin up an SSIS DB and then pay for the compute. You also need to execute multiple activities in the pipeline to start and stop the SSIS pipeline before and after execution. Also I don't feel like learning Spark just for this.
  • Finally, the solution that works that I used is I created a new connection that replaced the Blob Storage with a Data Lakes Gen 2 connection for the data set. It worked like a charm. Unlike Blob Storage connection, Managed Identity is supported for Azure Data Lakes Storage Gen 2 as per this article. In general, the more specific the connection type, the more likely the features will work for the specific need.