且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何将3d模型的变换矩阵转换为二维图像中的对象

更新时间:2022-10-27 20:47:31


  1. 相机的视角



    相机的视野是绝对最小的,甚至可以从这开始(如何确定如何放置物体当你不知道它会如何影响场景)。基本上你需要从世界 GCS (全局坐标系)映射到摄像机/屏幕空间并返回的变换矩阵。如果你对我写的内容毫无头绪,那么在你学习数学之前,也许你不应该尝试这些。



    对于未知的相机,你可以做一些校准基于视图中的标记或etalone(已知大小和形状)。但更好的是使用真实的相机值(如x,y方向的FOV角度,焦距等)。

    这个目标是创建映射将 GCS x,y,z )添加到屏幕 LCS x,y )。



    欲了解更多信息,请阅读:



    • 。太高的金额会很慢。


现在这只是一些简单的基础知识。由于你的网格不是很简单,你可能需要调整这个像使用轮廓而不是轮廓,并使用轮廓之间的距离,而不是非重叠像素计数,这是很难计算的......你应该从骰子,硬币等简单的网格开始...当把握所有这一举动,以形成更复杂的形状时... ...

代数方法



如果您知道图像中的某些点与已知3D点(在您的网格中)有关,那么您可以连同所用相机的FOV一起计算放置物体的转换矩阵...

如果变换矩阵是 M (OpenGL样式):

  M = xx,yx,zx,ox 
xy,yy,zy,oy
xz,yz,zz,oz
0,0 ,0,1

然后从网格中的任何点( x,y, z )被转换为全局世界(x',y',z')像这样:

 (x',y',z')= M *(x,y ,z)

像素位置( x'',y'')由摄像机FOV透视投影完成,如下所示:

  y''= FOVy *(z' +焦点)* y'+ ys2; 
x''= FOVx *(z'+焦点)* x'+ xs2;

其中相机位于(0,0,-focus),投影平面位于 z = 0 且观看方向为 + z ,因此对于任何焦距 focus 和屏幕分辨率( xs,ys ):

  XS2 = XS * 0.5; 
ys2 = ys * 0.5;
FOVx = xs2 / focus;
FOVy = ys2 / focus;

将所有这些放在一起时,您可以获得以下结果:

  xi''=(xx * xi + yx * yi + zx * zi + ox)*(xz * xi + yz * yi + zz * zi + ox + focus)* FOVx 
yi''=(xy * xi + yy * yi + zy * zi + oy)*(xz * xi + yz * yi + zz * zi + oy + focus)* FOVy

其中( xi,yi,zi )是 i-th 网格局部坐标中的已知点3D位置和( xi'',yi'')是相应的已知2D像素位置。所以未知数是 M 值:

  {xx,xy,xz ,yx,yy,yx,zx,zy,zz,ox,oy,oz} 

每个已知点有2个方程式,总共有12个未知数。所以你需要知道6点。求解方程组并构造矩阵 M

还可以利用M是一个统一的正交/ (xx,xy,xz)
Y =(yx,yy,yz)
Z =(zx,zy,zz)

彼此垂直:

 (XY)=(YZ)=(ZX)= 0.0 

通过将这些引入到您的系统中,可以降低所需点数。你也可以利用交叉产品,所以如果你知道两个向量,就可以计算出来了。
$ b $ pre $ Z =(X x Y)* scale

因此,不需要3个变量,您只需要一个标度(对于标准正交矩阵来说是1)。如果我假设正交矩阵,那么:

  | X | = | Y | = | Z | = 1 

所以我们得到了6个额外的方程(3个点和3个交叉)未知因此3点确实足够。

Given an object's 3D mesh file and an image that contains the object, what are some techniques to get the orientation/pose parameters of the 3d object in the image?

I tried searching for some techniques, but most seem to require texture information of the object or at least some additional information. Is there a way to get the pose parameters using just an image and a 3d mesh file (wavefront .obj)?

Here's an example of a 2D image that can be expected.

  1. FOV of camera

    Field of view of camera is absolute minimum to know to even start with this (how can you determine how to place object when you have no idea how it would affect scene). Basically you need transform matrix that maps from world GCS (global coordinate system) to Camera/Screen space and back. If you do not have a clue what about I am writing then perhaps you should not try any of this before you learn the math.

    For unknown camera you can do some calibration based on markers or etalones (known size and shape) in the view. But much better is use real camera values (like FOV angles in x,y direction, focal length etc ...)

    The goal for this is to create function that maps world GCS(x,y,z) into Screen LCS(x,y).

    For more info read:

  2. Silhouette matching

    In order to compare rendered and real image similarity you need some kind of measure. As you need to match geometry I think silhouette matching is the way (ignoring textures, shadows and stuff).

    So first you need to obtain silhouettes. Use image segmentation for that and create ROI mask of your object. For rendered image is this easy as you van render the object with single color without any lighting directly into ROI mask.

    So you need to construct function that compute the difference between silhouettes. You can use any kind of measure but I think you should start with non overlapping areas pixel count (it is easy to compute).

    Basically you count pixels that are present only in one ROI (region of interest) mask.

  3. estimate position

    as you got the mesh then you know its size so place it in the GCS so rendered image has very close bounding box to real image. If you do not have FOV parameters then you need to rescale and translate each rendered image so it matches images bounding box (and as result you obtain only orientation not position of object of coarse). Cameras have perspective so the more far from camera you place your object the smaller it will be.

  4. fit orientation

    render few fixed orientations covering all orientations with some step 8^3 orientations. For each compute the difference of silhouette and chose orientation with smallest difference.

    Then fit the orientation angles around it to minimize difference. If you do not know how optimization or fitting works see this:

    Beware too small amount of initial orientations can cause false positioves or missed solutions. Too high amount will be slow.

Now that was some basics in a nutshell. As your mesh is not very simple you may need to tweak this like use contours instead of silhouettes and using distance between contours instead of non overlapping pixels count which is really hard to compute ... You should start with simpler meshes like dice , coin etc ... and when grasping all of this move to more complex shapes ...

[Edit1] algebraic approach

If you know some points in the image that coresponds to known 3D points (in your mesh) then you can along with the FOV of the camera used compute the transform matrix placing your object ...

if the transform matrix is M (OpenGL style):

M = xx,yx,zx,ox
    xy,yy,zy,oy
    xz,yz,zz,oz
     0, 0, 0, 1

Then any point from your mesh (x,y,z) is transformed to global world (x',y',z') like this:

(x',y',z') = M * (x,y,z)

The pixel position (x'',y'') is done by camera FOV perspective projection like this:

y''=FOVy*(z'+focus)*y' + ys2;
x''=FOVx*(z'+focus)*x' + xs2;

where camera is at (0,0,-focus), projection plane is at z=0 and viewing direction is +z so for any focal length focus and screen resolution (xs,ys):

xs2=xs*0.5; 
ys2=ys*0.5;
FOVx=xs2/focus;
FOVy=ys2/focus;

When put all this together you obtain this:

xi'' = ( xx*xi + yx*yi + zx*zi + ox ) * ( xz*xi + yz*yi + zz*zi + ox + focus ) * FOVx
yi'' = ( xy*xi + yy*yi + zy*zi + oy ) * ( xz*xi + yz*yi + zz*zi + oy + focus ) * FOVy

where (xi,yi,zi) is i-th known point 3D position in mesh local coordinates and (xi'',yi'') is corresponding known 2D pixel positions. So unknowns are the M values:

{ xx,xy,xz,yx,yy,yx,zx,zy,zz,ox,oy,oz }

So we got 2 equations per each known point and 12 unknowns total. So you need to know 6 points. Solve the system of equations and construct your matrix M.

Also you can exploit that M is a uniform orthogonal/orthonormal matrix so vectors

X = (xx,xy,xz)
Y = (yx,yy,yz)
Z = (zx,zy,zz)

Are perpendicular to each other so:

(X.Y) = (Y.Z) = (Z.X) = 0.0

Which can lower the number of needed points by introducing these to your system. Also you can exploit cross product so if you know 2 vectors the thirth can be computed

Z = (X x Y)*scale

So instead of 3 variables you need just single scale (which is 1 for orthonormal matrix). If I assume orthonormal matrix then:

|X| = |Y| = |Z| = 1

so we got 6 additional equations (3 x dot, and 3 for cross) without any additional unknowns so 3 point are indeed enough.