COMPUTER VISION

Computer vision is a field of artificial intelligence that enables computers to "see" and interpret visual information from the world. It involves using algorithms to process and analyze digital images and videos, allowing machines to understand and extract meaningful data. This technology aims to replicate human visual capabilities, enabling tasks like object detection, image recognition, and scene understanding. Applications range from self-driving cars and facial recognition to medical imaging and industrial automation. Essentially, computer vision empowers machines to derive insights from visual inputs, much like humans do.
by Roderick Paulino

March 2, 2025

problem statement

Develop an algorithm to generate a synthetic dataset, analogous to MNIST, that represents printed text characters rendered in diverse fonts and augmented with pseudo-isometric transformations. The challenge lies in creating realistic pseudo-isometric views of printed text without introducing new font styles, thereby simulating variations in printed text orientation. The algorithm must effectively transform standard font renderings into credible pseudo-isometric perspectives, ensuring the resulting dataset maintains the integrity of character recognition (mostly used for engineering drawing ocr's) while introducing realistic variations in visual orientation.

INTRODUCTION

OpenCV, a widely used computer vision library, provides powerful tools for image manipulation, including rotation. However, its getRotationMatrix2D function is inherently limited to rotations within the 2D plane, effectively handling rotations around the z-axis. This constraint becomes a significant hurdle when attempting to simulate true 3D rotations, such as those around the x or y axes, which are essential for creating realistic perspective transformations. To achieve such 3D rotations, more sophisticated transformation methods, like 3x3 or 4x4 perspective matrices, are required, surpassing the capabilities of the 2x3 affine transformations offered by getRotationMatrix2D. This limitation necessitates the development and implementation of alternative approaches for accurate 3D image rotations within OpenCV environments.

SOLUTION

To achieve authentic 3D rotations of a 2D image around the x or y axes, a perspective transform, specifically a 3x3 homography, is essential. This approach involves first simulating the 3D rotation in space and then projecting the resulting image back onto a 2D plane. The key idea is to treat the source image as a flat plane in 3D space, defining its four corner points at a fixed z-coordinate, such as z=0. These 3D corner points are then subjected to a rotation around the desired axis (x or y). Subsequently, these rotated 3D points are projected back onto a 2D "camera" plane, yielding new 2D coordinates. Finally, OpenCV's cv2.getPerspectiveTransform(src_pts, dst_pts) is employed to calculate the 3x3 homography, which is then used in conjunction with cv2.warpPerspective() to transform the image accordingly. This method surpasses the limitations of 2D rotation functions by enabling rotations around the base axes of a 3D coordinate system, thereby producing a more realistic and visually accurate output.

Adjusting the Look

Rotation angles

Increase or decrease the angle around 

𝑥 (or 𝑦) to get our desired tilt.

Combining rotations

We can rotate around 𝑥 by some angle, then aroundn𝑦 by another angle to get a more complex orientation.

Focal length & viewer distance

Tweak these if the image looks too “stretched” or too “flat.” A larger focal length with a moderate viewer distance often looks more “telephoto,” while a smaller focal length with a moderate viewer distance looks more “wide‐angle.”

In contrast, the simpler 2D code snippet that uses cv2.getRotationMatrix2D(center, angle, scale) is effectively a rotation around the 𝑧-axis—but only in 2D image space. Internally, getRotationMatrix2D rotates the plane as though we are spinning it around an axis coming “straight out of” the image (which we can think of as the 𝑧-axis in 3D). But it doesn’t do an actual 3D transformation of the scene—just a flat rotation in the 2D plane. In the true 3D rotation example (the code with rotate_around_x and rotate_around_y functions), there is no explicit rotation around the 𝑧-axis. Those functions only construct the rotation matrix for 𝑥‐axis or 𝑦‐axis rotations. If we wanted to rotate around the 𝑧-axis in that 3D approach, you would need a similar function—for z: This is how it works:

1. 3D Rotation:

We can rotate the image in 3D space around three axes:

Rotate around X-axis: rotate_around_x rotates the image around the horizontal (x) axis.

Rotate around Y-axis: rotate_around_y rotates around the vertical (y) axis.

Rotate around Z-axis: rotate_around_z rotates around the depth (z) axis.

These rotations can be combined in any order, allowing for complex 3D orientations.

2. 2D Projection:

The project_points function simulates a "camera" viewing the rotated 3D image.

It uses a perspective model to project the 3D points onto a 2D plane.

The viewer_distance parameter controls how far the "camera" is from the image, affecting the perspective effect.

3. Perspective Transform Calculation:

OpenCV's cv2.getPerspectiveTransform function takes the original 2D corner points of the image and the newly projected 2D corner points (after 3D rotation and projection).

It calculates a 3x3 homography matrix (H), which describes the transformation needed to map the original image to the rotated and projected image.

4. Image Warping:

OpenCV's cv2.warpPerspective function applies the homography matrix (H) to the original image.

It re-maps each pixel of the original image to its corresponding location in the new, transformed image.

The output image size can be specified, just ensure its large enough to contain the entire transformed image.

Controlling the 3D Effect:

By adjusting the rotation angles (rot_x, rot_y, rot_z in degrees), you can control the 3D orientation of the image.

The focal_length and viewer_distance parameters allow you to fine-tune the perspective look, creating different visual effects.

But from a pure matrix‐math perspective, the (focal length + viewer distance) approach is simply more conventional for generating the perspective projection matrix— bottom Line they’re just different parameterizations of the same perspective geometry. Focal_length and viewer_distance happen to be straightforward in a simple pinhole model and align nicely with OpenCV’s typical conventions —but under the hood, we’ll just convert them into equivalent values of focal length and distance for the projection equations.

Because this is purely a mathematical model, we can tweak how large or small object_distance is to see more or less perspective distortion. Larger object_distance → milder perspective (like a telephoto lens). Smaller object_distance → stronger perspective (like a wide-angle lens).

We further enhanced it by following this method:

Centering the text also crucial in the transformation, thus,

output:

Original

TOP

RIGHT ISOMETRIC

LEFT ISOMETRIC

Piston-Operated P&ID Symbol

Reoriented

Reoriented

Reoriented

Actual Valve

Reoriented

Reoriented

Reoriented

ORB

SIFT

USING FEATURE DETECTORS IN THE MOST SKEWED VIEW

We propose leveraging a single, augmented image to generate a diverse synthetic dataset, effectively replacing the need for physical repositioning or complex 3D modeling. By systematically varying background, lighting, and viewing angles, and introducing diverse line styles, we create a rich collection of images that significantly enhance model robustness. This approach offers a practical alternative to traditional data acquisition methods, enabling the generation of supplementary training data that rivals or even surpasses the variability achieved through rendering CAD models or capturing real-world images from multiple perspectives.

conclusion:

The current limitation of OpenCV's perspective transformation, when applied to 2D images for simulating 3D rotations, stems from its reliance on a pinhole camera model. This model, while effective for many applications, primarily calculates perspective based on the object's distance from the "camera" (object distance) and the "camera's" field of view (focal length). Consequently, it approximates the object's dimensions by projecting points based solely on these parameters. This approach overlooks a crucial aspect of human visual perception and realistic 3D rendering: the observer's viewpoint relative to vanishing points. In true perspective, the perceived size and shape of an object are determined not just by its distance and focal length, but also by the convergence of parallel lines towards vanishing points as the object recedes into the distance. By neglecting the observer's viewpoint and the resulting vanishing point dynamics, OpenCV's current pinhole-based method can produce less visually accurate 3D rotations, especially when dealing with complex scenes or large rotation angles. Essentially, it simplifies the complex interplay of perspective cues into a distance-and-focal-length equation, leading to potential inaccuracies in simulating realistic 3D transformations within 2D images.

The challenge of accurately recognizing text within engineering drawings is significant due to the inherent variations in text orientation, perspective, and style. By augmenting training datasets with synthetically generated images that simulate these variations, particularly those produced through pseudo-isometric transformations, we can significantly improve the robustness and accuracy of text prediction models. Specifically, using regenerative architectures like Generative Adversarial Networks (GANs), transformers, or diffusion models for training allows the model to learn the underlying distribution of text appearances within engineering drawings. When exposed to a diverse dataset that includes rotated and perspective-shifted text, these models can better generalize and extract meaningful features, even in the presence of noise and distortions. For instance, GANs can learn to generate realistic variations of text, while transformers can capture long-range dependencies and contextual information, and diffusion models can learn to reverse the process of adding noise to create realistic images. This enhanced training regimen enables the model to effectively "regenerate" or reconstruct the original text from distorted or rotated inputs, drastically increasing the reliability of text prediction in complex engineering drawing scenarios.