Artificial intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think and learn like humans. Let us explore this topic and how it relates to the human experience.
Comparison of GenAI diagrams from StabilityAI's models. Analyzes layout, visual distinctions, text quality, and accuracy. Highlights strengths and weaknesses, emphasizing the importance of model selection for technical visualizations.
Discover other posts like this one
Stability has three classes of text-to-image models, each with different price points and image capabilities. Let's explore the differences between them so we can better understand when to use one over the others.
I want to learn about the viability of these models to help generate graphics for content people would likely create on the platform. The goal will be to see how well the models can create technical diagrams. I expect this to be outside the comfort zone for these models, as they typically excel at characters and natural/physical scenery. Generating text in these images is notoriously difficult as well.
We'll take a single detailed prompt and explore how well each model does with it. Some things we'll be able to objectively judge them on, while other aspects I'll give my opinion on.
I used ChatGPT as a preprocessing step to create a detailed prompt:
A split-screen image comparing a relational database and a graph database. On the left side, show a relational database with tables and rows, connected by lines representing foreign keys and relationships. On the right side, show a graph database with nodes and edges, illustrating relationships as direct connections between data points. Use clear labels for each side: 'Relational Database' and 'Graph Database'. The relational database side should have tables with headers like 'Users', 'Orders', and 'Products', while the graph database side should show interconnected nodes labeled 'User', 'Order', and 'Product'. Use a clean and modern design with distinct colors for clarity.
Our most advanced text to image generation service, Stable Image Ultra creates the highest quality images with unprecedented prompt understanding. Ultra excels in typography, complex compositions, dynamic lighting, vibrant hues, and overall cohesion and structure of an art piece. Made from the most advanced models, including Stable Diffusion 3, Ultra offers the best of the Stable Diffusion ecosystem.
Resolution: 1 megapixels, 1024x1024
8 credits per successful generation
The Ultra model is Stability's flagship model. It's about twice as expensive to use than the other models.
It handled the structuring pretty well. I think both database types are visually represented well, however when it comes to text and labeling, unsurprisingly it struggles.
The labels above each visual are nonsense, but the ones below are pretty decent. "graph database" came out well while "relational dararbase" has some issues. These could be cleaned up relatively easily with some post-processing.
The default aspect ratio of these images definitely feels like a limiting factor here, with the square aspect causing both halves to feel rather cramped. As we'll see in the other images, not all of the models chose to go this route. I would say it followed the directions the best though as the prompt specified what to include on the left and right sides.
Our primary service for text-to-image generation, Stable Image Core represents the best quality achievable at high speed. No prompt engineering is required! Try asking for a style, a scene, or a character, and see what you get.
Resolution: 1.5 megapixels, 1536x1536
3 credits per successful generation
The Core model seems to have taken a different approach compared to the Ultra model. Instead of a left-right split, it opted for a top-bottom division of the database types. This choice doesn't align with the original prompt's specifications, however I think it makes better use of the square aspect ratio.
The visual representation of the databases lacks distinction between relational and graph structures. This is problematic, as it fails to capture the fundamental differences between these database types.
Text rendering is also notably worse in this image. There's a lot of fine text, but none of it is readable. This is a step back from the Ultra model, where at least some labels were readable.
Spelling errors seem more prevalent in this version compared to the Ultra model. This further detracts from the image's quality and usefulness as a visual aid for understanding database types.
Resolution: 1 megapixels, 1024x1024
3.5 credits for Medium, 6.5 for Large, 4 for Large Turbo
The model successfully created a side-by-side layout, adhering well to the original prompt's specification. Interestingly, it didn't use the full space available, opting for a more rectangular aspect ratio with gray bars filling the rest of the space.
Like the Ultra image, one of the standout features is the clear visual distinction between the relational and graph databases. The model has effectively captured the fundamental structural differences between these database types.
Text labeling in this image is of good quality. The labels are clear and legible, which is a significant improvement over the previous image. However, it's worth noting that there are still some spelling issues and artifacting present, similar to the Ultra model.
The relational database visualization is particularly impressive in this version. It correctly includes labeled tables as specified in the prompt, making it the best representation of a relational database among all the versions. This attention to detail enhances the educational value of the image.
On the other hand, the graph database representation, while visually distinct, has some issues. The connections between nodes have become overly complex and unrealistic. They look more like tangled hair than the clean edges typically seen in a graph database diagram.
The Ultra, Core, and SD3 models each approached the task differently, with varying degrees of success.
Layout and adherence to prompts: Both Ultra and SD3 models closely followed the prompt, creating side-by-side comparisons of relational and graph databases. In contrast, the Core model deviated by using a less effective top-bottom split. SD3 emerged as the top performer, particularly excelling in its depiction of relational databases.
Visual distinction and accuracy: Ultra and SD3 successfully created clear visual distinctions between database types, enhancing the educational value of their images. The Core model, however, failed to differentiate between the two, significantly reducing its usefulness. SD3 stood out with its accurate representation of relational database structures, though all models faced challenges in depicting graph databases realistically.
Text and labeling quality: Text rendering varied dramatically across the models. While Ultra produced some readable labels with minor issues, the Core model's text was completely illegible. SD3 outperformed in this aspect, providing the clearest and most legible labels. However, all models exhibited some level of spelling errors and artifacting in text elements.
Space utilization and layout effectiveness: The Ultra model's square format resulted in cramped visuals, while the Core model's top-bottom split was unexpected and a clear violation of the prompt. SD3's side-by-side layout proved most effective for easy comparison of the database types although it didn't use all of the space available.
Specific database representations: In depicting relational databases, SD3 excelled by correctly including labeled tables as specified. For graph databases, all models struggled to some degree, with SD3's representation being distinct but overly complex, resembling "hair" more than a realistic graph structure.
This study underscores the variability in AI image generation capabilities and highlights the importance of selecting the appropriate model for specific visualization tasks, especially for technical content requiring accuracy and clarity.
The results suggest that while AI-generated technical diagrams can be effective, they may still require some level of post-processing to correct errors, particularly in text elements, backgrounds, and clarity.