Microsoft has launched MAI-Image-2, its most capable text-to-image model to date, and it immediately claimed the #3 spot on the Arena.ai text-to-image leaderboard — placing Microsoft's own technology directly behind Google's Gemini 3.1 Flash and OpenAI's GPT Image 1.5.
The diffusion-based generative model works by progressively transforming random noise into coherent images aligned with text prompts. What sets it apart is photorealism — MAI-Image-2 produces natural light, accurate skin tones, and environments that feel lived-in.
Public comparison figures show an overall Elo increase of approximately 97 points over MAI-Image-1, with particularly notable gains in portrait generation, product and branding work, and text rendering within images.
For enterprise users, the model excels at consistent creation of infographics, slides, diagrams, and branded materials with minimal gap between creative direction and output.
MAI-Image-2 is rolling out across Microsoft's ecosystem — available in the MAI Playground for experimentation, with deployment beginning on Copilot and Bing Image Creator. API access is available today for select Microsoft customers, with broader availability coming soon.
The launch signals Microsoft's commitment to building its own foundation models rather than relying solely on OpenAI's technology for image generation capabilities.