I’ve been thinking a lot about what artificial intelligence models actually understand versus what they just pattern-match. Google’s latest research on MapTrace hits this issue squarely. Despite all the hype around multimodal large language models being able to “see” and “understand” images, most of them can’t trace a simple path on a map without walking straight through walls.
Think about that for a second. These models can identify animals in photos, generate photorealistic images, and explain complex diagrams. But ask them to draw a route from point A to point B on a shopping mall map, and they’ll confidently route you through the food court kitchen.
The problem isn’t that the models are stupid. It’s that they’ve never really been taught spatial reasoning as an explicit skill. They’ve seen millions of images with the word “path” associated with sidewalks and trails, but they’ve never learned the underlying rules: paths connect, walls block movement, routes are ordered sequences of points that respect physical constraints.
The Data Problem Nobody Talks About
Here’s where it gets interesting. The most obvious solution would be to collect a massive dataset of maps with hand-traced paths. But if you’ve ever tried to annotate anything with pixel-level accuracy, you know this is a nightmare. One path might take 10-15 minutes to trace properly. Scale that to the millions of examples needed for training a large model, and you’re looking at an impossible task.
Google’s research team, led by Artemis Panagopoulou and Mohit Goyal, took a different approach. Instead of collecting real-world data, they built a fully automated pipeline that uses AI models to generate synthetic training data. The clever bit is using AI as both creator and critic throughout the process.
The pipeline starts with an LLM generating diverse map prompts. Not just “a shopping mall” but rich descriptions like “a fantasy theme park with winding paths through different themed lands” or “a zoo with interconnected habitats.” These prompts feed into a text-to-image model that renders actual map images.
But generating pretty maps isn’t enough. The system needs to identify which parts are actually walkable.
Four Stages of Synthetic Map Generation
The pipeline uses color clustering to create candidate path masks, essentially black-and-white versions showing potential walkways. Then comes the first quality gate: another MLLM acts as a “Mask Critic” that examines each candidate and judges whether it represents a realistic, connected path network. Only the high-quality masks move forward.
Once you have clean masks of traversable areas, the system converts them into graph format. Intersections become nodes, paths become edges. This is the kind of structured representation that makes computational pathfinding possible.
The final stage generates thousands of random start and end points for each map. Dijkstra’s algorithm calculates the shortest path between these points, then a “Path Critic” MLLM does a final quality check. Does the route look logical? Does it stay within the lines? Would a human actually take this path?
This pipeline produced 2 million annotated map images with valid paths. The researchers acknowledge that generated images sometimes have typographic errors, but they’re focused on path quality. Fair enough. Text rendering in synthetic images is a known issue that’ll improve as generation models get better.
Does Synthetic Data Actually Work?
The real test was fine-tuning models on this synthetic data and evaluating them on MapBench, a benchmark of real-world maps the models had never seen. The team fine-tuned several models including Gemma 3 27B and Gemini 2.5 Flash on just 23,000 paths from their dataset.
The results are compelling. Fine-tuned Gemini 2.5 Flash saw its normalized dynamic time warping (NDTW) score drop from 1.29 to 0.87. Lower is better here, since NDTW measures the distance between the predicted path and the reference path. But the more interesting metric is success rate: the percentage of times the model produces a valid, parsable path at all.
Gemma’s success rate jumped 6.4 percentage points, and its NDTW improved from 1.29 to 1.13. These aren’t just incremental gains. The models went from frequently failing completely to reliably producing valid paths. That’s the difference between a research curiosity and something potentially useful.
The researchers manually reviewed the critics’ decisions to validate their pipeline. The Path Critic achieved 76% accuracy with an 8% false positive rate. The Mask Critic hit 83% accuracy with a 9% false positive rate. These aren’t perfect, but they’re good enough for generating training data at scale. The errors mostly stem from color similarity issues and missing thin paths within larger open regions.
What This Actually Means
I find the core insight here more interesting than the specific results. Fine-grained spatial reasoning isn’t some emergent property of scale. It’s a learnable skill that requires explicit supervision. You can teach it with synthetic data if you design the data generation process carefully enough.
This has implications beyond map navigation. The same principles could apply to circuit board routing, warehouse logistics, protein folding pathways, or any domain where understanding connectivity and spatial constraints matters. The technique of using AI models as both generators and quality critics could generalize to other structured prediction tasks.
Google open-sourced 2 million question-answer pairs generated with this pipeline using Gemini 2.5 Pro and Imagen-4. That’s actually significant. Most industrial AI research stays locked behind closed doors. Having this data available means researchers outside Google can explore spatial reasoning without rebuilding the entire generation pipeline.
The paper doesn’t oversell what they’ve built. They’re honest about the limitations: text rendering artifacts, critic accuracy in the 76-83% range, focus on path quality over photorealism. This is a research contribution, not a product launch.
But it demonstrates something I think we’ll see more of: using capable models to generate training data for teaching specific skills to other models. As base models get better at generation and evaluation, the bottleneck shifts from “can we collect enough data” to “can we design good data generation processes.”
The real question isn’t whether models can learn to trace paths on maps, it’s what other spatial reasoning tasks remain inaccessible because we haven’t figured out how to generate the right training data yet.