Meta's Muse Spark: A Tooled-Up Return to Frontier Models : mgks.dev

Meta just dropped Muse Spark, and it’s their first model release since Llama 4 about a year ago. This time though, they’re playing a different game. It’s hosted, not open weights, and the API is locked behind a “private preview to select users” wall. You can try it on meta.ai if you’re willing to log in with Facebook or Instagram credentials.

The benchmarks put it alongside Opus 4.6, Gemini 3.1 Pro, and GPT 5.4 on selected tests, though it falls behind on Terminal-Bench 2.0. Meta admits they’re “continue to invest in areas with current performance gaps, such as long-horizon agentic systems and coding workflows”. So they know where they’re weak.

What caught my attention wasn’t the model itself but the tooling harness they built around it. The meta.ai interface exposes two modes: “Instant” and “Thinking”, with a promised “Contemplating” mode coming later that should compete with Gemini Deep Think or GPT-5.4 Pro. I ran my pelican test against both modes since there’s no API access yet.

Both versions generated SVG pelicans, but the implementation differed wildly. Instant spit out raw SVG with code comments. Thinking wrapped everything in an HTML shell with unused Playables SDK v1.0.0 JavaScript libraries. This tells me the modes aren’t just inference-time compute differences but actually distinct model behaviors or system prompts.

The Tool Collection Nobody Expected

Meta’s chat harness has tools wired up, at least for rendering SVG and HTML as embedded frames like Claude Artifacts. I asked for the exact tool names and descriptions, and it just gave them to me. No jailbreak needed. Sixteen different tools with full parameter schemas.

The browser tools are straightforward. browser.search runs web searches through an undisclosed engine, browser.open loads full pages, and browser.find does pattern matching on content. Standard web browsing capabilities.

Then it gets interesting. meta_1p.content_search does “Semantic search across Instagram, Threads, and Facebook posts” but only for posts you have access to and only since January 1st, 2025. The parameters include author_ids, key_celebrities, commented_by_user_ids, and liked_by_user_ids. This is Meta’s entire social graph exposed as a tool for artificial intelligence queries.

There’s also meta_1p.meta_catalog_search for product searches, which I assume powers the Shopping mode. And media.image_gen for image generation with “artistic” and “realistic” modes plus square, vertical, or landscape outputs. The generated images save to a sandbox, which becomes relevant in a moment.

container.python_execution is the Code Interpreter equivalent. Python 3.9 with pandas, numpy, matplotlib, plotly, scikit-learn, PyMuPDF, Pillow, OpenCV. The usual data science stack. Python 3.9 is EOL now, and when I checked versions I got Python 3.9.25 with SQLite 3.34.1 from January 2021. Not cutting edge but functional.

Visual Grounding That Actually Works

The most interesting tool is container.visual_grounding. The description says it “analyzes the image, identifies and labels objects, locates regions, or counts objects.” It takes an image path, optional object names, and returns results in bbox, point, or count format.

I had it generate an image of a raccoon wearing trash as a hat. Classic AI image generation test. Then I asked it to use OpenCV to analyze the image. It ran Python code against the generated image without showing me the code, which is annoying, but produced an edge-detected version of the raccoon.

Then the visual grounding test. I asked it to identify objects using the tool. It pinpointed every piece of the outfit with pixel-level precision: the raccoon body, face, eyes, nose, coffee cup crown, banana peel brim, newspaper pieces. All returned as coordinate points overlaid on the image.

I tried bounding box mode next. The boxes nested properly: face inside raccoon, eyes inside face, hat pieces overlapping but staying above the face box. This is real object localization, not center-point guessing.

When I asked about masks, it said visual_grounding doesn’t output pixel-level segmentation masks, only point, bbox, and count formats. But it offered to fake masks using OpenCV in the container. I had it dump the raw JSON for all three modes instead.

Count mode is particularly wild. It counted individual whiskers on the raccoon. The JSON showed 18 whiskers with confidence scores and coordinates for each one. This level of detail in a default tool is unexpected.

Meta’s Jack Wu confirmed these tools are part of the new harness launched with the model, and “bigger models are already in development with infrastructure scaling to match.” He mentioned plans to open source future versions, which would be a return to form after this closed release.

The Efficiency Claim

Meta’s intro post claims they “can reach the same capabilities with over an order of magnitude less compute than our previous model, Llama 4 Maverick.” If true, this is significant for deployment costs and inference speed. Artificial Analysis scored Muse Spark at 52, behind only Gemini 3.1 Pro, GPT-5.4, and Claude Opus 4.6. Last year’s Llama 4 Maverick scored 18.

The other tools round out the harness: container.create_web_artifact for HTML+JavaScript interactives served as secure sandboxed iframes, container.download_meta_1p_media for pulling Instagram/Facebook/Threads content into the sandbox for processing, container.file_search for digging through uploaded PDFs, and file editing tools (container.view, container.insert, container.str_replace) that mirror Claude’s text editor pattern.

There’s subagents.spawn_agent for the sub-agent delegation pattern, and third_party.link_third_party_account for connecting Google Calendar, Outlook Calendar, Gmail, or Outlook accounts. The harness is comprehensive but locked to meta.ai for now.

I’m waiting for API access because the real test isn’t what Meta built on top of their model, it’s what developers can build. The tool collection is impressive but also creates lock-in to Meta’s ecosystem. If this stays closed and the API stays restricted, it’s just another hosted LLM with fancy demos. If they follow through on open sourcing future versions with this level of tooling capability baked in, that changes the landscape considerably.

Meta's Muse Spark: A Tooled-Up Return to Frontier Models

The Tool Collection Nobody Expected

Visual Grounding That Actually Works

The Efficiency Claim

Read Next

Teaching AI to Read Maps: Why Your GPS Assistant Still Gets Lost

Amazon GameLift Streams Gets Live Session Admin Shell