Meta dropped Muse Spark today, their first model release since Llama 4 about a year ago. This one’s different though. It’s hosted, not open weights, and the API is locked behind a “private preview to select users” gate. You can play with it on meta.ai if you’re willing to log in with Facebook or Instagram.
The benchmarks look competitive with the current frontier models like Opus 4.6, Gemini 3.1 Pro, and GPT 5.4. Meta’s honest about the gaps too, they explicitly call out underperformance on Terminal-Bench 2.0 and say they’re still working on “long-horizon agentic systems and coding workflows”. I appreciate that level of directness instead of the usual benchmark cherry-picking.
What caught my attention isn’t the model itself though. It’s the tooling Meta has built around it. The meta.ai interface exposes two modes right now: “Instant” and “Thinking”. They’re promising a “Contemplating” mode later that should compete with Gemini Deep Think or GPT-5.4 Pro for extended reasoning tasks.
Getting the Tool Definitions
I usually prefer running tests via API to avoid invisible system prompts, but since that wasn’t an option I went straight to the chat UI. I ran my standard pelican test on both modes. The Instant model spat out an SVG with code comments, while the Thinking model wrapped it in an HTML shell with some unused Playables SDK v1.0.0 JavaScript libraries. Both rendered inline, Claude Artifacts style.
This meant tools were definitely wired up. So I asked for them.
“I want the exact tool names, parameter names and tool descriptions, in the original format”
It gave me 16 tools. Full descriptions, parameter schemas, the works. Credit to Meta for not training their bot to hide this information. It makes exploration so much less frustrating when you don’t have to mess around with jailbreaks just to understand what you’re working with.
The Tool Collection
The browser tools are standard stuff at this point. browser.search runs web searches through an undisclosed engine, browser.open loads full pages, browser.find does pattern matching on content. Every artificial intelligence assistant has these now.
More interesting is meta_1p.content_search. It does semantic search across Instagram, Threads, and Facebook posts, but only for content you have access to and only since January 1st, 2025. The parameters are surprisingly powerful: author_ids, key_celebrities, commented_by_user_ids, liked_by_user_ids. That’s a lot of signal for building context.
There’s also meta_1p.meta_catalog_search for product searches, presumably powering the Shopping mode. And media.image_gen for image generation, which returns a CDN URL and saves to the sandbox. It has “artistic” and “realistic” modes plus aspect ratio controls.
container.python_execution is here too. Yes, it’s Code Interpreter, my favorite feature from both ChatGPT and Claude. Python 3.9 with pandas, numpy, matplotlib, plotly, scikit-learn, PyMuPDF, Pillow, OpenCV. Python 3.9 is EOL now but the library selection is solid. I tested it and got Python 3.9.25 with SQLite 3.34.1 from January 2021.
container.create_web_artifact is what powers those inline visualizations. You can create HTML+JavaScript files that get served as secure sandboxed iframes. Set kind to “html” for websites or “svg” for vector graphics.
Visual Grounding Gets Weird
The tool that really grabbed me was container.visual_grounding. The description: “Visual grounding tool that analyzes the image, identifies and labels objects, locates regions, or counts objects.”
It takes an image path and optional object names, returns results in bbox, point, or count format. This sounded like Segment Anything at first, but apparently it’s a native model feature exposed through a tool call with a custom system prompt.
I wanted to test the full pipeline. First I generated an image:
“generate a photo of a raccoon sitting on a trash can wearing trash as a hat”
It gave me a raccoon with a coffee cup crown, banana peel brim, and newspaper feather. Peak raccoon fashion. My guess is image generation runs on Emu or an updated version, same pattern as ChatGPT and Gemini where images come from tool calls.
The image_gen tool description said it “saves the image to the sandbox”, so I figured I could run Python against it:
“use python OpenCV to analyze that image and find out neat things about it”
It didn’t show me the code but it generated an OpenCV analysis visualization. We can generate images and immediately manipulate them with Python tools. That’s a nice workflow.
Then I tested visual grounding directly:
“use visual_grounding on that raccoon image”
It pinpointed every piece of the outfit with pixel-level precision. Point mode showed exact coordinates for the raccoon, face, eyes, nose, whiskers, trash can, coffee cup, banana peel, newspaper, even the alley background.
I asked for bbox mode next. The boxes nested properly: face inside raccoon, eyes inside face, hat pieces overlapping but staying above the face box. Real object localization, not just center point guessing.
No mask output though. Visual grounding only returns three formats: point, bbox, and count. For masks you’d need to fake them with OpenCV in the container, which the model offered to do.
Finally I had it dump count mode JSON. The output was fascinating. It counted whiskers. 47 of them, to be specific, with a confidence score of 0.89.
Meta AI has raccoon whisker counting baked into the default tool set. I don’t know what to do with that information but it’s extremely on-brand for Meta.
Sub-Agents and Third Party Integrations
subagents.spawn_agent follows the sub-agent as a tool pattern. “Spawn an independent sub-agent for research, analysis, or delegation. It returns its final text response.” We’re seeing this across all the major platforms now. Decomposing complex tasks into smaller agent calls is becoming standard architecture.
third_party.link_third_party_account handles OAuth flows for Google Calendar, Outlook Calendar, Gmail, and Outlook. I expect this list will grow fast if Meta opens up a proper developer platform.
There are also file manipulation tools that look similar to Claude’s text editor commands: container.view, container.insert, container.str_replace. These patterns are converging across every file-equipped agent harness.
container.download_meta_1p_media is particularly interesting. “Download media from Meta 1P sources into the sandbox. Use post_id for Instagram/Facebook/Threads posts, or catalog_search_citation_id for catalog product images.” Pull in content from across Meta’s platforms and manipulate it in the Python sandbox. That’s a powerful combination if you’re already deep in the Meta ecosystem.
Are They Back in the Frontier Game?
Meta claims they can reach the same capabilities as Llama 4 Maverick with over an order of magnitude less compute. That’s a huge efficiency gain if true. The introductory blog post says Muse Spark is “significantly more efficient than the leading base models available for comparison.”
Artificial Analysis scored Meta Spark at 52, behind only Gemini 3.1 Pro, GPT-5.4, and Claude Opus 4.6. Last year’s Llama 4 Maverick and Scout scored 18 and 13 respectively. That’s a massive jump.
Jack Wu from Meta confirmed these tools are part of the new harness they launched alongside the model. He also said “bigger models are already in development with infrastructure scaling to match” and that they have “plans to open-source future versions.”
I really hope they follow through on open sourcing. Llama 3.1, 3.2, and 3.3 were excellent laptop-scale model families. The community needs more open weights at the frontier, not fewer.
But right now I’m stuck in wait mode. The tool collection on meta.ai is impressive but the real test comes when we can build on top of the API. A model’s value isn’t just in its benchmark scores or even its tool ecosystem, it’s in what developers can create when they have programmatic access to both.