StarDojo: Benchmarking Open-Ended Behaviors of Agentic Multimodal LLMs in Production–Living Simulations with Stardew Valley

Abstract

Autonomous agents navigating human society must master both production activities and social interactions, yet existing benchmarks rarely evaluate these skills simultaneously. To bridge this gap, we introduce StarDojo, a novel benchmark based on Stardew Valley, designed to assess AI agents in open-ended production–living simulations. In StarDojo, agents are tasked to perform essential livelihood activities such as farming and crafting, while simultaneously engaging in social interactions to establish relationships within a vibrant community. StarDojo features 1,000 meticulously curated tasks across five key domains: farming, crafting, exploration, combat, and social interactions. Additionally, we provide a compact subset of 100 representative tasks for efficient model evaluation. The benchmark is designed with a unified, user-friendly interface without the need for keyboard and mouse control, supports all mainstream operating systems, and enables parallelized execution of multiple environment instances, making it particularly suited for evaluating the most capable foundation agents with multimodal large language models (MLLMs). Extensive evaluations of state-of-the-art MLLMs agents demonstrate substantial limitations, with the best-performing model, GPT-4.1, achieving only a 12.7% success rate, primarily due to challenges in visual understanding, multimodal reasoning and low-level manipulation. As a user-friendly environment and benchmark, StarDojo aims to facilitate further research towards robust, open-ended agents in complex production-living environments.

Stardew Valley: An Ideal Production-Living Simulation



Stardew Valley is an open-ended simulation RPG where players inherit a run-down farm. Players must thoughtfully manage their farming strategies, explore the surrounding village, and gather diverse resources to revitalize the farm. Players are encouraged to build meaningful relationships with local villagers and participate in community events. Its well-integrated systems of time management, resource allocation, economic planning, and social interaction provide a dynamic and complex environment that requires strategic thinking and adaptability. The game's structured yet open-ended nature makes it an excellent testbed for evaluating decision-making capabilities in simulated real-world conditions. Overall, Stardew Valley serves as an ideal environment for decision-making agents in the production–living simulation.

StarDojo Environment



As a classical video game, Stardew Valley only supports human-like interaction, e.g., observing gameplay through screenshots and using keyboard and mouse to control. The game must remain active and focused in the foreground, which severely limits the ability to automate gameplay or run multiple instances simultaneously. We introduce our carefully designed StarDojo environment to facilitate agents' interaction and assessment.

Unified User-friendly Interface. We present StarDojoMod, a novel extension built upon the Stardew Modding API (SMAPI), which is a widely adopted, open-source modding framework designed specifically for Stardew Valley. SMAPI offers developers extensive APIs that expose key game events and internal states, facilitating the creation of interactive and sophisticated mods. Based on SMAPI, StarDojoMod provides structured and efficient interactions between agents and the game environment. It communicates in real-time with the Stardew Valley game engine through a socket server, granting agents direct access to rendered gameplay images, saving the time-consuming screen captures, internal game states (such as character positions, statuses, and environmental details), and enabling diverse callable functions as action skills beyond traditional keyboard and mouse inputs. Moreover, we implemented a configurable pause-and-resume mechanism by directly modifying the inner states of the game, allowing the game to pause during agent planning and resume before action execution. Inherited from SMAPI, StarDojoMod is implemented in C\# to be consistent with the game engine. To enhance ease-of-use and accessibility of the environment, we provide a user-friendly Python Wrapper based on the StarDojoMod for observation retrieval, action execution, and task customization, empowering users to engage with the StarDojo environment effortlessly.

System Compatibility. Stardew Valley is one of the few games that can be played on all mainstream operating systems (Linux, macOS and Windows). We also ensured the compatibility of StarDojoMod and the Python Wrapper, enabling the entire environment to run seamlessly across different systems.

Parallel Execution. Our architecture is designed for scalability and parallel execution. Each instance of Stardew Valley is independently managed through unique ports, enabling simultaneous control of multiple game instances without interference. Communication efficiency is further enhanced through the use of shared memory, reducing observation retrieval time to as little as 30 milliseconds. Furthermore, StarDojoMod supports headless operation through the X Virtual Framebuffer (Xvfb), enabling compatibility with Linux systems without graphical interfaces, thus broadening accessibility across diverse hardware and system configurations.

StarDojo Benchmark

1000 Tasks
Distribution of 1000 tasks across five categories: Farming, Crafting, Exploration, Combat and Social in StarDojo, each with Easy, Medium, and Hard difficulties.
StarDojo-Lite
Task statistics of StarDojo-Lite. The suite comprises 100 tasks, selected as the most representative early-stage examples from each category.


We carefully curate 1000 tasks to benchmark agents' various behaviors in StarDojo. These tasks are divided into five distinct categories, Farming, Crafting, Exploration, Combat and Social, which covers most of the production-living activities in the early and middle stages of the game. Each task is classified into three difficulties, easy, medium, and hard, with a heuristic maximum steps of 30, 50 and 150, based on their complexity and the time consuming. To facilitate efficient agent evaluation, we curate a representative smaller task suite, called StarDojo-Lite, comprising 100 core tasks from the full task collection, balancing coverage and practicality. This lite task set covers most of the representative activities in the early stage of the game.



SOTA Models in StarDojo-Lite

Major Results


Across all evaluated models, GPT-4.1 achieves the highest overall success rate of 12.7% across all tasks, while other models perform below 11%. Among open-source models, Llama 4 Maverick demonstrates the strongest performance, benefiting from its larger model size. Most successful completions are limited to easy tasks, whereas all models struggle significantly with medium and hard tasks, achieving near-zero success rates due to increased task complexity and longer sequences of required actions. Models show some proficiency in farming and crafting tasks but exhibit considerable difficulty in exploration, combat, and social interactions. These results highlight the significant remaining space for improvement of MLLMs, particularly in visual understanding, multimodal reasoning, low-level manipulation, and long-term planning.