Evaluations

Evaluations help you answer a simple question: is your agent actually getting work done on a real computer or sandbox, and how long does the work take?

MIOSA evaluations cover three layers:

Layer	What it measures
Sandbox workflow	create, write files, run commands, expose a preview, publish, and clean up
Desktop workflow	boot a Computer, open a browser, screenshot, click, type, scroll, and observe the result
Agent task	give an agent a goal, collect the trace, score whether the final state matches the expected outcome

Why evaluations matter

For app-builder products and computer-use agents, normal API uptime is not enough. You need to know:

How quickly a workspace becomes usable.
Whether the desktop stream is ready before the agent acts.
Whether screenshots and actions stay responsive during real tasks.
Whether a generated app can move from prompt to preview to publish.
Whether a model, tool loop, or provider change improves task success.

OSWorld-compatible benchmark direction

OSWorld is a public benchmark for multimodal agents in real computer environments. It defines real desktop tasks, setup state, agent interaction, and execution-based scoring across open-ended workflows. The public OSWorld site describes 369 real-world computer tasks, with 361 tasks commonly used when excluding Google Drive tasks that need manual configuration.

MIOSA uses the same idea for product benchmarking:

Start from a known computer state.
Run a task through screenshot and action APIs.
Capture every observation and action.
Score the final state with a deterministic evaluator where possible.
Save the trace so failures can be replayed and compared.

Evaluation flow

Desktop readiness metrics

These are the computer desktop metrics we track for V1 quality:

Metric	Why it matters
Provision time	Time from create request to assigned machine.
Desktop ready	Time until the desktop stream accepts a real connection.
First screenshot	Time until the agent can observe the screen.
First action	Time until click, type, or key events are accepted.
Screenshot latency	Round-trip time for repeated screenshots during a task.
Action latency	Round-trip time for click, type, scroll, and key events.
Task success	Whether the final desktop state matches the evaluator.
Trace completeness	Whether every observation and action is captured for replay.

Sandbox workflow metrics

For AI app builders, the main path is:

Create a sandbox.
Write generated files.
Install and run the app.
Expose a live preview.
Iterate on feedback.
Publish to a durable deployment.
Destroy or pause the workspace.

MIOSA tracks timing and failure rates across that whole path, not just isolated endpoint latency.

What customers see

Success rate, total time, p50 and p95 latency, failed step, and resource usage.

Screenshots, commands, preview URLs, action events, and final result state.

Compare native MIOSA Computers, BYOC machines, and external providers through one task shape.

Re-run the same task suite after image, SDK, agent, or provider changes.

V1 scope

Capability	Status
Sandbox lifecycle benchmark	Available in docs and internal release checks.
Desktop readiness benchmark	Available as a V1 quality target.
Provider comparison adapter	Available for selected external computer providers.
OSWorld-style task scoring	In hardening for partner use.
Public leaderboard	Planned after the runner and reporting format stabilize.

Use cases

Validate that a ClinicIQ lead magnet builder can create, preview, and publish a generated app.
Compare two agent prompts against the same browser task.
Check whether a desktop image change affected first screenshot time.
Compare a native MIOSA Computer with an external computer provider.
Prove that a full-stack app builder still works after SDK or deploy changes.