Life-cycle legal-agent simulation and benchmark

LEGALWORLD A Life-Cycle Interactive Environment for Legal Agents

Songhan Zuo1,2,* Shengbin Yue1,* Tao Chiang1 Guanying Li1 Yun Song3 Xuanjing Huang1,2 Zhongyu Wei1,2,#

1Fudan University    2Shanghai Innovation Institute    3Northwest University of Political and Law

* Equal contribution. # Corresponding author.

LEGALWORLD environment overview with agent roles, memory, tools, skills, and connected litigation stages
LEGALWORLD connects consultation, document drafting, first-instance trial, appeal determination, appeal drafting, and second-instance trial through persistent memory and procedural Tool/Skill support.

Overview

Civil litigation is not a collection of isolated tasks. Earlier consultation and drafting choices shape what can be argued at trial, how evidence is used, and what happens on appeal. LEGALWORLD turns that process into a connected interactive environment for legal agents, modeling Chinese civil litigation as a causally linked state chain across five stages and seven sub-scenarios.

Built on 75,309 paired Chinese civil judgments, LEGALWORLD provides role-bound interfaces for clients, lawyers, and judges, plus local memory, global case memory, and a modular Skill/Tool library. On top of the environment, LongJud-Bench evaluates legal-agent capability across consultation, drafting, courtroom advocacy, appeal, and second-instance proceedings.

Data Foundation

LEGALWORLD is grounded in 75,309 paired first- and second-instance Chinese civil judgments. The paired structure is important: it lets each runnable case preserve the relationship between an initial dispute, its first-instance judgment, the appeal context, and the final appellate outcome.

The corpus covers more than 500 causes of action, ranging from high-frequency disputes such as private lending and labor disputes to long-tail civil cases. This breadth is what makes it possible to construct life-cycle trajectories rather than a small set of hand-written scenes.

In the current paper version: 75,309 paired cases support five connected stages and seven sub-scenarios, with 18,992 human ratings from 217 legal-background evaluators used for reliability validation.

Court-level and cause-of-action distribution for the 75,309 paired cases
Data foundation: paired appellate judgments provide both procedural continuity and broad cause-of-action coverage.

Life-Cycle Environment

LEGALWORLD unfolds a civil dispute from the first client consultation to the final second-instance judgment. Each stage consumes the artifacts produced by the previous stage: facts disclosed during consultation become complaint claims, drafted claims constrain courtroom advocacy, and trial outcomes shape the appeal record.

This is why the environment uses both in-scenario local memory and global case memory. Local memory keeps a single stage coherent, while global case memory carries facts, evidence, litigation goals, and role-specific cognition across the full procedural chain. Clients, lawyers, and judges therefore do not restart from a shared omniscient state; they operate under their own visibility, duties, and procedural permissions.

The Tool/Skill layer gives agents procedural support for retrieval, document drafting, citation checking, artifact reading, and memory writing. The goal is not only to make agents talk like legal roles, but to make them act through the same kinds of artifacts and constraints that structure legal practice.

Example case flow from legal consultation to first-instance trial
Example trajectory: facts and evidence discovered in early consultation are carried into complaint drafting and then used during first-instance trial.

Experiments

The experiments first ask whether LEGALWORLD produces reliable legal trajectories, then use the validated environment to evaluate model backbones through LongJud-Bench. The paper studies stage authenticity, role consistency, judicial-output alignment, cross-stage dependence, cross-model capability profiles, and a small trajectory-reflection probe.

Environment reliability.

Across all five stages and three roles, 217 legal-background evaluators rate LEGALWORLD at 8.96/10 on stage authenticity and 8.98/10 on role consistency. The human study contributes 18,992 ratings, and 64.4 percent of aligned human and LLM-as-Judge metric pairs fall within one point. The free-text explanations also point to process coherence, procedural completeness, and role authenticity as the dominant reasons for high scores.

Human and LLM-as-Judge score difference distribution
Human and LLM-as-Judge validation: 64.4 percent of aligned pairs fall within +/-1 point.
Reason themes from human ratings
Human rating reasons emphasize process coherence, procedural completeness, and role authenticity.

LongJud-Bench cross-model evaluation.

LongJud-Bench scores lawyer agents across the full litigation process through eight capabilities: issue spotting, party identification, claim construction, fact marshalling, evidence marshalling, position consistency, evidentiary advocacy, and legal reasoning. The capability profile reveals that no backbone wins everywhere. Drafting strength, claim construction, and courtroom advocacy separate in ways that aggregate scores would hide.

The paper finds courtroom advocacy to be the shared frontier. Compared with drafting, advocacy requires the lawyer agent to integrate accumulated memory, opposing statements, judge prompts, and evidentiary constraints in multi-turn interaction, which makes it the most discriminative target for future legal-agent training.

Cross-model capability heatmap on LongJud-Bench
Cross-model capability profile. Courtroom advocacy remains lighter than drafting across backbones.

Trajectory reflection as a training signal.

As an exploratory probe, the paper tests whether completed life-cycle trajectories can produce reusable reflective Skill notes. On the three most frequent civil causes of action in the dataset, adding these notes raises the average LongJud-Bench overall score from 61.56 to 65.29, a gain of 3.73 points. Gains appear both on reflected cases and on held-out same-cause cases, suggesting that life-cycle interaction traces can serve not only for evaluation, but also as procedurally grounded data for improving legal agents.

Reflective Skill gains across high-frequency civil causes
Exploratory reflective Skill gains across high-frequency civil causes.

Release Status

This project page is prepared before the arXiv release. The paper PDF and public backend code are available for preview, while the arXiv page, data release, and BibTeX entry are intentionally left pending until the public release is ready. Large data assets, including the law-retrieval vector index, will be distributed through the data release rather than stored in the GitHub code repository.

Paper PDF ready arXiv pending Code ready Data pending Citation pending

Citation

BibTeX will be added after the arXiv version is public.