HappyHorse-1.0: The Open Video Model That Topped the Artificial Analysis Arena
2026/04/09

HappyHorse-1.0: The Open Video Model That Topped the Artificial Analysis Arena

HappyHorse-1.0, a 15B unified Transformer, claims #1 in both Text-to-Video and Image-to-Video on the Artificial Analysis blind-preference leaderboard. Here's what makes it different.

A new leader has emerged in AI video generation. HappyHorse-1.0 has claimed first place in both core tracks on the Artificial Analysis Video Arena: Text-to-Video Elo 1383 and Image-to-Video Elo 1413. These are not self-reported benchmarks. They are the result of real users choosing between anonymized outputs in a blind comparison setting.

Why the #1 Ranking Matters

Artificial Analysis operates a blind comparison platform where participants evaluate side-by-side generated videos without knowing which model produced which result. Rankings use the Elo rating system, the same method that powers competitive chess leaderboards. Every pairwise comparison dynamically adjusts each model's score.

This carries several important implications:

  • Not a synthetic benchmark: Rankings derive from genuine human preference judgments rather than pre-defined automated metrics, making them far more reflective of real-world usage.
  • Dual-track dominance: HappyHorse-1.0 ranks first simultaneously in both Text-to-Video (T2V) and Image-to-Video (I2V), a feat rarely achieved by open-source models.
  • Blind-test fairness: Participants do not know which two models they are comparing, eliminating brand bias from the evaluation.

Technical Architecture of HappyHorse-1.0

HappyHorse-1.0 is a 15-billion-parameter unified Transformer purpose-built for joint video and audio generation. Its core architecture consists of 40 self-attention layers organized in a "sandwich" modality distribution strategy:

  • Edge layers: The layers nearest to the input and output allocate modality-specific pathways for video and audio, handling their respective feature encoding and decoding.
  • Middle layers: Core shared layers perform cross-modal feature fusion and alignment.

This design allows the model to capture fine-grained details within each modality while achieving genuine multimodal joint understanding at deeper levels. The model is released with a base version, a distilled version, a companion super-resolution module, and full inference code, all under the Apache 2.0 license.

The Advantage of a Unified Multimodal Architecture

Generating video and audio within a single Transformer, rather than training two separate models, delivers tangible benefits:

  • More natural cross-modal alignment: The instant a character opens their mouth on screen, synchronized audio is generated automatically with no post-processing alignment step required.
  • Lower handoff loss: In a two-model pipeline, the outputs of a video model and an audio model need an additional alignment pass. A unified architecture maintains timeline consistency throughout the generation process itself.
  • Stronger prompt fidelity: A single text instruction influences both video and audio generation simultaneously, eliminating the risk of two models interpreting the same prompt differently.

The most visible manifestation of this advantage is native multilingual lip-sync. Users can describe a scene in Chinese, English, or any other language, and the generated video will feature naturally matching lip movements and speech, without requiring post-processing tools like Wav2Lip.

8-Step Distilled Inference: Speed as Capability

HappyHorse-1.0 employs DMD-2 (Distribution Matching Distillation) to compress inference to just 8 steps. Paired with MagiCompiler optimization, generating a 5-second 1080p clip takes approximately 38 seconds on a single H100.

Fast inference is not merely an engineering optimization. It directly impacts creative productivity:

  • Iteration speed: Creators can explore more prompt variants in less time, converging on the best result faster.
  • Near-real-time preview: Generation speeds approaching real-time feedback make interactive, iterative creation workflows viable.
  • Production cost: Fewer inference steps mean lower GPU time consumption, a critical factor for large-scale deployment.

Why Leading Both T2V and I2V Is Significant

Text-to-Video and Image-to-Video may appear to be two entry points for the same task, but they impose structurally different demands on a model:

  • T2V requires the model to build spatiotemporal consistency from scratch, placing heavy demands on semantic understanding and compositional ability.
  • I2V must inject plausible motion and dynamics while preserving the style and content of a reference image, testing the model's capacity to "animate" static information.

HappyHorse-1.0 topping both leaderboards indicates that its architecture does not favor one input modality over another. Instead, it has established a genuinely universal video generation capability. For downstream applications, this means users receive equally high-quality output regardless of whether they start from a text prompt or a reference image.

What This Means for Developers and Creators

HappyHorse-1.0's open-source release opens new possibilities across the AI video ecosystem:

  • Local deployment: The Apache 2.0 license permits commercial use and modification, enabling developers to integrate the model directly into their own products.
  • API integration: The distilled version's efficient inference makes it well-suited as a cloud service backend, reducing operational costs.
  • Creative workflows: Native multilingual lip-sync and joint audio-video generation substantially streamline production pipelines for short-form video, advertising, and animation.

Impact on the Open Video Ecosystem

An open-source model claiming the top spot on a user-preference blind leaderboard is shifting the competitive landscape of video generation. Until now, cutting-edge video generation has been nearly monopolized by closed API services. HappyHorse-1.0 demonstrates that the open-source community has reached parity with commercial systems on quality as measured by real users.

The signal to the industry is clear: the capability ceiling of AI video generation is no longer the exclusive domain of closed-source labs. Developers, startups, and research teams can now build products and iterate on a top-tier model validated by actual users, without relying on expensive API calls.

Ready to see what HappyHorse-1.0 can do? Visit https://happyhorse.design/ to try it yourself.

Newsletter

Join the community

Subscribe to our newsletter for the latest news and updates