NEW AI Video Generator Kling 2.6 DESTROYS Veo 3.1 & Sora 2? Full Comparison

Dan Kieft
8 Dec 202520:04

TLDRIn this video, the new AI video generator Cling 2.6 is compared to its competitors, Sora 2 and Google VO 3.1. Key comparisons include dialogue generation, dynamic shots, realism, and audio effects. While Cling 2.6 impresses with its prompt adherence and animation quality, Google VO 3.1 excels in realistic sound and detail. Sora 2 stands out for its image quality and dynamic effects. Despite some flaws, Cling 2.6 shows promise, especially given its lower price point compared to the other models. The video explores various real-world scenarios to highlight the strengths and weaknesses of each tool.

Takeaways

  • 😀 Kling 2.6 has finally added native audio to its video generation feature, making it comparable to other tools like Sora 2 and Google V3.1. Developers can now leverage the Kling 2.6 API to enhance their creative workflows.
  • 🎬 The first test focuses on dialogue generation, where Google V3.1 impresses with its audio quality, Sora 2 excels in image quality, and Cling 2.6 still needs some refinement.
  • 🔊 Cling 2.6's audio generation is improving but doesn't quite match the realism of Google V3.1, especially in complex scenarios like podcasts.
  • 📷 In a dynamic skateboard trick example, Cling 2.6's generation was the most realistic, with proper camera flow and sound effects, outperforming Sora 2 and Google V3.1.
  • ⚠️ Sora 2 struggles when generating realistic human references, while Google V3.1 and Cling 2.6 handle them better, with Cling 2.6 offering a slight edge.
  • 👻 For horror scenes, Cling 2.6 and Sora 2 outperform Google V3.1, with Sora 2 providing the best sound design.
  • 🍳 In a cooking scene, Google V3.1 stands out for its realistic kitchen sounds and satisfying audio effects, while Cling 2.6 lacks some detail in visual motion.
  • 🎥 Cling 2.6 excels at prompt adherence and animation quality, especially when generating static animations, but Sora 2 struggles with certain prompts.
  • 🕵️‍♂️ ClingCling 2.6 vs Sora 2 2.6's animation and character movement quality is better than Google V3.1 in some instances, though not consistently.
  • 💸 Cling 2.6 is more affordable compared to Sora 2 and Google V3.1, making it a viable option for users looking for a cost-effective AI video generator.

Q & A

  • What is the main topic of the video transcript?

    -A hands-on comparison between AI video generators—Cling (Cling) 2.6, Google Video (Veo/VO) 3.1, and Sora 2—evaluating audio, image quality, motion, prompt adherence, and practical use cases.

  • Which new feature does Cling 2.6 introduce according to the transcript?

    -Cling 2.6 introduces native audio generation for its videos, allowing the model to produce synchronized speech and sound effects within video outputs.

  • Where does the creator say viewers can access Cling 2.6?

    -The creator says Cling 2.6 is available on Artlist (referred to as 'art list') and mentions a link in the video description; Artlist sponsored the video.

  • Which model does the creator generally prefer for audio quality?

    -The creator repeatedly ranks Google Video VO 3.1 (V3.1) as the best for audio quality and realistic environmental sound design.

  • Which model is credited with the best image/animation and prompt adherence in the transcript?

    -null

  • How does Sora 2 compare across the tests?

    -Sora 2 often produces strong, realistic-looking visuals and convincing vlog-style audio in some tests, but it has restrictions when given realistic reference images (it can reject or fail those prompts).

  • What recurring limitation did the creator encounter with Sora 2?

    -Sora 2 refused or failed to generate outputs when realistic reference images of real people were used, due to its safety/guideline constraints.

  • In the skateboard and motorcycle examples, which model performed best for realism and believable motion?

    -Cling 2.6 was favored for the skateboard trick and given a slight advantage on the motorcycle action shot; Sora produced sometimes inconsistent physics or strange editing, and Google struggled with realistic motion in those examples.

  • How did the models perform on complex audio + SFX scenes, like cooking or horror?

    -Google VO 3.1 excelled at detailed SFX (e.g., egg cracking, fridge noise), while Cling and Sora performed well in horror / atmospheric scenes—Cling for visuals and Sora often for creepy sound design—depending on the example.

  • Were there examples where a model failed to follow the intended narration or role (e.g., narration vs. on-screen speaking)?

    -Yes—when the creator wanted narration over changing camera angles (Pennywise example), Cling 2.6 incorrectly had the clown speak the lines as on-screen audio instead of producing a separate narration voice, while Google handled the narration better.

  • What issues did the creator notice about audio mixing and environmental sound in some Cling 2.6 outputs?

    -Cling sometimes lacked rich environmental sounds (e.g., street honking) or had quieter audio versus Google, and some audio-camera synchronization or mic-level variations were less refined than Google VO 3.1.

  • How did the models handle prompt complexity and camera instructions (dolly, whip pan, angle changes)?

    -All models had mixed results: Cling often followed prompts well (good prompt adherence), Google handled some camera changes and narration pacing reliably, and Sora sometimes introduced unexpected edits (e.g., slow motion) or failed on certain directed camera moves.

  • What does the creator conclude about using Cling 2.6 versus VO 3.1 and Sora 2?

    -The creator sees Cling 2.6 as a strong, cost-effective new option—particularly for animation/prompt fidelity—and recommends adding it to a toolkit alongside VO 3.1 and Sora 2, choosing the tool by task (audio vs. image vs. cost).

  • Does the video mention costs or pricing differences between the tools?

    -Yes—the creator notes that Cling 2.6 is significantly cheaper than Google VO 3.1 and Sora 2, making it an attractive option for budget-conscious creators.

  • What practical advice does the creator give for viewers who want to experiment with these tools?

    -Try multiple generators depending on the scene: use Google VO 3.1 for detailed audio/SFX, Cling 2.6 for prompt-accurate visuals and cheaper runs, and Kling Video 2.6 API for certain photo-realistic text-to-video cases—also join the creator's community for prompts and tips.

Outlines

00:00

🆕 Introduction & First Impressions of Cling 2.6

The paragraph introduces the newly launched Cling (Clingi/Cling) 2.6 model which finally includes native audio for AI video generation. The speaker explains they will compare Cling 2.6 against Google VO (V) 3.1 and Sora 2 across several categories. They note how to access Cling 2.6 via Artlist (link in description) and mention the sponsor. The author runs a text-to-video dialogue test (a woman vlogging on a busy New York street) across Google VO3.1, Sora 2, and Cling 2.6. Key observations: Google VO3.1 produces strong, emotive audio with convincing ambient sounds (cars honking), Sora 2 provides convincing background noise and a realistic vlog feel, while Cling 2.6’s audio is weaker in ambient fidelity — it mostly captures walking and speech but lacks the richer environmental sounds. The author’s initial verdict for this section: VO3.1 leads on audio, Sora 2 leads on image quality, and Cling 2.6 isn’t yet as impressive. The paragraph ends by introducing the next test (a podcast-style UFC interview) and hints at the author’s testing methodology (same prompt across multiple models).

05:01

🎙️ Dialogue & Dynamic-Scene Tests (Podcast, Animation, Skateboard)

This paragraph covers multiple focused tests comparing the three models. First,Cling 2.6 comparison a podcast/interview-style UFC scene with quick shot changes: Google VO3.1 provides excellent audio dynamics (mic proximity effects, clear loudness changes) and strong overall audio realism; Cling 2.6 struggles with consistent speech rendering and scene coherence in this example; Sora 2 performs well and in some cases ties with Google for quality. Next, the author tests an animated character intro (YouTube channel/Dracula prompt): Google’s voice is liked, Cling 2.6 shows excellent prompt adherence and solid animation/character movement (the author roots for Cling’s progress), while Sora’s result for this prompt is weak. The paragraph then moves to a harder test — dynamic, realistic skateboard physics and camera flow. Google’s result fails to meet expectations, Cling 2.6 nails a convincing trick with believable audio-sync and believable motion (author calls it “sick”), and Sora 2 produces visually interesting shots but with odd camera logic and inconsistent continuity (extraneous slow-motion and compilation-like output). The author awards the skateboard example to Cling 2.6, then notes a general limitation: Sora 2 refuses to process realistic reference images, so it blocks some use cases. Finally, the author compares a motorcycle front-mountain action reference image: Google’s render has direction/lean issues; Cling 2.6 handles the turn better and is judged slightly superior (with the caveat that both could be improved with more generations).

10:03

👻 Audio, Sound Effects & Horror / Domestic Scene Comparisons

This paragraph focuses on how well each model handles complex audio design and sound effects across different genres. The author first tests a horror prompt (close-up frightened woman, whip pan revealing a ghost): Google’s output is decent but misses the whip-pan reveal; Cling 2.6 produces very realistic close-ups and convincing emotional beats plus a striking monster reveal, with acceptable sound design; Sora 2 excels on the horror sound design and delivers a freakier, more effective audio/shot pairing. The author ranks this horror test roughly: Cling 2.6 and Sora 2 as strong contenders (Cling slightly ahead visually), then Google. The paragraph then shifts to a gentle anime/domestic kitchen scene (fridge, egg cracking, sizzling): Google VO3.1 performs best at realistic foley (fridge noise, egg crack, sizzle) and is rated highly; Cling 2.6 shows promising sound but produces surreal visual/audio artifacts (morphing egg, odd continuity); Sora 2 could not produce this scene under the author’s attempts (likely blocked by its constraints). The author concludes this section by saying Google VO3.1 clearly wins for domestic/foley-heavy scenes, Cling shows potential but has weird visual/audio glitches, and Sora is inconsistent or constrained by its safeguards. The paragraph ends by introducing a filmmaking/narration test using a Pennywise-like input image to be covered next.

15:04

🎬 Narration, Ads, Transforms & Final Verdict

This final paragraph describes narration and ad-style tests, plus a transformation/entertainment test, and closes with an overall recommendation. For the narration (Pennywise-style) sequence: Google VO3.1 produces a solid voice-over that changes scenes appropriately with the narration; Cling 2.6 fails to produce a consistent voice-over (the clown voice speaks inconsistently and switches tone mid-narration); Sora 2 could not generate this example in the author’s attempts. Next the author tests a UGC-style ad (a woman holding an avocado skincare product): Google’s generation includes product-holding and convincing voice but can look “plastic” and sometimes inserts odd sound effects; Cling 2.6 produces surprisingly realistic-looking output with good voice; Sora 2 yields very realistic, polished results but the author cautions that Sora’s ability to make highly realistic people is why it restricts usage (and why it sometimes blocks realistic references). The G-Wagon → Transformer transformation test shows Google VO3.1 delivering the best, recognizable “Transformers”-style line and transformation pacing, while Cling’s attempt looks low-quality and Sora’s attempt is weak. In the wrap-up the author says Cling 2.6 is "not bad" — a strong contender especially given its lower cost compared to Google VO3.1 and Sora 2 — and recommends considering Cling 2.6 for workflows, while urging readers to try it via the Artlist link (sponsor mention). The paragraph ends with calls-to-action: join the author’s community for prompts, watch a linked tutorial on combining tools like Nano and Banana Pro, and try the models themselves.

Mindmap

Keywords

💡Cling 2.6

Cling 2.6 is the new AI video model being evaluated in the script; it’s presented as the main subject of the comparison. The presenter tests Cling 2.6 across dialogue, action, horror, and ad-style scenes to judge audio, motion, and prompt-following—for example, the narrator repeatedly generates scenes with Cling 2.6 (a woman vlogging in New York, a skateboard trick, and a motorcycle turn) and comments on its strengths (prompt adherence, convincing skateboard trick) and weaknesses (camera stabilization not perfect).

💡Google V3.1 (VO3.1)

Google V3.1 (also referred to in the transcript as VO3.1) is another AI video generator used as a benchmark in the comparison. The script repeatedly praises its audio quality (the narrator calls its audio "best" in some samples) and shows it performing well with realistic sounds like fridge noise or egg cracking—e.g., the kitchen anime-style example is rated "9 out of 10" for sound design.

💡Sora 2

Sora 2 is the third video generator compared in the video; it is evaluated for image quality and realism but also criticized for strict content restrictions. The presenter often says Sora JSON code correction2 produces strong visuals (he calls its images the best in one example) but notes it blocks realistic-person image references and sometimes fails to generate when realistic reference images are used, which limits certain workflows.

💡AI video generator

An AI video generator is a model that converts text prompts (and sometimes images) into moving-image sequences with audio, camera moves, and visual details. The whole video is framed as a hands-on comparison of three such generators—Cling 2.6, Google V3.1, and Sora 2—testing how well each translates prompts into coherent scenes like vlogs, interviews, skateboard tricks, horror reveals, and product UGC.

💡native audio

Native audio means the model generates synchronized speech, sound effects, and environmental noises as part of the video (rather than requiring separate audio editing). The transcript emphasizes that Cling 2.6 added native audio, and the presenter compares how natural and contextually appropriate that audio is in several examples—for instance, noting VO3.1’s superior street ambience and Cling’s matching audio for a skateboard trick.

💡text-to-video

Text-to-video is the workflow where the user types a descriptive prompt and the model generates a video matching that description. Throughout the script the creator uses text-to-video prompts (e.g., "A woman walking on a busy New York street..." or "skater approaches a handrail...") to compare how each model handles dialogue, camera motion, scene continuity, and sound design.

💡prompt adherence

Prompt adherence refers to how faithfully a generated video follows the user’s written instructions (shots, actions, audio cues, character lines). The narrator praises Cling for strong prompt adherence—saying it "followed my prompt" in the YouTube-intro example—while criticizing other models when they omit requested camera moves or change the intended narration style (e.g., Cling accidentally makes a clown speak instead of producing a voice-over).

💡realism (realistic people)

Realism here means how believable people, motion, and lighting appear in the generated footage. The script repeatedly tests realism—like the motorcycle and skateboard scenes—and notes that Sora 2 often refuses or blocks realistic-person references, while Google V3.1 and Cling 2.6 can render realistic characters, sometimes with anatomical or physics errors (e.g., unnatural motorcycle lean or strange camera angles).

💡audio & sound design

Audio and sound design covers generated speech, environmental sounds, Foley (footsteps, doors, eggs cracking), and SFX that make a scene convincing. The presenter treats audio as a major comparison axis—praising Google V3.1’s detailed ambiance (cars honking, fridge noise) and egg-sizzle, noting Sora’s strong horror soundscape, and grading Cling’s audio as improving but not always matching Google’s richness.

💡camera movement / cinematography

Camera movement and cinematography refer to the simulated shot types (close-ups, dolly, whip pan, slow motion) and how well the model composes and transitions between them. The video’s prompts request specific cinematography—fast angle changes in the UFC interview, whip pan to reveal a ghost, dolly left/right for a room reveal—and the narrator evaluates whether each model executes those moves (for example, Google sometimes fails to perform the whip pan, while Cling often follows camera instructions but sometimes produces odd stabilization).

💡UGC style / product ad

UGC (user-generated content) style refers to informal, selfie-like product videos used in ads—often handheld, direct-to-camera, and authentic-feeling. The presenter runs a handheld selfie prompt for an avocado product; Google generates a somewhat plastic result, Cling produces a more realistic-looking clip that sounds "surprisingly good," and Sora can produce very realistic-looking ad clips but is restricted when real-person image references are applied.

💡image reference restrictions

Image reference restrictions are content-moderation or policy limits that prevent a model from using real-person images or realistic likenesses as inputs. The transcript highlights that Sora 2 "shuts you down" when a realistic reference image is used (e.g., a realistic girl on a motorcycle), which frustrates the creator because it blocks certain legitimate use cases like matching a specific actor or subject in a scene.

💡hallucinations / artifacts

Hallucinations and artifacts are unintended or unrealistic results (weird object morphs, impossible physics, or nonsensical edits) generated by the model. The narrator points to examples—the egg turning into glass and creating an endless egg in Cling’s kitchen attempt, or odd slow-motion edits and strange landing positions in skateboard and skateboard camera sequences—as instances where the model’s output doesn’t match real-world expectations.

💡cost / pricing

Cost or pricing refers to how affordable a given AI service is for creators compared in the video. The presenter explicitly mentions that Cling 2.6 is "a lot cheaper than Google V3.1 and also than Sora 2," framing cost as a practical factor when choosing a tool—even if a slightly cheaper model may require extra generations to reach the desired quality.

💡use-case fit / workflow

Use-case fit or workflow describes which model best serves particular creative tasks—vlogs, interviews, horror, product ads, or highly realistic character-driven scenes. The entire comparison demonstrates that each model has different strengths (Google: audio and some SFX; Sora: visuals but restricted inputs; Cling: prompt adherence and competitive motion) so creators should choose the tool that aligns with their desired end product and constraints mentioned in the script.

Highlights

Cling 2.6 introduces native audio support for video generation, allowing for more realistic and immersive creations.

Cling 2.6 is compared to other tools like Sora 2 and Google VO 3.1, particularly focusing on the quality of speech and audio.

Google VO 3.1 provides high-quality audio with realistic speech, but Cling 2.6 still needs improvement in terms of environmental sound like honking cars.

Sora 2 provides better background noise for a vlog-style video, but its image quality is not as strong as Cling 2.6.

Cling 2.6 shows promise in prompt adherence, with good animation and image generation, particularly in character movement.

Sora 2 faces issues with generating realistic images of people, especially with reference images of real humans.

Cling 2.6 outperforms both Google VO 3.1 and Sora 2 in skateboard physics generation, offering a more realistic and smooth trick sequenceCling 2.6 comparison.

In the motorcycle generation test, Cling 2.6 offers better motion and audio than Google VO 3.1, though both tools still have flaws.

Cling 2.6 excels in generating horror scenes, with an eerie atmosphere and good sound design for supernatural settings.

Sora 2 does well with sound in a horror setting but fails in terms of visual consistency and camera movement.

For food preparation scenes, Google VO 3.1 stands out with accurate sound effects like cracking eggs and fridge noises, while Cling 2.6 struggles with animation and sound realism.

In a storytelling scenario with Pennywise, Google VO 3.1 shines with accurate narration, scene changes, and a menacing atmosphere.

Cling 2.6 fails to replicate the narration and camera angle changes effectively during the Pennywise scene, making Google VO 3.1 the winner.

Sora 2’s product advertisement generation is impressive, but it has limitations when generating realistic human characters with product images.

Cling 2.6, though cheaper than both Sora 2 and Google VO 3.1, is still a valuable addition to any AI video production toolkit, especially for budget-conscious users.