Video search, part 2 (or, The Plot Thickens)

October 22, 2008

A few weeks ago, I wrote about an interesting new video search application that debuted in September called VideoSurf. I mentioned that VideoSurf is a great piece of the overall video search puzzle, alongside some other obvious pieces like video speech-to-text search. But I didn’t raise the question of what should happen next in speech-to-text.

As if on cue, EveryZing (formerly PodZinger, but that was a long, long time ago in Internet time), announced the integration of the speech-to-text technology it uses in video search engine optimization and site search tools into a consumer-facing video player it calls the EveryZing MetaPlayer.  See the example that CEO Tom Wilde shared with me.


The text generated from the video shows up on the side of the video, allowing you to search through the text, then find the spots in the video that mention those words (a little yellow dot above the playline tells you where the word is mentioned so you can visualize where in the clip you are).

The text generated from the video shows up on the side of the video, allowing you to search through the text, then find the spots in the video that mention those words (a little yellow dot above the playline tells you where the word is mentioned so you can visualize where in the clip you are).

As I mentioned in my last post, Nexidia has been doing a slight variant (speech-to-phoneme) search, which has a lot of cool features such as multiple language support. However, it hasn’t been implemented widely, leading me to speculate that it’s not as easy to integrate as one might hope. I suspect that’s what drove EveryZing to fashion their solution as a video player (that handily integrates with whatever flash-based player environment a company uses like Brightcove or others). By providing a simple package such as this, it might motivate more entities like the Dallas Cowboys, a launch client of EveryZing, to try this out.

The extra cool part comes beyond text search when you realize that this system gives a video content provider a way to auto-tag video content — even content they don’t own but can embed from YouTube or others –thus matching it to existing tagging systems they have built. These tags can trigger ads, trigger relevant sidebar content (like player stats or stock information). All in an automated and scalable way.

We’re still not all the way there, and as I said before, this will eventually get built into the cloud (a la Google) or the OS (a la MSFT and Apple) or both, but I like what I see so far.

Maybe video search doesn’t have to stink after all.

Video Search, are we there yet?

October 3, 2008

A few weeks back, a company called VideoSurf announced a new video search technology that blew me away when co-founder and CEO Lior Delgo pre-briefed me prior to debuting at the TechCrunch 50 conference in September. Don’t get too excited — I say “blew me away” because so far, the only thing in video search that has impressed me is Nexidia’s speech-to-phoneme technology (which used to let you search the news clips on WXIA Atlanta, but it’s no longer there, which either means I just can’t find it, or the broadcaster decided it wasn’t being used enough, yikes!).

Vodpod videos no longer available.

more about “VideoSurf | TechCrunch50 Conference 2008“, posted with vodpod

I raved about VideoSurf’s long term potential to the LA Times. I should be clear: the Beta site itself doesn’t threaten Google or even Blinkx as a way to find videos today. But the math behind it is what makes this technology worth paying attention to. VideoSurf doesnt focus on speech-to-text as so many others have (including Google, as reported by ReelSEO).

Instead, VideoSurf processes the image itself to do a few things your brain does automatically but that computers are generally lame at.

First, they can match a face from one video to another. This means that if you are looking for videos of Milo Ventimiglia (not that you would, mind you, but you could), the search engine can find any video that has his face in it, even if the video metadata doesn’t mention Milo. That means you’re likely to pick up his image in more places than just the TV shows and movies he has played in (Emmy’s, news clips, etc.).

The second thing is an extension of the first. By recognizing faces, VideoSurf can identify when a scene changes or when a significant event occurs inside a scene. Applying some simple rules about what % of the screen a face takes up can even tell you when the most dramatic moments of a show are. By parsing scenes this way, VideoSurf can now give you a relatively intelligent way to not only search for a person, but to scan the video to find the scenes that person is in. 

These are two important pieces of a much bigger puzzle in the future. The future of video search will combine image recognition — including beyond faces to objects and even abstractions like the pace or mood of the video — and text-to-speech, and a whole bunch of viewer behaviors, to eventually allow you to pull exactly the shot you want or answer exactly the question you had. For example:

Search engine query: When did we get footage of grandma telling about how she and grandad met? 

The results would be a 3-minute clip of her telling the story, culled by using face recognition, speech-to-text, and metadata about where the clip was shot, from the hundreds of hours of video you have sitting on your home server, your cloud account with Google, or even the linked accounts you and your siblings all share.

The ultimate destination for this engine will be in the cloud (Google will host it) and in the OS (Microsoft and Apple will bake it in). It won’t be a website that people use to find celebrity videos. How many years away is this? Sooner than you think. It sits in the collision between YouTube video and online TV shows and tapeless digital cameras. I give it 3-5 years before people like me can do it, 5+ years before average people can do it.