A few weeks back, a company called VideoSurf announced a new video search technology that blew me away when co-founder and CEO Lior Delgo pre-briefed me prior to debuting at the TechCrunch 50 conference in September. Don’t get too excited — I say “blew me away” because so far, the only thing in video search that has impressed me is Nexidia’s speech-to-phoneme technology (which used to let you search the news clips on WXIA Atlanta, but it’s no longer there, which either means I just can’t find it, or the broadcaster decided it wasn’t being used enough, yikes!).
Vodpod videos no longer available.
I raved about VideoSurf’s long term potential to the LA Times. I should be clear: the Beta site itself doesn’t threaten Google or even Blinkx as a way to find videos today. But the math behind it is what makes this technology worth paying attention to. VideoSurf doesnt focus on speech-to-text as so many others have (including Google, as reported by ReelSEO).
Instead, VideoSurf processes the image itself to do a few things your brain does automatically but that computers are generally lame at.
First, they can match a face from one video to another. This means that if you are looking for videos of Milo Ventimiglia (not that you would, mind you, but you could), the search engine can find any video that has his face in it, even if the video metadata doesn’t mention Milo. That means you’re likely to pick up his image in more places than just the TV shows and movies he has played in (Emmy’s, news clips, etc.).
The second thing is an extension of the first. By recognizing faces, VideoSurf can identify when a scene changes or when a significant event occurs inside a scene. Applying some simple rules about what % of the screen a face takes up can even tell you when the most dramatic moments of a show are. By parsing scenes this way, VideoSurf can now give you a relatively intelligent way to not only search for a person, but to scan the video to find the scenes that person is in.
These are two important pieces of a much bigger puzzle in the future. The future of video search will combine image recognition — including beyond faces to objects and even abstractions like the pace or mood of the video — and text-to-speech, and a whole bunch of viewer behaviors, to eventually allow you to pull exactly the shot you want or answer exactly the question you had. For example:
Search engine query: When did we get footage of grandma telling about how she and grandad met?
The results would be a 3-minute clip of her telling the story, culled by using face recognition, speech-to-text, and metadata about where the clip was shot, from the hundreds of hours of video you have sitting on your home server, your cloud account with Google, or even the linked accounts you and your siblings all share.
The ultimate destination for this engine will be in the cloud (Google will host it) and in the OS (Microsoft and Apple will bake it in). It won’t be a website that people use to find celebrity videos. How many years away is this? Sooner than you think. It sits in the collision between YouTube video and online TV shows and tapeless digital cameras. I give it 3-5 years before people like me can do it, 5+ years before average people can do it.