My recent projects are:
Today's video conferencing systems have had limited success. While this is due to many factors, we think they at least have the following technical deficiencies: poor camera viewpoints, insufficient resolution, and inaccurate speaker detection, among others. The RingCam system is designed to overcome some of these problems. See our paper here . Note that an important piece of information that any teleconferencing system wants to know is how many people are in the meeting, their position, and who is talking. These are exactly the problems I am attaking now. Specifically, I am interested in real-time multi-person tracking, sound source localiaztion, sensor fusion and speaker timeline clustering.
|
|
Tracking multiple people reliably in real time in a meeting is not easy. For example, the tracker needs
to be initialized automatically, lighting condition can change, and people can occlude one another. We tackle
this problem by using a statistical tool: particle filters (PF). For linear Guassian systems, we all know
the elegant solution is Kalman filter. Unfortunately, in real life, we have to handle non-linear
non-Guassian systems. PF is such a tool. It uses a set of weighted particles (samples) to approximate
the posterior probability. We have advanced this technique at various fronts:
|
|
Knowing where people are is one thing, knowing who is currently talking is another. The goal of SSL is to detect where the sound source is coming from with respect to a microphone array. One of the most successful SSL techniques is based on time delay estimation. If two microphones are located at A and B, and the sound source is at C, the sound source will reach A and B at slightly different time. But based on this slight difference (time dealy), we can figure out the sound location. I am preparing a draft of the full technique. But if you cannot wait, take a look at Section 5 of this paper first. |
|
Given rapid improvements in storage devices, network infrastructure and streaming-media technologies, a large number of corporations and universities are recording lectures and making them available online for anytime, anywhere access. However, producing high-quality lecture videos is still labor intensive and expensive. Fortunately, recent technology advances are making it feasible to build automated camera management systems to capture lectures. In this paper we report our design of such a system, including system configuration, audio-visual tracking techniques, software architecture, and user study. Motivated by different roles in a professional video production team, we have developed a multi-cinematographer single-director camera management system. The system performs lecturer tracking, audience tracking, and video editing all fully automatically, and offers quality close to that of human-operated systems. See paper . |
|
For teleconferencing and meeting recording, we are not limited by the RingCam. We can use a parabolic mirror coupled with high-res camera sensor: Omnicam. |
|
Tired of watching a three-hour baseball game, but do not want to miss any highlights? This project is the answer. We focus on detecting highlights using audio-track features alone without relying on expensive-to-compute video-track features. We use a combination of generic sports features and baseball-specific features to obtain our results, but believe that many other sports offer the same opportunity and that the techniques presented here will apply to those sports. We present details on relative performance of various learning algorithms, and a probabilistic framework for combining multiple sources of information. We present results comparing output of our algorithms against human-selected highlights for a diverse collection of baseball games with very encouraging results. See paper . |
| This is a very nice generalization of the relevance feedback techniques from the text-based document retrieval domain to the content-based image retrieval domain. For detailed description, please see related publications. |
|
|
|
But ..., image is more complicated than text. That is why there is a old saying "An image worth a thousand words". The major complexity comes from human's perception subjectivity of image content. So, we need to come up with a better model to deal with this. My old team at UIUC is one of the earliest teams in the nation having looked into such a situation. This system is an effective interactive image retrieval system which achieves the relevance feedback at all the feature level, representation level, and vector level. User's initial query can be dynamically refined such that the new query is a better approximation to the user's information need. For more complete description, please see the related publications . |
What is even more important in video domain is that the ToC and index
should be inter-related. For a continuous long medium type like
video, such ``back and forth'' mechanism between browsing and retrieval
is crucial. The video library users may have to browse the video first
before they know what to retrieve. On the other hand, after retrieving
some video objects, it will guide the users to browse the video in the
correct direction.
For more detailed description, please see related publications .
Copyright © 1995-1999