Blindsights

Sunday, June 28, 2015

Face Triangulation for Graphic Design

Nowadays low poly has emerged as a popular element in graphic design, creating sculpture/crystal-like effects. Various movie posters utilize this technique to highlight characteristics of the main subject, e.g. the Star Wars poster shown below. And multiple mobile apps aim to make this visual effect more approachable for common users.

Image Credit: Star Wars low poly portraits, designed by Vladan Filipovic

To convert a posed photo to low poly style, designers employ different tools and emphasize different aspects. Among all these variants, a typical production process involves three steps: 1. place vertices (usually 500~5000) on suitable locations of image; 2. generate mesh from vertices using Delaunay triangulation; 3. fuse colors inside each triangle-coverage area.

The last two steps are quite straightforward and there are several off-the-shelf tools for them. What takes most effort and exhibits unique aesthetic taste is the first step, i.e. choosing the right locations on image to place vertices. So in the post, we attempt to automate this vertices selection step by referring to computer vision approaches, especially in face triangulation domain.

Generally two principles define a good set of vertices: 1. vertices should be distributed more or less uniformly, so that the shear of each triangle generated wouldn't be dramatic; 2. On the boundaries of semantic parts or textured details, vertices should be more densely, such that unions of triangles would correspond to semantic parts. In this sense, the problem has been cast onto seeking a proper probability density map for sampling.

Here we advocate to combine face alignment and boundary detection to produce a desired density map. Dense facial landmarks will place more emphasize on semantic facial parts (e.g. eyebrows, eyes, nose, mouth etc.) and edge maps will preserve object contours and fine details in image. My lovely Alice and two dear professors are taken as models and the final effect is illustrated as follows.

Image Credit: Prof. Xiaoou Tang & Prof. Xiaogang Wang

The current result seems surreal but nice. Future improvements could be made on two directions: 1. Devise post-processing method to re-split and re-merge triangles according to established scale distribution; 2. Form an end-to-end system for learning point correspondences, just like Pointer Networks.

Friday, January 30, 2015

Extrapolating Paintings by PatchMatch

PatchMatch is a popular image editing technique for establishing dense correspondence in or across images. A group of Cambridge students utilized this technique to extrapolate classic paintings (e.g. Van Gogh's 'Starry Night') that went beyond frame constraints and achieved fantastic results.

The key ingredients of PatchMatch are two insights on natural image statistics, especially from the patch offsets space: 1. There are abundant similar patches inside natural image (Non-local Means also shares this spirit); so large number of random sampling would yield good initialization. 2. Neighboring patches often have similar offset vectors; this observation enables propagation of offset vectors among neighboring patches that leads to an efficient iteration solution.

I have also tried this useful algorithm on one of Jimmy Liao's hand drawings and a Chinese ancient painting. The extrapolation effects are illustrated as follows:

Both results are visually plausible overall. By inspecting some local regions and details, we further find that: 1. This category of methods (like PatchMatch) work well on smooth regions, repetitive patterns and could also maintain local semantic coherence. 2. Global semantic coherence still can't be handled, which leaves room for the interdisciplinary research between low-level image processing and high-level visual reasoning.

Wednesday, December 31, 2014

An Interesting Perspective towards Machine Reasoning

Recently, I came across an interesting article exploring potential directions for future machine learning research "From Machine Learning to Machine Reasoning". The main theme of this article involves a plausible definition of "reasoning": "algebraically manipulating previously acquired knowledge in order to answer a new question".

It is an interesting perspective as it explains how representation learning, transfer learning and multi-task learning could help construct practical machine learning systems for computer vision and natural language processing. Below depicts an example of training face recognition system in the paper:

Figure 1 in the paper "From Machine Learning to Machine Reasoning"

If we consider trained models (either for the underlying task or other related tasks) as previously acquired knowledge, then it actually advocates a sequential construction manner by re-using the representations (for sample or for category) obtained. In this sense, the recent Dark Knowledge and FitNets also share similar spirit in the realm of neural networks.

Saturday, December 20, 2014

The Secret of Bokeh

Bokeh, the technique of deliberately creating out-of-focus area in photos, has been widely used by photographers for emphasizing semantic subjects. These subjects could be either foreground or background of photos. An example of portrait bokeh is shown below:

Photograph by Di Liu

The effect of bokeh is usually introduced by particular lens in SLR cameras. However, camera in mobile devices has become more and more ubiquitous and also a common choice to record daily life nowadays. So the question arises that how can we design algorithms to generate the effect of bokeh in well-focused photo shot by mobile devices.

In fact, this functionality (creating semantically out-of-focus photo from well-focused photo) is a well-studied topic in computational photography. And a typical algorithm comprises two stages: depth map estimation and selective blurring.

We would like to obtain depth map of the underlying photo in the first place to distinguish between foreground and background. But estimating depth map from a single image has long been known as a difficult task. A practical solution is to leverage information in image sequence (burst mode in mobile devices) so that a multi-view stereo could be formed. Google's Lens Blur adopts this kind of practice and they had a research blog to elaborate detailed techniques. BTW, burst images could also be utilized for denoising. After that, we can perform adaptive per-pixel blurring according to the computed depth map. Or more advanced technique could be adopted to enable motion blur rendering. To make the final result more seamless and natural, the above rendering process could be done in a multi-scale manner, e.g. local Laplace pyramids. The same pipeline could also be applied to generate 3D effect image from burst images, as demonstrated in this website.

Wednesday, September 19, 2012

Analogy among PageRank Algorithm in Search Engine, Graph Transduction in Image Retrieval and Graph-based Visual Saliency

"A mathematician is a person who can find analogies between theorems; a better mathematician is one who can see analogies between proofs; and the best mathematician can notice analogies between theories. One can imagine that the ultimate mathematician is one who can see analogies between analogies. "

------ Stefan Banach

Sometimes it is a good way to find analogies between algorithms in different fields. Here, I will talk about Random Walk and Markov Chain, especially their applications in three different fields: Search Engine, Image Retrieval and Visual Saliency.

1. Background Knowledge about Random Walk and Markov Chain

Markov Chain: The random walk defined by a stochastic matrix;
Chapman-Kolmogorov Equation: P(k)P(n-k) = P(n).
Stochastic Matrix: a square matrix of non-negative entries, each of whose rows adds to 1;
Probability Vector: a vector of non-negative entries that add to 1, and stochastic matrix is a square matrix with probability vectors for rows;
Limiting Distribution: If p(n) = p(0)Pn converges to a limit vector p∞ as n → ∞, that vector is called the limiting distribution of the Markov chain;
Stationary Distribution: A probability vector x is a stationary distribution (or equilibrium vector or steady state vector, or fixed point) for a Markov chain with transition matrix P if and only if p(n) = x ⇒ p(n+1) = x for any and all n. In words, once the distribution reaches a stationary distribution, it never changes.
Transition Matrix: For a graph walk, the adjacency matrix of the graph tells which vertices you can move to.

We can represent the collection of all possible data sets as the vertices of a graph, with moves between data sets as edges. Choosing moves at random gives us a random walk on the graph.
Given a graph, what is the limiting distribution of the walk on that graph? How does the limiting distribution depend on the structure of the graph? If the limiting distribution is not uniform, is it possible to alter the graph walk to get a uniform distribution? How quickly does the walk reach its limiting distribution, and how does the convergence rate depend on the structure of the graph?

2. PageRank Algorithm in Search Engine

PageRank (PR) is a term that is used by Google. In simple terms, PageRank is how your site ranks with the Google search engine. So what is the basic algorithm and what does the algorithm mean to you? The algorithm:
PR(A) = (1 + d) + d (PR(T1)/C(T1) + … + PR(Tn)/C(Tn))

This may remind you of the chicken and the egg: to determine the importance of a page, we first need to know the importance of all the pages linking to it. However, we may recast the problem into one that is more mathematically familiar.

Notice that H has some special properties. First, its entries are all nonnegative. Also, the sum of the entries in a column is one unless the page corresponding to that column has no links. Matrices in which all the entries are nonnegative and the sum of the entries in every column is one are called stochastic; they will play an important role in our story.

In other words, the vector I is an eigenvector of the matrix H with eigenvalue 1. We also call this a stationary vector of H.

Here's how the PageRank is determined. Suppose that page Pj has lj links. If one of those links is to page Pi, then Pj will pass on 1/lj of its importance to Pi. The importance ranking of Pi is then the sum of all the contributions made by pages linking to it. That is, if we denote the set of pages linking to Pi by Bi, then Let's first create a matrix, called the hyperlink matrix,

in which the entry in the ith row and jth column is We will also form a vector

whose components are PageRanks--that is, the importance rankings--of all the pages. The condition above defining the PageRank may be expressed as A more detailed discussion about PageRank Algorithm can be found here.

3. Graph Transduction in Image Retrieval

Given a set of objects X={x1, ..., xn} and a similarity functionsim: X×X→R+ that assigns a positive similarity value to each pair of objects. Assume that x1 is a query object (e.g., a query shape), {x2, ..., xn} is a set of known database objects(or a training set). Then by sorting the values sim(x1, xi) in decreasing order for i= 2, ..., n, we can obtain a ranking for database objects according to their similarity to the query. A critical issue is then to learn a faithful sim. A Schematic diagram is shown below.

Thus, the similarity of xi to the query x1, expressed as f(xi), is a weighted average over all other database objects, where the weights sum to one and are proportional to the similarity of the other database objects to xi:

f(xi) = P(i,1)f(x1)+P(i,2)f(x2)+ ... +P(i,n)f(xn)

Let w(i,j) = sim(xi, xj), for i,j = 1, ..., n, be a similarity matrix, then obtain a n×n probabilistic transition matrix P as a row-wise normalized matrix w. A new similarity measure s is computed based on P. Since s is defined as similarity of other elements to query x1, we denote f(xi) = s(x1, xi) for i = 1, ..., n. A key function is f and it satisfies We can see that those two formulations in PageRank and Graph Transduction are quite similar, because the mathematical foundations behind them are all Random Walk and Markov Chain. A more detailed discussion about Graph Transduction in Image Retrieval can be found here.

4. Graph-based Visual Saliency

According to the analysis similar to PageRank Algorithm, we can calculate the stationary distribution of every nodes being visited when undergoing a random walk. If a cluster of nodes is different from its surroundings, then the probability of arriving at these nodes would be very small. Thus, this cluster of nodes is the visual saliency region. Results of Graph-based Visual Saliency is shown below:

Previous visual saliency research often utilize the contrast between center and surroundings both globally and locally. Edge cues are also introduced sometimes. However, this method treats every pixels (or superpixels, patches) as nodes in graph. The edges between nodes represent the similarity between two nodes. And the similarity measure could consist of color, texture and so on. Then an adjacent matrix can be obtained. A more detailed discussion about Graph-based Visual Saliency can be found here.

Sunday, September 16, 2012

Ten Questions on Computer Vision from VALSE

In the 2nd Vision And Learning Seminar (VALSE) held by Chinese Academy of Sciences, participants raised ten questions on Computer Vision after free discussion. I translated and summarized them as follows:

1. Does Computer Vision have real solid theoretical foundation? Except from Marr's non-perfect visual computing theory, what kind of theoretical foundation do we need to set up now?

2. What is the ultimate questions in the Computer Vision field, which have to be high-level enough? Let's look at some ultimate questions in Computer Science field for reference: What is computable? P=NP? What is intelligence? What is information? (how) can we build complex system simply?

3. In Computer Vision and Machine Learning field, math and data, who is the king? Is big data the ultimate solution to Computer Vision problems?

4. Is biologically inspired models the hope for computer vision community?

5. What are the cans and can'ts of sparse representation? How far can it go? What else can be done and what is the killer application?

6. How far can local feature go in visual recognition problem?

7. Ridiculous Computer Vision reviewers? Lecun's anger in CVPR 2012 rebuttal: what happens to the evaluation system in computer vision community and how can we improve it?

8. The challenge from industry: Is the best technology in industry? Does the occupation of specific resources, especially big data and big platforms in industry form an invisible academic barrier?

9. What is the biggest progress in the Computer Vision and Machine Learning community in the past ten years? What about the top 5 progress? And the top 10?

10. Comparing with the researchers in Europe and America, what is the gap and hope of Chinese Computer Vision community?

Monday, September 10, 2012

What We Talk About When We Talk About Object

In a CVPR 2010 paper, Vittorio Ferrari raised the question of "What is an object?", that is, how could we give a measure of objectness generic over classes? I think it is an interesting question which not only relates to generic object detection, but also shed light on saliency or attention analysis and segmentation tasks.

First, we give a brief introduction of Ferrari's generic object detection method. Ferrari argues that any object has at least one of three distinctive characteristics:

a well-defined closed boundary in space;
a different appearance from their surroundings;
sometimes it is unique within the image and stands out as salient.2. Improvements under weakly supervised manner:

According to these three disciplines, Ferrari utilizes four image cues in the objectness measure: Multi-scale Saliency (MS), Color Contrast (CC), Edge Density (ED) and Superpixels Straddling (SS). All the parameters in those image cues are learned from a Bayesian framework. Below are some generic object detection results using Ferrari's method:

Ferrari's another paper named "Figure-ground segmentation by transferring window masks" further the discussion of generic object detection by introducing the transfer learning method.

The intuition behind the weakly supervised method is that similar windows often have similar segmentation masks. Therefore we could rely on nearest neighbors in the appearance space to transfer segmentation masks. And the appearance model consists of two gaussian mixture models (GMM), one for the foreground, one for the background. Below are some segmentation results as well as their masks: