Jekyll2023-12-01T00:10:25+00:00https://zhenhuan-yang.github.io/feed.xmlzhenhuan-yang.github.ioPersonal academic websiteDr. Zhenhuan YangLeveraging Volatility in Trading2022-05-25T00:00:00+00:002022-05-25T00:00:00+00:00https://zhenhuan-yang.github.io/leveraging-volatility<p>VIX is a volatility index that measures the volatility of S&P500 Index returns <a href="https://www.investopedia.com/terms/i/iv.asp#toc-what-is-implied-volatility-iv">implied</a> in the next 30 days and priced (implied) in S&P500 options. The VIX gauges the level of fear or stress in the stock market and hence enjoys an vivid nickname as the “Fear Index.”</p>
<h2 id="how-to-compute-vix">How to compute VIX?</h2>
<p>The components of the VIX Index are near- and next-term <a href="https://www.investopedia.com/options-basics-tutorial-4583012#toc-what-are-options">put and call options</a> with more than 23 days and less than 37 days to expiration. They are computed in the same way, hence we focus on the generalized formula. The time to expiration is given by the following expression:</p>
\[T = (M_c + M_s + M_o) / M_{365}\]
<p>where $M_c$ is the minutes remianing until midnight of the current day, $M_s$ is minutes from midnight until 9:30 am ET for “standard” SPX expirations; or minutes from midnight unitl 4:00om ET for “weakly” SPX expirations; $M_o$ total minutes in the days between current day and expiration day; $M_{365}$ is the minutes in a year.</p>
<p>The risk-free interest rates, $R_1$ and $R_2$, are yields based on <a href="https://www.investopedia.com/terms/c/cmtindex.asp">U.S. Treasury yield curve rates</a> (commonly referred to as “Constant Maturity Treasury” rates or CMTs), to which a <a href="https://mathworld.wolfram.com/CubicSpline.html">cubic spline</a> is applied to derive yields on the expiration dates of relevant SPX options. As such, the VIX Index calculation may use different risk-free interest rates for near- and next-term options.</p>
<p>The selected options are <a href="https://www.investopedia.com/ask/answers/042715/what-difference-between-money-and-out-money.asp">out-of-the-money</a> SPX calls and out-of-the-money SPX puts centered around an at-the-money <a href="https://www.investopedia.com/terms/s/strikeprice.asp#toc-what-is-a-strike-price">strike</a> price, $K_0$. Only SPX options quoted with non-zero bid prices are used in the VIX Index calculation. For each contract month, one determine the forward SPX level, F, by identifying the strike price at which the absolute difference between the call and put prices is smallest</p>
\[F = \text{Strike Price} + e^{RT} \times (\text{Call Price} - \text{Put Price}).\]
<p>Next, one determine $K_0$ - the strike price equal to or otherwise immediately below the forward index level, $F$ - for the near- and next-term options. Then one select out-of-the-money put options with strike prices $< K_0$. Start with the put strike immediately lower than $K_0$ and move to successively lower strike prices. Exclude any put option that has a bid price equal to zero (i.e., no bid). Likewise, one select out-of-the-money call options with strike prices $> K_0$. Start with the call strike immediately higher than $K_0$ and move
to successively higher strike prices, excluding call options that have a bid price of zero.</p>
<p>Finally, select both the put and call with strike price $K_0$. The VIX Index uses the midpoint of quoted bid and ask prices for each option selected. The $K_0$ put and call prices are averaged to produce a single value.</p>
<p>Now we are ready to apply the generalized formula of raw VIX index</p>
\[\sigma^2 = \frac{2}{T} \sum_{i}\frac{\Delta K_i}{K_i^2} e^{RT}Q(K_i) - \frac{1}{T}\Big(\frac{F}{K_0}-1\Big)^2\]
<p>where $\sigma = VIX/100$, $T$ is the time to expiration, $F$ is the forward index level derived from index option prices, $K_0$ is the first strike below the forward index level $F$, $K_i$ is the strike price of $i$th out-of-the-money option; a call if $K_i > K_0$ and a put if $K_i < K_0$; both put and call if $K_i=K_0$, $\Delta K_i$ is the interval between strike prices - half the difference between the strike on either side of $K_i$: $\Delta K_i = \frac{K_{i+1} - K_{i-1}}{2}$, $R$ is the risk-free interest rate to expiration and $Q(K_i)$ is the midpoint of the bid-ask spread for each option with strike $K_i$.</p>
<p>Going through the above process, one calculate the 30-day weighted average of $\sigma_1^2$ and $\sigma_2^2$, for near- and next-term, respectively. Then take the square root of that value and multiply 100 to get the VIX index value.</p>
\[\text{VIX} = 100 \times \sqrt{\Big(T_1\sigma_1^2 \frac{M_{T_2} - M_{30}}{M_{T_2} - M_{T_1}} + T_2\sigma_2^2 \frac{M_{30} - M_{T_1}}{M_{T_2} - M_{T_1}}\Big) \times \frac{M_{365}}{M_{30}}}\]
<p>where $M_{T_1}$ is the number of minutes to settlement of the near-term options; $M_{T_2}$ is the number of minutes to settlement of the next-term options; $M_{30}$ is the number of minutes in 30 days and $M_{365}$ is the number of minutes in a 365-day year.</p>
<p>The inclusion of SPX Weeklys in the VIX Index calculation means that the near-term options will always have more than 23 days to expiration and the next-term options always have less than 37 days to expiration, so the resulting VIX Index value will always reflect an interpolation of $\sigma_1^2$ and $\sigma_2^2$ ; i.e., each individual weight is less than or equal to 1 and the sum of the weights equals 1.</p>
<p>A concrete exmaple is given from <a href="https://cdn.cboe.com/resources/vix/vixwhite.pdf">CBOE white paper</a> or <a href="https://www.optionseducation.org/referencelibrary/white-papers/page-assets/vixwhite.aspx">OIC</a>.</p>
<p>For those interested in what the number mathematically represents, here it is in the most simple of terms. The VIX represents the S&P 500 index $\pm$ percentage move, annualized for one standard deviation. Example, if the VIX is currently at 15. That means, based on the option premiums in the S&P 500 index, the S&P is expected to stay with in a $\pm 15\%$ range over 1 year, $68\%$ of the time (which represents one standard deviation).</p>
<h2 id="what-does-vix-signalmean">What does VIX signal/mean?</h2>
<p>The VIX is a good indicator of the expectation of market volatility. This is a very important point; it is just a general assumption based on the <a href="https://www.investopedia.com/terms/o/option-premium.asp#toc-what-is-an-option-premium">premiums</a> investors are willing to pay for the right to buy or sell stock.</p>
<p>This premium in options can be loosely defined as risk. Just like other forms of insurance, the greater the risk the higher the premiums, and the lower the risk the lower the premiums. When the options premium fall the VIX falls and when premiums rise the VIX rises. The buyers and sellers move the option prices, more buyers and the premiums go up, more sellers and the premiums go down.</p>
<p>A high VIX means that traders expect the underlying futures index, in this case the S&P 500, to see choppy trading going forward. When the VIX is rising, this means that traders are paying more for the average monthly S&P 500 put or call option. Traders pay more because they expect bigger price swings.</p>
<p>A low VIX means that traders aren’t willing to pay too much for put and call options on the S&P 500. This usually happens during periods of quiet market trading where the S&P 500 forms small average trading ranges each day. During this sort of period, traders aren’t willing to pay much for protection against market swings, and thus prices of S&P 500 calls and puts are muted.</p>
<p>The above explanation can be found at <a href="https://seekingalpha.com/article/4493104-cboe-volatility-index-vix">here</a>.</p>
<h2 id="empiricalhistorical-chart-of-vix">Empirical/Historical chart of VIX</h2>
<p>Historically, the VIX tends to settle in the 15-25 range during an average market. VIX between 0-15 usually indicates optimism in the market and very low volatility. However, if the VIX falls too low it reflects complacency and that is dangerous accroding to the shoeshine boy theory - if everyone is bullish, there are no buyers left and the market comes tumbling down. VIX between 25-30 indicates some market turbulence, that volatility is increasing and investor confidence is likely declining. VIX over 30 typically indicate some extreme swings in the market coming up.</p>
<p>In general, VIX exhibits a spike pattern with bottom, or <a href="https://www.investopedia.com/terms/m/meanreversion.asp">mean-reverting</a>. A high VIX is often associated with a falling stock market. This is because stock investors often prefer bull market. Highest VIX readings are seen during major market panics such as the 2008 Financial Crisis and the Covid-19 shock of March 2020. Sometimes, VIX also spikes ahead of upcoming events that could lead to market turmoil, such as presidential elections.</p>
<h2 id="how-traders-can-leverage-vix">How traders can leverage VIX</h2>
<h3 id="use-vix-as-a-signal">Use VIX as a signal</h3>
<p>VIX serves as a market timing signal. Since VIX is a mean-reverting index that doesn’t trend over time, centain numbers always tend to hold importantce. Some traders like to buy stocks when the VIX hits historically high numbers such as 30 or 40. Likewise, when the VIX hits unusually low levels such as 10 or 12, it might be a good time to take profits on the stock market.</p>
<h3 id="trade-vix-future-and-options">Trade VIX future and options</h3>
<p>CBOE offers VIX options (VIX, VIXW) and futures (VX01 through VX53), enabling investors to trade volatility independent of the direction or the level of stock prices. Notable example includes story of the <a href="https://www.yahoo.com/entertainment/mystery-trader-50-cent-made-133000978.html">“50 cent”</a> and the <a href="https://www.amazon.com/When-Genius-Failed-Long-Term-Management/dp/0375758259">LTCM fiasco</a>.</p>
<p>With the rise of <a href="https://www.investopedia.com/terms/e/etf.asp">ETF</a> and <a href="https://www.investopedia.com/terms/e/etn.asp">ETN</a>, there are ETF longing VIX (e.g. VXX, UVXY) and shorting VIX (e.g. XIV). Such ETFs do not track VIX exactly, and <a href="https://en.wikipedia.org/wiki/Correlation_does_not_imply_causation">“correlation does not imply causation”</a>. We will introduce them in later posts.</p>Dr. Zhenhuan YangVIX is a volatility index that measures the volatility of S&P500 Index returns implied in the next 30 days and priced (implied) in S&P500 options. The VIX gauges the level of fear or stress in the stock market and hence enjoys an vivid nickname as the “Fear Index.”How to understand the $\epsilon$ value in differntial Privacy2020-07-23T00:00:00+00:002020-07-23T00:00:00+00:00https://zhenhuan-yang.github.io/privacy-loss<p>After reading so many papers about the application of differential privacy on machine learning, a nature question is raised in my mind:</p>
\[\textit{"So my algorithm is $\epsilon$-DP, but what does it mean?"}\]
<p>In fact the answer to this question is not that hard if you know some basic probability theory - Bayes’ rule to be more specific.</p>
<p>We imagine ourselves as an attacker tries to figure out whether someone (the target) is in $S$. Let’s be the strongeest attacker we can think of: we know all the database, except the target. Our prior guess satisfies</p>
\[\mathbb{P}[S = S_{\text{yes}}] = 1 - \mathbb{P}[S = S_{\text{no}}].\]
<table>
<tbody>
<tr>
<td>Now the $\epsilon$-DP algorithm $A$ returns an output $O$, we have to compare $\mathbb{P}[S = S_{\text{yes}}]$ with the posterior $\mathbb{P}[S = S_{\text{yes}}</td>
<td>A(S) \in O]$. By the Bayes’ rule, we know</td>
</tr>
</tbody>
</table>
\[\mathbb{P}[S = S_{\text{yes}}|A(S) \in O] = \frac{\mathbb{P}[S = S_{\text{yes}}] \cdot \mathbb{P}[A(S) \in O|S = S_{\text{yes}}]}{\mathbb{P}[A(S) \in O]}.\]
<p>By the definition of conditional probability, we know</p>
\[\frac{\mathbb{P}[S = S_{\text{yes}}|A(S) \in O]}{\mathbb{P}[S = S_{\text{no}}|A(S) \in O]} = \frac{\mathbb{P}[S = S_{\text{yes}}]}{\mathbb{P}[S = S_{\text{no}}]} \cdot \frac{\mathbb{P}[A(S_{\text{yes}}) \in O]}{\mathbb{P}[A(S_{\text{no}}) \in O]}.\]
<p>Now by the DP constraint, we know</p>
\[e^{-\epsilon}\leq \frac{\mathbb{P}[A(S_{\text{yes}}) \in O]}{\mathbb{P}[A(S_{\text{no}}) \in O]} \leq e^\epsilon.\]
<p>Plugging it into the formula above, we know</p>
\[e^{-\epsilon} \cdot \frac{\mathbb{P}[S = S_{\text{yes}}]}{\mathbb{P}[S = S_{\text{no}}]} \leq \frac{\mathbb{P}[S = S_{\text{yes}}|A(S) \in O]}{\mathbb{P}[S = S_{\text{no}}|A(S) \in O]} \leq e^\epsilon \cdot \frac{\mathbb{P}[S = S_{\text{yes}}]}{\mathbb{P}[S = S_{\text{no}}]}.\]
<table>
<tbody>
<tr>
<td>Solving for $\mathbb{P}[S = S_{\text{yes}}</td>
<td>A(S) \in O]$, we know</td>
</tr>
</tbody>
</table>
\[\frac{\mathbb{P}[S = S_{\text{yes}}]}{e^\epsilon + (1 - e^\epsilon)\cdot \mathbb{P}[S = S_{\text{yes}}]} \leq \mathbb{P}[S = S_{\text{yes}}|A(S) \in O] \leq \frac{e^\epsilon \cdot \mathbb{P}[S = S_{\text{yes}}]}{1 + (e^\epsilon - 1) \cdot \mathbb{P}[S = S_{\text{no}}]}.\]
<p>For example, if the prior is $50\%$ and $\epsilon = 1.1$, then the posterior is between $25\%$ to $75\%$.</p>
<p>Reading the book by <a href="https://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf">Dwork</a> will give you a deeper understanding on what does differential privacy guarantee. We will omit it here.</p>Dr. Zhenhuan YangAfter reading so many papers about the application of differential privacy on machine learning, a nature question is raised in my mind:Everything abut GANs: Model, Training and Beyond2019-07-26T00:00:00+00:002019-07-26T00:00:00+00:00https://zhenhuan-yang.github.io/everything-about-gan<h2 id="the-original-definition">The original definition</h2>
<p>The framework of generative adversarial networks is firstly introduced in <a href="https://papers.nips.cc/paper/5423-generative-adversarial-nets">Generative Adversarial Net</a> by I. Goodfellow et al.</p>
\[\min_G\max_D J(G,D) = \mathbb{E}_{x \sim p_{data}(x)}[\log(D(x))] + \mathbb{E}_{z\sim p(z)}[\log(1 - D(G(z)))]\]
<p>One insight of this loss is from the view point of cross-entropy. The cross entropy between two distributions, $p$ and $q$ is defined as</p>
\[H(p,q) = - \sum_{i} p_i \log(q_i)\]
<p>where $p$ and $q$ denote the true and estimated distribution, respectively. For a single point $(x_i, y_i)$, we define its true distribution as $\mathbb{P}[y_i = 0] = 1$ if $y_i = 0$ and $\mathbb{P}[y_i = 1] = 1$ if $y_i = 1$. Putting as a vector, it is either $[0,1]$ or $[1,0]$. Thus, for a data point $(x,y)$, we have the following loss for $D$</p>
\[H((x,y),D) = -y\log(D(x)) - (1 - y)\log(1-D(x)).\]
<p>In the case of GANs, $x_i$ coming from two sources, either $x_i\sim p_{data}$ or $x_i = G(z_i)$ where $z_i \sim p_z$. In addition, we also want the discriminator can not tell where $x_i$ coming from, hence we replace $y_i = \frac{1}{2}$.</p>
<p>So far we have specified the cost function $J^{(D)}$ for only the discriminator. A complete specification of the game requires that we specify a cost function also for the generator. The simplest version of the game is a zero-sum game</p>
\[J^{(G)} = -J^{(D)}\]
<p>We can summarize the entire game with a value function specifying the discriminator’s payoff</p>
\[V(D,G) = -J^{(D)}\]
<p>Zero-sum games are also called minimax games because their solution involves minimization in an outer loop and maximization in an inner loop</p>
\[G^* = \arg\min\max V(G,D).\]
<p>Another insight is the viewpoint of Jensen-Shannon divergence between the data and the model distribution. The KL (Kullback–Leibler) divergence measures how one probability distribution $p$ diverges from a second expected probability distribution $q$.</p>
\[D_ {KL}(p || q) = \int_x p(x)\log(\frac{p(x)}{q(x)})\mathrm{d}x.\]
<p>It is noticeable according to the formula that KL divergence is asymmetric. JS (Jensen–Shannon) Divergence is another measure of similarity between two probability distributions, which is symmetric.</p>
\[D_{JS}(p||q) = \frac{1}{2}D_{KL}(p||\frac{p+q}{2}) + \frac{1}{2}D_{KL}(q||\frac{p+q}{2})\]
<p>Let’s first examine what is the best value for $D$.</p>
<p>\(V(D,G) = \int_x p_{data}(x)\log(D(x))\mathrm{d}x + \int_z p_{z}(z) \log(1 - D(G(z)))\mathrm{d}z \\ = \int_x p_{data}(x)\log(D(x))+ p_{gen}(x) \log(1 - D(x))\mathrm{d}x\).</p>
<p>The best value of the discriminator is avhieved at $D^* (x) = \frac{p_{data}(x)}{p_{data}(x) + p_{gen}(x)}$. Once the generator is trained to its optimal, $p_{gen}$ gets very close to $p_{data}$, $D^* (x) = \frac{1}{2}$. The loss function becomes $V(D^* ,G) = \int_x p_{data}(x)\log(\frac{1}{2})+ p_{gen}(x) \log(1 - \frac{1}{2})\mathrm{d}x = -\log 4$. Subtracting this value from $V(D^* ,G)$, we obtain</p>
\[V(D^* ,G) = -\log 4 + D_{KL}(p_{data}||\frac{p_{data} + p_{gen}}{2}) + D_{KL}(p_{gen}||\frac{p_{data} + p_{gen}}{2})\\ = -\log 4 + 2D_{JS}(p_{data}||p_{gen})\]
<p>Since the JS divergence between two distributions is always non-negative, and zero iff they are equal, we have shown that $V(D^* ,G) = -\log 4$ is the global minimum of $V(D^* ,G)$ and that the only solution is $p_{gen} = p_{data}$.</p>
<p>The last insight is the connection between GAN and other generative models. For genrative model, the most simple and classical principle is maximum likelihood. The basic idea of maximum likelihood is to define a model that provides an
estimate of a probability distribution, parameterized by parameters $\theta$, the MLE is given by</p>
\[\theta^{MLE} = \arg\max \prod_{i=1}^n p_{model}(x_i|\theta)\\ = \arg\min \frac{1}{n}\sum_{i=1}^n -\log p_{model}(x_i|\theta).\]
<p>By the strong law of large number,</p>
\[\frac{1}{n}\sum_{i=1}^n -\log p_{model}(x_i|\theta) \rightarrow \mathbb{E}_ {p_{data}}[-\log p_{model}(x|\theta)]\]
<p>Hence we can think of maximum likelihood as trying to minimize</p>
\[\mathbb{E}_ {p_{data}}[-\log p_{model}(x|\theta)]\]
<p>Here</p>
\[p_{data} = p_{model}(x|\theta^* )\]
<p>i.e. data generate by true parameter. Furthermore, the minimization is equivalent to</p>
\[\mathbb{E}[\log p_{model}(x|\theta^* )-\log p_{model}(x|\theta)] = \mathbb{E}[\log(\frac{p_{model}(x|\theta^* )}{p_{model}(x|\theta)})] = D_{KL}(p_{model}(x|\theta^* )|| p_{model}(x|\theta)) \geq 0.\]
<p>Also note that the last inequality becomes equality if and only if</p>
\[p_{model}(x|\theta^* ) = p_{model}(x|\theta).\]
<p>This is because</p>
\[D_{KL}(p|| q) = \mathbb{E}_ p[\log(\frac{p}{q})] = - \mathbb{E}_ p[\log(\frac{q}{p})] \geq -\log\mathbb{E}_ p[\frac{q}{p}] = 0\]
<p>where the Jensen’s inequality holds when $p= q$. In practice, we don’t have access to</p>
\[p_{model}(x|\theta^* ) = p_{data}(x)\]
<p>but $\hat{p}_ {data}$.</p>
<p>Minimizing the KL divergence between $\hat{p}_ {data}$ and $p_{model}$ is exactly equivalent to maximizing the log-likelihood of the training set.</p>
<p>In explicit density models, is explicit, among which we have variational autoencoder (VAE). GANs fall in the category of implicit density models, where is implicit.</p>
<p><img src="/assets/images/KL.png" alt="image" /></p>
<p>Unlike maximum likelihood, reverse KL tends to learn the mode. Here we show an example of a distribution over one-dimensional data $x$. In this example, we use a mixture of two Gaussians as the data distribution, and a single Gaussian as the model family. Because a single Gaussian can not capture the true data distribution, the choice of divergence determines the tradeoff that the model makes.</p>
<h2 id="gan-problem">GAN problem</h2>
<p>GAN is based on the zero-sum non-cooperative game. In short, if one wins the other loses. A zero-sum game is also called minimax. Your opponent wants to maximize its actions and your actions are to minimize them. In game theory, the GAN model converges when the discriminator and the generator reach a Nash equilibrium.</p>
<p>Since both sides want to undermine the others, a Nash equilibrium happens when one player will not change its action regardless of what the opponent may do. Consider two player $A$ and $B$ which control the value of $x$ and $y$ respectively. Player $A$ wants to maximize the value $xy$ while $B$ wants to minimize it.</p>
\[\min_B\max_A V(A,B) = xy\]
<p>The Nash equilibrium is $x=y=0$. We update the parameter $x$ and $y$ based on the gradient of the value function $V$.</p>
<p><img src="/assets/images/xy.png" alt="image" /></p>
<p>Our example is an excellent showcase that some cost functions will not converge with gradient descent, in particular for a non-convex game.</p>
<p>It is possible that the Equation (1) cannot provide sufficient gradient for $G$ to learn well in practice. Generally speaking, $G$ is poor in early learning and samples are clearly different from the training data. Therefore, $D$ can reject the generated samples with high confidence. In this situation, $\log (1 − D (G (z)))$ saturates. I. Goodfellow et al. suggest to use the loss</p>
\[J^{(G)} = \mathbb{E}_ {z \sim p(z)}[-\log(D(G(z)))] = \mathbb{E}_ {x \sim p_{gen}(x)}[-\log(D(x))].\]
<p>In the minimax game, the generator minimizes the log-probability of the discriminator being correct. In this game, the generator maximizes the logprobability of the discriminator being mistaken. This new objective function results in the same fixed point of the dynamics of D and G but provides much larger gradients early in learning. However, the
non-saturating game has other problems such as unstable numerical gradient for training $G$. With optimal $D^∗$, we have</p>
\[\mathbb{E}_ {x \sim p_{gen}(x)}[-\log(D^* (x))] + \mathbb{E}_ {x \sim p_{gen}(x)}[\log(1 - D^* (x))] = D_{KL}(p_{gen}||p_{data}).\]
<p>Recall that</p>
\[\mathbb{E}_ {x \sim p_{data}(x)}[\log(D^* (x))] + \mathbb{E}_ {x \sim p_{gen}(x)}[\log(1 - D^* (x))] = 2D_{JS}(p_{gen}||p_{data}) - \log 4.\]
<p>Therefore</p>
\[\mathbb{E}_ {x \sim p_{gen}(x)}[-\log(D^* (x))] = D_{KL}(p_{gen}||p_{data}) - 2D_{JS}(p_{gen}||p_{data}) + \mathbb{E}_ {x \sim p_{data}(x)}[\log(D^* (x))] + \log 4.\]
<p>We can see that the optimization of the alternative $G$ loss in the non-saturating game is contradictory because
the first term aims to make the divergence between the generated distribution and the real distribution as small as
possible while the second term aims to make the divergence between these two distributions as large as possible due to the negative sign. This will bring unstable numerical gradient for training $G$. Furthermore, KL divergence is not a symmetrical quantity, which is reflected from the following two examples: if $p_{data} \rightarrow 0$ and $p_{gen} \rightarrow 1$ we have $D_{KL}(p_{gen}||p_{data}) \rightarrow \infty$; if $p_{data} \rightarrow 1$ and $p_{gen} \rightarrow 0$ we have $D_{KL}(p_{gen}||p_{data}) \rightarrow 0$.</p>
<p>The penalizations for two errors made by $G$ are completely different. The first error is that $G$ produces implausible samples and the penalization is rather large. The second error is that $G$ does not produce real samples and the penalization is quite small. The first error is that the generated samples are inaccurate while the second error is that generated samples are not diverse enough. Based on this, $G$ prefers producing repeated but safe samples rather than taking risk to produce different but unsafe samples, which has the mode collapse problem.</p>
<p>There are many methods to approximate in GANs. Especially, we might like to be able to do maximum likelihood learning with GANs. Under the assumption that the discriminator is optimal, minimizing</p>
\[J^{(G)} = \mathbb{E}_ {z\sim p_z}[-\exp(\sigma^{-1}(D(G(z))))]\]
<p>where $\sigma$ is the logistic sigmoid function, is equavalent to minimize</p>
\[D_{KL}(p_{data}||p_{gen}).\]
<p>The proof is as follow, we wish to find a function $f$ such that the expected gradient of</p>
\[J^{(G)} = \mathbb{E}_ {x\sim p_{gen}(x)}[f(x)]\]
<p>is equal to the expected gradient of</p>
\[D_{KL}(p_{data}||p_{gen}).\]
<p>First we take the derivative of the KL divergence with respect to a parameter $\theta$</p>
\[\frac{\partial}{\partial \theta} D_{KL}(p_{data}||p_{gen}) = - \mathbb{E}_ {x\sim p_{data}}\frac{\partial}{\partial \theta} \log p_{gen}(x).\]
<p>We now want to find the $f$ that will make the derivatives of $J^{(G)}$ match above. We begin by taking the derivatives of $J^{(G)}$</p>
\[\frac{\partial}{\partial \theta} J^{(G)} = \frac{\partial}{\partial \theta}\mathbb{E}_ {x \sim p_{gen}} f(x) = \int_x f(x) \frac{\partial}{\partial \theta} p_{gen}(x) = \int_x f(x) p_{gen}(x) \frac{\partial}{\partial \theta} \log p_{gen}(x)\]
<p>where the last identity is by the derivative of $\log$, and we assume we can use Leibniz’s rule to exhange the order of differentiation and integration.</p>
<p>We see that the derivatives of $J^{(G)}$ come very near to giving us what we want; the only problem is that the expectation is computed by drawing samples from $p_{gen}$ when we would like it to be computed by drawing samples from $p_{data}$. We can fix this problem using an importance sampling trick; by setting $f = -\frac{p_{data}}{p_{gen}}$. Note that when constructing $J^{(G)}$ we must copy $p_{gen}$ into $f(x)$ so that $f(x)$ has a derivative of zero with respect to the parameters of $p_{gen}$. Fortunately, this happens naturally if we obtain the value of $\frac{p_{data}(x)}{p_{gen}(x)}$. Suppose our discriminator is given by $D(x) = \sigma(a(x))$ where $\sigma$ is the logistic sigmoid function. Suppose further that our discriminator has converged to its optimal value for the current generator,</p>
\[D^* = \frac{p_{data}}{p_{data} + p_{gen}}.\]
<p>Then $f(x) = − \exp (a(x))$.</p>Dr. Zhenhuan YangThe original definitionJekyll Learning Notes2019-02-21T00:00:00+00:002019-02-21T00:00:00+00:00https://zhenhuan-yang.github.io/jekyll<p>Let us be honest. Building a personal academic website is essential, yet nightmare-ful for a non-cs major. This post summarizes all the experience/suffering that I have.</p>
<h2 id="good-old-days-with-jemdoc">Good old days with jemdoc</h2>
<p><a href="https://jemdoc.jaboc.net/">jemdoc</a> is a light text-based markup language designed for creating websites, developed by Jacob Mattingley. It is popular among researchers, for example, <a href="https://web.stanford.edu/~boyd/">Stephen P. Boyd</a>.</p>
<h3 id="setting-up-jemdoc">Setting up jemdoc</h3>
<p>Following the official site can fail even if you are the superuser! Instead, follow <a href="http://www-personal.umich.edu/~wylguan/using-jemdoc.html">here</a>.</p>
<h3 id="hosting-your-websites">Hosting your websites</h3>
<p>After you build your badass website, you will want to publish and show off (isn’t that all that matters?). If you are a student/faculty/staff in college, contact your Information Technology Services for hosting. Otherwise, try out <a href="https://pages.github.com/">Github Pages</a>.</p>
<h3 id="jemdoc--mathjax">jemdoc + Mathjax</h3>
<p>As a math major, typing $\LaTeX$ equation becomes a necessity. The original jemdoc loads LATEX equations as PNG image that is pixelated! Then I found Mathjax, which is a JavaScript display engine for mathematics that works in all browsers!</p>
<p><a href="http://www.mit.edu/~wsshin/jemdoc+mathjax.html">Wonseok Shin</a> made this possible. The usage is quite simple, just change your <code class="language-plaintext highlighter-rouge">mysite.conf</code> and do</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>../jemdoc -c mysite.conf *.jemdoc
</code></pre></div></div>
<p>when you compile. Check details <a href="https://github.com/wsshin/jemdoc_mathjax">here</a>.</p>
<h3 id="writing-blog-with-jupyter-notebook">Writing blog with jupyter notebook</h3>
<p>Since <code class="language-plaintext highlighter-rouge">jemdoc.py</code> is a python file, I intuitively created a python project with <a href="https://www.jetbrains.com/pycharm/">Pycharm</a> editor. Later I found it is not necessary but I am happy with my choice, since it comes with Terminal feature and you can just <code class="language-plaintext highlighter-rouge">jemdoc</code> in there. Besides you can call <code class="language-plaintext highlighter-rouge">jupyter notebook</code>, which is an amazing markdown tool as it</p>
<ul>
<li>supports LATEX display</li>
<li>exports to <code class="language-plaintext highlighter-rouge">.html</code> easily</li>
</ul>
<h2 id="migrating-to-jekyll">Migrating to Jekyll</h2>
<p>Yes, if you are satisfied with the jemdoc site, then you can close this post right away (save you huge time I guarantee!). Yet I wanted to make my website a little bit fancier. Let me cut to the chase, after going to hell and back, I found my treasure - Jekyll.</p>
<p>Just like jemdoc, <a href="https://jekyllrb.com/">Jekyll</a> is also a static site generator. It has the following pros</p>
<ul>
<li>Blog-aware</li>
<li>Free hosting with GitHub Pages</li>
<li>Good community with rich choice of theme templates</li>
</ul>
<p>Nonetheless, it has the following cons</p>
<ul>
<li>It is heavy in coding and I know nothing about it!</li>
</ul>
<p>Jekyll is a <a href="https://www.ruby-lang.org/en/">Ruby</a> <a href="https://guides.rubygems.org/what-is-a-gem/">Gem</a> that can be installed on most systems. If you are on macOS like me, you probably have it pre-installed. Without knowing anything about Ruby, all you have to do now is to install the jekyll and <a href="https://bundler.io/">bundler</a> gems.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gem install jekyll bundler
</code></pre></div></div>
<h3 id="life-saver-minimal-mistakes">Life saver: Minimal Mistakes</h3>
<p><a href="https://github.com/mmistakes/minimal-mistakes">Minimal Mistakes</a> is a flexible two-column Jekyll theme, perfect for building personal sites, blogs, and portfolios.</p>
<p>I didn’t install the theme following the <a href="https://mmistakes.github.io/minimal-mistakes/docs/quick-start-guide/">guide</a>, instead, I forked the repository and rename it as <code class="language-plaintext highlighter-rouge">username.github.io</code>. And then I <code class="language-plaintext highlighter-rouge">git clone</code> my own repository to remove the unnecessary as <a href="https://mmistakes.github.io/minimal-mistakes/docs/quick-start-guide/">suggested</a>. Whenever you have made changes to your local repository, you have to stage, commit and push.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git add
git commit -a
git push origin master
</code></pre></div></div>
<p>Do not remove <code class="language-plaintext highlighter-rouge">minimal-mistakes-jekyll.gemspec</code> if you are not planing to mess up with Ruby like me! I accidentally deleted it so I have to <code class="language-plaintext highlighter-rouge">cd</code> into your local repository and run</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>touch minimal-mistakes-jekyll.gemspec
vim minimal-mistakes-jekyll.gemspec
</code></pre></div></div>
<h3 id="first-time-building">First time building…</h3>
<p>To build the website locally, <code class="language-plaintext highlighter-rouge">cd</code> into your local repository and run</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>bundle install
bundle exec jekyll serve
</code></pre></div></div>
<p>Here <code class="language-plaintext highlighter-rouge">bundle install</code> will download the necessary gem packages and create a <code class="language-plaintext highlighter-rouge">Gemfile.lock</code> file and <code class="language-plaintext highlighter-rouge">bundle exec</code> will stick your gem packages version to <code class="language-plaintext highlighter-rouge">Gemfile.lock</code>. And <code class="language-plaintext highlighter-rouge">jekyll serve</code> will build your website locally. You can now check out your local website from your local server (check your terminal message).</p>
<p>Alternatively, I use <a href="https://atom.io/">Atom</a> instead of terminal directly. Atom is an amazing editor built with HTML, JavaScript, CSS, and Node.js integration. Therefore you can easily bulid your jekyll website by installing the Atom jekyll package. Furthermore, Atom can be configured with your Github account and repo, allows you to git without command line!</p>
<h3 id="editing-_configyml">Editing <code class="language-plaintext highlighter-rouge">_config.yml</code></h3>
<p>To make the website looks like yours, start with editing <code class="language-plaintext highlighter-rouge">_config.yml</code> to the site author with your information.</p>
<p>To add your avator, create a folder named <code class="language-plaintext highlighter-rouge">images</code> in <code class="language-plaintext highlighter-rouge">assets</code> folder. Put your avator picture inside.</p>
<p>To edit the links, find your font awesome icons <a href="https://www.w3schools.com/icons/default.asp">here</a>. I did not find the brand icon for Google Scholar so I used <code class="language-plaintext highlighter-rouge">fas fa-fw fa-graduation-cap</code>.</p>
<p>You also want to edit the site settings. If you intend to write blog and enable comments. Follow the instruction for how to use <a href="https://disqus.com/">disqus</a>.</p>
<p>One additional thing I did is to add Google Analytics. See how to sign up for an Analytics account and Find your Analytics tracking ID at <a href="https://support.google.com/sites/answer/97459?hl=en">here</a>. In your <code class="language-plaintext highlighter-rouge">_config.yml</code>, you need to choose the provider as <code class="language-plaintext highlighter-rouge">provider: google-gtag</code> and put down your <code class="language-plaintext highlighter-rouge">tracking_id</code>.</p>
<h3 id="adding-favicons">Adding favicons</h3>
<p>Go to <code class="language-plaintext highlighter-rouge">custom.html</code> located in <code class="language-plaintext highlighter-rouge">/_includes/head</code>. Follow the instruction to add your favicons.</p>
<h3 id="adding-mathjax-supports">Adding Mathjax supports</h3>
<p>Go to <code class="language-plaintext highlighter-rouge">scripts.html</code> located in <code class="language-plaintext highlighter-rouge">/_includes</code>. Add this snippet at the end</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><script type="text/javascript" async
src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/latest.js?config=TeX-MML-AM_CHTML">
</script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js"],
jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
processEscapes: true
},
"HTML-CSS": { availableFonts: ["TeX"] }
});
</script>
</code></pre></div></div>
<p>Enable MathJax in <code class="language-plaintext highlighter-rouge">_config.yml</code> by adding <code class="language-plaintext highlighter-rouge">mathjax: true</code> to the page defaults</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Defaults
defaults:
# _posts
- scope:
path: ""
type: posts
values:
layout: single
author_profile: true
read_time: true
comments: true
share: true
related: true
mathjax: true
</code></pre></div></div>
<h3 id="editing-home-layout">Editing home layout</h3>
<p>The default homepage layout shows your recent posts, which is good if your website is just a blog. However, if you intend to display your website as a personal academic site like me, you will want the homepage to include basic information and disable the recent posts.</p>
<p>I have not found a perfect way, so I go to <code class="language-plaintext highlighter-rouge">home.html</code> located in <code class="language-plaintext highlighter-rouge">/_layouts</code> and delete the snippet after</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><h3 class="archive__subtitle">Recent posts</h3>
</code></pre></div></div>
<p>Now go ahead to <code class="language-plaintext highlighter-rouge">index.html</code> and change the name to <code class="language-plaintext highlighter-rouge">index.md</code>, which enables you to write in markdown mode. If your website is hosted by Github Pages this is fine. Otherwise, you need to exports your <code class="language-plaintext highlighter-rouge">.md</code> file to <code class="language-plaintext highlighter-rouge">.html</code> file.</p>
<h3 id="clustrmaps">ClustrMaps</h3>
<p><a href="https://clustrmaps.com/">ClustrMaps</a> is a fancy widget that can track your website visitors from all over the world and visualize it using a real-time map. Follow the instruction from the official site to create your widget.</p>
<p>Go to <code class="language-plaintext highlighter-rouge">home.html</code> located in <code class="language-plaintext highlighter-rouge">/_layouts</code> and paste your Javascript at the end. Now you will see the change.</p>
<h3 id="creating-pages">Creating pages</h3>
<p>Pages are aside from homepage. For a academic site, I need at least Publication page and Blog page.</p>
<p>Firstly, go to <code class="language-plaintext highlighter-rouge">navigation.yml</code> located in <code class="language-plaintext highlighter-rouge">/_data</code> to state your pages, mine looks like</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># main links
main:
# - title: "Quick-Start Guide"
# url: https://mmistakes.github.io/minimal-mistakes/docs/quick-start-guide/
- title: "Publications"
url: /publications/
- title: "Projects"
url: /projects/
- title: "Teaching"
url: /teaching/
- title: "Blog"
url: /blog/
</code></pre></div></div>
<p>Next, create a folder <code class="language-plaintext highlighter-rouge">_pages</code> together with page files according the <code class="language-plaintext highlighter-rouge">navigation.yml</code>. Specify <code class="language-plaintext highlighter-rouge">layout: archive</code> and specify the <code class="language-plaintext highlighter-rouge">permalink</code> in each page corresponding to <code class="language-plaintext highlighter-rouge">navigation.yml</code>.</p>
<h3 id="using-paginate">Using paginate</h3>
<p>Since we have created <code class="language-plaintext highlighter-rouge">/blog/</code> url. So create a folder also named <code class="language-plaintext highlighter-rouge">blog</code>. Put <code class="language-plaintext highlighter-rouge">index.html</code> in this folder and paste the code original from <code class="language-plaintext highlighter-rouge">home.html</code> layout to here.</p>
<p>In the <code class="language-plaintext highlighter-rouge">_config.yml</code>, specify the output</p>
<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nx">paginate_path</span><span class="p">:</span> <span class="o">/</span><span class="nx">blog</span><span class="o">/</span><span class="nx">page</span><span class="p">:</span><span class="nx">num</span>
</code></pre></div></div>
<h3 id="creating-posts">Creating posts</h3>
<p>Firstly, create a folder named <code class="language-plaintext highlighter-rouge">_posts</code> and then put all your posts in here.</p>
<p>To name your post, use <code class="language-plaintext highlighter-rouge">yyyy-mm-dd-title.md</code> as the post name. It will tell jekyll the date and title. Set your layout as <code class="language-plaintext highlighter-rouge">layout: single</code> and enable <code class="language-plaintext highlighter-rouge">comments: true</code> if you added comment provider in your <code class="language-plaintext highlighter-rouge">_config.yml</code>.</p>
<p>Happy blogging!</p>
<h3 id="deployment">Deployment</h3>
<p>To upload a Jekyll site to a web host using FTP, run the command</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>bundle exec jekyll build
</code></pre></div></div>
<p>and copy the contents of the generated <code class="language-plaintext highlighter-rouge">_site</code> folder to the root folder of your hosting account.</p>
<p>If you are using Github Pages, this is no need, and if you do so you do not have to commit or push since <code class="language-plaintext highlighter-rouge">_site</code> is in <code class="language-plaintext highlighter-rouge">.gitignore</code>.</p>
<h2 id="check-out-academicpages">Check out academicpages</h2>
<p>Yet, I found this amazing <a href="https://github.com/academicpages/academicpages.github.io">repository</a> after I built everything (yah!).</p>
<p>Basically this template has everything you need, you just need to fill in your info.</p>Dr. Zhenhuan YangLet us be honest. Building a personal academic website is essential, yet nightmare-ful for a non-cs major. This post summarizes all the experience/suffering that I have.Everything about FFNs: Structure, Optimization and Beyond2018-08-08T00:00:00+00:002018-08-08T00:00:00+00:00https://zhenhuan-yang.github.io/everything-about-ffn<h2 id="mathematical-formulation">Mathematical formulation</h2>
<p>A standard fully-connected neural network is given by</p>
\[f_\theta(x) = W^L\phi(W^{L-1}\cdots \phi(W^2\phi(W^1x)))\]
<p>where $\phi: \mathbb{R} \rightarrow \mathbb{R}$ is the neuron activation function, $W^l$ is a matrix of dimension $d_j \times d_{j-1}, l=1, \cdots, L$ and $\theta = (W^1,\cdots,W^L)$ represents the colelction of all parameters. When applying the scalar function $\phi$ to a matrix $Z$, we apply $\phi$ to each entry of $Z$. Another way to write down the neural network is to use a recursion formula</p>
\[z^0 = x, z^l = \phi(W^l z^{l-1}+ b^l), l= 1,\cdots L.\]
<p>For simplicity of presentation, we skip the bias term $b^l$ in the expression of neural networks. For computer vision tasks, convolutional neural networks (CNN) are standard. For an input $Z$ of size $I \times I \times C$, where $C$ is the number of channels, A filter $W$ of size $F \times F \times C$ produce an output of size $O \times O \times 1$. The indexes of rows and columns of the result matrix are marked with $m$ and $n$ respectively.</p>
\[Z^l(m,n) = (Z^{l-1} * W^{l}) (m,n) = \sum_{i=1}^F\sum_{j=1}^F\sum_{k=1}^C W^l(i,j,k) Z^{l-1}(m+i-1,n+j-1,k).\]
<p>Please be noted this definition is different from mathmetical definition of convolusion of two funcitons, where the place of $+$ and $-$ should be switched. If there is an application of $K$ filters, it will result in an output of size $O \times O \times K$. If there is a stride $S$, then the convulution index on the input becomes $Z^{l-1}(m +(m-1)s+i-1, n+(n-1)s+j-1,k)$.</p>
<p>The pooling layer (POOL) is a downsampling operation, typically applied after a convolution layer, which does some spatial invariance. The operation is similar to convolution, but instead of weighted sum, it is replaced with maximum or average.</p>
<p>Zero-padding denotes the process of adding P zeroes to each side of the boundaries of the input. There are three types of padding. $P = 0$ denotes the valid padding. $P_s = \lfloor\frac{S\lceil\frac{I}{S}\rceil - I + F - S}{2}\rfloor$ and $P_e = \lceil\frac{S\lceil\frac{I}{S}\rceil - I + F - S}{2}\rceil$ are called the same padding, such that ouput has size $\lceil\frac{I}{S}\rceil$. $P_s \in [0,F-1]$ and $P_e = F-1$ are called the full padding. Finally, the output size $O$ is given by</p>
\[O = \frac{I - F + P_s + P_e}{S} + 1.\]
<p>The last type of neural networks we mention in this article is Residual Network (ResNet). A nested sequence of function classes $\mathcal{F}_ 1 \subseteq \cdots \subseteq \mathcal{F}_ L$ will enhance its capability of finding the optimum. Therefore, we would require the mapping between two layers ($\mathcal{F}_ {l-1}$ and $\mathcal{F}_ l$) includes the identity map $\mathcal{H}(x) = x$, or the residual map $\mathcal{F}(x) = \mathcal{H}(x) - x$. The original function thus becomes $\mathcal{F}(x) + x$. The building blocks of ResNet look like</p>
\[z^l = \phi(\mathcal{F}(z^{l-k},\{W^{j}\}_ {j=l-k}^l) + z^{l-k}),\]
<p>where $\mathcal{F}$ repressents the residual mapping to be learned, For example, two layers, $\mathcal{F}(z^{l-1}) = W^l\phi(W^{l-1}z^{l-1})$.</p>
<h2 id="backpropagation">Backpropagation</h2>
<p>From an optimization perspective, backpropagation is an efficient implementation of gradient computation. To illustrate how BP works, suppose the loss function is quadratic and consider the per-sample loss</p>
<p>\(F(\theta) = \|y - W^L\phi(W^{L-1} \cdots W^2\phi(W^1x))\|^2\).</p>
<p>We define an important set of intermediate variables</p>
<p>\(z^0 = x, h^1 = W^1 z^0,\)
\(z^1 = \phi(h^1), h^2 = W^2 z^1,\)
\(\cdots\)
\(z^{L-1} = \phi(h^{L-1}), h^L = W^L z^{L-1}.\)</p>
<p>Furthermore, define $D^l = diag(\phi’(h_1^l),\cdots, \phi’(h_d^l))$, which is a diagonal matrix with the $i$-th diagonal entry being the derivative of the activation function evaluated at the $i$-th pre-activation $h_i^l$. Let the error vector $e = (h^L - y)^2$. The gradient over weight matrix $W^l$ is given by</p>
\[\frac{\partial F(\theta)}{\partial W^l} = -(W^LD^{L-1} \cdots W^{l+1}D^l)^\top 2(h^L - y)(z^{l-1})^\top, l = 1,\cdots, L.\]
<p>Define a sequence of backpropagated error as</p>
<p>\(e^L = 2(h^L - y),\)
\(e^{L-1} = (D^{L-1}W^L)^\top e^L,\)
\(\cdots\)
\(e^1 = (D^1W^2)^\top e^2.\)</p>
<p>Then the partial gradient can be written as</p>
\[\frac{\partial F(\theta)}{\partial W^l} = -e^l(z^{l-1})^\top, l = 1,\cdots, L.\]
<p>A naive method to compute all partial gradients would require $\mathcal{O}(L^2)$ matrix multiplications since each partial gradients requires $\mathcal{O}(L)$ matrix multiplication. A smarter algorithm is to reuse multiplications as follow.</p>
<p><img src="/assets/images/bp.png" alt="image" /></p>
<p>In the forward pass, from the bottom layer $1$ to the top layer $L$, post-activation $z^l$ is computed recursively and stored for future use. After computing the last layer ouput $h^L$, we compare it with the ground-truth $y$ to obtain the error $e = (h^L - y)^2$. In the backward pass, from the top layer $L$ to the bottom layer $1$, two quantities are compared at each layer $l$. First, the backpropagated error $e^l$ is computed. Second, the partial gradient over the $l$-th layer weight matrix $W^l$ is computed. After the forward pass and backward pass, we have computed the partial gradient for each weight for one sample $x$.</p>
<p>By a small modification to this procedure, we can implement SGD as follows. After tge partial gradient over $W^l$ is computed, we update $W^l$ by a gradient step. After updating all weights $W^l$, we have completed one iteration of SGD. In mini-batch SGD, the implementation is slightly different: in the feedforward and backward pass, a mini-batch of multiple samples will pass the network together.</p>
<h2 id="gradient-explosion--vanishing">Gradient Explosion / Vanishing</h2>
<p>Consider the following example of $1$-dimensional problem</p>
\[\min_{w_1,\cdots,w^L} F(\theta) = (1 - w^1\cdots w^L)^2.\]
<p>The gradient over $w^l$ is</p>
\[\frac{\partial F(\theta)}{\partial w^l} = -2w^1\cdots w^{l-1}w^{l+1}\cdots w^L(1-w^1\cdots w^L) = -2w^1\cdots w^{l-1}w^{l+1}\cdots w^L e.\]
<p>If all $w^l = 2$, then thegradient has norm $2^{L-1}e$ which is exponentially large, if all $w_l = \frac{1}{2}$, then the gradient has norm $(\frac{1}{2})^{L-1}e$ which is exponentially small. Note that many works do not mention gradient explosion, but just mention gradient vanishing. This is partially because the non-linear activation function can reduce the signal, partially because empirical tricks such as gradient clipping and partially because regularization.</p>
<h3 id="relu-activation">ReLu Activation</h3>
<p>The use of sigmoid type of activation functions, such as logistic sigmoid function $\phi(x) = \frac{1}{1+\exp(x)}$, suffers from gradient vanishing problem. Recall the backpropagated error $e^l = (D^{l-1}W^l)^\top e^l$, since $\phi’(x) \in (0,1)$, hence the gradient is always shrinking (especially when the input value is large) when backpropagating to the earlier layers.</p>
<p>The introduction of rectified linear unit function (ReLu), $\phi(x) = \max(0,x)$, can mitigate this issue, as the gradient is $1$ when $x > 0$, hence the speed of convergence is much faster than sigmoid activation. ReLu can also easily obtain sparse representations.</p>
<p><img src="/assets/images/sparse.png" alt="image" /></p>
<h3 id="initialization">Initialization</h3>
<h3 id="batchnorm">BatchNorm</h3>
<p>Batch normalization (BatchNorm) is another way to avoid gradient vanishing (or to avoid internal covatiate shift). Recall in linear regression problem $\min_w \sum_{i=1}^n(y_i - w^\top x_i)^2$, we often scale each row of data matrix $[x_1, \cdots, x_n] \in \mathbb{R}^{d \times n}$ so that each row has zero mean and unit norm (one row corresponds to one feature). This operation can be viewed as a pre-conditioning technique that can reduce the condition number of the Hessian matrix.</p>
<p>Consider the matrix of pre-activations $[h^l(1), \cdots, h^l(n)]$, where $h^l(i)$ represents the pre-activation at the $l$-th layer for the $i$-th sample. Thus it is natural to hope each row of $[h^l(1), \cdots, h^l(n)]$ has zero mean and unit variance. However, normalizing the whole dataset is computationally hard, hence by normalizing each batch and treating it as one of the layer can save time and successfully apply chain rule. Formally, for any $x \in B$, the operator is defined as</p>
\[BN(x) = \gamma \frac{x - \mu_B}{\sigma_B} +\beta\]
<p>where $\mu_B$ is the batch mean and $\sigma_B$ is the perturbed batch standard deviation. Note that $\gamma$ and $\beta$ are parameters that need to be learned jointly with the other model parameters. The BN layers for fully connected network is</p>
\[z^{l} = \phi(BN(Wz^{l-1})).\]Dr. Zhenhuan YangMathematical formulationHello world2016-08-16T00:00:00+00:002016-08-16T00:00:00+00:00https://zhenhuan-yang.github.io/hello-world<ul>
<li>
<p>First post.</p>
</li>
<li>
<p>MathJax is supported</p>
</li>
</ul>
\[e^{i\pi} + 1 = 0.\]
<ul>
<li>Tags are sorted.</li>
</ul>Dr. Zhenhuan YangFirst post.