코세라 - 딥러닝 테스트 기록 : Optimization algorithms

2020. 11. 4. 13:05캐리의 데이터 세상/캐리의 데이터 공부 기록

반응형

코세라 - 딥러닝

Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization

2주차 테스트 문제 기록

80%를 넘겨야 통과되는데 틀려도 계속 시도할 수 있음. 영어라... 이렇게라도 써놔야 나중에 다시 볼듯하여 문제 및 정답을 긁어서 기록용으로 남김. 

 

Q. Which of these statements about mini-batch gradient descent do you agree with?

One iteration of mini-batch gradient descent (computing on a single mini-batch) is faster than one iteration of batch gradient descent.

 

Q. Why is the best mini-batch size usually not 1 and not m, but instead something in-between?

 

  • If the mini-batch size is 1, you lose the benefits of vectorization across examples in the mini-batch.
  • If the mini-batch size is m, you end up with batch gradient descent, which has to process the whole training set before making progress.

 

Q. Suppose your learning algorithm’s cost JJ, plotted as a function of the number of iterations, looks like this:

 

  • If you’re using mini-batch gradient descent, this looks acceptable. But if you’re using batch gradient descent, something is wrong.

 

 

Q. You use an exponentially weighted average on the London temperature dataset. You use the following to track the temperature: v_{t} = \beta v_{t-1} + (1-\beta)\theta_tvt=βvt−1+(1−β)θt. The red line below was computed using \beta = 0.9β=0.9. What would happen to your red curve as you vary \betaβ? (Check the two that apply)

  • Increasing \betaβ will shift the red line slightly to the right.

-> True, remember that the red line corresponds to \beta = 0.9β=0.9. In lecture we had a green line $$\beta = 0.98) that is slightly shifted to the right.

  • Decreasing \betaβ will create more oscillation within the red line.

-> True, remember that the red line corresponds to \beta = 0.9β=0.9. In lecture we had a yellow line $$\beta = 0.98 that had a lot of oscillations.

 

 

Q. Consider this figure:

These plots were generated with gradient descent; with gradient descent with momentum (\betaβ = 0.5) and gradient descent with momentum (\betaβ = 0.9). Which curve corresponds to which algorithm?

 

  • (1) is gradient descent. (2) is gradient descent with momentum (small \betaβ). (3) is gradient descent with momentum (large \betaβ)

 

Q. Which of the following statements about Adam is False?

(False) Adam should be used with batch gradient computations, not with mini-batches.

  • (True) The learning rate hyperparameter \alphaα in Adam usually needs to be tuned.
  • (True) Adam combines the advantages of RMSProp and momentum
  • (True) We usually use “default” values for the hyperparameters \beta_1, \beta_2β1,β2 and \varepsilonε in Adam (\beta_1 = 0.9β1=0.9, \beta_2 = 0.999β2=0.999, \varepsilon = 10^{-8}ε=10−8)

 

반응형