If $y=1$, $L(\hat{y},y)=-log\hat{y}$, then we want $L$ is as small as possible, which means $\hat{y}$ is as large as possible. While $\hat{y}$ is an output of the sigmoid function, so we want $\hat{y}$ is close to 1.
If $y=0$, $L(\hat{y},y)=-log(1-\hat{y})$, then we want $L$ is as small as possible, which means $\hat{y}$ is as small as possible. While $\hat{y}$ is an output of the sigmoid function, so we want $\hat{y}$ is close to 0.
We want to find $w, b$ that minimize $J(w, b)$. Here is an example that simplifies the situation.
Here is the gradient descent for one step based one single training sample.
One step for m samples:
[1] Andrew Ng: Deep Learning