The State of Computer Vision
- Generally, we need more data for more complex problems
-
For example, speech recognition has:
- A large amount of data available
- Simpler algorithms
- However, image recognition is a more complicated problem compared to speech recognition
- Therefore, we need even more data for image recognition compared to speech recognition
- More complex problems involve carefully designing features requiring large amounts of data
- Simpler problems can use simpler network architectures with smaller amounts of data
Introducing Image Classification and Detection
- Localization refers to determining where in the picture is the object we've detected
-
There are three general types of image classification:
- Image classification without localization
- Image classification with localization
- Object Detection
- Image classification without localization refers to labeling an image with a broad category
- For example, running image classification without localization on a picture of a car would hopefully output a car label
-
Image classification with localization includes the following:
- Labeling an image with a broad category
- Providing a bounding box around the classified object
- For example, running image classification with localization on a picture of a car would hopefully output a car label and a box around the car in the image
- Object detection refers to finding multiple objects in a picture and localizing them
- For example, running object detection on an image with multiple cars and people would hopefully detect each object (i.e. humans, cars, etc.) in the image
Describing Object Localization
-
Most object localization problems will return:
- A classification label
- A probability tied to the label
- A bounding box
-
We denote the above outputs as the following:
- A label
- A probability
-
A bounding box
- A center of the bounding box
- A height of the bounding box
- A width of the bounding box
-
The output is given by the following output layers:
- A fully-connected layer outputting
- A fully-connected layer outputting
- A fully-connected layer outputting
-
The three fully-connected layers are specifically:
- The layer outputting is a softmax layer
- The layer outputting is a logistic layer
- The layer outputting is a linear layer
- As a reminder, a linear neuron uses a quadratic loss function
- Softmax and logistic neurons uses a cross-entropy loss function
Introducing Landmark Detection
- Previously, our network specified the bounding box of an object
- Our network can also output the and coordinates of the most important points in an image
- These points are called landmarks
- Essentially, these landmarks represent locations, rather than a bounding box
- For example, let's say we have a bunch of images of faces
- We may want to output the and coordinates of the corner of someone's eye
- Specifically, the following landmarks could represent the four corners of someone's eyes:
- Being able to detect these landmarks is crucial for creating computer graphics effects
- For example, snapchat uses landmark detection for its filters
- Landmark detection is a type of supervised learning
- Therefore, we typically need to manually label the and coordinates of the object in each image for our network to learn on
- In this case, we need to refer to each landmark as the same identity
- For example, the coordinates of the left corner of the leftward eye needs to always be nad
- And, the coordinates of the right corner of the rightward eyes needs to always be and
tldr
- Localization refers to determining where in the picture is the object we've detected
-
There are three general types of image classification:
- Image classification without localization
- Image classification with localization
- Object Detection
- Image classification without localization refers to labeling an image with a broad category
-
Image classification with localization includes the following:
- Labeling an image with a broad category
- Providing a bounding box around the classified object
- Our network can also output the and coordinates of the most important points in an image
- These points are called landmarks
- Essentially, these landmarks represent locations, rather than a bounding box
References
Previous
Next