Skip to content

Commit 683d1f9

Browse files
committed
homework 04 basic and advanced
1 parent 3a3406a commit 683d1f9

7 files changed

+2794
-0
lines changed

homework04/README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
In this homework you will combine your knowledge of convolutional and recurrent neural networks to build an image captioning model.
2+
3+
The assignment is provided in captioning_pytorch.ipynb as usual, but before you start, please __download the data from [here](https://yadi.sk/d/b4nAwIE73TVcp5)__

homework04/beheaded_inception3.py

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
import torch, torch.nn as nn
2+
import torch.nn.functional as F
3+
from torchvision.models.inception import Inception3
4+
from torch.utils.model_zoo import load_url
5+
from warnings import warn
6+
7+
class BeheadedInception3(Inception3):
8+
""" Like torchvision.models.inception.Inception3 but the head goes separately """
9+
10+
def forward(self, x):
11+
if self.transform_input:
12+
x = x.clone()
13+
x[:, 0] = x[:, 0] * (0.229 / 0.5) + (0.485 - 0.5) / 0.5
14+
x[:, 1] = x[:, 1] * (0.224 / 0.5) + (0.456 - 0.5) / 0.5
15+
x[:, 2] = x[:, 2] * (0.225 / 0.5) + (0.406 - 0.5) / 0.5
16+
else: warn("Input isn't transformed")
17+
x = self.Conv2d_1a_3x3(x)
18+
x = self.Conv2d_2a_3x3(x)
19+
x = self.Conv2d_2b_3x3(x)
20+
x = F.max_pool2d(x, kernel_size=3, stride=2)
21+
x = self.Conv2d_3b_1x1(x)
22+
x = self.Conv2d_4a_3x3(x)
23+
x = F.max_pool2d(x, kernel_size=3, stride=2)
24+
x = self.Mixed_5b(x)
25+
x = self.Mixed_5c(x)
26+
x = self.Mixed_5d(x)
27+
x = self.Mixed_6a(x)
28+
x = self.Mixed_6b(x)
29+
x = self.Mixed_6c(x)
30+
x = self.Mixed_6d(x)
31+
x = self.Mixed_6e(x)
32+
x = self.Mixed_7a(x)
33+
x = self.Mixed_7b(x)
34+
x_for_attn = x = self.Mixed_7c(x)
35+
# 8 x 8 x 2048
36+
x = F.avg_pool2d(x, kernel_size=8)
37+
# 1 x 1 x 2048
38+
x_for_capt = x = x.view(x.size(0), -1)
39+
# 2048
40+
x = self.fc(x)
41+
# 1000 (num_classes)
42+
return x_for_attn, x_for_capt, x
43+
44+
45+
def beheaded_inception_v3(transform_input=True):
46+
model= BeheadedInception3(transform_input=transform_input)
47+
inception_url = 'https://download.pytorch.org/models/inception_v3_google-1a9a5a14.pth'
48+
model.load_state_dict(load_url(inception_url))
49+
return model
92.4 KB
Loading
Lines changed: 207 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,207 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"### Advanced task: image captioning with visual attention\n",
8+
"\n",
9+
"![img](https://i.imgur.com/r3r0fS4.jpg)\n",
10+
"\n",
11+
"This task is designed for folks that have prior experience with deep NLP. If you're new to this thing, you're gonna be better off if you stick the basic track. It will also require a persistent storage space, running it on colab is possible but quite complicated.\n",
12+
"\n",
13+
"__This task__ walks you through all steps required to build an attentive image-to-captioning system. Except this time, there's no `<YOUR CODE HERE>`'s. You write all the code.\n",
14+
"\n",
15+
"You are free to approach this task in any way you want. Follow our step-by-step guide or abandon it altogether. Use the notebook or add extra .py files (remember to add them to your anytask submission). The only limitation is that your code should be readable and runnable top-to-bottom.\n",
16+
"\n"
17+
]
18+
},
19+
{
20+
"cell_type": "markdown",
21+
"metadata": {},
22+
"source": [
23+
"### Step 1: image preprocessing\n",
24+
"\n",
25+
"First, you need to prepare images for captioning. Just like in the basic notebook, you are going to use a pre-trained image classifier from the model zoo. Let's go to the [`preprocess_data.ipynb`](./preprocess_data) notebook and change a few things there. This stage is mostly running the existing code with minor modiffications.\n",
26+
"\n",
27+
"1. Download the data someplace where you have enough space. You will need around 100Gb for the whole thing.\n",
28+
"2. Pre-compute and save Inception activations at the layer directly __before the average pooling__.\n",
29+
" - the correct shape should be `[batch_size, 2048, 8, 8]`. Your LSTM will attend to that 8x8 grid.\n",
30+
"\n",
31+
"\n",
32+
"__Note 1:__ Inception is great, but not the best model in the field. If you have enough courage, consider using ResNet or DenseNet from the same model zoo. Just remember that different models may require different image preprocessing.\n",
33+
"\n",
34+
"__Note 2:__ Running this model on CPU may take days. You can speed things up by processing data in parts using colab + google drive. Here's how you do that: https://colab.research.google.com/notebooks/io.ipynb"
35+
]
36+
},
37+
{
38+
"cell_type": "code",
39+
"execution_count": null,
40+
"metadata": {},
41+
"outputs": [],
42+
"source": [
43+
"<...>"
44+
]
45+
},
46+
{
47+
"cell_type": "markdown",
48+
"metadata": {},
49+
"source": [
50+
"### Step 2: sub-word tokenization\n",
51+
"\n",
52+
"While it is not strictly necessary for image captioning, you can generally improve generative text models by using sub-word units. There are several sub-word tokenizers available in the open-source (BPE, Wordpiece, etc).\n",
53+
"\n",
54+
"* __[recommended]__ BPE implementation you can use: [github_repo](https://github.com/rsennrich/subword-nmt). \n",
55+
"* Theory on how it works: https://arxiv.org/abs/1508.07909\n",
56+
"* We recommend starting with __4000 bpe rules__.\n",
57+
"* The result@@ ing lines will contain splits for rare and mis@@ typed words like this: ser@@ endi@@ pity\n"
58+
]
59+
},
60+
{
61+
"cell_type": "code",
62+
"execution_count": null,
63+
"metadata": {},
64+
"outputs": [],
65+
"source": [
66+
"<...>"
67+
]
68+
},
69+
{
70+
"attachments": {},
71+
"cell_type": "markdown",
72+
"metadata": {},
73+
"source": [
74+
"### Step 3: define attentive decoder\n",
75+
"\n",
76+
"Your model works similarly to the normal image captioning decoder, except that it has an additional mechanism for peeping into image at each step. We recommend implementing this mechanism as a separate Attention layer, inheriting from `nn.Module`. Here's what it should do:\n",
77+
"\n",
78+
"![img](https://camo.githubusercontent.com/1f5d1b5def5ab2933b3746c9ef51f4622ce78b86/68747470733a2f2f692e696d6775722e636f6d2f36664b486c48622e706e67)\n",
79+
"\n",
80+
"\n",
81+
"__Input:__ 8x8=64 image encoder vectors $ h^e_0, h^e_1, h^e_2, ..., h^e_64$ and a single decoder LSTM hidden state $h^d$.\n",
82+
"\n",
83+
"* Compute logits with a 2-layer neural network with tanh activation (or anything similar)\n",
84+
"\n",
85+
"$$a_t = linear_{out}(tanh(linear_{e}(h^e_t) + linear_{d}(h_d)))$$\n",
86+
"\n",
87+
"* Get probabilities from logits, \n",
88+
"\n",
89+
"$$ p_t = {{e ^ {a_t}} \\over { \\sum_\\tau e^{a_\\tau} }} $$\n",
90+
"\n",
91+
"* Add up encoder states with probabilities to get __attention response__\n",
92+
"$$ attn = \\sum_t p_t \\cdot h^e_t $$\n",
93+
"\n",
94+
"You can now feed this $attn$ to the decoder LSTM in concatenation with previous token embeddings.\n",
95+
"\n",
96+
"__Note 1:__ If you need more information on how attention works, here's [a class in attentive seq2seq](https://github.com/yandexdataschool/nlp_course/tree/master/week04_seq2seq) from the NLP course.\n",
97+
"\n",
98+
"__Note 2:__ There's always a choice whether you initialize LSTM state with some image features or zeros. We recommend using zeros: it is a good way to debug whether your attention is working and it usually produces better-looking attention maps"
99+
]
100+
},
101+
{
102+
"cell_type": "code",
103+
"execution_count": null,
104+
"metadata": {},
105+
"outputs": [],
106+
"source": [
107+
"<...>"
108+
]
109+
},
110+
{
111+
"cell_type": "markdown",
112+
"metadata": {},
113+
"source": [
114+
"### Step 4: training\n",
115+
"\n",
116+
"The training procedure for your model is no different from the original non-attentive captioning from the base track: iterate minibatches, compute loss, backprop, use the optimizer.\n",
117+
"\n",
118+
"Feel free to use the [`basic track notebook`](./homework04_basic_part2_image_captioning) for \"inspiration\" :)\n"
119+
]
120+
},
121+
{
122+
"cell_type": "code",
123+
"execution_count": null,
124+
"metadata": {},
125+
"outputs": [],
126+
"source": [
127+
"<...>"
128+
]
129+
},
130+
{
131+
"cell_type": "markdown",
132+
"metadata": {},
133+
"source": [
134+
"### Final step: show us what it's capable of!\n",
135+
"\n",
136+
"The task is exactly the same as in the base track _(with the exception that you don't have to deal with salary prediction :) )_\n",
137+
"\n",
138+
"\n",
139+
"__Task: Find at least 10 images to test it on.__\n",
140+
"\n",
141+
"* Seriously, that's a part of the assignment. Go get at least 10 pictures for captioning\n",
142+
"* Make sure it works okay on __simple__ images before going to something more complex\n",
143+
"* Your pictures must feature both successful and failed captioning. Get creative :)\n",
144+
"* Use photos, not animation/3d/drawings, unless you want to re-train CNN network on anime\n",
145+
"* Mind the aspect ratio."
146+
]
147+
},
148+
{
149+
"cell_type": "code",
150+
"execution_count": null,
151+
"metadata": {},
152+
"outputs": [],
153+
"source": [
154+
"# apply your network on images you've found\n",
155+
"#\n",
156+
"#\n"
157+
]
158+
},
159+
{
160+
"cell_type": "markdown",
161+
"metadata": {},
162+
"source": [
163+
"### What else to try\n",
164+
"\n",
165+
"If you're made it this far you're awesome and you should know it already. All the tasks below are completely optional and may take a lot of your time. Proceed at your own risk\n",
166+
"\n",
167+
"#### Hard attention\n",
168+
"\n",
169+
"* There are more ways to implement attention than simple softmax averaging. Here's [a lecture](https://www.youtube.com/watch?v=_XRBlhzb31U) on that. \n",
170+
"* We recommend you to start with [gumbel-softmax](https://blog.evjang.com/2016/11/tutorial-categorical-variational.html) or [sparsemax](https://arxiv.org/abs/1602.02068) attention.\n",
171+
"\n",
172+
"#### Reinforcement learning\n",
173+
"\n",
174+
"* After your model has been pre-trained in a teacher forced way, you can tune for captioning-speific models like CIDEr.\n",
175+
"* Tutorial on RL for sequence models: [practical_rl week7](https://github.com/yandexdataschool/Practical_RL/tree/spring19/week7_seq2seq)\n",
176+
"* Theory: https://arxiv.org/abs/1612.00563\n",
177+
"\n",
178+
"#### Chilling out\n",
179+
"\n",
180+
"This is the final and the most advanced task in the DL course. And if you're doing this with the on-campus YSDA students, it should be late spring by now. There's got to be a better way to spend a few days than coding another deep learning model. If you have no idea what to do, ask Yandex. Or your significant other.\n",
181+
"\n",
182+
"![img](https://imgs.xkcd.com/comics/computers_vs_humans.png)"
183+
]
184+
}
185+
],
186+
"metadata": {
187+
"kernelspec": {
188+
"display_name": "Python 3",
189+
"language": "python",
190+
"name": "python3"
191+
},
192+
"language_info": {
193+
"codemirror_mode": {
194+
"name": "ipython",
195+
"version": 3
196+
},
197+
"file_extension": ".py",
198+
"mimetype": "text/x-python",
199+
"name": "python",
200+
"nbconvert_exporter": "python",
201+
"pygments_lexer": "ipython3",
202+
"version": "3.6.8"
203+
}
204+
},
205+
"nbformat": 4,
206+
"nbformat_minor": 2
207+
}

0 commit comments

Comments
 (0)