This post is the last part of the

robotics/duckietownseries.

See my github repository for the code in this document.

# Introduction

This is the final post in the *Robotics/DuckieTown* series. For those of you who have made it all the way through, I hope that you enjoyed the series.

I think that I’ve saved the best post for last. In this post, we’re going to be using Reinforcement Learning (RL) methods called *General Value Functions* (GVFs) to predict which apriltag we are going to see next when following a set policy.

If you don’t know much about RL, this post may be a bit difficult for you. I suggest first looking into RL before continuing. Some great resources are the RL Book by Sutton and Barto or the RL Coursera Course (which I currently moderate).

Before we begin, let’s briefly discuss exactly what we are going to be doing in this post. We’re going to have our DuckieBot follow a set policy. You can use any policy you’d like (e.g. you could use the lane following code we wrote previously). The policy I’ve used is to drive clockwise around DuckieTown (in a circle) always. I’ve simply used manual keyboard control to implement this.

Our DuckieBot will follow this set policy, and it will predict which apriltag it will see next using GVFs. I am predicting four different apriltags (hence, using four GVFs) and use seven apriltags in total for apriltag localization. For the state representation, we are going to break the town into a grid and use both odometry and apriltag localization to figure out exactly which grid cell we are in. The grid cell will serve as the state representation for the GVF learners.

In you’re interested to learn more about how GVFs have been used in Robotics, see my survey.

# An Introduction to General Value Functions

In this section, we briefly review RL and value functions before discussing general value functions.

## A Brief Refresher on Reinforcement Learning

In RL, we formalize the interaction between the agent and environment using Markov Decision Processes (MDPs). An MDP is a 5-tuple $(\mathscr{S}, \mathscr{A}, \mathscr{P}, \mathscr{R}, \gamma)$:

- $\mathscr{S}$ is the set of all states
- $\mathscr{A}$ is the set of all actions
- $\mathscr{P}: \mathscr{S} \times \mathscr{A} \times \mathscr{S} \to [0, \infty)$ is the one-step transition dynamics
- $\mathscr{R}: \mathscr{S} \times \mathscr{A} \times \mathscr{S} \to \mathbb{R}$ is the reward function
- $\gamma$ is the discount factor, weighting the relative importance of immediate and future rewards

At each discrete time step $t = 0, 1, 2, 3, 4, 5, …$ an intelligent agent finds itself in some state $S_t$. The agent samples an action $A_t \sim \pi(\cdot \mid S_t)$ from its stochastic policy $π: \mathscr{S} \to \Delta_\mathscr{A}$, a mapping from states to probability distributions over actions (here, we’ve used the notation $\Delta_\mathscr{A}$ to denote the set of all probability distributions on $\mathscr{A}$). After taking action $A_t$, the agent is transitioned to a new state $S_{t+1} \in \mathscr{S}$ based on the transition dynamics $\mathscr{P}$ and obtains a scalar reward $R_{t+1} \in \mathscr{R}$. The overall goal is to learn which actions to select in order to maximize the rewards obtained over time.

To learn how to maximize reward over time, many RL algorithms use value functions. In RL, we generally talk about two different kinds of value functions. A state-value function measures the expected sum of discounted, future rewards when arriving in some state $s \in \mathscr{S}$ (by any means) and following policy $\pi$ thereafter:

\[v(s) = \mathbb{E}_{\pi} [G_t \mid S_t = s]\]where

\[G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots = \sum\limits_{k=0}^{\infty} \gamma^k R_{t+k+1}\]Similar to the state-value function, an action value function measures the expected sum of discounted, future rewards when arriving in some state $s \in \mathscr{S}$ (by any means), taking action $a \in \mathscr{A}$, and then following policy $\pi$ thereafter:

\[q(s, a) = \mathbb{E}_{\pi} [G_t \mid S_t = s, A_t = a]\]These value functions form the basis of all Temporal Difference learning algorithms. Many popular algorithms such as Sarsa, Q-Learning, DQN, SQL, SAC, DDPG, D4PG, DPG, TD3, and others utilize such value functions.

State- and action-value functions measure the possible future **reward** obtained when following a policy. What happens if, instead of measuring reward, these functions measure some other signal? For example, what if a value function measured the current drawn from a motor on a robot or the current apriltag in view by a robot’s camera? In such a case, we get *general value functions*.

## General Value Functions

I’ve referenced [2] for this section.

Value functions predict the expected sum of discounted future rewards for a given policy. In contrast general value functions [1] (GVFs) make predictions with respect to four functions:

- The
*policy*$π: \mathscr{S} \to \Delta_\mathscr{A}$ - The
*cumulant*$C: \mathscr{S} \times \mathscr{A} \times \mathscr{S} \to \mathbb{R}$ is the signal that we are tyring to predict - The
*continuation factor*or*termination*: $\gamma: \mathscr{S} \to \in [0, 1]$ - The
*terminal value*$z: \mathscr{S} \to \mathbb{R}$

These four functions uniquely define the *predictive question* that the GVF answers. For example, a GVF can be used to answer the following questions on a DuckieBot:

*Over the next 10 seconds, how much current will my wheel motors draw?**If I continue straight indefinitely, how long until I hit a wall?**If I continue on-policy, which apriltag will I see next?*

There are two main types of GVFs corresponding to state- and action-value functions. The state-value GVF makes a prediction based on the current state we are in:

\[\tilde{v}(s) = \mathbb{E}_{\pi} [C_{t+1} + \dots C_{t+k} + z(S_{t+k}) \mid S_t = s, k \sim \gamma ]\]The action-value GVF makes a prediction based on an action selected in the current state we are in:

\[\tilde{q}(s, a) = \mathbb{E}_{\pi} [C_{t+1} + \dots C_{t+k} + z(S_{t+k}) \mid S_t = s, A_t = a, k \sim \gamma ]\]where $s$ is the current state, $a$ is the current action, and discounting is implicit.

GVFs also satisfy the Bellman equations:

\[\tilde{q}(s, a) = \sum\limits_{s'} p(s' \mid s, a) \left[ C(s, a, s') + \gamma(s') \sum\limits_{a'} \pi(a' \mid s') \tilde{q}(s', a') + (1 - \gamma(s'))z(s') \right]\]and have a corresponding TD error:

\[\delta_t = C(S_t, A_t, S_{t+1}) + γ(S_{t+1}) \sum\limits_{a'} \left( \pi(a' \mid S_{t+1}) \tilde{q}(S_{t+1}, a') \right) + (1 - \gamma(S_{t+1})) z(S_{t+1}) - \tilde{q}(S_t, A_t)\]Therefore, we can learn GVFs with the TD learning algorithms we already know! For my implementation of the GVF learners, I’ve simply used the `TD(0)`

algorithm!

You may have noticed the strange notation above $k \sim \gamma$. This is really an abuse of notation, but indicates that $k$ follows a distribution induced by $\gamma$. This distribution is a geometric distribution with probability of success (i.e. successfully terminating the episode) $(1 - \gamma)$ [3].

### Simplifying the GVF Formulation

If you search the GVF literature, you may see another GVF formulation, one without the termination value $z(s)$. Actually, it turns out that the termination value can be absorbed into the cumulant [4]!

In this case, we say that a GVF is a predictive question defined by *three* functions:

- The
*policy*$π: \mathscr{S} \to \Delta_\mathscr{A}$ - The
*cumulant*$C: \mathscr{S} \times \mathscr{A} \times \mathscr{S} \to \mathbb{R}$ is the signal that we are tyring to predict - The
*continuation factor*or*termination*: $\gamma: \mathscr{S} \to \in [0, 1]$

In this case, GVFs have simplified equations. The GVF for state-values is:

\[\tilde{v}(s) = \mathbb{E}_{\pi} [C_{t+1} + \gamma(S_{t+1}) \tilde{v}(S_{t+1}) \mid S_t = s ]\]and the GVF for action-values is:

\[\tilde{q}(s, a) = \mathbb{E}_{\pi} [C_{t+1} + \gamma(S_{t+1}) \tilde{q}(S_{t+1}, A_{t+1}) \mid S_t = s, A_t = a]\]### The Termination Trick

In RL, it is common to give no rewards for all transitions in an environment except for the terminal transition. Upon transitioning to a terminal state, a +1 reward is given for winning (or solving the problem/environment) and a -1 for losing (or not solving the problem/environment). This kind of reward scheme will encourage the agent to solve the problem.

Here, *transitioning to a terminal state* refers to a complete stop to the agent-environment interaction. This interaction is completely over, at least for the current episode. A new episode must be started for the agent-environment interaction to continue.

But termination can be seen in an alternate way. We can consider that the environment actually never stops; the agent-environment interaction never ends, but rather the discount factor becomes 0 after the terminal state is reached. When viewing termination in this way, we can also separate the agent-environment interaction into episodes. Upon transitioning into the terminal state, we set the discount factor to 0 before updating the agent. We then transition the agent to one of the starting states and reset the discount factor to its previous value.

Because of this alternative view, we can escape from exponential discounting in GVFs by using no discounting $(\gamma = 1)$ until we reach a terminal state, then use complete discounting ($\gamma = 0$) for the transition into the terminal state:

\[\gamma(s) = \begin{cases} 1 \quad \text{if $s$ is not terminal} \\ 0 \quad \text{if $s$ is terminal} \end{cases}\]### The Pseudo-Termination Trick

I’ve referenced [3] for this section. See this reference for more details.

Often, we may like to ask questions like *How much current will my wheel motors draw over the next N timesteps?*. To answer such questions, we could use the termination trick which we just discussed to measure the wheel motor current (i.e. the cumulant) over $N$ sequential steps and then terminate using a termination factor of $0$ after these $N$ steps. We could continue this over and over again to predict the answer to this predictive question.

But this kind of termination is wasteful of data. Instead, we can use the *pseudo-termination trick*. We can consider $\bar{\gamma} = (1 - \gamma)$ as the probability of successfully terminating on the next step, following a geometric distribution. The mean of this geometric distribution is $\frac{1}{\bar{\gamma}}$ and supplies the number of steps until successfully terminating, in expectation.

Therefore, if we are interested in answering the above question, *How much current will my wheel motors draw over the next N timesteps?*, we just need to set $\gamma$ to satisfy the equation:

# Implementing Apriltag Prediction

Okay! Now that we know a bit more about GVFs, it’s time to implement apriltag detection using GVFs. We’re going to be using a very simple implementation for this exercise. In particular, we’re going to consider DuckieTown to be a gridworld. Each foam mat in DuckieTown will be a single grid cell. Figure 1 shows the DuckieTown I’m working with. It is a $5 \times 3$ grid of foam mats, and hence I’m considering the town to have 15 states, each corresponding to a single cell in the $5 \times 3$ gridworld. The state representation that we will provide to our learning algorithm will be the grid cell’s index, shown in Figure 1 as white integers.

* Figure 1: My DuckieTown setup. We separate DuckieTown into a grid, where each foam mat is a single cell in the grid. My DuckieTown has 5 rows and 3 columns of foam mats, hence I’ve separated it into a $5 \times 3$ grid. Integers within the cells denote the cell index.*

### Specifying the GVF Question

We are going to be answering the following question using GVFs: *Will I see apriltag ID next?*, where

`ID`

is the ID of an apriltag. We will have one question for each unique apriltag and one GVF per question. I am going to be using four unique apriltags, with the IDs 8, 22, 63, and 67 and will have one GVF for each apriltag (four in total).For each GVF, We will use a continuation factor of $\gamma = 1$ everywhere, except in states where we can see an apriltag. These states will have a continuation factor $\gamma = 0$ (i.e., they are considered as terminal states) and will have the terminal value $z(s) = 1$ if the apriltag is the one associated with the GVF and a terminal value of $z(s) = 0$ otherwise. For example, for the GVF associated with apriltag 8:

\[z(s) = \begin{cases} 1 \quad \text{if we can see apriltag 8} \\ 0 \quad \text{otherwise} \end{cases}\]The cumulant will be 0 everywhere^{1}. The policy is going to be to proceed in a clockwise fashion around DuckieTown (I’ll use manual control), and we will start the DuckieBot at $(x, y) = (0.32, 0.32)$ with a yaw angle of $\frac{\pi}{2}$, which corresponds to state $0$ in our gridworld. We’ll initialize all weights to be 0, meaning that we suspect that we will not see any apriltags next.

### Determining the DuckieBot’s Grid Cell

There is one final important piece to this puzzle. How can we figure out which cell of the total 15 grid cells the DuckieBot is in (remember, this is our state representation for the GVF learners)? For this, we will be using the odometry and apriltag localization code that we wrote previously. When we can see an apriltag, we’ll localize with it; otherwise we will use odometry. Based on these $(x, y)$ coordinates, we can figure out which grid cell we are in. Finally, if our odometry estimates that the DuckieBot is outside of DuckieTown, then we will project our estimated position back to DuckieTown before estimating the grid cell that we are in. This can happen in a few scenarios:

- $x < 0$ or $x > 3 * \text{mat width}$
- $y < 0$ or $y > 5 * \text{mat height}$

My apriltags are placed in the following locations (see Figure 1):

ID | (x, y, z) | (yaw, pitch, roll) | Predicted? |
---|---|---|---|

8 | 1.58, 0.17, 0 | 3.93, 0, -1.57 | Yes |

15 | 0.66, 1.83, 0 | 0, 0, -1.57 | No |

22 | 0.17, 0.20, 0 | 2.35, 0, -1.57 | Yes |

59 | 1.17, 1.83, 0 | 3.142, 0, -1.57 | No |

63 | 0.215, 2.815, 0 | 0.785, 0, -1.57 | Yes |

67 | 1.65, 2.755, 0 | -0.785, 0, -1.57 | Yes |

76 | 0.66, 1.17, 0 | 0, 0, -1.57 | No |

Finally, keep in mind that odometry is inaccurate. Based on our odometry, we may think that the robot is in grid cell $i$ when really it is in grid cell $j$ with $i \ne j$. This will affect the quality and accuracy of our GVF predictions. Unfortunately, there’s not much more we can do about this other than using more apriltags or other localization techniques.

## Localization

For localization, we already have all the code we need. Create a new DuckieTown project and copy all the code from Localization using Sensor Fusion. That’s it for this part! We now have code that will recognize apriltags, localize with apriltags when they are available, and localize with odometry when apriltags are not available!

You can get rid of the `led_emitter`

package as we won’t be needing it for the rest of the post. I find that the performance of the DuckieBot is quite a bit better when you remove this package. You’ll also have to make the appropriate changes in the launch files and the `apriltag`

package, which changes the LEDs based on the apriltag in view. My code for this project already has the `led_emitter`

package removed, so you could also grab my implementation.

I’ve modified my implementation to wait a some time before publishing the updated bot position from apriltag localization. This is because I found that the apriltag localization took a while to become accurate and resulted in multiple (imaginary) jumps in the bot’s predicted location. This significantly reduced the accuracy of the GVF learners because it appeared as if the bot was moving from one state to another (sometimes these jumps happened between states that were not even connected, and so the transition was impossible). These transitions actually did not actually happen in real life.

This modification is live on my Github repo in the directory for this project.

## Learner Node

The final part of our implementation is to create the GVF learner node, which I’ve created in the `learner`

package (see my Github repo). I’m not going to go into detail on the implementation for two reasons. First, I want this blog post to focus on whether these predictions are possible on a DuckieBot and how the DuckieBot performs at answering these predictive GVF questions. Second, I think the implementation is fairly straight forward if you know basic RL concepts, I’m just using TD(0). I’ve included the code for this GVF learner node in the following code block:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279

#!/usr/bin/env python3
import pickle
import cv2
import rospy
import rosbag
import numpy as np
import time
from std_msgs.msg import String, Int64, Float64
import pickle
from geometry_msgs.msg import Pose
from duckietown.dtros import DTROS, NodeType, TopicType, DTParam, ParamType
from duckietown_msgs.msg import AprilTagDetectionArray, AprilTagDetection
SPIN_OR_RATE = "spin" # Whether to update in a callback or at a set rate
class LearnerNode(DTROS):
def __init__(self, node_name):
super(LearnerNode, self).__init__(
node_name=node_name, node_type=NodeType.PERCEPTION
)
self.node_name = node_name
self._current_tag = -1
self.previous_state = None
self.current_state = None
self._current_state = None
self.current_state_same_count = 0
self.current_state_same_count_thresh = DTParam(
"~current_state_same_count_thresh",
param_type=ParamType.INT,
default=3,
)
self._lr = rospy.get_param("~learning_rate", None)
self._tags_used = rospy.get_param("~tags_used", None)
self._pose = None
self._town_dims = (1.83, 3.0) # metres
self._n = len(self._tags_used)
self._grid_dims = rospy.get_param("~grid_dims", None)
self._n_features = np.prod(self._grid_dims)
# create subscribers
self._tag_sub = rospy.Subscriber(
"~detections",
AprilTagDetectionArray,
self._tag_cb,
queue_size=1,
)
self.pub = rospy.Subscriber(
"~odom", Odometry, self._odom_cb, queue_size=1,
)
self._switch = rospy.get_param("~switch", False)
self.srv_set_pattern_ = rospy.Service(
"~stop_learning", SetBool, self._stop_service,
)
self._filename = time.time()
self._bag = rosbag.Bag(f"/data/bags/learner_{self._filename}.bag", "w")
# Load weights, either from init file or as zeros
self._weights_init = rospy.get_param("~weights", None)
if self._weights_init is None:
self._weights = np.zeros((self._n, self._n_features))
else:
with open(self._weights_init, "rb") as infile:
self._weights = pickle.load(infile)
# Continue writing to rosbag using the previous rosbag cache
self._bag_cache = rospy.get_param("~bag_cache", None)
if self._bag_cache is None:
raise ValueError("must specify bag cache when specifying " +
"weight init")
# Populate new rosbag with old data
self._old_bag = rosbag.Bag(self._bag_cache, "r")
for topic, msg, _ in self._old_bag.read_messages():
self._bag.write(topic, msg)
self._old_bag.close()
def _stop_service(self, msg):
self._switch = msg.data
if not self._switch:
resp = SetBoolResponse()
resp.success = True
resp.message = "learning stopped"
else:
resp = SetBoolResponse()
resp.success = True
resp.message = "learning started"
return resp
def _odom_cb(self, msg):
self._pose = msg.pose.pose
current_state = self._get_state()
# Count the number of times we've seen the current state, only update
# if we've seen it sufficient times
if current_state == self._current_state:
self.current_state_same_count += 1
else:
self._current_state = current_state
self.current_state_same_count = 0
cont = (self.current_state_same_count <
self.current_state_same_count_thresh.value)
if cont:
return
if current_state != self.current_state:
self.previous_state = self.current_state
self.current_state = current_state
# Predict the apriltag that we are going to see
gvf_prediction_ind = np.argmax(
self._weights[:, self.current_state],
)
gvf_prediction = self._tags_used[gvf_prediction_ind]
rospy.loginfo(f"=== Predicting apriltag {gvf_prediction} will " +
"be seen next")
if SPIN_OR_RATE == "spin":
self.learn()
def _tag_cb(self, msg):
if len(msg.detections) == 0:
self._current_tag = -1
return
detection = msg.detections[0]
if detection.tag_id not in self._tags_used:
return
rospy.loginfo(f"=== Saw apriltag {detection.tag_id}")
if detection.tag_id != self._current_tag:
self._current_tag = detection.tag_id
def on_shutdown(self):
rospy.loginfo(f"{self.node_name} received shutdown request")
with open(f"/data/weights_{self._filename}.pkl", "wb") as outfile:
pickle.dump(self._weights, outfile)
self._bag.close()
def get_tag_index(self, i):
try:
return self._tags_used.index(i)
except KeyError:
return None
def _get_state(self):
pose = (self._pose.position.x, self._pose.position.y)
return features(pose, self._town_dims, self._grid_dims)
def _get_reward(self, i):
if self._current_tag < 0:
return 0
return int(i == self._tags_used.index(self._current_tag))
def learn(self):
if not self._switch:
return
if self.previous_state is None:
return
rospy.loginfo("=== learn called ===")
rospy.loginfo(f" prev_state: {self.previous_state}")
rospy.loginfo(f" curr_state: {self.current_state}")
rospy.loginfo(f" curr_tag: {self._current_tag}")
self._bag.write("curr_tag", Int64(self._current_tag))
self._bag.write("prev_state", Int64(self.previous_state))
self._bag.write("curr_state", Int64(self.current_state))
for i in range(self._n):
rospy.loginfo(f" update_weights ({i}) called")
self.update_weights(i)
rospy.loginfo("=====================")
def update_weights(self, i):
"""
Update the weights for the GVF predictor for apriltag i.
"""
weights = self._weights[i, :]
reward = self._get_reward(i) # cumulant + terminal value
# Continuation should be 1 until the agent sees an apriltag, in which
# case the continuation is 0 for all GVFs.
continuation = self._current_tag < 0
v = weights[self.previous_state]
next_v = weights[self.current_state]
δ = reward + continuation * next_v - v
weights[self.previous_state] += self._lr * δ
rospy.loginfo(
f" updated weights: {weights}"
)
rospy.loginfo(f" δ: {δ}")
rospy.loginfo(f" continuation: {continuation}")
self._bag.write(f"weights_{i}", String(str(weights)))
self._bag.write(f"delta_{i}", Float64(δ))
self._bag.write(f"reward_{i}", Float64(reward))
self._bag.write(f"continuation_{i}", Float64(continuation))
def bin(input_, metres_dims, grid_dims):
xi, yi = input_
x, y = metres_dims
xgrid, ygrid = grid_dims
grid_width = x / xgrid
grid_height = y / ygrid
return int(xi // grid_width), int(yi // grid_height)
def bin_to_features(bins, grid_dims):
x, y = bins
xgrid, _ = grid_dims
return int(y * xgrid + x)
def features(input_, metres_dims, grid_dims, prev_state=None):
"""
Return the flattened index of the bin that the bot is currently in
"""
i = bin_to_features(bin(input_, metres_dims, grid_dims), grid_dims)
# Project i back to the legal states
max_ = np.prod(grid_dims) - 1
min_ = 0
if i > max_:
i -= (
i // grid_dims[0] - grid_dims[1] + 1
) * grid_dims[0]
elif i < min_:
i += (grid_dims[0] * (np.abs(i // (grid_dims[0]))))
# Don't allow the state to wrap around the sides of DuckieTown
if prev_state is not None:
if prev_state % grid_dims[0] == 0:
if i < prev_state and i != prev_state - grid_dims[0]:
i = prev_state
elif prev_state % (grid_dims[0] - 1) == 0:
if i > prev_state and i != prev_state + grid_dims[0]:
i = prev_state
return i
if __name__ == "__main__":
rospy.loginfo("learner initializing...")
learner = LearnerNode("learner_node")
rospy.loginfo("done!")
if SPIN_OR_RATE == "spin":
rospy.spin()
else:
rate = rospy.Rate(5)
while not rospy.is_shutdown():
learner.learn()
rate.sleep()

The `LearnerNode`

will take care of training all 4 GVF learners. Additionally, it will keep a ROS bag of the sequence of states, apriltags observed, sequence of weight vectors learned, and other data and save this bag once the node shuts down.

The `LearnerNode`

has a number of hyperparameters which are set in the `packages/learner/config/learner_node/default.yaml`

file:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

# How many times the current state should be read before updating, helps with
# jumps due to apriltag localization
current_state_same_count_thresh: 3
learning_rate: 0.1 # The learning rate for the GVF leaners
tags_used: [8, 22, 63, 67] # Apriltags to predict
grid_dims: [3, 5] # The dimensions of DuckieTown in foam mat units
# If specified, the LearnerNode will initialize weights to those in the
# following file before beginning.
weights: null
# If specified, initialize the ROS bag with the contents of the ROS bag at this
# file
bag_cache: null
# Whether or not learning should progress, if False, then learning is halted.
# If True, then learning progresses. Useful for offline evaluation.
switch: False

# Running the Code and Analyzing the Performance

I ran the above code and manually controlled the DuckieBot. I drove around DuckieTown approximately 20 times, which was about 45 minutes, and allowed the GVF learners to learn the entire time. After, I turned off learning and continued driving the DuckieBot around town to see its predictions. The video below shows the predictions the DuckieBot makes during this evaluation procedure:

As you can see, the predictions are fairly accurate! You’ll notice that at one point, when the bot is in state 14 (see rviz in the video), it predicts seeing apriltag 67 next, but then sees apriltag 8 next. This is because the bot is predicting that it will see apriltag 67 after it leaves state 14, but while it was in state 14, it saw apriltag 8 (which is expected). The prediction was still accurate because it was predicting what would happen **after** state 14 was exited and not while it was still in state 14. This also happens in state 2.

If you’re interested, here are the bag and weights I generated from this experiment.

## Analyzing the Data

In this section, I’ve moved from indexing apriltags based on ID to apriltag indices in the array

`[8, 22, 63, 67]`

since this is how the code works. Apriltag 8 has index 0, apriltag 22 has index 1, apriltag 63 has index 2, and apriltag 67 has index 3.

Let’s look at the data generated during learning. In [Figure 2] I’ve plotted the root mean squared value error (RMSVE) of the four predictions over time, as well as the average RMSVE over time. You’ll notice that all four GVF learners improved over time: the prediction error reduced during training.

*Figure 2: Prediction error over time for each apriltag prediction. I drove the DuckieBot around DuckieTown for 15 loops (approximately 45 minutes) and learned 4 different GVFs, each of which predicted whether the DuckieBot would see the corresponding apriltag next. As can be seen in the figure, the error in predictions decreases over time.*

Why are two of the errors very low (for apriltags 1 and 3) while the errors of the other two are larger? This is because of the placement of the apriltags. Apriltags 1 and 3 are placed at the extremities of DuckieTown along the x-axis, while apriltags 0 and 2 are placed at the extremities along the y-axis. For apriltags 1 and 3, there are only two states for which these apriltags will be seen next. For apriltags 0 and 2 there are four states for which these apriltags will be seen next. Hence, the prediction error is higher for apriltags 0 and 2 than it is for 1 and 3 since the GVFs need to be accurate in more states for apriltags 0 and 2 than for apriltags 1 and 3.

Another question is why the error for apriltag 0 seems to stay above the error for apriltag 2. This is because of two things: the number of localization apriltags placed along the sequence of states proceeding the prediction apriltag and the location of where these localization apriltags are placed. By *localization apriltag*, I mean an apriltag used only for localization. By *prediction apriltag*, I mean one used for localization and prediction.

*Figure 3: The two longest paths in my DuckieTown. Path Pink has two localization apriltags along it, while path Cyan has only one.*

[Figure 3] shows the path of states before prediction apriltag 2 (path Pink) and the path of states before prediction apriltag 0 (path Cyan). Along path Pink, apriltag 2 will be seen next, and so this is what our GVFs should predict. Along path Cyan, apriltag 0 will be seen next, and this is what our GVFs will hopefully tell us.

As you can see in [Figure 3], I’ve placed two localization apriltags along path Pink. Because of this, the DuckieBot can better localize where it is along path Pink compared to path Cyan, where there is a single localization apriltag. When learning to predict when apriltag 2 will be seen, the DuckieBot has a good idea where it is, and so the state representation it uses to update with is accurate. Conversely, for apriltag 0, the DuckieBot has a poorer sense of where it is, and the state representation it uses to update with is fairly inaccurate, resulting in a more inaccurate GVF for apriltag 0 than for apriltag 2.

Additionally, I tested this setup with only one localization apriltag along each of these paths. For path Pink, I placed the localization apriltag closer to the target apriltag of the GVF’s prediction (apriltag 2). For path Cyan, I placed the localization apriltag farther from the target apriltag of the GVF’s prediction (apriltag 3):

ID | (x, y, z) | (yaw, pitch, roll) | Closer to Target Apriltag for Prediction |
---|---|---|---|

15 | 0.66, 1.83, 0 | 0, 0, -1.57 | Yes |

59 | 1.17, 1.83, 0 | 3.142, 0, -1.57 | No |

I noticed that the error in prediction for apriltag 2 was **still lower** than the error in prediction for apriltag 0. This is likely due to the placement of the localization apriltags. When approaching apriltag 2 along path Pink, the localization apriltag is placed at a distance where more errors in odometry can accumulate. Hence, we correct our pose estimation once the errors in odometry have built up a lot, and our state representations will be more accurate. On the other hand, when approaching apriltag 0 along path Cyan, the localization apriltag is placed at a distance where there is not much error in the odometry estimate. Hence we aren’t correcting the errors in odometry enough, and our state representation is fairly inaccurate, resulting in a more inaccurate GVF.

Next, let’s consider Figure 4. This figure shows apriltag predictions and actual apriltags observed over time. The dark lines indicate which apriltag the bot believes it will see next, while the coloured symbols denote which apriltag was seen at that point in time. Ideally, we would like the dark lines to precede the next symbol as time increases. Once that symbol is reached, we would like the dark line to jump to just before the next symbol seen in time.

What we do see is that at the beginning of training, the predictions are fairly random. As training progresses, the bot can detect just before an apriltag is observed. At the end of training, the bot quite accurately predicts the next apriltag a number of states into the future. You’ll notice that at some points (especially along the horizontal line corresponding to apriltag 3), the bot predicts seeing the apriltag but then does not see it at all. This is because I turned the bot too fast so that it did not have time to actually see the apriltag. Had I been more careful, the bot’s prediction would have been accurate.

Unfortunately, as can be seen in the Figure 4, some of the dark lines overlap, appearing as if the bot is predicting to see multiple apriltags next. This is an artifact of the plotting software I used.

*Figure 4: Apriltag prediction and actual apriltags seen. The dark lines indicate which apriltag the bot believes it will see next (with ties broken randomly), after leaving its current state (time is on the x-axis). The coloured symbols denote the actual apriltag observed. At the beginning of training, apriltag predictions are random, but by the end of training the bot predicts fairly accurately which tag it will see next.*

Finally, consider Figure 5. This figure is a plot, for each GVF, of state value (i.e. whether the corresponding apriltag will be seen in the next state) vs time. A value of 1 means that the GVF is certain that the corresponding apriltag will be seen next, and a value of 0 indicates certainty that the corresponding apriltag will not be seen next. I’ve also included the actual observation: if the apriltag corresponding to the GVF was seen. We would like each prediction to rise **just before** observing the correct apriltag. Once the apriltag is observed the prediction should fall down to 0.

At the beginning of training, the bot does not think seeing any apriltag is likely. But, the bot quickly corrects its predictions, and by the end of training, the bot can accurately anticipate which apriltag it will see next. Similar to what we saw in Figure 4, we see here that at certain points the bot anticipates seeing an apriltage (e.g. apriltag 3), but then does not observe the apriltag at all. Once again, this is because of how I controlled the bot.

*Figure 5: Predictions of seeing each apriltag over time. By the end of training, the bot correctly anticipates which apriltag it will see next.*

# Conclusion

In this blog post, we attempted to use general value functions (GVFs) to predict which apriltag a DuckieBot would observe next when following a set policy. This is sometimes called *nexting* [4], predicting what will happen next.

We started by reducing DuckieTown to a gridworld. We then constructed one GVF learner for each of four apriltags. Each GVF learner asked: *Will we see my apriltag next?*.

We used odometry and apriltag localization to figure out where in DuckieTown the bot was, and used this pose estimate to estimate which grid cell the DuckieBot was in. This served as the state representation for our GVF learners.

By reducing DuckieTown to a gridworld, we were able to successfully next four apriltags with a high degree of accuracy.

# References

[1] Richard Sutton, Joseph Modayil, Michael Delp, Thomas Degris, Patrick Pilarski, Adam White, and Precup Doina. Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. The 10th International Conference on Autonomous Agents and Multiagent Systems, 2011.

[2] Richard Sutton. AGI Lecture Slides. Link.

[3] Matthew Schlegel, Andrew Jacobsen, Zaheer Abbas, Andrew Patterson, Adam White, and Martha White. General Value Function Networks. Journal of Artificial Intelligence Research. 2021.

[4] Multi-Timescale Nexting in a Reinforcement Learning Robot. Joseph Modayil, Adam White, Richard Sutton. Proceedings of the Conference on the Simulation of Adaptive Behaviour. 2012.

# Footnotes

In the simplified GVF formulation, these GVFs are equivalent to having $C(s) = 0$ for all non-terminal $s \in \mathscr{S}$ and $C(s) = 1$ for all terminal $s \in \mathscr{S}$. The implementation reflects this. ↩