Planning-oriented Autonomous Drivingを読んだメモ

出典

概要

　現状の自動運転はPerception, Prediction, Planningといったモジュール化されたタスクに分解して実現することが多いが、これだと情報伝達のインターフェース部分で情報が落ちてしまう。この論文では、ある程度モジュール化はしつつ、それらをTransformerでいうところのクエリで接続することでEnd-to-end気味なアーキテクチャとしたUniADを提案する。

UniADの立ち位置

　まずPerception, Prediction, Planningといったモジュールの分け方について分類すると以下のようになる。

　UniADはニューラネットワーク的にEnd-to-endでありながら、ちゃんとモジュール化もされて、それらがある程度もっともらしい構造で接続されているという点に新規性・優位性がある。

UniADの詳細

　以下の図がだいたい全て。Backboneと、4つのTransformerによるモジュールと、1つのPlannerから構成される。

　クエリで情報伝達をしていくというところが要点

Backbone

　BEVFormerの学習済みモデルを使う。BEV空間の特徴量が得られるならば他のものでも良い。

参考文献

[55] : BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

TrackFormer

　クエリ（学習可能な埋め込み）をもとに、BEV特徴量Bにかけて物体をトラッキングする。

初期化クエリ : 今フレームで初めて検出されるものがあるかの検出を担当する
トラッククエリ : 前フレームで検出されたエージェントを今フレームでもまた検出する
自車クエリ

Notation	Shape and Params	Description
$N _ a$	dynamic	number of agents from TrackFormer
$Q _ A$	$N _ a \times 256$	agent features from TrackFormer
$P _ A$	$N _ a \times 256$	agent positions from TrackFormer
layers	6	number of transformer decoder layers for TrackFormer

参考文献

MapFormer

　lanes, dividers and crossings as things, and the drivable area as stuffを検出する。学習では全ての層に対して教師あり学習をして、推論時は最終層の出力のみを使う。

Notation	Shape and Params	Description
$N _ m$	300	number of map queries from MapFormer
$Q _ M$	$N _ a \times 256$	agent features from MapFormer
layers1	6	number of transformer decoder layers for MapFormer
layers2	4	number of mask decoder layers for MapFormer

参考文献

MotionFormer

　エージェントごとの軌跡を予測する。各エージェント個別で予測するのではなく、共同でいっぺんに予測する。トップkの軌跡をscene-centric mannerで予測する。 $N _ a$ エージェントについて、 $\mathcal{K}$ 個の、 $T$ タイムステップにおける $(x, y)$ 座標を予測するので、つまり $N _ a \times \mathcal{K} \times T \times 2$ 個のfloatが出てくる。

　ここで使うトランスフォーマーは若干特殊で、3種類の相互作用agent-agent, agent-map, agent-goalを考慮する。Self-AttentionとCross-Attention、およびagent-goalに関してはDeformable Attentionを使ってごちゃごちゃと求めている。

$Q _ a = \mathrm{MHCA} (\mathrm{MHSA}(Q), Q _ A)$
$Q _ m = \mathrm{MHCA} (\mathrm{MHSA}(Q), Q _ M)$
$Q _ g = \mathrm{DeformAttn}(Q, \hat{x} ^ {l-1} _ T, B)$

　ここで $\hat{x} ^ {l-1} _ T$ は前層の予測軌跡の最終点で、DeformableAttnはその周りでBについて情報を取得する感じになる。 $Q _ a, Q _ m, Q _ g$ を連結してMLPにかけて $Q _ \mathrm{ctx}$ にし、次の層に入力される。入力されるときには、クエリ位置に関する $Q _ \mathrm{pos}$ を和算する。 $Q _ \mathrm{pos}$ は

$Q _ \mathrm{pos} = \mathrm{MLP}(\mathrm{PE}(I ^ s)) + \mathrm{MLP}(\mathrm{PE}(I ^ a)) + \mathrm{MLP}(\mathrm{PE}(\hat{x} _ 0)) + \mathrm{MLP}(\mathrm{PE}(\hat{x} ^ {l-1} _ T))$

　以下の部分はよくわからなかった。

The scene-level anchor represents prior movement statistics in a global view, while the agent-level anchor captures the possible intention in the local coordinate. They are both clustered by k-means algorithm on the endpoints of groundtruth trajectories, to narrow down the uncertainty of prediction. Contrary to the prior knowledge, the start point provides customized positional embedding for each agent, and the predicted endpoint serves as a dynamic anchor optimized layer-by-layer in a coarse-to-fine fashion.

　学習中には軌道を滑らかにするような処理を入れるともあったが、ここもハッキリとわかるわけではなかった。

Notation	Shape and Params	Description
$\mathcal{K}$	6	number of forecasting modality in MotionFormer
$T$	12	length of prediction timestamps in MotionFormer
layers	3	number of transformer decoder layers for MotionFormer
$Q _ \mathrm{pos}$	$N _ a \times \mathcal{K} \times 256$	query position in MotionFormer
$Q _ \mathrm{ctx}$	$N _ a \times \mathcal{K} \times 256$	query context in MotionFormer
$Q _ \mathrm{a}$	$N _ a \times \mathcal{K} \times 256$	motion query after agent-agent interaction in MotionFormer
$Q _ \mathrm{m}$	$N _ a \times \mathcal{K} \times 256$	motion query after agent-map interaction in MotionFormer
$Q _ \mathrm{g}$	$N _ a \times \mathcal{K} \times 256$	motion query after agent-goal point interaction in MotionFormer
$I ^ {s}$	$\mathcal{K} \times T \times 2$	scene-level anchor position in MotionFormer
$I ^ {a}$	$\mathcal{K} \times T \times 2$	agent-level anchor position in MotionFormer

参考文献

OccFormer

　将来の占有率を予測する。1ブロックで1タイムステップ分を予測をする。

　入力としてはTrackFormerの出力 $Q _ A$ と、位置エンコーディング $P _ A$ 及び、MotionFormerの出力を最大プーリングで圧縮したもの $Q _ X \in \mathbb{R} ^ {N _ a \times D}$ をMLPにかけて計算する。最初のブロックにはBEV特徴量Bも1/4サイズにダウンスケールして入力する。

　各ブロックでは途中で一度ダウンスケールして出力前で元のサイズに戻すということをする。そのダウンスケールした状態で先のエージェント特徴量とのAttentionを計算する。そのときにマスクとしてそのピクセルのエージェントに関するものだけAttentionを張るようにする。

　最終ブロックから出てくる $F ^ {t} _ \mathrm{dec} \in \mathbb{R} ^ {C\times H\times W}$ と、エージェント特徴から計算できる $U ^ t \in \mathbb{R} ^ {N _ a \times C}$ を掛け合わせてもとめる。

Notation	Shape and Params	Description
$T _ o$	5	length of prediction timestamps in OccFormer
$G ^ t$	$N _ a \times 256$	agent feature input
$F ^ t$	$200 \times 200 \times 256$	future state output
$Q _ X$	$N _ a \times 256$	motion query (max-pooled on modality level) from the last layer of MotionFormer
$F ^ t _ \mathrm{ds}$	$25 \times 25 \times 256$	downscaled dense feature
$F ^ t _ \mathrm{dec}$	$200 \times 200 \times 256$	decoded dense feature after convolutional decoder
$D ^ t _ \mathrm{ds}$	$25 \times 25 \times 256$	agent-aware dense feature after pixel-agent interaction
$\hat{O} ^ t _ A$	$N _ a \times 200 \times 200$	instance-level probability map
$\hat{O} ^ t$	$200 \times 200$	classical instance-agnostic occupancy map merged from $\hat{O} ^ t _ A$ for planning
$O ^ t _ m$	$200 \times 200$	attention mask for pixel-agent interaction
$M ^ t$	$N _ a \times 256$	mask feature
$U ^ t$	$N _ a \times 256$	occupancy feature

Planner

　進むべき方向を示すコマンド（左折・右折・直進）を、コマンド埋め込みに変換する。これとBEV特徴量から軌跡を予測し、OccupancyMapの占有率で確率の高いところは避けるように最適化する。

Notation	Shape and Params	Description
layers	3	number of transformer decoder layers for Planner
$T _ p$	6	length of planning timestamps in Planner

参考文献

学習方法

　2段階で学習する。まずTrackingとMappingのところを6エポック学習させ、次に全体を20エポックで学習させる。

実験

　使用データセット : nuScenes

切除実験

　UniADの各パーツを切除して学習実験をしたところ、表のような結果になった。多くのモジュールが入っているNo.9やNo.12で良い結果が出ていることから、共同学習が各モジュールの学習を促進していることがわかる。

　とはいえ、Trackingの性能などはこれくらいの数値的な向上でどれくらい実際の性能が変わるのかはよくわからない。Planningもこれだけのモジュールを動かのに見合った向上量なのだろうか。

各モジュール特化の既存手法との比較

　Object Trackingや、Online MappingといったPerceptionに関するそれぞれの手法に特化した既存手法と比べると、UniADは劣っているところもある。とはいえUniADはPlanningに焦点を当てた手法であり、PredictionとPlanningについては既存手法よりも良い指標を出している。

可視化結果

所感

　もしかしたらこういった手法がデファクトスタンダードになっていく可能性がなくはないとはいえ、こんなに複雑なニューラルネットワーク・パズルゲームはしんどいなという気持ちになる。もっとシンプルにならないんだろうか。

　行ってほしい方向をコマンドで指定するというところも、実運用的にどうなるのかがイマイチ上手く想像ができなかった。そのコマンドを発行するのはどういうモジュールになるんだろうか。コマンド発行のためには自己位置とかマップが別に必要になってしまうんだろうか。その精度はどの程度必要になるんだろうか。

　この論文が主張する通りなら、もしTrackingしかしないとしても後続タスクまで学習させた方が良いということになるので、ますますデータの集め方・取り扱い方が重要になっていくような気がする。いろんなものを一括で学習できるようなデータになっていないといけない。

　なんか若干違うんじゃないかという気はしつつ、上手くいくならば納得できない手法でもなんでもやらなきゃいけないので、とりあえずは選択肢として念頭に置いておくということで。