计算机视觉与模式识别学术速递[12.7]

cs.CV 方向，今日共计135篇

Transformer(10篇)

【1】 DoodleFormer: Creative Sketch Drawing with Transformers 标题：DoodleFormer：用Transformer创作素描链接：https://arxiv.org/abs/2112.03258

作者：Ankan Kumar Bhunia,Salman Khan,Hisham Cholakkal,Rao Muhammad Anwer,Fahad Shahbaz Khan,Jorma Laaksonen,Michael Felsberg 机构：Mohamed bin Zayed University of AI, UAE, Australian National University, Australia, Aalto University, Finland, Link¨oping University, Sweden 摘要：创造性的素描或涂鸦是一种富有表现力的活动，在这种活动中，对日常视觉对象进行富有想象力和以前看不见的描绘。创造性草图图像生成是一个具有挑战性的视觉问题，其任务是生成具有视觉世界对象的看不见的组成的多样但真实的创造性草图。在这里，我们提出了一个新的从粗到精的两阶段框架，即涂鸦成型器，该框架将创造性草图生成问题分解为粗草图合成的创建，然后在草图中加入细节。我们介绍了图形感知Transformer编码器，有效地捕捉全局动态以及不同身体部位之间的局部静态结构关系。为了确保生成的创造性草图的多样性，我们引入了一个概率粗略草图解码器，该解码器显式地对要绘制的每个草图主体部分的变化进行建模。实验在两个创意草图数据集上进行：创意鸟和创意生物。我们的定性、定量和基于人为的评估表明，DoodleFormer在这两个数据集上都优于最先进的技术，产生了逼真和多样的创意草图。在创意生物方面，DoodleFormer在Fr`echet inception distance（FID）方面比最新技术获得了25的绝对增益。我们还展示了涂鸦器在文本创作草图生成和草图完成相关应用中的有效性。摘要：Creative sketching or doodling is an expressive activity, where imaginative and previously unseen depictions of everyday visual objects are drawn. Creative sketch image generation is a challenging vision problem, where the task is to generate diverse, yet realistic creative sketches possessing the unseen composition of the visual-world objects. Here, we propose a novel coarse-to-fine two-stage framework, DoodleFormer, that decomposes the creative sketch generation problem into the creation of coarse sketch composition followed by the incorporation of fine-details in the sketch. We introduce graph-aware transformer encoders that effectively capture global dynamic as well as local static structural relations among different body parts. To ensure diversity of the generated creative sketches, we introduce a probabilistic coarse sketch decoder that explicitly models the variations of each sketch body part to be drawn. Experiments are performed on two creative sketch datasets: Creative Birds and Creative Creatures. Our qualitative, quantitative and human-based evaluations show that DoodleFormer outperforms the state-of-the-art on both datasets, yielding realistic and diverse creative sketches. On Creative Creatures, DoodleFormer achieves an absolute gain of 25 in terms of Fr`echet inception distance (FID) over the state-of-the-art. We also demonstrate the effectiveness of DoodleFormer for related applications of text to creative sketch generation and sketch completion.

【2】 PTTR: Relational 3D Point Cloud Object Tracking with Transformer 标题：PTTR：基于变换的关系型三维点云目标跟踪链接：https://arxiv.org/abs/2112.02857

作者：Changqing Zhou,Zhipeng Luo,Yueru Luo,Tianrui Liu,Liang Pan,Zhongang Cai,Haiyu Zhao,Shijian Lu 机构： Nanyang Technological University , S-Lab, Nanyang Technological University , Sensetime Research 摘要：在点云序列中，三维对象跟踪旨在预测给定模板点云的当前搜索点云中对象的位置和方向。基于transformers的成功，我们提出了点跟踪TRansformer（PTTR），它可以借助TRansformer操作以从粗到精的方式有效地预测高质量的3D跟踪结果。PTTR由三种新颖的设计组成。1）代替随机抽样，我们设计了关系感知抽样，以在子抽样过程中保留给定模板的相关点。2）此外，我们还提出了一个由自我注意和交叉注意模块组成的点关系变换器（PRT）。全局自我关注操作捕获长距离依赖关系，以分别增强搜索区域和模板的编码点特征。随后，我们通过交叉注意匹配两组点特征来生成粗略的跟踪结果。3）基于粗跟踪结果，我们采用了一种新的预测细化模块来获得最终的细化预测。此外，我们基于Waymo开放数据集创建了大规模点云单目标跟踪基准。大量实验表明，PTTR在精度和效率上都实现了优越的点云跟踪。摘要：In a point cloud sequence, 3D object tracking aims to predict the location and orientation of an object in the current search point cloud given a template point cloud. Motivated by the success of transformers, we propose Point Tracking TRansformer (PTTR), which efficiently predicts high-quality 3D tracking results in a coarse-to-fine manner with the help of transformer operations. PTTR consists of three novel designs. 1) Instead of random sampling, we design Relation-Aware Sampling to preserve relevant points to given templates during subsampling. 2) Furthermore, we propose a Point Relation Transformer (PRT) consisting of a self-attention and a cross-attention module. The global self-attention operation captures long-range dependencies to enhance encoded point features for the search area and the template, respectively. Subsequently, we generate the coarse tracking results by matching the two sets of point features via cross-attention. 3) Based on the coarse tracking results, we employ a novel Prediction Refinement Module to obtain the final refined prediction. In addition, we create a large-scale point cloud single object tracking benchmark based on the Waymo Open Dataset. Extensive experiments show that PTTR achieves superior point cloud tracking in both accuracy and efficiency.

【3】 GETAM: Gradient-weighted Element-wise Transformer Attention Map for Weakly-supervised Semantic segmentation 标题：GETAM：用于弱监督语义分割的梯度加权元素转换器注意图链接：https://arxiv.org/abs/2112.02841

作者：Weixuan Sun,Jing Zhang,Zheyuan Liu,Yiran Zhong,Nick Barnes 机构：Australian National University, SenseTime 摘要：弱监督语义分割（WSSS）具有挑战性，尤其是当图像级标签用于监督像素级预测时。为了弥补它们之间的差距，通常会生成一个类激活映射（CAM），以提供像素级的伪标签。卷积神经网络中的CAM受到部分激活的影响，即，只有最具辨别力的区域被激活。另一方面，基于转换器的方法在通过远程依赖关系建模探索全局上下文方面非常有效，有可能缓解“部分激活”问题。在本文中，我们提出了第一种基于变换的WSSS方法，并引入了梯度加权元素变换注意图（GETAM）。GETAM显示所有要素地图元素的精细比例激活，显示对象跨变换器层的不同部分。此外，我们提出了一个激活感知标签完成模块来生成高质量的伪标签。最后，我们使用双反向传播将我们的方法合并到WSS的端到端框架中。在PASCAL VOC和COCO上进行的大量实验表明，我们的结果大大超过了最先进的端到端方法，并且优于大多数多阶段方法。m大多数多阶段方法。摘要：Weakly Supervised Semantic Segmentation (WSSS) is challenging, particularly when image-level labels are used to supervise pixel level prediction. To bridge their gap, a Class Activation Map (CAM) is usually generated to provide pixel level pseudo labels. CAMs in Convolutional Neural Networks suffer from partial activation ie, only the most discriminative regions are activated. Transformer based methods, on the other hand, are highly effective at exploring global context with long range dependency modeling, potentially alleviating the "partial activation" issue. In this paper, we propose the first transformer based WSSS approach, and introduce the Gradient weighted Element wise Transformer Attention Map (GETAM). GETAM shows fine scale activation for all feature map elements, revealing different parts of the object across transformer layers. Further, we propose an activation aware label completion module to generate high quality pseudo labels. Finally, we incorporate our methods into an end to end framework for WSSS using double backward propagation. Extensive experiments on PASCAL VOC and COCO demonstrate that our results beat the state-of-the-art end-to-end approaches by a significant margin, and outperform most multi-stage methods.m most multi-stage methods.

【4】 Dynamic Token Normalization Improves Vision Transformer 标题：动态标记归一化改进了视觉转换器链接：https://arxiv.org/abs/2112.02624

作者：Wenqi Shao,Yixiao Ge,Zhaoyang Zhang,Xuyuan Xu,Xiaogang Wang,Ying Shan,Ping Luo 机构： The Chinese University of Hong Kong, ARC Lab, Tencent PCG, AI Technology Center of Tencent Video, The University of Hong Kong 备注：18 pages, 12 Tables, 9 Figures 摘要：视觉转换器（ViT）及其变体（例如，SWN、PVT）在各种计算机视觉任务中取得了巨大成功，因为它们能够学习远程上下文信息。层规范化（LN）是这些模型中的一个重要组成部分。然而，我们发现普通LN使不同位置的标记在大小上相似，因为它规范化了每个标记中的嵌入。Transformer很难捕捉电感偏置，例如LN图像中的位置背景。我们通过提出一种新的规范化器来解决这个问题，称为动态令牌规范化（DTN），其中在每个令牌内（令牌内）和不同令牌间（令牌间）执行规范化。DTN有几个优点。首先，它建立在一个统一的公式上，因此可以表示各种现有的规范化方法。其次，DTN学习以令牌内和令牌间的方式规范化令牌，使转换器能够捕获全局上下文信息和本地位置上下文。{第三，通过简单地替换LN层，DTN可以很容易地插入各种视觉变换器，如ViT、SWN、PVT、LeViT、T2T ViT、BigBird和Reformer。大量实验表明，配备DTN的变换器在最小的额外参数和计算开销下始终优于基线模型。例如，DTN输出在ImageNet上以$0.5\%$-$1.2\%$top-1精度执行LN，在COCO基准上以$1.2$-$1.4$box AP进行目标检测，在ImageNet-C上以$2.3\%$-$3.9\%$mCE进行鲁棒性实验，在远程竞技场上以$0.5\%$-$0.8\%$精度进行远程列表操作。}代码将在\url公开{https://github.com/wqshao126/DTN} 摘要：Vision Transformer (ViT) and its variants (e.g., Swin, PVT) have achieved great success in various computer vision tasks, owing to their capability to learn long-range contextual information. Layer Normalization (LN) is an essential ingredient in these models. However, we found that the ordinary LN makes tokens at different positions similar in magnitude because it normalizes embeddings within each token. It is difficult for Transformers to capture inductive bias such as the positional context in an image with LN. We tackle this problem by proposing a new normalizer, termed Dynamic Token Normalization (DTN), where normalization is performed both within each token (intra-token) and across different tokens (inter-token). DTN has several merits. Firstly, it is built on a unified formulation and thus can represent various existing normalization methods. Secondly, DTN learns to normalize tokens in both intra-token and inter-token manners, enabling Transformers to capture both the global contextual information and the local positional context. {Thirdly, by simply replacing LN layers, DTN can be readily plugged into various vision transformers, such as ViT, Swin, PVT, LeViT, T2T-ViT, BigBird and Reformer. Extensive experiments show that the transformer equipped with DTN consistently outperforms baseline model with minimal extra parameters and computational overhead. For example, DTN outperforms LN by $0.5\%$ - $1.2\%$ top-1 accuracy on ImageNet, by $1.2$ - $1.4$ box AP in object detection on COCO benchmark, by $2.3\%$ - $3.9\%$ mCE in robustness experiments on ImageNet-C, and by $0.5\%$ - $0.8\%$ accuracy in Long ListOps on Long-Range Arena.} Codes will be made public at \url{https://github.com/wqshao126/DTN}

【5】 Learning Tracking Representations via Dual-Branch Fully Transformer Networks 标题：基于双支路全Transformer网络的跟踪表示学习链接：https://arxiv.org/abs/2112.02571

作者：Fei Xie,Chunyu Wang,Guangting Wang,Wankou Yang,Wenjun Zeng 机构：Southeast University, China, Microsoft Research Asia 备注：ICCV21 Workshops 摘要：我们提出了一种基于单独Transformer的暹罗式双分支跟踪网络。在给定模板和搜索图像的情况下，我们将它们划分为不重叠的面片，并根据每个面片在注意窗口内与其他面片的匹配结果为每个面片提取一个特征向量。对于每个标记，我们估计它是否包含目标对象和相应的大小。这种方法的优点是，特征是从匹配中学习的，最终是为了匹配。因此，这些特征与对象跟踪任务是一致的。该方法取得了更好的或可比的结果，作为性能最好的方法，首先使用CNN提取特征，然后使用Transformer进行融合。它在GOT-10k和VOT2020基准上的性能优于最先进的方法。此外，该方法在一个GPU上实现了实时推理速度（约40$fps）。代码和模型将发布。摘要：We present a Siamese-like Dual-branch network based on solely Transformers for tracking. Given a template and a search image, we divide them into non-overlapping patches and extract a feature vector for each patch based on its matching results with others within an attention window. For each token, we estimate whether it contains the target object and the corresponding size. The advantage of the approach is that the features are learned from matching, and ultimately, for matching. So the features are aligned with the object tracking task. The method achieves better or comparable results as the best-performing methods which first use CNN to extract features and then use Transformer to fuse them. It outperforms the state-of-the-art methods on the GOT-10k and VOT2020 benchmarks. In addition, the method achieves real-time inference speed (about $40$ fps) on one GPU. The code and models will be released.

【6】 Adaptive Channel Encoding Transformer for Point Cloud Analysis 标题：用于点云分析的自适应通道编码转换器链接：https://arxiv.org/abs/2112.02507

作者：Guoquan Xu,Hezhi Cao,Jianwei Wan,Ke Xu,Yanxin Ma,Cong Zhang 机构：National University of Defense Technology, Changsha, CHINA, University of Science and Technology of China, Hefei, CHINA 摘要：Transformer在各种计算机视觉领域发挥着越来越重要的作用，在点云分析方面也取得了显著的成就。由于它们主要关注点态变换，本文提出了一种自适应信道编码变换器。具体而言，称为Transformer Conv的信道卷积被设计用于对信道进行编码。它可以通过捕获坐标和特征之间的潜在关系对特征通道进行编码。与简单地为每个通道分配注意权重相比，我们的方法旨在对通道进行自适应编码。此外，我们的网络采用低层和高层双语义感受域的邻域搜索方法来提高性能。大量实验表明，在三个基准数据集上，我们的方法优于最新的点云分类和分割方法。摘要：Transformer plays an increasingly important role in various computer vision areas and remarkable achievements have also been made in point cloud analysis. Since they mainly focus on point-wise transformer, an adaptive channel encoding transformer is proposed in this paper. Specifically, a channel convolution called Transformer-Conv is designed to encode the channel. It can encode feature channels by capturing the potential relationship between coordinates and features. Compared with simply assigning attention weight to each channel, our method aims to encode the channel adaptively. In addition, our network adopts the neighborhood search method of low-level and high-level dual semantic receptive fields to improve the performance. Extensive experiments show that our method is superior to state-of-the-art point cloud classification and segmentation methods on three benchmark datasets.

【7】 Pose-guided Feature Disentangling for Occluded Person Re-identification Based on Transformer 标题：基于Transformer的姿态制导特征解缠遮挡人员再识别链接：https://arxiv.org/abs/2112.02466

作者：Tao Wang,Hong Liu,Pinhao Song,Tianyu Guo,Wei Shi 机构：Key Laboratory of Machine Perception, Shenzhen Graduate School, Peking University 备注：Accepted by AAAI2022 摘要：由于在某些场景中，人体部位可能被某些障碍物（如树木、汽车和行人）遮挡，因此被遮挡者的重新识别是一项具有挑战性的任务。现有的一些姿态引导方法通过根据图形匹配对齐身体部位来解决这一问题，但这些基于图形的方法并不直观和复杂。因此，我们提出了一种基于变换器的姿势引导特征分离（PFD）方法，该方法利用姿势信息对语义成分（如人体或关节部位）进行清晰的分离，并对非遮挡部位进行相应的选择性匹配。首先，利用视觉变换器（ViT）的强大功能提取面片特征。其次，在姿态引导特征聚合（PFA）模块中，利用匹配和分布机制，初步分离姿态信息和面片信息。第三，在transformer解码器中引入一组可学习的语义视图来隐式增强分离的身体部位特征。然而，如果没有额外的监督，这些语义观点不能保证与身体相关。因此，提出了姿态-视图匹配（PVM）模块来显式匹配可见身体部位并自动分离遮挡特征。第四，为了更好地防止遮挡的干扰，我们设计了一种姿势引导的推离，以强调可见身体部位的特征。对两个任务（闭塞和整体重建）的五个具有挑战性的数据集进行的大量实验表明，我们提出的PFD具有优越的前景，其性能优于最先进的方法。代码可在https://github.com/WangTaoAs/PFD_Net 摘要：Occluded person re-identification is a challenging task as human body parts could be occluded by some obstacles (e.g. trees, cars, and pedestrians) in certain scenes. Some existing pose-guided methods solve this problem by aligning body parts according to graph matching, but these graph-based methods are not intuitive and complicated. Therefore, we propose a transformer-based Pose-guided Feature Disentangling (PFD) method by utilizing pose information to clearly disentangle semantic components (e.g. human body or joint parts) and selectively match non-occluded parts correspondingly. First, Vision Transformer (ViT) is used to extract the patch features with its strong capability. Second, to preliminarily disentangle the pose information from patch information, the matching and distributing mechanism is leveraged in Pose-guided Feature Aggregation (PFA) module. Third, a set of learnable semantic views are introduced in transformer decoder to implicitly enhance the disentangled body part features. However, those semantic views are not guaranteed to be related to the body without additional supervision. Therefore, Pose-View Matching (PVM) module is proposed to explicitly match visible body parts and automatically separate occlusion features. Fourth, to better prevent the interference of occlusions, we design a Pose-guided Push Loss to emphasize the features of visible body parts. Extensive experiments over five challenging datasets for two tasks (occluded and holistic Re-ID) demonstrate that our proposed PFD is superior promising, which performs favorably against state-of-the-art methods. Code is available at https://github.com/WangTaoAs/PFD_Net

【8】 TransCMD: Cross-Modal Decoder Equipped with Transformer for RGB-D Salient Object Detection 标题：TransCMD：用于RGB-D显著目标检测的Transformer交叉模式解码器链接：https://arxiv.org/abs/2112.02363

作者：Youwei Pang,Xiaoqi Zhao,Lihe Zhang,Huchuan Lu 备注：Manuscript Version 摘要：现有的RGB-D突出目标检测方法大多利用卷积运算，构造复杂的交织融合结构，实现跨模态信息融合。卷积运算固有的局部连通性限制了基于卷积的方法的性能。在这项工作中，我们从全球信息对齐和转换的角度重新思考这项任务。具体而言，所提出的方法（TransCMD）级联多个跨模态积分单元，以构造自顶向下的基于Transformer的信息传播路径（TIPP）。TransCMD将多尺度和多模态特征集成视为构建在转换器上的序列到序列上下文传播和更新过程。此外，考虑到二次复杂度w.r.t.和输入令牌的数量，我们设计了一种具有可接受计算成本的分片令牌重新嵌入策略（PTRE）。在七个RGB-D SOD基准数据集上的实验结果表明，当配备TIPP时，一个简单的双流编码器-解码器框架可以超过最先进的纯CNN方法。摘要：Most of the existing RGB-D salient object detection methods utilize the convolution operation and construct complex interweave fusion structures to achieve cross-modal information integration. The inherent local connectivity of convolution operation constrains the performance of the convolution-based methods to a ceiling. In this work, we rethink this task from the perspective of global information alignment and transformation. Specifically, the proposed method (TransCMD) cascades several cross-modal integration units to construct a top-down transformer-based information propagation path (TIPP). TransCMD treats the multi-scale and multi-modal feature integration as a sequence-to-sequence context propagation and update process built on the transformer. Besides, considering the quadratic complexity w.r.t. the number of input tokens, we design a patch-wise token re-embedding strategy (PTRE) with acceptable computational cost. Experimental results on seven RGB-D SOD benchmark datasets demonstrate that a simple two-stream encoder-decoder framework can surpass the state-of-the-art purely CNN-based methods when it is equipped with the TIPP.

【9】 U2-Former: A Nested U-shaped Transformer for Image Restoration 标题：U2-form：一种用于图像恢复的嵌套式U型Transformer 链接：https://arxiv.org/abs/2112.02279

作者：Haobo Ji,Xin Feng,Wenjie Pei,Jinxing Li,Guangming Lu 摘要：虽然Transformer在各种高级视觉任务中取得了显著的性能，但充分利用Transformer在图像恢复中的潜力仍然是一个挑战。问题的关键在于在用于图像恢复的典型编码器-解码器框架中应用Transformer的深度有限，这是由于沉重的自我关注计算负载和跨不同深度（尺度）层的低效通信造成的。在本文中，我们提出了一种深度和有效的基于Transformer的图像恢复网络，称为U2-Former，它能够使用Transformer作为核心操作，在深度编码和解码空间中执行图像恢复。具体地说，它利用嵌套的U形结构来促进具有不同比例特征图的不同层之间的交互。此外，通过引入特征过滤机制来压缩令牌表示，我们优化了基本Transformer块的计算效率。除了图像恢复的典型监督方式外，我们的U2前者还从多个方面进行对比学习，以进一步将噪声分量与背景图像分离。在各种图像恢复任务上的大量实验，包括反射去除、雨条纹去除和去杂，证明了所提出的U2前者的有效性。摘要：While Transformer has achieved remarkable performance in various high-level vision tasks, it is still challenging to exploit the full potential of Transformer in image restoration. The crux lies in the limited depth of applying Transformer in the typical encoder-decoder framework for image restoration, resulting from heavy self-attention computation load and inefficient communications across different depth (scales) of layers. In this paper, we present a deep and effective Transformer-based network for image restoration, termed as U2-Former, which is able to employ Transformer as the core operation to perform image restoration in a deep encoding and decoding space. Specifically, it leverages the nested U-shaped structure to facilitate the interactions across different layers with different scales of feature maps. Furthermore, we optimize the computational efficiency for the basic Transformer block by introducing a feature-filtering mechanism to compress the token representation. Apart from the typical supervision ways for image restoration, our U2-Former also performs contrastive learning in multiple aspects to further decouple the noise component from the background image. Extensive experiments on various image restoration tasks, including reflection removal, rain streak removal and dehazing respectively, demonstrate the effectiveness of the proposed U2-Former.

【10】 LAVT: Language-Aware Vision Transformer for Referring Image Segmentation 标题：LAVT：用于参考图像分割的语言感知视觉转换器链接：https://arxiv.org/abs/2112.02244

作者：Zhao Yang,Jiaqi Wang,Yansong Tang,Kai Chen,Hengshuang Zhao,Philip H. S. Torr 机构：University of Oxford,Shanghai AI Laboratory, SenseTime Research,The University of Hong Kong 备注：10 pages, 8 figures 摘要：参考图像分割是一项基本的视觉语言任务，旨在从图像中分割出自然语言表达式所指的对象。此任务背后的一个关键挑战是利用引用表达式突出显示图像中的相关位置。解决这个问题的一个范例是利用强大的视觉语言（“交叉模态”）解码器来融合从视觉编码器和语言编码器独立提取的特征。最新的方法在这一范式中取得了显著的进步，将Transformer用作跨模态解码器，同时Transformer在许多其他视觉语言任务中取得了巨大的成功。在这项工作中采用不同的方法，我们表明，通过早期融合视觉转换器编码器网络中间层的语言和视觉特征，可以实现显著更好的跨模态对齐。通过在视觉特征编码阶段进行跨模态特征融合，我们可以利用Transformer编码器经验证的相关建模能力挖掘有用的多模态上下文。这样，通过一个轻量级的掩码预测器，可以很容易地获得准确的分割结果。我们的方法在无需敲钟和口哨的情况下，在RefCOCO、RefCOCO+和G-Ref上大大超过了以前最先进的方法。摘要：Referring image segmentation is a fundamental vision-language task that aims to segment out an object referred to by a natural language expression from an image. One of the key challenges behind this task is leveraging the referring expression for highlighting relevant positions in the image. A paradigm for tackling this problem is to leverage a powerful vision-language ("cross-modal") decoder to fuse features independently extracted from a vision encoder and a language encoder. Recent methods have made remarkable advancements in this paradigm by exploiting Transformers as cross-modal decoders, concurrent to the Transformer's overwhelming success in many other vision-language tasks. Adopting a different approach in this work, we show that significantly better cross-modal alignments can be achieved through the early fusion of linguistic and visual features in intermediate layers of a vision Transformer encoder network. By conducting cross-modal feature fusion in the visual feature encoding stage, we can leverage the well-proven correlation modeling power of a Transformer encoder for excavating helpful multi-modal context. This way, accurate segmentation results are readily harvested with a light-weight mask predictor. Without bells and whistles, our method surpasses the previous state-of-the-art methods on RefCOCO, RefCOCO+, and G-Ref by large margins.

检测相关(17篇)

【1】 Learning to Reason from General Concepts to Fine-grained Tokens for Discriminative Phrase Detection 标题：用于鉴别性短语检测的从一般概念到细粒度标记的学习推理链接：https://arxiv.org/abs/2112.03237

作者：Maan Qraitem,Bryan A. Plummer 机构：Department of Computer Science, Boston University 摘要：短语检测需要识别短语是否与图像相关的方法，然后在适用的情况下对其进行定位。训练更具辨别力的短语检测模型的一个关键挑战是采样硬否定。这是因为很少有短语被注释为可能适用的几乎无限的变化。为了解决这个问题，我们引入了PFP-Net，一种通过两种新方法区分短语的短语检测器。首先，我们将相关对象的短语组合成视觉连贯概念的粗略组（例如动物与汽车），然后训练我们的PFP网络，根据它们的概念成员来区分它们。其次，对于包含细粒度互斥标记（例如颜色）的短语，我们强制模型为每个区域仅选择一个适用短语。我们在Flickr30K实体和RefCOCO+数据集上评估了我们的方法，在这项具有挑战性的任务中，我们将mAP比最新技术提高了1-1.5个百分点。当只考虑受细粒度推理模块影响的短语时，我们在两个数据集上都提高了1-4个点。摘要：Phrase detection requires methods to identify if a phrase is relevant to an image and then localize it if applicable. A key challenge in training more discriminative phrase detection models is sampling hard-negatives. This is because few phrases are annotated of the nearly infinite variations that may be applicable. To address this problem, we introduce PFP-Net, a phrase detector that differentiates between phrases through two novel methods. First, we group together phrases of related objects into coarse groups of visually coherent concepts (eg animals vs automobiles), and then train our PFP-Net to discriminate between them according to their concept membership. Second, for phrases containing fine grained mutually-exclusive tokens (eg colors), we force the model into selecting only one applicable phrase for each region. We evaluate our approach on the Flickr30K Entities and RefCOCO+ datasets, where we improve mAP over the state-of-the-art by 1-1.5 points over all phrases on this challenging task. When considering only the phrases affected by our fine-grained reasoning module, we improve by 1-4 points on both datasets.

【2】 Context-Aware Transfer Attacks for Object Detection 标题：面向对象检测的上下文感知传输攻击链接：https://arxiv.org/abs/2112.03223

作者：Zikui Cai,Xinxin Xie,Shasha Li,Mingjun Yin,Chengyu Song,Srikanth V. Krishnamurthy,Amit K. Roy-Chowdhury,M. Salman Asif 机构： Electrical and Computer Engineering, University of California Riverside, Computer Science and Engineering, University of California Riverside 备注：accepted to AAAI 2022 摘要：近年来，针对图像分类器的黑盒转移攻击得到了广泛的研究。相比之下，在目标探测器的转移攻击方面进展甚微。对象检测器对图像进行整体查看，一个对象（或缺少对象）的检测通常取决于场景中的其他对象。这使得这种检测器固有的上下文感知和对抗性攻击比那些针对图像分类器的攻击更具挑战性。在本文中，我们提出了一种新的方法来生成对象检测器的上下文感知攻击。我们证明，通过使用对象及其相对位置和大小的共现性作为上下文信息，我们可以成功地生成有针对性的错误分类攻击，在blackbox对象检测器上实现比最新技术更高的传输成功率。我们使用PASCAL VOC和MS COCO数据集的图像在各种物体检测器上测试我们的方法，并证明与其他最先进的方法相比，性能提高了20美元百分点。摘要：Blackbox transfer attacks for image classifiers have been extensively studied in recent years. In contrast, little progress has been made on transfer attacks for object detectors. Object detectors take a holistic view of the image and the detection of one object (or lack thereof) often depends on other objects in the scene. This makes such detectors inherently context-aware and adversarial attacks in this space are more challenging than those targeting image classifiers. In this paper, we present a new approach to generate context-aware attacks for object detectors. We show that by using co-occurrence of objects and their relative locations and sizes as context information, we can successfully generate targeted mis-categorization attacks that achieve higher transfer success rates on blackbox object detectors than the state-of-the-art. We test our approach on a variety of object detectors with images from PASCAL VOC and MS COCO datasets and demonstrate up to $20$ percentage points improvement in performance compared to the other state-of-the-art methods.

【3】 Fusion Detection via Distance-Decay IoU and weighted Dempster-Shafer Evidence Theory 标题：基于距离衰减欠条和加权Dempster-Shafer证据理论的融合检测链接：https://arxiv.org/abs/2112.03044

作者：Fang Qingyun,Wang Zhaokui 机构：Department of Aeronautics and Astronautics Engineering, Tsinghua University, Beĳing, China 备注：18 pages, 7 pages, under consideration at Journal of Aerospace Information Systems 摘要：近年来，遥感图像中的目标检测越来越受到人们的重视。然而，传统的光学探测对光照和天气异常非常敏感。如何有效地利用多源遥感图像，特别是光学和合成孔径雷达图像的跨模态信息，实现全天、全天候、高精度、高速度的目标检测，是一个挑战。为此，本文提出了一种快速多源融合检测框架。提出了一种新的基于并集的距离衰减交集，对目标的形状特性进行了尺度不变性编码。因此，可以精确地对多源图像中的同一目标进行配对。此外，利用加权Dempster-Shafer证据理论将光学和合成孔径雷达检测结合起来，克服了特征级融合需要大量成对数据的缺点。此外，本文还对曾经在苏伊士运河搁浅的集装箱船的光学和合成孔径雷达图像进行了对比，以验证我们的融合算法。为了验证该方法的有效性，在自建数据集上，该融合检测框架的平均精度比光学检测高20.13%。摘要：In recent years, increasing attentions are paid on object detection in remote sensing imagery. However, traditional optical detection is highly susceptible to illumination and weather anomaly. It is a challenge to effectively utilize the cross-modality information from multi-source remote sensing images, especially from optical and synthetic aperture radar images, to achieve all-day and all-weather detection with high accuracy and speed. Towards this end, a fast multi-source fusion detection framework is proposed in current paper. A novel distance-decay intersection over union is employed to encode the shape properties of the targets with scale invariance. Therefore, the same target in multi-source images can be paired accurately. Furthermore, the weighted Dempster-Shafer evidence theory is utilized to combine the optical and synthetic aperture radar detection, which overcomes the drawback in feature-level fusion that requires a large amount of paired data. In addition, the paired optical and synthetic aperture radar images for container ship Ever Given which ran aground in the Suez Canal are taken to demonstrate our fusion algorithm. To test the effectiveness of the proposed method, on self-built data set, the average precision of the proposed fusion detection framework outperform the optical detection by 20.13%.

【4】 Cross-Modality Attentive Feature Fusion for Object Detection in Multispectral Remote Sensing Imagery 标题：多光谱遥感图像目标检测的跨模态注意特征融合链接：https://arxiv.org/abs/2112.02991

作者：Qingyun Fang,Zhaokui Wang 备注：23 pages,11 figures, under consideration at Pattern Recognition 摘要：跨模态融合多光谱遥感图像对的互补信息可以提高检测算法的感知能力，使其在更广泛的应用（如夜间检测）中更加稳健和可靠。与以前的方法相比，我们认为不同的特征应该被专门处理，模态特定的特征应该被保留和增强，而模态共享的特征应该从RGB和热红外模态中挑选出来。基于这一思想，提出了一种新颖、轻量级的多光谱特征融合方法，即跨模态注意特征融合（CMAFF）。给定RGB和IR图像的中间特征映射，我们的模块并行地从两种不同的模式（公共模式和差分模式）推断注意映射，然后将注意映射分别乘以输入特征映射以进行自适应特征增强或选择。大量的实验表明，我们提出的方法可以在较低的计算成本下实现最先进的性能。摘要：Cross-modality fusing complementary information of multispectral remote sensing image pairs can improve the perception ability of detection algorithms, making them more robust and reliable for a wider range of applications, such as nighttime detection. Compared with prior methods, we think different features should be processed specifically, the modality-specific features should be retained and enhanced, while the modality-shared features should be cherry-picked from the RGB and thermal IR modalities. Following this idea, a novel and lightweight multispectral feature fusion approach with joint common-modality and differential-modality attentions are proposed, named Cross-Modality Attentive Feature Fusion (CMAFF). Given the intermediate feature maps of RGB and IR images, our module parallel infers attention maps from two separate modalities, common- and differential-modality, then the attention maps are multiplied to the input feature map respectively for adaptive feature enhancement or selection. Extensive experiments demonstrate that our proposed approach can achieve the state-of-the-art performance at a low computation cost.

【5】 Anomaly Detection in IR Images of PV Modules using Supervised Contrastive Learning 标题：基于有监督对比学习的光伏组件红外图像异常检测链接：https://arxiv.org/abs/2112.02922

作者：Lukas Bommes,Mathis Hoffmann,Claudia Buerhop-Lutz,Tobias Pickel,Jens Hauch,Christoph Brabec,Andreas Maier,Ian Marius Peters 机构：Forschungszentrum Jülich GmbH, Helmholtz-Institute Erlangen-Nuremberg for Renewable Energies (HI ERN), Pattern Recognition Lab, Department Informatik, Universität Erlangen-Nürnberg (FAU) 摘要：光伏（PV）发电厂的不断增加需要自动检测故障光伏组件的方法，如红外（IR）图像。最近，深度学习为此变得流行起来。然而，相关工作通常从同一分布中采样序列和测试数据，忽略不同光伏电站数据之间存在的域转移。相反，我们将故障检测视为更现实的无监督域自适应问题，在该问题中，我们对一个源光伏电站的标记数据进行训练，并对另一个目标电站进行预测。我们训练了一个具有监督对比损失的ResNet-34卷积神经网络，在此基础上我们使用k-最近邻分类器来检测异常。我们的方法在四个源和目标数据集的九个组合上获得了73.3%到96.6%的接收机工作特性（AUROC）下的满意区域，有292万幅红外图像，其中8.5%是异常的。在某些情况下，它甚至优于二进制交叉熵分类器。在固定的决策阈值下，这将分别导致79.4%和77.1%的正常和异常图像正确分类。大多数错误分类的异常严重程度较低，如热二极管和小热点。我们的方法对超参数设置不敏感，收敛速度快，能够可靠地检测未知类型的异常，非常适合于实际应用。可能的用途是在自动光伏电站检查系统中，或通过过滤掉正常图像来简化红外数据集的手动标签。此外，我们的工作为社区提供了一个更现实的视角，即使用无监督的领域自适应进行光伏组件故障检测，以开发具有良好泛化能力的更高性能的方法。摘要：Increasing deployment of photovoltaic (PV) plants requires methods for automatic detection of faulty PV modules in modalities, such as infrared (IR) images. Recently, deep learning has become popular for this. However, related works typically sample train and test data from the same distribution ignoring the presence of domain shift between data of different PV plants. Instead, we frame fault detection as more realistic unsupervised domain adaptation problem where we train on labelled data of one source PV plant and make predictions on another target plant. We train a ResNet-34 convolutional neural network with a supervised contrastive loss, on top of which we employ a k-nearest neighbor classifier to detect anomalies. Our method achieves a satisfactory area under the receiver operating characteristic (AUROC) of 73.3 % to 96.6 % on nine combinations of four source and target datasets with 2.92 million IR images of which 8.5 % are anomalous. It even outperforms a binary cross-entropy classifier in some cases. With a fixed decision threshold this results in 79.4 % and 77.1 % correctly classified normal and anomalous images, respectively. Most misclassified anomalies are of low severity, such as hot diodes and small hot spots. Our method is insensitive to hyperparameter settings, converges quickly and reliably detects unknown types of anomalies making it well suited for practice. Possible uses are in automatic PV plant inspection systems or to streamline manual labelling of IR datasets by filtering out normal images. Furthermore, our work serves the community with a more realistic view on PV module fault detection using unsupervised domain adaptation to develop more performant methods with favorable generalization capabilities.

【6】 ALIKE: Accurate and Lightweight Keypoint Detection and Descriptor Extraction 标题：LIKE：准确、轻量级的关键点检测和描述符提取链接：https://arxiv.org/abs/2112.02906

作者：Xiaoming Zhao,Xingming Wu,Jinyu Miao,Weihai Chen,Peter C. Y. Chen,Zhengguo Li 机构：National University of Singapore 备注：10 pages, 10 figures 摘要：现有的方法是以不可微的方式检测关键点，因此不能通过反向传播直接优化关键点的位置。为了解决这个问题，我们提出了一个可微关键点检测模块，它输出精确的亚像素关键点。然后提出重投影损耗直接优化这些亚像素关键点，并提出色散峰值损耗用于精确的关键点正则化。我们还以亚像素的方式提取描述符，并使用稳定的神经重投影误差损失对其进行训练。此外，还为关键点检测和描述符提取设计了一个轻量级网络，该网络在商用GPU上可以每秒95帧运行640x480图像。在单应估计、摄像机姿态估计和视觉（再）定位任务上，该方法与最新的方法实现了同等的性能，同时大大减少了推理时间。摘要：Existing methods detect the keypoints in a non-differentiable way, therefore they can not directly optimize the position of keypoints through back-propagation. To address this issue, we present a differentiable keypoint detection module, which outputs accurate sub-pixel keypoints. The reprojection loss is then proposed to directly optimize these sub-pixel keypoints, and the dispersity peak loss is presented for accurate keypoints regularization. We also extract the descriptors in a sub-pixel way, and they are trained with the stable neural reprojection error loss. Moreover, a lightweight network is designed for keypoint detection and descriptor extraction, which can run at 95 frames per second for 640x480 images on a commercial GPU. On homography estimation, camera pose estimation, and visual (re-)localization tasks, the proposed method achieves equivalent performance with the state-of-the-art approaches, while greatly reduces the inference time.

【7】 Seeing BDD100K in dark: Single-Stage Night-time Object Detection via Continual Fourier Contrastive Learning 标题：在黑暗中看到BDD100K：基于连续傅立叶对比学习的单级夜间目标检测链接：https://arxiv.org/abs/2112.02891

作者：Ujjal Kr Dutta 摘要：尽管最先进的目标探测器有了巨大的改进，但在有限的可用论文中，通过非统一的评估协议对夜间目标探测的研究也很少。除了缺乏解决这一问题的方法外，还缺乏足够大的基准数据集来研究夜间目标检测。最近，我们引入了大规模的BDD100K，我们认为应该选择它作为基准，以启动这一领域的研究。现在，对于这些方法，现有的方法（数量有限）主要是基于生成图像转换的方法，或者基于图像增强/照明的方法，这两种方法都不是自然的，都不符合人类在夜间看到物体的方式（通过聚焦物体轮廓）。在本文中，我们填补了这3个空白：1。缺乏统一的评估方案（由于其有效性和效率，使用单级检测器），2。选择用于基准夜间目标检测的数据集，以及3。一种解决当前备选方案局限性的新方法。我们的方法利用了基于对比学习的特征提取器，通过傅立叶变换从频域中借用信息，并以基于持续学习的方式进行训练。当用于对象检测时（在微调分类和回归层后），学习的功能有助于实现新的最先进的经验性能，轻松超越大量竞争对手。摘要：Despite tremendous improvements in state-of-the-art object detectors, addressing object detection in the night-time has been studied only sparsely, that too, via non-uniform evaluation protocols among the limited available papers. In addition to the lack of methods to address this problem, there was also a lack of an adequately large benchmark dataset to study night-time object detection. Recently, the large scale BDD100K was introduced, which, in our opinion, should be chosen as the benchmark, to kickstart research in this area. Now, coming to the methods, existing approaches (limited in number), are mainly either generative image translation based, or image enhancement/ illumination based, neither of which is natural, conforming to how humans see objects in the night time (by focusing on object contours). In this paper, we bridge these 3 gaps: 1. Lack of an uniform evaluation protocol (using a single-stage detector, due to its efficacy, and efficiency), 2. Choice of dataset for benchmarking night-time object detection, and 3. A novel method to address the limitations of current alternatives. Our method leverages a Contrastive Learning based feature extractor, borrowing information from the frequency domain via Fourier transformation, and trained in a continual learning based fashion. The learned features when used for object detection (after fine-tuning the classification and regression layers), help achieve a new state-of-the-art empirical performance, comfortably outperforming an extensive number of competitors.

【8】 SyntEO: Synthetic Dataset Generation for Earth Observation with Deep Learning -- Demonstrated for Offshore Wind Farm Detection 标题：SyntEO：基于深度学习的对地观测综合数据集生成--以海上风电场探测为例链接：https://arxiv.org/abs/2112.02829

作者：Thorsten Hoeser,Claudia Kuenzer 机构：German Remote Sensing Data Center (DFD), German Aerospace Center (DLR), Department of Remote Sensing, Institute of Geography and Geology, University of Wuerzburg 备注：25 pages, 12 figures 摘要：随着过去几年深入学习的出现，地球观测研究出现了新的机遇。然而，它们也带来了新的挑战。深度学习模型的数据饥渴训练过程需要大量的、资源昂贵的、带注释的数据集，并部分取代了知识驱动的方法，因此模型行为和最终预测过程成为一个黑箱。拟议的SyntEO方法使地球观测研究人员能够自动生成大型深度学习准备数据集，从而释放其他占用的资源。SyntEO通过在数据生成过程中以高度结构化的方式包含专家知识来实现这一点。通过这种方式，建立了完全可控的实验环境，支持模型训练中的洞察。因此，SyntEO使学习过程可接近，模型行为可解释，这是可解释机器学习的重要基石。我们通过在两个世界上最大的海上风能生产基地的Sentinel-1图像中预测海上风电场来演示SyntEO方法。生成的最大数据集有90000个训练示例。用于目标检测的基本卷积神经网络（仅针对该合成数据进行训练）通过在具有挑战性的环境中最小化错误检测，自信地检测海上风电场。此外，还生成了四个顺序数据集，演示了SyntEO方法如何精确定义数据集结构并影响训练过程。因此，SyntEO是一种混合方法，它在专家知识和数据驱动的图像分析之间创建了一个接口。摘要：With the emergence of deep learning in the last years, new opportunities arose in Earth observation research. Nevertheless, they also brought with them new challenges. The data-hungry training processes of deep learning models demand large, resource expensive, annotated datasets and partly replaced knowledge-driven approaches, so that model behaviour and the final prediction process became a black box. The proposed SyntEO approach enables Earth observation researchers to automatically generate large deep learning ready datasets and thus free up otherwise occupied resources. SyntEO does this by including expert knowledge in the data generation process in a highly structured manner. In this way, fully controllable experiment environments are set up, which support insights in the model training. Thus, SyntEO makes the learning process approachable and model behaviour interpretable, an important cornerstone for explainable machine learning. We demonstrate the SyntEO approach by predicting offshore wind farms in Sentinel-1 images on two of the worlds largest offshore wind energy production sites. The largest generated dataset has 90,000 training examples. A basic convolutional neural network for object detection, that is only trained on this synthetic data, confidently detects offshore wind farms by minimising false detections in challenging environments. In addition, four sequential datasets are generated, demonstrating how the SyntEO approach can precisely define the dataset structure and influence the training process. SyntEO is thus a hybrid approach that creates an interface between expert knowledge and data-driven image analysis.

【9】 A Survey of Deep Learning for Low-Shot Object Detection 标题：深度学习在低镜头目标检测中的研究进展链接：https://arxiv.org/abs/2112.02814

作者：Qihan Huang,Haofei Zhang,Jie Song,Mingli Song 摘要：目标检测是计算机视觉和图像处理中的一项基本任务。目前，基于深度学习的目标检测器已经取得了巨大的成功，拥有丰富的标记数据。但在现实生活中，并不能保证每个对象类别都有足够的标记样本用于训练。当训练数据有限时，这些大目标检测器容易过度拟合。因此，有必要在目标检测中引入少量镜头学习和Zero-Shot学习，将其统称为低镜头目标检测。低镜头目标检测（LSOD）的目的是从少量甚至零标记的数据中检测目标，这可以分为Few-Shot目标检测（FSOD）和Zero-Shot目标检测（ZSD）。本文对基于FSOD和ZSD的深度学习进行了全面调查。首先，本调查将FSOD和ZSD的方法分为不同的类别，并讨论了它们的优缺点。其次，本次调查回顾了FSOD和ZSD的数据集设置和评估指标，然后分析了不同方法在这些基准上的性能。最后，本调查讨论了FSOD和ZSD的未来挑战和前景。摘要：Object detection is a fundamental task in computer vision and image processing. Current deep learning based object detectors have been highly successful with abundant labeled data. But in real life, it is not guaranteed that each object category has enough labeled samples for training. These large object detectors are easy to overfit when the training data is limited. Therefore, it is necessary to introduce few-shot learning and zero-shot learning into object detection, which can be named low-shot object detection together. Low-Shot Object Detection (LSOD) aims to detect objects from a few or even zero labeled data, which can be categorized into few-shot object detection (FSOD) and zero-shot object detection (ZSD), respectively. This paper conducts a comprehensive survey for deep learning based FSOD and ZSD. First, this survey classifies methods for FSOD and ZSD into different categories and discusses the pros and cons of them. Second, this survey reviews dataset settings and evaluation metrics for FSOD and ZSD, then analyzes the performance of different methods on these benchmarks. Finally, this survey discusses future challenges and promising directions for FSOD and ZSD.

【10】 MetaCloth: Learning Unseen Tasks of Dense Fashion Landmark Detection from a Few Samples 标题：MetaCloth：从几个样本中学习密集时尚标志检测的未见任务链接：https://arxiv.org/abs/2112.02763

作者：Yuying Ge,Ruimao Zhang,Ping Luo 机构：The University of Hong Kong, The Chinese University of Hong Kong (Shenzhen) 备注：Accepted by IEEE Transactions on Image Processing 摘要：目前流行的时尚地标检测方法主要是通过在大规模时尚数据集上训练卷积神经网络来实现的，这些数据集具有大量的带注释的地标。然而，在实际应用中很难获得如此大规模的注释，并且成本高昂，因此需要能够从少量标记数据中很好地概括的模型。我们调查这个问题的少数镜头时尚地标检测，其中只有少数标记样本可用于一个看不见的任务。本文提出了一种新的元学习框架MetaCloth，该框架只需少量的标注样本就可以学习密集时尚地标检测的不可见任务。与以前的元学习工作不同，元学习工作的重点是解决“N-way K-shot”任务，其中每个任务通过对每个类的K个带注释的样本进行训练来预测N个类的数量（N对于所有可见和不可见的任务都是固定的），元Cloth中的任务使用K个样本检测不同服装类别的N个不同地标，其中N在任务中有所不同，因为不同的服装类别通常有不同数量的标志。因此，MetaCloth中不同的可见和不可见任务的参数数量是不同的。MetaCloth经过精心设计，可以为不同的任务动态生成不同数量的参数，并从带有一组良好初始化参数的几个带注释的样本中学习可概括的特征提取网络。大量实验表明，MetaCloth的性能大大优于同类产品。摘要：Recent advanced methods for fashion landmark detection are mainly driven by training convolutional neural networks on large-scale fashion datasets, which has a large number of annotated landmarks. However, such large-scale annotations are difficult and expensive to obtain in real-world applications, thus models that can generalize well from a small amount of labelled data are desired. We investigate this problem of few-shot fashion landmark detection, where only a few labelled samples are available for an unseen task. This work proposes a novel framework named MetaCloth via meta-learning, which is able to learn unseen tasks of dense fashion landmark detection with only a few annotated samples. Unlike previous meta-learning work that focus on solving "N-way K-shot" tasks, where each task predicts N number of classes by training with K annotated samples for each class (N is fixed for all seen and unseen tasks), a task in MetaCloth detects N different landmarks for different clothing categories using K samples, where N varies across tasks, because different clothing categories usually have various number of landmarks. Therefore, numbers of parameters are various for different seen and unseen tasks in MetaCloth. MetaCloth is carefully designed to dynamically generate different numbers of parameters for different tasks, and learn a generalizable feature extraction network from a few annotated samples with a set of good initialization parameters. Extensive experiments show that MetaCloth outperforms its counterparts by a large margin.

【11】 Facial Emotion Characterization and Detection using Fourier Transform and Machine Learning 标题：基于傅立叶变换和机器学习的面部情感表征与检测链接：https://arxiv.org/abs/2112.02729

作者：Aishwarya Gouru,Shan Suthaharan 机构：Department of Computer Science, University of North Carolina at Greensboro, Greensboro, NC 备注：8 pages, 3 figures 摘要：我们提出了一种基于傅立叶变换的机器学习技术，用于表征和检测面部情绪。在人脸情感分类的机器学习（ML）模型开发中，主要的挑战性任务是从一组训练样本中检测准确的情感特征，并生成特征向量以构建有意义的特征空间和建立ML模型。在本文中，我们假设情感特征隐藏在频域中；因此，可以通过利用频域和掩蔽技术来捕获它们。我们还利用了一个假设，即一个面部情绪与正常面部特征和其他情绪特征相卷积；然而，它们携带线性可分离的空间频率（我们称之为计算情感频率）。因此，我们提出了一种利用快速傅立叶变换（FFT）和矩形窄带频率核以及广泛使用的耶鲁人脸图像数据集的技术。我们使用随机森林（RF）和人工神经网络（ANN）分类器的性能分数作为度量来验证捕获的情感频率的有效性，从而验证假设。我们的发现是，通过提出的方法发现的计算情感频率提供了有意义的情感特征，帮助RF和ANN实现平均93%以上的高精度分数。摘要：We present a Fourier-based machine learning technique that characterizes and detects facial emotions. The main challenging task in the development of machine learning (ML) models for classifying facial emotions is the detection of accurate emotional features from a set of training samples, and the generation of feature vectors for constructing a meaningful feature space and building ML models. In this paper, we hypothesis that the emotional features are hidden in the frequency domain; hence, they can be captured by leveraging the frequency domain and masking techniques. We also make use of the conjecture that a facial emotions are convoluted with the normal facial features and the other emotional features; however, they carry linearly separable spatial frequencies (we call computational emotional frequencies). Hence, we propose a technique by leveraging fast Fourier transform (FFT) and rectangular narrow-band frequency kernels, and the widely used Yale-Faces image dataset. We test the hypothesis using the performance scores of the random forest (RF) and the artificial neural network (ANN) classifiers as the measures to validate the effectiveness of the captured emotional frequencies. Our finding is that the computational emotional frequencies discovered by the proposed approach provides meaningful emotional features that help RF and ANN achieve a high precision scores above 93%, on average.

【12】 Joint Symmetry Detection and Shape Matching for Non-Rigid Point Cloud 标题：非刚性点云的节点对称性检测与形状匹配链接：https://arxiv.org/abs/2112.02713

作者：Abhishek Sharma,Maks Ovsjanikov 机构：LIX, Ecole Polytechnique, IPParis, France 备注：Under Review. arXiv admin note: substantial text overlap with arXiv:2110.02994 摘要：尽管深度函数映射在非刚性三维形状匹配中取得了成功，但目前还没有同时对自对称性和形状匹配进行建模的学习框架。尽管对称性失配导致的误差是非刚性形状匹配中的一个主要挑战，这一点仍然存在。在本文中，我们提出了一个新的框架，同时学习自对称性以及一对形状之间的成对映射。我们的关键思想是通过正则化项将自对称映射和成对映射耦合在一起，正则化项为它们提供联合约束，从而获得更精确的映射。我们在几个基准上验证了我们的方法，在这两个任务上，我们的方法都优于许多竞争性基线。摘要：Despite the success of deep functional maps in non-rigid 3D shape matching, there exists no learning framework that models both self-symmetry and shape matching simultaneously. This is despite the fact that errors due to symmetry mismatch are a major challenge in non-rigid shape matching. In this paper, we propose a novel framework that simultaneously learns both self symmetry as well as a pairwise map between a pair of shapes. Our key idea is to couple a self symmetry map and a pairwise map through a regularization term that provides a joint constraint on both of them, thereby, leading to more accurate maps. We validate our method on several benchmarks where it outperforms many competitive baselines on both tasks.

【13】 Simple Adaptive Projection with Pretrained Features for Anomaly Detection 标题：带预训练特征的简单自适应投影异常检测链接：https://arxiv.org/abs/2112.02597

作者：Xingtai Gui 机构：CV] 5 Dec 20 2 1SIMPLE ADAPTIVE PROJECTION WITH PRETRAINED FEATURESFOR ANOMALY DETECTIONXintai GuiUniversity of Electronic Science and Technology of Chinatabgui 摘要：深度异常检测的目的是用高质量的表示从正常样本中分离异常。预训练特征带来了有效的表示和良好的异常检测性能。然而，对于一类训练数据，调整预训练特征是一个棘手的问题。具体来说，现有的全局目标优化目标往往会导致模式崩溃，即所有输入都映射到同一个目标。在本文中，我们提出了一个新的适应框架，包括简单的线性变换和自我注意。将这种自适应应用于特定输入，挖掘预训练特征空间中正常样本的k近邻表示以及相似单类语义特征之间的内在关系。此外，基于该框架，我们提出了一个有效的约束条件来避免学习平凡解。我们的简单自适应投影预训练特征（SAP2）产生了一种新的异常检测准则，该准则对模式崩溃更为精确和鲁棒。我们的方法在语义异常检测和感官异常检测基准上实现了最先进的异常检测性能，包括CIFAR-100数据集上96.5%的AUROC、CIFAR-10数据集上97.0%的AUROC和MvTec数据集上88.1%的AUROC。摘要：Deep anomaly detection aims to separate anomaly from normal samples with high-quality representations. Pretrained features bring effective representation and promising anomaly detection performance. However, with one-class training data, adapting the pretrained features is a thorny problem. Specifically, the existing optimization objectives with global target often lead to pattern collapse, i.e. all inputs are mapped to the same. In this paper, we propose a novel adaptation framework including simple linear transformation and self-attention. Such adaptation is applied on a specific input, and its k nearest representations of normal samples in pretrained feature space and the inner-relationship between similar one-class semantic features are mined. Furthermore, based on such framework, we propose an effective constraint term to avoid learning trivial solution. Our simple adaptive projection with pretrained features(SAP2) yields a novel anomaly detection criterion which is more accurate and robust to pattern collapse. Our method achieves state-of-the-art anomaly detection performance on semantic anomaly detection and sensory anomaly detection benchmarks including 96.5% AUROC on CIFAR-100 dataset, 97.0% AUROC on CIFAR-10 dataset and 88.1% AUROC on MvTec dataset.

【14】 BAANet: Learning Bi-directional Adaptive Attention Gates for Multispectral Pedestrian Detection 标题：BAANET：用于多光谱行人检测的学习双向自适应注意门链接：https://arxiv.org/abs/2112.02277

作者：Xiaoxiao Yang,Yeqian Qiang,Huijie Zhu,Chunxiang Wang,Ming Yang 摘要：热红外（TIR）图像在为多光谱行人检测的RGB功能提供温度提示方面已被证明是有效的。大多数现有方法直接将TIR模式注入基于RGB的框架，或者简单地集成两种模式的结果。然而，这可能导致较差的检测性能，因为RGB和TIR特征通常具有特定于模态的噪声，这可能随着网络的传播而恶化特征。因此，本文提出了一种高效的跨模态融合模块，称为双向自适应注意门（BAA门）。基于注意机制，BAA门被设计用于提取信息特征并渐进地重新校准表征。具体而言，采用双向多阶段融合策略，逐步优化两种模式的特征，并在传播过程中保持其特异性。此外，通过基于光照的加权策略引入BAA门的自适应交互，自适应调整BAA门中的重新校准和聚集强度，增强对光照变化的鲁棒性。在具有挑战性的KAIST数据集上进行的大量实验表明，该方法性能优越，速度令人满意。摘要：Thermal infrared (TIR) image has proven effectiveness in providing temperature cues to the RGB features for multispectral pedestrian detection. Most existing methods directly inject the TIR modality into the RGB-based framework or simply ensemble the results of two modalities. This, however, could lead to inferior detection performance, as the RGB and TIR features generally have modality-specific noise, which might worsen the features along with the propagation of the network. Therefore, this work proposes an effective and efficient cross-modality fusion module called Bi-directional Adaptive Attention Gate (BAA-Gate). Based on the attention mechanism, the BAA-Gate is devised to distill the informative features and recalibrate the representations asymptotically. Concretely, a bi-direction multi-stage fusion strategy is adopted to progressively optimize features of two modalities and retain their specificity during the propagation. Moreover, an adaptive interaction of BAA-Gate is introduced by the illumination-based weighting strategy to adaptively adjust the recalibrating and aggregating strength in the BAA-Gate and enhance the robustness towards illumination changes. Considerable experiments on the challenging KAIST dataset demonstrate the superior performance of our method with satisfactory speed.

【15】 Dense Extreme Inception Network for Edge Detection 标题：用于边缘检测的稠密极端初始网络链接：https://arxiv.org/abs/2112.02250

作者：Xavier Soria Poma,Angel Sappa,Patricio Humanante,Arash Arbarinia 机构：Computer Vision Center, Autonomous University of Barcelona, Barcelona, Spain, National University of Chimborazo, Riobamba, Ecuador, ESPOL Polytechnic University, FIEC, CIDIS, Guayaquil, Ecuador 备注：Paper submitted to an Elsevier journal 摘要：边缘检测是许多计算机视觉应用的基础。最先进的技术主要依赖于深度学习，有两个决定性因素：数据集内容和网络架构。大多数公开可用的数据集都不是为边缘检测任务而设计的。在这里，我们为这个约束提供了一个解决方案。首先，我们认为边缘、轮廓和边界，尽管它们相互重叠，但它们是三种不同的视觉特征，需要单独的基准数据集。为此，我们提出了一个新的边数据集。其次，我们提出了一种新的结构，称为稠密极限初始边缘检测网络（DexiNed），它可以从零开始训练，而无需任何预先训练的权重。在所提供的数据集中，DexiNed的性能优于其他算法。它还可以很好地推广到其他数据集，而无需任何微调。由于其输出的边缘更加锐利和精细，DexiNed的更高质量在视觉上也是显而易见的。摘要：Edge detection is the basis of many computer vision applications. State of the art predominantly relies on deep learning with two decisive factors: dataset content and network's architecture. Most of the publicly available datasets are not curated for edge detection tasks. Here, we offer a solution to this constraint. First, we argue that edges, contours and boundaries, despite their overlaps, are three distinct visual features requiring separate benchmark datasets. To this end, we present a new dataset of edges. Second, we propose a novel architecture, termed Dense Extreme Inception Network for Edge Detection (DexiNed), that can be trained from scratch without any pre-trained weights. DexiNed outperforms other algorithms in the presented dataset. It also generalizes well to other datasets without any fine-tuning. The higher quality of DexiNed is also perceptually evident thanks to the sharper and finer edges it outputs.

【16】 Orientation Aware Weapons Detection In Visual Data : A Benchmark Dataset 标题：视觉数据中的方位感知武器检测：一个基准数据集链接：https://arxiv.org/abs/2112.02221

作者：Nazeef Ul Haq,Muhammad Moazam Fraz,Tufail Sajjad Shah Hashmi,Muhammad Shahzad 机构：National University of Sciences and Technology (NUST), Islamabad, Pakistan, The Alan Turing Institute, Euston Rd, London NW,DB, United Kingdom, arXiv:,.,v, [cs.CV] , Dec 备注：Submitted this paper in Journal 摘要：武器的自动检测对于提高个人的安全和福祉具有重要意义，但由于武器的尺寸、形状和外观多种多样，这是一项艰巨的任务。视点变化和遮挡也是使此任务更加困难的原因。此外，当前的目标检测算法处理矩形区域，然而细长的步枪可能只覆盖一小部分区域，其余区域可能包含不必要的细节。为了克服这些问题，我们提出了一种用于方向感知武器检测的CNN体系结构，它提供了具有改进的武器检测性能的方向包围盒。该模型不仅以角度为分类问题，将角度分为八类，而且以角度为回归问题，提供了方向性。为了训练我们的武器检测模型，我们从web上收集了一个新的数据集，该数据集由6400个武器图像组成，然后用面向位置的边界框进行手动注释。我们的数据集不仅提供了定向边界框作为基本事实，还提供了水平边界框。我们还提供了多种格式的现代对象检测器数据集，以供该领域的进一步研究。在该数据集上对所提出的模型进行了评估，通过与现成的目标探测器进行比较分析，得出了所提出模型的优越性能，并用标准评估策略进行了衡量。数据集和模型实现可通过以下链接公开：https://bit.ly/2TyZICF. 摘要：Automatic detection of weapons is significant for improving security and well being of individuals, nonetheless, it is a difficult task due to large variety of size, shape and appearance of weapons. View point variations and occlusion also are reasons which makes this task more difficult. Further, the current object detection algorithms process rectangular areas, however a slender and long rifle may really cover just a little portion of area and the rest may contain unessential details. To overcome these problem, we propose a CNN architecture for Orientation Aware Weapons Detection, which provides oriented bounding box with improved weapons detection performance. The proposed model provides orientation not only using angle as classification problem by dividing angle into eight classes but also angle as regression problem. For training our model for weapon detection a new dataset comprising of total 6400 weapons images is gathered from the web and then manually annotated with position oriented bounding boxes. Our dataset provides not only oriented bounding box as ground truth but also horizontal bounding box. We also provide our dataset in multiple formats of modern object detectors for further research in this area. The proposed model is evaluated on this dataset, and the comparative analysis with off-the shelf object detectors yields superior performance of proposed model, measured with standard evaluation strategies. The dataset and the model implementation are made publicly available at this link: https://bit.ly/2TyZICF.

【17】 Behind the Curtain: Learning Occluded Shapes for 3D Object Detection 标题：幕后：学习用于3D对象检测的遮挡形状链接：https://arxiv.org/abs/2112.02205

作者：Qiangeng Xu,Yiqi Zhong,Ulrich Neumann 机构：University of Southern California 备注：None 摘要：激光雷达传感器的进步提供了丰富的3D数据，支持3D场景理解。然而，由于遮挡和信号缺失，激光雷达点云实际上是2.5D的，因为它们只覆盖部分底层形状，这对3D感知提出了根本性的挑战。为了应对这一挑战，我们提出了一种新的基于激光雷达的三维物体检测模型，称为幕后探测器（BtcDet），该模型学习物体形状先验知识，并估计点云中部分遮挡（遮挡）的完整物体形状。BtcDet首先识别受遮挡和信号缺失影响的区域。在这些区域中，我们的模型预测占用的概率，这表明区域是否包含对象形状。与此概率图集成，BtcDet可以生成高质量的3D提案。最后，占用概率还集成到提案细化模块中，以生成最终边界框。在KITTI数据集和Waymo开放数据集上的大量实验证明了BtcDet的有效性。特别是在KITTI基准上对汽车和自行车的3D检测方面，BtcDet以惊人的优势超过了所有已发布的最新方法。代码发布了(https://github.com/Xharlie/BtcDet}{https://github.com/Xharlie/BtcDet). 摘要：Advances in LiDAR sensors provide rich 3D data that supports 3D scene understanding. However, due to occlusion and signal miss, LiDAR point clouds are in practice 2.5D as they cover only partial underlying shapes, which poses a fundamental challenge to 3D perception. To tackle the challenge, we present a novel LiDAR-based 3D object detection model, dubbed Behind the Curtain Detector (BtcDet), which learns the object shape priors and estimates the complete object shapes that are partially occluded (curtained) in point clouds. BtcDet first identifies the regions that are affected by occlusion and signal miss. In these regions, our model predicts the probability of occupancy that indicates if a region contains object shapes. Integrated with this probability map, BtcDet can generate high-quality 3D proposals. Finally, the probability of occupancy is also integrated into a proposal refinement module to generate the final bounding boxes. Extensive experiments on the KITTI Dataset and the Waymo Open Dataset demonstrate the effectiveness of BtcDet. Particularly, for the 3D detection of both cars and cyclists on the KITTI benchmark, BtcDet surpasses all of the published state-of-the-art methods by remarkable margins. Code is released (https://github.com/Xharlie/BtcDet}{https://github.com/Xharlie/BtcDet).

分类|识别相关(9篇)

【1】 Interpretable Image Classification with Differentiable Prototypes Assignment 标题：基于不同原型赋值的可解释图像分类链接：https://arxiv.org/abs/2112.02902

作者：Dawid Rymarczyk,Łukasz Struski,Michał Górszczak,Koryna Lewandowska,Jacek Tabor,Bartosz Zieliński 机构：Michał G´orszczak, Bartosz Zieli´nski, Jagiellonian University, Ardigen SA, Department of Cognitive Neuroscience and Neuroergonomics, Institute of Applied Psychology 备注：Code will be published after paper acceptance 摘要：我们介绍ProtoPool，一个可解释的图像分类模型，它有一个由类共享的原型池。与现有方法相比，该训练更为直接，因为它不需要修剪阶段。它是通过引入原型对特定类的完全可微赋值来实现的。此外，我们还引入了一种新的焦点相似性函数，将模型聚焦在罕见的前景特征上。我们表明，ProtoPool在CUB-200-2011和斯坦福汽车数据集上获得了最先进的准确性，大大减少了原型的数量。我们提供了该方法的理论分析和用户研究，以表明我们的原型比通过竞争方法获得的原型更具特色。摘要：We introduce ProtoPool, an interpretable image classification model with a pool of prototypes shared by the classes. The training is more straightforward than in the existing methods because it does not require the pruning stage. It is obtained by introducing a fully differentiable assignment of prototypes to particular classes. Moreover, we introduce a novel focal similarity function to focus the model on the rare foreground features. We show that ProtoPool obtains state-of-the-art accuracy on the CUB-200-2011 and the Stanford Cars datasets, substantially reducing the number of prototypes. We provide a theoretical analysis of the method and a user study to show that our prototypes are more distinctive than those obtained with competitive methods.

【2】 Letter-level Online Writer Identification 标题：信书级在线作者身份识别链接：https://arxiv.org/abs/2112.02824

作者：Zelin Chen,Hong-Xing Yu,Ancong Wu,Wei-Shi Zheng 机构：For reference of this work, please cite:, Chen, Zelin & Yu, Hong-Xing & Wu, Ancong & Zheng, Wei-Shi. (,). Letter-Level Online Writer, Identification. International Journal of Computer Vision. ,. ,.,s,-,-,-y., Bib: 摘要：书写者识别（Writer-id）是生物特征识别的一个重要领域，旨在通过书写者的笔迹来识别书写者。现有writer-id研究中的标识需要完整的文档或文本，这限制了writer-id在实际应用中的可伸缩性和灵活性。为了使writer id的应用更加实用（例如，在移动设备上），我们关注一个新的问题，即字母级别的在线writer id，它只需要一些书写字母的轨迹作为识别线索。与基于文本\文档的编写者id不同，它具有丰富的识别上下文，仅从几个字母中识别作者的线索要少得多。一个主要的挑战是一个人经常用不同的风格写一封信。我们把这个问题称为在线写作风格的差异（Var-O-style）。我们以捕获规范化聚合的方式处理Var-O-style：首先，我们通过精心设计的多分支编码器提取字母轨迹的不同特征，试图捕获不同的在线写作风格。然后通过一个新的规范化层将所有这些样式特征转换为参考样式特征域。最后，我们通过层次注意池（HAP）聚合标准化特征，将具有多种书写风格的所有输入字母融合成一个紧凑的特征向量。此外，我们还提供了一个用于评估的大型字母级在线作者识别数据集（LERID）。大量的对比实验证明了该框架的有效性。摘要：Writer identification (writer-id), an important field in biometrics, aims to identify a writer by their handwriting. Identification in existing writer-id studies requires a complete document or text, limiting the scalability and flexibility of writer-id in realistic applications. To make the application of writer-id more practical (e.g., on mobile devices), we focus on a novel problem, letter-level online writer-id, which requires only a few trajectories of written letters as identification cues. Unlike text-\ document-based writer-id which has rich context for identification, there are much fewer clues to recognize an author from only a few single letters. A main challenge is that a person often writes a letter in different styles from time to time. We refer to this problem as the variance of online writing styles (Var-O-Styles). We address the Var-O-Styles in a capture-normalize-aggregate fashion: Firstly, we extract different features of a letter trajectory by a carefully designed multi-branch encoder, in an attempt to capture different online writing styles. Then we convert all these style features to a reference style feature domain by a novel normalization layer. Finally, we aggregate the normalized features by a hierarchical attention pooling (HAP), which fuses all the input letters with multiple writing styles into a compact feature vector. In addition, we also contribute a large-scale LEtter-level online wRiter IDentification dataset (LERID) for evaluation. Extensive comparative experiments demonstrate the effectiveness of the proposed framework.

【3】 STSM: Spatio-Temporal Shift Module for Efficient Action Recognition 标题：STSM：一种高效的动作识别时空移位模块链接：https://arxiv.org/abs/2112.02523

作者：Zhaoqilin Yang,Gaoyun An 备注：9 pages,4 figures 摘要：传统时空网络的建模、计算量和准确性是视频动作识别领域的三大研究热点。传统的二维卷积算法计算量小，但不能捕获时间关系；基于三维卷积的卷积神经网络（CNNs）模型可以获得良好的性能，但计算量大，参数量大。在本文中，我们提出了一种即插即用的时空移位模块（STSM），这是一种既有效又高性能的通用模块。具体地说，在将STSM插入其他网络之后，可以在不增加计算和参数数量的情况下改进网络的性能。特别是，当网络是2D CNN时，我们的STSM模块允许网络学习有效的时空特征。我们对所提出的模块进行了广泛的评估，进行了大量实验以研究其在视频动作识别中的有效性，并在kinetics-400和某物V2数据集上获得了最新的结果。摘要：The modeling, computational cost, and accuracy of traditional Spatio-temporal networks are the three most concentrated research topics in video action recognition. The traditional 2D convolution has a low computational cost, but it cannot capture the time relationship; the convolutional neural networks (CNNs) model based on 3D convolution can obtain good performance, but its computational cost is high, and the amount of parameters is large. In this paper, we propose a plug-and-play Spatio-temporal Shift Module (STSM), which is a generic module that is both effective and high-performance. Specifically, after STSM is inserted into other networks, the performance of the network can be improved without increasing the number of calculations and parameters. In particular, when the network is 2D CNNs, our STSM module allows the network to learn efficient Spatio-temporal features. We conducted extensive evaluations of the proposed module, conducted numerous experiments to study its effectiveness in video action recognition, and achieved state-of-the-art results on the kinetics-400 and Something-Something V2 datasets.

【4】 Face Trees for Expression Recognition 标题：用于表情识别的人脸树链接：https://arxiv.org/abs/2112.02487

作者：Mojtaba Kolahdouzi,Alireza Sepas-Moghaddam,Ali Etemad 机构：Dept. ECE and Ingenuity Labs Research Institute, Queen’s University, Kingston, Canada 摘要：我们提出了一种端到端的人脸表情识别体系结构。我们的模型学习人脸标志的最佳树拓扑结构，通过遍历生成一个序列，我们从中获得一个嵌入来为序列学习者提供信息。提出的体系结构包含两个主流，一个侧重于地标位置以学习人脸结构，另一个侧重于地标周围的面片以学习纹理信息。每个流后面都有一个注意机制，输出被馈送到两流融合组件以执行最终分类。我们在两个大规模公开的面部表情数据集AffectNet和FER2013上进行了广泛的实验，以评估我们方法的有效性。我们的方法在这方面优于其他解决方案，并在这些数据集上设置了新的最先进的表达式识别率。摘要：We propose an end-to-end architecture for facial expression recognition. Our model learns an optimal tree topology for facial landmarks, whose traversal generates a sequence from which we obtain an embedding to feed a sequential learner. The proposed architecture incorporates two main streams, one focusing on landmark positions to learn the structure of the face, while the other focuses on patches around the landmarks to learn texture information. Each stream is followed by an attention mechanism and the outputs are fed to a two-stream fusion component to perform the final classification. We conduct extensive experiments on two large-scale publicly available facial expression datasets, AffectNet and FER2013, to evaluate the efficacy of our approach. Our method outperforms other solutions in the area and sets new state-of-the-art expression recognition rates on these datasets.

【5】 Label Hierarchy Transition: Modeling Class Hierarchies to Enhance Deep Classifiers 标题：标签层次转换：对类层次进行建模以增强深度分类器链接：https://arxiv.org/abs/2112.02353

作者：Renzhen Wang,De cai,Kaiwen Xiao,Xixi Jia,Xiao Han,Deyu Meng 机构：Xi’an Jiaotong University, Tencent, Xidian University 摘要：层次分类的目的是将对象按类别层次进行分类。例如，一只鸟可以按照顺序、科和种的三级层次结构进行分类。现有方法通常通过将层次分类分解为多个多类分类任务来解决层次分类问题。然而，这种多任务学习策略未能充分利用不同层次中不同类别之间的相关性。在本文中，我们提出了标签层次转换，一个基于深度学习的统一概率框架，以解决层次分类问题。具体地说，我们明确地学习了标签层次转换矩阵，其列向量表示类在两个相邻层次之间的条件标签分布，并且能够编码嵌入在类层次中的相关性。我们进一步提出了一种混淆损失，它鼓励分类网络在训练期间学习不同标签层次之间的相关性。所提出的框架只需稍作修改即可适应任何现有的深度网络。我们对三个具有不同类层次结构的公共基准数据集进行了实验，结果表明我们的方法优于现有技术。源代码将公开提供。摘要：Hierarchical classification aims to sort the object into a hierarchy of categories. For example, a bird can be categorized according to a three-level hierarchy of order, family, and species. Existing methods commonly address hierarchical classification by decoupling it into several multi-class classification tasks. However, such a multi-task learning strategy fails to fully exploit the correlation among various categories across different hierarchies. In this paper, we propose Label Hierarchy Transition, a unified probabilistic framework based on deep learning, to address hierarchical classification. Specifically, we explicitly learn the label hierarchy transition matrices, whose column vectors represent the conditional label distributions of classes between two adjacent hierarchies and could be capable of encoding the correlation embedded in class hierarchies. We further propose a confusion loss, which encourages the classification network to learn the correlation across different label hierarchies during training. The proposed framework can be adapted to any existing deep network with only minor modifications. We experiment with three public benchmark datasets with various class hierarchies, and the results demonstrate the superiority of our approach beyond the prior arts. Source code will be made publicly available.

【6】 Ablation study of self-supervised learning for image classification 标题：自监督学习在图像分类中的消融研究链接：https://arxiv.org/abs/2112.02297

作者：Ilias Papastratis 机构：Aristotle University, Department of Informatics, DMCI, Thessaloniki 摘要：本项目的重点是卷积神经网络（CNN）和Transformer网络的自我监督训练，用于图像识别任务。为了最大化来自同一源图像的两个增强变换图像的相似性，使用了具有不同主干的简单暹罗网络。通过这种方式，主干能够在没有监督的情况下学习视觉信息。最后，在三个图像识别数据集上对该方法进行了评价。摘要：This project focuses on the self-supervised training of convolutional neural networks (CNNs) and transformer networks for the task of image recognition. A simple siamese network with different backbones is used in order to maximize the similarity of two augmented transformed images from the same source image. In this way, the backbone is able to learn visual information without supervision. Finally, the method is evaluated on three image recognition datasets.

【7】 Feature-based Recognition Framework for Super-resolution Images 标题：基于特征的超分辨率图像识别框架链接：https://arxiv.org/abs/2112.02270

作者：Jing Hu,Meiqi Zhang,Rui Zhang 机构： School of Artificial, Intelligence and Automation.HUST 备注：7 pages, 2 figures 摘要：在实际应用中，识别网络应用于超分辨率图像时，其性能往往会下降。在本文中，我们提出了一种结合GAN（FGAN）的基于特征的识别网络。我们的网络通过从SR图像中提取更多有利于识别的特征来提高识别精度。在实验中，我们使用三种不同的超分辨率算法构建了三个数据集，与Reinet50和DenseNet121相比，我们的网络将识别准确率提高了6%以上。摘要：In practical application, the performance of recognition network usually decreases when being applied on super-resolution images. In this paper, we propose a feature-based recognition network combined with GAN (FGAN). Our network improves the recognition accuracy by extracting more features that benefit recognition from SR images. In the experiment, we build three datasets using three different super-resolution algorithm, and our network increases the recognition accuracy by more than 6% comparing with ReaNet50 and DenseNet121.

【8】 Novel Local Radiomic Bayesian Classifiers for Non-Invasive Prediction of MGMT Methylation Status in Glioblastoma 标题：用于无创性预测胶质母细胞瘤MGMT甲基化状态的新型局部放射贝叶斯分类器链接：https://arxiv.org/abs/2112.03259

作者：Mihir Rao 机构：Chatham High School, Chatham, NJ 摘要：胶质母细胞瘤是一种侵袭性脑癌，是所有癌症中最致命的一种。胶质母细胞瘤组织中O6-甲基鸟嘌呤DNA甲基转移酶（MGMT）基因的表达具有临床意义，因为它对替莫唑胺的疗效有显著影响，替莫唑胺是治疗胶质母细胞瘤患者的主要化疗药物。目前，MGMT甲基化是通过侵入性脑活检和随后提取的肿瘤组织的遗传分析确定的。在这项工作中，我们提出了新的贝叶斯分类器，基于从FLAIR序列磁共振图像（MRIs）中提取的放射特征对MGMT甲基化状态进行概率预测。我们利用局部放射技术生成放射激活图，并基于原始体素强度的统计特征分析MGMT生物标记物的磁共振成像。我们展示了简单贝叶斯分类器在建模局部放射性数据而非全局特征时提供预测性能提升的能力。所提出的技术为确定胶质母细胞瘤患者MGMT甲基化状态提供了一种基于MRI的非侵入性方法。摘要：Glioblastoma, an aggressive brain cancer, is amongst the most lethal of all cancers. Expression of the O6-methylguanine-DNA-methyltransferase (MGMT) gene in glioblastoma tumor tissue is of clinical importance as it has a significant effect on the efficacy of Temozolomide, the primary chemotherapy treatment administered to glioblastoma patients. Currently, MGMT methylation is determined through an invasive brain biopsy and subsequent genetic analysis of the extracted tumor tissue. In this work, we present novel Bayesian classifiers that make probabilistic predictions of MGMT methylation status based on radiomic features extracted from FLAIR-sequence magnetic resonance imagery (MRIs). We implement local radiomic techniques to produce radiomic activation maps and analyze MRIs for the MGMT biomarker based on statistical features of raw voxel-intensities. We demonstrate the ability for simple Bayesian classifiers to provide a boost in predictive performance when modelling local radiomic data rather than global features. The presented techniques provide a non-invasive MRI-based approach to determining MGMT methylation status in glioblastoma patients.

【9】 Classification of COVID-19 on chest X-Ray images using Deep Learning model with Histogram Equalization and Lungs Segmentation 标题：基于直方图均衡化和肺部分割的深度学习模型在胸片冠状病毒分类中的应用链接：https://arxiv.org/abs/2112.02478

作者：Hitendra Singh Bhadouria,Krishan Kumar,Aman Swaraj,Karan Verma,Arshpreet Kaur,Shasvat Sharma,Ghanshyam Singh,Ashok Kumar,Leandro Melo de Sales 机构：National Institute of Technology, Delhi,; Indian Institute of Technology, Roorkee,; DIT University, Dehradun, Malaviya National Institute of Technology Jaipur,; Government Mahila Engineering, College, Ajmer,; Universidade Federal De Alagoas-UFAL, Brasil 备注：Total number of words of the manuscript- 6577 The number of words of the abstract- 238 The number of figures- 8 The number of tables- 10 摘要：背景和目的：人工智能（AI）方法与生物医学分析相结合在大流行期间发挥着关键作用，因为它有助于释放来自医疗系统和医生的巨大压力。随着2019冠状病毒疾病2019冠状病毒疾病在巴西和印度等人口密集且检测工具不完善的国家中恶化，放射成像可以作为准确分类COVID-19患者的重要诊断工具，并在适当的时候规定必要的治疗。基于这个动机，2019冠状病毒疾病的肺部深部学习体系的研究结果，我们采用X射线胸片进行了研究。数据集2019冠状病毒疾病：2019冠状病毒疾病，三个不同类别的标签，共2470个图像，即健康肺、普通肺炎和COVID-19感染性肺炎，其中470个X射线图像属于COVID-19类。方法：我们首先使用直方图均衡化技术对所有图像进行预处理，然后使用U-net结构对其进行分割。然后利用VGG-16网络对预处理后的图像进行特征提取，再利用SMOTE过采样技术对图像进行采样，得到一个平衡的数据集。最后，使用支持向量机（SVM）分类器对类平衡特征进行分类，并对分类精度进行了评估。结果与结论：2019冠状病毒疾病图像中，已知的预处理技术、特征提取方法和数据集平衡方法，使我们对2470个X射线图像数据集的识别率达到98%。因此，我们的模型适合用于医疗机构的筛查目的。摘要：Background and Objective: Artificial intelligence (AI) methods coupled with biomedical analysis has a critical role during pandemics as it helps to release the overwhelming pressure from healthcare systems and physicians. As the ongoing COVID-19 crisis worsens in countries having dense populations and inadequate testing kits like Brazil and India, radiological imaging can act as an important diagnostic tool to accurately classify covid-19 patients and prescribe the necessary treatment in due time. With this motivation, we present our study based on deep learning architecture for detecting covid-19 infected lungs using chest X-rays. Dataset: We collected a total of 2470 images for three different class labels, namely, healthy lungs, ordinary pneumonia, and covid-19 infected pneumonia, out of which 470 X-ray images belong to the covid-19 category. Methods: We first pre-process all the images using histogram equalization techniques and segment them using U-net architecture. VGG-16 network is then used for feature extraction from the pre-processed images which is further sampled by SMOTE oversampling technique to achieve a balanced dataset. Finally, the class-balanced features are classified using a support vector machine (SVM) classifier with 10-fold cross-validation and the accuracy is evaluated. Result and Conclusion: Our novel approach combining well-known pre-processing techniques, feature extraction methods, and dataset balancing method, lead us to an outstanding rate of recognition of 98% for COVID-19 images over a dataset of 2470 X-ray images. Our model is therefore fit to be utilized in healthcare facilities for screening purposes.

分割|语义相关(14篇)

【1】 Unsupervised Domain Adaptation for Semantic Image Segmentation: a Comprehensive Survey 标题：无监督区域自适应语义图像分割研究综述链接：https://arxiv.org/abs/2112.03241

作者：Gabriela Csurka,Riccardo Volpi,Boris Chidlovskii 机构：Naver Labs Europe, France 备注：33 pages 摘要：语义分割在各种计算机视觉应用中起着基础性的作用，为图像的全局理解提供了关键信息。然而，最先进的模型依赖于大量带注释的样本，这些样本的获取成本比图像分类等任务更高。由于未标记数据的获取成本显著降低，因此无监督领域自适应在语义分割社区中取得了广泛的成功也就不足为奇了。本次调查旨在总结这一飞速发展的领域五年来的工作，其中包括语义分割本身的重要性以及使分割模型适应新环境的关键需求。我们提出了最重要的语义分割方法；我们提供了一个全面的调查领域适应技术的语义分割；我们揭示了新的趋势，如多领域学习、领域泛化、测试时间自适应或无源领域自适应；我们通过描述语义切分研究中最广泛使用的数据集和基准来总结这项调查。我们希望这项调查将为学术界和工业界的研究人员提供全面的参考指南，并帮助他们在该领域培养新的研究方向。摘要：Semantic segmentation plays a fundamental role in a broad variety of computer vision applications, providing key information for the global understanding of an image. Yet, the state-of-the-art models rely on large amount of annotated samples, which are more expensive to obtain than in tasks such as image classification. Since unlabelled data is instead significantly cheaper to obtain, it is not surprising that Unsupervised Domain Adaptation reached a broad success within the semantic segmentation community. This survey is an effort to summarize five years of this incredibly rapidly growing field, which embraces the importance of semantic segmentation itself and a critical need of adapting segmentation models to new environments. We present the most important semantic segmentation methods; we provide a comprehensive survey on domain adaptation techniques for semantic segmentation; we unveil newer trends such as multi-domain learning, domain generalization, test-time adaptation or source-free domain adaptation; we conclude this survey by describing datasets and benchmarks most widely used in semantic segmentation research. We hope that this survey will provide researchers across academia and industry with a comprehensive reference guide and will help them in fostering new research directions in the field.

【2】 Semantic Segmentation In-the-Wild Without Seeing Any Segmentation Examples 标题：未见任何切分示例的自然语义切分链接：https://arxiv.org/abs/2112.03185

作者：Nir Zabari,Yedid Hoshen 机构：School of Computer Science and Engineering, The Hebrew University of Jerusalem, Israel 摘要：语义分割是计算机视觉的一项重要任务，几十年来一直在积极研究。近年来，有监督的方法已经达到了前所未有的精度，但是它们需要对每个新的类别进行许多像素级的注释，这非常耗时和昂贵。此外，当前语义分割网络处理大量类别的能力是有限的。这意味着包含稀有类别的图像不太可能被当前的方法很好地分割。在本文中，我们提出了一种为每个对象创建语义分割模板的新方法，无需训练分割网络或查看任何分割模板。我们的方法以图像中存在的类类别的图像级标签作为输入；它们可以自动或手动获取。我们利用视觉语言嵌入模型（特别是CLIP）使用模型可解释性方法为每个类创建一个粗略的分割图。我们使用测试时间增强技术来优化映射。此阶段的输出提供像素级伪标签，而不是监督方法所需的手动像素级标签。在给定伪标签的情况下，我们利用单个图像分割技术获得高质量的输出分割模板。我们的方法在数量和质量上都优于使用类似数量监督的方法。对于包含稀有类别的图像，我们的结果尤其显著。摘要：Semantic segmentation is a key computer vision task that has been actively researched for decades. In recent years, supervised methods have reached unprecedented accuracy, however they require many pixel-level annotations for every new class category which is very time-consuming and expensive. Additionally, the ability of current semantic segmentation networks to handle a large number of categories is limited. That means that images containing rare class categories are unlikely to be well segmented by current methods. In this paper we propose a novel approach for creating semantic segmentation masks for every object, without the need for training segmentation networks or seeing any segmentation masks. Our method takes as input the image-level labels of the class categories present in the image; they can be obtained automatically or manually. We utilize a vision-language embedding model (specifically CLIP) to create a rough segmentation map for each class, using model interpretability methods. We refine the maps using a test-time augmentation technique. The output of this stage provides pixel-level pseudo-labels, instead of the manual pixel-level labels required by supervised methods. Given the pseudo-labels, we utilize single-image segmentation techniques to obtain high-quality output segmentation masks. Our method is shown quantitatively and qualitatively to outperform methods that use a similar amount of supervision. Our results are particularly remarkable for images containing rare categories.

【3】 Diffusion Models for Implicit Image Segmentation Ensembles 标题：隐式图像分割集成的扩散模型链接：https://arxiv.org/abs/2112.03145

作者：Julia Wolleb,Robin Sandkühler,Florentin Bieder,Philippe Valmaggia,Philippe C. Cattin 机构：Robin Sandk¨uhler∗, Department of Biomedical Engineering, University of Basel, Allschwil, Switzerland 摘要：扩散模型在图像生成建模方面表现出了令人印象深刻的性能。本文提出了一种基于扩散模型的语义分割方法。通过修改训练和采样方案，我们证明了扩散模型可以对医学图像进行病变分割。为了生成特定于图像的分割，我们在地面真值分割上对模型进行训练，并在训练期间和采样过程的每个步骤中使用图像作为先验。在给定的随机采样过程中，我们可以生成一个分布的分割模板。此属性允许我们计算分段的像素级不确定性映射，并允许隐式集成分段，从而提高分段性能。我们在BRATS2020数据集上评估了我们的脑肿瘤分割方法。与最先进的分割模型相比，我们的方法产生了良好的分割结果，此外，还产生了有意义的不确定性映射。摘要：Diffusion models have shown impressive performance for generative modelling of images. In this paper, we present a novel semantic segmentation method based on diffusion models. By modifying the training and sampling scheme, we show that diffusion models can perform lesion segmentation of medical images. To generate an image specific segmentation, we train the model on the ground truth segmentation, and use the image as a prior during training and in every step during the sampling process. With the given stochastic sampling process, we can generate a distribution of segmentation masks. This property allows us to compute pixel-wise uncertainty maps of the segmentation, and allows an implicit ensemble of segmentations that increases the segmentation performance. We evaluate our method on the BRATS2020 dataset for brain tumor segmentation. Compared to state-of-the-art segmentation models, our approach yields good segmentation results and, additionally, meaningful uncertainty maps.

【4】 Label-Efficient Semantic Segmentation with Diffusion Models 标题：基于扩散模型的高效标注语义分割链接：https://arxiv.org/abs/2112.03126

作者：Dmitry Baranchuk,Ivan Rubachev,Andrey Voynov,Valentin Khrulkov,Artem Babenko 机构： Yandex, Russia, National Research University Higher School of Economics, Russia 摘要：去噪扩散概率模型最近受到了广泛的研究关注，因为它们优于其他方法，如GANs，并且目前提供了最先进的生成性能。扩散模型优越的性能使其在修复、超分辨率和语义编辑等应用中成为一种极具吸引力的工具。在本文中，我们证明了扩散模型也可以作为语义分割的工具，特别是在标记数据稀少的情况下。特别是，对于几个预训练扩散模型，我们研究了执行反向扩散过程马尔可夫步的网络的中间激活。我们表明，这些激活有效地捕获了输入图像的语义信息，并且似乎是分割问题的优秀像素级表示。基于这些观察结果，我们描述了一种简单的分割方法，即使只提供少量的训练图像，该方法也可以工作。我们的方法在几个数据集上显著优于现有的替代方法，以获得相同数量的人类监督。摘要：Denoising diffusion probabilistic models have recently received much research attention since they outperform alternative approaches, such as GANs, and currently provide state-of-the-art generative performance. The superior performance of diffusion models has made them an appealing tool in several applications, including inpainting, super-resolution, and semantic editing. In this paper, we demonstrate that diffusion models can also serve as an instrument for semantic segmentation, especially in the setup when labeled data is scarce. In particular, for several pretrained diffusion models, we investigate the intermediate activations from the networks that perform the Markov step of the reverse diffusion process. We show that these activations effectively capture the semantic information from an input image and appear to be excellent pixel-level representations for the segmentation problem. Based on these observations, we describe a simple segmentation method, which can work even if only a few training images are provided. Our approach significantly outperforms the existing alternatives on several datasets for the same amount of human supervision.

【5】 Reliable Propagation-Correction Modulation for Video Object Segmentation 标题：用于视频对象分割的可靠传播-校正调制链接：https://arxiv.org/abs/2112.02853

作者：Xiaohao Xu,Jinglu Wang,Xiao Li,Yan Lu 机构： Huangzhong University of Science & Technology, Microsoft Research Asia 备注：13 pages, 8 figures, AAAI 2022 Accepted 摘要：错误传播是在线半监督视频对象分割中一个普遍但关键的问题。我们的目标是通过高可靠性的校正机制来抑制误差传播。关键的洞察是从传统的掩模传播过程中分离出具有可靠线索的校正。我们引入了两个调制器，即传播调制器和校正调制器，分别根据局部时间相关性和可靠参考对目标帧嵌入进行信道重新校准。具体来说，我们使用级联传播校正方案组装调制器。这避免了传播调制器覆盖可靠校正调制器的效果。尽管带有地面真值标签的参考框架提供了可靠的线索，但它可能与目标框架非常不同，并引入不确定或不完整的相关性。我们通过向维护的池补充可靠的特征补丁来增加参考线索，从而为调制器提供更全面和更具表现力的对象表示。此外，还设计了一个可靠性过滤器来检索可靠的补丁并在后续帧中传递它们。我们的模型在YouTube-VOS18/19和DAVIS17 Val/Test基准上实现了最先进的性能。大量的实验表明，通过充分利用可靠的制导，校正机制提供了可观的性能增益。代码可从以下网址获取：https://github.com/JerryX1110/RPCMVOS. 摘要：Error propagation is a general but crucial problem in online semi-supervised video object segmentation. We aim to suppress error propagation through a correction mechanism with high reliability. The key insight is to disentangle the correction from the conventional mask propagation process with reliable cues. We introduce two modulators, propagation and correction modulators, to separately perform channel-wise re-calibration on the target frame embeddings according to local temporal correlations and reliable references respectively. Specifically, we assemble the modulators with a cascaded propagation-correction scheme. This avoids overriding the effects of the reliable correction modulator by the propagation modulator. Although the reference frame with the ground truth label provides reliable cues, it could be very different from the target frame and introduce uncertain or incomplete correlations. We augment the reference cues by supplementing reliable feature patches to a maintained pool, thus offering more comprehensive and expressive object representations to the modulators. In addition, a reliability filter is designed to retrieve reliable patches and pass them in subsequent frames. Our model achieves state-of-the-art performance on YouTube-VOS18/19 and DAVIS17-Val/Test benchmarks. Extensive experiments demonstrate that the correction mechanism provides considerable performance gain by fully utilizing reliable guidance. Code is available at: https://github.com/JerryX1110/RPCMVOS.

【6】 A hybrid convolutional neural network/active contour approach to segmenting dead trees in aerial imagery 标题：航空影像枯木分割的卷积神经网络/活动轮廓混合方法链接：https://arxiv.org/abs/2112.02725

作者：Jacquelyn A. Shelton,Przemyslaw Polewski,Wei Yao,Marco Heurich 机构：The Hong Kong Polytechnic University, Dept. of Land Surveying and Geo-Informatics, Bavarian Forest National Park, Dept. for Visitor Management and National Park Monitoring, and the Dept. of Wildlife Ecology and Management 摘要：生态系统的稳定性和抵御气候变化的能力与其生物多样性直接相关。枯树是整个森林健康的关键指标，拥有森林生态系统生物多样性的三分之一，占全球碳储量的8%。它们被多种自然因素分解，如气候、昆虫和真菌。准确地检测和模拟死木块对于理解森林生态、碳循环和分解者至关重要。我们提出了一种新的方法，通过将已建立的卷积神经网络与能量最小化框架下的新的活动轮廓模型相结合，从航空照片中构建精确的死亡树木形状轮廓。我们的方法在精确度、召回率和相交性方面优于最新技术，而不是检测到的死树的并集。这种改进的绩效对于应对气候变化（以及其他人为干扰系统）带来的新挑战至关重要，特别是对于监测和估计碳储量衰减率、监测森林健康和生物多样性以及枯木对气候变化的总体影响。摘要：The stability and ability of an ecosystem to withstand climate change is directly linked to its biodiversity. Dead trees are a key indicator of overall forest health, housing one-third of forest ecosystem biodiversity, and constitute 8%of the global carbon stocks. They are decomposed by several natural factors, e.g. climate, insects and fungi. Accurate detection and modeling of dead wood mass is paramount to understanding forest ecology, the carbon cycle and decomposers. We present a novel method to construct precise shape contours of dead trees from aerial photographs by combining established convolutional neural networks with a novel active contour model in an energy minimization framework. Our approach yields superior performance accuracy over state-of-the-art in terms of precision, recall, and intersection over union of detected dead trees. This improved performance is essential to meet emerging challenges caused by climate change (and other man-made perturbations to the systems), particularly to monitor and estimate carbon stock decay rates, monitor forest health and biodiversity, and the overall effects of dead wood on and from climate change.

【7】 Boosting Mobile CNN Inference through Semantic Memory 标题：利用语义记忆促进移动CNN推理链接：https://arxiv.org/abs/2112.02644

作者：Yun Li,Chen Zhang,Shihao Han,Li Lyna Zhang,Baoqun Yin,Yunxin Liu,Mengwei Xu 机构：University of Science and Technology, of China, Damo Academy, Alibaba Group, Rose-Hulman Institute of Technology, Microsoft Research, Institute for AI Industry Research, (AIR), Tsinghua University, State Key Laboratory of Networking, and Switching Technology, Beijing 备注：13 pages, 13 figures 摘要：众所周知，人脑能够通过对激活神经元进行更快的记忆编码和访问程序，加速对重复呈现的物体的视觉识别。这是我们第一次借用并提炼这种能力到语义内存设计中，即SMTM，以改进设备上的CNN推理。SMTM采用分层存储体系结构来利用感兴趣对象的长尾分布，并进一步结合了几种新技术来实现：（1）将高维特征映射编码为低维语义向量，以实现低成本但准确的缓存和查找；（2）考虑到不同层的固有特性，它使用一种新的度量来确定出口定时；（3）它自适应地调整缓存大小和语义向量以适应场景动态。SMTM是商品CNN引擎上的原型，在移动CPU和GPU上运行。在大规模数据集和模型上进行的大量实验表明，与标准方法（高达2倍）和以前的缓存设计（高达1.5倍）相比，SMTM可以显著加快模型推理速度，且精度损失可以接受。摘要：Human brains are known to be capable of speeding up visual recognition of repeatedly presented objects through faster memory encoding and accessing procedures on activated neurons. For the first time, we borrow and distill such a capability into a semantic memory design, namely SMTM, to improve on-device CNN inference. SMTM employs a hierarchical memory architecture to leverage the long-tail distribution of objects of interest, and further incorporates several novel techniques to put it into effects: (1) it encodes high-dimensional feature maps into low-dimensional, semantic vectors for low-cost yet accurate cache and lookup; (2) it uses a novel metric in determining the exit timing considering different layers' inherent characteristics; (3) it adaptively adjusts the cache size and semantic vectors to fit the scene dynamics. SMTM is prototyped on commodity CNN engine and runs on both mobile CPU and GPU. Extensive experiments on large-scale datasets and models show that SMTM can significantly speed up the model inference over standard approach (up to 2X) and prior cache designs (up to 1.5X), with acceptable accuracy loss.

【8】 PolyphonicFormer: Unified Query Learning for Depth-aware Video Panoptic Segmentation 标题：PolyhonicFormer：深度感知视频全景分割的统一查询学习链接：https://arxiv.org/abs/2112.02582

作者：Haobo Yuan,Xiangtai Li,Yibo Yang,Guangliang Cheng,Jing Zhang,Yunhai Tong,Lefei Zhang,Dacheng Tao 机构： School of Computer Science, Wuhan University, Key Laboratory of Machine Perception (MOE), Peking University, JD Explore Academy , SenseTime Research , The University of Sydney 摘要：最近提出的深度感知视频全景分割（DVPS）旨在预测视频中的全景分割结果和深度图，这是一个具有挑战性的场景理解问题。在本文中，我们提出了一种视觉变换器，用于统一DVPS任务下的所有子任务。我们的方法通过基于查询的学习探索深度估计和全景分割之间的关系。特别地，我们设计了三种不同的查询，包括事物查询、内容查询和深度查询。然后，我们建议通过门控融合来学习这些查询之间的相关性。通过实验，我们从深度估计和全景分割两个方面证明了我们设计的好处。由于每个事物查询也对实例信息进行编码，因此自然可以通过外观学习裁剪实例掩码特征来执行跟踪。我们的方法在ICCV-2021 BMTT挑战视频+深度跟踪中排名第一。烧蚀研究报告表明，我们如何提高性能。代码将在https://github.com/HarborYuan/PolyphonicFormer. 摘要：The recently proposed Depth-aware Video Panoptic Segmentation (DVPS) aims to predict panoptic segmentation results and depth maps in a video, which is a challenging scene understanding problem. In this paper, we present PolyphonicFormer, a vision transformer to unify all the sub-tasks under the DVPS task. Our method explores the relationship between depth estimation and panoptic segmentation via query-based learning. In particular, we design three different queries including thing query, stuff query, and depth query. Then we propose to learn the correlations among these queries via gated fusion. From the experiments, we prove the benefits of our design from both depth estimation and panoptic segmentation aspects. Since each thing query also encodes the instance-wise information, it is natural to perform tracking via cropping instance mask features with appearance learning. Our method ranks 1st on the ICCV-2021 BMTT Challenge video + depth track. Ablation studies are reported to show how we improve the performance. Code will be available at https://github.com/HarborYuan/PolyphonicFormer.

【9】 End-to-End Segmentation via Patch-wise Polygons Prediction 标题：基于面片多边形预测的端到端分割链接：https://arxiv.org/abs/2112.02535

作者：Tal Shaharabany,Lior Wolf 机构：Tel-Aviv University 摘要：主流的分割方法将输出地图表示为像素网格。我们研究了另一种表示方法，在该方法中，每个图像面片的对象边缘被建模为一个具有$k$顶点的多边形，该多边形与每个面片的标签概率相耦合。顶点通过使用可微分神经渲染器来创建光栅图像进行优化。然后将所描绘的区域与地面真值分割进行比较。我们的方法获得了多个最先进的结果：城市景观验证76.26\%mIoU，Vaihingen建筑分割基准90.92\%IoU，MoNU显微镜数据集66.82\%IoU，鸟类基准幼体90.91\%。我们训练和复制这些结果的代码作为补充随附。摘要：The leading segmentation methods represent the output map as a pixel grid. We study an alternative representation in which the object edges are modeled, per image patch, as a polygon with $k$ vertices that is coupled with per-patch label probabilities. The vertices are optimized by employing a differentiable neural renderer to create a raster image. The delineated region is then compared with the ground truth segmentation. Our method obtains multiple state-of-the-art results: 76.26\% mIoU on the Cityscapes validation, 90.92\% IoU on the Vaihingen building segmentation benchmark, 66.82\% IoU for the MoNU microscopy dataset, and 90.91\% for the bird benchmark CUB. Our code for training and reproducing these results is attached as supplementary.

【10】 Unsupervised Adaptation of Semantic Segmentation Models without Source Data 标题：无源数据语义分词模型的无监督自适应链接：https://arxiv.org/abs/2112.02359

作者：Sujoy Paul,Ansh Khurana,Gaurav Aggarwal 机构：Google Research 摘要：我们考虑新的问题的无监督域适应的源模型，而没有访问源数据的语义分割。无监督领域自适应的目的是使在标记的源数据上学习的模型适应新的未标记的目标数据集。现有的方法假设在自适应过程中源数据与目标数据一起可用。然而，在实际场景中，由于隐私、存储等原因，我们可能只能访问源模型和未标记的目标数据，而不能访问标记的源。在这项工作中，我们提出了一种自训练方法来从源模型中提取知识。为了补偿从源到目标的分布偏移，我们首先使用未标记的目标数据只更新网络的规范化参数。然后，我们使用置信度过滤伪标记，并对某些转换强制一致性。尽管我们的框架非常简单直观，但与直接将源模型应用于目标数据相比，我们的框架能够实现显著的性能提升，这反映在我们广泛的实验和烧蚀研究中。事实上，该性能与最近使用源数据进行自适应的最先进方法仅相差几点。我们进一步证明了所提出的方法对于完全测试时间自适应设置的通用性，其中我们不需要任何目标训练数据，并且仅在测试时间内自适应。摘要：We consider the novel problem of unsupervised domain adaptation of source models, without access to the source data for semantic segmentation. Unsupervised domain adaptation aims to adapt a model learned on the labeled source data, to a new unlabeled target dataset. Existing methods assume that the source data is available along with the target data during adaptation. However, in practical scenarios, we may only have access to the source model and the unlabeled target data, but not the labeled source, due to reasons such as privacy, storage, etc. In this work, we propose a self-training approach to extract the knowledge from the source model. To compensate for the distribution shift from source to target, we first update only the normalization parameters of the network with the unlabeled target data. Then we employ confidence-filtered pseudo labeling and enforce consistencies against certain transformations. Despite being very simple and intuitive, our framework is able to achieve significant performance gains compared to directly applying the source model on the target data as reflected in our extensive experiments and ablation studies. In fact, the performance is just a few points away from the recent state-of-the-art methods which use source data for adaptation. We further demonstrate the generalisability of the proposed approach for fully test-time adaptation setting, where we do not need any target training data and adapt only during test-time.

【11】 Separated Contrastive Learning for Organ-at-Risk and Gross-Tumor-Volume Segmentation with Limited Annotation 标题：基于有限标注的高危器官和肿瘤大体分割的分离对比学习链接：https://arxiv.org/abs/2112.02743

作者：Jiacheng Wang,Xiaomeng Li,Yiming Han,Jing Qin,Liansheng Wang,Qichao Zhou 机构： Department of Computer Science at School of Informatics, Xiamen University, Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology 备注：Accepted in AAAI-22 摘要：自动划定危险器官（OAR）和肿瘤总体积（GTV）对于放射治疗计划具有重要意义。然而，在有限的像素（体素）注释下学习功能强大的表示法以进行精确描绘是一项具有挑战性的任务。像素级的对比学习可以通过从未标记数据中学习密集表示来减轻对注释的依赖。这方面的最新研究在特征图上设计了各种对比损失，以产生地图中每个像素的鉴别特征。然而，同一地图中的像素不可避免地共享语义，使其比实际更接近，这可能会影响同一地图中像素的区分，并导致与其他地图中像素的不公平比较。为了解决这些问题，我们提出了一种分离的区域级对比学习方案，即SepaReg，其核心是将每个图像分割成多个区域，并分别对每个区域进行编码。具体而言，SepaReg包括两个组件：结构感知图像分离（SIS）模块和器官内和器官间蒸馏（IID）模块。SIS在结构信息的指导下对图像集进行操作，重建区域集。器官间表征将通过典型的跨区域对比学习。另一方面，IID被提议通过利用器官内表征来解决区域集合中的数量不平衡问题，因为微小的器官可能产生较少的区域。我们在一个公共数据集和两个私有数据集上进行了大量实验来评估所提出的模型。实验结果证明了该模型的有效性，其性能始终优于现有的方法。代码可在https://github.com/jcwang123/Separate_CL. 摘要：Automatic delineation of organ-at-risk (OAR) and gross-tumor-volume (GTV) is of great significance for radiotherapy planning. However, it is a challenging task to learn powerful representations for accurate delineation under limited pixel (voxel)-wise annotations. Contrastive learning at pixel-level can alleviate the dependency on annotations by learning dense representations from unlabeled data. Recent studies in this direction design various contrastive losses on the feature maps, to yield discriminative features for each pixel in the map. However, pixels in the same map inevitably share semantics to be closer than they actually are, which may affect the discrimination of pixels in the same map and lead to the unfair comparison to pixels in other maps. To address these issues, we propose a separated region-level contrastive learning scheme, namely SepaReg, the core of which is to separate each image into regions and encode each region separately. Specifically, SepaReg comprises two components: a structure-aware image separation (SIS) module and an intra- and inter-organ distillation (IID) module. The SIS is proposed to operate on the image set to rebuild a region set under the guidance of structural information. The inter-organ representation will be learned from this set via typical contrastive losses cross regions. On the other hand, the IID is proposed to tackle the quantity imbalance in the region set as tiny organs may produce fewer regions, by exploiting intra-organ representations. We conducted extensive experiments to evaluate the proposed model on a public dataset and two private datasets. The experimental results demonstrate the effectiveness of the proposed model, consistently achieving better performance than state-of-the-art approaches. Code is available at https://github.com/jcwang123/Separate_CL.

【12】 Uncertainty-Guided Mutual Consistency Learning for Semi-Supervised Medical Image Segmentation 标题：不确定性引导的互一致性学习在半监督医学图像分割中的应用链接：https://arxiv.org/abs/2112.02508

作者：Yichi Zhang,Qingcheng Liao,Rushi Jiao,Jicong Zhang 机构： Beihang University, Jicong Zhang is with School of Biological Science and Medical Engineer-ing, and with Hefei InnovationResearch Institute 摘要：医学图像分割是许多临床方法的基础和关键步骤。半监督学习由于减轻了获取专家检查注释的沉重负担，并且利用了更容易获取的未标记数据的优势，已被广泛应用于医学图像分割任务。尽管一致性学习已被证明是一种有效的方法，它可以在不同的分布下实现预测的不变性，但现有的方法不能充分利用未标记数据中的区域级形状约束和边界级距离信息。在本文中，我们提出了一种新的不确定性引导的相互一致性学习框架，通过将最新预测的任务内一致性学习与任务级正则化的跨任务一致性学习相结合来利用几何形状信息，从而有效地利用未标记数据。该框架以估计的模型分段不确定性为指导，选择相对确定的预测进行一致性学习，从而有效地利用未标记数据中更可靠的信息。我们在两个公开的基准数据集上广泛验证了我们提出的方法：左心房分割（LA）数据集和脑肿瘤分割（BraTS）数据集。实验结果表明，我们的方法通过利用未标记的数据实现了性能提升，并且优于现有的半监督分割方法。摘要：Medical image segmentation is a fundamental and critical step in many clinical approaches. Semi-supervised learning has been widely applied to medical image segmentation tasks since it alleviates the heavy burden of acquiring expert-examined annotations and takes the advantage of unlabeled data which is much easier to acquire. Although consistency learning has been proven to be an effective approach by enforcing an invariance of predictions under different distributions, existing approaches cannot make full use of region-level shape constraint and boundary-level distance information from unlabeled data. In this paper, we propose a novel uncertainty-guided mutual consistency learning framework to effectively exploit unlabeled data by integrating intra-task consistency learning from up-to-date predictions for self-ensembling and cross-task consistency learning from task-level regularization to exploit geometric shape information. The framework is guided by the estimated segmentation uncertainty of models to select out relatively certain predictions for consistency learning, so as to effectively exploit more reliable information from unlabeled data. We extensively validate our proposed method on two publicly available benchmark datasets: Left Atrium Segmentation (LA) dataset and Brain Tumor Segmentation (BraTS) dataset. Experimental results demonstrate that our method achieves performance gains by leveraging unlabeled data and outperforms existing semi-supervised segmentation methods.

【13】 Echocardiography Segmentation with Enforced Temporal Consistency 标题：增强时间一致性的超声心动图分割链接：https://arxiv.org/abs/2112.02102

作者：Nathan Painchaud,Nicolas Duchateau,Olivier Bernard,Pierre-Marc Jodoin 备注：10 pages, submitted to IEEE TMI 摘要：卷积神经网络（CNN）已经证明了其分割二维心脏超声图像的能力。然而，尽管最近取得了一些成功，根据这些成功，观察者在舒张末期和收缩末期图像上的可变性已经达到，CNN仍然难以利用时间信息在整个周期内提供准确且时间一致的分割图。准确描述心脏功能需要这种一致性，这是诊断许多心血管疾病的必要步骤。在本文中，我们提出了一个框架来学习2D+时间长轴心脏形状，这样分段序列可以受益于时间和解剖一致性约束。我们的方法是一种后处理方法，将任何最先进的方法产生的分段超声心动图序列作为输入，分两步进行处理，以（i）根据心脏序列的整体动力学识别时空不一致性，以及（ii）纠正不一致性。心脏不一致性的识别和纠正依赖于一个受约束的自动编码器，该编码器经过训练以学习生理上可解释的心脏形状嵌入，在这里我们可以检测和修复异常。我们在来自CAMUS数据集的98个完整周期序列上测试了我们的框架，该数据集将与本文一起公开。我们的时间正则化方法不仅提高了整个序列的分割精度，而且增强了时间和解剖的一致性。摘要：Convolutional neural networks (CNN) have demonstrated their ability to segment 2D cardiac ultrasound images. However, despite recent successes according to which the intra-observer variability on end-diastole and end-systole images has been reached, CNNs still struggle to leverage temporal information to provide accurate and temporally consistent segmentation maps across the whole cycle. Such consistency is required to accurately describe the cardiac function, a necessary step in diagnosing many cardiovascular diseases. In this paper, we propose a framework to learn the 2D+time long-axis cardiac shape such that the segmented sequences can benefit from temporal and anatomical consistency constraints. Our method is a post-processing that takes as input segmented echocardiographic sequences produced by any state-of-the-art method and processes it in two steps to (i) identify spatio-temporal inconsistencies according to the overall dynamics of the cardiac sequence and (ii) correct the inconsistencies. The identification and correction of cardiac inconsistencies relies on a constrained autoencoder trained to learn a physiologically interpretable embedding of cardiac shapes, where we can both detect and fix anomalies. We tested our framework on 98 full-cycle sequences from the CAMUS dataset, which will be rendered public alongside this paper. Our temporal regularization method not only improves the accuracy of the segmentation across the whole sequences, but also enforces temporal and anatomical consistency.

【14】 View-Consistent Metal Segmentation in the Projection Domain for Metal Artifact Reduction in CBCT -- An Investigation of Potential Improvement 标题：基于CBCT的金属伪影消除投影域的视域一致性分割--潜在改进研究链接：https://arxiv.org/abs/2112.02101

作者：Tristan M. Gottschalk,Andreas Maier,Florian Kordon,Björn W. Kreher 机构： Department of Computer Science, Friedrich-Alexander University Erlangen-Nuremberg, GermanyErlangen Graduate School in Advanced Optical Technologies (SAOT) 备注：Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) 摘要：创伤干预的积极结果取决于植入金属植入物的术中评估。由于出现金属伪影，评估的质量在很大程度上取决于所谓的金属伪影减少方法（MAR）的性能。这些MAR方法中的大多数都需要事先分割插入的金属对象。因此，尽管存在一些主要缺点，但通常在重建的三维体积中应用相当简单的基于阈值的分割方法。通过本出版物，研究了根据下游MAR结果将分割任务转移到基于学习、视图一致的2D投影方法的可能性。为了分割现有的金属，我们研究了一个非常简单的基于学习的2D投影分割网络，该网络使用尸体研究期间获得的真实数据进行训练。为了克服二维投影分割的缺点，提出了一种一致性滤波器。通过比较标准fsMAR和使用新分割模板的改进fsMAR版本的结果，研究了移位分割域的影响。通过对真实尸体数据的定量和定性评估，所研究的方法显示MAR性能提高，对金属伪影高度不敏感。对于重建视野外有金属的病例或金属消失的病例，伪影明显减少。因此，所有切片的平均峰值信噪比指标提高了约3 dB w.r.t.，单个切片的平均峰值信噪比指标提高了约9 dB。显示的结果显示，转移到基于2D的分割方法会对实际数据产生有益的影响，以便与MAR方法（如fsMAR）一起下游使用。摘要：The positive outcome of a trauma intervention depends on an intraoperative evaluation of inserted metallic implants. Due to occurring metal artifacts, the quality of this evaluation heavily depends on the performance of so-called Metal Artifact Reduction methods (MAR). The majority of these MAR methods require prior segmentation of the inserted metal objects. Therefore, typically a rather simple thresholding-based segmentation method in the reconstructed 3D volume is applied, despite some major disadvantages. With this publication, the potential of shifting the segmentation task to a learning-based, view-consistent 2D projection-based method on the downstream MAR's outcome is investigated. For segmenting the present metal, a rather simple learning-based 2D projection-wise segmentation network that is trained using real data acquired during cadaver studies, is examined. To overcome the disadvantages that come along with a 2D projection-wise segmentation, a Consistency Filter is proposed. The influence of the shifted segmentation domain is investigated by comparing the results of the standard fsMAR with a modified fsMAR version using the new segmentation masks. With a quantitative and qualitative evaluation on real cadaver data, the investigated approach showed an increased MAR performance and a high insensitivity against metal artifacts. For cases with metal outside the reconstruction's FoV or cases with vanishing metal, a significant reduction in artifacts could be shown. Thus, increases of up to roughly 3 dB w.r.t. the mean PSNR metric over all slices and up to 9 dB for single slices were achieved. The shown results reveal a beneficial influence of the shift to a 2D-based segmentation method on real data for downstream use with a MAR method, like the fsMAR.

Zero/Few Shot|迁移|域适配|自适应(11篇)

【1】 Prototypical Model with Novel Information-theoretic Loss Function for Generalized Zero Shot Learning 标题：基于新信息论损失函数的广义Zero-Shot学习原型模型链接：https://arxiv.org/abs/2112.03134

作者：Chunlin Ji,Hanchu Shen,Zhan Xiong,Feng Chen,Meiying Zhang,Huiwen Yang 机构：Shenzhen Origin AI Technology Co. Ltd, Department of Computer Science and Engineering, Southern University of Science, and Technology, Department of Electrical Engineering and Computer Sciences, University of, California, Berkeley 摘要：广义零炮学习（GZSL）仍然是深度学习的技术挑战，因为它必须在没有目标类数据的情况下识别源类和目标类。为了在仅使用源类数据进行训练时保持源类和目标类之间的语义关系，我们从信息论的角度对知识转移和语义关系进行了量化。为此，我们遵循原型模型，将关注的变量格式化为概率向量。利用所提出的概率向量表示，可以用简单的闭合形式有效地评估互信息和熵等信息度量。讨论了使用原型模型时常用嵌入空间和距离函数的选择。然后，我们提出了确定性GZSL模型的三个信息论损失函数：连接SEW数据和目标类的互信息损失；当使用SEW数据学习目标类的嵌入时，不确定性感知熵约束损失可防止过度拟合；在将语义表示映射到公共空间时，使用语义保持交叉熵损失来保持语义关系。仿真表明，作为一种确定性模型，我们提出的方法在GZSL基准数据集上获得了最新的结果。我们实现了比基线模型——深度校准网络（DCN）21%-64%的改进，并首次证明确定性模型可以和生成性模型一样发挥作用。此外，我们提出的模型与生成模型兼容。仿真研究表明，通过与f-CLSWGAN结合，我们获得了与先进生成模型相比较的结果。摘要：Generalized zero shot learning (GZSL) is still a technical challenge of deep learning as it has to recognize both source and target classes without data from target classes. To preserve the semantic relation between source and target classes when only trained with data from source classes, we address the quantification of the knowledge transfer and semantic relation from an information-theoretic viewpoint. To this end, we follow the prototypical model and format the variables of concern as a probability vector. Leveraging on the proposed probability vector representation, the information measurement such as mutual information and entropy, can be effectively evaluated with simple closed forms. We discuss the choice of common embedding space and distance function when using the prototypical model. Then We propose three information-theoretic loss functions for deterministic GZSL model: a mutual information loss to bridge seen data and target classes; an uncertainty-aware entropy constraint loss to prevent overfitting when using seen data to learn the embedding of target classes; a semantic preserving cross entropy loss to preserve the semantic relation when mapping the semantic representations to the common space. Simulation shows that, as a deterministic model, our proposed method obtains state of the art results on GZSL benchmark datasets. We achieve 21%-64% improvements over the baseline model -- deep calibration network (DCN) and for the first time demonstrate a deterministic model can perform as well as generative ones. Moreover, our proposed model is compatible with generative models. Simulation studies show that by incorporating with f-CLSWGAN, we obtain comparable results compared with advanced generative models.

【2】 AdaSTE: An Adaptive Straight-Through Estimator to Train Binary Neural Networks 标题：AdaSTE：一种用于训练二元神经网络的自适应直通估值器链接：https://arxiv.org/abs/2112.02880

作者：Huu Le,Rasmus Kjær Høier,Che-Tsung Lin,Christopher Zach 机构：Chalmers University of Technology, Gothenburg, Sweden 备注：18 pages 摘要：提出了一种新的二元加权深度神经网络训练算法。特别地，我们首先将二元神经网络（BiNNs）的训练问题作为一个双层优化实例，然后构造该双层规划的灵活松弛。由此产生的训练方法与几种现有的BINN训练方法，特别是成功用于BinaryConnect和后续方法的直通梯度估计器，具有相同的算法简单性。事实上，我们提出的方法可以解释为原始直通估计器的自适应变体，该估计器在误差传播的反向过程中有条件（但并非总是）起到线性映射的作用。实验结果表明，与现有算法相比，新算法具有良好的性能。摘要：We propose a new algorithm for training deep neural networks (DNNs) with binary weights. In particular, we first cast the problem of training binary neural networks (BiNNs) as a bilevel optimization instance and subsequently construct flexible relaxations of this bilevel program. The resulting training method shares its algorithmic simplicity with several existing approaches to train BiNNs, in particular with the straight-through gradient estimator successfully employed in BinaryConnect and subsequent methods. In fact, our proposed method can be interpreted as an adaptive variant of the original straight-through estimator that conditionally (but not always) acts like a linear mapping in the backward pass of error propagation. Experimental results demonstrate that our new algorithm offers favorable performance compared to existing approaches.

【3】 A Dataset-free Self-supervised Disentangled Learning Method for Adaptive Infrared and Visible Images Super-resolution Fusion 标题：一种无数据集的自适应红外与可见光图像超分辨率融合的自监督解缠学习方法链接：https://arxiv.org/abs/2112.02869

作者：Yuanjie Gu,Zhibo Xiao,Hailun Wang,Cheng Liu,Shouyu Wang 机构： School of Science 备注：10 pages, 9 figures 摘要：本研究提出了一种基于物理模型的通用无数据集自监督学习框架——自监督解纠缠学习（SDL），并提出了一种将SDL框架、生成网络和Retinex理论应用于红外和可见光图像超分辨率融合的深度Retinex融合（DRF）方法。同时，设计了生成型双路径融合网络ZipperNet和自适应融合损失函数Retinex loss，有效地实现了高质量融合。DRF（基于SDL）的核心思想包括两部分：一是利用生成网络生成与物理模型分离的组件；另一种是基于物理关系设计的损耗函数，在训练阶段生成的分量由损耗函数组合而成。此外，为了验证我们提出的DRF的有效性，在三种不同的红外和可见光数据集上与六种最先进的方法进行了定性和定量比较。我们的代码将很快在https://github.com/GuYuanjie/Deep-Retinex-fusion. 摘要：This study proposes a novel general dataset-free self-supervised learning framework based-on physical model named self-supervised disentangled learning (SDL), and proposes a novel method named Deep Retinex fusion (DRF) which applies SDL framework with generative networks and Retinex theory in infrared and visible images super-resolution fusion. Meanwhile, a generative dual-path fusion network ZipperNet and adaptive fusion loss function Retinex loss are designed for effectively high-quality fusion. The core idea of DRF (based-on SDL) consists of two parts: one is generating components which are disentangled from physical model using generative networks; the other is loss functions which are designed based-on physical relation, and generated components are combined by loss functions in training phase. Furthermore, in order to verify the effectiveness of our proposed DRF, qualitative and quantitative comparisons compared with six state-of-the-art methods are performed on three different infrared and visible datasets. Our code will be open source available soon at https://github.com/GuYuanjie/Deep-Retinex-fusion.

【4】 No-Reference Point Cloud Quality Assessment via Domain Adaptation 标题：基于领域适配的无参考点云质量评估链接：https://arxiv.org/abs/2112.02851

作者：Qi Yang,Yipeng Liu,Siheng Chen,Yiling Xu,Jun Sun 机构： Shanghai Jiaotong University 摘要：针对三维点云，我们提出了一种新的无参考质量评估指标，即图像传输点云质量评估（IT-PCQA）。对于质量评估，深度神经网络（DNN）在无参考指标设计方面表现出令人信服的性能。然而，对于无参考PCQA来说，最具挑战性的问题是我们缺乏大规模的主观数据库来驱动健壮的网络。我们的动机是，人类视觉系统（HVS）是质量评估的决策者，无论媒体类型如何。利用自然图像丰富的主观评分，我们可以通过DNN寻求人类感知的评价标准，并将预测能力转移到3D点云。特别地，我们将自然图像作为源域，点云作为目标域，并通过无监督对抗域自适应来推断点云的质量。为了提取有效的潜在特征并最小化域差异，我们提出了分层特征编码器和条件判别网络。考虑到最终目的是回归客观分数，我们在条件判别网络中引入了一种新的条件交叉熵损失来惩罚阻碍质量回归网络收敛的负样本。实验结果表明，与传统的无参考指标相比，该方法可以获得更高的性能，甚至可以与完全参考指标进行比较。提出的方法还表明，评估特定媒体内容质量的可行性，而无需昂贵且繁琐的主观评估。摘要：We present a novel no-reference quality assessment metric, the image transferred point cloud quality assessment (IT-PCQA), for 3D point clouds. For quality assessment, deep neural network (DNN) has shown compelling performance on no-reference metric design. However, the most challenging issue for no-reference PCQA is that we lack large-scale subjective databases to drive robust networks. Our motivation is that the human visual system (HVS) is the decision-maker regardless of the type of media for quality assessment. Leveraging the rich subjective scores of the natural images, we can quest the evaluation criteria of human perception via DNN and transfer the capability of prediction to 3D point clouds. In particular, we treat natural images as the source domain and point clouds as the target domain, and infer point cloud quality via unsupervised adversarial domain adaptation. To extract effective latent features and minimize the domain discrepancy, we propose a hierarchical feature encoder and a conditional-discriminative network. Considering that the ultimate purpose is regressing objective score, we introduce a novel conditional cross entropy loss in the conditional-discriminative network to penalize the negative samples which hinder the convergence of the quality regression network. Experimental results show that the proposed method can achieve higher performance than traditional no-reference metrics, even comparable results with full-reference metrics. The proposed method also suggests the feasibility of assessing the quality of specific media content without the expensive and cumbersome subjective evaluations.

【5】 DemoGrasp: Few-Shot Learning for Robotic Grasping with Human Demonstration 标题：DemoGrasp：以人为示范的机器人抓取的极少机会学习链接：https://arxiv.org/abs/2112.02849

作者：Pengyuan Wang,Fabian Manhardt,Luca Minciullo,Lorenzo Garattoni,Sven Meie,Nassir Navab,Benjamin Busam 机构： Doing so with a 1 Technical University of Munich {first 备注：Accepted by IROS 2021 摘要：在机器人技术中，成功抓取物体的能力至关重要，因为它支持多个交互式下游应用程序。为此，大多数方法要么计算感兴趣对象的完整6D姿势，要么学习预测一组抓取点。虽然前一种方法不能很好地扩展到多个对象实例或类，但后一种方法需要大量带注释的数据集，并且由于其对新几何体的泛化能力较差而受到阻碍。为了克服这些缺点，我们建议通过简单而简短的人类演示来教机器人如何抓取物体。因此，我们的方法既不需要许多带注释的图像，也不局限于特定的几何体。我们首先展示了一个小序列的RGB-D图像，显示人机交互。然后利用该序列构建表示所描述交互的关联手和对象网格。随后，我们完成重建对象形状的缺失部分，并估计重建与场景中可见对象之间的相对变换。最后，通过对场景中当前物体姿态的估计，将物体与人手之间的相对姿态的先验知识转化为机器人所需的抓取指令。在真实和合成环境中，对丰田的人工支持机器人（HSR）进行了详尽的评估，证明了我们提出的方法的适用性及其与以前方法相比的优势。摘要：The ability to successfully grasp objects is crucial in robotics, as it enables several interactive downstream applications. To this end, most approaches either compute the full 6D pose for the object of interest or learn to predict a set of grasping points. While the former approaches do not scale well to multiple object instances or classes yet, the latter require large annotated datasets and are hampered by their poor generalization capabilities to new geometries. To overcome these shortcomings, we propose to teach a robot how to grasp an object with a simple and short human demonstration. Hence, our approach neither requires many annotated images nor is it restricted to a specific geometry. We first present a small sequence of RGB-D images displaying a human-object interaction. This sequence is then leveraged to build associated hand and object meshes that represent the depicted interaction. Subsequently, we complete missing parts of the reconstructed object shape and estimate the relative transformation between the reconstruction and the visible object in the scene. Finally, we transfer the a-priori knowledge from the relative pose between object and human hand with the estimate of the current object pose in the scene into necessary grasping instructions for the robot. Exhaustive evaluations with Toyota's Human Support Robot (HSR) in real and synthetic environments demonstrate the applicability of our proposed methodology and its advantage in comparison to previous approaches.

【6】 A Generalized Zero-Shot Quantization of Deep Convolutional Neural Networks via Learned Weights Statistics 标题：基于学习权重统计的深卷积神经网络广义零点量化链接：https://arxiv.org/abs/2112.02834

作者：Prasen Kumar Sharma,Arun Abraham,Vikram Nelvoy Rajendiran 机构：. 备注：Accepted by IEEE Transactions on Multimedia 摘要：将浮点权重和深度卷积神经网络的激活量化为定点表示可以减少内存占用和推理时间。最近，人们正在努力实现Zero-Shot量化，这种量化不需要给定任务的原始未标记训练样本。这些发表得最好的作品严重依赖于学习的批量归一化（BN）参数来推断量化激活的范围。特别是，这些方法建立在经验估计框架或数据提取方法的基础上，用于计算激活范围。然而，当使用不容纳BN层的网络时，此类方案的性能严重下降。在这种思路下，我们提出了一种既不需要原始数据也不依赖于BN层统计的广义零拍量化（GZSQ）框架。我们使用了数据提取方法，仅利用模型的预训练权重来估计激活范围校准的丰富数据。据我们所知，这是第一个利用预训练权重分布来辅助零拍量化过程的工作。对于各种任务，拟议方案的性能明显优于现有的零炮工作，例如，MobileNet V2和其他几种w&w/o BN层模型的分类精度提高了约33%。我们还展示了所提出的工作在多个开源量化框架中的有效性。重要的是，我们的工作是对未来非规范化深度神经网络的训练后零炮量化的首次尝试。摘要：Quantizing the floating-point weights and activations of deep convolutional neural networks to fixed-point representation yields reduced memory footprints and inference time. Recently, efforts have been afoot towards zero-shot quantization that does not require original unlabelled training samples of a given task. These best-published works heavily rely on the learned batch normalization (BN) parameters to infer the range of the activations for quantization. In particular, these methods are built upon either empirical estimation framework or the data distillation approach, for computing the range of the activations. However, the performance of such schemes severely degrades when presented with a network that does not accommodate BN layers. In this line of thought, we propose a generalized zero-shot quantization (GZSQ) framework that neither requires original data nor relies on BN layer statistics. We have utilized the data distillation approach and leveraged only the pre-trained weights of the model to estimate enriched data for range calibration of the activations. To the best of our knowledge, this is the first work that utilizes the distribution of the pretrained weights to assist the process of zero-shot quantization. The proposed scheme has significantly outperformed the existing zero-shot works, e.g., an improvement of ~ 33% in classification accuracy for MobileNetV2 and several other models that are w & w/o BN layers, for a variety of tasks. We have also demonstrated the efficacy of the proposed work across multiple open-source quantization frameworks. Importantly, our work is the first attempt towards the post-training zero-shot quantization of futuristic unnormalized deep neural networks.

【7】 ActiveZero: Mixed Domain Learning for Active Stereovision with Zero Annotation 标题：ActiveZero：带零标注的有源立体视觉混合域学习链接：https://arxiv.org/abs/2112.02772

作者：Isabella Liu,Edward Yang,Jianyu Tao,Rui Chen,Xiaoshuai Zhang,Qing Ran,Zhu Liu,Hao Su 机构：University of California, San Diego, Tsinghua University, Alibaba DAMO Academy 摘要：传统的深度传感器生成准确的真实世界深度估计，甚至超过仅在模拟领域训练的最先进的学习方法。由于地面真实深度在模拟域中很容易获得，但在真实域中很难获得，因此我们提出了一种利用两种方法的优点的方法。在本文中，我们提出了一个新的框架，ActiveZero，这是一个混合领域学习解决方案的主动立体视觉系统，不需要真实世界的深度注释。首先，我们使用混合域学习策略证明了我们的方法对分布外的真实数据的可转移性。在模拟领域，我们在形状基元数据集上使用监督视差损失和自我监督损失的组合。相比之下，在真实域中，我们仅对训练模拟数据或测试真实数据中不分布的数据集使用自监督损失。其次，我们的方法引入了一种称为时间红外重投影的新的自监督损失，以提高我们在难以感知区域的重投影的鲁棒性和准确性。最后，我们展示了如何对该方法进行端到端的训练，以及每个模块对于实现最终结果的重要性。对真实数据进行广泛的定性和定量评估表明，最先进的结果甚至可以击败商业深度传感器。摘要：Traditional depth sensors generate accurate real world depth estimates that surpass even the most advanced learning approaches trained only on simulation domains. Since ground truth depth is readily available in the simulation domain but quite difficult to obtain in the real domain, we propose a method that leverages the best of both worlds. In this paper we present a new framework, ActiveZero, which is a mixed domain learning solution for active stereovision systems that requires no real world depth annotation. First, we demonstrate the transferability of our method to out-of-distribution real data by using a mixed domain learning strategy. In the simulation domain, we use a combination of supervised disparity loss and self-supervised losses on a shape primitives dataset. By contrast, in the real domain, we only use self-supervised losses on a dataset that is out-of-distribution from either training simulation data or test real data. Second, our method introduces a novel self-supervised loss called temporal IR reprojection to increase the robustness and accuracy of our reprojections in hard-to-perceive regions. Finally, we show how the method can be trained end-to-end and that each module is important for attaining the end result. Extensive qualitative and quantitative evaluations on real data demonstrate state of the art results that can even beat a commercial depth sensor.

【8】 One-shot Talking Face Generation from Single-speaker Audio-Visual Correlation Learning 标题：基于单说话人视听相关学习的一次说话人脸生成链接：https://arxiv.org/abs/2112.02749

作者：Suzhen Wang,Lincheng Li,Yu Ding,Xin Yu 机构： Netease Fuxi AI Lab, University of Technology Sydney 备注：None 摘要：音频驱动的单镜头对话人脸生成方法通常是在各种人的视频资源上进行训练的。然而，他们制作的视频经常会出现不自然的口型和不同步的嘴唇，因为这些方法很难从不同的说话人那里学习到一致的讲话风格。我们观察到，从一个特定的说话人那里学习一致的讲话风格会容易得多，这会导致真实的口腔运动。因此，我们提出了一种新的单镜头对话人脸生成框架，通过探索特定说话人的音频和视频运动之间的一致相关性，然后将音频驱动的运动场传输到参考图像。具体来说，我们开发了一种视听相关变换器（AVCT），旨在从输入音频中推断出由基于关键点的密集运动场表示的说话运动。特别是，考虑到音频可能来自部署中的不同身份，我们合并了音素来表示音频信号。通过这种方式，我们的AVCT可以固有地推广到其他身份所说的音频。此外，由于人脸关键点用于表示说话人，AVCT对训练说话人的外表是不可知的，因此允许我们随时操纵不同身份的人脸图像。考虑到不同的人脸形状会导致不同的运动，开发了运动场传输模块，以减少训练身份和一次性参考之间的音频驱动密集运动场间隙。一旦我们获得了参考图像的稠密运动场，我们就使用图像渲染器从音频剪辑生成它的对话人脸视频。多亏了我们所学的一贯的说话风格，我们的方法产生了真实的口型和生动的动作。大量实验表明，我们合成的视频在视觉质量和唇形同步方面优于最新技术。摘要：Audio-driven one-shot talking face generation methods are usually trained on video resources of various persons. However, their created videos often suffer unnatural mouth shapes and asynchronous lips because those methods struggle to learn a consistent speech style from different speakers. We observe that it would be much easier to learn a consistent speech style from a specific speaker, which leads to authentic mouth movements. Hence, we propose a novel one-shot talking face generation framework by exploring consistent correlations between audio and visual motions from a specific speaker and then transferring audio-driven motion fields to a reference image. Specifically, we develop an Audio-Visual Correlation Transformer (AVCT) that aims to infer talking motions represented by keypoint based dense motion fields from an input audio. In particular, considering audio may come from different identities in deployment, we incorporate phonemes to represent audio signals. In this manner, our AVCT can inherently generalize to audio spoken by other identities. Moreover, as face keypoints are used to represent speakers, AVCT is agnostic against appearances of the training speaker, and thus allows us to manipulate face images of different identities readily. Considering different face shapes lead to different motions, a motion field transfer module is exploited to reduce the audio-driven dense motion field gap between the training identity and the one-shot reference. Once we obtained the dense motion field of the reference image, we employ an image renderer to generate its talking face videos from an audio clip. Thanks to our learned consistent speaking style, our method generates authentic mouth shapes and vivid movements. Extensive experiments demonstrate that our synthesized videos outperform the state-of-the-art in terms of visual quality and lip-sync.

【9】 Adaptive Channel Encoding for Point Cloud Analysis 标题：用于点云分析的自适应信道编码链接：https://arxiv.org/abs/2112.02509

作者：Guoquan Xu,Hezhi Cao,Yifan Zhang,Jianwei Wan,Ke Xu,Yanxin Ma 机构：National University of Defense Technology, Changsha, CHINA, University of Science and Technology of China, Hefei, CHINA 摘要：注意机制在点云分析中扮演着越来越重要的角色，通道注意是研究的热点之一。由于信道信息量大，神经网络很难筛选出有用的信道信息。因此，本文提出了一种自适应信道编码机制来捕获信道关系。它通过显式编码其特征的通道之间的相互依赖性来提高网络生成的表示的质量。具体地说，提出了一种基于信道的卷积（channelconv），自适应地学习坐标和特征之间的关系，从而对信道进行编码。与目前流行的注意权重方案不同，本文提出的信道Conv实现了卷积运算的适应性，而不是简单地为信道分配不同的权重。在现有基准上的大量实验验证了我们的方法达到了最先进的水平。摘要：Attention mechanism plays a more and more important role in point cloud analysis and channel attention is one of the hotspots. With so much channel information, it is difficult for neural networks to screen useful channel information. Thus, an adaptive channel encoding mechanism is proposed to capture channel relationships in this paper. It improves the quality of the representation generated by the network by explicitly encoding the interdependence between the channels of its features. Specifically, a channel-wise convolution (Channel-Conv) is proposed to adaptively learn the relationship between coordinates and features, so as to encode the channel. Different from the popular attention weight schemes, the Channel-Conv proposed in this paper realizes adaptability in convolution operation, rather than simply assigning different weights for channels. Extensive experiments on existing benchmarks verify our method achieves the state of the arts.

【10】 SITA: Single Image Test-time Adaptation 标题：SITA：单幅图像测试时间适配链接：https://arxiv.org/abs/2112.02355

作者：Ansh Khurana,Sujoy Paul,Piyush Rai,Soma Biswas,Gaurav Aggarwal 机构：Google Research,IIT Kanpur,IISc Bangalore 摘要：在测试时间自适应（TTA）中，给定一个在某些源数据上训练的模型，目标是对其进行自适应，以便对来自不同分布的测试实例做出更好的预测。至关重要的是，TTA假设无法访问源数据，甚至无法访问来自目标分布的任何附加标记/未标记样本来微调源模型。在这项工作中，我们考虑TTA在一个更务实的设置，我们称之为SITA（单图像测试时间适应）。在这里，当进行每个预测时，模型只能访问给定的\emph{single}测试实例，而不是文献中通常考虑的\emph{batch}实例。这是因为现实场景中需要按需进行推理，这种推理可能不会延迟到“批量通知”传入请求，或者推理发生在没有批量范围的边缘设备（如移动电话）上。SITA的整个适应过程应该非常快，因为它发生在推理时。为了解决这个问题，我们提出了一种新的SITA设置方法，它只需要前向传播。该方法可以使任何现成的训练模型适应分类和分割任务的单个测试实例。AugBN仅使用一次带有标签保留变换的前向传递，从给定测试图像估计未知测试分布的归一化统计。由于AugBN不涉及任何反向传播，因此与其他最新方法相比，它的速度要快得多。据我们所知，这是第一个只使用一张测试图像来解决这个难以适应的问题的工作。尽管非常简单，但与直接将源模型应用于目标实例相比，我们的框架能够实现显著的性能提升，这反映在我们广泛的实验和研究中。摘要：In Test-time Adaptation (TTA), given a model trained on some source data, the goal is to adapt it to make better predictions for test instances from a different distribution. Crucially, TTA assumes no access to the source data or even any additional labeled/unlabeled samples from the target distribution to finetune the source model. In this work, we consider TTA in a more pragmatic setting which we refer to as SITA (Single Image Test-time Adaptation). Here, when making each prediction, the model has access only to the given \emph{single} test instance, rather than a \emph{batch} of instances, as has typically been considered in the literature. This is motivated by the realistic scenarios where inference is needed in an on-demand fashion that may not be delayed to "batch-ify" incoming requests or the inference is happening on an edge device (like mobile phone) where there is no scope for batching. The entire adaptation process in SITA should be extremely fast as it happens at inference time. To address this, we propose a novel approach AugBN for the SITA setting that requires only forward propagation. The approach can adapt any off-the-shelf trained model to individual test instances for both classification and segmentation tasks. AugBN estimates normalisation statistics of the unseen test distribution from the given test image using only one forward pass with label-preserving transformations. Since AugBN does not involve any back-propagation, it is significantly faster compared to other recent methods. To the best of our knowledge, this is the first work that addresses this hard adaptation problem using only a single test image. Despite being very simple, our framework is able to achieve significant performance gains compared to directly applying the source model on the target instances, as reflected in our extensive experiments and ablation studies.

【11】 Unsupervised Domain Generalization by Learning a Bridge Across Domains 标题：基于学习跨域桥梁的无监督领域泛化链接：https://arxiv.org/abs/2112.02300

作者：Sivan Harary,Eli Schwartz,Assaf Arbelle,Peter Staar,Shady Abu-Hussein,Elad Amrani,Roei Herzig,Amit Alfassy,Raja Giryes,Hilde Kuehne,Dina Katabi,Kate Saenko,Rogerio Feris,Leonid Karlinsky 机构：IBM Research, Tel-Aviv University, Technion, Boston University, Goethe University, MIT-IBM Watson AI Lab 摘要：人类视觉系统的一项基本能力是，能够在显著不同的视觉领域，如真实照片、剪贴画、绘画和草图之间，概括学习到的表示。在本文中，不同于大多数跨领域的工作，利用一些（或完全）源域监督，我们探讨了一个相对新的和非常实用的无监督域泛化（UDG）设置，即在源域和目标域中都没有训练监督。我们的方法基于跨域桥（BrAD）的自监督学习，BrAD是一个辅助桥域，伴随着从每个训练域到BrAD的一组语义保持视觉（图像到图像）映射。BrAD及其映射通过对比自监督表示模型（端到端）联合学习，该模型将每个域与其BrAD投影语义对齐，从而隐式驱动所有域（可见或不可见）语义对齐。在这项工作中，我们展示了如何使用边缘正则化方法，我们的方法在多个基准和一系列任务（包括UDG、少量UDA和跨多域数据集的无监督泛化（包括泛化到看不见的域和类）中获得显著的收益。摘要：The ability to generalize learned representations across significantly different visual domains, such as between real photos, clipart, paintings, and sketches, is a fundamental capacity of the human visual system. In this paper, different from most cross-domain works that utilize some (or full) source domain supervision, we approach a relatively new and very practical Unsupervised Domain Generalization (UDG) setup of having no training supervision in neither source nor target domains. Our approach is based on self-supervised learning of a Bridge Across Domains (BrAD) - an auxiliary bridge domain accompanied by a set of semantics preserving visual (image-to-image) mappings to BrAD from each of the training domains. The BrAD and mappings to it are learned jointly (end-to-end) with a contrastive self-supervised representation model that semantically aligns each of the domains to its BrAD-projection, and hence implicitly drives all the domains (seen or unseen) to semantically align to each other. In this work, we show how using an edge-regularized BrAD our approach achieves significant gains across multiple benchmarks and a range of tasks, including UDG, Few-shot UDA, and unsupervised generalization across multi-domain datasets (including generalization to unseen domains and classes).

半弱无监督|主动学习|不确定性(5篇)

【1】 3D Hierarchical Refinement and Augmentation for Unsupervised Learning of Depth and Pose from Monocular Video 标题：单目视频深度和姿态无监督学习的3D分层细化和增强链接：https://arxiv.org/abs/2112.03045

作者：Guangming Wang,Jiquan Zhong,Shijie Zhao,Wenhua Wu,Zhe Liu,Hesheng Wang 机构： Key Laboratory of System Controland Information Processing of Ministry of Education, Key Laboratory ofMarine Intelligent Equipment and System of Ministry of Education, Shang-hai Jiao Tong University, Zhao is with the Department of Engineering Mechanics 备注：10 pages, 7 figures, under review 摘要：深度和自我运动估计对于自主机器人的定位和导航以及自主驾驶至关重要。最近的研究使得从未标记的单目视频中学习每像素深度和自我运动成为可能。提出了一种新的无监督训练框架，利用显式三维几何对三维层次结构进行细化和扩充。在该框架中，深度和姿势估计是分层的，相互耦合的，以逐层细化估计的姿势。提出了一种中间视图图像，并通过扭曲具有估计深度和粗略姿态的图像中的像素来合成中间视图图像。然后，从新的视图图像和相邻帧的图像中估计残余姿态变换，以细化粗略姿态。本文以可微的方式实现了迭代求精，使整个框架得到一致优化。同时，提出了一种新的图像增强方法，通过合成新的视图图像来进行姿态估计，该方法创造性地增强了三维空间中的姿态，但得到了新的增强二维图像。在KITTI上的实验表明，我们的深度估计达到了最先进的性能，甚至超过了利用其他辅助任务的最新方法。我们的视觉里程计优于所有最近的基于无监督单目学习的方法，并实现了与基于几何的方法ORB-SLAM2和后端优化的竞争性能。摘要：Depth and ego-motion estimations are essential for the localization and navigation of autonomous robots and autonomous driving. Recent studies make it possible to learn the per-pixel depth and ego-motion from the unlabeled monocular video. A novel unsupervised training framework is proposed with 3D hierarchical refinement and augmentation using explicit 3D geometry. In this framework, the depth and pose estimations are hierarchically and mutually coupled to refine the estimated pose layer by layer. The intermediate view image is proposed and synthesized by warping the pixels in an image with the estimated depth and coarse pose. Then, the residual pose transformation can be estimated from the new view image and the image of the adjacent frame to refine the coarse pose. The iterative refinement is implemented in a differentiable manner in this paper, making the whole framework optimized uniformly. Meanwhile, a new image augmentation method is proposed for the pose estimation by synthesizing a new view image, which creatively augments the pose in 3D space but gets a new augmented 2D image. The experiments on KITTI demonstrate that our depth estimation achieves state-of-the-art performance and even surpasses recent approaches that utilize other auxiliary tasks. Our visual odometry outperforms all recent unsupervised monocular learning-based methods and achieves competitive performance to the geometry-based method, ORB-SLAM2 with back-end optimization.

【2】 A Tale of Color Variants: Representation and Self-Supervised Learning in Fashion E-Commerce 标题：颜色变体的故事：服装电子商务中的表征和自我监督学习链接：https://arxiv.org/abs/2112.02910

作者：Ujjal Kr Dutta,Sandeep Repakula,Maulik Parmar,Abhinav Ravi 机构：Data Sciences-Image Sciences, Myntra 备注：In Annual Conference on Innovative Applications of Artificial Intelligence (IAAI)/ AAAI Conference on Artificial Intelligence (AAAI) 2022. arXiv admin note: substantial text overlap with arXiv:2104.08581 摘要：在本文中，我们讨论了时尚电子商务中的一个关键问题（与客户体验以及收入有关）：颜色变体识别，即识别在设计（或风格）上完全匹配但仅在颜色上不同的时尚产品。我们提出了一个通用框架，该框架的核心是利用深度视觉表征学习，为我们的时尚电子商务平台解决这个问题。我们的框架可以通过手动获取的三元组形式的监控信号进行训练。然而，在捕获所有困难的情况下，为时尚电子商务平台（如我们的平台）中通常存在的整个庞大数据集合获取手动注释是不可行的。但是，为了拯救我们，有趣的是，我们观察到，时尚电子商务中的这一关键问题也可以通过简单的基于颜色抖动的图像增强来解决，这一点最近在对比自监督学习（SSL）文献中广为流行，该文献旨在学习视觉表示，而不使用手动标签。这自然会在我们的脑海中引出一个问题：我们是否可以在用例中利用SSL，并且仍然可以获得与受监管框架相当的性能？答案是，是的！因为，颜色变化的时尚对象只不过是一种风格的表现形式，不同的颜色，一个经过训练对颜色保持不变的模型（有监督或没有监督）应该能够识别这一点！这是本文进一步从定性和定量两方面论证的内容，同时评估了两种最先进的SSL技术，并提出了一种新方法。摘要：In this paper, we address a crucial problem in fashion e-commerce (with respect to customer experience, as well as revenue): color variants identification, i.e., identifying fashion products that match exactly in their design (or style), but only to differ in their color. We propose a generic framework, that leverages deep visual Representation Learning at its heart, to address this problem for our fashion e-commerce platform. Our framework could be trained with supervisory signals in the form of triplets, that are obtained manually. However, it is infeasible to obtain manual annotations for the entire huge collection of data usually present in fashion e-commerce platforms, such as ours, while capturing all the difficult corner cases. But, to our rescue, interestingly we observed that this crucial problem in fashion e-commerce could also be solved by simple color jitter based image augmentation, that recently became widely popular in the contrastive Self-Supervised Learning (SSL) literature, that seeks to learn visual representations without using manual labels. This naturally led to a question in our mind: Could we leverage SSL in our use-case, and still obtain comparable performance to our supervised framework? The answer is, Yes! because, color variant fashion objects are nothing but manifestations of a style, in different colors, and a model trained to be invariant to the color (with, or without supervision), should be able to recognize this! This is what the paper further demonstrates, both qualitatively, and quantitatively, while evaluating a couple of state-of-the-art SSL techniques, and also proposing a novel method.

【3】 Clue Me In: Semi-Supervised FGVC with Out-of-Distribution Data 标题：提示我：具有非分布数据的半监督FGVC 链接：https://arxiv.org/abs/2112.02825

作者：Ruoyi Du,Dongliang Chang,Zhanyu Ma,Yi-Zhe Song,Jun Guo 摘要：尽管在细粒度视觉分类（FGVC）方面取得了巨大的进步，但当前的方法仍然严重依赖于需要大量专家标签的完全监督模式。半监督学习（SSL）技术，从未标记的数据中获取知识，为解决粗粒度问题提供了一种重要的方法。然而，现有的SSL范式大多假设分布（即，类别对齐）中存在未标记的数据，这妨碍了它们在FGVC上重新提出时的有效性。在本文中，我们提出了一种新的设计，专门针对半监督FGVC的分布数据工作，即“提示他们”。我们提出了一个重要的假设，即所有细粒度类别自然遵循一个层次结构（例如，涵盖所有鸟类物种的“鸟类”系统发育树）。因此，我们可以在这个树结构中预测样本关系，而不是对单个样本进行操作，以此作为SSL的优化目标。除此之外，我们还进一步介绍了由这些树结构带来的两种独特策略，以实现样本间一致性正则化和可靠的伪关系。我们的实验结果表明：（i）所提出的方法对分布外的数据具有良好的鲁棒性；（ii）它可以配备现有技术，提高其性能，从而产生最先进的结果。代码可在https://github.com/PRIS-CV/RelMatch. 摘要：Despite great strides made on fine-grained visual classification (FGVC), current methods are still heavily reliant on fully-supervised paradigms where ample expert labels are called for. Semi-supervised learning (SSL) techniques, acquiring knowledge from unlabeled data, provide a considerable means forward and have shown great promise for coarse-grained problems. However, exiting SSL paradigms mostly assume in-distribution (i.e., category-aligned) unlabeled data, which hinders their effectiveness when re-proposed on FGVC. In this paper, we put forward a novel design specifically aimed at making out-of-distribution data work for semi-supervised FGVC, i.e., to "clue them in". We work off an important assumption that all fine-grained categories naturally follow a hierarchical structure (e.g., the phylogenetic tree of "Aves" that covers all bird species). It follows that, instead of operating on individual samples, we can instead predict sample relations within this tree structure as the optimization goal of SSL. Beyond this, we further introduced two strategies uniquely brought by these tree structures to achieve inter-sample consistency regularization and reliable pseudo-relation. Our experimental results reveal that (i) the proposed method yields good robustness against out-of-distribution data, and (ii) it can be equipped with prior arts, boosting their performance thus yielding state-of-the-art results. Code is available at https://github.com/PRIS-CV/RelMatch.

【4】 Gated2Gated: Self-Supervised Depth Estimation from Gated Images 标题：Gated2Gated：基于门控图像的自监督深度估计链接：https://arxiv.org/abs/2112.02416

作者：Amanpreet Walia,Stefanie Walz,Mario Bijelic,Fahim Mannan,Frank Julca-Aguilar,Michael Langer,Werner Ritter,Felix Heide 机构：Algolux, Mercedes-Benz AG, McGill University, Princeton University 备注：11 pages, 6 Figures 摘要：门控摄像头有望替代扫描高分辨率3D深度的激光雷达传感器，这种传感器在雾、雪和雨中具有强大的后向散射能力。与脉冲激光雷达传感器一样，门控成像仪不是顺序扫描场景并通过光子飞行时间直接记录深度，而是以百万像素分辨率捕获的少数门控切片的相对强度对深度进行编码。尽管现有的方法已经表明，可以从这些测量中解码高分辨率深度，但这些方法需要同步和校准的激光雷达来监督选通深度解码器——禁止跨地理区域快速采用，禁止在大型未配对数据集上进行训练，以及探索汽车用例之外的替代应用。在这项工作中，我们填补了这一空白，并提出了一种完全自我监督的深度估计方法，该方法使用选通强度剖面和时间一致性作为训练信号。所提出的模型是从选通视频序列端到端训练的，不需要激光雷达或RGB数据，并且学习估计绝对深度值。我们将选通切片作为输入，分离场景反照率、深度和环境光的估计，然后使用这些估计通过循环损耗来学习重建输入切片。我们依赖于给定帧和相邻选通切片之间的时间一致性来估计具有阴影和反射的区域中的深度。实验证明，该方法优于现有的基于单目RGB和立体图像的监督和自监督深度估计方法，以及基于门控图像的监督深度估计方法。摘要：Gated cameras hold promise as an alternative to scanning LiDAR sensors with high-resolution 3D depth that is robust to back-scatter in fog, snow, and rain. Instead of sequentially scanning a scene and directly recording depth via the photon time-of-flight, as in pulsed LiDAR sensors, gated imagers encode depth in the relative intensity of a handful of gated slices, captured at megapixel resolution. Although existing methods have shown that it is possible to decode high-resolution depth from such measurements, these methods require synchronized and calibrated LiDAR to supervise the gated depth decoder -- prohibiting fast adoption across geographies, training on large unpaired datasets, and exploring alternative applications outside of automotive use cases. In this work, we fill this gap and propose an entirely self-supervised depth estimation method that uses gated intensity profiles and temporal consistency as a training signal. The proposed model is trained end-to-end from gated video sequences, does not require LiDAR or RGB data, and learns to estimate absolute depth values. We take gated slices as input and disentangle the estimation of the scene albedo, depth, and ambient light, which are then used to learn to reconstruct the input slices through a cyclic loss. We rely on temporal consistency between a given frame and neighboring gated slices to estimate depth in regions with shadows and reflections. We experimentally validate that the proposed approach outperforms existing supervised and self-supervised depth estimation methods based on monocular RGB and stereo images, as well as supervised methods based on gated images.

【5】 Toward Practical Self-Supervised Monocular Indoor Depth Estimation 标题：走向实用化的自监督单目室内深度估计链接：https://arxiv.org/abs/2112.02306

作者：Cho-Ying Wu,Jialiang Wang,Michael Hall,Ulrich Neumann,Shuochen Su 机构：Meta Reality Labs,University of Southern California, HM,D, MP,D, Replica, Training on SimSIN, Left-Right consistency, m, Prediction on real scenes, Trained on simulation, SimSIN, Trained on real data, UniSIN 摘要：大多数自监督单目深度估计方法都集中于驾驶场景。我们表明，这种方法对看不见的复杂室内场景的推广效果很差，在这些场景中，物体杂乱无章，在近场中任意排列。为了获得更高的稳健性，我们提出了一种结构蒸馏方法，从预训练深度估计器中学习诀窍，该估计器由于其在野外混合数据集中的训练而产生结构化但度量不可知的深度。通过将蒸馏和从左右一致性中学习度量的自监督分支相结合，我们获得了一般室内场景的结构化和度量深度，并实时进行推理。为了便于学习和评估，我们收集了来自数千个环境模拟的数据集SimSIN和UniSIN，这两个数据集包含大约500个普通室内环境的真实扫描序列。我们在sim-to-real和real-to-real设置中进行了实验，并在定性和定量方面以及在下游应用中使用我们的深度图显示了改进。这项工作提供了全面的研究，包括方法、数据和应用。我们相信这项工作为通过自我监督进行实际的室内深度估计奠定了坚实的基础。摘要：The majority of self-supervised monocular depth estimation methods focus on driving scenarios. We show that such methods generalize poorly to unseen complex indoor scenes, where objects are cluttered and arbitrarily arranged in the near field. To obtain more robustness, we propose a structure distillation approach to learn knacks from a pretrained depth estimator that produces structured but metric-agnostic depth due to its in-the-wild mixed-dataset training. By combining distillation with the self-supervised branch that learns metrics from left-right consistency, we attain structured and metric depth for generic indoor scenes and make inferences in real-time. To facilitate learning and evaluation, we collect SimSIN, a dataset from simulation with thousands of environments, and UniSIN, a dataset that contains about 500 real scan sequences of generic indoor environments. We experiment in both sim-to-real and real-to-real settings, and show improvements both qualitatively and quantitatively, as well as in downstream applications using our depth maps. This work provides a full study, covering methods, data, and applications. We believe the work lays a solid basis for practical indoor depth estimation via self-supervision.

时序|行为识别|姿态|视频|运动估计(4篇)

【1】 PP-MSVSR: Multi-Stage Video Super-Resolution 标题：PP-MSVSR：多级视频超分辨率链接：https://arxiv.org/abs/2112.02828

作者：Lielin Jiang,Na Wang,Qingqing Dang,Rui Liu,Baohua Lai 机构：Baidu Inc. 备注：8 pages, 6 figures, 3 tables 摘要：与单图像超分辨率（SISR）任务不同，视频超分辨率（VSR）任务的关键是充分利用帧间的互补信息重构高分辨率序列。由于来自不同帧的图像具有不同的运动和场景，因此准确对齐多帧并有效融合不同帧一直是VSR任务的关键研究工作。为了利用相邻帧丰富的互补信息，本文提出了一种多级VSR深度结构，称为PP-MSVSR，采用局部融合模块、辅助损耗和重对准模块逐步细化增强结果。具体来说，为了加强特征传播中的跨帧特征融合，在第一阶段设计了局部融合模块，在特征传播前进行局部特征融合。此外，我们在第二阶段引入了一个辅助损失，使传播模块获得的特征保留更多连接到HR空间的相关信息，并在第三阶段引入了一个重新对齐模块，以充分利用前一阶段的特征信息。大量实验证明，PP-MSVSR实现了Vid4数据集的良好性能，仅用1.45M的参数即可实现28.13dB的峰值信噪比。PP-MSVSR-L在具有大量参数的REDS4数据集上超过了所有最先进的方法。代码和模型将在Packegan\footnote中发布{https://github.com/PaddlePaddle/PaddleGAN.}. 摘要：Different from the Single Image Super-Resolution(SISR) task, the key for Video Super-Resolution(VSR) task is to make full use of complementary information across frames to reconstruct the high-resolution sequence. Since images from different frames with diverse motion and scene, accurately aligning multiple frames and effectively fusing different frames has always been the key research work of VSR tasks. To utilize rich complementary information of neighboring frames, in this paper, we propose a multi-stage VSR deep architecture, dubbed as PP-MSVSR, with local fusion module, auxiliary loss and re-align module to refine the enhanced result progressively. Specifically, in order to strengthen the fusion of features across frames in feature propagation, a local fusion module is designed in stage-1 to perform local feature fusion before feature propagation. Moreover, we introduce an auxiliary loss in stage-2 to make the features obtained by the propagation module reserve more correlated information connected to the HR space, and introduce a re-align module in stage-3 to make full use of the feature information of the previous stage. Extensive experiments substantiate that PP-MSVSR achieves a promising performance of Vid4 datasets, which achieves a PSNR of 28.13dB with only 1.45M parameters. And the PP-MSVSR-L exceeds all state of the art method on REDS4 datasets with considerable parameters. Code and models will be released in PaddleGAN\footnote{https://github.com/PaddlePaddle/PaddleGAN.}.

【2】 Make It Move: Controllable Image-to-Video Generation with Text Descriptions 标题：让它移动：使用文本描述实现可控的图像到视频生成链接：https://arxiv.org/abs/2112.02815

作者：Yaosi Hu,Chong Luo,Zhenzhong Chen 机构：Wuhan University, Microsoft Research Asia 摘要：在计算机视觉中，生成符合用户意图的可控视频是一个极具吸引力但极具挑战性的课题。为了实现符合用户意图的可操作控制，提出了一种新的视频生成任务，称为文本图像到视频生成（TI2V）。TI2V具有可控制的外观和运动，旨在从静态图像和文本描述生成视频。TI2V任务的关键挑战在于从不同的模式调整外观和运动，以及处理文本描述中的不确定性。为了应对这些挑战，我们提出了一种基于运动锚的视频生成器（MAGE），它具有一种创新的运动锚（MA）结构来存储外观和运动对齐表示。为了对不确定性进行建模并增加多样性，它还允许引入显式条件和隐式随机性。通过三维轴向变换器，MA与给定图像交互，以递归方式生成下一帧，并具有满意的可控性和多样性。伴随着新的任务，我们构建了两个新的基于MNIST的视频-文本配对数据集，用于评估。在这些数据集上进行的实验验证了MAGE的有效性，并显示了TI2V任务的吸引力潜力。模型和数据集的源代码将很快提供。摘要：Generating controllable videos conforming to user intentions is an appealing yet challenging topic in computer vision. To enable maneuverable control in line with user intentions, a novel video generation task, named Text-Image-to-Video generation (TI2V), is proposed. With both controllable appearance and motion, TI2V aims at generating videos from a static image and a text description. The key challenges of TI2V task lie both in aligning appearance and motion from different modalities, and in handling uncertainty in text descriptions. To address these challenges, we propose a Motion Anchor-based video GEnerator (MAGE) with an innovative motion anchor (MA) structure to store appearance-motion aligned representation. To model the uncertainty and increase the diversity, it further allows the injection of explicit condition and implicit randomness. Through three-dimensional axial transformers, MA is interacted with given image to generate next frames recursively with satisfying controllability and diversity. Accompanying the new task, we build two new video-text paired datasets based on MNIST and CATER for evaluation. Experiments conducted on these datasets verify the effectiveness of MAGE and show appealing potentials of TI2V task. Source code for model and datasets will be available soon.

【3】 An Annotated Video Dataset for Computing Video Memorability 标题：一种用于计算视频记忆性的带注释的视频数据集链接：https://arxiv.org/abs/2112.02303

作者：Rukiye Savran Kiziltepe,Lorin Sweeney,Mihai Gabriel Constantin,Faiyaz Doctor,Alba Garcia Seco de Herrera,Claire-Helene Demarty,Graham Healy,Bogdan Ionescu,Alan F. Smeaton 机构：University of Essex, UK, Insight Centre for Data Analytics, Dublin City University, Glasnevin, Dublin , Ireland, University Politehnica of Bucharest, Romania, InterDigital, R&I, France, A R T I C L E I N F O 备注：None 摘要：1275名用户使用一组公开的短格式视频剪辑链接（每个视频剪辑的平均持续时间为6秒），对每个视频进行多次手动注释，以表明视频的长期和短期记忆性。这些注释是作为在线记忆游戏的一部分收集的，并测量了参与者在观看一组视频时回忆以前看过视频的能力。识别任务是在前几分钟内看到的视频上执行的，用于短期记忆，而在前24到72小时内看到的视频用于长期记忆。数据包括每个视频每次识别的反应时间。与每个视频相关联的是文本描述（字幕）以及应用于从每个视频（开始、中间和结束）提取的3帧的图像级特征集合。还提供了视频级功能。该数据集在2020年作为中世纪基准的一部分用于视频可记忆性任务。摘要：Using a collection of publicly available links to short form video clips of an average of 6 seconds duration each, 1,275 users manually annotated each video multiple times to indicate both long-term and short-term memorability of the videos. The annotations were gathered as part of an online memory game and measured a participant's ability to recall having seen the video previously when shown a collection of videos. The recognition tasks were performed on videos seen within the previous few minutes for short-term memorability and within the previous 24 to 72 hours for long-term memorability. Data includes the reaction times for each recognition of each video. Associated with each video are text descriptions (captions) as well as a collection of image-level features applied to 3 frames extracted from each video (start, middle and end). Video-level features are also provided. The dataset was used in the Video Memorability task as part of the MediaEval benchmark in 2020.

【4】 Snapshot HDR Video Construction Using Coded Mask 标题：利用编码掩码构建快照HDR视频链接：https://arxiv.org/abs/2112.02522

作者：Masheal Alghamdi,Qiang Fu,Ali Thabet,Wolfgang Heidrich 机构：Visual Computing Center, KAUST, Thuwal, SA 备注：13 pages, 7 figures 摘要：本文研究了由快照编码的LDR视频重建高动态范围（HDR）视频。构建HDR视频需要恢复每个帧的HDR值并保持连续帧之间的一致性。从单个图像捕获获取HDR图像，也称为快照HDR成像，可以通过几种方式实现。例如，可重构快照HDR相机通过在相机的光学堆栈中引入光学元件来实现；通过在传感器前面的小距离处放置编码掩模。利用深度学习方法可以从捕获的编码图像中恢复高质量的HDR图像。本研究利用3D CNN对编码LDR视频进行联合去噪、去噪和HDR视频重建。我们通过引入考虑短期和长期一致性的时间损失函数来实现更具时间一致性的HDR视频重建。获得的结果是有希望的，并可能导致负担得起的HDR视频捕获使用传统相机。摘要：This paper study the reconstruction of High Dynamic Range (HDR) video from snapshot-coded LDR video. Constructing an HDR video requires restoring the HDR values for each frame and maintaining the consistency between successive frames. HDR image acquisition from single image capture, also known as snapshot HDR imaging, can be achieved in several ways. For example, the reconfigurable snapshot HDR camera is realized by introducing an optical element into the optical stack of the camera; by placing a coded mask at a small standoff distance in front of the sensor. High-quality HDR image can be recovered from the captured coded image using deep learning methods. This study utilizes 3D-CNNs to perform a joint demosaicking, denoising, and HDR video reconstruction from coded LDR video. We enforce more temporally consistent HDR video reconstruction by introducing a temporal loss function that considers the short-term and long-term consistency. The obtained results are promising and could lead to affordable HDR video capture using conventional cameras.

医学相关(3篇)

【1】 Joint Learning of Localized Representations from Medical Images and Reports 标题：医学图像和报告本地化表征的联合学习链接：https://arxiv.org/abs/2112.02889

作者：Philip Müller,Georgios Kaissis,Congyu Zou,Daniel Rückert 机构： Institute for Artificial Intelligence and Informatics in Medicine, Department of Informatics, Institute of Diagnostic and Interventional Radiology, Technical University of Munich, Department of Computing, Imperial College London 备注：14 pages, 3 figures, 2 tables 摘要：对比学习已被证明对未标记数据的预训练图像模型是有效的，对于医学图像分类等任务具有良好的效果。在训练前使用成对的文本和图像（如放射报告和图像）进一步改善了结果。尽管如此，大多数现有的方法将图像分类作为下游任务，对于语义分割或对象检测等局部任务可能不是最优的。因此，据我们所知，我们提出了基于视觉和文本的局部表征学习（LoVT），这是第一种针对局部医学成像任务的文本监督预训练方法。该方法将实例级的图像-报表对比学习与图像区域和报表语句表示的局部对比学习相结合。我们在一个新的评估框架上评估LoVT和常用的预训练方法，该框架由来自五个公共数据集的18个胸部X光局部任务组成。虽然没有单一的最佳方法，但LoVT在18项研究任务中的11项表现最好，因此它是本地化任务的首选方法。摘要：Contrastive learning has proven effective for pre-training image models on unlabeled data with promising results for tasks such as medical image classification. Using paired text and images (such as radiological reports and images) during pre-training improved the results even further. Still, most existing methods target image classification as downstream tasks and may not be optimal for localized tasks like semantic segmentation or object detection. We therefore propose Localized representation learning from Vision and Text (LoVT), to our best knowledge, the first text-supervised pre-training method that targets localized medical imaging tasks. Our method combines instance-level image-report contrastive learning with local contrastive learning on image region and report sentence representations. We evaluate LoVT and commonly used pre-training methods on a novel evaluation framework consisting of 18 localized tasks on chest X-rays from five public datasets. While there is no single best method, LoVT performs best on 11 out of the 18 studied tasks making it the preferred method of choice for localized tasks.

【2】 Real-time Virtual Intraoperative CT for Image Guided Surgery 标题：实时虚拟CT在图像引导手术中的应用链接：https://arxiv.org/abs/2112.02608

作者：Yangming Li,Neeraja Konuthula,Ian M. Humphreys,Kris Moe,Blake Hannaford,Randall Bly 机构：Rochester Institute of Technology, RoCALab, Rochester, USA, University of Washington, Department of Otolaryngology–Head and Neck Surgery, Seattle, USA, University of Washington, BioRobotics Lab, Seattle, USA, Seattle Children’s Hospital, Seattle, USA 摘要：摘要目的：本文提出一种在鼻内窥镜手术（ESS）中生成虚拟术中CT扫描的方案，以提高手术的完整性。方法：该工作提出了三种方法，基于针尖运动、基于针尖轨迹和基于仪器，以及非参数平滑和高斯过程回归，用于虚拟术中CT生成。结果：研究并比较了在尸体上进行ESS的方法。手术结果表明，这三种方法提高了骰子相似系数>86%，F评分>92%，精确度>89.91%。基于尖端轨迹的方法被发现具有最佳性能，在手术完整性评估中达到96.87%的精度。结论：这项工作表明，虚拟术中CT扫描提高了实际手术场景与参考模型之间的一致性，并提高了ESS手术的完整性。与实际的术中CT扫描相比，该方案不影响现有的手术方案，不需要除大多数ESS中已有的硬件之外的额外硬件，克服了实际术中CT造成的高成本、重复辐射和延长麻醉时间，在ESS中是实用的。摘要：Abstract. Purpose: This paper presents a scheme for generating virtual intraoperative CT scans in order to improve surgical completeness in Endoscopic Sinus Surgeries (ESS). Approach: The work presents three methods, the tip motion-based, the tip trajectory-based, and the instrument based, along with non-parametric smoothing and Gaussian Process Regression, for virtual intraoperative CT generation. Results: The proposed methods studied and compared on ESS performed on cadavers. Surgical results show all three methods improve the Dice Similarity Coefficients > 86%, with F-score > 92% and precision > 89.91%. The tip trajectory-based method was found to have best performance and reached 96.87% precision in surgical completeness evaluation. Conclusions: This work demonstrated that virtual intraoperative CT scans improves the consistency between the actual surgical scene and the reference model, and improves surgical completeness in ESS. Comparing with actual intraoperative CT scans, the proposed scheme has no impact on existing surgical protocols, does not require extra hardware other than the one is already available in most ESS overcome the high costs, the repeated radiation, and the elongated anesthesia caused by actual intraoperative CTs, and is practical in ESS.

【3】 Predicting Axillary Lymph Node Metastasis in Early Breast Cancer Using Deep Learning on Primary Tumor Biopsy Slides 标题：基于原发肿瘤活检切片的深度学习预测早期乳腺癌腋窝淋巴结转移链接：https://arxiv.org/abs/2112.02222

作者：Feng Xu,Chuang Zhu,Wenqi Tang,Ying Wang,Yu Zhang,Jie Li,Hongchuan Jiang,Zhongyue Shi,Jun Liu,Mulan Jin 机构：Department of Breast Surgery, Beijing Chao-Yang Hospital, Beijing, School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, Department of Pathology, Beijing Chao-Yang Hospital, Beijing 备注：None 摘要：目的：开发并验证一种基于深度学习（DL）的原发肿瘤活检信号，用于预测临床上ALN阴性的早期乳腺癌（EBC）患者术前腋窝淋巴结（ALN）转移。方法：从2010年5月至2020年8月，共有1058例经病理证实为ALN状态的EBC患者入选。基于基于注意的多实例学习（AMIL）框架，建立了DL核心穿刺活检（DL-CNB）模型，利用DL特征预测ALN状态，DL特征是从两位病理学家注释的乳腺CNB标本数字化全幻灯片图像（WSI）的癌区提取的。分析准确度、敏感性、特异性、受试者操作特征（ROC）曲线和ROC曲线下面积（AUCs）以评估我们的模型。结果：在独立试验队列中，以VGG16_BN为特征提取因子的最佳DL-CNB模型预测ALN阳性转移的AUC为0.816（95%可信区间：0.758,0.865）。此外，我们的模型结合了临床数据（称为DL-CNB+C），得出了0.831（95%可信区间：0.775,0.878）的最佳准确度，尤其是对于50岁以下的患者（AUC:0.918,95%可信区间：0.825,0.971）。DL-CNB模型的解释表明，最能预测ALN转移的顶部特征是细胞核特征，包括密度（$p$=0.015）、周长（$p$=0.009）、圆形（$p$=0.010）和方向（$p$=0.012）。结论：我们的研究为原发性肿瘤CNB切片提供了一种新的基于DL的生物标记物，用于预测EBC患者术前ALN的转移状态。摘要：Objectives: To develop and validate a deep learning (DL)-based primary tumor biopsy signature for predicting axillary lymph node (ALN) metastasis preoperatively in early breast cancer (EBC) patients with clinically negative ALN. Methods: A total of 1,058 EBC patients with pathologically confirmed ALN status were enrolled from May 2010 to August 2020. A DL core-needle biopsy (DL-CNB) model was built on the attention-based multiple instance-learning (AMIL) framework to predict ALN status utilizing the DL features, which were extracted from the cancer areas of digitized whole-slide images (WSIs) of breast CNB specimens annotated by two pathologists. Accuracy, sensitivity, specificity, receiver operating characteristic (ROC) curves, and areas under the ROC curve (AUCs) were analyzed to evaluate our model. Results: The best-performing DL-CNB model with VGG16_BN as the feature extractor achieved an AUC of 0.816 (95% confidence interval (CI): 0.758, 0.865) in predicting positive ALN metastasis in the independent test cohort. Furthermore, our model incorporating the clinical data, which was called DL-CNB+C, yielded the best accuracy of 0.831 (95%CI: 0.775, 0.878), especially for patients younger than 50 years (AUC: 0.918, 95%CI: 0.825, 0.971). The interpretation of DL-CNB model showed that the top signatures most predictive of ALN metastasis were characterized by the nucleus features including density ($p$ = 0.015), circumference ($p$ = 0.009), circularity ($p$ = 0.010), and orientation ($p$ = 0.012). Conclusion: Our study provides a novel DL-based biomarker on primary tumor CNB slides to predict the metastatic status of ALN preoperatively for patients with EBC.

GAN|对抗|攻击|生成相关(8篇)

【1】 CSG0: Continual Urban Scene Generation with Zero Forgetting 标题：CSG0：零遗忘的连续城市场景生成链接：https://arxiv.org/abs/2112.03252

作者：Himalaya Jain,Tuan-Hung Vu,Patrick Pérez,Matthieu Cord 机构：Patrick P´erez, Valeo.ai, Paris, France, Sorbonne University, Paris, France 摘要：随着生成性对抗网络（GAN）的快速发展，合成场景的视觉质量不断提高，包括应用于自动驾驶的复杂城市场景。在这项工作中，我们解决了一个连续场景生成设置，在该设置中，在一系列不同的域上训练GANs；理想情况下，学习的模型最终应该能够在所有可见域中生成新场景。此设置反映了在不同时间在不同地点连续采集数据的真实场景。在这样一个连续的设置中，我们的目标是零遗忘学习，也就是说，在早期的领域中，由于灾难性遗忘，合成质量不会下降。为此，我们引入了一个新的框架，该框架不仅（i）在持续训练中实现无缝知识转移，而且（ii）以较小的开销保证零遗忘。由于不断学习，我们的模型具有更高的内存效率，与为每个域训练一个完整模型的蛮力解决方案相比，我们的模型获得了更好的合成质量。特别是，在极低数据的情况下，我们的方法大大优于蛮力方法。摘要：With the rapid advances in generative adversarial networks (GANs), the visual quality of synthesised scenes keeps improving, including for complex urban scenes with applications to automated driving. We address in this work a continual scene generation setup in which GANs are trained on a stream of distinct domains; ideally, the learned models should eventually be able to generate new scenes in all seen domains. This setup reflects the real-life scenario where data are continuously acquired in different places at different times. In such a continual setup, we aim for learning with zero forgetting, i.e., with no degradation in synthesis quality over earlier domains due to catastrophic forgetting. To this end, we introduce a novel framework that not only (i) enables seamless knowledge transfer in continual training but also (ii) guarantees zero forgetting with a small overhead cost. While being more memory efficient, thanks to continual learning, our model obtains better synthesis quality as compared against the brute-force solution that trains one full model for each domain. Especially, under extreme low-data regimes, our approach significantly outperforms the brute-force one by a large margin.

【2】 RADA: Robust Adversarial Data Augmentation for Camera Localization in Challenging Weather 标题：RADA：挑战天气下摄像机定位的鲁棒对抗性数据增强链接：https://arxiv.org/abs/2112.02469

作者：Jialu Wang,Muhamad Risqi U. Saputra,Chris Xiaoxuan Lu,Niki Trigon,Andrew Markham 摘要：摄像机定位是许多机器人应用中的一个基本和关键问题。近年来，利用深度学习进行基于摄像机的定位已经成为一个热门的研究方向。然而，它们对训练和测试数据集之间的季节性或光照变化可能导致的大域移动缺乏鲁棒性。数据扩充是解决这一问题的一种有吸引力的方法，因为它不需要提供额外的数据。然而，现有的增强方法盲目地扰动所有像素，因此无法获得令人满意的性能。为了克服这个问题，我们提出了RADA系统，其目的是集中于扰动图像的几何信息部分。结果，它学会了产生最小的图像扰动，这些扰动仍然能够使网络感到困惑。我们表明，当这些例子被用作增广，它大大提高了鲁棒性。我们表明，当在“看不见的”挑战性天气条件下进行测试时，我们的方法优于以前的增强技术，并实现了比SOTA定位模型（例如AtLoc和MapNet）高两倍的精度。摘要：Camera localization is a fundamental and crucial problem for many robotic applications. In recent years, using deep-learning for camera-based localization has become a popular research direction. However, they lack robustness to large domain shifts, which can be caused by seasonal or illumination changes between training and testing data sets. Data augmentation is an attractive approach to tackle this problem, as it does not require additional data to be provided. However, existing augmentation methods blindly perturb all pixels and therefore cannot achieve satisfactory performance. To overcome this issue, we proposed RADA, a system whose aim is to concentrate on perturbing the geometrically informative parts of the image. As a result, it learns to generate minimal image perturbations that are still capable of perplexing the network. We show that when these examples are utilized as augmentation, it greatly improves robustness. We show that our method outperforms previous augmentation techniques and achieves up to two times higher accuracy than the SOTA localization models (e.g., AtLoc and MapNet) when tested on `unseen' challenging weather conditions.

【3】 Implicit Data Augmentation Using Feature Interpolation for Diversified Low-Shot Image Generation 标题：基于特征插值的隐式数据增强方法用于多种低焦图像生成链接：https://arxiv.org/abs/2112.02450

作者：Mengyu Dai,Haibin Hang,Xiaoyang Guo 机构：Microsoft, University of Delaware, Meta 摘要：生成性模型的训练，特别是生成性对抗网络的训练，在低数据环境下很容易产生分歧。为了缓解这一问题，我们提出了一种新的隐式数据扩充方法，该方法有助于稳定的训练和合成不同的样本。具体而言，我们将鉴别器视为真实数据流形的度量嵌入，它提供真实数据点之间的适当距离。然后，我们利用特征空间中的信息来开发数据驱动的增强方法。我们进一步提出了一个简单的度量来评估合成样品的多样性。在少数镜头生成任务上的实验表明，与现有方法相比，我们的方法提高了FID和结果的多样性，并且允许使用少于100个训练样本生成高质量和多样性的图像。摘要：Training of generative models especially Generative Adversarial Networks can easily diverge in low-data setting. To mitigate this issue, we propose a novel implicit data augmentation approach which facilitates stable training and synthesize diverse samples. Specifically, we view the discriminator as a metric embedding of the real data manifold, which offers proper distances between real data points. We then utilize information in the feature space to develop a data-driven augmentation method. We further bring up a simple metric to evaluate the diversity of synthesized samples. Experiments on few-shot generation tasks show our method improves FID and diversity of results compared to current methods, and allows generating high-quality and diverse images with less than 100 training samples.

【4】 LTT-GAN: Looking Through Turbulence by Inverting GANs 标题：LTT-GaN：通过倒置Gans看湍流链接：https://arxiv.org/abs/2112.02379

作者：Kangfu Mei,Vishal M. Patel 机构：Department of Electrical and Computer Engineering, Johns Hopkins University 备注：Project Page: this https URL 摘要：在远程成像的许多应用中，我们面临着这样一种情况，即在捕获的图像中出现的人通常会因大气湍流而退化。然而，恢复这样的退化图像用于人脸验证是困难的，因为退化会导致图像几何失真和模糊。为了缓解湍流效应，在本文中，我们提出了第一种湍流缓解方法，该方法利用经过良好训练的GAN封装的视觉先验。基于视觉先验，我们建议学习在空间周期性上下文距离上保持恢复图像的身份。这样的距离可以在网络学习中考虑身份差异的同时保持从GAN恢复的图像的真实性。此外，本文还提出了分层伪连接，在不改变身份的情况下引入更多的外观方差，从而促进了身份保持学习。大量实验表明，我们的方法在恢复结果的视觉质量和人脸验证精度方面都显著优于现有技术。摘要：In many applications of long-range imaging, we are faced with a scenario where a person appearing in the captured imagery is often degraded by atmospheric turbulence. However, restoring such degraded images for face verification is difficult since the degradation causes images to be geometrically distorted and blurry. To mitigate the turbulence effect, in this paper, we propose the first turbulence mitigation method that makes use of visual priors encapsulated by a well-trained GAN. Based on the visual priors, we propose to learn to preserve the identity of restored images on a spatial periodic contextual distance. Such a distance can keep the realism of restored images from the GAN while considering the identity difference at the network learning. In addition, hierarchical pseudo connections are proposed for facilitating the identity-preserving learning by introducing more appearance variance without identity changing. Extensive experiments show that our method significantly outperforms prior art in both the visual quality and face verification accuracy of restored results.

【5】 Construct Informative Triplet with Two-stage Hard-sample Generation 标题：用两阶段硬样本生成构造信息三元组链接：https://arxiv.org/abs/2112.02259

作者：Chuang Zhu,Zheng Hu,Huihui Dong,Gang He,Zekuan Yu,Shangshang Zhang 机构：School of Information and Communication Engineering, Beijing University of Posts and, Telecommunications, Beijing, China, Center for Shanghai Intelligent Imaging for Critical Brain Diseases Engineering and, Technology Research, Fudan University, Shanghai, China 摘要：在本文中，我们提出了一种稳健的样本生成方案来构造信息三元组。提出的硬样本生成是一个两阶段合成框架，分别通过两个阶段中的有效正样本生成器和负样本生成器生成硬样本。第一阶段通过分段线性操作拉伸锚正对，并通过巧妙地设计条件生成对抗网络来提高生成样本的质量，以降低模式崩溃的风险。第二阶段利用自适应反向度量约束生成最终硬样本。在多个基准数据集上的大量实验验证了我们的方法比现有的硬样本生成算法具有更好的性能。此外，我们还发现，我们提出的硬样本生成方法结合现有的三元组挖掘策略，可以进一步提高深度度量学习的性能。摘要：In this paper, we propose a robust sample generation scheme to construct informative triplets. The proposed hard sample generation is a two-stage synthesis framework that produces hard samples through effective positive and negative sample generators in two stages, respectively. The first stage stretches the anchor-positive pairs with piecewise linear manipulation and enhances the quality of generated samples by skillfully designing a conditional generative adversarial network to lower the risk of mode collapse. The second stage utilizes an adaptive reverse metric constraint to generate the final hard samples. Extensive experiments on several benchmark datasets verify that our method achieves superior performance than the existing hard-sample generation algorithms. Besides, we also find that our proposed hard sample generation method combining the existing triplet mining strategies can further boost the deep metric learning performance.

【6】 SemanticStyleGAN: Learning Compositional Generative Priors for Controllable Image Synthesis and Editing 标题：SemantStyleGAN：用于可控图像合成和编辑的构图生成先验学习链接：https://arxiv.org/abs/2112.02236

作者：Yichun Shi,Xiao Yang,Yangyue Wan,Xiaohui Shen 机构：ByteDance Inc., USA, Photo, Toonify, MetFaces, Eyes, Mouth, Hair 备注：project page at this https URL 摘要：最近的研究表明，StyleGANs为图像合成和编辑的下游任务提供了有前景的先验模型。然而，由于StyleGANs的潜在代码设计用于控制全局样式，因此很难实现对合成图像的细粒度控制。我们提出了SemanticStyleGAN，其中一个生成器被训练为分别对局部语义部分建模，并以合成方式合成图像。不同局部的结构和纹理由相应的潜在代码控制。实验结果表明，我们的模型在不同的空间区域之间提供了很强的分离。当与为StyleGANs设计的编辑方法相结合时，它可以实现更细粒度的控制来编辑合成图像或真实图像。该模型还可以通过迁移学习扩展到其他领域。因此，作为一种具有内置解纠缠功能的通用先验模型，它可以促进基于GAN的应用程序的开发，并实现更多潜在的下游任务。摘要：Recent studies have shown that StyleGANs provide promising prior models for downstream tasks on image synthesis and editing. However, since the latent codes of StyleGANs are designed to control global styles, it is hard to achieve a fine-grained control over synthesized images. We present SemanticStyleGAN, where a generator is trained to model local semantic parts separately and synthesizes images in a compositional way. The structure and texture of different local parts are controlled by corresponding latent codes. Experimental results demonstrate that our model provides a strong disentanglement between different spatial areas. When combined with editing methods designed for StyleGANs, it can achieve a more fine-grained control to edit synthesized or real images. The model can also be extended to other domains via transfer learning. Thus, as a generic prior model with built-in disentanglement, it could facilitate the development of GAN-based applications and enable more potential downstream tasks.

【7】 Hyper-GAN: Transferring Unconditional to Conditional GANs with HyperNetworks 标题：Hyper-GAN：通过超级网络将无条件GAN转换为条件GAN 链接：https://arxiv.org/abs/2112.02219

作者：Héctor Laria,Yaxing Wang,Joost van de Weijer,Bogdan Raducanu 机构：Computer Vision Center, Barcelona, Spain 备注：14 pages, 12 figures 摘要：近年来，条件GANs已经成熟，能够生成高质量的真实图像。然而，高质量的GANs训练所需的计算资源和训练数据是巨大的，因此对这些模型的迁移学习的研究是一个迫切的课题。在本文中，我们探讨了从高质量的预先训练的无条件GAN到有条件GAN的转换。为此，我们提出了基于超网络的自适应权重调制。此外，我们引入了一个自初始化过程，该过程不需要任何实际数据来初始化超网络参数。为了进一步提高知识转移的样本效率，我们建议使用自监督（对比）损失来改进GAN鉴别器。在大量的实验中，我们验证了超网络、自初始化和对比损失在多个标准基准上的知识转移效率。摘要：Conditional GANs have matured in recent years and are able to generate high-quality realistic images. However, the computational resources and the training data required for the training of high-quality GANs are enormous, and the study of transfer learning of these models is therefore an urgent topic. In this paper, we explore the transfer from high-quality pre-trained unconditional GANs to conditional GANs. To this end, we propose hypernetwork-based adaptive weight modulation. In addition, we introduce a self-initialization procedure that does not require any real data to initialize the hypernetwork parameters. To further improve the sample efficiency of the knowledge transfer, we propose to use a self-supervised (contrastive) loss to improve the GAN discriminator. In extensive experiments, we validate the efficiency of the hypernetworks, self-initialization and contrastive loss for knowledge transfer on several standard benchmarks.

【8】 Generative Modeling of Turbulence 标题：湍流的产生式模拟链接：https://arxiv.org/abs/2112.02548

作者：Claudia Drygala,Benjamin Winhart,Francesca di Mare,Hanno Gottschalk 机构：University of Wuppertal, School of Mathematics and Natural Sciences, IMACM & IZMD, Ruhr University Bochum, Department of Mechanical Engineering, Chair of Thermal, Turbomachines and Aero Engines 摘要：我们提出了一种基于生成对抗网络（GAN）的湍流综合建模方法。基于对混沌确定性系统遍历性的分析，我们给出了一个数学证明，证明了GAN实际上可以从混沌系统的不变测度中学习采样状态快照。基于这一分析，我们从洛伦兹吸引子开始研究混沌系统的层次结构，然后用GAN对湍流进行建模。作为训练数据，我们使用从大涡模拟（LES）获得的速度波动场。详细研究了两种结构：我们使用深卷积GAN（DCGAN）来合成圆柱体周围的湍流。此外，我们还使用pix2pixHD结构模拟了低压涡轮定子周围的流动，该结构适用于条件DCGAN，条件是定子前面的旋转尾迹位置。介绍了对抗性训练的设置和使用特定GAN架构的效果。因此，我们表明，在中等训练数据的基础上，GAN可以有效地模拟具有技术挑战性的流动问题中的湍流。与经典数值方法（尤其是LES）相比，GAN训练和推理时间明显缩短，同时仍能提供高分辨率的湍流。摘要：We present a mathematically well founded approach for the synthetic modeling of turbulent flows using generative adversarial networks (GAN). Based on the analysis of chaotic, deterministic systems in terms of ergodicity, we outline a mathematical proof that GAN can actually learn to sample state snapshots form the invariant measure of the chaotic system. Based on this analysis, we study a hierarchy of chaotic systems starting with the Lorenz attractor and then carry on to the modeling of turbulent flows with GAN. As training data, we use fields of velocity fluctuations obtained from large eddy simulations (LES). Two architectures are investigated in detail: we use a deep, convolutional GAN (DCGAN) to synthesise the turbulent flow around a cylinder. We furthermore simulate the flow around a low pressure turbine stator using the pix2pixHD architecture for a conditional DCGAN being conditioned on the position of a rotating wake in front of the stator. The settings of adversarial training and the effects of using specific GAN architectures are explained. We thereby show that GAN are efficient in simulating turbulence in technically challenging flow problems on the basis of a moderate amount of training date. GAN training and inference times significantly fall short when compared with classical numerical methods, in particular LES, while still providing turbulent flows in high resolution.

OCR|文本相关(3篇)

【1】 Text2Mesh: Text-Driven Neural Stylization for Meshes 标题：Text2Mesh：文本驱动的网格神经样式化链接：https://arxiv.org/abs/2112.03221

作者：Oscar Michel,Roi Bar-On,Richard Liu,Sagie Benaim,Rana Hanocka 机构：University of Chicago, Tel Aviv University, Iron Man, Brick Lamp, Colorful Crochet Candle, Astronaut Horse 备注：project page: this https URL 摘要：在这项工作中，我们开发了用于编辑三维对象样式的直观控件。我们的框架Text2Mesh通过预测符合目标文本提示的颜色和局部几何细节来设置3D网格的样式。我们考虑使用一个固定的网格输入（内容）与一个学习的神经网络，这是我们的神经风格的现场网络的3D对象的解散表示。为了修改样式，我们利用CLIP的表示能力获得文本提示（描述样式）和样式化网格之间的相似性分数。Text2Mesh既不需要预先训练的生成模型，也不需要专门的三维网格数据集。它可以处理具有任意亏格的低质量网格（非流形、边界等），并且不需要UV参数化。我们展示了我们的技术在各种3D网格上合成各种样式的能力。摘要：In this work, we develop intuitive controls for editing the style of 3D objects. Our framework, Text2Mesh, stylizes a 3D mesh by predicting color and local geometric details which conform to a target text prompt. We consider a disentangled representation of a 3D object using a fixed mesh input (content) coupled with a learned neural network, which we term neural style field network. In order to modify style, we obtain a similarity score between a text prompt (describing style) and a stylized mesh by harnessing the representational power of CLIP. Text2Mesh requires neither a pre-trained generative model nor a specialized 3D mesh dataset. It can handle low-quality meshes (non-manifold, boundaries, etc.) with arbitrary genus, and does not require UV parameterization. We demonstrate the ability of our technique to synthesize a myriad of styles over a wide variety of 3D meshes.

【2】 Embedding Arithmetic for Text-driven Image Transformation 标题：文本驱动图像变换的嵌入算法链接：https://arxiv.org/abs/2112.03162

作者：Guillaume Couairon,Matthieu Cord,Matthijs Douze,Holger Schwenk 机构：Facebook AI, Sorbonne Université 摘要：潜在的文本呈现出几何规律，比如著名的类比：女王之于国王，就像女人之于男人。这种结构化的语义关系并没有在图像表示上得到证实。最近的工作旨在弥合这一语义鸿沟，将图像和文本嵌入到多模态空间中，使文本定义的转换能够转移到图像模态。我们引入SIMAT数据集来评估文本驱动的图像转换任务。SIMAT包含6k图像和18k“变换查询”，旨在替换场景元素或更改其成对关系。目标是检索与（源图像，转换）查询一致的图像。我们使用图像/文本匹配oracle（OSCAR）来评估图像转换是否成功。SIMAT数据集将公开提供。我们使用SIMAT表明，香草剪辑多模式嵌入并不非常适合文本驱动的图像转换，但对COCO数据集进行简单的微调可以带来显著的改进。我们还研究了利用预训练通用句子编码器（FastText、LASER和LaBSE）的几何特性是否有益。摘要：Latent text representations exhibit geometric regularities, such as the famous analogy: queen is to king what woman is to man. Such structured semantic relations were not demonstrated on image representations. Recent works aiming at bridging this semantic gap embed images and text into a multimodal space, enabling the transfer of text-defined transformations to the image modality. We introduce the SIMAT dataset to evaluate the task of text-driven image transformation. SIMAT contains 6k images and 18k "transformation queries" that aim at either replacing scene elements or changing their pairwise relationships. The goal is to retrieve an image consistent with the (source image, transformation) query. We use an image/text matching oracle (OSCAR) to assess whether the image transformation is successful. The SIMAT dataset will be publicly available. We use SIMAT to show that vanilla CLIP multimodal embeddings are not very well suited for text-driven image transformation, but that a simple finetuning on the COCO dataset can bring dramatic improvements. We also study whether it is beneficial to leverage the geometric properties of pretrained universal sentence encoders (FastText, LASER and LaBSE).

【3】 Joint Audio-Text Model for Expressive Speech-Driven 3D Facial Animation 标题：表现力语音驱动三维人脸动画的联合音文模型链接：https://arxiv.org/abs/2112.02214

作者：Yingruo Fan,Zhaojiang Lin,Jun Saito,Wenping Wang,Taku Komura 机构：The University of Hong Kong, The Hong Kong University of Science and Technology, Adobe Research, Texas A&M University 摘要：语音驱动的精确嘴唇同步三维人脸动画已经得到广泛的研究。然而，在演讲过程中合成整个人脸的真实运动却很少被探索。在这项工作中，我们提出了一个联合音频文本模型来捕获上下文信息，用于表达语音驱动的三维人脸动画。现有的数据集被收集来覆盖尽可能多的不同音素而不是句子，从而限制了基于音频的模型学习更多不同上下文的能力。为了解决这个问题，我们建议利用从强大的预先训练的语言模型中提取的上下文文本嵌入，该语言模型已经从大规模文本数据中学习了丰富的上下文表示。我们的假设是，文本特征可以消除面部表情变化的歧义，而面部表情变化与音频相关性不强。与以前从文本中学习音素级特征的方法不同，我们研究了语音驱动的三维人脸动画的高级上下文文本特征。我们表明，组合声学和文本模式可以合成真实的面部表情，同时保持语音唇同步。我们进行定量和定性评估以及感性用户研究。结果表明，与现有的最先进的方法相比，我们的模型具有优越的性能。摘要：Speech-driven 3D facial animation with accurate lip synchronization has been widely studied. However, synthesizing realistic motions for the entire face during speech has rarely been explored. In this work, we present a joint audio-text model to capture the contextual information for expressive speech-driven 3D facial animation. The existing datasets are collected to cover as many different phonemes as possible instead of sentences, thus limiting the capability of the audio-based model to learn more diverse contexts. To address this, we propose to leverage the contextual text embeddings extracted from the powerful pre-trained language model that has learned rich contextual representations from large-scale text data. Our hypothesis is that the text features can disambiguate the variations in upper face expressions, which are not strongly correlated with the audio. In contrast to prior approaches which learn phoneme-level features from the text, we investigate the high-level contextual text features for speech-driven 3D facial animation. We show that the combined acoustic and textual modalities can synthesize realistic facial expressions while maintaining audio-lip synchronization. We conduct the quantitative and qualitative evaluations as well as the perceptual user study. The results demonstrate the superior performance of our model against existing state-of-the-art approaches.

Attention注意力(1篇)

【1】 SSAGCN: Social Soft Attention Graph Convolution Network for Pedestrian Trajectory Prediction 标题：SSAGCN：面向行人轨迹预测的社会软注意图卷积网络链接：https://arxiv.org/abs/2112.02459

作者：Pei Lv,Wentong Wang,Yunxin Wang,Yuzhen Zhang,Mingliang Xu,Changsheng Xu 机构：cnWentong Wang and Yunxin Wang are with Henan Institute of AdvancedTechnology, Zhengzhou University 备注：14 pages, 8 figures 摘要：行人轨迹预测是自主驾驶的一项重要技术，近年来已成为研究热点。以往的方法主要依赖于行人的位置关系来模拟社会互动，这显然不足以代表真实场景中的复杂情况。此外，现有的大多数工作通常将场景交互模块作为一个独立的分支引入，并将社会交互特征嵌入到轨迹生成过程中，而不是同时进行社会交互和场景交互，这可能会破坏轨迹预测的合理性。在本文中，我们提出了一种新的预测模型，称为社会软注意图卷积网络（SSAGCN），旨在同时处理行人之间的社会互动和行人与环境之间的场景互动。具体而言，在对社会互动进行建模时，我们提出了一种新的\emph{social soft attention function}，它充分考虑了行人之间的各种互动因素。并能根据不同情况下的不同因素区分agent周围行人的影响。对于物理交互，我们提出了一种新的\emph{顺序场景共享机制}。场景在每个时刻对一个agent的影响可以通过社交软注意与其他邻居共享，因此场景的影响在空间和时间维度上都得到了扩展。在这些改进的帮助下，我们成功地获得了社会和身体上可接受的预测轨迹。在公共数据集上的实验证明了SSAGCN的有效性，并取得了最先进的结果。摘要：Pedestrian trajectory prediction is an important technique of autonomous driving, which has become a research hot-spot in recent years. Previous methods mainly rely on the position relationship of pedestrians to model social interaction, which is obviously not enough to represent the complex cases in real situations. In addition, most of existing work usually introduce the scene interaction module as an independent branch and embed the social interaction features in the process of trajectory generation, rather than simultaneously carrying out the social interaction and scene interaction, which may undermine the rationality of trajectory prediction. In this paper, we propose one new prediction model named Social Soft Attention Graph Convolution Network (SSAGCN) which aims to simultaneously handle social interactions among pedestrians and scene interactions between pedestrians and environments. In detail, when modeling social interaction, we propose a new \emph{social soft attention function}, which fully considers various interaction factors among pedestrians. And it can distinguish the influence of pedestrians around the agent based on different factors under various situations. For the physical interaction, we propose one new \emph{sequential scene sharing mechanism}. The influence of the scene on one agent at each moment can be shared with other neighbors through social soft attention, therefore the influence of the scene is expanded both in spatial and temporal dimension. With the help of these improvements, we successfully obtain socially and physically acceptable predicted trajectories. The experiments on public available datasets prove the effectiveness of SSAGCN and have achieved state-of-the-art results.

人脸|人群计数(9篇)

【1】 HIVE: Evaluating the Human Interpretability of Visual Explanations 标题：蜂巢：评估人类视觉解释的可译性链接：https://arxiv.org/abs/2112.03184

作者：Sunnie S. Y. Kim,Nicole Meister,Vikram V. Ramaswamy,Ruth Fong,Olga Russakovsky 机构：Princeton University 备注：HIVE can be found at this https URL 摘要：随着机器学习越来越多地应用于高影响、高风险的领域，出现了许多新方法，旨在使人工智能模型更加人性化。尽管可解释性工作最近有所增长，但对所提议的技术缺乏系统的评估。在这项工作中，我们提出了一个新的人类评估框架HIVE（视觉解释的人类可解释性），用于计算机视觉中的各种可解释性方法；据我们所知，这是同类作品中的第一部。我们认为，人类研究应该是正确评估方法对人类用户的解释能力的金标准。虽然由于成本、研究设计和跨方法比较方面的挑战，人类研究通常被避免，但我们描述了我们的框架如何缓解这些问题，并对代表可解释性工作多样性的四种方法进行了IRB批准的研究：GradCAM、BagNet、ProtoPNet和ProtoTree。我们的结果表明，解释（不管它们是否真的正确）产生了人类的信任，但不足以让用户区分正确和错误的预测。最后，我们还将我们的框架开源，以支持未来的研究，并鼓励更多以人为中心的解释性方法。摘要：As machine learning is increasingly applied to high-impact, high-risk domains, there have been a number of new methods aimed at making AI models more human interpretable. Despite the recent growth of interpretability work, there is a lack of systematic evaluation of proposed techniques. In this work, we propose a novel human evaluation framework HIVE (Human Interpretability of Visual Explanations) for diverse interpretability methods in computer vision; to the best of our knowledge, this is the first work of its kind. We argue that human studies should be the gold standard in properly evaluating how interpretable a method is to human users. While human studies are often avoided due to challenges associated with cost, study design, and cross-method comparison, we describe how our framework mitigates these issues and conduct IRB-approved studies of four methods that represent the diversity of interpretability works: GradCAM, BagNet, ProtoPNet, and ProtoTree. Our results suggest that explanations (regardless of if they are actually correct) engender human trust, yet are not distinct enough for users to distinguish between correct and incorrect predictions. Lastly, we also open-source our framework to enable future studies and to encourage more human-centered approaches to interpretability.

【2】 General Facial Representation Learning in a Visual-Linguistic Manner 标题：视觉语言方式下的一般面部表征学习链接：https://arxiv.org/abs/2112.03109

作者：Yinglin Zheng,Hao Yang,Ting Zhang,Jianmin Bao,Dongdong Chen,Yangyu Huang,Lu Yuan,Dong Chen,Ming Zeng,Fang Wen 机构：School of Informatics, Xiamen Unversity, Microsoft Research Asia, Microsoft Cloud+AI 备注：15 pages, 5 figures, 12 tables 摘要：如何学习一个通用的面部表情，以促进所有的面部分析任务？本文朝着这个目标迈出了一步。在本文中，我们研究了预训练模型在人脸分析任务中的迁移性能，并引入了一个称为FaRL的框架，用于以视觉语言方式进行一般的人脸表征学习。一方面，该框架涉及到从图像-文本对中学习高级语义的对比损失。另一方面，我们建议通过添加遮罩图像模型，同时探索低层信息以进一步增强人脸表示。我们对包含大量人脸图像-文本对的数据集LAION-FACE进行预训练，并评估其在多个下游任务上的表示能力。我们表明，FaRL与以前的预训练模型相比，具有更好的传输性能。我们还验证了它在低数据区的优越性。更重要的是，我们的模型在人脸分析任务（包括人脸解析和人脸对齐）上超过了最先进的方法。摘要：How to learn a universal facial representation that boosts all face analysis tasks? This paper takes one step toward this goal. In this paper, we study the transfer performance of pre-trained models on face analysis tasks and introduce a framework, called FaRL, for general Facial Representation Learning in a visual-linguistic manner. On one hand, the framework involves a contrastive loss to learn high-level semantic meaning from image-text pairs. On the other hand, we propose exploring low-level information simultaneously to further enhance the face representation, by adding a masked image modeling. We perform pre-training on LAION-FACE, a dataset containing large amount of face image-text pairs, and evaluate the representation capability on multiple downstream tasks. We show that FaRL achieves better transfer performance compared with previous pre-trained models. We also verify its superiority in the low-data regime. More importantly, our model surpasses the state-of-the-art methods on face analysis tasks including face parsing and face alignment.

【3】 Pose2Room: Understanding 3D Scenes from Human Activities 标题：Pose2Room：从人类活动中理解3D场景链接：https://arxiv.org/abs/2112.03030

作者：Yinyu Nie,Angela Dai,Xiaoguang Han,Matthias Nießner 机构：Technical University of Munich, SRIBD, CUHKSZ 备注：Project page: this https URL Video: this https URL 摘要：有了可穿戴IMU传感器，人们可以通过可穿戴设备估计人体姿势，而无需视觉输入。在这项工作中，我们提出了一个问题：我们能否仅从人类轨迹信息来推断真实环境中的对象结构？关键的是，我们观察到，人类的运动和互动往往会提供场景中物体的强烈信息——例如，一个人坐着表示可能有椅子或沙发。为此，我们提出P2R网络来学习场景中对象的概率3D模型，该模型以其类别和定向3D边界框为特征，基于环境中观察到的输入人体轨迹。P2R网络对对象类的概率分布以及对象框的深度高斯混合模型进行建模，从而能够从观察到的人体轨迹中对多个、不同、可能的对象配置模式进行采样。在我们的实验中，我们证明了P2R网络可以有效地学习人体运动中可能对象的多模态分布，并产生各种可能的环境对象结构，即使没有任何视觉信息。摘要：With wearable IMU sensors, one can estimate human poses from wearable devices without requiring visual input \cite{von2017sparse}. In this work, we pose the question: Can we reason about object structure in real-world environments solely from human trajectory information? Crucially, we observe that human motion and interactions tend to give strong information about the objects in a scene -- for instance a person sitting indicates the likely presence of a chair or sofa. To this end, we propose P2R-Net to learn a probabilistic 3D model of the objects in a scene characterized by their class categories and oriented 3D bounding boxes, based on an input observed human trajectory in the environment. P2R-Net models the probability distribution of object class as well as a deep Gaussian mixture model for object boxes, enabling sampling of multiple, diverse, likely modes of object configurations from an observed human trajectory. In our experiments we demonstrate that P2R-Net can effectively learn multi-modal distributions of likely objects for human motions, and produce a variety of plausible object structures of the environment, even without any visual information.

【4】 HumanNeRF: Generalizable Neural Human Radiance Field from Sparse Inputs 标题：HumanNeRF：稀疏输入的广义神经人类辐射场链接：https://arxiv.org/abs/2112.02789

作者：Fuqiang Zhao,Wei Yang,Jiakai Zhang,Pei Lin,Yingliang Zhang,Jingyi Yu,Lan Xu 机构： ShanghaiTech University, Huazhong University of Science and Technology, DGene 摘要：最近的人工神经表示可以产生高质量的多视图渲染，但需要使用密集的多视图输入和昂贵的训练。因此，它们很大程度上局限于静态模型，因为训练每个帧是不可行的。我们提出了HumanNeRF——一种可推广的神经表示法——用于动态人的高保真自由视图合成。类似于IBRNet如何通过避免每场景训练来帮助NeRF，HumanNeRF在多视图输入中使用聚合像素对齐功能以及嵌入姿势的非刚性变形场来处理动态运动。raw HumanNeRF已经可以在看不见的对象和相机设置的稀疏视频输入上生成合理的渲染。为了进一步提高渲染质量，我们增加了一个外观混合模块，将神经体积渲染和神经纹理混合的优点结合起来。在各种多视图动态人体数据集上进行的大量实验证明了我们的方法在具有挑战性的运动和非常稀疏的摄像机视图输入下合成具有照片真实感的自由视图人体的通用性和有效性。摘要：Recent neural human representations can produce high-quality multi-view rendering but require using dense multi-view inputs and costly training. They are hence largely limited to static models as training each frame is infeasible. We present HumanNeRF - a generalizable neural representation - for high-fidelity free-view synthesis of dynamic humans. Analogous to how IBRNet assists NeRF by avoiding per-scene training, HumanNeRF employs an aggregated pixel-alignment feature across multi-view inputs along with a pose embedded non-rigid deformation field for tackling dynamic motions. The raw HumanNeRF can already produce reasonable rendering on sparse video inputs of unseen subjects and camera settings. To further improve the rendering quality, we augment our solution with an appearance blending module for combining the benefits of both neural volumetric rendering and neural texture blending. Extensive experiments on various multi-view dynamic human datasets demonstrate the generalizability and effectiveness of our approach in synthesizing photo-realistic free-view humans under challenging motions and with very sparse camera view inputs.

【5】 PSI: A Pedestrian Behavior Dataset for Socially Intelligent Autonomous Car 标题：PSI：一种面向社会智能自动驾驶汽车的行人行为数据集链接：https://arxiv.org/abs/2112.02604

作者：Tina Chen,Renran Tian,Yaobin Chen,Joshua Domeyer,Heishiro Toyoda,Rini Sherony,Taotao Jing,Zhengming Ding 机构：Electrical & Computer Engineering, IUPUI, Computer Information Technology, Collaborative Safety Research Center, Toyota Motor North America, Toyota Research Institute, Computer Science, Tulane University 摘要：行人行为预测对于全自动车辆在繁忙的城市街道上安全高效地行驶至关重要。未来的自动驾驶汽车不仅需要具备技术能力，还需要具备社会能力。随着越来越多的算法和数据集被开发用于预测行人行为，这些工作缺乏基准标签和能力来估计行人的时间动态意图变化，提供交互场景的解释，并支持具有社会智能的算法。本文提出并共享了另一个称为IUPUI-PSI（PSI）数据的基准数据集，该数据集除了全面的计算机视觉标签外，还具有两个创新的标签。第一个新颖的标签是行人在ego车辆前通过的动态意图变化，由24名具有不同背景的驾驶员实现。第二个是基于文本的解释，在评估行人意图和预测他们在互动期间的行为时，驾驶员的推理过程。这些创新的标签可以实现多个计算机视觉任务，包括行人意图/行为预测、车辆-行人交互分割以及视频到语言的可解释算法映射。发布的数据集可以从根本上改进行人行为预测模型的开发，并开发出具有社会智能的自动驾驶汽车，以有效地与行人互动。数据集已通过不同的任务进行评估，并发布给公众访问。摘要：Prediction of pedestrian behavior is critical for fully autonomous vehicles to drive in busy city streets safely and efficiently. The future autonomous cars need to fit into mixed conditions with not only technical but also social capabilities. As more algorithms and datasets have been developed to predict pedestrian behaviors, these efforts lack the benchmark labels and the capability to estimate the temporal-dynamic intent changes of the pedestrians, provide explanations of the interaction scenes, and support algorithms with social intelligence. This paper proposes and shares another benchmark dataset called the IUPUI-CSRC Pedestrian Situated Intent (PSI) data with two innovative labels besides comprehensive computer vision labels. The first novel label is the dynamic intent changes for the pedestrians to cross in front of the ego-vehicle, achieved from 24 drivers with diverse backgrounds. The second one is the text-based explanations of the driver reasoning process when estimating pedestrian intents and predicting their behaviors during the interaction period. These innovative labels can enable several computer vision tasks, including pedestrian intent/behavior prediction, vehicle-pedestrian interaction segmentation, and video-to-language mapping for explainable algorithms. The released dataset can fundamentally improve the development of pedestrian behavior prediction models and develop socially intelligent autonomous cars to interact with pedestrians efficiently. The dataset has been evaluated with different tasks and is released to the public to access.

【6】 Implicit Neural Deformation for Multi-View Face Reconstruction 标题：隐式神经变形在多视点人脸重建中的应用链接：https://arxiv.org/abs/2112.02494

作者：Moran Li,Haibin Huang,Yi Zheng,Mengtian Li,Nong Sang,Chongyang Ma 机构：∗Kuaishou Technology, †Huazhong University of Science and Technology 备注：13 pages, 4 figures 摘要：在这项工作中，我们提出了一种从多视角RGB图像重建三维人脸的新方法。与以前基于三维可变形模型（3DMMs）的方法不同，我们的方法利用隐式表示对丰富的几何特征进行编码。我们的整个管道由两个主要部分组成，包括一个几何网络，它学习可变形神经符号距离函数（SDF）作为3D人脸表示，以及一个渲染网络，它学习在神经SDF的表面点上渲染，以通过自监督优化匹配输入图像。为了在测试时处理具有不同表达式的同一目标的野生稀疏视图输入，我们进一步提出了剩余潜在代码来有效扩展学习的隐式人脸表示的形状空间，以及一种新的视图切换损失来加强不同视图之间的一致性。我们在几个基准数据集上的实验结果表明，与最新的方法相比，我们的方法优于其他基线，并获得了更好的人脸重建结果。摘要：In this work, we present a new method for 3D face reconstruction from multi-view RGB images. Unlike previous methods which are built upon 3D morphable models (3DMMs) with limited details, our method leverages an implicit representation to encode rich geometric features. Our overall pipeline consists of two major components, including a geometry network, which learns a deformable neural signed distance function (SDF) as the 3D face representation, and a rendering network, which learns to render on-surface points of the neural SDF to match the input images via self-supervised optimization. To handle in-the-wild sparse-view input of the same target with different expressions at test time, we further propose residual latent code to effectively expand the shape space of the learned implicit face representation, as well as a novel view-switch loss to enforce consistency among different views. Our experimental results on several benchmark datasets demonstrate that our approach outperforms alternative baselines and achieves superior face reconstruction results compared to state-of-the-art methods.

【7】 MoFaNeRF: Morphable Facial Neural Radiance Field 标题：MoFaNeRF：可变形的面神经辐射场链接：https://arxiv.org/abs/2112.02308

作者：Yiyu Zhuang,Hao Zhu,Xusen Sun,Xun Cao 机构：Nanjing University, Nanjing, China 摘要：我们提出了一个参数化模型，该模型使用神经辐射场，即可变形面部神经网络，将自由视点图像映射到编码面部形状、表情和外观的向量空间。具体地说，MoFaNeRF将编码的面部形状、表情和外观以及空间坐标和视图方向作为MLP的输入，并输出空间点的辐射度以进行照片真实感图像合成。与传统的3D可变形模型（3DMM）相比，MoFaNeRF在直接合成照片逼真的面部细节方面显示出优越性，即使是眼睛、嘴巴和胡须。此外，通过插值输入的形状、表达式和外观代码，可以轻松实现连续的面部变形。通过引入特定于身份的调制和纹理编码器，我们的模型合成了精确的光度细节，并显示了强大的表示能力。我们的模型在多种应用中表现出强大的能力，包括基于图像的拟合、随机生成、人脸装配、人脸编辑和新颖的视图合成。实验表明，我们的方法比以前的参数化模型具有更高的表示能力，并且在多个应用中取得了有竞争力的性能。据我们所知，我们的工作是第一个基于神经辐射场的面部参数化模型，可用于拟合、生成和操作。我们的代码和模型发布于https://github.com/zhuhao-nju/mofanerf. 摘要：We propose a parametric model that maps free-view images into a vector space of coded facial shape, expression and appearance using a neural radiance field, namely Morphable Facial NeRF. Specifically, MoFaNeRF takes the coded facial shape, expression and appearance along with space coordinate and view direction as input to an MLP, and outputs the radiance of the space point for photo-realistic image synthesis. Compared with conventional 3D morphable models (3DMM), MoFaNeRF shows superiority in directly synthesizing photo-realistic facial details even for eyes, mouths, and beards. Also, continuous face morphing can be easily achieved by interpolating the input shape, expression and appearance codes. By introducing identity-specific modulation and texture encoder, our model synthesizes accurate photometric details and shows strong representation ability. Our model shows strong ability on multiple applications including image-based fitting, random generation, face rigging, face editing, and novel view synthesis. Experiments show that our method achieves higher representation ability than previous parametric models, and achieves competitive performance in several applications. To the best of our knowledge, our work is the first facial parametric model built upon a neural radiance field that can be used in fitting, generation and manipulation. Our code and model are released in https://github.com/zhuhao-nju/mofanerf.

【8】 Sphere Face Model:A 3D Morphable Model with Hypersphere Manifold Latent Space 标题：球面模型：一种具有超球流形潜在空间的三维可变形模型链接：https://arxiv.org/abs/2112.02238

作者：Diqiong Jiang,Yiwei Jin,Fanglue Zhang,Zhe Zhu,Yun Zhang,Ruofeng Tong,Min Tang 机构：Zhejiang University, Fang-Lue Zhang, Victoria University, of Wellington, Duke University, Zhang Yun, Communication University, of Zhejiang, Tong, Ruofeng, Tang, Min 摘要：3D可变形模型（3DMMs）是人脸形状和外观的生成模型。然而，传统3DMMs的形状参数满足多元高斯分布，而身份嵌入满足超球面分布，这一矛盾使得人脸重建模型在保持忠实性和形状一致性的同时面临挑战。为了解决这个问题，我们提出了球面人脸模型（SFM），这是一种新颖的单目人脸重建3DMM，它可以保持形状保真度和身份一致性。我们的SFM的核心是可用于重建3D人脸形状的基础矩阵，并且通过采用两阶段训练方法学习基本矩阵，其中3D和2D训练数据分别用于第一和第二阶段。为了解决分布失配问题，我们设计了一种新的损耗，使形状参数具有超球形的潜在空间。大量实验表明，SFM具有较高的表示能力和形状参数空间的聚类性能。此外，它可以生成逼真的人脸形状，并且在单目人脸重建中，这些形状在具有挑战性的条件下是一致的。摘要：3D Morphable Models (3DMMs) are generative models for face shape and appearance. However, the shape parameters of traditional 3DMMs satisfy the multivariate Gaussian distribution while the identity embeddings satisfy the hypersphere distribution, and this conflict makes it challenging for face reconstruction models to preserve the faithfulness and the shape consistency simultaneously. To address this issue, we propose the Sphere Face Model(SFM), a novel 3DMM for monocular face reconstruction, which can preserve both shape fidelity and identity consistency. The core of our SFM is the basis matrix which can be used to reconstruct 3D face shapes, and the basic matrix is learned by adopting a two-stage training approach where 3D and 2D training data are used in the first and second stages, respectively. To resolve the distribution mismatch, we design a novel loss to make the shape parameters have a hyperspherical latent space. Extensive experiments show that SFM has high representation ability and shape parameter space's clustering performance. Moreover, it produces fidelity face shapes, and the shapes are consistent in challenging conditions in monocular face reconstruction.

【9】 Face Reconstruction with Variational Autoencoder and Face Masks 标题：基于变分自动编码器和人脸模板的人脸重建链接：https://arxiv.org/abs/2112.02139

作者：Rafael S. Toledo,Eric A. Antonelo 机构：Department of Automation and Systems, Federal University of Santa Catarina (UFSC), Florian´opolis, Brazil 备注：12 pages, 7 figures, 18th Encontro Nacional de Intelig\^encia Artificial e Computacional (ENIAC) 摘要：可变自动编码器（VAE）利用深度学习模型学习高维观测数据集下的连续潜在z空间。利用该模型，可以完成许多任务，包括人脸重建和人脸合成。在这项工作中，我们研究了面罩如何通过将学习限制在由面罩选择的像素。使用celebA数据集对方案进行的评估表明，面罩增强了重建图像，特别是当SSIM损耗与l1或l2损耗函数一起使用时。我们注意到，在体系结构中包含用于面罩预测的解码器影响了l1或l2损失函数，而SSIM损失并非如此。此外，SSIM感知损失在所有测试假设之间产生了最清晰的样本，尽管它改变了图像的原始颜色，使l1或l2损失与SSIM一起使用有助于解决此问题。摘要：Variational AutoEncoders (VAE) employ deep learning models to learn a continuous latent z-space that is subjacent to a high-dimensional observed dataset. With that, many tasks are made possible, including face reconstruction and face synthesis. In this work, we investigated how face masks can help the training of VAEs for face reconstruction, by restricting the learning to the pixels selected by the face mask. An evaluation of the proposal using the celebA dataset shows that the reconstructed images are enhanced with the face masks, especially when SSIM loss is used either with l1 or l2 loss functions. We noticed that the inclusion of a decoder for face mask prediction in the architecture affected the performance for l1 or l2 loss functions, while this was not the case for the SSIM loss. Besides, SSIM perceptual loss yielded the crispest samples between all hypotheses tested, although it shifts the original color of the image, making the usage of the l1 or l2 losses together with SSIM helpful to solve this issue.

跟踪(1篇)

【1】 Visual Object Tracking with Discriminative Filters and Siamese Networks: A Survey and Outlook 标题：基于判别滤波器和暹罗网络的视觉目标跟踪：综述与展望链接：https://arxiv.org/abs/2112.02838

作者：Sajid Javed,Martin Danelljan,Fahad Shahbaz Khan,Muhammad Haris Khan,Michael Felsberg,Jiri Matas 备注：Tracking Survey 摘要：准确、鲁棒的视觉目标跟踪是计算机视觉中最具挑战性和最基本的问题之一。它需要在给定目标初始位置的情况下，估计目标在图像序列中的轨迹，并进行分割，或以边界框的形式进行粗略近似。判别相关滤波器（DCF）和深度暹罗网络（SNs）已成为主要的跟踪模式，并取得了重大进展。随着视觉目标跟踪在过去十年中的快速发展，本调查基于九个跟踪基准的结果，对90多个DCF和暹罗跟踪器进行了系统和彻底的审查。首先，我们介绍了DCF和暹罗跟踪核心配方的背景理论。然后，我们区分并全面回顾了这两种追踪范式中的共享和特定开放研究挑战。此外，我们深入分析了DCF和暹罗跟踪器在九个基准上的性能，涵盖了视觉跟踪的不同实验方面：数据集、评估指标、性能和速度比较。我们在分析的基础上提出了针对突出开放性挑战的建议和建议，从而完成了调查。摘要：Accurate and robust visual object tracking is one of the most challenging and fundamental computer vision problems. It entails estimating the trajectory of the target in an image sequence, given only its initial location, and segmentation, or its rough approximation in the form of a bounding box. Discriminative Correlation Filters (DCFs) and deep Siamese Networks (SNs) have emerged as dominating tracking paradigms, which have led to significant progress. Following the rapid evolution of visual object tracking in the last decade, this survey presents a systematic and thorough review of more than 90 DCFs and Siamese trackers, based on results in nine tracking benchmarks. First, we present the background theory of both the DCF and Siamese tracking core formulations. Then, we distinguish and comprehensively review the shared as well as specific open research challenges in both these tracking paradigms. Furthermore, we thoroughly analyze the performance of DCF and Siamese trackers on nine benchmarks, covering different experimental aspects of visual tracking: datasets, evaluation metrics, performance, and speed comparisons. We finish the survey by presenting recommendations and suggestions for distinguished open challenges based on our analysis.

图像视频检索|Re-id相关(3篇)

【1】 Global-Local Context Network for Person Search 标题：用于人员搜索的全局-局部上下文网络链接：https://arxiv.org/abs/2112.02500

作者：Peng Zheng,Jie Qin,Yichao Yan,Shengcai Liao,Bingbing Ni,Xiaogang Cheng,Ling Shao 摘要：人员搜索的目的是从自然的、未删减的图像中联合定位和识别查询人员，这在过去几年中已经在计算机视觉领域得到了积极的研究。在本文中，我们深入研究了目标人周围丰富的全局和局部上下文信息，分别称为场景上下文和组上下文。与以往分别处理这两种类型的上下文的工作不同，我们在统一的全局-局部上下文网络（GLCNet）中利用它们，直观的目的是增强特征。具体而言，以多阶段的方式同时增强了re ID嵌入和上下文特征，最终为个人搜索提供了增强的、有区别的特征。我们在两人搜索基准（即中大-系统和PRW）上进行实验，并将我们的方法扩展到更具挑战性的环境（即MovieNet上的角色搜索）。大量的实验结果表明，在三个数据集上，与最新的方法相比，所提出的GLCNet具有一致的改进。我们的源代码、预先训练的模型和角色搜索的新设置可从以下网址获得：https://github.com/ZhengPeng7/GLCNet. 摘要：Person search aims to jointly localize and identify a query person from natural, uncropped images, which has been actively studied in the computer vision community over the past few years. In this paper, we delve into the rich context information globally and locally surrounding the target person, which we refer to scene and group context, respectively. Unlike previous works that treat the two types of context individually, we exploit them in a unified global-local context network (GLCNet) with the intuitive aim of feature enhancement. Specifically, re-ID embeddings and context features are enhanced simultaneously in a multi-stage fashion, ultimately leading to enhanced, discriminative features for person search. We conduct the experiments on two person search benchmarks (i.e., CUHK-SYSU and PRW) as well as extend our approach to a more challenging setting (i.e., character search on MovieNet). Extensive experimental results demonstrate the consistent improvement of the proposed GLCNet over the state-of-the-art methods on the three datasets. Our source codes, pre-trained models, and the new setting for character search are available at: https://github.com/ZhengPeng7/GLCNet.

【2】 3rd Place: A Global and Local Dual Retrieval Solution to Facebook AI Image Similarity Challenge 标题：第3名：Facebook AI图像相似性挑战的全局和局部双重检索解决方案链接：https://arxiv.org/abs/2112.02373

作者：Xinlong Sun,Yangyang Qin,Xuyuan Xu,Guoping Gong,Yang Fang,Yexin Wang 机构：AI Technolygy Centor, OVB, Tencent, Shenzhen, China 备注：This is the 3rd solution for Facebook Image Similarity Challenge and NIPS2021 Workshop. The current first draft version will be updated later 摘要：图像相似性检索作为计算机视觉的一项基本任务，面临着大规模数据和图像复制攻击的挑战。本文提出了我们的第三位解决方案的图像相似性挑战（ISC）2021的匹配轨道由脸谱网AI组织。我们提出了一种结合全局描述符和局部描述符的多分支检索方法来覆盖所有攻击案例。具体来说，我们尝试了许多策略来优化全局描述符，包括丰富的数据扩充、单Transformer模型的自监督学习、覆盖检测预处理。此外，我们还引入了鲁棒的SIFT特征和GPU-Faiss进行局部检索，弥补了全局检索的不足。最后，利用KNN匹配算法判断匹配和合并分数。我们展示了我们的方法的一些烧蚀实验，揭示了全局和局部特征的互补优势。摘要：As a basic task of computer vision, image similarity retrieval is facing the challenge of large-scale data and image copy attacks. This paper presents our 3rd place solution to the matching track of Image Similarity Challenge (ISC) 2021 organized by Facebook AI. We propose a multi-branch retrieval method of combining global descriptors and local descriptors to cover all attack cases. Specifically, we attempt many strategies to optimize global descriptors, including abundant data augmentations, self-supervised learning with a single Transformer model, overlay detection preprocessing. Moreover, we introduce the robust SIFT feature and GPU Faiss for local retrieval which makes up for the shortcomings of the global retrieval. Finally, KNN-matching algorithm is used to judge the match and merge scores. We show some ablation experiments of our method, which reveals the complementary advantages of global and local features.

【3】 HHF: Hashing-guided Hinge Function for Deep Hashing Retrieval 标题：HHF：用于深度散列检索的散列制导的铰链函数链接：https://arxiv.org/abs/2112.02225

作者：Chengyin Xu,Zhengzhuo Xu,Zenghao Chai,Hongjia Li,Qiruyi Zuo,Lingyu Yang,Chun Yuan 机构：Shenzhen International Graduate School, Tsinghua University, Peng Cheng Laboratory 摘要：深度哈希在大规模图像检索中表现出良好的性能。然而，由\textbf{D}eep\textbf{N}eural\textbf{N}网络（DNN）提取的潜在代码在二值化过程中不可避免地会丢失语义信息，这会损害检索效率并使其具有挑战性。虽然许多现有的方法执行正则化以减轻量化误差，但我们发现度量和量化损失之间存在不相容的冲突。度量损失惩罚类间距离，将不同的类无约束地推远。更糟糕的是，它倾向于映射出偏离理想二值化点的潜在代码，并在二值化过程中产生严重的歧义。基于二元线性码的最小距离，提出了一种避免二元线性码之间冲突的方法，即基于二元线性码的最小距离的引导函数HHF（textbf{H}ashing-guided\textbf{H}inge\textbf{F}函数HHF）。具体来说，我们仔细设计了一个特定的拐点，它依赖于哈希位长度和类别数来平衡度量学习和量化学习。这样的修改可以防止网络在深度散列中陷入局部度量最优极小值。在CIFAR-10、CIFAR-100、ImageNet和MS-COCO中进行的大量实验表明，HHF始终优于现有技术，并且具有鲁棒性和灵活性，可以移植到其他方法中。摘要：Deep hashing has shown promising performance in large-scale image retrieval. However, latent codes extracted by \textbf{D}eep \textbf{N}eural \textbf{N}etwork (DNN) will inevitably lose semantic information during the binarization process, which damages the retrieval efficiency and make it challenging. Although many existing approaches perform regularization to alleviate quantization errors, we figure out an incompatible conflict between the metric and quantization losses. The metric loss penalizes the inter-class distances to push different classes unconstrained far away. Worse still, it tends to map the latent code deviate from ideal binarization point and generate severe ambiguity in the binarization process. Based on the minimum distance of the binary linear code, \textbf{H}ashing-guided \textbf{H}inge \textbf{F}unction (HHF) is proposed to avoid such conflict. In detail, we carefully design a specific inflection point, which relies on the hash bit length and category numbers to balance metric learning and quantization learning. Such a modification prevents the network from falling into local metric optimal minima in deep hashing. Extensive experiments in CIFAR-10, CIFAR-100, ImageNet, and MS-COCO show that HHF consistently outperforms existing techniques, and is robust and flexible to transplant into other methods.

裁剪|量化|加速|压缩相关(1篇)

【1】 Generalized Binary Search Network for Highly-Efficient Multi-View Stereo 标题：高效多视点立体的广义对分搜索网络链接：https://arxiv.org/abs/2112.02338

作者：Zhenxing Mi,Di Chang,Dan Xu 机构：The Department of Computer Science and Engineering, HKUST 备注：16 pages 摘要：具有已知摄像机参数的多视点立体（MVS）本质上是一个有效深度范围内的一维搜索问题。最近基于深度学习的MVS方法通常在深度范围内对深度假设进行密集采样，然后为深度预测构建令人望而却步的占用内存的3D成本量。虽然从粗到精的采样策略在一定程度上缓解了这一开销问题，但MVS的效率仍然是一个开放的挑战。在这项工作中，我们提出了一种新的高效MVS方法，该方法显著减少了内存占用，同时明显提高了最先进的深度预测性能。考虑到效率和有效性，我们研究了什么样的搜索策略可以合理地优化MVS。我们首先将MVS描述为一个二进制搜索问题，并据此提出了一个适用于MVS的广义二进制搜索网络。具体而言，在每一步中，深度范围被分成两个箱子，两侧各有一个额外的误差容限箱子。执行分类以确定哪个箱子包含真实深度。我们还设计了三种机制来分别处理分类错误、处理超出范围的样本和减少训练记忆。新的公式使得我们的方法在每一步中只采样极少量的深度假设，这具有很高的记忆效率，并且极大地促进了快速训练收敛。在竞争性基准测试上的实验表明，我们的方法以更少的内存实现了最先进的精度。特别是，我们的方法在DTU数据集上的总分为0.289，在所有基于学习的方法中，在挑战坦克和庙宇高级数据集上排名第一。经过训练的模型和代码将在https://github.com/MiZhenxing/GBi-Net. 摘要：Multi-view Stereo (MVS) with known camera parameters is essentially a 1D search problem within a valid depth range. Recent deep learning-based MVS methods typically densely sample depth hypotheses in the depth range, and then construct prohibitively memory-consuming 3D cost volumes for depth prediction. Although coarse-to-fine sampling strategies alleviate this overhead issue to a certain extent, the efficiency of MVS is still an open challenge. In this work, we propose a novel method for highly efficient MVS that remarkably decreases the memory footprint, meanwhile clearly advancing state-of-the-art depth prediction performance. We investigate what a search strategy can be reasonably optimal for MVS taking into account of both efficiency and effectiveness. We first formulate MVS as a binary search problem, and accordingly propose a generalized binary search network for MVS. Specifically, in each step, the depth range is split into 2 bins with extra 1 error tolerance bin on both sides. A classification is performed to identify which bin contains the true depth. We also design three mechanisms to respectively handle classification errors, deal with out-of-range samples and decrease the training memory. The new formulation makes our method only sample a very small number of depth hypotheses in each step, which is highly memory efficient, and also greatly facilitates quick training convergence. Experiments on competitive benchmarks show that our method achieves state-of-the-art accuracy with much less memory. Particularly, our method obtains an overall score of 0.289 on DTU dataset and tops the first place on challenging Tanks and Temples advanced dataset among all the learning-based methods. The trained models and code will be released at https://github.com/MiZhenxing/GBi-Net.

表征学习(1篇)

【1】 Forward Compatible Training for Representation Learning 标题：表征学习的前向相容训练链接：https://arxiv.org/abs/2112.02805

作者：Vivek Ramanujan,Pavan Kumar Anasosalu Vasu,Ali Farhadi,Oncel Tuzel,Hadi Pouransari 机构：University of Washington†, Apple 备注：14 pages with appendix 摘要：在可视化检索系统中，更新嵌入模型需要重新计算每段数据的特征。这种昂贵的过程称为回填。最近，向后兼容训练（BCT）的思想被提出。为了避免回填成本，BCT修改了新模型的训练，使其表示与旧模型的表示兼容。然而，BCT会显著阻碍新模型的性能。在这项工作中，我们提出了一种新的表征学习范式：前向兼容训练（FCT）。在FCT中，当旧模型被训练时，我们也为模型的未来未知版本做准备。我们建议学习侧信息，这是每个样本的一个辅助功能，有助于模型的未来更新。为了开发一个强大而灵活的模型兼容性框架，我们将附加信息与从旧嵌入到新嵌入的正向转换结合起来。新模型的训练不会被修改，因此，其精度不会降低。我们证明，与BCT相比，在各种数据集的检索准确率有显著提高：ImageNet-1k（+18.1%）、Places-365（+5.4%）和VGG-Face2（+8.3%）。当新旧模型跨不同的数据集、损失和体系结构进行训练时，FCT可以获得模型兼容性。摘要：In visual retrieval systems, updating the embedding model requires recomputing features for every piece of data. This expensive process is referred to as backfilling. Recently, the idea of backward compatible training (BCT) was proposed. To avoid the cost of backfilling, BCT modifies training of the new model to make its representations compatible with those of the old model. However, BCT can significantly hinder the performance of the new model. In this work, we propose a new learning paradigm for representation learning: forward compatible training (FCT). In FCT, when the old model is trained, we also prepare for a future unknown version of the model. We propose learning side-information, an auxiliary feature for each sample which facilitates future updates of the model. To develop a powerful and flexible framework for model compatibility, we combine side-information with a forward transformation from old to new embeddings. Training of the new model is not modified, hence, its accuracy is not degraded. We demonstrate significant retrieval accuracy improvement compared to BCT for various datasets: ImageNet-1k (+18.1%), Places-365 (+5.4%), and VGG-Face2 (+8.3%). FCT obtains model compatibility when the new and old models are trained across different datasets, losses, and architectures.

蒸馏|知识提取(1篇)

【1】 A comparison study of CNN denoisers on PRNU extraction 标题：用于PRNU提取的CNN去噪剂的比较研究链接：https://arxiv.org/abs/2112.02858

作者：Hui Zeng,Morteza Darvish Morshedi Hosseini,Kang Deng,Anjie Peng,Miroslav Goljan 备注：12 pages, 6 figures, 4 tables 摘要：基于传感器的摄像机识别（SCI）方法的性能在很大程度上依赖于估计光响应不均匀性（PRNU）的去噪滤波器。尽管对提高提取的PRNU的质量进行了各种尝试，但它在低分辨率图像中的性能仍然不令人满意，并且计算量也很高。利用PRNU估计和图像去噪的相似性，我们利用基于卷积神经网络（CNN）的去噪器的最新成果进行PRNU提取。本文在公共的“德累斯顿图像数据库”上对这种CNN去噪器的SCI性能进行了比较评估。我们的发现有两个方面。从一个方面来说，PRNU提取和图像去噪都将噪声从图像内容中分离出来。因此，如果经过仔细训练，SCI可以从最近的CNN去噪器中获益。另一方面，PRNU提取和图像去噪的目标和场景不同，因为一个优化了噪声质量，另一个优化了图像质量。当CNN去噪器用于PRNU估计时，需要仔细定制的训练。对训练数据准备和损失函数设计的备选策略进行了理论分析和实验评估。我们指出，向CNN提供图像PRNU对，并使用基于相关的损失函数对其进行训练，可以获得最佳的PRNU估计性能。为了便于SCI的进一步研究，我们还提出了一种最小损失相机指纹量化方案，使用该方案我们将指纹保存为PNG格式的图像文件。此外，我们还公开了“德累斯顿图像数据库”中摄像机的量化指纹。摘要：Performance of the sensor-based camera identification (SCI) method heavily relies on the denoising filter in estimating Photo-Response Non-Uniformity (PRNU). Given various attempts on enhancing the quality of the extracted PRNU, it still suffers from unsatisfactory performance in low-resolution images and high computational demand. Leveraging the similarity of PRNU estimation and image denoising, we take advantage of the latest achievements of Convolutional Neural Network (CNN)-based denoisers for PRNU extraction. In this paper, a comparative evaluation of such CNN denoisers on SCI performance is carried out on the public "Dresden Image Database". Our findings are two-fold. From one aspect, both the PRNU extraction and image denoising separate noise from the image content. Hence, SCI can benefit from the recent CNN denoisers if carefully trained. From another aspect, the goals and the scenarios of PRNU extraction and image denoising are different since one optimizes the quality of noise and the other optimizes the image quality. A carefully tailored training is needed when CNN denoisers are used for PRNU estimation. Alternative strategies of training data preparation and loss function design are analyzed theoretically and evaluated experimentally. We point out that feeding the CNNs with image-PRNU pairs and training them with correlation-based loss function result in the best PRNU estimation performance. To facilitate further studies of SCI, we also propose a minimum-loss camera fingerprint quantization scheme using which we save the fingerprints as image files in PNG format. Furthermore, we make the quantized fingerprints of the cameras from the "Dresden Image Database" publicly available.

超分辨率|去噪|去模糊|去雾(1篇)

【1】 Deblurring via Stochastic Refinement 标题：基于随机细化的去模糊方法链接：https://arxiv.org/abs/2112.02475

作者：Jay Whang,Mauricio Delbracio,Hossein Talebi,Chitwan Saharia,Alexandros G. Dimakis,Peyman Milanfar 机构：†University of Texas at Austin, ‡Google Research 摘要：图像去模糊是一个不适定问题，对于给定的输入图像有多个似是而非的解。然而，大多数现有的方法产生一个确定的估计干净的图像和训练，以尽量减少像素级失真。众所周知，这些指标与人类感知的相关性很差，通常会导致不切实际的重建。我们提出了一种基于条件扩散模型的盲去模糊框架。与现有技术不同，我们训练了一个随机采样器，该采样器细化了确定性预测的输出，并且能够为给定的输入生成一组不同的似是而非的重构。这导致在多个标准基准中，感知质量比现有的最先进方法有了显著的提高。与典型的扩散模型相比，我们的预测和细化方法还可以实现更有效的采样。结合精心调整的网络结构和推理过程，我们的方法在失真度量（如PSNR）方面具有竞争力。这些结果显示了我们基于扩散的去模糊方法的明显优势，并对广泛使用的生成单一确定性重建的策略提出了挑战。摘要：Image deblurring is an ill-posed problem with multiple plausible solutions for a given input image. However, most existing methods produce a deterministic estimate of the clean image and are trained to minimize pixel-level distortion. These metrics are known to be poorly correlated with human perception, and often lead to unrealistic reconstructions. We present an alternative framework for blind deblurring based on conditional diffusion models. Unlike existing techniques, we train a stochastic sampler that refines the output of a deterministic predictor and is capable of producing a diverse set of plausible reconstructions for a given input. This leads to a significant improvement in perceptual quality over existing state-of-the-art methods across multiple standard benchmarks. Our predict-and-refine approach also enables much more efficient sampling compared to typical diffusion models. Combined with a carefully tuned network architecture and inference procedure, our method is competitive in terms of distortion metrics such as PSNR. These results show clear benefits of our diffusion-based method for deblurring and challenge the widely used strategy of producing a single, deterministic reconstruction.

点云|SLAM|雷达|激光|深度RGBD相关(2篇)

【1】 Real-time Registration and Reconstruction with Cylindrical LiDAR Images 标题：柱面激光雷达图像的实时配准与重建链接：https://arxiv.org/abs/2112.02779

作者：Wei Dong,Kwonyoung Ryu,Michael Kaess,Jaesik Park 机构： 1 Wei Dong and Michael Kaess are with Carnegie Mellon University 备注：6 pages, 7 figures. This paper is under the review 摘要：旋转激光雷达数据普遍用于三维感知任务，但其圆柱形图像形式研究较少。传统方法将扫描视为点云，它们要么依赖昂贵的欧几里德3D最近邻搜索进行数据关联，要么依赖投影范围图像进行进一步处理。我们回顾了激光雷达扫描的形成，并提出了一种用于原始扫描数据的柱面距离图像表示方法，该方法配备了一个有效的校准球面投影模型。根据我们的公式，我们1）收集了一个由室内和室外序列以及伪地面真实姿态组成的大型激光雷达数据集；2）对具有合成变换和真实变换的序列的投影和常规配准方法进行评估；3）将最先进的RGB-D算法传输到激光雷达，该激光雷达可高达180 Hz用于配准，150 Hz用于密集重建。数据集和工具将发布。摘要：Spinning LiDAR data are prevalent for 3D perception tasks, yet its cylindrical image form is less studied. Conventional approaches regard scans as point clouds, and they either rely on expensive Euclidean 3D nearest neighbor search for data association or depend on projected range images for further processing. We revisit the LiDAR scan formation and present a cylindrical range image representation for data from raw scans, equipped with an efficient calibrated spherical projective model. With our formulation, we 1) collect a large dataset of LiDAR data consisting of both indoor and outdoor sequences accompanied with pseudo-ground truth poses; 2) evaluate the projective and conventional registration approaches on the sequences with both synthetic and real-world transformations; 3) transfer state-of-the-art RGB-D algorithms to LiDAR that runs up to 180 Hz for registration and 150 Hz for dense reconstruction. The dataset and tools will be released.

【2】 PointCLIP: Point Cloud Understanding by CLIP 标题：PointCLIP：通过剪辑了解点云链接：https://arxiv.org/abs/2112.02413

作者：Renrui Zhang,Ziyu Guo,Wei Zhang,Kunchang Li,Xupeng Miao,Bin Cui,Yu Qiao,Peng Gao,Hongsheng Li 机构：Shanghai AI Laboratory, Peking University, The Chinese University of Hong Kong 备注：Open sourced, Code and Model Available 摘要：最近，通过对比视觉语言预训练（CLIP）进行的Zero-Shot和Few-Shot学习在2D视觉识别上表现出了鼓舞人心的表现，2D视觉识别学习在开放词汇环境下将图像与其对应的文本匹配。然而，由大规模的二维图像-文本对预训练的片段是否可以推广到三维识别，这一问题仍有待研究。在本文中，我们通过提出PointCLIP来确定这种设置是可行的，PointCLIP在剪辑编码的点云和3D类别文本之间进行对齐。具体地说，我们通过将点云投影到多视图深度图中而不进行渲染来对其进行编码，并聚合视图方向的零拍预测以实现从二维到三维的知识转移。在此基础上，我们设计了一个视图间适配器，以更好地提取全局特征，并将从三维学习到的少量镜头知识自适应地融合到二维预训练的片段中。只需在少量快照设置中微调轻量级适配器，PointCLIP的性能就可以大大提高。此外，我们还观察了PointCLIP和经典3D监督网络之间的互补性。通过简单的集成，PointCLIP提高了基线的性能，甚至超过了最先进的模型。因此，PointCLIP是一种在低资源成本和数据管理下通过CLIP进行有效三维点云理解的有希望的替代方案。我们在广泛采用的ModelNet10、ModelNet40和具有挑战性的ScanObjectNN上进行了彻底的实验，以证明PointCLIP的有效性。该代码发布于https://github.com/ZrrSkywalker/PointCLIP. 摘要：Recently, zero-shot and few-shot learning via Contrastive Vision-Language Pre-training (CLIP) have shown inspirational performance on 2D visual recognition, which learns to match images with their corresponding texts in open-vocabulary settings. However, it remains under explored that whether CLIP, pre-trained by large-scale image-text pairs in 2D, can be generalized to 3D recognition. In this paper, we identify such a setting is feasible by proposing PointCLIP, which conducts alignment between CLIP-encoded point cloud and 3D category texts. Specifically, we encode a point cloud by projecting it into multi-view depth maps without rendering, and aggregate the view-wise zero-shot prediction to achieve knowledge transfer from 2D to 3D. On top of that, we design an inter-view adapter to better extract the global feature and adaptively fuse the few-shot knowledge learned from 3D into CLIP pre-trained in 2D. By just fine-tuning the lightweight adapter in the few-shot settings, the performance of PointCLIP could be largely improved. In addition, we observe the complementary property between PointCLIP and classical 3D-supervised networks. By simple ensembling, PointCLIP boosts baseline's performance and even surpasses state-of-the-art models. Therefore, PointCLIP is a promising alternative for effective 3D point cloud understanding via CLIP under low resource cost and data regime. We conduct thorough experiments on widely-adopted ModelNet10, ModelNet40 and the challenging ScanObjectNN to demonstrate the effectiveness of PointCLIP. The code is released at https://github.com/ZrrSkywalker/PointCLIP.

多模态(1篇)

【1】 Channel Exchanging Networks for Multimodal and Multitask Dense Image Prediction 标题：用于多模态多任务密集图像预测的信道交换网络链接：https://arxiv.org/abs/2112.02252

作者：Yikai Wang,Wenbing Huang,Fuchun Sun,Fengxiang He,Dacheng Tao 机构： Sun are with the Department of ComputerScience and Technology, Tsinghua University 备注：18 pages. arXiv admin note: substantial text overlap with arXiv:2011.05005 摘要：多模态融合和多任务学习是机器学习中的两个重要课题。尽管取得了卓有成效的进展，但针对这两个问题的现有方法仍然难以应对同样的挑战——在整合不同模式（分别任务）的公共信息的同时保留每个模式（分别任务）的特定模式仍然是一个难题。此外，尽管多模态融合和多任务学习实际上彼此密切相关，但以前很少在同一方法框架内进行研究。在本文中，我们提出了信道交换网络（CEN），它是自适应的，无参数的，更重要的是，适用于多模融合和多任务学习。在其核心，CEN在不同模式的子网之间动态交换信道。具体地说，通道交换过程是由单个通道重要性自我引导的，该重要性通过训练期间的批量归一化（BN）比例因子的大小来测量。对于密集图像预测的应用，CEN的有效性通过四种不同的场景进行了测试：多模态融合、循环多模态融合、多任务学习和多模态多任务学习。通过RGB-D数据进行语义分割和通过多域输入进行图像翻译的大量实验验证了我们的CEN与当前最先进的方法相比的有效性。还进行了详细的烧蚀研究，证实了我们提出的每个组件的优势。摘要：Multimodal fusion and multitask learning are two vital topics in machine learning. Despite the fruitful progress, existing methods for both problems are still brittle to the same challenge -- it remains dilemmatic to integrate the common information across modalities (resp. tasks) meanwhile preserving the specific patterns of each modality (resp. task). Besides, while they are actually closely related to each other, multimodal fusion and multitask learning are rarely explored within the same methodological framework before. In this paper, we propose Channel-Exchanging-Network (CEN) which is self-adaptive, parameter-free, and more importantly, applicable for both multimodal fusion and multitask learning. At its core, CEN dynamically exchanges channels between subnetworks of different modalities. Specifically, the channel exchanging process is self-guided by individual channel importance that is measured by the magnitude of Batch-Normalization (BN) scaling factor during training. For the application of dense image prediction, the validity of CEN is tested by four different scenarios: multimodal fusion, cycle multimodal fusion, multitask learning, and multimodal multitask learning. Extensive experiments on semantic segmentation via RGB-D data and image translation through multi-domain input verify the effectiveness of our CEN compared to current state-of-the-art methods. Detailed ablation studies have also been carried out, which provably affirm the advantage of each component we propose.

3D|3D重建等相关(3篇)

【1】 Input-level Inductive Biases for 3D Reconstruction 标题：用于三维重建的输入级感应偏差链接：https://arxiv.org/abs/2112.03243

作者：Wang Yifan,Carl Doersch,Relja Arandjelović,João Carreira,Andrew Zisserman 机构：Relja Arandjelovi´c, Jo˜ao Carreira, ETH Zurich, DeepMind, VGG, Department of Engineering Science, University of Oxford, Queries, for image 摘要：我们使用通才感知模型（最近的感知IO）探索3D重建，该模型接收无序平坦输入矩阵（例如像素）。该模型使用查询矩阵进行查询，并为每个查询生成一个输出——在本文中，输出是输入图像对中所有像素的深度值。我们将对多视图几何有用的归纳偏差合并到这个通才模型中，而不必接触其架构，而是直接将其编码为附加输入。摘要：We explore 3D reconstruction using a generalist perception model, the recent Perceiver IO which ingests a matrix of unordered and flattened inputs (e.g. pixels). The model is interrogated using a query matrix and generates an output for every query -- in this paper the outputs are depth values for all pixels of the input image pair. We incorporate inductive biases useful for multiple view geometry into this generalist model without having to touch its architecture, by instead encoding them directly as additional inputs.

【2】 4DContrast: Contrastive Learning with Dynamic Correspondences for 3D Scene Understanding 标题：4D对比：三维场景理解的动态对应对比学习链接：https://arxiv.org/abs/2112.02990

作者：Yujin Chen,Matthias Nießner,Angela Dai 机构：Technical University of Munich 备注：Video: this https URL 摘要：我们提出了一种新的方法，通过无监督的预训练将4D动态对象先验知识灌输到学习的3D表示中。我们观察到，对象在环境中的动态移动提供了有关其对象性的重要线索，因此建议将学习到的三维表示与这种动态理解结合起来，然后可以有效地将这种动态理解转化为下游三维语义场景理解任务中的改进性能。我们提出了一种新的数据增强方案，利用在静态3D环境中移动的合成3D形状，并采用3D-4D约束下的对比学习，将4D不变性编码到学习的3D表示中。实验表明，我们的无监督表示学习可以改善下游3D语义分割、对象检测和实例分割任务，并且可以显著提高数据稀缺场景下的性能。摘要：We present a new approach to instill 4D dynamic object priors into learned 3D representations by unsupervised pre-training. We observe that dynamic movement of an object through an environment provides important cues about its objectness, and thus propose to imbue learned 3D representations with such dynamic understanding, that can then be effectively transferred to improved performance in downstream 3D semantic scene understanding tasks. We propose a new data augmentation scheme leveraging synthetic 3D shapes moving in static 3D environments, and employ contrastive learning under 3D-4D constraints that encode 4D invariances into the learned 3D representations. Experiments demonstrate that our unsupervised representation learning results in improvement in downstream 3D semantic segmentation, object detection, and instance segmentation tasks, and moreover, notably improves performance in data-scarce scenarios.

【3】 Fast 3D registration with accurate optimisation and little learning for Learn2Reg 2021 标题：Learn2Reg 2021快速3D配准，精确优化，几乎不需要学习链接：https://arxiv.org/abs/2112.03053

作者：Hanna Siebert,Lasse Hansen,Mattias P. Heinrich 机构：Institute of Medical Informatics, Universit¨at zu L¨ubeck, Germany 摘要：当前的可变形医学图像配准方法往往难以满足以下所有标准：通用性、计算或训练时间小以及能够估计大变形。此外，用于监督注册训练的端到端网络往往变得过于复杂，难以训练。对于Learn2Reg2021挑战，我们的目标是通过解耦特征学习和几何对齐来解决这些问题。首先，我们介绍了一种新的非常快速和准确的优化方法。通过使用离散位移和耦合凸优化程序，我们能够稳健地处理大变形。借助基于Adam的实例优化，我们实现了非常精确的配准性能，并且通过使用正则化，我们获得了平滑和合理的变形场。其次，为了适应不同的注册任务，我们提取了形态和对比度不变的手工特征，并用特定于任务的分段U-Net中的语义特征对其进行补充。凭借我们的成绩，我们获得了Learn2Reg2021挑战赛的第二名，赢得了任务1，并在其他两项任务中分别获得了第二名和第三名。摘要：Current approaches for deformable medical image registration often struggle to fulfill all of the following criteria: versatile applicability, small computation or training times, and the being able to estimate large deformations. Furthermore, end-to-end networks for supervised training of registration often become overly complex and difficult to train. For the Learn2Reg2021 challenge, we aim to address these issues by decoupling feature learning and geometric alignment. First, we introduce a new very fast and accurate optimisation method. By using discretised displacements and a coupled convex optimisation procedure, we are able to robustly cope with large deformations. With the help of an Adam-based instance optimisation, we achieve very accurate registration performances and by using regularisation, we obtain smooth and plausible deformation fields. Second, to be versatile for different registration tasks, we extract hand-crafted features that are modality and contrast invariant and complement them with semantic features from a task-specific segmentation U-Net. With our results we were able to achieve the overall Learn2Reg2021 challenge's second place, winning Task 1 and being second and third in the other two tasks.

其他神经网络|深度学习|模型|建模(12篇)

【1】 Functional Regularization for Reinforcement Learning via Learned Fourier Features 标题：基于学习傅立叶特征的强化学习函数正则化链接：https://arxiv.org/abs/2112.03257

作者：Alexander C. Li,Deepak Pathak 机构：Carnegie Mellon University 备注：Accepted at NeurIPS 2021. Website at this https URL 摘要：我们提出了一种简单的深度强化学习体系结构，通过将输入嵌入到学习的傅里叶基中，并表明它提高了基于状态和基于图像的RL的采样效率。我们使用神经切线核对我们的体系结构进行了无限宽度分析，并从理论上表明，调整傅里叶基的初始方差相当于对学习的深层网络进行函数正则化。也就是说，这些学习到的傅立叶特征允许调整网络在训练数据中的欠拟合或过拟合不同频率的程度，并因此提供一种受控机制来改进RL优化的稳定性和性能。从经验上讲，这使我们能够优先学习低频函数，并通过在优化过程中降低网络对噪声的敏感性来加快学习速度，例如在贝尔曼更新期间。在标准的基于状态和基于图像的RL基准测试上的实验表明，我们的体系结构明显优于基线。网址：https://alexanderli.com/learned-fourier-features 摘要：We propose a simple architecture for deep reinforcement learning by embedding inputs into a learned Fourier basis and show that it improves the sample efficiency of both state-based and image-based RL. We perform infinite-width analysis of our architecture using the Neural Tangent Kernel and theoretically show that tuning the initial variance of the Fourier basis is equivalent to functional regularization of the learned deep network. That is, these learned Fourier features allow for adjusting the degree to which networks underfit or overfit different frequencies in the training data, and hence provide a controlled mechanism to improve the stability and performance of RL optimization. Empirically, this allows us to prioritize learning low-frequency functions and speed up learning by reducing networks' susceptibility to noise in the optimization process, such as during Bellman updates. Experiments on standard state-based and image-based RL benchmarks show clear benefits of our architecture over the baselines. Website at https://alexanderli.com/learned-fourier-features

【2】 CALVIN: A Benchmark for Language-conditioned Policy Learning for Long-horizon Robot Manipulation Tasks 标题：CALVIN：一种用于长视距机器人操作任务的语言条件策略学习基准链接：https://arxiv.org/abs/2112.03227

作者：Oier Mees,Lukas Hermann,Erick Rosete-Beas,Wolfram Burgard 机构： All authors are with the Universityof Freiburg 备注：this http URL 摘要：在其环境中与人类共存的通用机器人必须学会将人类语言与其感知和行动联系起来，以便在一系列日常任务中发挥作用。此外，他们还需要掌握多种通用技能，通过遵循无约束的语言指令来完成长期任务。在本文中，我们介绍了CALVIN（从语言和视觉合成动作），这是一个开源的模拟基准，用于学习长视野语言条件任务。我们的目标是开发能够解决许多机器人操作任务的代理，这些任务可以通过车载传感器完成，并且只能通过人类语言指定。CALVIN任务在序列长度、动作空间和语言方面比现有的视觉和语言任务数据集更复杂，并且支持传感器套件的灵活规范。我们评估Zero-Shot中的代理，以获得新的语言指令和新的环境和对象。我们发现，基于多语境模仿学习的基线模型在卡尔文身上表现不佳，这表明有很大的空间开发创新代理，学习将人类语言与其世界模型与该基准相关联。摘要：General-purpose robots coexisting with humans in their environment must learn to relate human language to their perceptions and actions to be useful in a range of daily tasks. Moreover, they need to acquire a diverse repertoire of general-purpose skills that allow composing long-horizon tasks by following unconstrained language instructions. In this paper, we present CALVIN (Composing Actions from Language and Vision), an open-source simulated benchmark to learn long-horizon language-conditioned tasks. Our aim is to make it possible to develop agents that can solve many robotic manipulation tasks over a long horizon, from onboard sensors, and specified only via human language. CALVIN tasks are more complex in terms of sequence length, action space, and language than existing vision-and-language task datasets and supports flexible specification of sensor suites. We evaluate the agents in zero-shot to novel language instructions and to novel environments and objects. We show that a baseline model based on multi-context imitation learning performs poorly on CALVIN, suggesting that there is significant room for developing innovative agents that learn to relate human language to their world models with this benchmark.

【3】 Temporal-Spatial Causal Interpretations for Vision-Based Reinforcement Learning 标题：基于视觉强化学习的时空因果解释链接：https://arxiv.org/abs/2112.03020

作者：Wenjie Shi,Gao Huang,Shiji Song,Cheng Wu 备注：Accepted as a Regular Paper in IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 摘要：深度强化学习（RL）代理在一系列复杂的控制任务中变得越来越熟练。然而，由于黑盒功能的引入，agent的行为往往难以解释，难以获得用户的信任。尽管有一些有趣的基于视觉的RL解释方法，但大多数方法都无法揭示时间因果信息，从而对其可靠性提出了质疑。为了解决这个问题，我们提出了一个时空因果解释（TSCI）模型来理解代理的长期行为，这对于顺序决策至关重要。TSCI模型建立在时间因果关系的表述上，它反映了连续观测和RL代理决策之间的时间因果关系。然后，采用一个独立的因果发现网络来识别时空因果特征，并对其进行约束以满足时间因果关系。TSCI模型适用于复发性药物，训练后可以高效地发现因果特征。实验结果表明，TSCI模型可以产生高分辨率和清晰的注意面具，突出任务相关的时空信息，这些信息构成了基于视觉的RL代理如何做出顺序决策的大部分证据。此外，我们进一步证明，我们的方法能够从时间角度为基于视觉的RL代理提供有价值的因果解释。摘要：Deep reinforcement learning (RL) agents are becoming increasingly proficient in a range of complex control tasks. However, the agent's behavior is usually difficult to interpret due to the introduction of black-box function, making it difficult to acquire the trust of users. Although there have been some interesting interpretation methods for vision-based RL, most of them cannot uncover temporal causal information, raising questions about their reliability. To address this problem, we present a temporal-spatial causal interpretation (TSCI) model to understand the agent's long-term behavior, which is essential for sequential decision-making. TSCI model builds on the formulation of temporal causality, which reflects the temporal causal relations between sequential observations and decisions of RL agent. Then a separate causal discovery network is employed to identify temporal-spatial causal features, which are constrained to satisfy the temporal causality. TSCI model is applicable to recurrent agents and can be used to discover causal features with high efficiency once trained. The empirical results show that TSCI model can produce high-resolution and sharp attention masks to highlight task-relevant temporal-spatial information that constitutes most evidence about how vision-based RL agents make sequential decisions. In addition, we further demonstrate that our method is able to provide valuable causal interpretations for vision-based RL agents from the temporal perspective.

【4】 Adjusting the Ground Truth Annotations for Connectivity-Based Learning to Delineate 标题：调整基于连通性学习的基本事实注释以进行描绘链接：https://arxiv.org/abs/2112.02781

作者：Doruk Oner,Leonardo Citraro,Mateusz Koziński,Pascal Fua 机构： School of Computerand Communication Sciences 摘要：基于深度学习的三维结构描述方法依赖于精确的注释来训练网络。然而，在实践中，人们无论多么认真，都难以在3D和大规模上精确描绘，部分原因是数据往往难以直观解释，部分原因是3D界面难以使用。在本文中，我们介绍了一种方法，明确解释注释不准确。为此，我们将注释视为活动轮廓模型，可以在保持拓扑的同时使其自身变形。这使我们能够联合训练网络并纠正原始注释中的潜在错误。其结果是一种提高深层网络性能的方法，该网络使用可能不准确的注释进行训练。摘要：Deep learning-based approaches to delineating 3D structure depend on accurate annotations to train the networks. Yet, in practice, people, no matter how conscientious, have trouble precisely delineating in 3D and on a large scale, in part because the data is often hard to interpret visually and in part because the 3D interfaces are awkward to use. In this paper, we introduce a method that explicitly accounts for annotation inaccuracies. To this end, we treat the annotations as active contour models that can deform themselves while preserving their topology. This enables us to jointly train the network and correct potential errors in the original annotations. The result is an approach that boosts performance of deep networks trained with potentially inaccurate annotations.

【5】 A Survey on Deep learning based Document Image Enhancement 标题：基于深度学习的文档图像增强研究综述链接：https://arxiv.org/abs/2112.02719

作者：Zahra Anvari,Vassilis Athitsos 机构：Department of Computer Science and Engineering, University of Texas Arlington, Arlington, TX 摘要：如今，诸如科学文章、税务表格、发票、合同文件和历史文本等数字化文档被广泛使用。由于各种原因，这些图像可能会降级或损坏，包括拍摄图像时的照明条件差、扫描图像时的阴影、噪声和模糊等失真、老化、墨迹、渗透、水印、印章、，文档图像增强和恢复在许多自动文档分析和识别任务中起着至关重要的作用，例如使用光学字符识别（OCR）进行内容提取。随着深度学习的发展，人们提出了许多方法来提高这些文档图像的质量。在本文中，我们回顾了针对不同文档图像增强问题的基于深度学习的方法、数据集和度量。我们提供了六种不同文档图像增强任务的基于深度学习的方法的全面概述，包括二值化、去模糊、去噪、去噪、去噪、水印移除和阴影移除。我们总结了每项任务的主要最新作品，并讨论了它们的特点、挑战和局限性。我们介绍了多个文档图像增强任务，这些任务受到了相当多的关注，包括曝光过度和曝光不足校正以及穿透式清除，并确定了其他几个有希望的研究方向和未来研究的机会。摘要：Digitized documents such as scientific articles, tax forms, invoices, contract papers, and historic texts, are widely used nowadays. These images could be degraded or damaged due to various reasons including poor lighting conditions when capturing the image, shadow while scanning them, distortion like noise and blur, aging, ink stain, bleed through, watermark, stamp, etc. Document image enhancement and restoration play a crucial role in many automated document analysis and recognition tasks, such as content extraction using optical character recognition (OCR). With recent advances in deep learning, many methods are proposed to enhance the quality of these document images. In this paper, we review deep learning-based methods, datasets, and metrics for different document image enhancement problems. We provide a comprehensive overview of deep learning-based methods for six different document image enhancement tasks, including binarization, debluring, denoising, defading, watermark removal, and shadow removal. We summarize the main state-of-the-art works for each task and discuss their features, challenges, and limitations. We introduce multiple document image enhancement tasks that have received no to little attention, including over and under exposure correction and bleed-through removal, and identify several other promising research directions and opportunities for future research.

【6】 Learning Query Expansion over the Nearest Neighbor Graph 标题：学习最近邻图上的查询扩展链接：https://arxiv.org/abs/2112.02666

作者：Benjamin Klein,Lior Wolf 机构：The Blavatnik School of Computer, Science, Tel Aviv University, Israel 备注：BMVC 2021 摘要：查询扩展（QE）是一种成熟的方法，用于改进图像搜索应用中的检索度量。当使用QE时，搜索是在一个新的查询向量上进行的，该查询向量是使用对查询和数据库中的图像的聚合函数构造的。最近的工作产生了QE技术，在QE技术中学习聚合函数，而以前的技术是基于手工制作的聚合函数，例如，获取查询最近邻的平均值。然而，大多数QE方法都关注于直接作用于查询及其最近邻的聚合函数。在这项工作中，提出了一种分层模型，即图查询扩展（GQE），该模型以有监督的方式学习，并在查询的扩展邻域上执行聚合，从而在计算查询扩展时增加从数据库使用的信息，并使用最近邻图的结构。与已知基准相比，该技术实现了最先进的结果。摘要：Query Expansion (QE) is a well established method for improving retrieval metrics in image search applications. When using QE, the search is conducted on a new query vector, constructed using an aggregation function over the query and images from the database. Recent works gave rise to QE techniques in which the aggregation function is learned, whereas previous techniques were based on hand-crafted aggregation functions, e.g., taking the mean of the query's nearest neighbors. However, most QE methods have focused on aggregation functions that work directly over the query and its immediate nearest neighbors. In this work, a hierarchical model, Graph Query Expansion (GQE), is presented, which is learned in a supervised manner and performs aggregation over an extended neighborhood of the query, thus increasing the information used from the database when computing the query expansion, and using the structure of the nearest neighbors graph. The technique achieves state-of-the-art results over known benchmarks.

【7】 Next Day Wildfire Spread: A Machine Learning Data Set to Predict Wildfire Spreading from Remote-Sensing Data 标题：第二天野火蔓延：从遥感数据预测野火蔓延的机器学习数据集链接：https://arxiv.org/abs/2112.02447

作者：Fantine Huot,R. Lily Hu,Nita Goyal,Tharun Sankar,Matthias Ihme,Yi-Fan Chen 备注：submitted to IEEE Transactions on Geoscience and Remote Sensing 摘要：预测野火蔓延对土地管理和防灾至关重要。为此，我们展示了“第二天野火蔓延”，这是一组精心策划的大规模多变量历史野火数据集，汇集了美国近十年的遥感数据。与基于地球观测卫星的现有火灾数据集不同，我们的数据集将二维火灾数据与多个解释变量（例如地形、植被、天气、干旱指数、人口密度）结合在一起，在二维区域上对齐，为机器学习提供了一个功能丰富的数据集。为了证明该数据集的有用性，我们实现了一个卷积自动编码器，该编码器利用该数据的空间信息来预测野火蔓延。我们比较了神经网络与其他机器学习模型的性能：logistic回归和随机森林。该数据集可作为基准，用于开发基于遥感数据的野火传播模型，提前期为一天。摘要：Predicting wildfire spread is critical for land management and disaster preparedness. To this end, we present `Next Day Wildfire Spread,' a curated, large-scale, multivariate data set of historical wildfires aggregating nearly a decade of remote-sensing data across the United States. In contrast to existing fire data sets based on Earth observation satellites, our data set combines 2D fire data with multiple explanatory variables (e.g., topography, vegetation, weather, drought index, population density) aligned over 2D regions, providing a feature-rich data set for machine learning. To demonstrate the usefulness of this data set, we implement a convolutional autoencoder that takes advantage of the spatial information of this data to predict wildfire spread. We compare the performance of the neural network with other machine learning models: logistic regression and random forest. This data set can be used as a benchmark for developing wildfire propagation models based on remote sensing data for a lead time of one day.

【8】 VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts 标题：VT-CLIP：用视觉引导文本增强视觉语言模型链接：https://arxiv.org/abs/2112.02399

作者：Renrui Zhang,Longtian Qiu,Wei Zhang,Ziyao Zeng 机构：Shanghai AI Laboratory, ShanghaiTech University 摘要：对比视觉语言预训练（CLIP）因其可转移的视觉表征学习而受到越来越多的关注。在大规模图像-文本对的监督下，CLIP能够对齐成对的图像和文本，从而在开放词汇场景中进行Zero-Shot识别。然而，在特定的应用程序和一般预先训练的知识之间存在语义鸿沟，这使得下游任务的匹配次优。在本文中，我们提出VT-CLIP通过视觉引导文本增强视觉语言建模。具体来说，我们引导文本特征自适应地探索图像上的信息区域，并通过交叉注意机制聚合视觉特征。通过这种方式，视觉引导文本在语义上与图像更加相关，这大大有利于匹配过程。在少数情况下，我们在11个著名的分类数据集上评估我们的VT-CLIP，并进行广泛的消融研究，以证明VT-CLIP的有效性。代码将很快发布。摘要：Contrastive Vision-Language Pre-training (CLIP) has drown increasing attention recently for its transferable visual representation learning. Supervised by large-scale image-text pairs, CLIP is able to align paired images and texts and thus conduct zero-shot recognition in open-vocabulary scenarios. However, there exists semantic gap between the specific application and generally pre-trained knowledge, which makes the matching sub-optimal on downstream tasks. In this paper, we propose VT-CLIP to enhance vision-language modeling via visual-guided texts. Specifically, we guide the text feature to adaptively explore informative regions on the image and aggregate the visual feature by cross-attention machanism. In this way, the visual-guided text become more semantically correlated with the image, which greatly benefits the matching process. In few-shot settings, we evaluate our VT-CLIP on 11 well-known classification datasets and experiment extensive ablation studies to demonstrate the effectiveness of VT-CLIP. The code will be released soon.

【9】 Interactive Disentanglement: Learning Concepts by Interacting with their Prototype Representations 标题：交互式解缠：通过与原型表示交互来学习概念链接：https://arxiv.org/abs/2112.02290

作者：Wolfgang Stammer,Marius Memmel,Patrick Schramowski,Kristian Kersting 机构：Technical University of Darmstadt, Computer Science Department, Technical University of Darmstadt, Centre for Cognitive Science 摘要：在没有强有力监督的情况下，从原始图像中学习视觉概念是一项具有挑战性的任务。在这项工作中，我们展示了原型表征在理解和修正神经概念学习者的潜在空间方面的优势。为此，我们引入了交互式概念交换网络（iCSNs），这是一种通过弱监督和隐式原型表征学习概念基础表征的新框架。ICSN通过交换成对图像的潜在表示，学习将概念信息绑定到特定的原型槽。这种语义基础和离散的潜在空间有助于人类理解和人机交互。我们通过对我们的新数据集“基本概念推理”（ECR）进行实验来支持这一说法，重点是几何对象共享的视觉概念。摘要：Learning visual concepts from raw images without strong supervision is a challenging task. In this work, we show the advantages of prototype representations for understanding and revising the latent space of neural concept learners. For this purpose, we introduce interactive Concept Swapping Networks (iCSNs), a novel framework for learning concept-grounded representations via weak supervision and implicit prototype representations. iCSNs learn to bind conceptual information to specific prototype slots by swapping the latent representations of paired images. This semantically grounded and discrete latent space facilitates human understanding and human-machine interaction. We support this claim by conducting experiments on our novel data set "Elementary Concept Reasoning" (ECR), focusing on visual concepts shared by geometric objects.

【10】 Dual-Flow Transformation Network for Deformable Image Registration with Region Consistency Constraint 标题：具有区域一致性约束的双流变换网络可变形图像配准链接：https://arxiv.org/abs/2112.02249

作者：Xinke Ma,Yibo Yang,Yong Xia,Dacheng Tao 机构：School of Computer Science and Engineering, Northwestern Polytechnical University, China 摘要：可变形图像配准能够实现快速、准确的配准，在医学图像研究中占有重要地位。当前基于深度学习（DL）的图像配准方法通过卷积神经网络直接学习从一幅图像到另一幅图像的空间变换，需要基础真值或相似性度量。然而，这些方法仅使用全局相似性能量函数来评估一对图像的相似性，而忽略了图像中感兴趣区域（ROI）的相似性。此外，基于DL的方法通常直接估计图像的全局空间变换，而从不关注图像中ROI的区域空间变换。在本文中，我们提出了一种具有区域一致性约束的双流变换网络，该网络可以最大化一对图像中roi的相似性，同时估计全局和区域空间变换。在4个公开的3D-MRI数据集上的实验表明，与其他先进的配准方法相比，该方法在精度和泛化性方面都达到了最佳的配准性能。摘要：Deformable image registration is able to achieve fast and accurate alignment between a pair of images and thus plays an important role in many medical image studies. The current deep learning (DL)-based image registration approaches directly learn the spatial transformation from one image to another by leveraging a convolutional neural network, requiring ground truth or similarity metric. Nevertheless, these methods only use a global similarity energy function to evaluate the similarity of a pair of images, which ignores the similarity of regions of interest (ROIs) within images. Moreover, DL-based methods often estimate global spatial transformations of image directly, which never pays attention to region spatial transformations of ROIs within images. In this paper, we present a novel dual-flow transformation network with region consistency constraint which maximizes the similarity of ROIs within a pair of images and estimates both global and region spatial transformations simultaneously. Experiments on four public 3D MRI datasets show that the proposed method achieves the best registration performance in accuracy and generalization compared with other state-of-the-art methods.

【11】 A Triple-Double Convolutional Neural Network for Panchromatic Sharpening 标题：一种用于全色锐化的三重卷积神经网络链接：https://arxiv.org/abs/2112.02237

作者：Tian-Jing Zhang,Liang-Jian Deng,Ting-Zhu Huang,Jocelyn Chanussot,Gemine Vivone 机构：Yingcai Honors College, University of Electronic Science and Technology of China, Chengdu, China, School of Mathematical Sciences, Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, Grenoble, France 摘要：泛锐化是指将具有高空间分辨率的全色图像与具有低空间分辨率的多光谱图像进行融合，以获得高空间分辨率的多光谱图像。在本文中，我们提出了一种新的基于水平域损失函数的泛锐化深度神经网络结构，该结构考虑了以下两种类型的结构，即}双层、双分支和双向，称为三重双网络（TDNet）。利用TDNet的结构，可以充分利用全色图像的空间细节，逐步注入低空间分辨率的多光谱图像，从而获得高空间分辨率的输出。具体的网络设计是由传统多分辨率分析（MRA）方法的物理公式驱动的。因此，有效的MRA融合模块也集成到TDNet中。此外，我们还采用了几个ResNet块和一些多尺度卷积核来加深和加宽网络，从而有效地提高了所提出的TDNet的特征提取和鲁棒性。对WorldView-3、QuickBird和GaoFen-2传感器获取的降低分辨率和全分辨率数据集进行的大量实验表明，与一些最新的泛锐化方法相比，所提出的TDNet具有优越性。消融研究也证实了所提出方法的有效性。摘要：Pansharpening refers to the fusion of a panchromatic image with a high spatial resolution and a multispectral image with a low spatial resolution, aiming to obtain a high spatial resolution multispectral image. In this paper, we propose a novel deep neural network architecture with level-domain based loss function for pansharpening by taking into account the following double-type structures, \emph{i.e.,} double-level, double-branch, and double-direction, called as triple-double network (TDNet). By using the structure of TDNet, the spatial details of the panchromatic image can be fully exploited and utilized to progressively inject into the low spatial resolution multispectral image, thus yielding the high spatial resolution output. The specific network design is motivated by the physical formula of the traditional multi-resolution analysis (MRA) methods. Hence, an effective MRA fusion module is also integrated into the TDNet. Besides, we adopt a few ResNet blocks and some multi-scale convolution kernels to deepen and widen the network to effectively enhance the feature extraction and the robustness of the proposed TDNet. Extensive experiments on reduced- and full-resolution datasets acquired by WorldView-3, QuickBird, and GaoFen-2 sensors demonstrate the superiority of the proposed TDNet compared with some recent state-of-the-art pansharpening approaches. An ablation study has also corroborated the effectiveness of the proposed approach.

【12】 Bridging the gap between prostate radiology and pathology through machine learning 标题：通过机器学习弥合前列腺放射学和病理学之间的鸿沟链接：https://arxiv.org/abs/2112.02164

作者：Indrani Bhattacharya,David S. Lim,Han Lin Aung,Xingchen Liu,Arun Seetharaman,Christian A. Kunder,Wei Shao,Simon J. C. Soerensen,Richard E. Fan,Pejman Ghanouni,Katherine J. To'o,James D. Brooks,Geoffrey A. Sonn,Mirabela Rusu 机构：Department of Radiology, Stanford University School of Medicine, Stanford, CA , Department of Urology, Stanford University School of Medicine, Stanford, CA , Department of Computer Science, Stanford University, Stanford, CA 备注：Indrani Bhattacharya and David S. Lim contributed equally as first authors. Geoffrey A. Sonn and Mirabela Rusu contributed equally as senior authors 摘要：前列腺癌是美国男性第二大致命癌症。虽然磁共振成像（MRI）越来越多地用于指导前列腺癌诊断的靶向活检，但由于假阳性率和假阴性率高以及读者之间的一致性低，其实用性仍然有限。在前列腺MRI上检测和定位癌症的机器学习方法可以帮助标准化放射科医生的解释。然而，现有的机器学习方法不仅在模型结构上有所不同，而且在用于模型训练的基本真理标记策略上也有所不同。在这项研究中，我们比较了不同的标记策略，即病理学确认的放射科医生标签、全套组织病理学图像上的病理学家标签以及病变级别和像素级别的数字病理学家标签（先前在组织病理学图像上验证的用于预测像素级别Gleason模式的深度学习算法）全套组织病理学图像。我们分析了这些标签对经过训练的机器学习模型性能的影响。我们的实验表明：（1）使用它们训练的放射科医生标签和模型可能会错过癌症，或低估癌症的程度；（2）使用它们训练的数字病理学家标签和模型与病理学家标签具有高度一致性；（3）使用数字病理学家标签训练的模型在具有不同疾病分布的两个不同队列中实现了最佳前列腺癌检测性能，而与所使用的模型结构无关。数字病理学家标签可以减少与人类注释相关的挑战，包括劳动力、时间、读取器间和读取器内的变异性，并且可以通过训练可靠的机器学习模型，在MRI上检测和定位前列腺癌，帮助缩小前列腺放射学和病理学之间的差距。摘要：Prostate cancer is the second deadliest cancer for American men. While Magnetic Resonance Imaging (MRI) is increasingly used to guide targeted biopsies for prostate cancer diagnosis, its utility remains limited due to high rates of false positives and false negatives as well as low inter-reader agreements. Machine learning methods to detect and localize cancer on prostate MRI can help standardize radiologist interpretations. However, existing machine learning methods vary not only in model architecture, but also in the ground truth labeling strategies used for model training. In this study, we compare different labeling strategies, namely, pathology-confirmed radiologist labels, pathologist labels on whole-mount histopathology images, and lesion-level and pixel-level digital pathologist labels (previously validated deep learning algorithm on histopathology images to predict pixel-level Gleason patterns) on whole-mount histopathology images. We analyse the effects these labels have on the performance of the trained machine learning models. Our experiments show that (1) radiologist labels and models trained with them can miss cancers, or underestimate cancer extent, (2) digital pathologist labels and models trained with them have high concordance with pathologist labels, and (3) models trained with digital pathologist labels achieve the best performance in prostate cancer detection in two different cohorts with different disease distributions, irrespective of the model architecture used. Digital pathologist labels can reduce challenges associated with human annotations, including labor, time, inter- and intra-reader variability, and can help bridge the gap between prostate radiology and pathology by enabling the training of reliable machine learning models to detect and localize prostate cancer on MRI.

其他(15篇)

【1】 Simultaneously Predicting Multiple Plant Traits from Multiple Sensors via Deformable CNN Regression 标题：基于可变形CNN回归的多传感器同时预测多个植物性状链接：https://arxiv.org/abs/2112.03205

作者：Pranav Raja,Alex Olenskyj,Hamid Kamangir,Mason Earles 机构：Biological and Agricultural Engineering, University of California, Davis, Viticulture and Enology, University of California, Davis, AI Institute for Food Systems (AIFS) 摘要：性状测量是植物育种和农业生产的关键。通常，一套植物性状是通过费力的手动测量来测量的，然后用于训练和/或验证更高产量的性状估计技术。这里，我们介绍一个相对简单的卷积神经网络（CNN）模型，该模型接受多个传感器输入并预测多个连续特征输出，即多输入、多输出CNN（MIMO-CNN）。此外，我们在该网络结构（MIMO-DCNN）中引入可变形卷积层，使模型能够自适应调整其感受野，对数据中的复变量几何变换建模，并微调连续特征输出。我们研究MIMO-CNN和MIMO-DCNN模型如何在多输入（即RGB和深度图像）上执行，来自2021自主温室挑战的多性状输出莴苣数据集。进行消融研究以检查使用单输入与多输入以及单输出与多输出的效果。MIMO-DCNN模型导致归一化均方误差（NMSE）为0.068，比前2021名排行榜得分大为0.081。提供了开放源代码。摘要：Trait measurement is critical for the plant breeding and agricultural production pipeline. Typically, a suite of plant traits is measured using laborious manual measurements and then used to train and/or validate higher throughput trait estimation techniques. Here, we introduce a relatively simple convolutional neural network (CNN) model that accepts multiple sensor inputs and predicts multiple continuous trait outputs - i.e. a multi-input, multi-output CNN (MIMO-CNN). Further, we introduce deformable convolutional layers into this network architecture (MIMO-DCNN) to enable the model to adaptively adjust its receptive field, model complex variable geometric transformations in the data, and fine-tune the continuous trait outputs. We examine how the MIMO-CNN and MIMO-DCNN models perform on a multi-input (i.e. RGB and depth images), multi-trait output lettuce dataset from the 2021 Autonomous Greenhouse Challenge. Ablation studies were conducted to examine the effect of using single versus multiple inputs, and single versus multiple outputs. The MIMO-DCNN model resulted in a normalized mean squared error (NMSE) of 0.068 - a substantial improvement over the top 2021 leaderboard score of 0.081. Open-source code is provided.

【2】 Encouraging Disentangled and Convex Representation with Controllable Interpolation Regularization 标题：用可控插值正则化鼓励解缠和凸表示链接：https://arxiv.org/abs/2112.03163

作者：Yunhao Ge,Zhi Xu,Yao Xiao,Gan Xin,Yunkui Pang,Laurent Itti 机构：University of Southern California, Los Angeles, CA 备注：14 pages, 15 figure (including appendix) 摘要：我们专注于可控解纠缠表示学习（C-Dis-RL），其中用户可以控制解纠缠潜在空间的划分，以分解下游任务的数据集属性（概念）。目前的方法仍存在两个普遍问题：（1）缺乏全面的解纠缠约束，特别是缺少潜在域和观察域中不同属性之间互信息的最小化。（2）它们在分离的潜在空间中缺乏凸性约束，这对于有意义地操纵下游任务的特定属性非常重要。为了同时鼓励全面的C-Dis-RL和凸性，我们提出了一种简单而有效的方法：可控插值正则化（CIR），它创建了一个正循环，其中解纠缠和凸性可以相互帮助。具体来说，我们在训练期间在潜在空间中进行受控插值，并“重用”编码器，以帮助形成“完美解纠缠”正则化。在这种情况下，（a）解纠缠损失隐含地扩大了潜在的“可理解”分布，以鼓励凸性；（b）凸性反过来可以提高鲁棒性和精确解纠缠。CIR是一个通用模块，我们将CIR与三种不同的算法合并：优雅、I2I Dis和GZS Net，以展示其兼容性和有效性。定性和定量实验表明，CIR改善了C-Dis-RL和潜在凸度。这进一步改善了下游任务：可控图像合成、跨模态图像转换和零拍合成。更多的实验表明，CIR还可以改进其他下游任务，如新属性值挖掘、数据扩充和消除公平性偏差。摘要：We focus on controllable disentangled representation learning (C-Dis-RL), where users can control the partition of the disentangled latent space to factorize dataset attributes (concepts) for downstream tasks. Two general problems remain under-explored in current methods: (1) They lack comprehensive disentanglement constraints, especially missing the minimization of mutual information between different attributes across latent and observation domains. (2) They lack convexity constraints in disentangled latent space, which is important for meaningfully manipulating specific attributes for downstream tasks. To encourage both comprehensive C-Dis-RL and convexity simultaneously, we propose a simple yet efficient method: Controllable Interpolation Regularization (CIR), which creates a positive loop where the disentanglement and convexity can help each other. Specifically, we conduct controlled interpolation in latent space during training and 'reuse' the encoder to help form a 'perfect disentanglement' regularization. In that case, (a) disentanglement loss implicitly enlarges the potential 'understandable' distribution to encourage convexity; (b) convexity can in turn improve robust and precise disentanglement. CIR is a general module and we merge CIR with three different algorithms: ELEGANT, I2I-Dis, and GZS-Net to show the compatibility and effectiveness. Qualitative and quantitative experiments show improvement in C-Dis-RL and latent convexity by CIR. This further improves downstream tasks: controllable image synthesis, cross-modality image translation and zero-shot synthesis. More experiments demonstrate CIR can also improve other downstream tasks, such as new attribute value mining, data augmentation, and eliminating bias for fairness.

【3】 Ethics and Creativity in Computer Vision 标题：计算机视觉中的伦理与创新链接：https://arxiv.org/abs/2112.03111

作者：Negar Rostamzadeh,Emily Denton,Linda Petrini 机构：Google Research, Montreal, New York 备注：None 摘要：本文提供了一个回顾我们从组织工作坊*在计算机视觉的创造性应用* CVPR 2021会议的伦理考虑，并在此之前，在计算机视觉的时装，艺术和设计的一系列研讨会*在ECRV 2018，ICCV 2019，和CVPR 2020。我们希望这一反思将使艺术家和机器学习研究人员围绕计算机视觉创造性应用的伦理和社会层面展开对话。摘要：This paper offers a retrospective of what we learnt from organizing the workshop *Ethical Considerations in Creative applications of Computer Vision* at CVPR 2021 conference and, prior to that, a series of workshops on *Computer Vision for Fashion, Art and Design* at ECCV 2018, ICCV 2019, and CVPR 2020. We hope this reflection will bring artists and machine learning researchers into conversation around the ethical and social dimensions of creative applications of computer vision.

【4】 Scaling Up Influence Functions 标题：放大影响函数链接：https://arxiv.org/abs/2112.03052

作者：Andrea Schioppa,Polina Zablotskaia,David Vilar,Artem Sokolov 机构：Google Research 备注：Published at AAAI-22 摘要：我们解决了影响函数的有效计算问题，以便将预测跟踪回训练数据。我们提出并分析了一种基于Arnoldi迭代的加速逆Hessian计算的新方法。通过这一改进，据我们所知，我们首次成功实现了影响函数，可扩展到具有数亿个参数的全尺寸（语言和视觉）转换器模型。我们通过数千万到数亿个训练示例来评估我们在图像分类和序列到序列任务方面的方法。我们的代码将在https://github.com/google-research/jax-influence. 摘要：We address efficient calculation of influence functions for tracking predictions back to the training data. We propose and analyze a new approach to speeding up the inverse Hessian calculation based on Arnoldi iteration. With this improvement, we achieve, to the best of our knowledge, the first successful implementation of influence functions that scales to full-size (language and vision) Transformer models with several hundreds of millions of parameters. We evaluate our approach on image classification and sequence-to-sequence tasks with tens to a hundred of millions of training examples. Our code will be available at https://github.com/google-research/jax-influence.

【5】 Controllable Animation of Fluid Elements in Still Images 标题：静止图像中流体元素的可控动画链接：https://arxiv.org/abs/2112.03051

作者：Aniruddha Mahapatra,Kuldeep Kulkarni 机构：Adobe Research India 摘要：我们提出了一种交互控制静止图像中流体元素的动画以生成电影图的方法。具体来说，我们重点关注水、烟、火等流体元素的动画，这些元素具有重复纹理和连续流体运动的特性。从以前的作品中获得灵感，我们以恒定的2D光流图的形式表示图像中此类流体元素的运动。为此，我们允许用户提供任意数量的箭头方向及其相关速度，以及用户想要设置动画的区域的遮罩。用户提供的输入箭头方向、相应的速度值和遮罩随后被转换为表示恒定光流图（FD）的密集流图。我们观察到，使用简单的指数运算获得的FD可以非常接近图像中元素的合理运动。我们使用生成对抗网络（GAN）进一步细化计算的密集光流图FD，以获得更真实的流图。我们设计了一种新的基于UNet的架构，通过在不同分辨率下对输入图像特征进行前向扭曲，使用改进的光流图自动回归生成未来帧。我们在一个公开的数据集上进行了大量的实验，结果表明我们的方法在定性和定量指标方面优于基线。此外，我们还展示了对象在训练集中不存在的方向上的定性动画，并提供了一种合成视频的方法，否则在现实世界中就不会存在。摘要：We propose a method to interactively control the animation of fluid elements in still images to generate cinemagraphs. Specifically, we focus on the animation of fluid elements like water, smoke, fire, which have the properties of repeating textures and continuous fluid motion. Taking inspiration from prior works, we represent the motion of such fluid elements in the image in the form of a constant 2D optical flow map. To this end, we allow the user to provide any number of arrow directions and their associated speeds along with a mask of the regions the user wants to animate. The user-provided input arrow directions, their corresponding speed values, and the mask are then converted into a dense flow map representing a constant optical flow map (FD). We observe that FD, obtained using simple exponential operations can closely approximate the plausible motion of elements in the image. We further refine computed dense optical flow map FD using a generative-adversarial network (GAN) to obtain a more realistic flow map. We devise a novel UNet based architecture to autoregressively generate future frames using the refined optical flow map by forward-warping the input image features at different resolutions. We conduct extensive experiments on a publicly available dataset and show that our method is superior to the baselines in terms of qualitative and quantitative metrics. In addition, we show the qualitative animations of the objects in directions that did not exist in the training set and provide a way to synthesize videos that otherwise would not exist in the real world.

【6】 The artificial synesthete: Image-melody translations with variational autoencoders 标题：人工联会：采用变分自动编码器的图像旋律翻译链接：https://arxiv.org/abs/2112.02953

作者：Karl Wienand,Wolfgang M. Heckl 机构：Technische Universität München, Munich, Germany and, Deutsches Museum, Munich, Germany, This project presents a system of neural networks to translate between images and melodies. Au- 备注：7 pages, 4 figures, supplementary media can be downloaded at this https URL 摘要：这个项目提出了一个神经网络系统，用于在图像和旋律之间进行转换。自动编码器将样本中的信息压缩为抽象表示。翻译网络通过反复的联合接触学习音乐和视觉概念之间的一系列对应关系。由此产生的“人工联觉”产生简单的旋律，灵感来自图像和音乐图像。这些是新颖的解释（不是转置数据），表达了机器的感知和理解。通过观察作品，我们可以探索机器的感知能力，从而形成对比，探索自己的感知能力。摘要：Abstract This project presents a system of neural networks to translate between images and melodies. Autoencoders compress the information in samples to abstract representation. A translation network learns a set of correspondences between musical and visual concepts from repeated joint exposure. The resulting "artificial synesthete" generates simple melodies inspired by images, and images from music. These are novel interpretation (not transposed data), expressing the machine' perception and understanding. Observing the work, one explores the machine's perception and thus, by contrast, one's own.

【7】 SelectAugment: Hierarchical Deterministic Sample Selection for Data Augmentation 标题：SelectAugment：用于数据增强的分层确定性样本选择链接：https://arxiv.org/abs/2112.02862

作者：Shiqi Lin,Zhizheng Zhang,Xin Li,Wenjun Zeng,Zhibo Chen 机构：University of Science and Technology of China, Microsoft Research Asia 摘要：数据扩充（DA）已被广泛研究，以促进许多任务中的模型优化。然而，在大多数情况下，数据扩充是以一定的概率为每个训练样本随机执行的，这可能导致内容破坏和视觉模糊。为了消除这种情况，在本文中，我们提出了一种称为SelectAugment的有效方法，根据样本内容和网络训练状态，以确定性和在线方式选择要增强的样本。具体地说，在每个批次中，我们首先确定扩充比率，然后决定是否在此比率下扩充每个训练样本。我们将该过程建模为两步马尔可夫决策过程，并采用层次强化学习（HRL）来学习扩充策略。这样，可以有效地缓解随机性在选择样本进行增强时的负面影响，提高DA的有效性。大量实验表明，我们提出的SelectAugment可以适应多种常用的DA方法，如Mixup、Cutmix、AutoAugment等，并提高了它们在多个图像分类和细粒度图像识别基准数据集上的性能。摘要：Data augmentation (DA) has been widely investigated to facilitate model optimization in many tasks. However, in most cases, data augmentation is randomly performed for each training sample with a certain probability, which might incur content destruction and visual ambiguities. To eliminate this, in this paper, we propose an effective approach, dubbed SelectAugment, to select samples to be augmented in a deterministic and online manner based on the sample contents and the network training status. Specifically, in each batch, we first determine the augmentation ratio, and then decide whether to augment each training sample under this ratio. We model this process as a two-step Markov decision process and adopt Hierarchical Reinforcement Learning (HRL) to learn the augmentation policy. In this way, the negative effects of the randomness in selecting samples to augment can be effectively alleviated and the effectiveness of DA is improved. Extensive experiments demonstrate that our proposed SelectAugment can be adapted upon numerous commonly used DA methods, e.g., Mixup, Cutmix, AutoAugment, etc, and improve their performance on multiple benchmark datasets of image classification and fine-grained image recognition.

【8】 Texture Reformer: Towards Fast and Universal Interactive Texture Transfer 标题：纹理重建器：迈向快速、通用的交互式纹理转换链接：https://arxiv.org/abs/2112.02788

作者：Zhizhong Wang,Lei Zhao,Haibo Chen,Ailin Li,Zhiwen Zuo,Wei Xing,Dongming Lu 机构：College of Computer Science and Technology, Zhejiang University 备注：Accepted by AAAI2022 摘要：在本文中，我们提出了纹理重整器，这是一种基于神经网络的快速通用框架，用于用户指定指导下的交互式纹理传输。挑战在于三个方面：1）任务的多样性，2）制导图的简单性，以及3）执行效率。为了应对这些挑战，我们的关键思想是使用一种新的前馈多视图和多阶段合成过程，包括I）全局视图结构对齐阶段，II）局部视图纹理细化阶段，以及III）整体效果增强阶段，以粗到细的方式合成具有连贯结构和精细纹理细节的高质量结果。此外，我们还引入了一种新的无学习视图特定纹理重构（VSTR）操作，并采用了一种新的语义地图引导策略，以实现更精确的语义引导和结构保持纹理传输。在各种应用场景上的实验结果证明了该框架的有效性和优越性。与最先进的交互式纹理转移算法相比，它不仅获得了更高质量的结果，而且更显著的是，它的速度也快了2-5个数量级。代码可在https://github.com/EndyWon/Texture-Reformer. 摘要：In this paper, we present the texture reformer, a fast and universal neural-based framework for interactive texture transfer with user-specified guidance. The challenges lie in three aspects: 1) the diversity of tasks, 2) the simplicity of guidance maps, and 3) the execution efficiency. To address these challenges, our key idea is to use a novel feed-forward multi-view and multi-stage synthesis procedure consisting of I) a global view structure alignment stage, II) a local view texture refinement stage, and III) a holistic effect enhancement stage to synthesize high-quality results with coherent structures and fine texture details in a coarse-to-fine fashion. In addition, we also introduce a novel learning-free view-specific texture reformation (VSTR) operation with a new semantic map guidance strategy to achieve more accurate semantic-guided and structure-preserved texture transfer. The experimental results on a variety of application scenarios demonstrate the effectiveness and superiority of our framework. And compared with the state-of-the-art interactive texture transfer algorithms, it not only achieves higher quality results but, more remarkably, also is 2-5 orders of magnitude faster. Code is available at https://github.com/EndyWon/Texture-Reformer.

【9】 MobRecon: Mobile-Friendly Hand Mesh Reconstruction from Monocular Image 标题：MobRecon：基于单目图像的移动友好手部网格重建链接：https://arxiv.org/abs/2112.02753

作者：Xingyu Chen,Yufeng Liu,Yajiao Dong,Xiong Zhang,Chongyang Ma,Yanmin Xiong,Yuan Zhang,Xiaoyan Guo 机构：Y-tech, Kuaishou Technology, YY Live, Baidu Inc., SEU-ALLEN Joint Center, Institute for Brain and Intelligence, Southeast University, China. 摘要：在这项工作中，我们提出了一个单视图手部网格重建框架，它可以同时实现高重建精度、快速推理速度和时间一致性。具体来说，对于二维编码，我们提出了轻量级但有效的堆叠结构。对于三维解码，我们提供了一种有效的图算子，即深度分离螺旋卷积。此外，我们还提出了一种新的特征提升模块，用于弥合二维和三维表示之间的差距。该模块从基于地图的位置回归（MapReg）块开始，以集成热图编码和位置回归范例的优点，从而提高2D精度和时间一致性。此外，MapReg之后是姿势池和姿势到顶点提升方法，它们将二维姿势编码转换为三维顶点的语义特征。总的来说，我们的手部重建框架，称为Mobrenc，包括可承受的计算成本和微型模型尺寸，在Apple A14 CPU上达到83FPS的高推断速度。在FreiHAND、RHD和HO3Dv2等流行数据集上进行的大量实验表明，我们的Mobrenc在重建精度和时间一致性方面取得了优异的性能。我们的代码在https://github.com/SeanChenxy/HandMesh. 摘要：In this work, we propose a framework for single-view hand mesh reconstruction, which can simultaneously achieve high reconstruction accuracy, fast inference speed, and temporal coherence. Specifically, for 2D encoding, we propose lightweight yet effective stacked structures. Regarding 3D decoding, we provide an efficient graph operator, namely depth-separable spiral convolution. Moreover, we present a novel feature lifting module for bridging the gap between 2D and 3D representations. This module starts with a map-based position regression (MapReg) block to integrate the merits of both heatmap encoding and position regression paradigms to improve 2D accuracy and temporal coherence. Furthermore, MapReg is followed by pose pooling and pose-to-vertex lifting approaches, which transform 2D pose encodings to semantic features of 3D vertices. Overall, our hand reconstruction framework, called MobRecon, comprises affordable computational costs and miniature model size, which reaches a high inference speed of 83FPS on Apple A14 CPU. Extensive experiments on popular datasets such as FreiHAND, RHD, and HO3Dv2 demonstrate that our MobRecon achieves superior performance on reconstruction accuracy and temporal coherence. Our code is publicly available at https://github.com/SeanChenxy/HandMesh.

【10】 Making a Bird AI Expert Work for You and Me 标题：让鸟牌人工智能专家为你我服务链接：https://arxiv.org/abs/2112.02747

作者：Dongliang Chang,Kaiyue Pang,Ruoyi Du,Zhanyu Ma,Yi-Zhe Song,Jun Guo 机构： School of Artificial Intelligence, Bei-jing University of Posts and Telecommunications 摘要：与细粒度视觉分类（FGVC）一样强大的是，用鸟名“鞭子可怜的威尔”或“绿头鸭”来响应您的查询可能没有多大意义。尽管这在文献中被普遍接受，但它强调了人工智能和人类之间的一个基本问题——什么构成了人类可以从人工智能中学习的可转移知识？本文试图用FGVC作为试验台来回答这个问题。具体地说，我们设想了一个场景，一个训练有素的FGVC模型（AI专家）充当知识提供者，让普通人（你和我）自己成为更好的领域专家，即那些能够区分“鞭打穷人意志”和“绿头鸭”的人。图1展示了我们回答这个问题的方法。假设一位人工智能专家使用专家标签进行训练，我们会问（i）我们可以从人工智能中提取的最佳可转移知识是什么，以及（ii）在给定知识的情况下，测量专业知识收益的最实际方法是什么？对于前者，我们建议将知识表示为高度区分的视觉区域，这些视觉区域是专家专有的。为此，我们设计了一个多阶段学习框架，该框架首先对领域专家和新手的视觉注意力进行建模，然后区分性地提取他们的差异以获取专家专有知识。对于后者，我们模拟评估过程作为书籍指南，以最好地适应人类习惯的学习实践。一项对15000项试验的全面人体研究表明，我们的方法能够持续提高具有不同鸟类专业知识的人对曾经无法识别的鸟类的识别能力。有趣的是，当使用提取的定义知识作为实现区分性定位的手段时，我们的方法还可以提高传统FGVC的性能。代码可从以下网址获取：https://github.com/PRIS-CV/Making-a-Bird-AI-Expert-Work-for-You-and-Me 摘要：As powerful as fine-grained visual classification (FGVC) is, responding your query with a bird name of "Whip-poor-will" or "Mallard" probably does not make much sense. This however commonly accepted in the literature, underlines a fundamental question interfacing AI and human -- what constitutes transferable knowledge for human to learn from AI? This paper sets out to answer this very question using FGVC as a test bed. Specifically, we envisage a scenario where a trained FGVC model (the AI expert) functions as a knowledge provider in enabling average people (you and me) to become better domain experts ourselves, i.e. those capable in distinguishing between "Whip-poor-will" and "Mallard". Fig. 1 lays out our approach in answering this question. Assuming an AI expert trained using expert human labels, we ask (i) what is the best transferable knowledge we can extract from AI, and (ii) what is the most practical means to measure the gains in expertise given that knowledge? On the former, we propose to represent knowledge as highly discriminative visual regions that are expert-exclusive. For that, we devise a multi-stage learning framework, which starts with modelling visual attention of domain experts and novices before discriminatively distilling their differences to acquire the expert exclusive knowledge. For the latter, we simulate the evaluation process as book guide to best accommodate the learning practice of what is accustomed to humans. A comprehensive human study of 15,000 trials shows our method is able to consistently improve people of divergent bird expertise to recognise once unrecognisable birds. Interestingly, our approach also leads to improved conventional FGVC performance when the extracted knowledge defined is utilised as means to achieve discriminative localisation. Codes are available at: https://github.com/PRIS-CV/Making-a-Bird-AI-Expert-Work-for-You-and-Me

【11】 A Dataset of Stationary, Fixed-wing Aircraft on a Collision Course for Vision-Based Sense and Avoid 标题：用于基于视觉的感知和避免的静止固定翼飞机碰撞航线的数据集链接：https://arxiv.org/abs/2112.02735

作者：Jasmin Martin,Jenna Riseley,Jason J. Ford 摘要：预计到2026年，新兴的全球无人机（UAV）服务市场将达到584亿美元，这将促使人们做出重大努力，以不损害现有安全水平的方式将常规无人机操作安全地整合到国家领空。无人机的商业使用将通过感知和避免潜在空中碰撞威胁的能力得到加强，但由于缺乏可用的数据集，该领域的研究受到阻碍，因为这些数据集价格昂贵，而且捕获技术复杂。在本文中，我们提出了一个基于视觉的飞机检测数据集。该数据集由15个图像序列组成，其中包含55521张固定翼飞机接近固定、接地摄像机的图像。还提供了基本事实标签和性能基准。据我们所知，这是第一个研究中型固定翼飞机与观测者碰撞过程的公共数据集。完整的数据集和地面真相标签可在https://qcr.github.io/dataset/aircraft-collision-course/. 摘要：The emerging global market for unmanned aerial vehicle (UAV) services is anticipated to reach USD 58.4 billion by 2026, spurring significant efforts to safely integrate routine UAV operations into the national airspace in a manner that they do not compromise the existing safety levels. The commercial use of UAVs would be enhanced by an ability to sense and avoid potential mid-air collision threats however research in this field is hindered by the lack of available datasets as they are expensive and technically complex to capture. In this paper we present a dataset for vision based aircraft detection. The dataset consists of 15 image sequences containing 55,521 images of a fixed-wing aircraft approaching a stationary, grounded camera. Ground truth labels and a performance benchmark are also provided. To our knowledge, this is the first public dataset for studying medium sized, fixed-wing aircraft on a collision course with the observer. The full dataset and ground truth labels are publicly available at https://qcr.github.io/dataset/aircraft-collision-course/.

【12】 Neural Photometry-guided Visual Attribute Transfer 标题：神经光度学引导的视觉属性传递链接：https://arxiv.org/abs/2112.02520

作者：Carlos Rodriguez-Pardo,Elena Garces 机构： Spain) and withUniversidad Carlos III de Madrid ( 2800 5, Spain) and with UniversidadRey Juan Carlos ( 289 3 3 备注：13 pages. To be published in Transactions on Visualizations and Computer Graphics. Project website: this http URL 摘要：我们提出了一种基于深度学习的方法，用于将空间变化的视觉材质属性（例如纹理贴图或图像样式化）传播到相同或类似材质的较大样本。对于训练，我们利用在多个照明下拍摄的材料图像和专用的数据增强策略，使传输对新的照明条件和仿射变形具有鲁棒性。我们的模型依赖于一个有监督的图像到图像的转换框架，并且对转换域是不可知的；我们展示了语义分割、法线映射和样式化。采用图像类比法，该方法只要求训练数据包含与输入制导相同的视觉结构。我们的方法以交互速率工作，使其适合于材料编辑应用程序。我们在一个受控的环境中全面评估我们的学习方法，提供定量的绩效衡量。最后，我们证明了在单一材料上训练模型足以推广到相同类型的材料，而不需要大量数据集。摘要：We present a deep learning-based method for propagating spatially-varying visual material attributes (e.g. texture maps or image stylizations) to larger samples of the same or similar materials. For training, we leverage images of the material taken under multiple illuminations and a dedicated data augmentation policy, making the transfer robust to novel illumination conditions and affine deformations. Our model relies on a supervised image-to-image translation framework and is agnostic to the transferred domain; we showcase a semantic segmentation, a normal map, and a stylization. Following an image analogies approach, the method only requires the training data to contain the same visual structures as the input guidance. Our approach works at interactive rates, making it suitable for material edit applications. We thoroughly evaluate our learning methodology in a controlled setup providing quantitative measures of performance. Last, we demonstrate that training the model on a single material is enough to generalize to materials of the same type without the need for massive datasets.

【13】 Exploring Complicated Search Spaces with Interleaving-Free Sampling 标题：用非交错抽样探索复杂搜索空间链接：https://arxiv.org/abs/2112.02488

作者：Yunjie Tian,Lingxi Xie,Jiemin Fang,Jianbin Jiao,Qixiang Ye,Qi Tian 机构：University of Chinese Academy of Sciences, Huawei Inc., Huazhong University of Science and Technology 备注：9 pages, 8 figures, 6 tables 摘要：现有的神经结构搜索算法大多是针对短距离连接的搜索空间。我们认为，这种设计虽然安全稳定，但阻碍了搜索算法探索更复杂的场景。在本文中，我们将搜索算法建立在具有长距离连接的复杂搜索空间上，并表明现有的权重共享搜索算法大多由于\textbf{interleaved connections}的存在而失败。根据观察结果，我们提出了一种简单而有效的算法\textbf{IF-NAS}，在搜索过程中，我们执行周期性采样策略来构造不同的子网络，避免在任何子网络中出现交叉连接。在提出的搜索空间中，IF-NAS比随机抽样和以前的权重共享搜索算法都有很大的优势。IF-NAS还推广到更容易使用的基于微单元的空间。我们的研究强调了宏观结构的重要性，我们期待着沿着这一方向进一步努力。摘要：The existing neural architecture search algorithms are mostly working on search spaces with short-distance connections. We argue that such designs, though safe and stable, obstacles the search algorithms from exploring more complicated scenarios. In this paper, we build the search algorithm upon a complicated search space with long-distance connections, and show that existing weight-sharing search algorithms mostly fail due to the existence of \textbf{interleaved connections}. Based on the observation, we present a simple yet effective algorithm named \textbf{IF-NAS}, where we perform a periodic sampling strategy to construct different sub-networks during the search procedure, avoiding the interleaved connections to emerge in any of them. In the proposed search space, IF-NAS outperform both random sampling and previous weight-sharing search algorithms by a significant margin. IF-NAS also generalizes to the micro cell-based spaces which are much easier. Our research emphasizes the importance of macro structure and we look forward to further efforts along this direction.

【14】 Scanpath Prediction on Information Visualisations 标题：信息可视化中的扫描路径预测链接：https://arxiv.org/abs/2112.02340

作者：Yao Wang,Mihai Bâce,Andreas Bulling 机构： and Andreas Bulling are with the Institute forVisualisation and Interactive Systems, University of Stuttgart 备注：11 pages, 6 figures 摘要：我们提出了显著性和扫描路径统一模型（UMSS）——一种学习预测信息可视化的视觉显著性和扫描路径（即眼睛注视序列）的模型。尽管扫描路径提供了关于视觉探索过程中不同视觉化元素重要性的丰富信息，但之前的工作仅限于预测聚合注意统计数据，如视觉显著性。我们对流行的MASSVIS数据集上不同信息可视化元素（如标题、标签、数据）的凝视行为进行了深入分析。我们发现，尽管总体而言，视觉化和观众的注视模式惊人地一致，但不同元素的注视动态也存在结构性差异。根据我们的分析，UMSS首先预测多持续时间元素级显著性图，然后从中概率地采样扫描路径。在MASSVIS上的大量实验表明，我们的方法在几个广泛使用的扫描路径和显著性评估指标方面始终优于最先进的方法。我们的方法实现了扫描路径预测的序列得分相对提高11.5%，显著性预测的Pearson相关系数相对提高23.6%。这些结果是幸运的，并指向更丰富的用户模型和可视化视觉注意力的模拟，而不需要任何眼睛跟踪设备。摘要：We propose Unified Model of Saliency and Scanpaths (UMSS) -- a model that learns to predict visual saliency and scanpaths (i.e. sequences of eye fixations) on information visualisations. Although scanpaths provide rich information about the importance of different visualisation elements during the visual exploration process, prior work has been limited to predicting aggregated attention statistics, such as visual saliency. We present in-depth analyses of gaze behaviour for different information visualisation elements (e.g. Title, Label, Data) on the popular MASSVIS dataset. We show that while, overall, gaze patterns are surprisingly consistent across visualisations and viewers, there are also structural differences in gaze dynamics for different elements. Informed by our analyses, UMSS first predicts multi-duration element-level saliency maps, then probabilistically samples scanpaths from them. Extensive experiments on MASSVIS show that our method consistently outperforms state-of-the-art methods with respect to several, widely used scanpath and saliency evaluation metrics. Our method achieves a relative improvement in sequence score of 11.5% for scanpath prediction, and a relative improvement in Pearson correlation coefficient of up to 23.6% for saliency prediction. These results are auspicious and point towards richer user models and simulations of visual attention on visualisations without the need for any eye tracking equipment.

【15】 Tunable Image Quality Control of 3-D Ultrasound using Switchable CycleGAN 标题：基于可切换CycleGan的三维超声可调谐成像质量控制链接：https://arxiv.org/abs/2112.02896

作者：Jaeyoung Huh,Shujaat Khan,Sungjin Choi,Dongkuk Shin,Eun Sun Lee,Jong Chul Ye 机构：Department of Bio and Brain Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon , Republic of Korea, System R&D Group, Samsung Medison Co., Ltd., Seoul, Korea 摘要：与单轴平面成像的二维超声（US）相比，三维超声成像系统可以沿三个轴平面显示体积。这允许全面查看解剖结构，这对于妇科（妇科）和产科（OB）应用非常有用。不幸的是，与二维US相比，三维US在分辨率上存在固有的限制。例如，在使用三维机械探头的三维超声成像中，图像质量沿光束方向相当，但在其他两个轴向图像平面中经常观察到图像质量的显著恶化。为了解决这个问题，我们提出了一种新的无监督深度学习方法来提高三维超声图像质量。特别是，我们使用{\em unmatched}高质量的二维US图像作为参考，训练了最近提出的可切换CycleGAN体系结构，以便三维US中的每个映射平面都可以学习二维US图像的图像质量。由于采用了可切换的体系结构，我们的网络还可以根据用户偏好实时控制图像增强级别，这非常适合以用户为中心的扫描仪设置。大量的临床评估实验证实，我们的方法提供了显著改善的图像质量和用户友好的灵活性。摘要：In contrast to 2-D ultrasound (US) for uniaxial plane imaging, a 3-D US imaging system can visualize a volume along three axial planes. This allows for a full view of the anatomy, which is useful for gynecological (GYN) and obstetrical (OB) applications. Unfortunately, the 3-D US has an inherent limitation in resolution compared to the 2-D US. In the case of 3-D US with a 3-D mechanical probe, for example, the image quality is comparable along the beam direction, but significant deterioration in image quality is often observed in the other two axial image planes. To address this, here we propose a novel unsupervised deep learning approach to improve 3-D US image quality. In particular, using {\em unmatched} high-quality 2-D US images as a reference, we trained a recently proposed switchable CycleGAN architecture so that every mapping plane in 3-D US can learn the image quality of 2-D US images. Thanks to the switchable architecture, our network can also provide real-time control of image enhancement level based on user preference, which is ideal for a user-centric scanner setup. Extensive experiments with clinical evaluation confirm that our method offers significantly improved image quality as well user-friendly flexibility.

机器翻译，仅供参考