计算机视觉与模式识别学术速递[12.8]

cs.CV 方向，今日共计75篇

Transformer(3篇)

【1】 SSAT: A Symmetric Semantic-Aware Transformer Network for Makeup Transfer and Removal 标题：SSAT：一种对称语义感知的补丁迁移与移除转换网络链接：https://arxiv.org/abs/2112.03631

作者：Zhaoyang Sun,Yaxiong Chen,Shengwu Xiong 机构： School of Computer Science and Artificial Intelligence, Wuhan University of Technology, Wuhan, Wuhan University of Technology Chongqing Research Institute, Chongqing, Sanya Science and Education Innovation Park of Wuhan University of Technology, Sanya 备注：Accepted to AAAI 2022 摘要：化妆传递不仅是提取参考图像的化妆风格，而且是将化妆风格渲染到目标图像的语义对应位置。然而，大多数现有的方法侧重于前者而忽视后者，导致无法达到预期的结果。为了解决上述问题，我们提出了一种统一的对称语义感知变换器（SSAT）网络，该网络结合了语义对应学习来同时实现补足转移和补足移除。在SSAT中，提出了一种新的对称语义对应特征转移（SSCFT）模块和一种弱监督语义丢失模型，以便于建立精确的语义对应。在生成过程中，利用SSCFT对提取的化妆特征进行空间扭曲，实现与目标图像的语义对齐，然后将扭曲的化妆特征与未修改的化妆无关特征相结合，生成最终结果。实验表明，我们的方法获得了更加直观准确的化妆转移结果，与其他最先进的化妆转移方法相比，用户研究反映了我们方法的优越性。此外，我们还验证了该方法在表情和姿势差异、对象遮挡场景等方面的鲁棒性，并将其扩展到视频合成传输中。代码将在https://gitee.com/sunzhaoyang0304/ssat-msp. 摘要：Makeup transfer is not only to extract the makeup style of the reference image, but also to render the makeup style to the semantic corresponding position of the target image. However, most existing methods focus on the former and ignore the latter, resulting in a failure to achieve desired results. To solve the above problems, we propose a unified Symmetric Semantic-Aware Transformer (SSAT) network, which incorporates semantic correspondence learning to realize makeup transfer and removal simultaneously. In SSAT, a novel Symmetric Semantic Corresponding Feature Transfer (SSCFT) module and a weakly supervised semantic loss are proposed to model and facilitate the establishment of accurate semantic correspondence. In the generation process, the extracted makeup features are spatially distorted by SSCFT to achieve semantic alignment with the target image, then the distorted makeup features are combined with unmodified makeup irrelevant features to produce the final result. Experiments show that our method obtains more visually accurate makeup transfer results, and user study in comparison with other state-of-the-art makeup transfer methods reflects the superiority of our method. Besides, we verify the robustness of the proposed method in the difference of expression and pose, object occlusion scenes, and extend it to video makeup transfer. Code will be available at https://gitee.com/sunzhaoyang0304/ssat-msp.

【2】 Bootstrapping ViTs: Towards Liberating Vision Transformers from Pre-training 标题：自助式VITS：将视觉Transformer从预训中解放出来链接：https://arxiv.org/abs/2112.03552

作者：Haofei Zhang,Jiarui Duan,Mengqi Xue,Jie Song,Li Sun,Mingli Song 机构： Zhejiang University, Xidian University 备注：10 Pages, Reviewing under CVPR 2022 摘要：近年来，视觉变换器（VIT）发展迅速，并开始挑战卷积神经网络（CNN）在计算机视觉领域的主导地位。随着通用Transformer结构取代硬编码卷积电感偏置，VIT已超过CNN，尤其是在数据充足的情况下。然而，VIT容易过度适应小数据集，因此依赖于大规模的预训练，这需要花费大量时间。在本文中，我们努力将CNN的归纳偏差引入VIT，同时保留其网络结构以获得更高的上界，并设置更合适的优化目标，从而将VIT从预训练中解放出来。首先，基于给定的具有电感偏置的ViT设计了一个代理CNN。在此基础上，提出了一种自举训练算法，通过权重共享对agent和ViT进行联合优化，ViT从agent的中间特征中学习归纳偏差。在有限的训练数据下对CIFAR-10/100和ImageNet-1k进行的大量实验表明，令人鼓舞的结果是，诱导偏差有助于VIT更快地收敛，并优于参数更少的传统CNN。摘要：Recently, vision Transformers (ViTs) are developing rapidly and starting to challenge the domination of convolutional neural networks (CNNs) in the realm of computer vision (CV). With the general-purpose Transformer architecture for replacing the hard-coded inductive biases of convolution, ViTs have surpassed CNNs, especially in data-sufficient circumstances. However, ViTs are prone to over-fit on small datasets and thus rely on large-scale pre-training, which expends enormous time. In this paper, we strive to liberate ViTs from pre-training by introducing CNNs' inductive biases back to ViTs while preserving their network architectures for higher upper bound and setting up more suitable optimization objectives. To begin with, an agent CNN is designed based on the given ViT with inductive biases. Then a bootstrapping training algorithm is proposed to jointly optimize the agent and ViT with weight sharing, during which the ViT learns inductive biases from the intermediate features of the agent. Extensive experiments on CIFAR-10/100 and ImageNet-1k with limited training data have shown encouraging results that the inductive biases help ViTs converge significantly faster and outperform conventional CNNs with even fewer parameters.

【3】 Decision-based Black-box Attack Against Vision Transformers via Patch-wise Adversarial Removal 标题：基于决策的基于补丁对抗性去除的视觉Transformer黑盒攻击链接：https://arxiv.org/abs/2112.03492

作者：Yucheng Shi,Yahong Han 机构：College of Intelligence and Computing, Tianjin University, Tianjin, China 摘要：与深度卷积神经网络（CNN）相比，视觉转换器（VIT）表现出令人印象深刻的性能和更强的对抗鲁棒性。一方面，ViTs关注单个面片之间的全局交互，降低了图像的局部噪声敏感性。另一方面，现有的基于决策的CNN攻击忽略了图像不同区域的噪声敏感性差异，影响了噪声压缩的效率。因此，当只能查询目标模型时，验证ViTs的黑盒对抗鲁棒性仍然是一个具有挑战性的问题。在本文中，我们提出了一种新的基于决策的针对VIT的黑盒攻击，称为补丁式对抗移除（PAR）。PAR通过粗到细的搜索过程将图像分割成补丁，并分别压缩每个补丁上的噪声。PAR记录每个贴片的噪声幅度和噪声灵敏度，并选择具有最高查询值的贴片进行噪声压缩。此外，PAR可以用作其他基于决策的攻击的噪声初始化方法，以提高VITS和CNNs上的噪声压缩效率，而不引入额外的计算。在IMANETET-21K、ILVRC-2012和Tiny Imagenet数据集上的大量实验表明，PAR在相同数量的查询下实现了平均低得多的扰动幅度。摘要：Vision transformers (ViTs) have demonstrated impressive performance and stronger adversarial robustness compared to Deep Convolutional Neural Networks (CNNs). On the one hand, ViTs' focus on global interaction between individual patches reduces the local noise sensitivity of images. On the other hand, the existing decision-based attacks for CNNs ignore the difference in noise sensitivity between different regions of the image, which affects the efficiency of noise compression. Therefore, validating the black-box adversarial robustness of ViTs when the target model can only be queried still remains a challenging problem. In this paper, we propose a new decision-based black-box attack against ViTs termed Patch-wise Adversarial Removal (PAR). PAR divides images into patches through a coarse-to-fine search process and compresses the noise on each patch separately. PAR records the noise magnitude and noise sensitivity of each patch and selects the patch with the highest query value for noise compression. In addition, PAR can be used as a noise initialization method for other decision-based attacks to improve the noise compression efficiency on both ViTs and CNNs without introducing additional calculations. Extensive experiments on ImageNet-21k, ILSVRC-2012, and Tiny-Imagenet datasets demonstrate that PAR achieves a much lower magnitude of perturbation on average with the same number of queries.

检测相关(7篇)

【1】 MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection 标题：MS-TCT：用于动作检测的多尺度时域变换链接：https://arxiv.org/abs/2112.03902

作者：Rui Dai,Srijan Das,Kumara Kahatapitiya,Michael S. Ryoo,Francois Bremond 机构：Franc¸ois Br´emond, Inria, Universit´e Cˆote d’Azur, Stony Brook University 摘要：动作检测是一项重要且具有挑战性的任务，尤其是对于未剪辑视频的密集标记数据集。在这些数据集中，时间关系是复杂的，包括复合动作和共现动作等挑战。为了检测这些复杂视频中的动作，有效地捕获视频中的短期和长期时间信息至关重要。为此，我们提出了一种新的行为检测网络。该网络由三个主要部分组成：（1）时间编码器模块在多个时间分辨率下广泛探索全局和局部时间关系。（2）时间尺度混合模块有效地融合多尺度特征，形成统一的特征表示。（3）分类模块用于学习实例中心的相对位置并预测帧级分类分数。在多个数据集（包括字谜、TSU和MultiTHUMOS）上进行的大量实验证实了我们提出的方法的有效性。我们的网络在所有三个数据集上都优于最先进的方法。摘要：Action detection is an essential and challenging task, especially for densely labelled datasets of untrimmed videos. The temporal relation is complex in those datasets, including challenges like composite action, and co-occurring action. For detecting actions in those complex videos, efficiently capturing both short-term and long-term temporal information in the video is critical. To this end, we propose a novel ConvTransformer network for action detection. This network comprises three main components: (1) Temporal Encoder module extensively explores global and local temporal relations at multiple temporal resolutions. (2) Temporal Scale Mixer module effectively fuses the multi-scale features to have a unified feature representation. (3) Classification module is used to learn the instance center-relative position and predict the frame-level classification scores. The extensive experiments on multiple datasets, including Charades, TSU and MultiTHUMOS, confirm the effectiveness of our proposed method. Our network outperforms the state-of-the-art methods on all three datasets.

【2】 Activation to Saliency: Forming High-Quality Labels for Unsupervised Salient Object Detection 标题：显著激活：形成用于无监督显著目标检测的高质量标签链接：https://arxiv.org/abs/2112.03650

作者：Huajun Zhou,Peijia Chen,Lingxiao Yang,Jianhuang Lai,Xiaohua Xie 机构： Shell was with the Department of Electrical and Computer Engineering, Georgia Institute of Technology 备注：11 pages 摘要：无监督显著目标检测（USOD）对于工业应用和下游任务都具有极其重要的意义。现有的基于深度学习（DL）的USOD方法利用一些传统SOD方法提取的低质量显著性预测作为显著性线索，主要捕获图像中的一些显著区域。此外，他们在语义信息的辅助下对这些显著性线索进行细化，这些语义信息来自于在其他相关视觉任务中通过监督学习训练的一些模型。在这项工作中，我们提出了一个两阶段激活显著性（A2S）框架，有效地生成高质量的显著性线索，并使用这些线索来训练一个鲁棒的显著性检测器。更重要的是，在整个训练过程中，我们的框架中没有涉及人工注释。在第一阶段，我们将一个预训练网络（MoCo v2）转化为一个单一的激活图，其中提出了一个自适应决策边界（ADB）来帮助训练转化后的网络。为了便于生成高质量的伪标签，我们提出了一种损失函数来扩大像素与其均值之间的特征距离。在第二阶段，在线标签纠正（OLR）策略在训练过程中更新伪标签，以减少干扰物的负面影响。此外，我们使用两个剩余注意模块（RAMs）构造了一个轻量级显著性检测器，该检测器利用边缘和颜色等低层特征中的互补信息来细化高层特征。在几个SOD基准上的大量实验证明，与现有的USOD方法相比，我们的框架报告了显著的性能。此外，在3000张图像上训练我们的框架大约需要1小时，比以前最先进的方法快30倍多。摘要：Unsupervised Salient Object Detection (USOD) is of paramount significance for both industrial applications and downstream tasks. Existing deep-learning (DL) based USOD methods utilize some low-quality saliency predictions extracted by several traditional SOD methods as saliency cues, which mainly capture some conspicuous regions in images. Furthermore, they refine these saliency cues with the assistant of semantic information, which is obtained from some models trained by supervised learning in other related vision tasks. In this work, we propose a two-stage Activation-to-Saliency (A2S) framework that effectively generates high-quality saliency cues and uses these cues to train a robust saliency detector. More importantly, no human annotations are involved in our framework during the whole training process. In the first stage, we transform a pretrained network (MoCo v2) to aggregate multi-level features to a single activation map, where an Adaptive Decision Boundary (ADB) is proposed to assist the training of the transformed network. To facilitate the generation of high-quality pseudo labels, we propose a loss function to enlarges the feature distances between pixels and their means. In the second stage, an Online Label Rectifying (OLR) strategy updates the pseudo labels during the training process to reduce the negative impact of distractors. In addition, we construct a lightweight saliency detector using two Residual Attention Modules (RAMs), which refine the high-level features using the complementary information in low-level features, such as edges and colors. Extensive experiments on several SOD benchmarks prove that our framework reports significant performance compared with existing USOD methods. Moreover, training our framework on 3000 images consumes about 1 hour, which is over 30x faster than previous state-of-the-art methods.

【3】 Regularity Learning via Explicit Distribution Modeling for Skeletal Video Anomaly Detection 标题：基于显式分布建模的规则性学习在骨架视频异常检测中的应用链接：https://arxiv.org/abs/2112.03649

作者：Shoubin Yu,Zhongyin Zhao,Haoshu Fang,Andong Deng,Haisheng Su,Dongliang Wang,Weihao Gan,Cewu Lu,Wei Wu 机构： Shanghai Jiao Tong University, SenseTime Research, Shanghai AI Laboratory 摘要：监控视频中的异常检测对于确保公共安全具有挑战性和重要意义。与基于像素的异常检测方法不同，基于姿态的异常检测方法利用高度结构化的骨架数据，既减少了计算负担，又避免了背景噪声的负面影响。然而，与基于像素的方法不同（基于像素的方法可以直接利用光流等显式运动特征），基于姿势的方法缺乏替代的动态表示。本文提出了一种新的运动嵌入器（ME），从概率的角度提供了一种姿态运动表示。此外，还采用了一种新的任务特定时空变换器（STT）进行自我监督的姿势序列重建。然后将这两个模块集成到一个统一的姿势规则学习框架中，称为运动先验规则学习器（MoPRL）。MoPRL通过在几个具有挑战性的数据集上平均提高4.7%的AUC，实现了最先进的性能。大量实验验证了每个模块的通用性。摘要：Anomaly detection in surveillance videos is challenging and important for ensuring public security. Different from pixel-based anomaly detection methods, pose-based methods utilize highly-structured skeleton data, which decreases the computational burden and also avoids the negative impact of background noise. However, unlike pixel-based methods, which could directly exploit explicit motion features such as optical flow, pose-based methods suffer from the lack of alternative dynamic representation. In this paper, a novel Motion Embedder (ME) is proposed to provide a pose motion representation from the probability perspective. Furthermore, a novel task-specific Spatial-Temporal Transformer (STT) is deployed for self-supervised pose sequence reconstruction. These two modules are then integrated into a unified framework for pose regularity learning, which is referred to as Motion Prior Regularity Learner (MoPRL). MoPRL achieves the state-of-the-art performance by an average improvement of 4.7% AUC on several challenging datasets. Extensive experiments validate the versatility of each proposed module.

【4】 Gram-SLD: Automatic Self-labeling and Detection for Instance Objects 标题：GRAM-SLD：实例对象的自动自标记和检测链接：https://arxiv.org/abs/2112.03641

作者：Rui Wang,Chengtun Wu,Jiawen Xin,Liang Zhang 机构：School of Instrumentation Science and Opto-electronics Engineering, Beihang University, Key Laboratory of Precision Opto-mechatronics Technology, China, b Department of Electrical and Computer Engineering, University of Connecticut 备注：37 pages with 7 figures 摘要：实例目标检测在智能监控、视觉导航、人机交互、智能服务等领域发挥着重要作用。受深度卷积神经网络（DCNN）的巨大成功的启发，基于DCNN的实例目标检测已成为一个很有前途的研究课题。针对DCNN在人工标注时需要大规模标注数据集来监督训练，而手工标注又费时费力的问题，提出了一种基于协同训练的新框架——Gram自标记与检测（Gram SLD）。所提出的Gram-SLD能够用非常有限的人工标注的关键数据自动标注大量数据，并获得具有竞争力的性能。在我们的框架中，定义了革兰氏损失，并构造了两个完全冗余和独立的视图，以及一个关键样本选择策略以及一个综合考虑精确性和召回性的自动注释策略，以产生高质量的伪标签。在公共GMU Kitchen数据集、Active Vision数据集和自制BHID-ITEM数据集上的实验表明，与完全监督的方法相比，我们的Gram SLD仅使用5%的标记训练数据，在目标检测方面取得了具有竞争力的性能（小于2%的地图丢失）。在复杂多变环境下的实际应用中，该方法能够满足实例目标检测的实时性和准确性要求。摘要：Instance object detection plays an important role in intelligent monitoring, visual navigation, human-computer interaction, intelligent services and other fields. Inspired by the great success of Deep Convolutional Neural Network (DCNN), DCNN-based instance object detection has become a promising research topic. To address the problem that DCNN always requires a large-scale annotated dataset to supervise its training while manual annotation is exhausting and time-consuming, we propose a new framework based on co-training called Gram Self-Labeling and Detection (Gram-SLD). The proposed Gram-SLD can automatically annotate a large amount of data with very limited manually labeled key data and achieve competitive performance. In our framework, gram loss is defined and used to construct two fully redundant and independent views and a key sample selection strategy along with an automatic annotating strategy that comprehensively consider precision and recall are proposed to generate high quality pseudo-labels. Experiments on the public GMU Kitchen Dataset , Active Vision Dataset and the self-made BHID-ITEM Datasetdemonstrate that, with only 5% labeled training data, our Gram-SLD achieves competitive performance in object detection (less than 2% mAP loss), compared with the fully supervised methods. In practical applications with complex and changing environments, the proposed method can satisfy the real-time and accuracy requirements on instance object detection.

【5】 DCAN: Improving Temporal Action Detection via Dual Context Aggregation 标题：DCAN：通过双上下文聚合改进时间动作检测链接：https://arxiv.org/abs/2112.03612

作者：Guo Chen,Yin-Dong Zheng,Limin Wang,Tong Lu 机构：State Key Lab for Novel Software Technology, Nanjing University, China 备注：AAAI 2022 camera ready version 摘要：时间动作检测的目的是定位视频中动作的边界。当前基于边界匹配的方法枚举并计算所有可能的边界匹配以生成建议。然而，这些方法忽略了边界预测中的长期上下文聚合。同时，由于相邻匹配的语义相似，密集生成匹配的局部语义聚合不能提高语义的丰富性和区分性。在本文中，我们提出了端到端的提议生成方法，称为双上下文聚合网络（DCAN），将上下文聚合到两个级别，即边界级别和提议级别，以生成高质量的动作提议，从而提高时间动作检测的性能。具体来说，我们设计了多径时态上下文聚合（MTCA），以实现边界级的平滑上下文聚合和精确的边界评估。对于匹配评估，粗到精匹配（CFM）设计用于在建议级别聚合上下文，并从粗到精细化匹配映射。我们在ActivityNet v1.3和THUMOS-14上进行了广泛的实验。DCAN在ActivityNet v1.3上获得35.39%的平均mAP，在IoU@0.5在THUMOS-14上，这表明DCAN可以生成高质量的提案并实现最先进的性能。我们在https://github.com/cg1177/DCAN. 摘要：Temporal action detection aims to locate the boundaries of action in the video. The current method based on boundary matching enumerates and calculates all possible boundary matchings to generate proposals. However, these methods neglect the long-range context aggregation in boundary prediction. At the same time, due to the similar semantics of adjacent matchings, local semantic aggregation of densely-generated matchings cannot improve semantic richness and discrimination. In this paper, we propose the end-to-end proposal generation method named Dual Context Aggregation Network (DCAN) to aggregate context on two levels, namely, boundary level and proposal level, for generating high-quality action proposals, thereby improving the performance of temporal action detection. Specifically, we design the Multi-Path Temporal Context Aggregation (MTCA) to achieve smooth context aggregation on boundary level and precise evaluation of boundaries. For matching evaluation, Coarse-to-fine Matching (CFM) is designed to aggregate context on the proposal level and refine the matching map from coarse to fine. We conduct extensive experiments on ActivityNet v1.3 and THUMOS-14. DCAN obtains an average mAP of 35.39% on ActivityNet v1.3 and reaches mAP 54.14% at IoU@0.5 on THUMOS-14, which demonstrates DCAN can generate high-quality proposals and achieve state-of-the-art performance. We release the code at https://github.com/cg1177/DCAN.

【6】 Voxelized 3D Feature Aggregation for Multiview Detection 标题：用于多视角检测的体素化三维特征聚合链接：https://arxiv.org/abs/2112.03471

作者：Jiahao Ma,Jinguang Tong,Shan Wang,Wei Zhao,Liang Zheng,Chuong Nguyen 机构：Cyber Physical Systems, CSIRO Data, Australia, College of Engineering & Computer Science, Australian National University, Australia 摘要：多视图检测结合了多个摄影机视图以减轻拥挤场景中的遮挡，其中最先进的方法采用单应变换将多视图特征投影到地平面。但是，我们发现，这些2D变换没有考虑对象的高度，因此忽略了沿同一对象垂直方向的特征可能不会投影到同一地平面点上，从而导致不纯净的地平面特征。为了解决这一问题，我们提出了体素化三维特征聚合（VFA），用于多视图检测中的特征转换和聚合。具体来说，我们将三维空间体素化，将体素投影到每个摄影机视图上，并将二维特征与这些投影体素相关联。这使我们能够沿同一垂直线识别并聚合二维特征，从而在很大程度上减轻投影扭曲。此外，由于不同类型的对象（人与牛）在地平面上具有不同的形状，我们引入了定向高斯编码来匹配这些形状，从而提高了准确性和效率。我们对多视图2D检测和多视图3D检测问题进行了实验。四个数据集（包括新引入的MultiviewC数据集）的结果表明，与最先进的方法相比，我们的系统具有很强的竞争力我们的代码和数据将是开源的https://github.com/Robert-Mar/VFA. 摘要：Multi-view detection incorporates multiple camera views to alleviate occlusion in crowded scenes, where the state-of-the-art approaches adopt homography transformations to project multi-view features to the ground plane. However, we find that these 2D transformations do not take into account the object's height, and with this neglection features along the vertical direction of same object are likely not projected onto the same ground plane point, leading to impure ground-plane features. To solve this problem, we propose VFA, voxelized 3D feature aggregation, for feature transformation and aggregation in multi-view detection. Specifically, we voxelize the 3D space, project the voxels onto each camera view, and associate 2D features with these projected voxels. This allows us to identify and then aggregate 2D features along the same vertical line, alleviating projection distortions to a large extent. Additionally, because different kinds of objects (human vs. cattle) have different shapes on the ground plane, we introduce the oriented Gaussian encoding to match such shapes, leading to increased accuracy and efficiency. We perform experiments on multiview 2D detection and multiview 3D detection problems. Results on four datasets (including a newly introduced MultiviewC dataset) show that our system is very competitive compared with the state-of-the-art approaches. %Our code and data will be open-sourced.Code and MultiviewC are released at https://github.com/Robert-Mar/VFA.

【7】 Hybrid SNN-ANN: Energy-Efficient Classification and Object Detection for Event-Based Vision 标题：混合SNN-ANN：基于事件视觉的节能分类与目标检测链接：https://arxiv.org/abs/2112.03423

作者：Alexander Kugele,Thomas Pfeil,Michael Pfeiffer,Elisabetta Chicca 机构：-,-,- 备注：Accepted at DAGM German Conference on Pattern Recognition (GCPR 2021) 摘要：基于事件的视觉传感器对事件流而非图像帧中的局部像素亮度变化进行编码，并产生稀疏、节能的场景编码，此外还具有低延迟、高动态范围和缺少运动模糊的特点。基于事件的传感器在目标识别方面的最新进展来自使用反向传播训练的深层神经网络的转换。然而，将这些方法用于事件流需要转换为同步范式，这不仅会损失计算效率，而且还会错过提取时空特征的机会。在本文中，我们提出了一种用于基于事件的模式识别和目标检测的深度神经网络端到端训练的混合体系结构，将用于有效基于事件的特征提取的尖峰神经网络（SNN）主干与随后的模拟神经网络（ANN）相结合负责解决同步分类和检测任务。这是通过将标准反向传播与代理梯度训练相结合来实现的，以通过SNN传播梯度。混合SNN ANN可以在不进行转换的情况下进行训练，并生成比ANN对应网络更具计算效率的高精度网络。我们展示了基于事件的分类和目标检测数据集的结果，其中只需要调整ANN头的结构以适应任务，并且不需要转换基于事件的输入。由于ANN和SNN需要不同的硬件模式来最大限度地提高其效率，我们设想SNN主干和ANN头部可以在不同的处理单元上执行，从而分析两部分之间通信所需的带宽。混合网络是一种很有前途的体系结构，可以在不影响效率的情况下进一步推进基于事件的视觉的机器学习方法。摘要：Event-based vision sensors encode local pixel-wise brightness changes in streams of events rather than image frames and yield sparse, energy-efficient encodings of scenes, in addition to low latency, high dynamic range, and lack of motion blur. Recent progress in object recognition from event-based sensors has come from conversions of deep neural networks, trained with backpropagation. However, using these approaches for event streams requires a transformation to a synchronous paradigm, which not only loses computational efficiency, but also misses opportunities to extract spatio-temporal features. In this article we propose a hybrid architecture for end-to-end training of deep neural networks for event-based pattern recognition and object detection, combining a spiking neural network (SNN) backbone for efficient event-based feature extraction, and a subsequent analog neural network (ANN) head to solve synchronous classification and detection tasks. This is achieved by combining standard backpropagation with surrogate gradient training to propagate gradients through the SNN. Hybrid SNN-ANNs can be trained without conversion, and result in highly accurate networks that are substantially more computationally efficient than their ANN counterparts. We demonstrate results on event-based classification and object detection datasets, in which only the architecture of the ANN heads need to be adapted to the tasks, and no conversion of the event-based input is necessary. Since ANNs and SNNs require different hardware paradigms to maximize their efficiency, we envision that SNN backbone and ANN head can be executed on different processing units, and thus analyze the necessary bandwidth to communicate between the two parts. Hybrid networks are promising architectures to further advance machine learning approaches for event-based vision, without having to compromise on efficiency.

分类|识别相关(8篇)

【1】 Handwritten Mathematical Expression Recognition via Attention Aggregation based Bi-directional Mutual Learning 标题：基于注意力聚合的双向交互学习手写数学表达式识别链接：https://arxiv.org/abs/2112.03603

作者：Xiaohang Bian,Bo Qin,Xiaozhe Xin,Jianwu Li,Xuefeng Su,Yanfeng Wang 机构： Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science and Technology, Beijing Institute of Technology, China, AI Interaction Department, Tencent, China 备注：None 摘要：手写数学表达式识别的目的是从给定的图像中自动生成乳胶序列。目前，基于注意的编解码模型被广泛应用于这项任务中。它们通常以从左到右（L2R）的方式生成目标序列，而不利用从右到左（R2L）的上下文。本文提出了一种基于注意聚合的双向互学习网络（ABM），该网络由一个共享编码器和两个并行逆解码器（L2R和R2L）组成。这两个译码器通过相互蒸馏来增强，在每个训练步骤中涉及一对一的知识转移，充分利用来自两个反向的互补信息。此外，为了处理不同尺度下的数学符号，提出了一种注意力聚合模块（AAM）来有效地集成多尺度覆盖注意。值得注意的是，在推理阶段，假设模型已经从两个反向学习知识，我们只使用L2R分支进行推理，保持原始参数大小和推理速度。大量实验表明，我们提出的方法在没有数据增强和模型融合的情况下，在CROHME 2014、CROHME 2016和CROHME 2019上的识别准确率分别为56.85%、52.92%和53.96%，大大优于最先进的方法。补充资料中提供了源代码。摘要：Handwritten mathematical expression recognition aims to automatically generate LaTeX sequences from given images. Currently, attention-based encoder-decoder models are widely used in this task. They typically generate target sequences in a left-to-right (L2R) manner, leaving the right-to-left (R2L) contexts unexploited. In this paper, we propose an Attention aggregation based Bi-directional Mutual learning Network (ABM) which consists of one shared encoder and two parallel inverse decoders (L2R and R2L). The two decoders are enhanced via mutual distillation, which involves one-to-one knowledge transfer at each training step, making full use of the complementary information from two inverse directions. Moreover, in order to deal with mathematical symbols in diverse scales, an Attention Aggregation Module (AAM) is proposed to effectively integrate multi-scale coverage attentions. Notably, in the inference phase, given that the model already learns knowledge from two inverse directions, we only use the L2R branch for inference, keeping the original parameter size and inference speed. Extensive experiments demonstrate that our proposed approach achieves the recognition accuracy of 56.85 % on CROHME 2014, 52.92 % on CROHME 2016, and 53.96 % on CROHME 2019 without data augmentation and model ensembling, substantially outperforming the state-of-the-art methods. The source code is available in the supplementary materials.

【2】 E^2(GO)MOTION: Motion Augmented Event Stream for Egocentric Action Recognition标题：E^2(GO)运动：用于自我中心动作识别的运动增强事件流链接：https://arxiv.org/abs/2112.03596

作者：Chiara Plizzari,Mirco Planamente,Gabriele Goletto,Marco Cannici,Emanuele Gusso,Matteo Matteucci,Barbara Caputo 机构： Politecnico di Torino, CINI Consortium, Politecnico di Milano 摘要：事件摄像机是一种新颖的仿生传感器，它以“事件”的形式异步捕捉像素级的强度变化。由于其传感机制，事件摄影机几乎没有运动模糊，具有非常高的时间分辨率，并且与传统的基于帧的摄影机相比，所需的电源和内存要少得多。这些特性使它们非常适合于一些现实世界的应用，例如可穿戴设备上的以自我为中心的动作识别，在这些应用中，快速的摄像机运动和有限的功率挑战了传统的视觉传感器。然而，不断增长的基于事件的视觉领域迄今为止忽视了事件摄像机在此类应用中的潜力。在本文中，我们证明了事件数据是一种非常有价值的自我中心行为识别模式。为此，我们引入了N-EPIC-Kitchens，这是大型EPIC Kitchens数据集的第一个基于事件的摄像头扩展。在这种情况下，我们提出了两种策略：（i）使用传统视频处理架构（E$^2$（GO））直接处理事件摄影机数据；（ii）使用事件数据提取光流信息（E$^2$（GO）MO）。在我们提出的基准上，我们表明，事件数据提供了与RGB和光流相当的性能，但在部署时没有任何额外的流量计算，并且相对于仅RGB的信息，性能提高了4%。摘要：Event cameras are novel bio-inspired sensors, which asynchronously capture pixel-level intensity changes in the form of "events". Due to their sensing mechanism, event cameras have little to no motion blur, a very high temporal resolution and require significantly less power and memory than traditional frame-based cameras. These characteristics make them a perfect fit to several real-world applications such as egocentric action recognition on wearable devices, where fast camera motion and limited power challenge traditional vision sensors. However, the ever-growing field of event-based vision has, to date, overlooked the potential of event cameras in such applications. In this paper, we show that event data is a very valuable modality for egocentric action recognition. To do so, we introduce N-EPIC-Kitchens, the first event-based camera extension of the large-scale EPIC-Kitchens dataset. In this context, we propose two strategies: (i) directly processing event-camera data with traditional video-processing architectures (E$^2$(GO)) and (ii) using event-data to distill optical flow information (E$^2$(GO)MO). On our proposed benchmark, we show that event data provides a comparable performance to RGB and optical flow, yet without any additional flow computation at deploy time, and an improved performance of up to 4% with respect to RGB only information.

【3】 Contrastive Learning from Extremely Augmented Skeleton Sequences for Self-supervised Action Recognition 标题：基于高度增广骨架序列的自监督动作识别对比学习链接：https://arxiv.org/abs/2112.03590

作者：Tianyu Guo,Hong Liu,Zhan Chen,Mengyuan Liu,Tao Wang,Runwei Ding 机构： Key Laboratory of Machine Perception, Shenzhen Graduate School, Peking University, China, The School of Intelligent Systems Engineering, Sun Yat-sen University, China 备注：Accepted by AAAI2022 摘要：近年来，随着对比学习方法的发展，基于骨架的动作识别的自监督表示学习得到了发展。现有的对比学习方法使用正常的增强来构建相似的积极样本，这限制了探索新的运动模式的能力。为了更好地利用极限增强引入的运动模式，提出了一种基于丰富信息挖掘的自监督动作表示对比学习框架（AIMCRL）。首先，提出了极端增强和基于能量的注意引导下降模块（EADM）来获得不同的正样本，这带来了新的运动模式，提高了学习表征的普遍性。其次，由于直接使用极端增广可能无法提高性能，因为原始身份发生了剧烈变化，因此提出了双分布散度最小化损失（D$^3$M损失），以更温和的方式最小化分布散度。第三，提出了最近邻挖掘（NNM），进一步扩展正样本，使丰富信息挖掘过程更加合理。在NTU RGB+D 60、PKU-MMD、NTU RGB+D 120数据集上进行的详尽实验已经证实，我们的AimCLR可以在各种评估协议下，与最先进的方法相比，在观察到的更高质量的动作表现下，显著表现良好。我们的代码可在https://github.com/Levigty/AimCLR. 摘要：In recent years, self-supervised representation learning for skeleton-based action recognition has been developed with the advance of contrastive learning methods. The existing contrastive learning methods use normal augmentations to construct similar positive samples, which limits the ability to explore novel movement patterns. In this paper, to make better use of the movement patterns introduced by extreme augmentations, a Contrastive Learning framework utilizing Abundant Information Mining for self-supervised action Representation (AimCLR) is proposed. First, the extreme augmentations and the Energy-based Attention-guided Drop Module (EADM) are proposed to obtain diverse positive samples, which bring novel movement patterns to improve the universality of the learned representations. Second, since directly using extreme augmentations may not be able to boost the performance due to the drastic changes in original identity, the Dual Distributional Divergence Minimization Loss (D$^3$M Loss) is proposed to minimize the distribution divergence in a more gentle way. Third, the Nearest Neighbors Mining (NNM) is proposed to further expand positive samples to make the abundant information mining process more reasonable. Exhaustive experiments on NTU RGB+D 60, PKU-MMD, NTU RGB+D 120 datasets have verified that our AimCLR can significantly perform favorably against state-of-the-art methods under a variety of evaluation protocols with observed higher quality action representations. Our code is available at https://github.com/Levigty/AimCLR.

【4】 CMA-CLIP: Cross-Modality Attention CLIP for Image-Text Classification 标题：CMA-CLIP：用于图文分类的跨通道注意剪辑链接：https://arxiv.org/abs/2112.03562

作者：Huidong Liu,Shaoyuan Xu,Jinmiao Fu,Yang Liu,Ning Xie,Chien-chih Wang,Bryan Wang,Yi Sun 机构：Stony Brook University, Stony Brook, NY, USA, Amazon Inc., Seattle, WA, USA 摘要：社交媒体和电子商务等现代网络系统包含以图像和文本表示的丰富内容。利用来自多种模式的信息可以提高机器学习任务（如分类和推荐）的性能。在本文中，我们提出了跨模态注意对比语言图像预训练（CMA-CLIP），这是一种新的框架，它将两种跨模态注意（顺序注意和模态注意）结合起来，以有效地融合图像和文本对中的信息。序列式注意使框架能够捕获图像块和文本标记之间的细粒度关系，而模态式注意则通过其与下游任务的相关性来衡量每个模态。此外，通过添加任务特定的模态注意和多层感知器，我们提出的框架能够执行多模态的多任务分类。我们在一个主要零售网站产品属性（MRWPA）数据集和两个公共数据集Food101和Fashion-Gen上进行了实验。结果表明，CMA-CLIP在多任务分类的MRWPA数据集上，在相同精度水平下，召回率平均比预训练和微调的CLIP高11.9%。它在精度上也比Fashion Gen数据集上的最新方法高出5.5%，并在Food101数据集上实现了具有竞争力的性能。通过详细的消融研究，我们进一步证明了跨模态注意模块的有效性以及我们的方法对图像和文本输入噪声的鲁棒性，这是实践中的一个常见挑战。摘要：Modern Web systems such as social media and e-commerce contain rich contents expressed in images and text. Leveraging information from multi-modalities can improve the performance of machine learning tasks such as classification and recommendation. In this paper, we propose the Cross-Modality Attention Contrastive Language-Image Pre-training (CMA-CLIP), a new framework which unifies two types of cross-modality attentions, sequence-wise attention and modality-wise attention, to effectively fuse information from image and text pairs. The sequence-wise attention enables the framework to capture the fine-grained relationship between image patches and text tokens, while the modality-wise attention weighs each modality by its relevance to the downstream tasks. In addition, by adding task specific modality-wise attentions and multilayer perceptrons, our proposed framework is capable of performing multi-task classification with multi-modalities. We conduct experiments on a Major Retail Website Product Attribute (MRWPA) dataset and two public datasets, Food101 and Fashion-Gen. The results show that CMA-CLIP outperforms the pre-trained and fine-tuned CLIP by an average of 11.9% in recall at the same level of precision on the MRWPA dataset for multi-task classification. It also surpasses the state-of-the-art method on Fashion-Gen Dataset by 5.5% in accuracy and achieves competitive performance on Food101 Dataset. Through detailed ablation studies, we further demonstrate the effectiveness of both cross-modality attention modules and our method's robustness against noise in image and text inputs, which is a common challenge in practice.

【5】 Label Hallucination for Few-Shot Classification 标题：给幻觉贴上标签，以便对“Few-Shot”进行分类链接：https://arxiv.org/abs/2112.03340

作者：Yiren Jian,Lorenzo Torresani 机构：Dartmouth College 备注：Accepted by AAAI 2022. Code is available: this https URL 摘要：少数镜头分类需要调整从大型带注释的基础数据集中学习的知识，以识别新的看不见的类，每个类由少数标记的示例表示。在这种情况下，在大数据集上预训练具有高容量的网络，然后在少数示例上对其进行微调会导致严重的过度拟合。同时，在从大的标记数据集中学习到的“冻结”特征的基础上训练一个简单的线性分类器无法使模型适应新类的属性，从而有效地导致欠拟合。在本文中，我们提出了两种流行策略的替代方法。首先，我们的方法使用在新类上训练的线性分类器对整个大型数据集进行伪标记。这有效地“幻觉”了大型数据集中的新类，尽管新类不存在于基础数据库中（新类和基类是不相交的）。然后，除了新数据集上的标准交叉熵损失外，它还使用伪标记基示例上的蒸馏损失对整个模型进行微调。该步骤有效地训练网络识别上下文和外观线索，这些线索对新类别识别有用，但使用整个大规模基础数据集，从而克服了少数镜头学习固有的数据稀缺问题。尽管该方法简单，但我们表明，我们的方法在四个完善的少数镜头分类基准上优于最先进的方法。摘要：Few-shot classification requires adapting knowledge learned from a large annotated base dataset to recognize novel unseen classes, each represented by few labeled examples. In such a scenario, pretraining a network with high capacity on the large dataset and then finetuning it on the few examples causes severe overfitting. At the same time, training a simple linear classifier on top of "frozen" features learned from the large labeled dataset fails to adapt the model to the properties of the novel classes, effectively inducing underfitting. In this paper we propose an alternative approach to both of these two popular strategies. First, our method pseudo-labels the entire large dataset using the linear classifier trained on the novel classes. This effectively "hallucinates" the novel classes in the large dataset, despite the novel categories not being present in the base database (novel and base classes are disjoint). Then, it finetunes the entire model with a distillation loss on the pseudo-labeled base examples, in addition to the standard cross-entropy loss on the novel dataset. This step effectively trains the network to recognize contextual and appearance cues that are useful for the novel-category recognition but using the entire large-scale base dataset and thus overcoming the inherent data-scarcity problem of few-shot learning. Despite the simplicity of the approach, we show that that our method outperforms the state-of-the-art on four well-established few-shot classification benchmarks.

【6】 Learning Connectivity with Graph Convolutional Networks for Skeleton-based Action Recognition 标题：利用图卷积网络学习连通性进行基于骨架的动作识别链接：https://arxiv.org/abs/2112.03328

作者：Hichem Sahbi 机构：Sorbonne University, CNRS, LIP, F-, Paris, France, ! 备注：arXiv admin note: text overlap with arXiv:2104.04255, arXiv:2104.05482 摘要：学习图卷积网络（GCN）是一个新兴的领域，旨在将卷积运算推广到任意非正则域。特别是，与光谱网络相比，在空间域上运行的GCN表现出优越的性能，但是它们的成功与否在很大程度上取决于输入图的拓扑结构是如何定义的。在这篇文章中，我们介绍了一个新的框架图卷积网络学习拓扑性质的图。我们的方法的设计原则是基于约束目标函数的优化，该函数不仅学习GCN中常用的卷积参数，还学习传递这些图中最相关拓扑关系的变换基。在具有挑战性的基于骨架的动作识别任务上进行的实验表明，与手工图形设计以及相关工作相比，该方法具有优越性。摘要：Learning graph convolutional networks (GCNs) is an emerging field which aims at generalizing convolutional operations to arbitrary non-regular domains. In particular, GCNs operating on spatial domains show superior performances compared to spectral ones, however their success is highly dependent on how the topology of input graphs is defined. In this paper, we introduce a novel framework for graph convolutional networks that learns the topological properties of graphs. The design principle of our method is based on the optimization of a constrained objective function which learns not only the usual convolutional parameters in GCNs but also a transformation basis that conveys the most relevant topological relationships in these graphs. Experiments conducted on the challenging task of skeleton-based action recognition shows the superiority of the proposed method compared to handcrafted graph design as well as the related work.

【7】 Hard Sample Aware Noise Robust Learning for Histopathology Image Classification 标题：硬样本感知噪声鲁棒学习在组织病理学图像分类中的应用链接：https://arxiv.org/abs/2112.03694

作者：Chuang Zhu,Wenkai Chen,Ting Peng,Ying Wang,Mulan Jin 机构： Peng are with the School of Arti-ficial Intelligence, Beijing University of Posts and Telecommunica-tions 备注：14 pages, 20figures, IEEE Transactions on Medical Imaging 摘要：基于深度学习的组织病理学图像分类是帮助医生提高癌症诊断准确性和及时性的关键技术。然而，在复杂的人工标注过程中，标签噪声往往是不可避免的，从而误导了分类模型的训练。在这项工作中，我们介绍了一种新的硬样本感知噪声鲁棒学习方法用于组织病理学图像分类。为了区分信息性硬样本和有害噪声样本，我们利用样本训练历史建立了易/硬/噪声（EHN）检测模型。然后，我们将EHN集成到一个自训练结构中，通过逐步标记校正来降低噪声率。利用获得的几乎干净的数据集，我们进一步提出了一种噪声抑制和硬增强（NSHE）方案来训练噪声鲁棒模型。与以前的工作相比，我们的方法可以节省更多的干净样本，并且可以直接应用于真实的有噪声数据集场景，而不需要使用干净的子集。实验结果表明，无论是在合成数据集还是在真实噪声数据集，该方法都优于目前最新的方法。源代码和数据可在https://github.com/bupt-ai-cz/HSA-NRL/. 摘要：Deep learning-based histopathology image classification is a key technique to help physicians in improving the accuracy and promptness of cancer diagnosis. However, the noisy labels are often inevitable in the complex manual annotation process, and thus mislead the training of the classification model. In this work, we introduce a novel hard sample aware noise robust learning method for histopathology image classification. To distinguish the informative hard samples from the harmful noisy ones, we build an easy/hard/noisy (EHN) detection model by using the sample training history. Then we integrate the EHN into a self-training architecture to lower the noise rate through gradually label correction. With the obtained almost clean dataset, we further propose a noise suppressing and hard enhancing (NSHE) scheme to train the noise robust model. Compared with the previous works, our method can save more clean samples and can be directly applied to the real-world noisy dataset scenario without using a clean subset. Experimental results demonstrate that the proposed scheme outperforms the current state-of-the-art methods in both the synthetic and real-world noisy datasets. The source code and data are available at https://github.com/bupt-ai-cz/HSA-NRL/.

【8】 RSBNet: One-Shot Neural Architecture Search for A Backbone Network in Remote Sensing Image Recognition 标题：RSBNet：遥感图像识别中骨干网络的一次性神经结构搜索链接：https://arxiv.org/abs/2112.03456

作者：Cheng Peng,Yangyang Li,Ronghua Shang,Licheng Jiao 机构：Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of, Artificial Intelligence, Xidian University, Xi’an, China 摘要：近年来，大量基于深度学习的方法已成功应用于各种遥感图像（RSI）识别任务。然而，目前RSI领域的深度学习方法大多依赖于人工设计的主干网络提取的特征，由于RSI的复杂性和先验知识的局限性，这严重阻碍了深度学习模型的发展。在本文中，我们研究了一种新的RSI识别任务中主干结构的设计范式，包括场景分类、土地覆盖分类和目标检测。提出了一种新的基于权重共享策略和进化算法的一次性搜索框架RSBNet，该框架包括三个阶段：首先，基于集成单路径训练策略，在自组装的大规模RSI数据集上预训练在分层搜索空间中构造的超网。然后，通过可切换的识别模块为预先训练的超网配备不同的识别头，并分别对目标数据集进行微调，以获得特定于任务的超网。最后，在不需要任何网络训练的情况下，基于进化算法搜索不同识别任务的最佳主干结构。在五个基准数据集上对不同的识别任务进行了大量的实验，结果表明了所提出的搜索范式的有效性，并证明了所搜索的主干能够灵活地适应不同的RSI识别任务，并取得了令人印象深刻的性能。摘要：Recently, a massive number of deep learning based approaches have been successfully applied to various remote sensing image (RSI) recognition tasks. However, most existing advances of deep learning methods in the RSI field heavily rely on the features extracted by the manually designed backbone network, which severely hinders the potential of deep learning models due the complexity of RSI and the limitation of prior knowledge. In this paper, we research a new design paradigm for the backbone architecture in RSI recognition tasks, including scene classification, land-cover classification and object detection. A novel one-shot architecture search framework based on weight-sharing strategy and evolutionary algorithm is proposed, called RSBNet, which consists of three stages: Firstly, a supernet constructed in a layer-wise search space is pretrained on a self-assembled large-scale RSI dataset based on an ensemble single-path training strategy. Next, the pre-trained supernet is equipped with different recognition heads through the switchable recognition module and respectively fine-tuned on the target dataset to obtain task-specific supernet. Finally, we search the optimal backbone architecture for different recognition tasks based on the evolutionary algorithm without any network training. Extensive experiments have been conducted on five benchmark datasets for different recognition tasks, the results show the effectiveness of the proposed search paradigm and demonstrate that the searched backbone is able to flexibly adapt different RSI recognition tasks and achieve impressive performance.

分割|语义相关(4篇)

【1】 A Contrastive Distillation Approach for Incremental Semantic Segmentation in Aerial Images 标题：一种航空图像增量式语义分割的对比提取方法链接：https://arxiv.org/abs/2112.03814

作者：Edoardo Arnaudo,Fabio Cermelli,Antonio Tavera,Claudio Rossi,Barbara Caputo 备注：12 pages, ICIAP 2021 摘要：增量学习是航空图像处理中的一项关键任务，特别是在大规模标注数据集有限的情况下。当前深层神经结构的一个主要问题是灾难性遗忘，即一旦提供了一组新的数据用于再训练，就无法忠实地维护过去的知识。多年来，人们提出了几种技术来缓解图像分类和目标检测中的这一问题。然而，直到最近，焦点才转向更复杂的下游任务，如实例或语义分割。从语义分割任务的增量类学习开始，我们的目标是将此策略应用于航空领域，利用其与自然图像不同的一个特殊特征，即方向。除了标准的知识提取方法外，我们还提出了一种对比正则化方法，将任何给定的输入与其增强版本（即翻转和旋转）进行比较，以最小化两个输入产生的分割特征之间的差异。我们在波茨坦数据集上展示了我们的解决方案的有效性，在每个测试中都优于增量基线。代码可从以下网址获得：https://github.com/edornd/contrastive-distillation. 摘要：Incremental learning represents a crucial task in aerial image processing, especially given the limited availability of large-scale annotated datasets. A major issue concerning current deep neural architectures is known as catastrophic forgetting, namely the inability to faithfully maintain past knowledge once a new set of data is provided for retraining. Over the years, several techniques have been proposed to mitigate this problem for image classification and object detection. However, only recently the focus has shifted towards more complex downstream tasks such as instance or semantic segmentation. Starting from incremental-class learning for semantic segmentation tasks, our goal is to adapt this strategy to the aerial domain, exploiting a peculiar feature that differentiates it from natural images, namely the orientation. In addition to the standard knowledge distillation approach, we propose a contrastive regularization, where any given input is compared with its augmented version (i.e. flipping and rotations) in order to minimize the difference between the segmentation features produced by both inputs. We show the effectiveness of our solution on the Potsdam dataset, outperforming the incremental baseline in every test. Code available at: https://github.com/edornd/contrastive-distillation.

【2】 Deep Level Set for Box-supervised Instance Segmentation in Aerial Images 标题：航空影像盒监督实例分割的深水平集方法链接：https://arxiv.org/abs/2112.03451

作者：Wentong Li,Yijie Chen,Wenyu Liu,Jianke Zhu 机构：Alibaba-Zhejiang University Joint Research Institute of Frontier Technologies 备注：10 pages, 5 figures 摘要：盒监督实例分割是近年来研究的热点，但在航空图像领域的研究却很少。与一般的目标集合相比，航空目标具有较大的类内方差和类间相似性，且背景复杂。此外，在高分辨率卫星图像中还存在许多微小物体。这使得最近的成对亲和建模方法不可避免地涉及到噪声监督，结果较差。为了解决这些问题，我们提出了一种新的航空实例分割方法，该方法驱动网络以端到端的方式学习一系列仅带有方框标注的航空对象的水平集函数。水平集方法采用精心设计的能量函数，不学习两两相似性，而是将对象分割视为曲线演化，能够准确地恢复对象的边界，防止不可分辨背景和相似对象的干扰。实验结果表明，该方法优于现有的盒监督实例分割方法。源代码可在https://github.com/LiWentomng/boxlevelset. 摘要：Box-supervised instance segmentation has recently attracted lots of research efforts while little attention is received in aerial image domain. In contrast to the general object collections, aerial objects have large intra-class variances and inter-class similarity with complex background. Moreover, there are many tiny objects in the high-resolution satellite images. This makes the recent pairwise affinity modeling method inevitably to involve the noisy supervision with the inferior results. To tackle these problems, we propose a novel aerial instance segmentation approach, which drives the network to learn a series of level set functions for the aerial objects with only box annotations in an end-to-end fashion. Instead of learning the pairwise affinity, the level set method with the carefully designed energy functions treats the object segmentation as curve evolution, which is able to accurately recover the object's boundaries and prevent the interference from the indistinguishable background and similar objects. The experimental results demonstrate that the proposed approach outperforms the state-of-the-art box-supervised instance segmentation methods. The source code is available at https://github.com/LiWentomng/boxlevelset.

【3】 Hybrid guiding: A multi-resolution refinement approach for semantic segmentation of gigapixel histopathological images 标题：混合引导：一种用于千兆像素组织病理图像语义分割的多分辨率细化方法链接：https://arxiv.org/abs/2112.03455

作者：André Pedersen,Erik Smistad,Tor V. Rise,Vibeke G. Dale,Henrik S. Pettersen,Tor-Arne S. Nordmo,David Bouget,Ingerid Reinertsen,Marit Valla 机构：Department of Clinical and Molecular Medicine, Norwegian University of Science and Technology, NO-, Trondheim, Norway, Clinic of Surgery, St. Olavs Hospital, Trondheim University Hospital, NO-, Trondheim, Norway 备注：12 pages, 3 figures 摘要：组织病理学癌症诊断变得更加复杂，越来越多的活检对大多数病理学实验室来说是一个挑战。因此，开发用于评估组织病理学癌症切片的自动方法将是有价值的。在这项研究中，我们使用了来自挪威队列的624张乳腺癌全幻灯片图像（WSIs）。我们提出了一种级联卷积神经网络设计，称为H2G网络，用于千兆像素组织病理学图像的语义分割。该设计包括使用分片方法的检测阶段和使用卷积自动编码器的细化阶段。为了验证设计，我们进行了一项消融研究，以评估管道中所选成分对肿瘤分割的影响。在分割组织病理学图像时，使用分层采样和深度热图细化的引导分割被证明是有益的。我们发现，当使用细化网络对生成的肿瘤分割热图进行后处理时，有了显著的改进。在90个WSI的独立测试集上，总体最佳设计的骰子得分为0.933。该设计优于单分辨率方法，例如使用MobileNetV2（0.872）和低分辨率U-Net（0.874）的群集引导、面片式高分辨率分类。此外，仅使用CPU，在典型的x400 WSI上进行分段大约需要58秒。这些发现证明了利用细化网络改进面片预测的潜力。该解决方案是有效的，并且不需要重叠的面片推断或加密。此外，我们还表明，深度神经网络可以使用随机抽样方案进行训练，该方案可以同时在多个不同的标签上进行平衡，而无需在磁盘上存储补丁。未来的工作应该包括更有效的补丁生成和采样，以及改进聚类。摘要：Histopathological cancer diagnostics has become more complex, and the increasing number of biopsies is a challenge for most pathology laboratories. Thus, development of automatic methods for evaluation of histopathological cancer sections would be of value. In this study, we used 624 whole slide images (WSIs) of breast cancer from a Norwegian cohort. We propose a cascaded convolutional neural network design, called H2G-Net, for semantic segmentation of gigapixel histopathological images. The design involves a detection stage using a patch-wise method, and a refinement stage using a convolutional autoencoder. To validate the design, we conducted an ablation study to assess the impact of selected components in the pipeline on tumour segmentation. Guiding segmentation, using hierarchical sampling and deep heatmap refinement, proved to be beneficial when segmenting the histopathological images. We found a significant improvement when using a refinement network for postprocessing the generated tumour segmentation heatmaps. The overall best design achieved a Dice score of 0.933 on an independent test set of 90 WSIs. The design outperformed single-resolution approaches, such as cluster-guided, patch-wise high-resolution classification using MobileNetV2 (0.872) and a low-resolution U-Net (0.874). In addition, segmentation on a representative x400 WSI took ~58 seconds, using only the CPU. The findings demonstrate the potential of utilizing a refinement network to improve patch-wise predictions. The solution is efficient and does not require overlapping patch inference or ensembling. Furthermore, we showed that deep neural networks can be trained using a random sampling scheme that balances on multiple different labels simultaneously, without the need of storing patches on disk. Future work should involve more efficient patch generation and sampling, as well as improved clustering.

【4】 Quality control for more reliable integration of deep learning-based image segmentation into medical workflows 标题：将基于深度学习的图像分割更可靠地集成到医疗工作流中的质量控制链接：https://arxiv.org/abs/2112.03277

作者：Elena Williams,Sebastian Niehaus,Janis Reinelt,Alberto Merola,Paul Glad Mihai,Ingo Roeder,Nico Scherf,Maria del C. Valdés Hernández 机构： – AICURA medical, Bessemerstrasse , Berlin, Germany., – Centre for Clinical Brain Sciences. University of Edinburgh., – Institute for Medical Informatics and Biometry, Technische Universität Dresden, Fetscherstrasse , Dresden, Germany. 备注：25 pages 摘要：机器学习算法是现代诊断辅助软件的基础，该软件在临床实践中，特别是在放射学中被证明是有价值的。然而，不准确主要是由于用于训练这些算法的临床样本有限，妨碍了它们在临床医生中的广泛适用性、接受度和认可度。我们分析了最先进的自动质量控制（QC）方法，这些方法可以在这些算法中实现，以估计其输出的确定性。我们在脑图像分割任务中验证了最有希望的方法，以识别磁共振成像数据中的白质高强度（WMH）。WMH是一种常见于成年中后期的小血管疾病，由于其大小和分布模式不同，对其进行分割尤其具有挑战性。我们的结果表明，不确定性聚合和骰子预测在该任务的故障检测中最有效。两种方法独立地将平均骰子从0.82提高到0.84。我们的工作揭示了QC方法如何帮助检测分割失败的案例，从而使自动分割更可靠，更适合临床实践。摘要：Machine learning algorithms underpin modern diagnostic-aiding software, which has proved valuable in clinical practice, particularly in radiology. However, inaccuracies, mainly due to the limited availability of clinical samples for training these algorithms, hamper their wider applicability, acceptance, and recognition amongst clinicians. We present an analysis of state-of-the-art automatic quality control (QC) approaches that can be implemented within these algorithms to estimate the certainty of their outputs. We validated the most promising approaches on a brain image segmentation task identifying white matter hyperintensities (WMH) in magnetic resonance imaging data. WMH are a correlate of small vessel disease common in mid-to-late adulthood and are particularly challenging to segment due to their varied size, and distributional patterns. Our results show that the aggregation of uncertainty and Dice prediction were most effective in failure detection for this task. Both methods independently improved mean Dice from 0.82 to 0.84. Our work reveals how QC methods can help to detect failed segmentation cases and therefore make automatic segmentation more reliable and suitable for clinical practice.

Zero/Few Shot|迁移|域适配|自适应(5篇)

【1】 Domain Generalization via Progressive Layer-wise and Channel-wise Dropout 标题：基于逐层和逐通道丢弃的域综合链接：https://arxiv.org/abs/2112.03676

作者：Jintao Guo,Lei Qi,Yinghuan Shi,Yang Gao 机构： National Key Laboratory for Novel Software Technology, Nanjing University, National Institute of Healthcare Data Science, Nanjing University, Key Lab of Computer Network and Information Integration, Southeast University 摘要：通过在多个观测源域上训练一个模型，域泛化的目的是在无需进一步训练的情况下很好地泛化到任意不可见的目标域。现有的工作主要集中在学习领域不变特征以提高泛化能力。然而，由于目标域在训练过程中不可用，以前的方法不可避免地会受到源域过度拟合的影响。为了解决这个问题，我们开发了一个有效的基于退出的框架来扩大模型的关注范围，这可以有效地缓解过度拟合问题。特别是，与通常在固定层上进行辍学的典型辍学方案不同，我们首先随机选择一层，然后随机选择其信道进行辍学。此外，我们还利用渐进式方案增加了训练过程中的辍学率，从而逐渐提高了训练模型的难度，增强了模型的鲁棒性。此外，为了进一步缓解过度拟合问题的影响，我们利用图像级和特征级的增强方案来生成强基线模型。我们在多个基准数据集上进行了大量的实验，结果表明，我们的方法优于最先进的方法。摘要：By training a model on multiple observed source domains, domain generalization aims to generalize well to arbitrary unseen target domains without further training. Existing works mainly focus on learning domain-invariant features to improve the generalization ability. However, since target domain is not available during training, previous methods inevitably suffer from overfitting in source domains. To tackle this issue, we develop an effective dropout-based framework to enlarge the region of the model's attention, which can effectively mitigate the overfitting problem. Particularly, different from the typical dropout scheme, which normally conducts the dropout on the fixed layer, first, we randomly select one layer, and then we randomly select its channels to conduct dropout. Besides, we leverage the progressive scheme to add the ratio of the dropout during training, which can gradually boost the difficulty of training model to enhance the robustness of the model. Moreover, to further alleviate the impact of the overfitting issue, we leverage the augmentation schemes on image-level and feature-level to yield a strong baseline model. We conduct extensive experiments on multiple benchmark datasets, which show our method can outperform the state-of-the-art methods.

【2】 Parallel Discrete Convolutions on Adaptive Particle Representations of Images 标题：图像自适应粒子表示的并行离散卷积链接：https://arxiv.org/abs/2112.03592

作者：Joel Jonsson,Bevan L. Cheeseman,Suryanarayana Maddu,Krzysztof Gonciarz,Ivo F. Sbalzarini 机构： Technische Universit¨at Dresden, Dresden, Germany, Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany, Center for Systems Biology Dresden, Dresden, Germany 备注：18 pages, 13 figures 摘要：我们提出了并行计算机体系结构上图像自适应粒子表示（APR）上离散卷积算子本机实现的数据结构和算法。APR是一种内容自适应图像表示法，可根据图像信号局部调整采样分辨率。它已被开发为一种替代像素表示的大型稀疏图像，因为它们通常出现在荧光显微镜中。已经证明，它可以减少存储、可视化和处理此类图像的内存和运行时成本。然而，这要求图像处理在APR上本机运行，而不需要中间恢复到像素。然而，设计高效且可扩展的APR本机图像处理原语由于APR不规则的内存结构而变得复杂。在这里，我们提供了高效和本机处理APR图像所需的算法构建块，使用了可以用离散卷积表示的各种算法。我们表明，APR卷积自然会导致在多核CPU和GPU架构上高效并行的规模自适应算法。与基于像素的算法和均匀采样数据上的卷积相比，我们量化了加速比。我们在单个Nvidia GeForce RTX 2080游戏GPU上实现了高达1 TB/s的像素等效吞吐量，与基于像素的实现相比，所需内存最多可减少两个数量级。摘要：We present data structures and algorithms for native implementations of discrete convolution operators over Adaptive Particle Representations (APR) of images on parallel computer architectures. The APR is a content-adaptive image representation that locally adapts the sampling resolution to the image signal. It has been developed as an alternative to pixel representations for large, sparse images as they typically occur in fluorescence microscopy. It has been shown to reduce the memory and runtime costs of storing, visualizing, and processing such images. This, however, requires that image processing natively operates on APRs, without intermediately reverting to pixels. Designing efficient and scalable APR-native image processing primitives, however, is complicated by the APR's irregular memory structure. Here, we provide the algorithmic building blocks required to efficiently and natively process APR images using a wide range of algorithms that can be formulated in terms of discrete convolutions. We show that APR convolution naturally leads to scale-adaptive algorithms that efficiently parallelize on multi-core CPU and GPU architectures. We quantify the speedups in comparison to pixel-based algorithms and convolutions on evenly sampled data. We achieve pixel-equivalent throughputs of up to 1 TB/s on a single Nvidia GeForce RTX 2080 gaming GPU, requiring up to two orders of magnitude less memory than a pixel-based implementation.

【3】 Learning Instance and Task-Aware Dynamic Kernels for Few Shot Learning 标题：Few-Shot学习的学习实例和任务感知动态核链接：https://arxiv.org/abs/2112.03494

作者：Rongkai Ma,Pengfei Fang,Gil Avraham,Yan Zuo,Tom Drummond,Mehrtash Harandi 机构：Monash University,Australian National University,CSIRO,The University of Melbourne 摘要：在实际应用中，学习和推广具有少量样本的新概念（少量快照学习）仍然是一个基本挑战。实现Few-Shot学习的一个主要方法是实现一个能够快速适应给定任务上下文的模型。动态网络已被证明能够有效地学习内容自适应参数，使其适合于少量镜头学习。在本文中，我们建议学习卷积网络的动态核作为手头任务的函数，从而实现更快的泛化。为此，我们基于整个任务和每个样本获得了动态内核，并开发了一种机制，进一步独立地调节每个通道和位置。这导致动态内核同时关注全局信息，同时也考虑可用的微小细节。我们的经验表明，我们的模型提高了在少数镜头分类和检测任务上的性能，与一些基线模型相比取得了明显的改进。这包括4个少数镜头分类基准的最新结果：mini ImageNet、分层ImageNet、CUB和FC100，以及少数镜头检测数据集MS COCO-PASCAL-VOC的竞争结果。摘要：Learning and generalizing to novel concepts with few samples (Few-Shot Learning) is still an essential challenge to real-world applications. A principle way of achieving few-shot learning is to realize a model that can rapidly adapt to the context of a given task. Dynamic networks have been shown capable of learning content-adaptive parameters efficiently, making them suitable for few-shot learning. In this paper, we propose to learn the dynamic kernels of a convolution network as a function of the task at hand, enabling faster generalization. To this end, we obtain our dynamic kernels based on the entire task and each sample and develop a mechanism further conditioning on each individual channel and position independently. This results in dynamic kernels that simultaneously attend to the global information whilst also considering minuscule details available. We empirically show that our model improves performance on few-shot classification and detection tasks, achieving a tangible improvement over several baseline models. This includes state-of-the-art results on 4 few-shot classification benchmarks: mini-ImageNet, tiered-ImageNet, CUB and FC100 and competitive results on a few-shot detection dataset: MS COCO-PASCAL-VOC.

【4】 Noise Distribution Adaptive Self-Supervised Image Denoising using Tweedie Distribution and Score Matching 标题：基于Twedie分布和分数匹配的噪声分布自适应自监督图像去噪链接：https://arxiv.org/abs/2112.03696

作者：Kwanyoung Kim,Taesung Kwon,Jong Chul Ye 机构： Department of Bio and Brain Engineering, Kim Jaechul Graduate School of AI, Deptartment of Mathematical Sciences, Korea Advanced Institute of Science and Technology (KAIST) 摘要：Tweedie分布是指数色散模型的一种特例，在经典统计学中常用作广义线性模型的分布。在这里，我们揭示了Tweedie分布在现代深度学习时代也起着关键作用，导致了一个独立于分布的自监督图像去噪公式，而没有干净的参考图像。具体地说，通过结合最近的Noise2Score自监督图像去噪方法和Tweedie分布的鞍点近似，我们可以提供一个通用的封闭形式去噪公式，该公式可用于大类噪声分布，而不需要知道潜在的噪声分布。与原始Noise2Score相似，新方法由两个连续步骤组成：使用扰动噪声图像进行分数匹配，然后通过分布无关的Tweedie公式得到封闭形式的图像去噪公式。这也提出了一个系统的算法来估计噪声模型和噪声参数为给定的噪声图像数据集。通过大量实验，我们证明了该方法能够准确估计噪声模型和参数，并在基准数据集和真实数据集上提供了最先进的自监督图像去噪性能。摘要：Tweedie distributions are a special case of exponential dispersion models, which are often used in classical statistics as distributions for generalized linear models. Here, we reveal that Tweedie distributions also play key roles in modern deep learning era, leading to a distribution independent self-supervised image denoising formula without clean reference images. Specifically, by combining with the recent Noise2Score self-supervised image denoising approach and the saddle point approximation of Tweedie distribution, we can provide a general closed-form denoising formula that can be used for large classes of noise distributions without ever knowing the underlying noise distribution. Similar to the original Noise2Score, the new approach is composed of two successive steps: score matching using perturbed noisy images, followed by a closed form image denoising formula via distribution-independent Tweedie's formula. This also suggests a systematic algorithm to estimate the noise model and noise parameters for a given noisy image data set. Through extensive experiments, we demonstrate that the proposed method can accurately estimate noise models and parameters, and provide the state-of-the-art self-supervised image denoising performance in the benchmark dataset and real-world dataset.

【5】 Learning Pixel-Adaptive Weights for Portrait Photo Retouching 标题：用于人像照片润色的像素自适应权值学习链接：https://arxiv.org/abs/2112.03536

作者：Binglu Wang,Chengzhe Lu,Dawei Yan,Yongqiang Zhao 机构： Equal Contribution 备注：Techinical report 摘要：人像照片润饰是一项照片润饰任务，强调人的区域优先级和组级别的一致性。基于查找表的方法通过学习图像自适应权重来组合三维查找表（3D LUT）并进行像素到像素的颜色变换，从而获得了良好的修饰性能。但是，当纵向像素和背景像素显示相同的原始RGB值时，此范例会忽略本地上下文提示，并将相同的变换应用于纵向像素和背景像素。相比之下，专家通常执行不同的操作来调整肖像区域和背景区域的色温和色调。这启发我们对本地上下文线索进行建模，以明确地提高修饰质量。首先，我们考虑图像贴片，并预测像素自适应查找表权重，以精确地刷新中心像素。其次，由于相邻像素与中心像素具有不同的亲和力，我们估计一个局部注意掩码来调节相邻像素的影响。第三，通过应用监督，可以进一步提高局部注意掩码的质量，监督是基于由groundtruth肖像掩码计算的亲和图。对于组级一致性，我们建议直接限制实验室空间中平均颜色分量的方差。在PPR10K数据集上的大量实验验证了我们的方法的有效性，例如，在高分辨率照片上，PSNR度量获得了超过0.5的增益，而组级一致性度量获得了至少2.1的降低。摘要：Portrait photo retouching is a photo retouching task that emphasizes human-region priority and group-level consistency. The lookup table-based method achieves promising retouching performance by learning image-adaptive weights to combine 3-dimensional lookup tables (3D LUTs) and conducting pixel-to-pixel color transformation. However, this paradigm ignores the local context cues and applies the same transformation to portrait pixels and background pixels when they exhibit the same raw RGB values. In contrast, an expert usually conducts different operations to adjust the color temperatures and tones of portrait regions and background regions. This inspires us to model local context cues to improve the retouching quality explicitly. Firstly, we consider an image patch and predict pixel-adaptive lookup table weights to precisely retouch the center pixel. Secondly, as neighboring pixels exhibit different affinities to the center pixel, we estimate a local attention mask to modulate the influence of neighboring pixels. Thirdly, the quality of the local attention mask can be further improved by applying supervision, which is based on the affinity map calculated by the groundtruth portrait mask. As for group-level consistency, we propose to directly constrain the variance of mean color components in the Lab space. Extensive experiments on PPR10K dataset verify the effectiveness of our method, e.g. on high-resolution photos, the PSNR metric receives over 0.5 gains while the group-level consistency metric obtains at least 2.1 decreases.

半弱无监督|主动学习|不确定性(7篇)

【1】 STC-mix: Space, Time, Channel mixing for Self-supervised Video Representation 标题：STC-MIX：用于自监督视频表示的空间、时间、通道混合链接：https://arxiv.org/abs/2112.03906

作者：Srijan Das,Michael S. Ryoo 机构：Stony Brook University 备注：12 pages, codes and model links will be updated soon 摘要：视频对比表征学习在很大程度上依赖于数百万未标记视频的可用性。这对于网络上的视频来说是可行的，但是为现实世界的应用程序获取如此大规模的视频是非常昂贵和费力的。因此，在本文中，我们重点设计用于自监督学习的视频增强，我们首先分析混合视频的最佳策略，以创建新的增强视频样本。那么，问题仍然是，我们能否利用视频中的其他模式进行数据混合？为此，我们提出了跨模态流形切割混合（CMMC），将一个视频拼接插入到特征空间中的另一个视频拼接中，跨越两个不同的模态。我们发现，我们的视频混合策略STC mix，即视频的初步混合，然后是视频中不同模式的CMMC，提高了学习视频表示的质量。我们在两个小规模视频数据集UCF101和HMDB51上对两个下游任务进行了彻底的实验：动作识别和视频检索。我们还展示了我们的STC组合在领域知识有限的NTU数据集上的有效性。我们表明，在两个下游任务的STC混合的性能等同于其他自我监督的方法，同时需要较少的训练数据。摘要：Contrastive representation learning of videos highly relies on the availability of millions of unlabelled videos. This is practical for videos available on web but acquiring such large scale of videos for real-world applications is very expensive and laborious. Therefore, in this paper we focus on designing video augmentation for self-supervised learning, we first analyze the best strategy to mix videos to create a new augmented video sample. Then, the question remains, can we make use of the other modalities in videos for data mixing? To this end, we propose Cross-Modal Manifold Cutmix (CMMC) that inserts a video tesseract into another video tesseract in the feature space across two different modalities. We find that our video mixing strategy STC-mix, i.e. preliminary mixing of videos followed by CMMC across different modalities in a video, improves the quality of learned video representations. We conduct thorough experiments for two downstream tasks: action recognition and video retrieval on two small scale video datasets UCF101, and HMDB51. We also demonstrate the effectiveness of our STC-mix on NTU dataset where domain knowledge is limited. We show that the performance of our STC-mix on both the downstream tasks is on par with the other self-supervised approaches while requiring less training data.

【2】 ViewCLR: Learning Self-supervised Video Representation for Unseen Viewpoints 标题：ViewCLR：学习不可见视点的自监督视频表示链接：https://arxiv.org/abs/2112.03905

作者：Srijan Das,Michael S. Ryoo 机构：Stony Brook University, Feature space, 𝓣! + 𝑽𝑮, General Contrastive Learning, Ours: ViewCLR, Latent viewpoint, representation, VG – Viewpoint Generator, 𝓣, 𝓣’ – Transformations 备注：13 pages, Codes and models will updated soon 摘要：学习自监督视频表示主要关注于从简单的数据增强方案生成的区分实例。然而，学习到的表示法往往无法在看不见的摄像机视点上推广。为此，我们提出ViewCLR，它学习对摄影机视点变化不变的自监督视频表示。我们引入了一个视图生成器，它可以被视为任何自我监督的文本前任务的可学习增强，以生成视频的潜在视点表示。ViewCLR最大化潜在视点表示与原始视点表示之间的相似性，使学习到的视频编码器能够在看不见的摄影机视点上泛化。在包括NTU RGB+D数据集在内的交叉视图基准数据集上的实验表明，ViewCLR是一种先进的视点不变自监督方法。摘要：Learning self-supervised video representation predominantly focuses on discriminating instances generated from simple data augmentation schemes. However, the learned representation often fails to generalize over unseen camera viewpoints. To this end, we propose ViewCLR, that learns self-supervised video representation invariant to camera viewpoint changes. We introduce a view-generator that can be considered as a learnable augmentation for any self-supervised pre-text tasks, to generate latent viewpoint representation of a video. ViewCLR maximizes the similarities between the latent viewpoint representation with its representation from the original viewpoint, enabling the learned video encoder to generalize over unseen camera viewpoints. Experiments on cross-view benchmark datasets including NTU RGB+D dataset show that ViewCLR stands as a state-of-the-art viewpoint invariant self-supervised method.

【3】 Suppressing Static Visual Cues via Normalizing Flows for Self-Supervised Video Representation Learning 标题：基于归一化流程的自监督视频表征学习抑制静电视觉线索链接：https://arxiv.org/abs/2112.03803

作者：Manlin Zhang,Jinpeng Wang,Andy J. Ma 机构：School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China, School of Electronics and Information Technology, Sun Yat-sen University, Guangzhou, China 备注：AAAI2022 摘要：尽管深卷积神经网络在视频理解方面取得了巨大的进步，但现有方法学习的特征表示可能偏向于静态视觉线索。为了解决这个问题，我们提出了一种基于概率分析的静态视觉线索抑制方法（SSVC），用于自监督视频表示学习。在我们的方法中，首先对视频帧进行编码，通过归一化流获得标准正态分布下的潜在变量。通过将视频中的静态因素建模为随机变量，每个潜在变量的条件分布将变为平移和标度正态分布。然后，选择随时间变化较小的潜在变量作为静态线索并进行抑制以生成运动保留视频。最后，通过运动保留视频构建正对进行对比学习，以缓解对静态线索的表征偏差问题。较少偏向的视频表示可以更好地推广到各种下游任务。在公开的基准上进行的大量实验表明，当仅使用单个RGB模态进行预训练时，所提出的方法优于现有的方法。摘要：Despite the great progress in video understanding made by deep convolutional neural networks, feature representation learned by existing methods may be biased to static visual cues. To address this issue, we propose a novel method to suppress static visual cues (SSVC) based on probabilistic analysis for self-supervised video representation learning. In our method, video frames are first encoded to obtain latent variables under standard normal distribution via normalizing flows. By modelling static factors in a video as a random variable, the conditional distribution of each latent variable becomes shifted and scaled normal. Then, the less-varying latent variables along time are selected as static cues and suppressed to generate motion-preserved videos. Finally, positive pairs are constructed by motion-preserved videos for contrastive learning to alleviate the problem of representation bias to static cues. The less-biased video representation can be better generalized to various downstream tasks. Extensive experiments on publicly available benchmarks demonstrate that the proposed method outperforms the state of the art when only single RGB modality is used for pre-training.

【4】 TCGL: Temporal Contrastive Graph for Self-supervised Video Representation Learning 标题：TCGL：用于自监督视频表示学习的时间对比图链接：https://arxiv.org/abs/2112.03587

作者：Yang Liu,Keze Wang,Lingbo Liu,Haoyuan Lan,Liang Lin 机构： Haoyuan Lan and Liang Lin are with the Schoolof Computer Science and Engineering, Sun Yat-sen University 备注：This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. arXiv admin note: substantial text overlap with arXiv:2101.00820 摘要：视频自监督学习是一项具有挑战性的任务，它需要模型具有强大的表达能力来利用丰富的时空知识，并从大量未标记的视频中生成有效的监督信号。然而，现有的方法无法增加未标记视频的时间多样性，并且忽略了以显式方式精心建模多尺度时间依赖关系。为了克服这些局限性，我们利用视频中的多尺度时间依赖性，提出了一种新的视频自监督学习框架，称为时间对比图学习（TCGL），该算法采用混合图对比学习策略，对时间表示学习中的片段间和片段内时间依赖关系进行联合建模。具体而言，首先引入时空知识发现（STKD）模块，基于离散余弦变换的频域分析从视频中提取运动增强的时空表示。为了明确地建模未标记视频的多尺度时间依赖关系，我们的TCGL将帧和片段顺序的先验知识集成到图结构中，即片段内/片段间时间对比图（TCG）。然后，设计了特定的对比学习模块，以最大限度地提高不同图形视图中节点之间的一致性。为了生成未标记视频的监控信号，我们引入了自适应片段顺序预测（ASOP）模块，该模块利用视频片段之间的关系知识来学习全局上下文表示，并自适应地重新校准通道特征。实验结果表明，我们的TCGL在大规模动作识别和视频检索基准测试中优于最新的方法。摘要：Video self-supervised learning is a challenging task, which requires significant expressive power from the model to leverage rich spatial-temporal knowledge and generate effective supervisory signals from large amounts of unlabeled videos. However, existing methods fail to increase the temporal diversity of unlabeled videos and ignore elaborately modeling multi-scale temporal dependencies in an explicit way. To overcome these limitations, we take advantage of the multi-scale temporal dependencies within videos and proposes a novel video self-supervised learning framework named Temporal Contrastive Graph Learning (TCGL), which jointly models the inter-snippet and intra-snippet temporal dependencies for temporal representation learning with a hybrid graph contrastive learning strategy. Specifically, a Spatial-Temporal Knowledge Discovering (STKD) module is first introduced to extract motion-enhanced spatial-temporal representations from videos based on the frequency domain analysis of discrete cosine transform. To explicitly model multi-scale temporal dependencies of unlabeled videos, our TCGL integrates the prior knowledge about the frame and snippet orders into graph structures, i.e., the intra-/inter- snippet Temporal Contrastive Graphs (TCG). Then, specific contrastive learning modules are designed to maximize the agreement between nodes in different graph views. To generate supervisory signals for unlabeled videos, we introduce an Adaptive Snippet Order Prediction (ASOP) module which leverages the relational knowledge among video snippets to learn the global context representation and recalibrate the channel-wise features adaptively. Experimental results demonstrate the superiority of our TCGL over the state-of-the-art methods on large-scale action recognition and video retrieval benchmarks.

【5】 Unsupervised Learning of Compositional Scene Representations from Multiple Unspecified Viewpoints 标题：多视点构图场景表示的无监督学习链接：https://arxiv.org/abs/2112.03568

作者：Jinyang Yuan,Bin Li,Xiangyang Xue 备注：AAAI 2022 摘要：视觉场景具有极其丰富的多样性，这不仅是因为存在无限多的对象和背景组合，而且还因为同一场景的观测值可能会随着视点的变化而发生很大变化。当从多个视点观察包含多个对象的视觉场景时，人类能够从每个视点以合成方式感知场景，同时在不同视点之间实现所谓的“对象恒定性”，即使确切的视点不详。这种能力对于人类在移动时识别同一物体以及有效地从视觉中学习至关重要。设计具有类似能力的模型很有趣。在本文中，我们考虑了从多个未指定的视点学习合成场景表示的新问题，而不使用任何监督，并提出了一种深度生成模型，该模型将潜在表示分离为视点无关部分和视点依赖部分来解决该问题。为了推断潜在表征，不同视点中包含的信息通过神经网络进行迭代集成。在几个专门设计的合成数据集上的实验表明，该方法能够有效地从多个未指定的视点进行学习。摘要：Visual scenes are extremely rich in diversity, not only because there are infinite combinations of objects and background, but also because the observations of the same scene may vary greatly with the change of viewpoints. When observing a visual scene that contains multiple objects from multiple viewpoints, humans are able to perceive the scene in a compositional way from each viewpoint, while achieving the so-called "object constancy" across different viewpoints, even though the exact viewpoints are untold. This ability is essential for humans to identify the same object while moving and to learn from vision efficiently. It is intriguing to design models that have the similar ability. In this paper, we consider a novel problem of learning compositional scene representations from multiple unspecified viewpoints without using any supervision, and propose a deep generative model which separates latent representations into a viewpoint-independent part and a viewpoint-dependent part to solve this problem. To infer latent representations, the information contained in different viewpoints is iteratively integrated by neural networks. Experiments on several specifically designed synthetic datasets have shown that the proposed method is able to effectively learn from multiple unspecified viewpoints.

【6】 Self-Supervised Camera Self-Calibration from Video 标题：基于视频的自监督摄像机自标定链接：https://arxiv.org/abs/2112.03325

作者：Jiading Fang,Igor Vasiljevic,Vitor Guizilini,Rares Ambrus,Greg Shakhnarovich,Adrien Gaidon,Matthew R. Walter 机构：Rares, Ambrus 摘要：摄像机校准是机器人技术和计算机视觉算法的一个组成部分，这些算法试图从视觉输入流推断场景的几何特性。在实践中，校准是一个费力的过程，需要专门的数据收集和仔细调整。每当摄像机参数发生变化时，必须重复该过程，这对于移动机器人和自动驾驶车辆来说是经常发生的。相比之下，自监督深度和自我运动估计方法可以通过推断优化视图合成目标的每帧投影模型绕过显式校准。在本文中，我们扩展了这种方法，从野外的原始视频中显式地校准了大范围的摄像机。我们提出了一种学习算法，使用一个有效的通用相机模型族来回归每个序列的校准参数。我们的程序实现了具有亚像素重投影误差的自校准结果，优于其他基于学习的方法。我们在各种各样的相机几何体上验证了我们的方法，包括透视、鱼眼和折反射。最后，我们表明，我们的方法改进了深度估计的下游任务，在EuRoC数据集上实现了最先进的结果，计算效率高于当代方法。摘要：Camera calibration is integral to robotics and computer vision algorithms that seek to infer geometric properties of the scene from visual input streams. In practice, calibration is a laborious procedure requiring specialized data collection and careful tuning. This process must be repeated whenever the parameters of the camera change, which can be a frequent occurrence for mobile robots and autonomous vehicles. In contrast, self-supervised depth and ego-motion estimation approaches can bypass explicit calibration by inferring per-frame projection models that optimize a view synthesis objective. In this paper, we extend this approach to explicitly calibrate a wide range of cameras from raw videos in the wild. We propose a learning algorithm to regress per-sequence calibration parameters using an efficient family of general camera models. Our procedure achieves self-calibration results with sub-pixel reprojection error, outperforming other learning-based methods. We validate our approach on a wide variety of camera geometries, including perspective, fisheye, and catadioptric. Finally, we show that our approach leads to improvements in the downstream task of depth estimation, achieving state-of-the-art results on the EuRoC dataset with greater computational efficiency than contemporary methods.

【7】 Organ localisation using supervised and semi supervised approaches combining reinforcement learning with imitation learning 标题：结合强化学习和模仿学习的监督和半监督器官定位方法链接：https://arxiv.org/abs/2112.03276

作者：Sankaran Iyer,Alan Blair,Laughlin Dawes,Daniel Moses,Christopher White,Arcot Sowmya 机构：School of Computer Science and Engineering, University of New South Wales Kensington, NSW , Department of Medical Imaging, Prince of Wales Hospital, NSW, Australia, Department of Endocrinology and Metabolism, Prince of Wales Hospital, NSW, Australia 备注：16 pages, 12 figures 摘要：计算机辅助诊断通常需要在放射学扫描中分析感兴趣区域（ROI），ROI可能是一个器官或亚器官。尽管深度学习算法的性能优于其他方法，但它们依赖于大量注释数据的可用性。出于解决这一局限性的需要，本文提出了一种基于监督和半监督学习的多器官定位和检测方法。它借鉴了作者先前在CT图像中定位胸椎和腰椎区域的工作。该方法生成感兴趣器官的六个边界框，然后将它们融合到一个边界框中。使用监督和半监督学习（SSL）对CT图像中的脾脏、左肾和右肾进行定位的实验结果表明，与其他最先进的方法相比，使用更小的数据集和更少的注释可以解决数据限制问题。使用三种不同的标记和未标记数据（即30:70、35:65、40:60）分别对腰椎、脾脏、左肾和右肾的SSL性能进行评估。结果表明，SSL提供了一种可行的替代方案，特别是在难以获得注释数据的医学成像中。摘要：Computer aided diagnostics often requires analysis of a region of interest (ROI) within a radiology scan, and the ROI may be an organ or a suborgan. Although deep learning algorithms have the ability to outperform other methods, they rely on the availability of a large amount of annotated data. Motivated by the need to address this limitation, an approach to localisation and detection of multiple organs based on supervised and semi-supervised learning is presented here. It draws upon previous work by the authors on localising the thoracic and lumbar spine region in CT images. The method generates six bounding boxes of organs of interest, which are then fused to a single bounding box. The results of experiments on localisation of the Spleen, Left and Right Kidneys in CT Images using supervised and semi supervised learning (SSL) demonstrate the ability to address data limitations with a much smaller data set and fewer annotations, compared to other state-of-the-art methods. The SSL performance was evaluated using three different mixes of labelled and unlabelled data (i.e.30:70,35:65,40:60) for each of lumbar spine, spleen left and right kidneys respectively. The results indicate that SSL provides a workable alternative especially in medical imaging where it is difficult to obtain annotated data.

时序|行为识别|姿态|视频|运动估计(1篇)

【1】 Time-Equivariant Contrastive Video Representation Learning 标题：时变对比视频表征学习链接：https://arxiv.org/abs/2112.03624

作者：Simon Jenni,Hailin Jin 机构：Adobe Research 备注：ICCV 2021 (oral) 摘要：我们提出了一种新的自监督对比学习方法来学习未标记视频的表征。现有的方法忽略了输入失真的细节，例如通过学习时间变换的不变性。相反，我们认为视频表示应该保留视频动态并反映输入的时间操作。因此，我们利用新的约束来构建与时间变换等价的表示，并更好地捕获视频动态。在我们的方法中，视频增强剪辑之间的相对时间变换被编码在一个向量中，并与其他变换向量进行对比。为了支持时间等变学习，我们还提出了将视频的两个片段自监督分类为1。重叠2。命令，或3。无序的。我们的实验表明，时间等变表示在UCF101、HMDB51和Diving48上的视频检索和动作识别基准测试中取得了最先进的结果。摘要：We introduce a novel self-supervised contrastive learning method to learn representations from unlabelled videos. Existing approaches ignore the specifics of input distortions, e.g., by learning invariance to temporal transformations. Instead, we argue that video representation should preserve video dynamics and reflect temporal manipulations of the input. Therefore, we exploit novel constraints to build representations that are equivariant to temporal transformations and better capture video dynamics. In our method, relative temporal transformations between augmented clips of a video are encoded in a vector and contrasted with other transformation vectors. To support temporal equivariance learning, we additionally propose the self-supervised classification of two clips of a video into 1. overlapping 2. ordered, or 3. unordered. Our experiments show that time-equivariant representations achieve state-of-the-art results in video retrieval and action recognition benchmarks on UCF101, HMDB51, and Diving48.

GAN|对抗|攻击|生成相关(4篇)

【1】 Generation of Non-Deterministic Synthetic Face Datasets Guided by Identity Priors 标题：基于身份先验的非确定性合成人脸数据集的生成链接：https://arxiv.org/abs/2112.03632

作者：Marcel Grimmer,Haoyu Zhang,Raghavendra Ramachandra,Kiran Raja,Christoph Busch 机构： NBL - Norwegian Biometrics Laboratory, NTNU, Norway, dasec - Biometrics and Internet Securiy Research Group, HDA, Germany 备注：None 摘要：通过人脸识别实现高度安全的应用（如越境）需要通过大规模数据进行广泛的生物特征性能测试。然而，使用真实的人脸图像会引起人们对隐私的担忧，因为法律不允许将这些图像用于最初计划之外的其他目的。使用具有代表性的人脸数据和人脸数据子集也可能导致不必要的人口统计偏差，并导致数据集的不平衡。克服这些问题的一个可能的解决方案是用合成生成的样本替换真实的人脸图像。虽然生成合成图像得益于计算机视觉的最新进展，但生成具有类似真实世界变化的相同合成身份的多个样本（即配对样本）仍然没有得到解决。本文提出了一种利用StyleGAN结构良好的潜在空间生成匹配人脸图像的非确定性方法。通过操纵潜在向量生成匹配样本，更准确地说，我们利用主成分分析（PCA）在潜在空间中定义语义上有意义的方向，并使用预训练的人脸识别系统控制原始样本和匹配样本之间的相似性。我们创建了一个新的合成人脸图像数据集（SymFace），该数据集由77034个样本组成，其中包括25919个合成ID。通过使用成熟的人脸图像质量指标进行分析，我们展示了模拟真实生物特征数据特征的合成样本在生物特征质量方面的差异。分析及其结果表明，使用所提议的方法创建的合成样本作为替代真实生物测定数据的可行替代方案。摘要：Enabling highly secure applications (such as border crossing) with face recognition requires extensive biometric performance tests through large scale data. However, using real face images raises concerns about privacy as the laws do not allow the images to be used for other purposes than originally intended. Using representative and subsets of face data can also lead to unwanted demographic biases and cause an imbalance in datasets. One possible solution to overcome these issues is to replace real face images with synthetically generated samples. While generating synthetic images has benefited from recent advancements in computer vision, generating multiple samples of the same synthetic identity resembling real-world variations is still unaddressed, i.e., mated samples. This work proposes a non-deterministic method for generating mated face images by exploiting the well-structured latent space of StyleGAN. Mated samples are generated by manipulating latent vectors, and more precisely, we exploit Principal Component Analysis (PCA) to define semantically meaningful directions in the latent space and control the similarity between the original and the mated samples using a pre-trained face recognition system. We create a new dataset of synthetic face images (SymFace) consisting of 77,034 samples including 25,919 synthetic IDs. Through our analysis using well-established face image quality metrics, we demonstrate the differences in the biometric quality of synthetic samples mimicking characteristics of real biometric data. The analysis and results thereof indicate the use of synthetic samples created using the proposed approach as a viable alternative to replacing real biometric data.

【2】 CG-NeRF: Conditional Generative Neural Radiance Fields 标题：CG-NERF：条件生成神经辐射场链接：https://arxiv.org/abs/2112.03517

作者：Kyungmin Jo,Gyumin Shim,Sanghun Jung,Soyoung Yang,Jaegul Choo 机构：KAIST, Graduate school of AI, Korea Advanced Institute of Science and Technology, Daejeon, Republic of Korea, Identical Condition with Different Noise Codes, Text, Color Image, Grayscale, Sketch, Low-Resolution 摘要：虽然最近基于NeRF的生成模型实现了不同3D感知图像的生成，但这些方法在生成包含用户指定特征的图像时存在局限性。在本文中，我们提出了一种新的模型，称为条件生成神经辐射场（CG NeRF），它可以生成反映额外输入条件（如图像或文本）的多视图图像。在保持给定输入条件的共同特征的同时，所提出的模型可以生成细节各异的图像。我们提出：1）一种新的统一架构，它将形状和外观从各种形式的给定条件中分离出来；2）姿势一致性多样性损失，用于生成多模态输出，同时保持视图的一致性。实验结果表明，与现有的基于NeRF的生成模型相比，该方法在各种条件下都保持了一致的图像质量，并获得了更好的保真度和多样性。摘要：While recent NeRF-based generative models achieve the generation of diverse 3D-aware images, these approaches have limitations when generating images that contain user-specified characteristics. In this paper, we propose a novel model, referred to as the conditional generative neural radiance fields (CG-NeRF), which can generate multi-view images reflecting extra input conditions such as images or texts. While preserving the common characteristics of a given input condition, the proposed model generates diverse images in fine detail. We propose: 1) a novel unified architecture which disentangles the shape and appearance from a condition given in various forms and 2) the pose-consistent diversity loss for generating multimodal outputs while maintaining consistency of the view. Experimental results show that the proposed method maintains consistent image quality on various condition types and achieves superior fidelity and diversity compared to existing NeRF-based generative models.

【3】 A Generic Approach for Enhancing GANs by Regularized Latent Optimization 标题：正则化潜在优化增强遗传算法的一种通用方法链接：https://arxiv.org/abs/2112.03502

作者：Yufan Zhou,Chunyuan Li,Changyou Chen,Jinhui Xu 机构：State University of New York at Buffalo, Microsoft Research, Redmond 摘要：随着模型复杂性和数据量的快速增长，训练深层生成模型（deepgenerativemodels，DGMs）以获得更好的性能已成为越来越重要的挑战。以前关于这个问题的研究主要集中在通过引入新的目标函数或设计更具表现力的模型体系结构来改进DGMs。然而，这种方法通常会引入更多的计算和/或设计开销。为了解决这些问题，我们在本文中介绍了一个名为{\em生成模型推理}的通用框架，该框架能够在各种应用场景中有效地无缝地增强预先训练的GAN。我们的基本思想是使用Wasserstein梯度流技术有效地推断给定需求的最佳潜在分布，而不是重新训练或微调预先训练的模型参数。在图像生成、图像翻译、文本图像生成、图像修复和文本引导图像编辑等应用上的大量实验结果表明了我们提出的框架的有效性和优越性。摘要：With the rapidly growing model complexity and data volume, training deep generative models (DGMs) for better performance has becoming an increasingly more important challenge. Previous research on this problem has mainly focused on improving DGMs by either introducing new objective functions or designing more expressive model architectures. However, such approaches often introduce significantly more computational and/or designing overhead. To resolve such issues, we introduce in this paper a generic framework called {\em generative-model inference} that is capable of enhancing pre-trained GANs effectively and seamlessly in a variety of application scenarios. Our basic idea is to efficiently infer the optimal latent distribution for the given requirements using Wasserstein gradient flow techniques, instead of re-training or fine-tuning pre-trained model parameters. Extensive experimental results on applications like image generation, image translation, text-to-image generation, image inpainting, and text-guided image editing suggest the effectiveness and superiority of our proposed framework.

【4】 Top-Down Deep Clustering with Multi-generator GANs 标题：基于多生成GAN的自上而下深度聚类链接：https://arxiv.org/abs/2112.03398

作者：Daniel de Mello,Renato Assunção,Fabricio Murai 机构： Department of Computer Science, Universidade Federal de Minas Gerais, ESRI Inc. 备注：Accepted to AAAI 2021 摘要：深度集群（DC）利用深度体系结构的表示能力来学习最适合集群分析的嵌入空间。这种方法过滤出与聚类无关的低级信息，并在高维数据空间中取得了显著的成功。一些DC方法采用生成性对抗网络（GAN），受这些模型能够隐式学习的强大潜在表示的激励。在这项工作中，我们提出了HC-MGAN，这是一种基于具有多个生成器的GANs（MGAN）的新技术，该技术尚未用于聚类。我们的方法的灵感来自这样一个观察：MGAN的每个生成器都倾向于生成与真实数据分布的子区域相关的数据。我们使用该聚类生成来训练分类器，以推断给定图像来自哪个生成者，从而为真实分布提供语义上有意义的聚类。此外，我们设计了我们的方法，使其在自上而下的层次聚类树中执行，从而提出了第一个层次DC方法，据我们所知。我们进行了几次实验来评估所提出的方法与最近的DC方法，获得了有竞争力的结果。最后，我们对层次聚类树进行了探索性分析，重点介绍了它在语义一致模式的层次结构中如何准确地组织数据。摘要：Deep clustering (DC) leverages the representation power of deep architectures to learn embedding spaces that are optimal for cluster analysis. This approach filters out low-level information irrelevant for clustering and has proven remarkably successful for high dimensional data spaces. Some DC methods employ Generative Adversarial Networks (GANs), motivated by the powerful latent representations these models are able to learn implicitly. In this work, we propose HC-MGAN, a new technique based on GANs with multiple generators (MGANs), which have not been explored for clustering. Our method is inspired by the observation that each generator of a MGAN tends to generate data that correlates with a sub-region of the real data distribution. We use this clustered generation to train a classifier for inferring from which generator a given image came from, thus providing a semantically meaningful clustering for the real distribution. Additionally, we design our method so that it is performed in a top-down hierarchical clustering tree, thus proposing the first hierarchical DC method, to the best of our knowledge. We conduct several experiments to evaluate the proposed method against recent DC methods, obtaining competitive results. Last, we perform an exploratory analysis of the hierarchical clustering tree that highlights how accurately it organizes the data in a hierarchy of semantically coherent patterns.

自动驾驶|车辆|车道检测等(2篇)

【1】 Vehicle trajectory prediction works, but not everywhere 标题：车辆轨迹预测有效，但不是所有地方都有效链接：https://arxiv.org/abs/2112.03909

作者：Mohammadhossein Bahari,Saeed Saadatnejad,Ahmad Rahimi,Mohammad Shaverdikondori,Mohammad Shahidzadeh,Seyed-Mohsen Moosavi-Dezfooli,Alexandre Alahi 机构：EPFL, Sharif university of technology, ETH Zurich 摘要：车辆轨迹预测是当今自动驾驶汽车的一个基本支柱。产业界和研究界都通过运行公共基准确认了这一支柱的必要性。虽然最先进的方法令人印象深刻，也就是说，它们没有越野预测，但它们对基准之外的城市的推广是未知的。在这项工作中，我们表明，这些方法不能推广到新的场景。我们提出了一种新的方法，自动生成真实场景，使最先进的模型越野。我们通过敌对场景生成的镜头来构建问题。我们提出了一种基于原子场景生成函数和物理约束的简单而有效的生成模型。我们的实验表明，当前基准测试中超过60\%$的现有场景可以进行修改，以使预测方法失败（预测越野）。我们进一步表明，（i）生成的场景是真实的，因为它们确实存在于现实世界中，（ii）可以用于使现有模型健壮30-40%。代码可在https://s-attack.github.io/. 摘要：Vehicle trajectory prediction is nowadays a fundamental pillar of self-driving cars. Both the industry and research communities have acknowledged the need for such a pillar by running public benchmarks. While state-of-the-art methods are impressive, i.e., they have no off-road prediction, their generalization to cities outside of the benchmark is unknown. In this work, we show that those methods do not generalize to new scenes. We present a novel method that automatically generates realistic scenes that cause state-of-the-art models go off-road. We frame the problem through the lens of adversarial scene generation. We promote a simple yet effective generative model based on atomic scene generation functions along with physical constraints. Our experiments show that more than $60\%$ of the existing scenes from the current benchmarks can be modified in a way to make prediction methods fail (predicting off-road). We further show that (i) the generated scenes are realistic since they do exist in the real world, and (ii) can be used to make existing models robust by 30-40%. Code is available at https://s-attack.github.io/.

【2】 Causal Imitative Model for Autonomous Driving 标题：自动驾驶的因果模拟模型链接：https://arxiv.org/abs/2112.03908

作者：Mohammad Reza Samsami,Mohammadhossein Bahari,Saber Salehkaleybar,Alexandre Alahi 机构：Sharif University of Tech., EPFL 摘要：模仿学习是一种利用专家驾驶员演示数据学习自主驾驶策略的强大方法。然而，通过模仿学习训练的驾驶策略忽略了专家演示的因果结构，从而产生了两种不良行为：惯性和碰撞。在本文中，我们提出了因果模拟模型（CIM）来解决惯性和碰撞问题。CIM明确地发现因果模型并利用它来训练策略。具体而言，CIM将输入分解为一组潜在变量，选择因果变量，并通过利用所选变量确定下一个位置。我们的实验表明，我们的方法在惯性和碰撞率方面优于以前的工作。此外，由于利用了因果结构，CIM将输入维度缩减为仅两个维度，因此，可以在几次拍摄设置中适应新环境。代码可在https://github.com/vita-epfl/CIM. 摘要：Imitation learning is a powerful approach for learning autonomous driving policy by leveraging data from expert driver demonstrations. However, driving policies trained via imitation learning that neglect the causal structure of expert demonstrations yield two undesirable behaviors: inertia and collision. In this paper, we propose Causal Imitative Model (CIM) to address inertia and collision problems. CIM explicitly discovers the causal model and utilizes it to train the policy. Specifically, CIM disentangles the input to a set of latent variables, selects the causal variables, and determines the next position by leveraging the selected variables. Our experiments show that our method outperforms previous work in terms of inertia and collision rates. Moreover, thanks to exploiting the causal structure, CIM shrinks the input dimension to only two, hence, can adapt to new environments in a few-shot setting. Code is available at https://github.com/vita-epfl/CIM.

Attention注意力(2篇)

【1】 ADD: Frequency Attention and Multi-View based Knowledge Distillation to Detect Low-Quality Compressed Deepfake Images 标题：ADD：基于频率关注度和多视图的知识提取检测低质量压缩深伪图像链接：https://arxiv.org/abs/2112.03553

作者：Binh M. Le,Simon S. Woo 机构： Department of Computer Science and Engineering, Sungkyunkwan University, South Korea, Department of Applied Data Science, Sungkyunkwan University, South Korea 备注：None 摘要：尽管基于深度学习的伪造检测器在识别操纵的深度假图像方面取得了重大进展，但大多数检测方法在低质量压缩深度假图像的情况下，性能会出现中度到显著的下降。由于低质量图像中的信息有限，检测低质量图像仍然是一个重要的挑战。在这项工作中，我们将频域学习和知识提取（KD）中的最优传输理论应用到低质量压缩假图像的检测中。我们探索KD中的迁移学习能力，使学生网络能够有效地从低质量图像中学习区分特征。特别是，我们提出了基于注意的深度假检测蒸馏器（ADD），它由两个新蒸馏组成：1）频率注意蒸馏，有效地检索学生网络中移除的高频成分，2）多视角注意提取，通过在不同视角下切片教师和学生的张量来创建多个注意向量，从而更有效地将教师张量的分布传递给学生。我们的大量实验结果表明，我们的方法在检测低质量压缩假图像方面优于最先进的基线。摘要：Despite significant advancements of deep learning-based forgery detectors for distinguishing manipulated deepfake images, most detection approaches suffer from moderate to significant performance degradation with low-quality compressed deepfake images. Because of the limited information in low-quality images, detecting low-quality deepfake remains an important challenge. In this work, we apply frequency domain learning and optimal transport theory in knowledge distillation (KD) to specifically improve the detection of low-quality compressed deepfake images. We explore transfer learning capability in KD to enable a student network to learn discriminative features from low-quality images effectively. In particular, we propose the Attention-based Deepfake detection Distiller (ADD), which consists of two novel distillations: 1) frequency attention distillation that effectively retrieves the removed high-frequency components in the student network, and 2) multi-view attention distillation that creates multiple attention vectors by slicing the teacher's and student's tensors under different views to transfer the teacher tensor's distribution to the student more efficiently. Our extensive experimental results demonstrate that our approach outperforms state-of-the-art baselines in detecting low-quality compressed deepfake images.

【2】 Graphical Models with Attention for Context-Specific Independence and an Application to Perceptual Grouping 标题：关注上下文独立性的图形模型及其在知觉分组中的应用链接：https://arxiv.org/abs/2112.03371

作者：Guangyao Zhou,Wolfgang Lehrach,Antoine Dedieu,Miguel Lázaro-Gredilla,Dileep George 机构：Vicarious AI 摘要：离散无向图形模型，也称为马尔可夫随机场（MRF），可以灵活地编码多变量的概率交互作用，并在广泛的问题中得到了成功的应用。然而，离散MRF的一个众所周知但很少研究的局限性是，它们不能捕获特定于上下文的独立性（CSI）。现有的方法需要精心开发的理论和专门构建的推理方法，这将它们的应用局限于小规模问题。在本文中，我们提出了马尔可夫注意模型（MAM），这是一个包含注意机制的离散MRF家族。注意机制允许变量在忽略其他变量的同时动态关注其他变量，并允许在MRF中捕获CSI。MAM被表述为MRF，允许它受益于丰富的现有MRF推理方法，并扩展到大型模型和数据集。为了演示MAM在规模上捕获CSI的能力，我们应用MAM来捕获一种重要类型的CSI，该CSI以符号方式呈现在感知分组中的循环计算中。在最近提出的两个合成感知分组任务和真实图像上的实验表明，与强递归神经网络基线相比，MAM在样本效率、可解释性和可推广性方面具有优势，并验证了MAM在大规模有效捕获CSI的能力。摘要：Discrete undirected graphical models, also known as Markov Random Fields (MRFs), can flexibly encode probabilistic interactions of multiple variables, and have enjoyed successful applications to a wide range of problems. However, a well-known yet little studied limitation of discrete MRFs is that they cannot capture context-specific independence (CSI). Existing methods require carefully developed theories and purpose-built inference methods, which limit their applications to only small-scale problems. In this paper, we propose the Markov Attention Model (MAM), a family of discrete MRFs that incorporates an attention mechanism. The attention mechanism allows variables to dynamically attend to some other variables while ignoring the rest, and enables capturing of CSIs in MRFs. A MAM is formulated as an MRF, allowing it to benefit from the rich set of existing MRF inference methods and scale to large models and datasets. To demonstrate MAM's capabilities to capture CSIs at scale, we apply MAMs to capture an important type of CSI that is present in a symbolic approach to recurrent computations in perceptual grouping. Experiments on two recently proposed synthetic perceptual grouping tasks and on realistic images demonstrate the advantages of MAMs in sample-efficiency, interpretability and generalizability when compared with strong recurrent neural network baselines, and validate MAM's capabilities to efficiently capture CSIs at scale.

蒸馏|知识提取(2篇)

【1】 Safe Distillation Box 标题：安全蒸馏箱链接：https://arxiv.org/abs/2112.03695

作者：Jingwen Ye,Yining Mao,Jie Song,Xinchao Wang,Cheng Jin,Mingli Song 机构： Zhejiang University, Hangzhou, National University of Singapore, Fudan University 备注：Accepted by AAAI2022 摘要：知识提炼（KD）最近成为一种将知识从预先训练过的教师模型转移到轻量级学生的强大策略，并在广泛的应用中取得了前所未有的成功。尽管取得了令人鼓舞的结果，KD过程本身对网络所有权保护构成了潜在威胁，因为网络中包含的知识可以毫不费力地提取出来，从而暴露给恶意用户。在本文中，我们提出了一个新的框架，称为安全蒸馏箱（SDB），它允许我们将预先训练好的模型包装在一个虚拟箱中，以保护知识产权。具体而言，SDB保留了包装模型对所有用户的推理能力，但排除了未经授权用户的KD。另一方面，对于授权用户，SDB执行知识扩充计划，以增强KD性能和学生模型的结果。换句话说，所有用户都可以使用SDB中的模型进行推理，但只有授权用户才能从该模型访问KD。建议的SDB对模型架构不施加任何约束，并且可以很容易地作为即插即用解决方案来保护预先训练的网络的所有权。各种数据集和体系结构的实验表明，使用SDB，未经授权KD的性能显著下降，而授权KD的性能得到增强，这证明了SDB的有效性。摘要：Knowledge distillation (KD) has recently emerged as a powerful strategy to transfer knowledge from a pre-trained teacher model to a lightweight student, and has demonstrated its unprecedented success over a wide spectrum of applications. In spite of the encouraging results, the KD process per se poses a potential threat to network ownership protection, since the knowledge contained in network can be effortlessly distilled and hence exposed to a malicious user. In this paper, we propose a novel framework, termed as Safe Distillation Box (SDB), that allows us to wrap a pre-trained model in a virtual box for intellectual property protection. Specifically, SDB preserves the inference capability of the wrapped model to all users, but precludes KD from unauthorized users. For authorized users, on the other hand, SDB carries out a knowledge augmentation scheme to strengthen the KD performances and the results of the student model. In other words, all users may employ a model in SDB for inference, but only authorized users get access to KD from the model. The proposed SDB imposes no constraints over the model architecture, and may readily serve as a plug-and-play solution to protect the ownership of a pre-trained network. Experiments across various datasets and architectures demonstrate that, with SDB, the performance of an unauthorized KD drops significantly while that of an authorized gets enhanced, demonstrating the effectiveness of SDB.

【2】 VizExtract: Automatic Relation Extraction from Data Visualizations 标题：VizExtract：从数据可视化中自动提取关系链接：https://arxiv.org/abs/2112.03485

作者：Dale Decatur,Sanjay Krishnan 机构：University of Chicago, Chicago, Illinois 备注：8 pages 摘要：可视化图形，如绘图、图表和数字，广泛用于传达统计结论。直接从这些可视化中提取信息是通过科学语料库、事实检查和数据提取进行有效搜索的关键子问题。本文提出了一个从统计图表中自动提取比较变量的框架。由于图表样式、库和工具的多样性和变化性，我们利用基于计算机视觉的框架来自动识别和定位直线图、散点图或条形图中的可视化方面，并且每个图形可以包含多个系列。该框架在matplotlib图表的大型综合生成语料库上进行了训练，并在其他图表数据集上对训练后的模型进行了评估。在受控实验中，我们的框架能够以87.5%的准确率对每个图1-3个系列、不同颜色和实线样式的图的变量之间的相关性进行分类。当部署在从互联网上抓取的真实图形上时，它的准确率达到72.8%（排除“硬”图形时，准确率为81.2%）。在FigureQA数据集上部署时，其准确率达到84.7%。摘要：Visual graphics, such as plots, charts, and figures, are widely used to communicate statistical conclusions. Extracting information directly from such visualizations is a key sub-problem for effective search through scientific corpora, fact-checking, and data extraction. This paper presents a framework for automatically extracting compared variables from statistical charts. Due to the diversity and variation of charting styles, libraries, and tools, we leverage a computer vision based framework to automatically identify and localize visualization facets in line graphs, scatter plots, or bar graphs and can include multiple series per graph. The framework is trained on a large synthetically generated corpus of matplotlib charts and we evaluate the trained model on other chart datasets. In controlled experiments, our framework is able to classify, with 87.5% accuracy, the correlation between variables for graphs with 1-3 series per graph, varying colors, and solid line styles. When deployed on real-world graphs scraped from the internet, it achieves 72.8% accuracy (81.2% accuracy when excluding "hard" graphs). When deployed on the FigureQA dataset, it achieves 84.7% accuracy.

点云|SLAM|雷达|激光|深度RGBD相关(3篇)

【1】 Wild ToFu: Improving Range and Quality of Indirect Time-of-Flight Depth with RGB Fusion in Challenging Environments 标题：野生豆腐：在具有挑战性的环境中通过RGB融合提高间接飞行时间深度的范围和质量链接：https://arxiv.org/abs/2112.03750

作者：HyunJun Jung,Nikolas Brasch,Ales Leonardis,Nassir Navab,Benjamin Busam 机构： Technical University of Munich, Huawei Noah’s Ark Lab 摘要：间接飞行时间（I-ToF）成像因其体积小、价格合理而成为移动设备深度估计的一种广泛方式。以往的工作主要集中在改善I-ToF成像质量，特别是治疗多径干扰（MPI）的影响。这些调查通常在近距离、室内和微弱环境光下的特定受限场景中进行。令人惊讶的是，在现实生活场景中，由于传感器功率和光散射有限的衰减导致大量诱导散粒噪声和信号稀疏，因此在强环境光和远距离情况下，研究I-ToF质量改善的工作很少。在这项工作中，我们提出了一种新的基于学习的端到端深度预测网络，该网络采用带噪的原始I-ToF信号和RGB图像，并基于多步方法融合其潜在表示，包括隐式和显式对齐，以预测与RGB视点对齐的高质量远程深度图。我们在具有挑战性的真实场景上测试了我们的方法，与基线方法相比，最终深度图上的RMSE提高了40%以上。摘要：Indirect Time-of-Flight (I-ToF) imaging is a widespread way of depth estimation for mobile devices due to its small size and affordable price. Previous works have mainly focused on quality improvement for I-ToF imaging especially curing the effect of Multi Path Interference (MPI). These investigations are typically done in specifically constrained scenarios at close distance, indoors and under little ambient light. Surprisingly little work has investigated I-ToF quality improvement in real-life scenarios where strong ambient light and far distances pose difficulties due to an extreme amount of induced shot noise and signal sparsity, caused by the attenuation with limited sensor power and light scattering. In this work, we propose a new learning based end-to-end depth prediction network which takes noisy raw I-ToF signals as well as an RGB image and fuses their latent representation based on a multi step approach involving both implicit and explicit alignment to predict a high quality long range depth map aligned to the RGB viewpoint. We test our approach on challenging real-world scenes and show more than 40% RMSE improvement on the final depth map compared to the baseline approach.

【2】 A Conditional Point Diffusion-Refinement Paradigm for 3D Point Cloud Completion 标题：一种三维点云补全的条件点扩散-细化范式链接：https://arxiv.org/abs/2112.03530

作者：Zhaoyang Lyu,Zhifeng Kong,Xudong Xu,Liang Pan,Dahua Lin 机构：CUHK-SenseTime Joint Lab, The Chinese University of Hong Kong, University of California, San Diego, S-Lab, Nanyang Technological University, Shanghai AI Laboratory, Centre of Perceptual and Interactive Intelligence 摘要：三维点云是捕捉真实世界三维对象的重要三维表示。然而，实际扫描的三维点云通常是不完整的，为下游应用程序恢复完整的点云非常重要。大多数现有的点云完成方法使用倒角距离（CD）损失进行训练。CD loss通过搜索最近邻来估计两个点云之间的对应关系，这不会捕获生成形状上的整体点密度分布，因此可能导致点云生成不均匀。为了解决这个问题，我们提出了一种新的点扩散细化（PDR）模式来完成点云。PDR由条件生成网络（CGNet）和优化网络（RFNet）组成。CGNet使用一种称为去噪扩散概率模型（DDPM）的条件生成模型来生成以部分观测为条件的粗略完成。DDPM在生成的点云和均匀地面真值之间建立一对一的逐点映射，然后优化均方误差损失以实现均匀生成。RFNet细化了CGNet的粗略输出，并进一步提高了完成的点云的质量。此外，我们还为这两个网络开发了一种新的双路径结构。该体系结构可以（1）从部分观测的点云中有效地提取多层次特征以指导完成；（2）精确地操纵三维点的空间位置以获得平滑的表面和清晰的细节。在各种基准数据集上的大量实验结果表明，我们的PDR范式优于以前最先进的点云完成方法。值得注意的是，在RFNet的帮助下，我们可以将DDPM的迭代生成过程加快50倍，而不会有太大的性能下降。摘要：3D point cloud is an important 3D representation for capturing real world 3D objects. However, real-scanned 3D point clouds are often incomplete, and it is important to recover complete point clouds for downstream applications. Most existing point cloud completion methods use Chamfer Distance (CD) loss for training. The CD loss estimates correspondences between two point clouds by searching nearest neighbors, which does not capture the overall point density distribution on the generated shape, and therefore likely leads to non-uniform point cloud generation. To tackle this problem, we propose a novel Point Diffusion-Refinement (PDR) paradigm for point cloud completion. PDR consists of a Conditional Generation Network (CGNet) and a ReFinement Network (RFNet). The CGNet uses a conditional generative model called the denoising diffusion probabilistic model (DDPM) to generate a coarse completion conditioned on the partial observation. DDPM establishes a one-to-one pointwise mapping between the generated point cloud and the uniform ground truth, and then optimizes the mean squared error loss to realize uniform generation. The RFNet refines the coarse output of the CGNet and further improves quality of the completed point cloud. Furthermore, we develop a novel dual-path architecture for both networks. The architecture can (1) effectively and efficiently extract multi-level features from partially observed point clouds to guide completion, and (2) accurately manipulate spatial locations of 3D points to obtain smooth surfaces and sharp details. Extensive experimental results on various benchmark datasets show that our PDR paradigm outperforms previous state-of-the-art methods for point cloud completion. Remarkably, with the help of the RFNet, we can accelerate the iterative generation process of the DDPM by up to 50 times without much performance drop.

【3】 Dense Depth Priors for Neural Radiance Fields from Sparse Input Views 标题：稀疏输入视图中神经辐射场的稠密深度先验链接：https://arxiv.org/abs/2112.03288

作者：Barbara Roessle,Jonathan T. Barron,Ben Mildenhall,Pratul P. Srinivasan,Matthias Nießner 机构：Technical University of Munich, Google Research 备注：Video: this https URL 摘要：神经辐射场（NeRF）将场景编码为神经表示，以实现新颖视图的照片真实感渲染。但是，从RGB图像成功重建需要在静态条件下获取大量输入视图，对于房间大小的场景，通常最多需要几百张图像。我们的方法旨在从数量级较少的图像合成整个房间的新颖视图。为此，我们利用密集深度先验来约束NeRF优化。首先，我们利用从运动结构（SfM）预处理步骤免费获得的稀疏深度数据来估计相机姿态。其次，我们使用深度补全将这些稀疏点转换为密集深度图和不确定性估计，用于指导NeRF优化。我们的方法能够在具有挑战性的室内场景中实现数据高效的新视图合成，整个场景仅使用18幅图像。摘要：Neural radiance fields (NeRF) encode a scene into a neural representation that enables photo-realistic rendering of novel views. However, a successful reconstruction from RGB images requires a large number of input views taken under static conditions - typically up to a few hundred images for room-size scenes. Our method aims to synthesize novel views of whole rooms from an order of magnitude fewer images. To this end, we leverage dense depth priors in order to constrain the NeRF optimization. First, we take advantage of the sparse depth data that is freely available from the structure from motion (SfM) preprocessing step used to estimate camera poses. Second, we use depth completion to convert these sparse points into dense depth maps and uncertainty estimates, which are used to guide NeRF optimization. Our method enables data-efficient novel view synthesis on challenging indoor scenes, using as few as 18 images for an entire scene.

3D|3D重建等相关(1篇)

【1】 Gaussian map predictions for 3D surface feature localisation and counting 标题：用于三维地物定位和计数的高斯图预测链接：https://arxiv.org/abs/2112.03736

作者：Justin Le Louëdec,Grzegorz Cielniak 机构：Lincoln Centre for Autonomous, Systems, University of Lincoln, Lincoln LN,TS, United Kingdom 备注：BMVC 2021 摘要：在本文中，我们建议使用高斯地图表示来估计三维表面特征的精确位置和计数，解决了基于密度估计的最新方法在局部干扰存在时的局限性。高斯贴图指示可能的对象位置，可以直接从关键点注释生成，避免了费力且昂贵的每像素注释。我们将此方法应用于3D球体类对象，这些对象可以投影到2D形状表示中，从而通过神经网络GNet（一种改进的UNet体系结构）进行高效处理，该结构生成曲面特征的可能位置及其精确计数。我们展示了这项技术在草莓瘦果计数中的实际应用，它被用作表型应用中的水果质量测量。从一个公开的数据集中对数百个草莓的3D扫描结果表明，该系统的准确性和精确度优于此应用中基于密度的最新方法。摘要：In this paper, we propose to employ a Gaussian map representation to estimate precise location and count of 3D surface features, addressing the limitations of state-of-the-art methods based on density estimation which struggle in presence of local disturbances. Gaussian maps indicate probable object location and can be generated directly from keypoint annotations avoiding laborious and costly per-pixel annotations. We apply this method to the 3D spheroidal class of objects which can be projected into 2D shape representation enabling efficient processing by a neural network GNet, an improved UNet architecture, which generates the likely locations of surface features and their precise count. We demonstrate a practical use of this technique for counting strawberry achenes which is used as a fruit quality measure in phenotyping applications. The results of training the proposed system on several hundreds of 3D scans of strawberries from a publicly available dataset demonstrate the accuracy and precision of the system which outperforms the state-of-the-art density-based methods for this application.

其他神经网络|深度学习|模型|建模(12篇)

【1】 Variance-Aware Weight Initialization for Point Convolutional Neural Networks 标题：点卷积神经网络的方差感知权值初始化链接：https://arxiv.org/abs/2112.03777

作者：Pedro Hermosilla,Michael Schelling,Tobias Ritschel,Timo Ropinski 机构：Ulm University, University College London 摘要：适当的权值初始化对于成功训练神经网络至关重要。最近，通过基于批次统计数据对每个层进行简单的规范化，批次规范化减少了权重初始化的作用。不幸的是，批量规范化在应用于小批量时有几个缺点，因为在点云上学习时需要它们来应对内存限制。虽然有充分依据的权重初始化策略可以使批量标准化变得不必要，从而避免这些缺点，但对于点卷积网络还没有提出这样的方法。为了填补这一空白，我们提出了一个框架来统一大量的连续卷积。这使我们的主要贡献，方差感知权重初始化。我们表明，这种初始化可以避免批处理规范化，同时获得类似的性能，在某些情况下，还可以获得更好的性能。摘要：Appropriate weight initialization has been of key importance to successfully train neural networks. Recently, batch normalization has diminished the role of weight initialization by simply normalizing each layer based on batch statistics. Unfortunately, batch normalization has several drawbacks when applied to small batch sizes, as they are required to cope with memory limitations when learning on point clouds. While well-founded weight initialization strategies can render batch normalization unnecessary and thus avoid these drawbacks, no such approaches have been proposed for point convolutional networks. To fill this gap, we propose a framework to unify the multitude of continuous convolutions. This enables our main contribution, variance-aware weight initialization. We show that this initialization can avoid batch normalization while achieving similar and, in some cases, better performance.

【2】 SalFBNet: Learning Pseudo-Saliency Distribution via Feedback Convolutional Networks 标题：SalFBNet：通过反馈卷积网络学习伪显著性分布链接：https://arxiv.org/abs/2112.03731

作者：Guanqun Ding,Nevrez Imamouglu,Ali Caglayan,Masahiro Murakawa,Ryosuke Nakamura 机构：Graduate School of Science and Technology, University of Tsukuba, Tsukuba, Japan, National Institute of Advanced Industrial Science and Technology, Tokyo,-, Japan 摘要：仅前馈卷积神经网络（CNN）可能忽略视觉任务（如显著性检测）中反馈连接的内在关系和潜在好处，尽管它们具有显著的表示能力。在这项工作中，我们提出了一种用于显著性检测的反馈递归卷积框架（SalFBNet）。该反馈模型通过桥接从高级特征块到低级特征块的递归路径，可以学习丰富的上下文表示。此外，我们还创建了一个大规模的伪显著性数据集，以缓解显著性检测中数据不足的问题。我们首先使用所提出的反馈模型从伪地面真值中学习显著性分布。然后，我们对现有的眼睛注视数据集的反馈模型进行微调。此外，我们提出了一种新的选择性注视和非注视错误（sFNE）损失，使所提出的反馈模型更好地学习基于眼睛注视的特征。大量的实验结果表明，我们的SalFBNet在公共显著性检测基准上取得了具有竞争力的结果，这证明了所提出的反馈模型和伪显著性数据的有效性。源代码和伪显著性数据集可在https://github.com/gqding/SalFBNet 摘要：Feed-forward only convolutional neural networks (CNNs) may ignore intrinsic relationships and potential benefits of feedback connections in vision tasks such as saliency detection, despite their significant representation capabilities. In this work, we propose a feedback-recursive convolutional framework (SalFBNet) for saliency detection. The proposed feedback model can learn abundant contextual representations by bridging a recursive pathway from higher-level feature blocks to low-level layer. Moreover, we create a large-scale Pseudo-Saliency dataset to alleviate the problem of data deficiency in saliency detection. We first use the proposed feedback model to learn saliency distribution from pseudo-ground-truth. Afterwards, we fine-tune the feedback model on existing eye-fixation datasets. Furthermore, we present a novel Selective Fixation and Non-Fixation Error (sFNE) loss to make proposed feedback model better learn distinguishable eye-fixation-based features. Extensive experimental results show that our SalFBNet with fewer parameters achieves competitive results on the public saliency detection benchmarks, which demonstrate the effectiveness of proposed feedback model and Pseudo-Saliency data. Source codes and Pseudo-Saliency dataset can be found at https://github.com/gqding/SalFBNet

【3】 Flexible Networks for Learning Physical Dynamics of Deformable Objects 标题：用于学习可变形物体物理动力学的柔性网络链接：https://arxiv.org/abs/2112.03728

作者：Jinhyung Park,DoHae Lee,In-Kwon Lee 机构： Yonsei University 摘要：使用基于粒子的表示学习可变形物体的物理动力学一直是机器学习中许多计算模型的目标。虽然一些最先进的模型在模拟环境中实现了这一目标，但大多数现有模型都有一个先决条件，即输入是有序点集的序列，即每个点集中的点在整个输入序列中的顺序必须相同。这限制了模型推广到现实世界的数据，这被认为是一个无序点集序列。在本文中，我们提出了一个称为时间点网（TP-Net）的模型，该模型通过直接使用一系列无序点集来推断基于粒子表示的可变形对象的未来状态，从而解决了这个问题。我们的模型由一个共享特征提取器和一个预测网络组成，共享特征提取器并行地从每个输入点集中提取全局特征，预测网络对这些特征进行聚合和推理，以便将来进行预测。我们方法的关键概念是，我们使用全局特征而不是局部特征来实现对输入置换的不变性，并确保模型的稳定性和可伸缩性。实验表明，我们的模型在合成数据集和真实数据集上都达到了最先进的性能，具有实时预测速度。我们提供定量和定性分析，说明为什么我们的方法比现有方法更有效。摘要：Learning the physical dynamics of deformable objects with particle-based representation has been the objective of many computational models in machine learning. While several state-of-the-art models have achieved this objective in simulated environments, most existing models impose a precondition, such that the input is a sequence of ordered point sets - i.e., the order of the points in each point set must be the same across the entire input sequence. This restrains the model to generalize to real-world data, which is considered to be a sequence of unordered point sets. In this paper, we propose a model named time-wise PointNet (TP-Net) that solves this problem by directly consuming a sequence of unordered point sets to infer the future state of a deformable object with particle-based representation. Our model consists of a shared feature extractor that extracts global features from each input point set in parallel and a prediction network that aggregates and reasons on these features for future prediction. The key concept of our approach is that we use global features rather than local features to achieve invariance to input permutations and ensure the stability and scalability of our model. Experiments demonstrate that our model achieves state-of-the-art performance in both synthetic dataset and in real-world dataset, with real-time prediction speed. We provide quantitative and qualitative analysis on why our approach is more effective and efficient than existing approaches.

【4】 Low-rank Tensor Decomposition for Compression of Convolutional Neural Networks Using Funnel Regularization 标题：基于漏斗正则化的卷积神经网络压缩的低秩张量分解链接：https://arxiv.org/abs/2112.03690

作者：Bo-Shiuan Chu,Che-Rung Lee 摘要：张量分解能够揭示复杂结构之间的潜在关系，是深卷积神经网络模型压缩的基本技术之一。然而，现有的大多数方法都是对网络进行分层压缩，不能提供一个令人满意的解决方案来实现全局优化。本文提出了一种利用卷积层的低秩张量分解来压缩预训练网络的模型降阶方法。我们的方法基于优化技术来选择合适的分解网络层的秩。提出了一种新的正则化方法，称为漏斗函数，以抑制压缩过程中的不重要因素，从而更容易显示适当的秩。实验结果表明，与其他张量压缩方法相比，该算法可以减少更多的模型参数。对于使用ImageNet2012的ResNet18，我们的简化模型可以达到GMAC速度的两倍以上，而精度下降仅为0.7%，在这两个指标上都优于大多数现有方法。摘要：Tensor decomposition is one of the fundamental technique for model compression of deep convolution neural networks owing to its ability to reveal the latent relations among complex structures. However, most existing methods compress the networks layer by layer, which cannot provide a satisfactory solution to achieve global optimization. In this paper, we proposed a model reduction method to compress the pre-trained networks using low-rank tensor decomposition of the convolution layers. Our method is based on the optimization techniques to select the proper ranks of decomposed network layers. A new regularization method, called funnel function, is proposed to suppress the unimportant factors during the compression, so the proper ranks can be revealed much easier. The experimental results show that our algorithm can reduce more model parameters than other tensor compression methods. For ResNet18 with ImageNet2012, our reduced model can reach more than twi times speed up in terms of GMAC with merely 0.7% Top-1 accuracy drop, which outperforms most existing methods in both metrics.

【5】 Does Proprietary Software Still Offer Protection of Intellectual Property in the Age of Machine Learning? -- A Case Study using Dual Energy CT Data 标题：在机器学习时代，专有软件还能提供知识产权保护吗？--基于双能量CT数据的案例研究链接：https://arxiv.org/abs/2112.03678

作者：Andreas Maier,Seung Hee Yang,Farhad Maleki,Nikesh Muthukrishnan,Reza Forghani 机构：Pattern Recognition Lab, FAU Erlangen-N¨urnberg, Department Artificial Intelligence in Medical Engineering, FAU Erlangen-N¨urnberg, McGill University Hospital, McGill University 备注：6 pages, 2 figures, 1 table, accepted on BVM 2022 摘要：在医学图像处理领域，医疗器械制造商在许多情况下通过仅装运编译软件（即二进制代码）来保护其知识产权，该二进制代码可以执行，但潜在攻击者难以理解。在本文中，我们将研究这个过程如何能够很好地保护图像处理算法。特别是，我们研究了从双能CT数据计算单能图像和碘图是否可以通过机器学习方法进行反向工程。我们的结果表明，在所有调查的情况下，仅使用一张单层图像作为训练数据，两者都可以以非常高的精度进行近似，结构相似性大于0.98。摘要：In the domain of medical image processing, medical device manufacturers protect their intellectual property in many cases by shipping only compiled software, i.e. binary code which can be executed but is difficult to be understood by a potential attacker. In this paper, we investigate how well this procedure is able to protect image processing algorithms. In particular, we investigate whether the computation of mono-energetic images and iodine maps from dual energy CT data can be reverse-engineered by machine learning methods. Our results indicate that both can be approximated using only one single slice image as training data at a very high accuracy with structural similarity greater than 0.98 in all investigated cases.

【6】 Defending against Model Stealing via Verifying Embedded External Features 标题：通过验证嵌入的外部特征来防御模型窃取链接：https://arxiv.org/abs/2112.03476

作者：Yiming Li,Linghui Zhu,Xiaojun Jia,Yong Jiang,Shu-Tao Xia,Xiaochun Cao 机构：Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China, Peng Cheng Laboratory, Shenzhen, China, Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China 备注：This work is accepted by the AAAI 2022. The first two authors contributed equally to this work. 11 pages 摘要：获得训练有素的模型需要昂贵的数据收集和训练程序，因此该模型是一项宝贵的知识产权。最近的研究表明，对手可以“窃取”部署的模型，即使他们没有训练样本，也无法获得模型参数或结构。目前，有一些防御方法可以缓解这种威胁，主要是通过增加模型窃取的成本。在本文中，我们通过验证可疑模型是否包含defender specified\emph{external features}的知识，从另一个角度探讨了防御。具体来说，我们通过使用样式转换对一些训练样本进行回火来嵌入外部特征。然后，我们训练一个元分类器来确定模型是否从受害者那里被盗。这种方法的灵感来自于这样一种理解，即被盗模型应该包含受害者模型学习到的特征知识。我们在CIFAR-10和ImageNet数据集上检查了我们的方法。实验结果表明，我们的方法可以有效地同时检测不同类型的模型窃取，即使窃取的模型是通过多阶段窃取过程获得的。再现主要结果的代码可在Github上获得(https://github.com/zlh-thu/StealingVerification). 摘要：Obtaining a well-trained model involves expensive data collection and training procedures, therefore the model is a valuable intellectual property. Recent studies revealed that adversaries can `steal' deployed models even when they have no training samples and can not get access to the model parameters or structures. Currently, there were some defense methods to alleviate this threat, mostly by increasing the cost of model stealing. In this paper, we explore the defense from another angle by verifying whether a suspicious model contains the knowledge of defender-specified \emph{external features}. Specifically, we embed the external features by tempering a few training samples with style transfer. We then train a meta-classifier to determine whether a model is stolen from the victim. This approach is inspired by the understanding that the stolen models should contain the knowledge of features learned by the victim model. We examine our method on both CIFAR-10 and ImageNet datasets. Experimental results demonstrate that our method is effective in detecting different types of model stealing simultaneously, even if the stolen model is obtained via a multi-stage stealing process. The codes for reproducing main results are available at Github (https://github.com/zlh-thu/StealingVerification).

【7】 Learning to Solve Hard Minimal Problems 标题：学会解决困难的最小问题链接：https://arxiv.org/abs/2112.03424

作者：Petr Hruby,Timothy Duff,Anton Leykin,Tomas Pajdla 机构：ETH Z¨urich, Department of Computer Science, University of Washington, Department of Mathematics, Georgia Institute of Technology, School of Mathematics, Czech Technical University in Prague, Czech Institute of Informatics, Robotics and Cybernetics 备注：24 pages total: 14 pages main paper and 10 pages supplementary 摘要：我们提出了一种在RANSAC框架下求解硬几何优化问题的方法。硬极小问题是将原来的几何优化问题松弛为一个具有许多虚假解的极小问题而产生的。我们的方法避免了计算大量虚假解。我们设计了一个学习策略来选择一个起始问题-解决方案对，它可以在数值上继续到问题和感兴趣的解决方案。我们通过开发一个RANSAC解算器来演示我们的方法，该解算器通过在每个视图中使用四个点的最小松弛来计算三个校准相机的相对姿势。平均而言，我们可以在不到70$\mu s.的时间内解决一个问题。我们还通过两个视图中五个点的最小情况，在计算两个校准摄像机的相对姿态这一非常熟悉的问题上对我们的工程选择进行基准测试和研究。摘要：We present an approach to solving hard geometric optimization problems in the RANSAC framework. The hard minimal problems arise from relaxing the original geometric optimization problem into a minimal problem with many spurious solutions. Our approach avoids computing large numbers of spurious solutions. We design a learning strategy for selecting a starting problem-solution pair that can be numerically continued to the problem and the solution of interest. We demonstrate our approach by developing a RANSAC solver for the problem of computing the relative pose of three calibrated cameras, via a minimal relaxation using four points in each view. On average, we can solve a single problem in under 70 $\mu s.$ We also benchmark and study our engineering choices on the very familiar problem of computing the relative pose of two calibrated cameras, via the minimal case of five points in two views.

【8】 Equal Bits: Enforcing Equally Distributed Binary Network Weights 标题：相等比特：实施均匀分布的二进制网络权重链接：https://arxiv.org/abs/2112.03406

作者：Yunqiang Li,Silvia L. Pintea,Jan C. van Gemert 机构：Computer Vision Lab, Delft University of Technology, Delft, Netherlands 摘要：二进制网络非常有效，因为它们只使用两个符号来定义网络：$\{+1，-1\}$。人们可以将这些符号的优先分配作为一种设计选择。秦等人最近的IR网络认为，对二进制权重施加具有相等优先级（相等比特率）的伯努利分布会导致最大熵，从而使信息损失最小化。然而，之前的工作无法精确控制训练过程中的二元权重分布，因此无法保证最大熵。这里，我们展示了使用最优传输的量化可以保证任何比特率，包括相等的比特率。我们通过实验研究了等比特率确实更可取，并表明我们的方法带来了优化效益。我们表明，与最先进的二值化方法相比，我们的量化方法是有效的，即使在使用二值权重修剪时也是如此。摘要：Binary networks are extremely efficient as they use only two symbols to define the network: $\{+1,-1\}$. One can make the prior distribution of these symbols a design choice. The recent IR-Net of Qin et al. argues that imposing a Bernoulli distribution with equal priors (equal bit ratios) over the binary weights leads to maximum entropy and thus minimizes information loss. However, prior work cannot precisely control the binary weight distribution during training, and therefore cannot guarantee maximum entropy. Here, we show that quantizing using optimal transport can guarantee any bit ratio, including equal ratios. We investigate experimentally that equal bit ratios are indeed preferable and show that our method leads to optimization benefits. We show that our quantization method is effective when compared to state-of-the-art binarization methods, even when using binary weight pruning.

【9】 Efficient Continuous Manifold Learning for Time Series Modeling 标题：用于时间序列建模的高效连续流形学习链接：https://arxiv.org/abs/2112.03379

作者：Seungwoo Jeong,Wonjun Ko,Ahmad Wisnu Mulyadi,Heung-Il Suk 机构：Department of Artificial Intelligence, Korea University, Department of Brain and Cognitive Engineering, Korea University 摘要：随着深度神经网络在不同领域的空前成功，非欧几里德数据建模正引起人们的关注。特别是，由于对称正定（SPD）矩阵能够学习适当的统计表示，它在计算机视觉、信号处理和医学图像分析中正受到积极的研究。然而，由于其强大的约束，对于优化问题或低效的计算成本仍然具有挑战性，尤其是在深度学习框架内。本文提出利用黎曼流形与Cholesky空间之间的微分同胚映射，不仅可以有效地解决优化问题，而且可以大大降低计算量。此外，为了在时间序列数据中进行动力学建模，我们通过系统地集成流形常微分方程和选通递归神经网络，设计了一种连续流形学习方法。值得注意的是，由于Cholesky空间中矩阵的良好参数化，因此可以直接使用配备了黎曼几何度量的网络来训练我们提出的网络。我们通过实验证明，所提出的模型能够高效可靠地训练，并且在两个分类任务：动作识别和睡眠分级分类中都优于现有的流形方法和最新的分类方法。摘要：Modeling non-Euclidean data is drawing attention along with the unprecedented successes of deep neural networks in diverse fields. In particular, symmetric positive definite (SPD) matrix is being actively studied in computer vision, signal processing, and medical image analysis, thanks to its ability to learn appropriate statistical representations. However, due to its strong constraints, it remains challenging for optimization problems or inefficient computation costs, especially, within a deep learning framework. In this paper, we propose to exploit a diffeomorphism mapping between Riemannian manifolds and a Cholesky space, by which it becomes feasible not only to efficiently solve optimization problems but also to reduce computation costs greatly. Further, in order for dynamics modeling in time series data, we devise a continuous manifold learning method by integrating a manifold ordinary differential equation and a gated recurrent neural network in a systematic manner. It is noteworthy that because of the nice parameterization of matrices in a Cholesky space, it is straightforward to train our proposed network with Riemannian geometric metrics equipped. We demonstrate through experiments that the proposed model can be efficiently and reliably trained as well as outperform existing manifold methods and state-of-the-art methods in two classification tasks: action recognition and sleep staging classification.

【10】 Noether Networks: Meta-Learning Useful Conserved Quantities 标题：Noether网络：元学习有用的守恒量链接：https://arxiv.org/abs/2112.03321

作者：Ferran Alet,Dylan Doblar,Allan Zhou,Joshua Tenenbaum,Kenji Kawaguchi,Chelsea Finn 机构：MIT,Stanford University,National University of Singapore 备注：Accepted to NeurIPS '21. The first two authors contributed equally 摘要：机器学习（ML）的进步源于数据可用性、计算资源和归纳偏差的适当编码的结合。有用的偏差通常利用预测问题中的对称性，例如依赖于平移等变的卷积网络。自动发现这些有用的对称性有可能极大地提高ML系统的性能，但仍然是一个挑战。在这项工作中，我们专注于序列预测问题，并从Noether定理中得到启发，以减少寻找归纳偏差到元学习有用守恒量的问题。我们提出了Noether网络：一种新型的结构，其中元学习守恒损失在预测函数内得到优化。我们从理论和实验上证明，Noether网络提高了预测质量，为发现序列问题中的归纳偏差提供了一个通用框架。摘要：Progress in machine learning (ML) stems from a combination of data availability, computational resources, and an appropriate encoding of inductive biases. Useful biases often exploit symmetries in the prediction problem, such as convolutional networks relying on translation equivariance. Automatically discovering these useful symmetries holds the potential to greatly improve the performance of ML systems, but still remains a challenge. In this work, we focus on sequential prediction problems and take inspiration from Noether's theorem to reduce the problem of finding inductive biases to meta-learning useful conserved quantities. We propose Noether Networks: a new type of architecture where a meta-learned conservation loss is optimized inside the prediction function. We show, theoretically and experimentally, that Noether Networks improve prediction quality, providing a general framework for discovering inductive biases in sequential problems.

【11】 Image Enhancement via Bilateral Learning 标题：基于双边学习的图像增强链接：https://arxiv.org/abs/2112.03888

作者：Saeedeh Rezaee,Nezam Mahdavi-Amiri 机构：Sharif University of Technology, Tehran, Iran 摘要：如今，由于先进的数字成像技术和公众的互联网接入，生成的数字图像数量急剧增加。因此，对自动图像增强技术的需求非常明显。近年来，深度学习得到了有效的应用。本文在介绍了近年来在图像增强方面的一些研究成果后，提出了一种基于卷积神经网络的图像增强系统。我们的目标是有效地利用两种可用的方法，卷积神经网络和双边网格。在我们的方法中，我们增加了训练数据和模型维度，并在训练过程中提出了一个可变比率。与其他可用方法相比，由我们提出的方法产生的增强结果（包括5位不同的专家）显示了定量和定性的改进。摘要：Nowadays, due to advanced digital imaging technologies and internet accessibility to the public, the number of generated digital images has increased dramatically. Thus, the need for automatic image enhancement techniques is quite apparent. In recent years, deep learning has been used effectively. Here, after introducing some recently developed works on image enhancement, an image enhancement system based on convolutional neural networks is presented. Our goal is to make an effective use of two available approaches, convolutional neural network and bilateral grid. In our approach, we increase the training data and the model dimensions and propose a variable rate during the training process. The enhancement results produced by our proposed method, while incorporating 5 different experts, show both quantitative and qualitative improvements as compared to other available methods.

【12】 Image Compressed Sensing Using Non-local Neural Network 标题：基于非局部神经网络的图像压缩传感链接：https://arxiv.org/abs/2112.03712

作者：Wenxue Cui,Shaohui Liu,Feng Jiang,Debin Zhao 机构： Harbin Institute of Technology 备注：None 摘要：近年来，基于深度网络的图像压缩感知（CS）受到了广泛关注。然而，现有的基于深度网络的CS方案要么以块对块的方式重建目标图像，导致严重的块伪影，要么将深度网络训练成一个黑箱，从而导致对图像先验知识的有限洞察。本文提出了一种基于非局部神经网络（NL-CSNet）的图像压缩编码框架，该框架利用深度网络的非局部自相似性先验知识来提高重构质量。在所提出的NL-CSNet中，构造了两个非局部子网，分别利用测量域和多尺度特征域中的非局部自相似先验信息。具体而言，在测量域的子网络中，为了更好地进行初始重建，建立了不同图像块的测量之间的长距离依赖关系。类似地，在多尺度特征域的子网络中，在多尺度空间中探索密集特征表示之间的亲和力以进行深度重构。此外，本文还提出了一种新的损失函数来增强非局部表示之间的耦合，从而实现了NL-CSNet的端到端训练。大量实验表明，NL CSNet在保持快速计算速度的同时，优于现有的最先进的CS方法。摘要：Deep network-based image Compressed Sensing (CS) has attracted much attention in recent years. However, the existing deep network-based CS schemes either reconstruct the target image in a block-by-block manner that leads to serious block artifacts or train the deep network as a black box that brings about limited insights of image prior knowledge. In this paper, a novel image CS framework using non-local neural network (NL-CSNet) is proposed, which utilizes the non-local self-similarity priors with deep network to improve the reconstruction quality. In the proposed NL-CSNet, two non-local subnetworks are constructed for utilizing the non-local self-similarity priors in the measurement domain and the multi-scale feature domain respectively. Specifically, in the subnetwork of measurement domain, the long-distance dependencies between the measurements of different image blocks are established for better initial reconstruction. Analogically, in the subnetwork of multi-scale feature domain, the affinities between the dense feature representations are explored in the multi-scale space for deep reconstruction. Furthermore, a novel loss function is developed to enhance the coupling between the non-local representations, which also enables an end-to-end training of NL-CSNet. Extensive experiments manifest that NL-CSNet outperforms existing state-of-the-art CS methods, while maintaining fast computational speed.

其他(14篇)

【1】 Ref-NeRF: Structured View-Dependent Appearance for Neural Radiance Fields 标题：REF-NERF：神经辐射场的依赖于结构化视图的外观链接：https://arxiv.org/abs/2112.03907

作者：Dor Verbin,Peter Hedman,Ben Mildenhall,Todd Zickler,Jonathan T. Barron,Pratul P. Srinivasan 机构：Harvard University, Google Research 备注：Project page: this https URL 摘要：神经辐射场（NeRF）是一种流行的视图合成技术，它将场景表示为连续的体积函数，由多层感知器参数化，该感知器在每个位置提供体积密度和与视图相关的发射辐射。尽管基于NeRF的技术擅长于表示具有平滑变化的视图相关外观的精细几何结构，但它们通常无法准确捕捉和再现光滑表面的外观。我们通过引入Ref-NeRF来解决这一限制，它用反射辐射的表示代替了NeRF对依赖于视图的输出辐射的参数化，并使用一组空间变化的场景属性来构造此函数。我们表明，与法向量上的正则化器一起，我们的模型显著提高了镜面反射的真实性和准确性。此外，我们还表明，我们的模型的向外辐射的内部表示是可解释的，并且对场景编辑有用。摘要：Neural Radiance Fields (NeRF) is a popular view synthesis technique that represents a scene as a continuous volumetric function, parameterized by multilayer perceptrons that provide the volume density and view-dependent emitted radiance at each location. While NeRF-based techniques excel at representing fine geometric structures with smoothly varying view-dependent appearance, they often fail to accurately capture and reproduce the appearance of glossy surfaces. We address this limitation by introducing Ref-NeRF, which replaces NeRF's parameterization of view-dependent outgoing radiance with a representation of reflected radiance and structures this function using a collection of spatially-varying scene properties. We show that together with a regularizer on normal vectors, our model significantly improves the realism and accuracy of specular reflections. Furthermore, we show that our model's internal representation of outgoing radiance is interpretable and useful for scene editing.

【2】 Traversing within the Gaussian Typical Set: Differentiable Gaussianization Layers for Inverse Problems Augmented by Normalizing Flows 标题：在高斯典型集合内的遍历：归一化流增强的反问题的可微高斯化层链接：https://arxiv.org/abs/2112.03860

作者：Dongzhuo Li,Huseyin Denli 机构：ExxonMobil Research & Engineering Company, Annandale, NJ , USA 备注：16 pages, 12 figures 摘要：生成网络（如规范化流）可以作为一种基于学习的方法，用于增强反问题，以获得高质量的结果。然而，当在反演期间遍历潜在空间时，潜在空间向量可能不会保持来自期望的高维标准高斯分布的典型样本。因此，实现高保真解决方案可能是一个挑战，尤其是在存在噪声和不准确的基于物理的模型的情况下。为了解决这个问题，我们建议使用新的可微数据相关层对潜在向量进行重新参数化和高斯化，其中通过解决优化问题定义自定义运算符。这些建议的层强制进行反演，以在典型的高斯潜在空间集中找到可行的解。我们在图像去模糊任务和eikonal层析成像（一种PDE约束反问题）上测试并验证了我们的技术，并获得了高保真的结果。摘要：Generative networks such as normalizing flows can serve as a learning-based prior to augment inverse problems to achieve high-quality results. However, the latent space vector may not remain a typical sample from the desired high-dimensional standard Gaussian distribution when traversing the latent space during an inversion. As a result, it can be challenging to attain a high-fidelity solution, particularly in the presence of noise and inaccurate physics-based models. To address this issue, we propose to re-parameterize and Gaussianize the latent vector using novel differentiable data-dependent layers wherein custom operators are defined by solving optimization problems. These proposed layers enforce an inversion to find a feasible solution within a Gaussian typical set of the latent space. We tested and validated our technique on an image deblurring task and eikonal tomography -- a PDE-constrained inverse problem and achieved high-fidelity results.

【3】 Grounded Language-Image Pre-training 标题：扎根的语言-形象预训链接：https://arxiv.org/abs/2112.03857

作者：Liunian Harold Li,Pengchuan Zhang,Haotian Zhang,Jianwei Yang,Chunyuan Li,Yiwu Zhong,Lijuan Wang,Lu Yuan,Lei Zhang,Jenq-Neng Hwang,Kai-Wei Chang,Jianfeng Gao 机构：UCLA,Microsoft Research,University of Washington, University of Wisconsin-Madison,Microsoft Cloud and AI,International Digital Economy Academy 备注：Code will be released at this https URL 摘要：本文提出了一个用于学习对象级、语言感知和语义丰富的视觉表征的扎根语言图像预训练（GLIP）模型。GLIP将目标检测和短语基础统一用于预训练。这种统一带来了两个好处：1）它允许GLIP从检测和接地数据中学习，以改进任务和引导良好的接地模型；2） GLIP可以利用大量的图像-文本对，以自我训练的方式生成基础框，使学习到的表示语义丰富。在我们的实验中，我们在27M的基础数据上预训练GLIP，包括3M的人类注释和24M的网络爬网图像-文本对。学习到的表示法显示出很强的Zero-Shot和少量镜头可转移到各种对象级识别任务。1）当直接在COCO和LVIS上进行评估时（在训练前没有看到COCO中的任何图像），GLIP分别达到49.8 AP和26.9 AP，超过了许多监督基线。2）在COCO上进行微调后，GLIP在val上达到60.8 AP，在测试开发上达到61.5 AP，超过了之前的SoTA。3）当转移到13个下游目标检测任务时，一个单发GLIP与一个完全监督的动态头部相匹敌。守则将于https://github.com/microsoft/GLIP. 摘要：This paper presents a grounded language-image pre-training (GLIP) model for learning object-level, language-aware, and semantic-rich visual representations. GLIP unifies object detection and phrase grounding for pre-training. The unification brings two benefits: 1) it allows GLIP to learn from both detection and grounding data to improve both tasks and bootstrap a good grounding model; 2) GLIP can leverage massive image-text pairs by generating grounding boxes in a self-training fashion, making the learned representation semantic-rich. In our experiments, we pre-train GLIP on 27M grounding data, including 3M human-annotated and 24M web-crawled image-text pairs. The learned representations demonstrate strong zero-shot and few-shot transferability to various object-level recognition tasks. 1) When directly evaluated on COCO and LVIS (without seeing any images in COCO during pre-training), GLIP achieves 49.8 AP and 26.9 AP, respectively, surpassing many supervised baselines. 2) After fine-tuned on COCO, GLIP achieves 60.8 AP on val and 61.5 AP on test-dev, surpassing prior SoTA. 3) When transferred to 13 downstream object detection tasks, a 1-shot GLIP rivals with a fully-supervised Dynamic Head. Code will be released at https://github.com/microsoft/GLIP.

【4】 A Survey on Intrinsic Images: Delving Deep Into Lambert and Beyond 标题：关于内在意象的调查：深入挖掘兰BERT及其之外链接：https://arxiv.org/abs/2112.03842

作者：Elena Garces,Carlos Rodriguez-Pardo,Dan Casas,Jorge Lopez-Moreno 机构：Received: date Accepted: date 备注：Accepted at International Journal of Computer Vision (to appear in 2022) this http URL 摘要：固有成像或固有图像分解传统上被描述为将图像分解为两层的问题：反射率，材料的反照率不变颜色；以及由光和几何体之间的相互作用产生的阴影。近年来，深度学习技术被广泛应用于提高这些分离的准确性。在这项调查中，我们在已知的内在图像数据集和文献中使用的相关指标的背景下概述了这些结果，讨论了它们对预测理想的内在图像分解的适用性。尽管朗伯假设仍然是许多方法的基础，但我们表明，人们越来越意识到成像过程中更复杂的物理原理组件的潜力，即光学精确的材料模型和几何，以及更完整的逆光传输估计。我们根据分解类型对这些方法进行分类，考虑到使用的先验知识和模型，以及驱动分解过程的学习架构和方法。鉴于神经、逆和微分渲染技术的最新进展，我们还提供了关于未来研究方向的见解。摘要：Intrinsic imaging or intrinsic image decomposition has traditionally been described as the problem of decomposing an image into two layers: a reflectance, the albedo invariant color of the material; and a shading, produced by the interaction between light and geometry. Deep learning techniques have been broadly applied in recent years to increase the accuracy of those separations. In this survey, we overview those results in context of well-known intrinsic image data sets and relevant metrics used in the literature, discussing their suitability to predict a desirable intrinsic image decomposition. Although the Lambertian assumption is still a foundational basis for many methods, we show that there is increasing awareness on the potential of more sophisticated physically-principled components of the image formation process, that is, optically accurate material models and geometry, and more complete inverse light transport estimations. We classify these methods in terms of the type of decomposition, considering the priors and models used, as well as the learning architecture and methodology driving the decomposition process. We also provide insights about future directions for research, given the recent advances in neural, inverse and differentiable rendering techniques.

【5】 Polarimetric Pose Prediction 标题：极化位姿预测链接：https://arxiv.org/abs/2112.03810

作者：Daoyi Gao,Yitong Li,Patrick Ruhkamp,Iuliia Skobleva,Magdalena Wysock,HyunJun Jung,Pengyuan Wang,Arturo Guridi,Nassir Navab,Benjamin Busam 机构：Technical University of Munich, Germany 摘要：光有许多可以被视觉传感器被动测量的特性。色带分离波长和强度可以说是单目6D物体姿态估计最常用的方法。本文探讨了互补偏振信息（即光波振荡的方向）如何影响姿态预测的准确性。设计了一个混合模型，利用物理先验知识和数据驱动的学习策略，并在具有不同光度复杂性的对象上进行了仔细测试。我们的设计不仅大大提高了与最先进的光度学方法相关的姿势精度，而且还能够对高反射和透明对象进行对象姿势估计。摘要：Light has many properties that can be passively measured by vision sensors. Colour-band separated wavelength and intensity are arguably the most commonly used ones for monocular 6D object pose estimation. This paper explores how complementary polarisation information, i.e. the orientation of light wave oscillations, can influence the accuracy of pose predictions. A hybrid model that leverages physical priors jointly with a data-driven learning strategy is designed and carefully tested on objects with different amount of photometric complexity. Our design not only significantly improves the pose accuracy in relation to photometric state-of-the-art approaches, but also enables object pose estimation for highly reflective and transparent objects.

【6】 Dilated convolution with learnable spacings 标题：具有可学习间隔的扩张卷积链接：https://arxiv.org/abs/2112.03740

作者：Ismail Khalfaoui Hassani,Thomas Pellegrini,Timothée Masquelier 机构：ANITI, Universit´e de Toulouse, France, IRIT, CNRS, Timoth´ee Masquelier, CerCo UMR , CNRS 备注：15 pages 摘要：扩展卷积基本上是一种卷积，通过在内核元素之间定期插入空格来创建更宽的内核。在这篇文章中，我们提出了一个新版本的扩张卷积，其中的间距是可学习的通过反向传播通过插值技术。我们称这种方法为“具有可学习间距的扩张卷积”（DCLS），并将其方法推广到n维卷积情形。然而，我们这里主要关注的是2D案例，我们为其开发了两种实现：一种是构造扩展内核的简单实现，适用于较小的扩展速率，另一种是使用改进版的“im2col”算法的时间/内存效率更高的实现。然后，我们通过简单地用DCLS层替换经典的扩展卷积层，说明了该技术如何提高Pascal Voc 2012数据集上语义分割任务的现有体系结构的准确性。此外，我们还表明，DCLS允许将最近ConvMixer体系结构中使用的深度卷积的可学习参数的数量减少3倍，而精度没有降低或降低得非常低，并且通过用稀疏DCLS替换大的密集核来实现。该方法的代码基于Pytork，可从以下网址获得：https://github.com/K-H-Ismail/Dilated-Convolution-with-Learnable-Spacings-PyTorch. 摘要：Dilated convolution is basically a convolution with a wider kernel created by regularly inserting spaces between the kernel elements. In this article, we present a new version of the dilated convolution in which the spacings are made learnable via backpropagation through an interpolation technique. We call this method "Dilated Convolution with Learnable Spacings" (DCLS) and we generalize its approach to the n-dimensional convolution case. However, our main focus here will be the 2D case for which we developed two implementations: a naive one that constructs the dilated kernel, suitable for small dilation rates, and a more time/memory efficient one that uses a modified version of the "im2col" algorithm. We then illustrate how this technique improves the accuracy of existing architectures on semantic segmentation task on Pascal Voc 2012 dataset via a simple drop-in replacement of the classical dilated convolutional layers by DCLS ones. Furthermore, we show that DCLS allows to reduce the number of learnable parameters of the depthwise convolutions used in the recent ConvMixer architecture by a factor 3 with no or very low reduction in accuracy and that by replacing large dense kernels with sparse DCLS ones. The code of the method is based on Pytorch and available at: https://github.com/K-H-Ismail/Dilated-Convolution-with-Learnable-Spacings-PyTorch.

【7】 Saliency Diversified Deep Ensemble for Robustness to Adversaries 标题：显著多样化的深度集成，增强了对对手的健壮性链接：https://arxiv.org/abs/2112.03615

作者：Alex Bogun,Dimche Kostadinov,Damian Borth 机构：University of St. Gallen 备注：Accepted to AAAI Workshop on Adversarial Machine Learning and Beyond 2022 摘要：深度学习模型在许多图像识别、分类和重建任务中表现出令人难以置信的性能。尽管由于其预测能力而非常有吸引力和价值，但有一个共同的威胁仍然难以解决。经过专门训练的攻击者可以引入恶意输入干扰来欺骗网络，从而导致潜在的有害预测失误。此外，当对手完全可以访问目标模型（白盒）时，甚至当访问受限时（黑盒设置），这些攻击都可以成功。模型集成可以防止此类攻击，但在其成员中共享漏洞（攻击可转移性）的情况下可能会变得脆弱。为此，这项工作提出了一种新的多样性促进学习方法的深层集成。其思想是通过在我们的学习目标中引入一个额外的术语，在集合成员上促进显著性图多样性（SMD），以防止攻击者同时攻击所有集合成员。在训练期间，这有助于我们最小化模型显著性之间的一致性，以减少共享成员的漏洞，从而提高对对手的整体鲁棒性。我们的经验表明，与针对中强度和高强度白盒攻击的最先进的集成防御相比，集成成员之间的可转移性降低，性能提高。此外，我们证明了我们的方法与现有方法相结合，在白盒和黑盒攻击下的防御性能优于最先进的集成算法。摘要：Deep learning models have shown incredible performance on numerous image recognition, classification, and reconstruction tasks. Although very appealing and valuable due to their predictive capabilities, one common threat remains challenging to resolve. A specifically trained attacker can introduce malicious input perturbations to fool the network, thus causing potentially harmful mispredictions. Moreover, these attacks can succeed when the adversary has full access to the target model (white-box) and even when such access is limited (black-box setting). The ensemble of models can protect against such attacks but might be brittle under shared vulnerabilities in its members (attack transferability). To that end, this work proposes a novel diversity-promoting learning approach for the deep ensembles. The idea is to promote saliency map diversity (SMD) on ensemble members to prevent the attacker from targeting all ensemble members at once by introducing an additional term in our learning objective. During training, this helps us minimize the alignment between model saliencies to reduce shared member vulnerabilities and, thus, increase ensemble robustness to adversaries. We empirically show a reduced transferability between ensemble members and improved performance compared to the state-of-the-art ensemble defense against medium and high strength white-box attacks. In addition, we demonstrate that our approach combined with existing methods outperforms state-of-the-art ensemble algorithms for defense under white-box and black-box attacks.

【8】 GaTector: A Unified Framework for Gaze Object Prediction 标题：GaTector：一个统一的凝视对象预测框架链接：https://arxiv.org/abs/2112.03549

作者：Binglu Wang,Tao Hu,Baoshan Li,Xiaojuan Chen,Zhijie Zhang 备注：Techinical report 摘要：凝视物体预测（GOP）是一项新提出的任务，旨在发现人类注视的物体。它具有重要的应用意义，但仍然缺乏统一的解决方案框架。直观的解决方案是将对象检测分支合并到现有的凝视预测方法中。然而，以往的凝视预测方法通常使用两种不同的网络从场景图像和头部图像中提取特征，这将导致沉重的网络结构，并阻止每个分支进行联合优化。在本文中，我们构建了一个名为GaTector的新框架，以统一的方式解决凝视对象预测问题。特别地，首次提出了一种特殊-通用-特殊（SGS）特征提取器，利用共享主干提取场景和头部图像的一般特征。为了更好地考虑输入和任务的特殊性，SGS在共享主干之前引入了两个输入特定块，并在共享主干之后引入了三个任务特定块。特别地，设计了一种新的离焦层，用于在不丢失信息或不需要额外计算的情况下为目标检测任务生成特定于目标的特征。此外，引入能量聚集损失来引导凝视热图集中在凝视框上。最后，我们提出了一种新的mDAP度量，它可以揭示盒子之间的差异，即使它们没有重叠区域。在GOO数据集上的大量实验验证了我们的方法在所有三个跟踪中的优越性，即目标检测、凝视估计和凝视目标预测。摘要：Gaze object prediction (GOP) is a newly proposed task that aims to discover the objects being stared at by humans. It is of great application significance but still lacks a unified solution framework. An intuitive solution is to incorporate an object detection branch into an existing gaze prediction method. However, previous gaze prediction methods usually use two different networks to extract features from scene image and head image, which would lead to heavy network architecture and prevent each branch from joint optimization. In this paper, we build a novel framework named GaTector to tackle the gaze object prediction problem in a unified way. Particularly, a specific-general-specific (SGS) feature extractor is firstly proposed to utilize a shared backbone to extract general features for both scene and head images. To better consider the specificity of inputs and tasks, SGS introduces two input-specific blocks before the shared backbone and three task-specific blocks after the shared backbone. Specifically, a novel defocus layer is designed to generate object-specific features for object detection task without losing information or requiring extra computations. Moreover, the energy aggregation loss is introduced to guide the gaze heatmap to concentrate on the stared box. In the end, we propose a novel mDAP metric that can reveal the difference between boxes even when they share no overlapping area. Extensive experiments on the GOO dataset verify the superiority of our method in all three tracks, i.e. object detection, gaze estimation, and gaze object prediction.

【9】 GPU-Based Homotopy Continuation for Minimal Problems in Computer Vision 标题：基于GPU的计算机视觉极小问题的同伦延拓链接：https://arxiv.org/abs/2112.03444

作者：Chiang-Heng Chien,Hongyi Fan,Ahmad Abdelfattah,Elias Tsigaridas,Stanimire Tomov,Benjamin Kimia 机构：School of Engineering, Brown University, Innovative Computing Laboratory, University of Tennessee, INRIA 摘要：多项式方程组经常出现在计算机视觉中，特别是在多视图几何问题中。解决这些系统的传统方法通常旨在消除变量以达到单变量多项式，例如，用于5点姿态估计的十阶多项式，使用巧妙的操作，或更普遍地使用Grobner基、结果和消除模板，导致多视图几何体和其他问题的成功算法。然而，当问题复杂时，这些方法不起作用，当它们起作用时，它们面临效率和稳定性问题。同伦延拓（HC）可以解决更复杂的问题，而不存在稳定性问题，并且可以保证全局解，但它们的速度较慢。在本文中，我们展示了HC可以在GPU上并行化，在多项式基准测试上显示了高达26倍的显著加速。我们还表明，GPU-HC可以普遍应用于一系列计算机视觉问题，包括四视图三角剖分和焦距未知的三焦点姿态估计，这些问题不能用消元模板解决，但可以用HC有效地解决。GPU-HC为一系列计算机视觉问题的简单制定和解决打开了大门。摘要：Systems of polynomial equations arise frequently in computer vision, especially in multiview geometry problems. Traditional methods for solving these systems typically aim to eliminate variables to reach a univariate polynomial, e.g., a tenth-order polynomial for 5-point pose estimation, using clever manipulations, or more generally using Grobner basis, resultants, and elimination templates, leading to successful algorithms for multiview geometry and other problems. However, these methods do not work when the problem is complex and when they do, they face efficiency and stability issues. Homotopy Continuation (HC) can solve more complex problems without the stability issues, and with guarantees of a global solution, but they are known to be slow. In this paper we show that HC can be parallelized on a GPU, showing significant speedups up to 26 times on polynomial benchmarks. We also show that GPU-HC can be generically applied to a range of computer vision problems, including 4-view triangulation and trifocal pose estimation with unknown focal length, which cannot be solved with elimination template but they can be efficiently solved with HC. GPU-HC opens the door to easy formulation and solution of a range of computer vision problems.

【10】 Producing augmentation-invariant embeddings from real-life imagery 标题：从真实图像生成增广不变嵌入链接：https://arxiv.org/abs/2112.03415

作者：Sergio Manuel Papadakis,Sanjay Addicam 摘要：本文提出了一种从真实图像中生成特征丰富、高维嵌入空间的有效方法。制作的功能设计独立于社交媒体上出现的真实案例中使用的增强功能。我们的方法使用卷积神经网络（CNN）生成嵌入空间。通过使用自动生成的增广，使用弧形头部来训练模型。此外，我们还提出了一种将包含相同语义信息的不同嵌入进行集成的方法，一种使用外部数据集对生成的嵌入进行规范化的方法，以及一种在ArcFace head中使用大量类对这些模型进行快速训练的新方法。使用这种方法，我们获得了第二的位置在2021脸谱网AI图像相似性挑战：描述符跟踪。摘要：This article presents an efficient way to produce feature-rich, high-dimensionality embedding spaces from real-life images. The features produced are designed to be independent from augmentations used in real-life cases which appear on social media. Our approach uses convolutional neural networks (CNN) to produce an embedding space. An ArcFace head was used to train the model by employing automatically produced augmentations. Additionally, we present a way to make an ensemble out of different embeddings containing the same semantic information, a way to normalize the resulting embedding using an external dataset, and a novel way to perform quick training of these models with a high number of classes in the ArcFace head. Using this approach we achieved the 2nd place in the 2021 Facebook AI Image Similarity Challenge: Descriptor Track.

【11】 DIY Graphics Tab: A Cost-Effective Alternative to Graphics Tablet for Educators 标题：DIY Graphics Tab：面向教育工作者的高性价比平板电脑替代产品链接：https://arxiv.org/abs/2112.03269

作者：Mohammad Imrul Jubair,Arafat Ibne Yousuf,Tashfiq Ahmed,Hasanath Jamy,Foisal Reza,Mohsena Ashraf 机构：Department of Computer Science and Engineering, Ahsanullah University of Science and Technology, Bangladesh 备注：Accepted in AAAI2022 workshop 摘要：每天，越来越多的人转向在线学习，这改变了我们传统的课堂教学方法。录制讲座一直是在线教育者的一项正常任务，在疫情期间，录制讲座变得更加重要，因为一些国家的实际课程仍在推迟。录制讲座时，图形平板电脑是白板的绝佳替代品，因为它具有便携性和与计算机接口的能力。然而，这种图形平板电脑对于大多数教师来说太贵了。在本文中，我们为教师和教育工作者提出了一种基于计算机视觉的图形平板电脑替代方案，其功能与图形平板电脑基本相同，但只需要一支笔、一张纸和一台笔记本电脑的网络摄像头。我们称之为“自己动手图形选项卡”或“DIY图形选项卡”。我们的系统接收由相机获取的一系列人在纸上书写的图像作为输入，并输出包含纸上书写内容的屏幕。这项任务并不简单，因为存在许多障碍，如人的手造成的遮挡、纸张的随机移动、光线条件差、视角造成的透视失真等。我们使用管道将输入记录传送到我们的系统，它在生成适当的输出之前执行实例分割和预处理。我们还对教师和学生进行了用户体验评估，并对他们的回答进行了检验。摘要：Everyday, more and more people are turning to online learning, which has altered our traditional classroom method. Recording lectures has always been a normal task for online educators, and it has lately become even more important during the epidemic because actual lessons are still being postponed in several countries. When recording lectures, a graphics tablet is a great substitute for a whiteboard because of its portability and ability to interface with computers. This graphic tablet, however, is too expensive for the majority of instructors. In this paper, we propose a computer vision-based alternative to the graphics tablet for instructors and educators, which functions largely in the same way as a graphic tablet but just requires a pen, paper, and a laptop's webcam. We call it "Do-It-Yourself Graphics Tab" or "DIY Graphics Tab". Our system receives a sequence of images of a person's writing on paper acquired by a camera as input and outputs the screen containing the contents of the writing from the paper. The task is not straightforward since there are many obstacles such as occlusion due to the person's hand, random movement of the paper, poor lighting condition, perspective distortion due to the angle of view, etc. A pipeline is used to route the input recording through our system, which conducts instance segmentation and preprocessing before generating the appropriate output. We also conducted user experience evaluations from the teachers and students, and their responses are examined in this paper.

【12】 Efficient joint noise removal and multi exposure fusion 标题：高效的联合去噪和多曝光融合链接：https://arxiv.org/abs/2112.03701

作者：A. Buades,J. L Lisani,O. Martorell 机构： Institute of Applied Computing and Community, Code (IAC,) and with the Dept.of Mathematics and, Computer Science, Universitat de les Illes Balears, Cra. de Valldemossa km. ,., E-, Palma, Spain, ¶The authors acknowledge the Ministerio de Ciencia 摘要：多曝光融合（MEF）是一种将使用不同曝光设置获取的同一场景的不同图像组合成单个图像的技术。所有提出的MEF算法都结合了图像集，以某种方式从每个图像中选择曝光更好的部分。提出了一种考虑噪声去除的多曝光图像融合链。新方法利用了DCT处理和MEF问题的多图像特性。我们提出了一种联合融合和去噪策略，利用时空面片选择和协作3D阈值。整体策略允许对图像集进行去噪和融合，而无需恢复每个去噪的曝光图像，从而实现非常高效的过程。摘要：Multi-exposure fusion (MEF) is a technique for combining different images of the same scene acquired with different exposure settings into a single image. All the proposed MEF algorithms combine the set of images, somehow choosing from each one the part with better exposure. We propose a novel multi-exposure image fusion chain taking into account noise removal. The novel method takes advantage of DCT processing and the multi-image nature of the MEF problem. We propose a joint fusion and denoising strategy taking advantage of spatio-temporal patch selection and collaborative 3D thresholding. The overall strategy permits to denoise and fuse the set of images without the need of recovering each denoised exposure image, leading to a very efficient procedure.

【13】 Evaluating Generic Auto-ML Tools for Computational Pathology 标题：评估用于计算病理学的通用Auto-ML工具链接：https://arxiv.org/abs/2112.03622

作者：Lars Ole Schwen,Daniela Schacherer,Christian Geißler,André Homeyer 机构：Fraunhofer Institute for Digital Medicine MEVIS, Max-von-Laue-Str. , Bremen, Germany, DAI-Labor, Technische Universität Berlin, Ernst-Reuter-Platz , Berlin, Germany 摘要：计算病理学中的图像分析任务通常使用卷积神经网络（CNN）来解决。选择合适的CNN架构和超参数通常通过探索性迭代优化来完成，这在计算上非常昂贵，并且需要大量的手动工作。本文的目标是评估用于神经网络架构搜索和超参数优化的通用工具如何在计算病理学的常见用例中执行。为此，我们针对组织学图像的三种不同分类任务：组织分类、突变预测和分级，评估了一种内部部署和一种基于云的工具。我们发现，默认的CNN架构和评估的自动化工具的参数化已经产生了与原始出版物相同的分类性能。尽管进行了额外的计算，这些任务的超参数优化并没有显著提高性能。然而，由于不确定性的影响，从单个AutoML运行获得的分类器之间的性能差异很大。因此，通用CNN体系结构和AutoML工具可能是手动优化CNN体系结构和参数化的可行替代方案。这将允许计算病理学软件解决方案的开发人员将精力集中在更难自动化的任务上，如数据整理。摘要：Image analysis tasks in computational pathology are commonly solved using convolutional neural networks (CNNs). The selection of a suitable CNN architecture and hyperparameters is usually done through exploratory iterative optimization, which is computationally expensive and requires substantial manual work. The goal of this article is to evaluate how generic tools for neural network architecture search and hyperparameter optimization perform for common use cases in computational pathology. For this purpose, we evaluated one on-premises and one cloud-based tool for three different classification tasks for histological images: tissue classification, mutation prediction, and grading. We found that the default CNN architectures and parameterizations of the evaluated AutoML tools already yielded classification performance on par with the original publications. Hyperparameter optimization for these tasks did not substantially improve performance, despite the additional computational effort. However, performance varied substantially between classifiers obtained from individual AutoML runs due to non-deterministic effects. Generic CNN architectures and AutoML tools could thus be a viable alternative to manually optimizing CNN architectures and parametrizations. This would allow developers of software solutions for computational pathology to focus efforts on harder-to-automate tasks such as data curation.

【14】 Dynamic imaging using Motion-Compensated SmooThness Regularization on Manifolds (MoCo-SToRM) 标题：基于流形上运动补偿平滑正则化的动态成像(MoCo-Storm) 链接：https://arxiv.org/abs/2112.03380

作者：Qing Zou,Luis A. Torres,Sean B. Fain,Nara S. Higano,Alister J. Bates,Mathews Jacob 机构：Department of Electrical and Computer Engineering, The University of Iowa, Iowa, Department of Medical Physics, University of Wisconsin, Madison, WI, USA, Department of Radiology , The University of Iowa, Iowa City, IA, USA 摘要：我们介绍了一种用于高分辨率自由呼吸肺MRI的无监督运动补偿重建方案。我们将时间序列中的图像帧建模为三维模板图像体积的变形版本。我们假设变形映射是高维空间中光滑流形上的点。具体来说，我们将每个时刻的变形贴图建模为基于CNN的生成器的输出，该生成器在低维潜在向量的驱动下，对所有时间帧具有相同的权重。潜在向量的时间序列说明了数据集中的动力学，包括呼吸运动和体运动。模板图像体积、生成器参数和潜在向量以无监督方式直接从k-t空间数据中学习。我们的实验结果表明，与最先进的方法相比，重建效果有所改善，尤其是在扫描过程中出现体运动的情况下。摘要：We introduce an unsupervised motion-compensated reconstruction scheme for high-resolution free-breathing pulmonary MRI. We model the image frames in the time series as the deformed version of the 3D template image volume. We assume the deformation maps to be points on a smooth manifold in high-dimensional space. Specifically, we model the deformation map at each time instant as the output of a CNN-based generator that has the same weight for all time-frames, driven by a low-dimensional latent vector. The time series of latent vectors account for the dynamics in the dataset, including respiratory motion and bulk motion. The template image volume, the parameters of the generator, and the latent vectors are learned directly from the k-t space data in an unsupervised fashion. Our experimental results show improved reconstructions compared to state-of-the-art methods, especially in the context of bulk motion during the scans.

机器翻译，仅供参考