Mass spectrometry imaging, which localizes molecules in a tag-free, spatially-resolved manner, is a powerful tool for the understanding of underlying biochemical mechanisms of biological phenomena. When analyzing MSI data, it is essential to delineate Regions-of-Interest (ROIs) that correspond to tissue areas of different anatomical or pathological labels. Spatial segmentation, obtained by clustering MSI pixels according to their mass spectral similarities, is a popular approach to automate ROI definition. However, how to select the number of clusters (#Clusters), which determines the granularity of segmentation, remains to be resolved, and an inappropriate #Clusters may lead to ROIs not biologically real. Here we report a multimodal fusion strategy to enable an objective and trustworthy selection of #Clusters by utilizing additional information from corresponding histology images. A Deep Learning-based algorithm is proposed to extract "histomorphological feature spectra" across an entire H&E image. Clustering is then similarly performed to produce Histology-segmentation. Since ROIs originating from instrumental noise or artifacts wouldn't be reproduced cross-modally, the consistency between histology- and MSI-segmentation becomes an effective measure of the biological validity of the results. So, #Clusters that maximizes the consistency is deemed as most probable. We validated our strategy on mouse kidney and renal tumor specimens by producing multimodally corroborated ROIs that agreed excellently with ground truths. Downstream analysis based on the said ROIs revealed lipid molecules highly specific to tissue anatomy or pathology. Our work will greatly facilitate MSI-mediated spatial lipidomics, metabolomics, and proteomics research by providing intelligent software to automatically and reliably generate ROIs.