Sample preparation and data acquisition
Unlabeled heart tissue sections were obtained from USC Keck School of Medicine under Institutional Review Board (IRB) #HS-20-00151 with ethical approval granted. Following the clinical standard for amyloid inspection, 8 μm sections were cut from archived cardiac tissue blocks that tested positive for Congo red. Samples with less than 5 mm2 tissue area or less than 5% amyloid-involved tissue, determined by the original pathology report, were excluded. After autofluorescence scanning, the standard Congo red staining was performed.
Autofluorescence images of the aforementioned label-free cardiac tissue samples were taken using a conventional scanning fluorescence microscope (IX-83, Olympus) equipped with a ×40/0.95NA objective lens (UPLSAPO, Olympus). These images were captured at four distinct excitation and emission wavelengths, each using a fluorescent filter set in a filter cube: DAPI (Semrock DAPI-5060C-OFX, EX 377/50 nm, EM 447/60 nm), FITC (Semrock FITC-2024B-OFX, EX 485/20 nm, EM 522/24 nm), TxRed (Semrock TXRED-4040C-OFX, EX 562/40 nm, EM 624/40 nm), and Cy5 (Semrock CY5-4040C-OFX, EX 628/40 nm, EM 692/40 nm). Note that the selection of the filters was primarily empirical, based on the availability of commonly used fluorescence filter sets, which spanned the visible spectrum while having spectral selectivity to capture distinct autofluorescence bands effectively42. The autofluorescence images were recorded using a scientific complementary metal-oxide-semiconductor (sCMOS) image sensor (ORCA-flash4.0 V2, Hamamatsu Photonics) using exposure times of 150 ms, 500 ms, 500 ms, and 1000 ms for the DAPI, FITC, TxRed, and Cy5 filters, respectively. The exposure time for each channel was selected based on the intensity levels observed, which helped us to use a substantial portion of the dynamic range of the image sensor. μManager (version 1.4) software43, designed for microscope management, was used for the automated image capture process. Autofocus44 is applied on the first autofluorescence channel (DAPI) for each FOV. Following the completion of the standard Congo red staining, high-resolution brightfield WSIs were obtained using a scanning microscope (AxioScan Z1, Zeiss) with a ×20/0.8NA objective lens (Plan-Apo) at the Translational Pathology Core Laboratory (TPCL) at UCLA. Polarized images of Congo red-stained slides were captured using a modified conventional brightfield microscope (IX-83, Olympus) with a halogen lamp used as the illumination source without blue filters. The microscope was equipped with a linear polarizer (U-POT, Olympus) and an analyzer with an adjustable wave plate (U-GAN, Olympus) with a ×20/0.75NA objective lens (UPlanSApo). The linear polarizer was placed on the condenser adapter (between the light source and the sample slide) and fixed in a customized 3D printed holder (Ultimaker S3, PETG black) to ensure polarization orientation during the scanning process. The analyzer was inserted into the slider-compatible revolving nosepiece, between the sample slide and the image sensor. The orientations of all polarization components were adjusted by a board-certified pathologist for optimal image quality.
Image preprocessing and registration
The registration process is essential to successfully train an image-to-image translation network. An alternative is to apply CycleGAN-like architectures33,45 which do not require paired samples for training. However, such unpaired image-based approaches result in inferior image quality with potential hallucinations compared to training with precisely registered image pairs18,21,38. In this work, we registered both the brightfield and birefringence images individually to match the autofluorescence images of the same samples using a two-step registration process. First, we stitched all the images into WSIs for all three imaging modalities and globally registered them by detecting and matching speeded-up robust features (SURF) on downsampled WSIs46,47. Then, we estimated the spatial transformations (projective) using the detected features with an M-estimator sample consensus algorithm48 and accordingly warped the brightfield and birefringence WSIs. Following this, the coarsely matched autofluorescence, brightfield and birefringence WSIs were divided into bundles of image tiles, each consisting of 2048 × 2048 pixels. Using these image tiles, we further improved the accuracy of our image registration to address optical aberrations among different imaging systems and morphological changes that occurred in the histochemical staining process with a correlation-based elastic registration algorithm18,49. During this elastic registration process, a registration neural network model was trained to align the style of the autofluorescence images with the styles of the brightfield and birefringence images. The registration model used for this purpose shared the same architecture and training strategy as our virtual staining network (detailed in the next section), however, without a registration submodule and was only used for the data preparation stage. After the image style transformation using the registration model, the pyramid elastic image registration algorithm was applied. This process involved hierarchically matching the local features of the sub-image blocks of different resolutions and calculating transformation maps. These transformation maps were then used to correct the local distortions in the brightfield and birefringence images, resulting in a better match with their autofluorescence counterparts. This training and registration process was repeated for both brightfield and birefringence images until precise pixel-level registration was achieved. The registration steps were implemented using MATLAB (MathWorks), Fiji50 and PyTorch51. To simulate the polarization filter rotation and generate images with yellow-colored amyloid deposits, we first extracted the apple-green part of an image using a segmentation algorithm (see the “Quantitative evaluation metrics for polarization Congo red virtual staining” section). Then, we multiplied the hue channel by a fixed parameter (~0.6) to transform the color of the amyloid deposits from apple-green to yellow. Meanwhile, we also multiplied the hue channel of the background regions by ~1.1 to simulate the background color change to dark blue. These parameters were tuned and approved by a pathologist.
Quantitative evaluation metrics for brightfield Congo red virtual staining
To quantitatively evaluate the performance of brightfield Congo red virtual staining, we organized 56 FOVs of virtually generated brightfield Congo red images together with their corresponding histochemically stained images for paired image comparison. Similar to refs. 23,52. we first used standard metrics of MAE, MS-SSIM, PSNR, and FID, as well as some other customized metrics for brightfield images, including the number of nuclei per FOV, and average area of nuclei per FOV, as shown in Supplementary Fig. 9a. MAE was defined as,
$${MAE}=\frac{1}{{MN}} {\sum}_{m} {\sum}_{n}\left|A\left(m,\, n\right)-B\left(m,\, n\right)\right|$$
(1)
where \(A,\, {B}\) represent histochemically and virtually stained brightfield Congo red images, respectively, \(m\) and \(n\) are the pixel indices, and \(M\times N\) denotes the total number of pixels in each image.
The MS-SSIM53 evaluated the SSIM between two images at different spatial levels, where the number of scales was set to 6, and the weights applied to each scale were set to (0.05, 0.05, 0.1, 0.15, 0.2, 0.45).
The PSNR was calculated using,
$${PNSR}=10 \,{\log }_{10}\left(\frac{{\max \left(A\right)}^{2}}{{MSE}}\right)$$
(2)
where \(A\) presents the ground truth image (histochemically stained), and MSE was defined as,
$${MSE}=\frac{1}{MN} {\sum}_{m} {\sum}_{n}{\left(A\left(m,\, n\right)-B\left(m,\, n\right)\right)}^{2}$$
(3)
The FID was calculated according to the definition reported in ref. 54. using the default feature number.
As for the quantifications of nuclei properties, we adopted a stain deconvolution method55 to first separate the channel corresponding to nuclei staining. Then, Otsu’s thresholding56 and a series of morphological operations, such as image dilation and erosion, were applied to the nuclei channel to obtain the segmented binary nuclei mask. The number of nuclei per FOV was defined as the number of connected components in the binary nuclei mask, and the average area of nuclei per FOV corresponds to the average area of connected components across the whole binary nuclei mask with a unit of pixel2.
Quantitative evaluation metrics for polarization Congo red virtual staining
To perform the quantitative evaluation of polarization Congo red virtual staining, a paired comparison was conducted on 56 paired histochemically and virtually stained images with the same FOVs used in the brightfield image evaluation. We also used the metrics of MAE, MS-SSIM, PSNR, and FID as defined earlier. There were also some differences: (1) for MS-SSIM, the weights applied on each scale are (0.45, 0.2, 0.15, 0.1, 0.05, 0.05); (2) for the calculation of FID, to adjust the image brightness, we transferred both the histochemically and virtually stained polarization images into YCbCr color space, multiplied the Y-channel values by 1.5, and then transferred the image back to the ordinary RGB color space. Additionally, we developed a segmentation algorithm to isolate the apple-green birefringence regions in polarization Congo red images by empirically setting thresholds on the HSV channels. The Hue channel in the HSV space enabled the direct selection of the green color range for amyloid deposits. Thresholds were determined from sample FOVs and then validated and finetuned by a board-certified pathologist to ensure accurate amyloid segmentation. Subsequently, we applied several morphological operations, such as image opening and image closing. As shown in Supplementary Fig. 9b, we used D-IoU metric to measure the similarity between the segmented birefringence masks of the histochemically and virtually stained polarization Congo red images, defined as follows,
$$D-{IoU}=\frac{{\sum}_{m}{\sum}_{n}\left({A}_{32}\left(m,\, n\right)*{B}_{32}\left(m,\, n\right)\right)}{{\sum}_{m} {\sum}_{n}\left({{{\rm{min}}}} \{\left({{A}_{32} (m,\, n)}\right)+\left({{B}_{32}(m,\, n)}\right),\,1\}\right)}$$
(4)
where \({A}_{32}\) and \({B}_{32}\) correspond to the 32\(\times\) down-sampled (bilinear) version of the segmented birefringence masks for the histochemically and virtually stained polarization Congo red images, respectively, and \(m\) and \(n\) denote the pixel indices. Before computing D-IoU, \({A}_{32}\) and \({B}_{32}\) were binarized with a small threshold for logical operations. Compared to the traditional per-pixel definition of IoU, this down-sampled D-IoU is more suitable to measure the similarity between binary masks and is resilient to small pixel misalignments57. Finally, for the whole FOV and the apple-green birefringence regions only, we drew the color distributions using probability density functions fitted from histograms separately calculated in Y, Cb, and Cr channels to compare the color similarity between the histochemically and virtually stained polarization Congo red images.
Network architecture and training strategy
We utilized a conditioned GAN architecture36,58 to learn the transformation from the 4-channel label-free autofluorescence images (DAPI, FITC, TxRed, and Cy5) into the corresponding microscopic images of Congo red stained tissue under brightfield microscopy and polarized light microscopy. As shown in Fig. 1b, this conditioned GAN employed a DSM, digitally concatenated with the autofluorescence input images, with all the elements of the matrix set to “1” corresponding to brightfield Congo red images, while “−1” refers to birefringence images. The conditioning of this DSM can be represented as:
$$\widetilde{c}=\left({c}_{1}\right){or}\left({c}_{-1}\right)$$
(5)
where \(\widetilde{c}\) is concatenated to the input as an additional channel. \(({c}_{1})\) and \(({c}_{-1})\) denote matrices with all the elements equal to 1 and -1, respectively. Note that for the additional 3rd channel required for yellow-colored amyloid deposits (see Fig. 6a), Eq. 5 is expanded to \(\widetilde{c}=\left({c}_{-1}\right){or}\left({c}_{1}\right){or}\left({c}_{2}\right)\).
The GAN framework consists of two deep neural networks, a generator and a discriminator. During the training, the generator is tasked with learning a statistical transformation to create virtually stained images. Concurrently, the discriminator learns to distinguish the generated virtually stained images and their histochemically stained counterparts. This competitive training paradigm fosters the simultaneous enhancement of both neural networks. In our training, the generator (G) and the discriminator networks (D) were optimized to minimize the following loss functions:
$${L}_{{{{\rm{generator}}}}}= \alpha {{{L}}}_{1}\big({I}_{{{{\rm{target}}}}},\,{{{\rm{G}}}}\big({I}_{{{{\rm{input}}}}},\,\widetilde{c}\big)\, \circ \,\,\, {{{\rm{R}}}}\big({{{\rm{G}}}}\big({I}_{{{{\rm{input}}}}},\,\widetilde{c}\big),\, {I}_{{{{\rm{target}}}}}\big)\big) \\ +\beta {{{\rm{BCE}}}}\big({{{\rm{D}}}}\big({{{\rm{G}}}}\big({I}_{{{{\rm{input}}}}},\,\widetilde{c}\big),\; \widetilde{c}\big),\; 1\big)+\gamma {{{\rm{TV}}}}\big({{{\rm{G}}}}\big({I}_{{{{\rm{input}}}}},\,\widetilde{c}\big)\big)$$
(6)
$${L}_{{{{\rm{discriminator}}}}}={{{\rm{BCE}}}}\big({{{\rm{D}}}}\big({{{\rm{G}}}}\big({I}_{{{{\rm{input}}}}},\,\widetilde{c}\big)\big),\; 0\big)+{{{\rm{BCE}}}}\big({{{\rm{D}}}}\big({I}_{{{{\rm{target}}}}}\big),\, 1\big)$$
(7)
where \({{{\rm{G}}}}\left(\cdot \right)\) and \({{{\rm{D}}}}\left(\cdot \right)\) refer to the outputs of the generator and discriminator networks, respectively; \({I}_{{{{\rm{target}}}}}\) denotes the ground truth image; and \({I}_{{{{\rm{input}}}}}\) denotes the input label-free autofluorescence images. The binary cross-entropy (BCE) loss is defined as:
$${{{\rm{BCE}}}}\left(p(a),\,p(b)\right)=-\left(p\left(b\right){{{\mathrm{ln}}}}\left({{{\rm{p}}}}\left({{{\rm{a}}}}\right)\right)+\left(1-p\left(b\right)\right) {{{\mathrm{ln}}}}\left(1-p\left({{{\rm{a}}}}\right)\right)\right)$$
(8)
where \(p(a)\) represents the discriminator prediction and \(p(b)\) represents the actual label (0 or 1). The total variation (TV) loss acted as a regularizer term and can be defined as:
$${{{\rm{TV}}}}\left(I\right)={\sum}_{p} {\sum}_{q}\left|\,{I}_{p+1,q}-{I}_{p,q}\,\right|+\left|\,{I}_{p,q+1}-{I}_{p,q}\,\right|$$
(9)
where \(p\) and \(q\) are the pixel indices of the image \(I\). The coefficients \(\left(\alpha,\, \beta,\,\gamma \right)\) in Eq. 6 were empirically set as \(\left({{\mathrm{10,\,1}}},\,0.0001\right)\) based on the validation image set and were fixed for the finalized model evaluation on test samples. In Eq. 6, we also incorporated an additional registration module \(R\) (as shown in Fig. 7a) into our virtual staining network to remove the residual alignment errors between the generated images and their corresponding ground truth images. The registration module was fed with a pair of images: a ‘fixed’ image, which was the ground truth, and a ‘moving’ image, the virtually generated one. This registration module outputs a transformation/displacement matrix detailing pixel displacements needed to align the virtually stained image with the ground truth59. Using the displacement matrix, a grid flow matrix containing new locations of each pixel was generated and applied to the virtually stained images via a resampling operation (denoted as \( \circ \)), which was implemented using PyTorch. During the training, the loss function of the registration module was defined as:
$${L}_{{{{\rm{registration}}}}}= \lambda {{{L}}}_{1}\big({I}_{{{{\rm{target}}}}},\; {{{\rm{G}}}}\big({I}_{{{{\rm{input}}}}},\,\widetilde{c}\big) \circ {{{\rm{R}}}}\big({{{\rm{G}}}}\big({I}_{{{{\rm{input}}}}},\,\widetilde{c}\big),\, {I}_{{{{\rm{target}}}}}\big)\big) \\ +\mu {{{\rm{SMTH}}}}\big({{{\rm{R}}}}\big({{{\rm{G}}}}\big({I}_{{{{\rm{input}}}}},\,\widetilde{c}\big),\, {I}_{{{{\rm{target}}}}}\big)\big)$$
(10)
where the smooth loss60 (SMTH) is defined as:
$${{{\rm{SMTH}}}}\left(T\right)=\frac{1}{{XY}} {\sum}_{x} {\sum}_{y}{\left|{T}_{x+1,y}-{T}_{x,y}\right|}^{2}+{\left|{T}_{x,y+1}-{T}_{x,y}\right|}^{2}$$
(11)
where \(T\) is the displacement matrix. \(X\times Y\) denotes the total number of elements in the matrix; \(x\) and \(y\) are the element indices. The coefficients \(\left(\lambda,\, \mu \right)\) in Eq. 6 were empirically set as \(\left({{\mathrm{20,\, 10}}}\right)\).
Fig. 7: Network architecture for virtual birefringence imaging and virtual staining of amyloid deposits in label-free tissue using autofluorescence microscopy and deep learning.
a The overview of the generator, discriminator and registration module. b Detailed architecture and building blocks of the generator. c Detailed architecture and building blocks of the discriminator. d The detailed architecture of the registration module.
Note that this registration module \(R\) is not the same as the registration models mentioned in the previous section: a registration model learns the style-transfer to help with the elastic registration and is solely used in the data preparation stage. Unlike other shift-invariant losses used in earlier work61, the registration module in our work was trained simultaneously with the virtual staining model to learn the misalignments between the target and output images from the virtual staining network. These misalignments include not only static optical aberrations but also complex tissue distortions introduced during the histochemical staining process, which can vary from one tissue FOV to another. These distortions are difficult to characterize with a simple loss function or data augmentation techniques but can be effectively learned with an optimizable network module.
Both the discriminator and registration modules were only used in training; during the testing phase, only the generator network was used with a simple forward process to infer the virtually stained images.
To avoid possible exploding gradients, we used the smooth \({L}_{1}\) loss62 represented as:
$${{{L}}}_{1}\left(A,\,B\right)= \frac{1}{{MN}}\bigg({\sum}_{|A\left(m,\, n\right)-B\left(m,\, n\right)| < \varphi }0.5\frac{{\left(A\left(m,\, n\right)-B\left(m,\, n\right)\right)}^{2}}{\varphi } \\ +{\sum}_{|A\left(m,\, n\right)-B\left(m,\, n\right)| \ge \varphi }|A\left(m,\, n\right)-B\left(m,\, n\right)|-0.5\varphi \bigg)$$
(12)
where \(A,\, {B}\) represent the images in the comparison, \(m\) and \(n\) are the pixel indices, and \(M\times N\) denotes the total number of pixels in each image. \(\varphi\) was empirically set to 1.
Figure 7b depicts the generator network modeled on the attention U-Net architecture63. This network is composed of a series of four downsampling blocks and four upsampling blocks. Each downsampling block comprises a three-convolution-layer residual block, followed by a leaky rectified linear unit (Leaky ReLU64) with a slope of 0.1, and a \(2\times 2\) max pooling layer with a stride size of 2, which downsample the feature maps and double the number of channels. The three-convolution-layer residual block65 is formed by three consecutive convolution layers and a convolutional residual path building a bridge between the input and output tensors of the residual block.
The input for each upsampling block is a fused tensor, concatenating the output from the preceding block with the corresponding feature maps at the matched level of the downsampling path passing through the attention gate connections. The attention gate passes a tensor through three convolution layers and a sigmoid operation, generating an activation weight map applied to the tensor to strengthen salient features63. The upsampling blocks bilinearly resize (2×) the concatenated tensors and then use the three-convolution-layer residual block to reduce the number of channels by a factor of four. The final upsampling block was followed by a three-convolutional layer residual block together with another single convolution layer, reducing the number of channels to 3, matching the ground truth images.
The discriminator network, as depicted in Fig. 7c, processes the inputs that are either the virtually stained images generated by the network or the actual ground truth images. It begins by transforming the input image into a 64-channel tensor via a single convolutional layer, followed by activation through a Leaky ReLU. Then, the resulting tensor passed through five successive two-convolutional-layer residual blocks, each of which set the stride size of the second convolutional layer as 2 for 2× downsampling and doubling the number of channels. These blocks were followed by a global pooling layer and two dense layers to obtain the output probability of the input being the ground truth (histochemically stained) image.
For the registration module37, as depicted in Fig. 7d, we adopted a U-net architecture akin to that of the generator, albeit with several key modifications. The network comprises seven pairs of downsampling and corresponding upsampling blocks, each integrated with a residual block. Following the final downsampling stage, a convolution layer is employed to double the feature channels, succeeded by a series of three consecutive residual blocks. Subsequently, another convolution layer halved the channels, transitioning the processed features to the upsampling block sequence. The output layer of the network utilizes a single convolution layer to condense the number of channels to 2, corresponding to the two normal components (x and y) of the displacement matrix.
The training dataset contained 386 image patches, each with dimensions of 2048 × 2048 pixels, extracted from 8 distinct patients. During the training, the networks received image patches sized 256 × 256 pixels, randomly cropped from the larger 2048 × 2048 patches in the dataset. The generator, discriminator and registration modules were all optimized using the Adam optimizers66, starting with learning rates of \(2\times {10}^{-5}\), \(2\times {10}^{-6}\) and \(2\times {10}^{-6}\), respectively. A batch size of 32 was maintained throughout the training phase. The generator/discriminator/registration module update frequency was set to 4:1:1. The network converged after ~48 hours of training. The training and testing are done on a standard workstation with GeForce RTX 3090 Ti graphics processing units (GPU) in workstations with 256GB of random-access memory (RAM) and Intel Core i9 central processing unit (CPU). The virtual staining network was implemented using Python version 3.12.0 and PyTorch51 version 1.9.0 with CUDA toolkit version 11.8.
Pathologists’ blind evaluations
Three board-certified pathologists were included to blindly evaluate the image quality of histochemically stained and virtually stained cardiac tissue sections. The blind evaluations were conducted in two ways: (1) Brightfield Congo red stain image quality on small image patches with M1–M4, which are standard quantification metrics for histology images; and (2) Large patch bundled images with both brightfield and birefringence images of the same FOV. In anatomic pathology, the evaluation of Congo red-stained slides under polarized light microscopy yields a binary result—positive or negative for amyloidosis. To enhance the quantification of our virtual polarization model, we incorporated additional metrics: amyloid quantification (M5), the quality of birefringence appearance (M6), and the consistency between brightfield and polarization images (M7). M7 indicates that amyloid deposits on Congo red-stained slides typically appear pink-salmon under a brightfield microscope, while the same areas exhibit apple-green birefringence under polarized light microscopy. Instances where pink-salmon areas do not show birefringence, or where apple-green birefringence occurs without the typical brightfield morphology, indicate inconsistency. We selected several representative FOVs from our training dataset to establish baseline scores for evaluation metrics M5 to M7 (see Supplementary Fig. 5). Note that these examples, along with their scores, were presented to the evaluating pathologists prior to their assessments to aid in their calibration and understanding of the scoring system.
For part 1 evaluation, small patches (\(1024\times 1024\) pixels, each corresponding to ~\(166\times 166\,{{{\rm{\mu }}}}{{{{\rm{m}}}}}^{2}\)) are randomly cropped from histochemical and virtual images without any overlapping FOVs. Then, each image is randomly augmented in the following ways: original, left-right flip, top-bottom flip, and random rotation at 90, 180 and 270 degrees. A total of 163 images were randomly shuffled and sent to pathologists without labeling. Pathologists scored four metrics evaluating the brightfield Congo red image quality: stain quality of nuclei, stain quality of cytoplasm, stain quality of extracellular space and stain contrast of congophilic areas. For part 2, we selected 56 large image patches (\(2048\times 2048\) pixels, each corresponding to ~\(332\times 332\,{{{\rm{\mu }}}}{{{{\rm{m}}}}}^{2}\)), and collected the brightfield and birefringence images of the same FOVs. For each image bundle, we collected both histochemically and virtually stained images, and randomly augmented the image bundles while ensuring a different augmentation method for each, resulting in a total of 112 image bundles to be scored. Pathologists scored three metrics evaluating the birefringence channel: amyloid quantification (% of positive areas / total tissue surface), birefringence image quality, and inconsistency between brightfield and polarization Congo red images. For part 2, the evaluation of the bundled images, we customized the Napari (0.4.18) viewer67 as a graphical user interface (GUI) for pathologists to conveniently evaluate the high-resolution images. The interface features keyboard bindings for switching between brightfield and birefringence views, selecting images, and controlling zoom functions for regions of interest using a mouse. All the participating pathologists were given a short tutorial on using the GUI, which included scored example FOVs selected from the training dataset. In both evaluations, any images that pathologists declined to give a score are excluded from the analysis.
Statistics and reproducibility
All the network inferences were deterministic for given input image data except for negligible numerical variations in implementation. For pathologists’ blind evaluations, the sample image patch locations were manually selected to maximize the number of sample patches with variable amounts of amyloid deposits while maintaining acceptable overlap regions. Randomization was done through augmentation as described in the Methods section and the pathologists were blinded at the time of evaluation to whether the sample was histochemically or virtually stained. Patches without a morphological appearance indicative of amyloidosis were not included.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.