It can be observed that the face image enlarged by the nearest-neighbor method is heavily affected by severe block artifacts. Existing advanced IQAs mistakenly interpret noise-related high-frequency components as textures, underscoring a critical limitation in their ability to distinguish useful high-frequency information in images. To overcome this drawback, FRIEREN is based on two main observations of the human visual system:
Edge-related high-frequency elements are the most important factors in how humans evaluate image sharpness.
For the HVS perception, the ranking of detail quality levels in facial images after applying different interpolations is as follows: Lanczos, bicubic, bilinear, and nearest-neighbor method (from the best to the worst).
To the best of our knowledge, our work is the first one that addresses interpolation effects on facial image quality. The detail quality degradation induced by interpolation can be quantified using our proposed motion and spatial noises and HVS-based sharpness calculation effectively.
Predicting image quality requires a regression model that maps features to quality scores to enhance alignment with human visual judgment. We employ Kolmogorov-Arnold Networks (KANs) for the regression in FRIEREN. KANs, based on the Kolmogorov-Arnold theorem, differ from MLPs by avoiding linear outputs. It enables KANs to effectively capture non-linear relationships, making them a way better regression model for simulating human perception. We use the Adam (Adaptive moment estimation) optimizer with $\beta_1$ = 0.9 and $\beta_2$ = 0.999 to update the model parameters for optimal quality prediction performance. To train the quality prediction model, 64% of the images were randomly selected for training, with 16% used for validation and the remaining 20% for testing. The proposed image features: motion noise, spatial noise, and HVS-based sharpness measure in FRIEREN are introduced in the following sections.
High-level temporal noise disrupts visual understanding of multimedia content. It cannot be completely eliminated due to the sensor limitations and environmental conditions. Hence, determining the effect of temporal noise is essential to quantify the facial details that a camera can reproduce in images. To adapt FRIEREN into a no-reference method, motion noise is introduced to calculate the temporal noise influence using only one single image frame. Frame modification is applied for motion noise estimation. Denoted $I$ as the original image frame. By discarding the last row and column of pixels from $I$, the new image frame $I’$ with slightly different content is created. Then, image frames $I$ and $I’$ are both enlarged to the same dimensions, simulating motion variations typically seen in a video sequence. Let $D$ be denoted as the absolute pixel-wise difference between the two frames. The motion noise of the image frame, $\sigma_m$, is calculated as:
$$ \sigma_m = \sqrt{ \frac{1}{M \cdot N} \sum_{i=1}^{M} \sum_{j=1}^{N} (D_{ij} - \mu_D)^2 } $$
where $\mu_D$ represents the average of the frame difference and $M$ and $N$ are the height and width of $I$, respectively.
Spatial noise estimation framework in FRIEREN: By excluding the edge-related components in high frequency layers, the DWT-based spatial noise estimation enables FRIEREN to quantify the level of noise in an image effectively and accurately.
The 2D Discrete Wavelet Transform (DWT) decomposes an image into LL, LH, HL, and HH sub-bands, capturing varying detail levels. High-frequency information, including noise and edges, is contained in the LH, HL, and HH sub-bands. Effective noise estimation depends on the ability to distinguish noise from these high-frequency components. By removing edges in the LH, HL, and HH layers, the noisy components can be identified. Denote $LL_i$, $LH_i$, $HL_i$, and $HH_i$ as frequency layers obtained at decomposition level $i$ in the 2D DWT process. Let $EM$ be the edge detection result applied to the $LL_i$ layer, and $IEM$ be the inverted edge mask. The noise energy map $NEM$ is expressed as
$$IEM = 1 - EM$$ $$NEM = \sqrt{a \cdot LH_i^2 + b \cdot HL_i^2 + c \cdot HH_i^2} \cdot IEM$$
where $a$, $b$, and $c$ are weighting parameters for regulating the influence of $LH_i$, $HL_i$, and $HH_i$ coefficients, respectively.
Noise dominates the edges when an image suffers from severe noise, causing underestimation in the noise influence evaluation. Our proposed approach addresses this problem by recognizing that varying noise energy distributions require different quantiles to identify the most representative energy data for estimation. In real-world scenarios such as video conferencing, image enlargement is commonly applied when a face appears in a frame to make the subject more visually prominent. Since faces before the enlargement are relatively small in the scene, they inherently contain a constrained amount of visual details. Higher decomposition levels can reveal subtle features that might be missed at lower levels, improving the noise estimation performance within small face images. We use the decomposition level of 3 in our method. The edge mask $EM$ is obtained by applying the Sobel operator on the $LL_3$ layer.
Sharpness determines how well-defined the textures and edges of objects appear in images. Inspired by the previous task of spatial noise estimation, our proposed image sharpness measure integrates the spatial-domain and transform-domain methods. We employ edge detection on the image’s low-frequency sub-band to effectively capture the edge-related high-frequency components that contain the critical sharpness information. Three levels of DWT edge-related high-frequency sub-bands are then incorpo- rated into the sharpness calculation. The method, which combines spatial and transform domain information, minimizes the impact of noise as much as possible while emphasizing the importance of edge features when calculating image clarity.
We utilize two face image datasets in our experiments: our collected mannequin face images and the MS1MV2 dataset. Our dataset validates quality degradation due to interpolation, while MS1MV2 demonstrates FRIEREN’s effectiveness compared to other IQA methods.
Mannequin dataset: Three realistic mannequins are used in part of our experiments. To prevent any impact from post-processing on facial details, all images are captured in RAW format using a Sony IMX383-AAQK image sensor.
Mannequin dataset: In our dataset, gamma correction is applied to generate additional face images, which are enlarged using nearest-neighbor, bilinear, bicubic, and Lanczos interpolation at scales from 2× to 5× (step 0.5). For MS1MV2, 10,000 randomly selected images are upscaled to 2× using all four methods.
Currently, there are no mean opinion scores (MOS) evaluated by the human subjective judgment for face images enlarged using different interpolations. Since adopting full-reference image quality assessments (FRIQAs) to generate credible pseudo-MOS has proven highly effective, we adopt PSNR as our target metric. Suppose that an original face image is scaled up by a factor of $β$ using each of the four interpolations. To produce the PSNR value, we first resize the original face image to $1/β$ % of its size. Next, we enlarge its size by a factor of $β$, restoring it to the original dimensions. Finally, the PSNR value is calculated using the original image and the interpolated image.
To target MOS values for each interpolation, we scale the MOSs in proportion to the relative PSNR values. That is, the relationship between the MOS values mirrors the ratio of their corresponding PSNRs. We use CLIB-FIQA, the most advanced face image quality assessment method, to evaluate face images enlarged using Lanczos interpolation. These scores serve as the reference MOS values. After that, all the corresponding target MOS values of each image both in our dataset and in the MS1MV2 dataset are calculated for further regression.
We hypothesize that the negative influence caused by the nearest-neighbor interpolation can be calculated by the estimations of motion and spatial noise. Temporal noise is calculated using all previous frames in our dataset. For the MS1MV2 dataset, frame modifications are applied for motion noise estimation. The mean estimations of σm and σs for each enlarged face image in our dataset and the MS1MV2 dataset are listed in Table I and Table II, respectively. The highest results are shown in bold. The results support our theory regarding the side effects of the nearest interpolation on face image quality.
TABLE I: Average Temporal And Spatial Noise Estimations For Each Type Of Enlarged Image In Our Mannequin Dataset
| Nearest | Bilinear | Bicubic | Lanczos | |
|---|---|---|---|---|
| Temporal Noise | 0.6178 | 0.4880 | 0.5150 | 0.5141 |
| Spatial Noise | 18.4668 | 14.7780 | 17.6856 | 17.7233 |
TABLE II: Average Motion And Spatial Noise Estimations For Each Type Of Enlarged Image In The MS1MV2 Dataset
| Nearest | Bilinear | Bicubic | Lanczos | |
|---|---|---|---|---|
| Motion Noise | 4.3453 | 3.2759 | 3.5767 | 3.5807 |
| Spatial Noise | 25.8520 | 24.1216 | 25.1551 | 25.0978 |
Table III and Table IV demonstrate the average PSNR values for each enlarged face image in our dataset and the MS1MV2 dataset, respectively. The highest values are shown in bold. The results also support the subjective human perception of the detail quality rankings in facial images after applying different interpolations: Lanczos, bicubic, bilinear, and nearest-neighbor method (from the best to the worst).
TABLE III: Average PSNR Results For Each Type Of Enlarged Face Images In Our Mannequin Dataset
| Nearest | Bilinear | Bicubic | Lanczos | |
|---|---|---|---|---|
| Mean PSNR | 28.6731 | 29.1761 | 29.3089 | 29.3140 |
TABLE IV: Average PSNR Results For Each Type Of Enarged Face Images In The MS1MV2 Dataset
| Nearest | Bilinear | Bicubic | Lanczos | |
|---|---|---|---|---|
| Mean PSNR | 29.9370 | 31.7721 | 33.5590 | 33.5791 |
In addition to the three proposed image features, CLIB-FIQA scores on the original unenlarged images are also used in the regression. Quality prediction performance is assessed using Pearson linear correlation coefficient (PLCC) and Spearman rank order correlation coefficient (SROCC). To ensure FRIEREN’s robustness, train-validation-test operations are repeated randomly 10 times, with median PLCC and SROCC values reported as final results. The prediction in the MS1MV2 dataset, assessed with FRIEREN and other existing state-of-the-art no-reference IQA methods such as MGVG (2017), CDV (2024), and CLIB-FIQA (2024), are illustrated in Table V. The average predicted quality scores for all methods are illustrated in Table VI. The highest values are shown in bold. It is worth noting that CLIB-FIQA is specifically designed for the quality evaluation of face images and trained on the MS1MV2 dataset. FRIEREN’s quality prediction best reflects the trend of target MOS values across enlarged face images.
TABLE V: Performance Comparison of Quality Prediction on Testing Face Images in the MS1MV2 Dataset
| Nearest | Bilinear | Bicubic | Lanczos | |
|---|---|---|---|---|
| PLCC | -0.6411 | -0.6860 | 0.7816 | 0.8954 |
| SROCC | -0.6905 | -0.6434 | 0.7327 | 0.8723 |
TABLE VI: Average Quality Scores of Different IQA Methods in the MS1MV2 Dataset. HVS-preferred Quality Ranking: Lanczos > Bicubic > Bilinear > Nearest-neighbor.
| MGVG | CDV | CLIB-FIQA | FRIEREN | |
|---|---|---|---|---|
| Nearest | 56.2486 | 32.4979 | 0.7511 | 0.6176 |
| Bilinear | 28.9318 | 16.5068 | 0.7361 | 0.6615 |
| Bicubic | 33.4192 | 19.1522 | 0.7443 | 0.6621 |
| Lanczos | 31.9301 | 18.2963 | 0.7441 | 0.6629 |
@inproceedings{frieren2025,
title = {FRIEREN: A Lightweight System for Face Resizing Image Detail Quality Evaluation via Robust Estimation of Image Naturalness},
author = {Yuan-Kang Lee and Kuan-Lin Chen and Jian-Jiun Ding},
booktitle = {IEEE Asia Pacific Conference on Circuits and Systems (APCCAS)},
year = {2025},
pages = {to appear},
url = {https://ntuneillee.github.io/research/friereniqa/}
}