Java app here using the Spark Excel library to read an Excel file into a Dataset<Row>
. When I use the following configurations:
String filePath = "file:///Users/myuser/example-data.xlsx";
Dataset<Row> dataset = spark.read()
.format("com.crealytics.spark.excel")
.option("header", "true")
.option("inferSchema", "true")
.option("dataAddress", "'ExampleData'!A2:D7")
.load(filePath);
This works beautifully and my Dataset<Row>
is instantiated without any issues whatsoever. But the minute I go to just tell it to read any rows between A through D, it reads an empty Dataset<Row>
:
// dataset will be empty
.option("dataAddress", "'ExampleData'!A:D")
This also happens if I set the sheetName
and dataAddress
separately:
// dataset will be empty
.option("sheetName", "ExampleData")
.option("dataAddress", "A:D")
And it also happens when, instead of providing the sheetName
, I provide a sheetIndex
:
// dataset will be empty; and I have experimented by setting it to 0 as well
// in case it is a 0-based index
.option("sheetIndex", 1)
.option("dataAddress", "A:D")
My question: is this expected behavior of the Spark Excel library, or is it a bug I have discovered, or am I not using the Options API correctly here?
Java app here using the Spark Excel library to read an Excel file into a Dataset<Row>
. When I use the following configurations:
String filePath = "file:///Users/myuser/example-data.xlsx";
Dataset<Row> dataset = spark.read()
.format("com.crealytics.spark.excel")
.option("header", "true")
.option("inferSchema", "true")
.option("dataAddress", "'ExampleData'!A2:D7")
.load(filePath);
This works beautifully and my Dataset<Row>
is instantiated without any issues whatsoever. But the minute I go to just tell it to read any rows between A through D, it reads an empty Dataset<Row>
:
// dataset will be empty
.option("dataAddress", "'ExampleData'!A:D")
This also happens if I set the sheetName
and dataAddress
separately:
// dataset will be empty
.option("sheetName", "ExampleData")
.option("dataAddress", "A:D")
And it also happens when, instead of providing the sheetName
, I provide a sheetIndex
:
// dataset will be empty; and I have experimented by setting it to 0 as well
// in case it is a 0-based index
.option("sheetIndex", 1)
.option("dataAddress", "A:D")
My question: is this expected behavior of the Spark Excel library, or is it a bug I have discovered, or am I not using the Options API correctly here?
Share Improve this question edited 23 hours ago IamYourFaja asked Feb 6 at 16:30 IamYourFajaIamYourFaja 1311 silver badge8 bronze badges1 Answer
Reset to default 0FWIW I rolled up my sleeves and it looks like this is a limitation in Apache POI. The spark-excel library uses POI under the hood to read Excel files, and specifically it uses a POI AreaReference
when reading in cells for a data address. AreaReference cannot handle anything other than specific cell coordinates (e.g. "A2"), and it cannot handle whole column references (e.g. "A"):
package com.example;
import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.ss.usermodel.Workbook;
import org.apache.poi.ss.util.AreaReference;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
public class PoiExcelDebugger {
public static void main(String[] args) {
String filePath = "/Users/<myuser>/path/to/price-sheet-v1-simple.xlsx";
String sheetName = "Price Sheet";
// String dataAddress = "'Price Sheet'!A:D";
String dataAddress = "'Price Sheet'!A2:D7";
try (FileInputStream fis = new FileInputStream(new File(filePath));
Workbook workbook = new XSSFWorkbook(fis)) {
Sheet sheet = workbook.getSheet(sheetName);
if (sheet == null) {
System.err.println("Sheet '" + sheetName + "' not found!");
return;
}
AreaReference areaRef = new AreaReference(dataAddress,
workbook.getSpreadsheetVersion());
int minColIndex = areaRef.getFirstCell().getCol();
int maxColIndex = areaRef.getLastCell().getCol();
int minRowIndex = areaRef.getFirstCell().getRow();
int maxRowIndex = areaRef.getLastCell().getRow();
System.out.println("Parsed Column Range: " + minColIndex + " to " + maxColIndex);
System.out.println("Parsed Row Range: " + minRowIndex + " to " + maxRowIndex);
for (Row row : sheet) {
int rowIndex = row.getRowNum();
// Skip empty rows outside the target range
if (rowIndex < minRowIndex || rowIndex > maxRowIndex) {
continue;
}
System.out.print("Row " + rowIndex + ": ");
// Read all columns within A:D (indices 0 to 3)
for (int colIndex = minColIndex; colIndex <= maxColIndex; colIndex++) {
Cell cell = row.getCell(colIndex, Row.MissingCellPolicy.CREATE_NULL_AS_BLANK);
System.out.print(getCellValueAsString(cell) + " | ");
}
System.out.println();
}
} catch (IOException e) {
e.printStackTrace();
}
}
private static String getCellValueAsString(Cell cell) {
if (cell == null) return "NULL";
return switch (cell.getCellType()) {
case STRING -> cell.getStringCellValue();
case NUMERIC -> String.valueOf(cell.getNumericCellValue());
case BOOLEAN -> String.valueOf(cell.getBooleanCellValue());
case FORMULA -> cell.getCellFormula();
case BLANK -> "BLANK";
default -> "UNKNOWN";
};
}
}
Above, everything will work perfectly fine. But if we replace the comments and change:
// String dataAddress = "'Price Sheet'!A:D";
String dataAddress = "'Price Sheet'!A2:D7";
to:
String dataAddress = "'Price Sheet'!A:D";
// String dataAddress = "'Price Sheet'!A2:D7";
It won't throw an exception but it won't read anything.
Hence: I guess this is "expected" POI behavior, albeit, very disappointing.