I am processing Excel spreadsheet in Java with Apache POI. Some columns in this spreadsheet contain hyperlinks to the documents stored on the web, which I need to extract. There are several different type of hyperlinks. Some looks like this (if open Excel cell):
=HYPERLINK("/doc", "ref to doc")
but these are rare. Most of them looks like this:
=HYPERLINK(R246,"ref to doc")
or
=IF(I257="#undef","undefined",HYPERLINK(U257,"ref to doc"))
Unfortunately. cell.getHyperlink()
always returns null. I can extract the formula text from the cell, with cell.getCellFormula()
, so I decided to get the hyperlink manually from the formula. I created a regular expression for it:
Pattern pattern = Patternpile("HYPERLINK\\((.*?),");
Matcher matcher = pattern.matcher(cellFormula);
String val = matcher.group();
This results in exception java.lang.IllegalStateException: No match found.
Can somebody either point out what is wrong with my regex or suggest some better way to get those hyperlinks?
I am processing Excel spreadsheet in Java with Apache POI. Some columns in this spreadsheet contain hyperlinks to the documents stored on the web, which I need to extract. There are several different type of hyperlinks. Some looks like this (if open Excel cell):
=HYPERLINK("https://web.address/doc", "ref to doc")
but these are rare. Most of them looks like this:
=HYPERLINK(R246,"ref to doc")
or
=IF(I257="#undef","undefined",HYPERLINK(U257,"ref to doc"))
Unfortunately. cell.getHyperlink()
always returns null. I can extract the formula text from the cell, with cell.getCellFormula()
, so I decided to get the hyperlink manually from the formula. I created a regular expression for it:
Pattern pattern = Patternpile("HYPERLINK\\((.*?),");
Matcher matcher = pattern.matcher(cellFormula);
String val = matcher.group();
This results in exception java.lang.IllegalStateException: No match found.
Can somebody either point out what is wrong with my regex or suggest some better way to get those hyperlinks?
Share Improve this question edited Feb 2 at 13:14 Mark Rotteveel 109k229 gold badges156 silver badges220 bronze badges asked Feb 1 at 2:24 Gary GreenbergGary Greenberg 1,1528 silver badges15 bronze badges 5 |1 Answer
Reset to default 1Parsing Excel formulas using regular expression is not what I would suggest. Even if you get it work, you will get either "\"https://web.address/doc\""
or "R246"
or "U257"
for your examples. The first result is usable. But what to do with the second and the third result then?
Fortunately Apache POI provides a FormulaParser. To parse formulas I always suggest using that.
Complete example:
import java.io.FileInputStream;
import .apache.poi.ss.formula.*;
import .apache.poi.ss.formula.ptg.*;
import .apache.poi.ss.usermodel.*;
import .apache.poi.xssf.usermodel.*;
import .apache.poi.ss.util.CellAddress;
public class ExcelHyperlinkDestinationFromFormula {
private static String getHyperlinkFormulaDestination(XSSFSheet sheet, String hyperlinkFormula) {
String result = "HyperlinkFormulaDestination not found";
XSSFEvaluationWorkbook workbookWrapper = XSSFEvaluationWorkbook.create((XSSFWorkbook) sheet.getWorkbook());
Ptg[] ptgs = FormulaParser.parse(hyperlinkFormula, workbookWrapper, FormulaType.CELL, sheet.getWorkbook().getSheetIndex(sheet));
for (int i = 0; i < ptgs.length; i++) {
//System.out.println(ptgs[i]);
if (ptgs[i] instanceof FuncVarPtg) {
FuncVarPtg funcVarPtg = (FuncVarPtg)ptgs[i];
//System.out.println(funcVarPtg);
if ("HYPERLINK".equals(funcVarPtg.getName())) {
int firstOperand = funcVarPtg.getNumberOfOperands();
Ptg firstOperandPtg = ptgs[i-firstOperand]; //Reverse Polish notation: first operand is number of operands before funcVarPtg
//System.out.println(firstOperandPtg);
if (firstOperandPtg instanceof StringPtg) {
result = ((StringPtg)firstOperandPtg).getValue();
} else if (firstOperandPtg instanceof RefPtg) {
Cell cell = sheet.getRow(((RefPtg)firstOperandPtg).getRow()).getCell(((RefPtg)firstOperandPtg).getColumn());
DataFormatter dataFormatter = new DataFormatter();
dataFormatter.setUseCachedValuesForFormulaCells(true);
result = dataFormatter.formatCellValue(cell);
} else if (firstOperandPtg instanceof FuncVarPtg) {
result = "HyperlinkFormulaDestination behind FuncVarPtg not parsed"; //ToDo
} else if (firstOperandPtg instanceof NamePtg) {
result = "HyperlinkFormulaDestination behind NamePtg not parsed"; //ToDo
}
}
}
}
return result;
}
public static void main(String[] args) throws Exception {
XSSFWorkbook workbook = (XSSFWorkbook)WorkbookFactory.create(new FileInputStream("./test.xlsx"));
XSSFSheet sheet = workbook.getSheetAt(0);
for (Row row : sheet) {
for (Cell cell : row) {
if (cell.getCellType() == CellType.FORMULA) {
CellAddress source = cell.getAddress();
String formula = cell.getCellFormula();
if (formula.toLowerCase().contains("hyperlink(")) {
System.out.println(source + "=" + formula);
String hyperlinkFormulaDestination = getHyperlinkFormulaDestination(sheet, formula);
System.out.println(hyperlinkFormulaDestination);
}
}
}
}
workbook.close();
}
}
The code should be able getting the hyperlink formula destination of your examples. ToDos are to extend the code so it is able parsing functions and/or names being the first operand in HYPERLINK
formula too.
.apache.poi.ss.formula.functions.Hyperlink
from the cell (not the.apache.poi.ss.usermodel.XSSFHyperlink
, as you are trying). The value in the cell=HYPERLINK("https://web.address/doc", "ref to doc")
represents an Excel function. – prasad_ Commented Feb 1 at 7:54matcher.matches()
is false as the cellFormula does not end with,
. Instead usematcher.find()
thenmatcher.group()
– DuncG Commented Feb 1 at 11:30matcher.group()
is whatfind()
matched, you would need to look at the capture group 1 of your regexmatcher.group(1)
. You still have a problem that this could be an expression (such as another cell reference R246) or string argument"https://web.address/doc"
so further processing is required. – DuncG Commented Feb 2 at 12:33