最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

java - Getting hyperlink from Excel spreadsheet with Apache POI - Stack Overflow

programmeradmin1浏览0评论

I am processing Excel spreadsheet in Java with Apache POI. Some columns in this spreadsheet contain hyperlinks to the documents stored on the web, which I need to extract. There are several different type of hyperlinks. Some looks like this (if open Excel cell):

=HYPERLINK("/doc", "ref to doc")

but these are rare. Most of them looks like this:

=HYPERLINK(R246,"ref to doc")

or

=IF(I257="#undef","undefined",HYPERLINK(U257,"ref to doc"))

Unfortunately. cell.getHyperlink() always returns null. I can extract the formula text from the cell, with cell.getCellFormula(), so I decided to get the hyperlink manually from the formula. I created a regular expression for it:

Pattern pattern = Patternpile("HYPERLINK\\((.*?),");
Matcher matcher = pattern.matcher(cellFormula);
String val = matcher.group();

This results in exception java.lang.IllegalStateException: No match found.

Can somebody either point out what is wrong with my regex or suggest some better way to get those hyperlinks?

I am processing Excel spreadsheet in Java with Apache POI. Some columns in this spreadsheet contain hyperlinks to the documents stored on the web, which I need to extract. There are several different type of hyperlinks. Some looks like this (if open Excel cell):

=HYPERLINK("https://web.address/doc", "ref to doc")

but these are rare. Most of them looks like this:

=HYPERLINK(R246,"ref to doc")

or

=IF(I257="#undef","undefined",HYPERLINK(U257,"ref to doc"))

Unfortunately. cell.getHyperlink() always returns null. I can extract the formula text from the cell, with cell.getCellFormula(), so I decided to get the hyperlink manually from the formula. I created a regular expression for it:

Pattern pattern = Patternpile("HYPERLINK\\((.*?),");
Matcher matcher = pattern.matcher(cellFormula);
String val = matcher.group();

This results in exception java.lang.IllegalStateException: No match found.

Can somebody either point out what is wrong with my regex or suggest some better way to get those hyperlinks?

Share Improve this question edited Feb 2 at 13:14 Mark Rotteveel 109k229 gold badges156 silver badges220 bronze badges asked Feb 1 at 2:24 Gary GreenbergGary Greenberg 1,1528 silver badges15 bronze badges 5
  • I think you need to extract the .apache.poi.ss.formula.functions.Hyperlink from the cell (not the .apache.poi.ss.usermodel.XSSFHyperlink, as you are trying). The value in the cell =HYPERLINK("https://web.address/doc", "ref to doc") represents an Excel function. – prasad_ Commented Feb 1 at 7:54
  • 1 In your case matcher.matches() is false as the cellFormula does not end with ,. Instead use matcher.find() then matcher.group() – DuncG Commented Feb 1 at 11:30
  • You are right, I have missed the call to find() function. However, when I did add this function, I do not get an exception, but instead of the URL or cell reference with URL, I am getting string with "HYPERLINK("http:://...". I did follow sample from stackoverflow/questions/16597303/… but still something is wrong with my regex. I can do a substring to extract the URL, but if you can point out where the error in my regex, I'll appreciate. – Gary Greenberg Commented Feb 2 at 1:49
  • I will try .apache.poi.ss.formula.functions.Hyperlink function, but it won't work when hyperlink is within =IF() function anyway. So I will need to use regex anyway. – Gary Greenberg Commented Feb 2 at 1:50
  • @GaryGreenberg matcher.group() is what find() matched, you would need to look at the capture group 1 of your regex matcher.group(1). You still have a problem that this could be an expression (such as another cell reference R246) or string argument "https://web.address/doc" so further processing is required. – DuncG Commented Feb 2 at 12:33
Add a comment  | 

1 Answer 1

Reset to default 1

Parsing Excel formulas using regular expression is not what I would suggest. Even if you get it work, you will get either "\"https://web.address/doc\"" or "R246" or "U257" for your examples. The first result is usable. But what to do with the second and the third result then?

Fortunately Apache POI provides a FormulaParser. To parse formulas I always suggest using that.

Complete example:

import java.io.FileInputStream;

import .apache.poi.ss.formula.*;
import .apache.poi.ss.formula.ptg.*;
import .apache.poi.ss.usermodel.*;
import .apache.poi.xssf.usermodel.*;
import .apache.poi.ss.util.CellAddress;

public class ExcelHyperlinkDestinationFromFormula {

 private static String getHyperlinkFormulaDestination(XSSFSheet sheet, String hyperlinkFormula) {
  String result = "HyperlinkFormulaDestination not found";  
  XSSFEvaluationWorkbook workbookWrapper = XSSFEvaluationWorkbook.create((XSSFWorkbook) sheet.getWorkbook());
  Ptg[] ptgs = FormulaParser.parse(hyperlinkFormula, workbookWrapper, FormulaType.CELL, sheet.getWorkbook().getSheetIndex(sheet));
  for (int i = 0; i < ptgs.length; i++) {
   //System.out.println(ptgs[i]);
   if (ptgs[i] instanceof FuncVarPtg) {
    FuncVarPtg funcVarPtg = (FuncVarPtg)ptgs[i];
    //System.out.println(funcVarPtg);
    if ("HYPERLINK".equals(funcVarPtg.getName())) {
     int firstOperand = funcVarPtg.getNumberOfOperands(); 
     Ptg firstOperandPtg = ptgs[i-firstOperand]; //Reverse Polish notation: first operand is number of operands before funcVarPtg
     //System.out.println(firstOperandPtg);
     if (firstOperandPtg instanceof StringPtg) {
      result = ((StringPtg)firstOperandPtg).getValue();
     } else if (firstOperandPtg instanceof RefPtg) {
      Cell cell = sheet.getRow(((RefPtg)firstOperandPtg).getRow()).getCell(((RefPtg)firstOperandPtg).getColumn());
      DataFormatter dataFormatter = new DataFormatter();
      dataFormatter.setUseCachedValuesForFormulaCells(true);
      result = dataFormatter.formatCellValue(cell);
     } else if (firstOperandPtg instanceof FuncVarPtg) {
      result = "HyperlinkFormulaDestination behind FuncVarPtg not parsed"; //ToDo
     } else if (firstOperandPtg instanceof NamePtg) {
      result = "HyperlinkFormulaDestination behind NamePtg not parsed"; //ToDo       
     }
    }
   }
  }
  return result;
 }

 public static void main(String[] args) throws Exception {

  XSSFWorkbook workbook = (XSSFWorkbook)WorkbookFactory.create(new FileInputStream("./test.xlsx"));  
  XSSFSheet sheet = workbook.getSheetAt(0);
  for (Row row : sheet) {
   for (Cell cell : row) {
    if (cell.getCellType() == CellType.FORMULA) {
     CellAddress source = cell.getAddress();
     String formula = cell.getCellFormula();
     if (formula.toLowerCase().contains("hyperlink(")) {
      System.out.println(source + "=" + formula);
      String hyperlinkFormulaDestination = getHyperlinkFormulaDestination(sheet, formula);
      System.out.println(hyperlinkFormulaDestination);
     }   
    }
   }
  }

  workbook.close();
 }
}

The code should be able getting the hyperlink formula destination of your examples. ToDos are to extend the code so it is able parsing functions and/or names being the first operand in HYPERLINK formula too.

发布评论

评论列表(0)

  1. 暂无评论