Gemini writes Apps Script to create doc review checklist

We have a Google Drive folder with documents needing review. It’s a little hard keeping track of our progress so I got some help from Gemini in creating a checklist. It worked out pretty well. I’m also happy to have a starting point for learning more about some features of Apps Script that I haven’t used much before — the DriveApp API — and a few details on how PDFs work. I was also happy to be reminded about a couple small things on regexes that I’m surprised I forgot.

My first prompt:

I have a Google Drive folder with some documents that we need to review. It’s hard to keep track of our progress reading them. Could I use Apps Script to create a spreadsheet with a checkbox for each item?

You can see the rest of our conversation here.

Final code that it came up with:

function generateLinkedReviewList() {
  const folderId = 'YOUR_FOLDER_ID_HERE'; 
  const sheet = SpreadsheetApp.getActiveSpreadsheet().getActiveSheet();
  const folder = DriveApp.getFolderById(folderId);
  const files = folder.getFiles();
  
  sheet.clear();
  sheet.appendRow(['Document (Link)', 'Total Pages', 'Pages Read', 'Status']);
  sheet.getRange(1, 1, 1, 4).setFontWeight('bold').setBackground('#f3f3f3');

  const data = [];
  
  while (files.hasNext()) {
    let file = files.next();
    let pageCount = 0;
    let fileName = file.getName().replace(/"/g, '""');
    let fileUrl = file.getUrl();

    if (file.getMimeType() === 'application/pdf') {
      try {
        let blob = file.getBlob();
        let content = blob.getDataAsString();

        // Find all instances of /Count inside a /Pages dictionary
        // We use a global search to find every potential candidate
        const globalRegex = /<<[^>]*?\/Type\s*\/Pages[^>]*?\/Count\s+(\d+)[^>]*?>>/g;
        const reverseRegex = /<<[^>]*?\/Count\s+(\d+)[^>]*?\/Type\s*\/Pages[^>]*?>>/g;
        
        let counts = [];
        let match;

        // Collect all forward matches
        while ((match = globalRegex.exec(content)) !== null) {
          counts.push(parseInt(match[1]));
        }
        // Collect all reverse matches
        while ((match = reverseRegex.exec(content)) !== null) {
          counts.push(parseInt(match[1]));
        }

        if (counts.length > 0) {
          // The total page count is the maximum value found in the tree
          pageCount = Math.max(...counts);
        } else {
          pageCount = 'Check Manually';
        }
      } catch (e) {
        pageCount = 'Error';
      }
    } else {
      pageCount = '-';
    }

    let linkFormula = `=HYPERLINK("${fileUrl}", "${fileName}")`;
    data.push([linkFormula, pageCount, 0, false]);
  }

  if (data.length > 0) {
    sheet.getRange(2, 1, data.length, 4).setFormulas(data);
    sheet.getRange(2, 4, data.length, 1).insertCheckboxes();
    
    // Data Validation
    for (let i = 0; i < data.length; i++) {
      let total = data[i][1];
      if (typeof total === 'number' && total > 0) {
        let cell = sheet.getRange(i + 2, 3);
        let rule = SpreadsheetApp.newDataValidation()
          .requireNumberBetween(0, total)
          .setAllowInvalid(false)
          .build();
        cell.setDataValidation(rule);
      }
    }
  }
  sheet.autoResizeColumns(1, 4);
}

I see that if I comment out the lines using reverseRegex, I get the same answer so perhaps we could drop that, but I think I’ll stop here.

To help with understanding what going wrong with certain estimates, I found it helpful to extract strings from some sample PDFs:

% strings doc.pdf | grep "/Count " | less

<< /Type /Pages /MediaBox [0 0 612 792] /Count 3 /Kids [ 1 0 R 24 0 R 31 0 R
<< /Type /Pages /Count 158 /MediaBox [ 0 0 612 792 ] /Kids [ 1 0 R 24 0 R

This seems like a nice resource for learning more about the PDF file format: https://www.elliotclyde.nz/blog/whats-in-a-pdf/ (Archive).


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *