We have a Google Drive folder with documents needing review. It’s a little hard keeping track of our progress so I got some help from Gemini in creating a checklist. It worked out pretty well. I’m also happy to have a starting point for learning more about some features of Apps Script that I haven’t used much before — the DriveApp API — and a few details on how PDFs work. I was also happy to be reminded about a couple small things on regexes that I’m surprised I forgot.
My first prompt:
I have a Google Drive folder with some documents that we need to review. It’s hard to keep track of our progress reading them. Could I use Apps Script to create a spreadsheet with a checkbox for each item?
You can see the rest of our conversation here.
Final code that it came up with:
function generateLinkedReviewList() {
const folderId = 'YOUR_FOLDER_ID_HERE';
const sheet = SpreadsheetApp.getActiveSpreadsheet().getActiveSheet();
const folder = DriveApp.getFolderById(folderId);
const files = folder.getFiles();
sheet.clear();
sheet.appendRow(['Document (Link)', 'Total Pages', 'Pages Read', 'Status']);
sheet.getRange(1, 1, 1, 4).setFontWeight('bold').setBackground('#f3f3f3');
const data = [];
while (files.hasNext()) {
let file = files.next();
let pageCount = 0;
let fileName = file.getName().replace(/"/g, '""');
let fileUrl = file.getUrl();
if (file.getMimeType() === 'application/pdf') {
try {
let blob = file.getBlob();
let content = blob.getDataAsString();
// Find all instances of /Count inside a /Pages dictionary
// We use a global search to find every potential candidate
const globalRegex = /<<[^>]*?\/Type\s*\/Pages[^>]*?\/Count\s+(\d+)[^>]*?>>/g;
const reverseRegex = /<<[^>]*?\/Count\s+(\d+)[^>]*?\/Type\s*\/Pages[^>]*?>>/g;
let counts = [];
let match;
// Collect all forward matches
while ((match = globalRegex.exec(content)) !== null) {
counts.push(parseInt(match[1]));
}
// Collect all reverse matches
while ((match = reverseRegex.exec(content)) !== null) {
counts.push(parseInt(match[1]));
}
if (counts.length > 0) {
// The total page count is the maximum value found in the tree
pageCount = Math.max(...counts);
} else {
pageCount = 'Check Manually';
}
} catch (e) {
pageCount = 'Error';
}
} else {
pageCount = '-';
}
let linkFormula = `=HYPERLINK("${fileUrl}", "${fileName}")`;
data.push([linkFormula, pageCount, 0, false]);
}
if (data.length > 0) {
sheet.getRange(2, 1, data.length, 4).setFormulas(data);
sheet.getRange(2, 4, data.length, 1).insertCheckboxes();
// Data Validation
for (let i = 0; i < data.length; i++) {
let total = data[i][1];
if (typeof total === 'number' && total > 0) {
let cell = sheet.getRange(i + 2, 3);
let rule = SpreadsheetApp.newDataValidation()
.requireNumberBetween(0, total)
.setAllowInvalid(false)
.build();
cell.setDataValidation(rule);
}
}
}
sheet.autoResizeColumns(1, 4);
}
I see that if I comment out the lines using reverseRegex, I get the same answer so perhaps we could drop that, but I think I’ll stop here.
To help with understanding what going wrong with certain estimates, I found it helpful to extract strings from some sample PDFs:
% strings doc.pdf | grep "/Count " | less
<< /Type /Pages /MediaBox [0 0 612 792] /Count 3 /Kids [ 1 0 R 24 0 R 31 0 R
<< /Type /Pages /Count 158 /MediaBox [ 0 0 612 792 ] /Kids [ 1 0 R 24 0 R
This seems like a nice resource for learning more about the PDF file format: https://www.elliotclyde.nz/blog/whats-in-a-pdf/ (Archive).
Leave a Reply