Y’all wanna scrape off the list of Field IDs of your Google Forms easily? yes programatically through a neat little script! Stick around y’all! 😉
So you remember my last blog posts about my journey of hacking around Google Forms, You may RESTfully submit to your Google Forms… and even in the one before, Let’s auto fill Google Forms with URL parameters… you might remember how crucial it was to hook up to the list of Field Identifiers in your Google Forms which you can use to Submit data to your Form using the REST API or even to populate the Auto filled data in the Form. Now that goes without saying this blog post is a follow up of the above two articles, therefore I would prefer if you took a sneak peak into those before continuing in this article.
So the way we retrieved the list of Field identifiers were by manually reading through the rendered HTML code and looking at the network traffic data. Specifically I introduced three methods as follows…
- Method 1: Looking up page source HTML content.
- Method 2: Inspecting HTML code of each field.
- Method 3: Monitor the network traffic data.
So all those methods are completely manual, shouldn’t there be an easy way? Yes, I’ve been wondering that myself and I continued on experimenting…
Yes.. Scrape em off programmatically!
Wouldn’t it be easier if we could write a little script to retrieve the list of Field IDs in any given Google Form? Just scrape through the rendered HTML content and pick up the Field identifiers automatically! Oh think about how much time you could save! 😀
Well yes, that’s exactly what we’re going to do. Let’s build a simple script that could scrape off the list of Field IDs of your Google Forms without you breaking a sweat!
But to build the script you might have to break some sweat… 😉 Alright, so what this script is going to do is, given the link of a Google Form, it will load the HTML content of it, and scrape through it to find the elements that holds the Field Ids, filter them out and return the results! Quite straight forward eh! 😀
Let the hack begin…
Let’s use a dot net C# to write out script, and yes an absolutely biased choice given my favorite platform! lol But if you can grasp the process and idea behind each step you could easily reproduce the same script from any other language or framework!
It’s all about programming the our method of manually reading through the HTML code and figuring out the IDs, into a self executing code. This requires us understanding the pattern of which the HTML is rendered for each Field element in the Google Form, and I figured this out by repeatedly looking at the rendered HTML content. Once we understand the pattern and where to filter out the data that we need to scrape out, we can easily code it into our script.
Now for this post also let’s use the same sample questionnaire Google Form that I created for last post’s demo.
So now we got a Google Form, let’s start the little hack by finding the field IDs in the form…
Identify the Pattern..
Now this is the most crucial step, let me walk you through by simply looking into the rendered HTML content of our Google Form.
Now for me though it took a lot of trial and error steps to recognize how all different types of question fields are rendered in Google Forms, where by created a bunch of sample Forms and analyzed their HTML structure continuously, until I was sure. 😉
I’ve explained regarding this pattern in my previous blog post: Let’s auto fill Google Forms with URL parameters… in which I dive into detailed steps of finding the IDs of each question field in your Google Form. If you go through it and focus on the section “Method 2: Inspecting each field” you could easily understand what I’m about to dive into. Therefore let me keep things simple in this article just to avoid repetition. 😀
Assuming you’re using Google Chrome let’s begin, by right clicking on any of the question fields in your Google Form and go to -> Inspect Element menu option.
If you carefully take a look at the rendered HTML node of our Short Answer question Field, it’s actually “input” type element and you can see how the “name” property holds the ID for the Field “entry.1277095329”. Now let’s take another type of an Field, how about Multiple choice selection question Field? 😮
When you’re reading through the parent node of the radio button elements, you can see at the bottom there’s an “input” type element, that’s set to hidden, with the Field ID that we’re looking for. Then how about a Checkbox selection Field? Let’s try the same inspection and see for ourselves… 😉
Now the parent node of it is a bit differently structured, but you can see how it follows the same pattern, of having an “input” type element which holds the ID of the question Field.
Also something to notice is that the same exact child node is repeated in each parent div element containing the ID value. And then just above the child elements, you got another field which carries the same values, but with a “_sentinel” suffix in the “name” property, which sorta creates a repetition of data that needs to be filtered out. So this is something you need to keep in mind, that we will need to filter out in our script. 😀
Next let’s try out a Paragraph question Field of Google Forms, and try to analyze it.
Now this fella got a “textbox” type element rendered, which also includes the “name” property that holds the Field ID, and now we know another element that needs special filtering in our script! 😉
Finally all the other types of Questions Fields in Google Forms follows almost the same pattern of rendered HTML, therefore without further adieu let’s try to analyze the patterns which we seen behind those elements.
Analyze the Pattern…
Now in this step you need to be able to see the full picture of the whole pattern of which those Google Forms question Field elements are rendered in their HTML environment. So after analyzing all those different fields we could draw a few major analysis that we need to keep in mind when we’re thinking of scraping out the Field IDs from the HTML…
- “input” elements holds the Field IDs in their “name” attribute
- “textarea” is an exceptional element which is used by Paragraph question type
- the Field ID value begins with “entry.” prefix in the “name” attribute
- Checkbox Field elements renderes a repetition of its “input” nodes
- Also Checkbox Filed generates an extra “input” element with “_sentinel” suffix in ID
- repeated nodes with same values should be filtered out
Now keeping all those in mind we need to implement the logic into our script, or in other words we need to code the above logic and filtering as rules into our little script that’ll scrap out the HTML of our Google Form to retrieve the Field IDs automatically.
Let the coding begin!
Assuming that you’re already experienced in dotnet, I’m not going to be diving in to spoon feeding details here, and rather focus on the important bits of the code. We’re going be using dotnet and C# as the language for our crawler script. And we need a library that could parse HTML content, and traverse through those content programmatically. Therefore I choose HTMLAgilityPack which is a well known and stable HTML parser for dotnet projects.
So the project type that you’re going to implement is totally up to you, but for this demo I would be using a Console Project in dotnet, pretty simple to begin with.
Given you have added the HTMLAgilityPack to your dotnet project, let’s create the method definition with a string parameter that will represent the URL link of our Google Form that we need to scrape off and it will be returning the list of Field IDs of the given Google Form!
Oh and make sure to make it an async method, hence we will need that for our web call that’ll load the HTML content.
Let’s use the HtmlWeb class which allows us to load an HTML content from a given url string asynchronously.
There we’re executing the LoadFromWebAsync() upon the given Google Forms Link and load it into memory. Next we need to implement out first line of filtering on top of the HTML content that we loaded into memory.
There we’re scooping off the “input” and “textarea” elements from the HTML content, into a List object of type IEnumerable<HtmlNode>, given the DocumentNode which contains all the HTML elements of the Google Form that we just parsed into memory.
Like I said before we’re basically implementing the logic that we learned in the Pattern Analysis step, therefore next we go on to the next layer of filtering.
There we’re filtering the list of data based on the predicament, that we retrieve all the HTML elements which contains the “entry.” prefix in their “name” attribute, thus securing out Field ID values. Then we exclude all the elements that contains “_sentinel” suffix in their “name” attributes which governs the cleaning up of Checkbox field element repetition.
As you see above we’ve singled out the HTML elements that we’re targeting. Next we gotta do some final clean up of our scraped nodes.
We need to clean up any existing duplicate elements in the list, therefore we’re gonna group similar items, and pick the first element into a list of type List<HtmlNode>, which will eliminate the repetition nodes probably caused by Checkbox fields.
And finally we’re going to access the each Node’s “name” attribute, load it into a List.
And return the results. Oh just an add-on I’m printing out the scraped off Field ID elements into the Console.
Let me share the whole script down here…
Let’s try it out shall we! 😉
Now I’m gonna use the sample Google Form that I created for this demo, pass its URL link into this little script, and hit F5 in Visual Studio!
Look at that beauty! 😉 The complete list of Question Field identifiers in my sample Google Forms just like we expected.
As far as my testing this little script works perfectly for any Google Form that contains the basic main types of Question Fields that are available in Google Forms as of this day!
Well… That’s it!
Basically you can scrape off any data from a given HTML content as long as you understand the pattern of which the HTML rendered targeting the pieces of data that you’re looking for! Likewise there could be many different types of Google Forms that contains different types Question Fields even with custom content in them, but at the end they all follow a certain HTML rendering pattern, which is just a matter of figuring out.
I would like to remind you again, the reason I considered this as a “SCRIPT” is due to the possibility of converting the same HTML scraping steps into any other language or framework easily, as long as you understand the pattern of the rendered HTML of your Google Form!
Now keep in mind all these are simple hacks and tricks derived by careful observation of rendered HTML content of any given Google Forms page, and we do not have precise control whether Google will change these format and rendering patterns in future, so you gotta keep an eye out if you’re planning to use these hacks for a long term solid implementation.
My suggestion would be to write up a series of Test cases (TDD yo!) which would test for the above process flows to make sure they’re working as expected and notify you in case of any changes from Google. 😉
There you have it, the little magic script to scrape off the list of Field IDs from your Google Forms page!
Share the love! 😀 Cheers!